0% found this document useful (0 votes)
10 views17 pages

ML unit 3

Uploaded by

rravireddy728
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
10 views17 pages

ML unit 3

Uploaded by

rravireddy728
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 17

UNIT III:

Building good training datasets:


Dealing with missing data, Handling categorical data, partitionng a data
set into separate training and test datasets, bringing features onto the
same scale, selecting meaningful features, assessing feature importance
with random forests.
Compressing data via dimensionality reduction: Unsupervised
dimensionality reduction via PCA, Supervised data compression via linear
discriminant analysis

Building Good Training Datasets

A well-constructed training dataset is crucial for the success of any


machine learning model. Here are some key factors to consider when
building your dataset:

1. Quality and Relevance:

 Data Quality: Ensure your data is clean, accurate, and free from
errors or inconsistencies.
 Relevance: Ensure the data is relevant to the problem you're
trying to solve.

2. Size and Diversity:

 Sufficient Size: The dataset should be large enough to represent


the problem domain adequately.
 Diversity: The data should be diverse to capture different
scenarios and avoid bias.

3. Labeling and Annotation:

 Accuracy: Ensure that the labels or annotations are accurate and


consistent.
 Clarity: The labeling or annotation process should be clear and
well-defined.

BY A.RAJENDRA PRASAD 1
4. Feature Engineering:

 Informative Features: Create features that are informative and


relevant to the problem.
 Avoid Redundancy: Minimize redundancy among features.

5. Class Balance:

 Balanced Classes: If dealing with classification problems, ensure


that the classes are balanced to avoid bias.

6. Data Preprocessing:

 Normalization and Standardization: Normalize or standardize


numerical features to a common scale.
 Handling Missing Values: Impute missing values or remove
rows with missing data.
 Outlier Detection and Handling: Identify and handle outliers if
necessary.

7. Data Augmentation (Optional):

 Increase Dataset Size: For tasks like image or text


classification, augment the dataset by applying transformations
like rotations, scaling, or adding noise.

8. Domain Knowledge:

 Leverage Expertise: Incorporate domain knowledge to guide


data collection and feature engineering.

9. Bias and Fairness:

 Avoid Bias: Be aware of potential biases in the data and take


steps to mitigate them.

10. Privacy and Ethics:

 Data Privacy: Handle data responsibly and comply with relevant


privacy regulations.
 Ethical Considerations: Ensure that the use of the data is
ethical and aligns with ethical guidelines.

BY A.RAJENDRA PRASAD 2
By following these guidelines, you can build high-quality training
datasets that will enable your machine learning models to perform
effectively.

Dealing with Missing Data

Missing data is a common challenge in machine learning. Here are some


strategies to handle it:

1. Deletion:

 Listwise Deletion: Remove entire rows or columns containing


missing values.
 Pairwise Deletion: Remove only the pairs of data points that
have missing values for both variables.

When to use:

 When missing data is minimal and doesn't significantly affect the


dataset.
 When the missing values are not systematically related to other
variables.

Drawbacks:

 Can lead to information loss if a significant amount of data is


deleted.
 May introduce bias if missing values are not random.

2. Imputation:

 Mean/Median Imputation: Replace missing values with the


mean or median of the corresponding column.
 Mode Imputation: Replace missing categorical values with the
most frequent category.
 Regression Imputation: Predict missing values using regression
models.
 Hot Deck Imputation: Replace missing values with values from
similar data points.
 Multiple Imputation: Create multiple imputed datasets and
combine the results.

BY A.RAJENDRA PRASAD 3
When to use:

 When missing values are not too numerous and the imputation
method is appropriate for the data type.
 For regression imputation, when there's a strong relationship
between the missing variable and other variables.

Drawbacks:

 Can introduce bias if the imputation method is not appropriate.


 May not accurately capture the uncertainty associated with missing
values.

3. Ignoring Missing Values:

 Some algorithms: Certain algorithms (e.g., decision trees,


random forests) can handle missing values directly.

When to use:

 When the algorithm can handle missing values without requiring


imputation.

Drawbacks:

 May not be optimal for all algorithms or datasets.

4. Feature Engineering:

 Create Indicator Variables: Create binary indicator variables to


indicate whether a value is missing.
 Impute with a Special Value: Assign a special value (e.g., -
999) to missing values.

When to use:

 When you want to preserve information about missingness.

Drawbacks:

 May introduce additional complexity to the model.

BY A.RAJENDRA PRASAD 4
The best approach for handling missing data depends on the specific
dataset, the algorithm being used, and the impact of missing values on
the analysis. It's often a good practice to experiment with different
methods and evaluate their impact on the model's performance.

Handling Categorical Data in Machine Learning

Categorical data is a type of data that represents different


categories or groups. It can be either nominal (no inherent order) or
ordinal (with an inherent order).

Here are some common techniques for handling categorical data in


machine learning:

1. One-Hot Encoding: (Dummy Variables)

 Create a binary vector for each category, where only one element
is 1 and the rest are 0.
 Suitable for nominal data.
 Increases the dimensionality of the data.

2. Label Encoding:

 Assign a unique integer to each category.


 Suitable for ordinal data.
 Preserves the order of categories.

3. Target Encoding:

 Replace each category with the mean or median of the target


variable for that category.
 Helps prevent overfitting in some cases.

4. Embedding:

 Learn a low-dimensional representation of categorical variables.


 Useful for high-cardinality categorical features.
 Commonly used in deep learning models.

5. Hashing:

 Map each category to a unique integer using a hash function.


 Reduces dimensionality but may introduce collisions.
BY A.RAJENDRA PRASAD 5
6. Count Encoding:

 Replace each category with its frequency in the dataset.


 Can be useful for imbalanced datasets.

7. Frequency Encoding:

 Replace each category with its proportion in the dataset.


 Similar to count encoding but scales the values between 0 and 1.

Choosing the Right Technique:

 Nominal Data: One-hot encoding is generally preferred.


 Ordinal Data: Label encoding or target encoding can be used.
 High-Cardinality Data: Embedding or hashing can be effective.
 Imbalanced Data: Count or frequency encoding can help
address class imbalance.

It's important to consider the specific characteristics of your data


and the algorithm you're using when selecting a technique for handling
categorical data.

Experimentation and evaluation can help you determine the best


approach for your particular problem.

Partitioning a Dataset into Training and Test Sets

Partitioning a dataset into separate training and test sets is a


crucial step in machine learning to evaluate the model's performance
and prevent overfitting.

Why Partition?

 Training: The training set is used to teach the model the patterns
and relationships in the data.
 Testing: The test set is used to evaluate how well the model
generalizes to new, unseen data.

Common Partitioning Methods:

1. Random Split:

BY A.RAJENDRA PRASAD 6
o The most common method.
o Randomly divide the dataset into two subsets: training and
testing.
o Typically, a 70/30 or 80/20 split is used.
2. Stratified Sampling:
o Ensures that the distribution of classes or target variables is
similar in both sets.
o Useful for imbalanced datasets.
3. K-Fold Cross-Validation:
o Divide the dataset into k folds.
o Train the model k times, each time using k-1 folds for
training and 1 fold for testing.
o Average the performance across all k folds.
4. Leave-One-Out Cross-Validation (LOOCV):
o A special case of k-fold cross-validation where k equals the
number of data points.
o Time-consuming for large datasets.

Factors to Consider:

 Dataset Size: For smaller datasets, k-fold cross-validation or


LOOCV might be more appropriate to avoid overfitting.
 Class Distribution: Stratified sampling is useful for imbalanced
datasets.
 Computational Resources: Consider the computational cost of
different methods, especially for large datasets.

Example:

from sklearn.model_selection import train_test_split

# Assuming X is your feature matrix and y is your target vector


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

This code splits the dataset into a training set (80%) and a testing set
(20%) using random splitting.

BY A.RAJENDRA PRASAD 7
By carefully partitioning your dataset, you can ensure a reliable
evaluation of your model's performance and avoid overfitting.

Bringing Features onto the Same Scale

Scaling features is a common preprocessing step in machine


learning to ensure that different features contribute equally to the
model's predictions.

This is particularly important for algorithms that rely on distance


calculations or gradient descent, such as k-nearest neighbors, support
vector machines, and neural networks.

Why Scale Features?

 Equal Contribution: Features with larger magnitudes can


dominate the model, leading to biased results. Scaling ensures
that all features contribute equally.
 Convergence Speed: Scaling can improve the convergence
speed of gradient-based algorithms.
 Interpretability: Scaled features can make it easier to interpret
the model's coefficients.

Common Scaling Techniques:

1. Min-Max Scaling:
o Scales features to a specific range, typically between 0 and
1.
o Formula:
o scaled_feature = (feature - min(feature)) / (max(feature) -
min(feature))
2. Standardization (Z-score Scaling):
o Scales features to have a mean of 0 and a standard
deviation of 1.
o Formula:
o scaled_feature = (feature - mean(feature)) / std(feature)
3. Robust Scaling:
o Similar to standardization but uses the median and
interquartile range instead of mean and standard deviation.
o Less sensitive to outliers.

BY A.RAJENDRA PRASAD 8
When to Use Which Technique?

 Min-Max Scaling: When you want to keep the original range of


values (e.g., for image data).
 Standardization: When you want to center the data and scale it
to a unit variance.
 Robust Scaling: When your data contains outliers that
significantly affect the mean and standard deviation.

Example:

from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Min-Max Scaling
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)

# Standardization
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

Note: It's important to scale the training and testing data using the
same scaler to ensure consistency.

By scaling features, you can improve the performance and


interpretability of your machine learning models.

Selecting Meaningful Features

Feature selection is a crucial step in machine learning, as it


involves identifying the most relevant features that contribute
significantly to the target variable.

Choosing meaningful features can improve model performance,


reduce overfitting, and enhance interpretability.

Why Feature Selection?

 Improved Performance: Relevant features can lead to better


model accuracy and generalization.

BY A.RAJENDRA PRASAD 9
 Reduced Overfitting: Removing irrelevant features can help
prevent overfitting.
 Interpretability: A smaller set of features can make the model
easier to understand and explain.

Feature Selection Techniques:

1. Filter Methods:
o Correlation Analysis: Measure the correlation between
features and the target variable.
o Chi-Square Test: For categorical features and categorical
targets.
o ANOVA (Analysis of Variance): For numerical features
and categorical targets.
o Mutual Information: A general-purpose measure of
dependence between two variables.
2. Wrapper Methods:
o Forward Selection: Start with an empty set of features
and add one feature at a time based on performance.
o Backward Elimination: Start with all features and remove
one feature at a time based on performance.
o Recursive Feature Elimination (RFE): Rank features
based on their importance and recursively eliminate the least
important ones.
3. Embedded Methods:
o Regularization: Techniques like L1 (Lasso) and L2 (Ridge)
regularization can automatically select features by penalizing
large coefficients.
o Tree-Based Methods: Decision trees and random forests
can identify important features based on their information
gain or Gini impurity.

Considerations:

 Domain Knowledge: Leverage domain expertise to identify


potentially relevant features.
 Feature Engineering: Create new features that may be more
informative.
 Feature Importance: Use techniques like permutation
importance or SHAP values to assess the importance of features.

BY A.RAJENDRA PRASAD 10
 Computational Cost: Some feature selection methods can be
computationally expensive, especially for large datasets.

Example:

from sklearn.feature_selection import SelectFromModel


from sklearn.linear_model import Lasso

# Create a Lasso model


lasso = Lasso(alpha=0.1)

# Fit the model to the data


lasso.fit(X, y)

# Select features with non-zero coefficients


selected_features = X[:, lasso.coef_ != 0]

By carefully selecting meaningful features, you can improve the


performance and interpretability of your machine learning models.

Assessing Feature Importance with Random Forests

Random forests are a powerful ensemble learning method that


can be used to assess the importance of features in a machine learning
model.

By analyzing the frequency with which features are selected in the


decision trees that make up the forest, we can gain insights into their
contribution to the model's predictions.

Methods for Assessing Feature Importance:

1. Mean Decrease in Impurity (MDI):


o Measures the average decrease in impurity (e.g., Gini
impurity or entropy) across all decision trees in the forest
when a feature is split on.
o A higher MDI value indicates a more important feature.

BY A.RAJENDRA PRASAD 11
2. Mean Decrease in Accuracy (MDA):
o Randomly permutes the values of a feature in the out-of-bag
(OOB) data and measures the decrease in accuracy.
o A higher MDA value indicates a more important feature.

Python Example:

from sklearn.ensemble import RandomForestClassifier


from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the Iris dataset


iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Create a random forest classifier


rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Fit the model to the training data


rf.fit(X_train, y_train)

# Get feature importances


importances = rf.feature_importances_

# Print feature importances


for feature, importance in zip(iris.feature_names, importances):
print(f"{feature}: {importance:.2f}")

Interpreting Feature Importance:

 Higher values: Features with higher importance values


contribute more to the model's predictions.
 Relative importance: Compare the values of different features
to determine their relative importance.

BY A.RAJENDRA PRASAD 12
 Contextual understanding: Consider the domain knowledge
and the nature of the features to interpret their importance in the
context of the problem.

Note: Feature importance can be influenced by factors such as the


correlation between features and the distribution of data.

It's essential to interpret feature importance in conjunction with other


information about the data and the model.

--------------------------------End for Mid1 ----------------------------------------

Dimensionality Reduction:
Dimensionality reduction is a technique used to reduce the number
of features (dimensions) in a dataset while preserving the most
important information. This can be beneficial for several reasons:

 Computational efficiency: Reducing the number of features can


significantly speed up training and prediction processes.
 Improved performance: Removing irrelevant or redundant
features can sometimes improve model performance.
 Visualization: Dimensionality reduction can make it easier to
visualize and understand high-dimensional data.

Common Dimensionality Reduction Techniques:

1. Principal Component Analysis (PCA):


o Finds a new set of uncorrelated features (principal
components) that capture the most variance in the data.
o Can be used to visualize high-dimensional data in 2D or 3D.
2. Linear Discriminant Analysis (LDA):
o Similar to PCA but specifically designed for classification
problems.
o Finds a projection that maximizes the separation between
classes.
3. t-SNE (t-Distributed Stochastic Neighbor Embedding):
o A non-linear dimensionality reduction technique that
preserves local structure in the data.
o Often used for visualizing high-dimensional data.

BY A.RAJENDRA PRASAD 13
4. Autoencoders:
o Neural networks trained to reconstruct the input data.
o Can be used for both dimensionality reduction and feature
learning.
5. Feature Selection:
o Selecting a subset of features based on their importance or
relevance.
o Can be considered a form of dimensionality reduction.

Choosing the Right Technique:

 Goal: Consider the specific goals of your analysis. Are you


interested in visualization, feature engineering, or computational
efficiency?
 Data Characteristics: The choice of technique may depend on
the distribution of your data and the relationships between
features.
 Desired Properties: Some techniques preserve linear
relationships, while others focus on preserving local structure.

Example:

from sklearn.decomposition import PCA

# Reduce dimensionality to 2 dimensions


pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

By applying dimensionality reduction techniques, you can often


simplify your data while preserving the most important information,
leading to more efficient and interpretable models.

BY A.RAJENDRA PRASAD 14
Unsupervised Dimensionality Reduction via PCA

Principal Component Analysis (PCA) is a popular


unsupervised dimensionality reduction technique that transforms a high-
dimensional dataset into a lower-dimensional space while preserving the
most important information.

How PCA Works:

1. Centering: The data is centered by subtracting the mean of each


feature from all data points.
2. Covariance Matrix: The covariance matrix of the centered data
is calculated.
3. Eigenvalue Decomposition: The eigenvectors and eigenvalues
of the covariance matrix are computed.
4. Projection: The data is projected onto the eigenvectors
corresponding to the largest eigenvalues. These eigenvectors are
called principal components.

Key Concepts:

 Principal Components: Linear combinations of the original


features that capture the most variance in the data.
 Eigenvalues: Represent the variance explained by each principal
component.
 Dimensionality Reduction: By selecting the principal
components with the highest eigenvalues, you can reduce the
dimensionality of the data while preserving the most important
information.

Benefits of PCA:

 Visualization: PCA can help visualize high-dimensional data in 2D


or 3D.
 Noise Reduction: PCA can reduce the impact of noise by
removing components with low variance.
 Feature Engineering: PCA can create new features that are
linear combinations of the original features.
 Computational Efficiency: PCA can reduce the computational
cost of subsequent analysis.

BY A.RAJENDRA PRASAD 15
Choosing the Number of Components:

 Scree Plot: A plot of the eigenvalues in descending order can


help you determine the number of components to keep.
 Explained Variance: Calculate the cumulative explained variance
by the selected components.
 Domain Knowledge: Consider the specific characteristics of your
data and the goals of your analysis.

Example:

from sklearn.decomposition import PCA

# Create a PCA object with 2 components


pca = PCA(n_components=2)

# Fit the PCA model to the data


pca.fit(X)

# Transform the data

X_reduced = pca.transform(X)

By applying PCA, you can effectively reduce the dimensionality of


your data while preserving the most important information, leading to
improved performance and interpretability in many machine learning
tasks.

PCA is an unsupervised dimensionality reduction technique. LDA,


on the other hand, is a supervised technique that aims to find a
projection that maximizes the separation between classes.

Linear Discriminant Analysis (LDA):

 Goal: To find a linear combination of features that maximizes the


separation between different classes.
 Process:
1. Calculate the mean of each class.
BY A.RAJENDRA PRASAD 16
2. Compute the within-class scatter matrix and the between-
class scatter matrix.
3. Find the projection that maximizes the ratio of the between-
class scatter to the within-class scatter.
4. Project the data onto this projection.

Advantages of LDA:

 Class-specific: LDA specifically considers the class labels, making


it more suitable for classification tasks.
 Separation Maximization: LDA aims to maximize the separation
between classes, which can improve classification accuracy.
 Dimensionality Reduction: LDA can reduce the dimensionality
of the data while preserving class information.

Disadvantages of LDA:

 Assumption of Normality: LDA assumes that the data is


normally distributed within each class.
 Sensitivity to Outliers: Outliers can significantly affect the
results.
 Limited Flexibility: LDA is a linear technique and may not be
suitable for complex non-linear relationships.

In summary, while PCA is an unsupervised technique that focuses on


preserving variance,

LDA is a supervised technique that focuses on maximizing class


separation.

The choice between PCA and LDA depends on the specific task and the
characteristics of the data.

BY A.RAJENDRA PRASAD 17

You might also like