ML unit 3

UNIT III:
Building good training datasets:

Dealing with missing data, Handling categorical data, partitionng a data
set into separate training and test datasets, bringing features onto the
same scale, selecting meaningful features, assessing feature importance
with random forests.
Compressing data via dimensionality reduction: Unsupervised
dimensionality reduction via PCA, Supervised data compression via linear
discriminant analysis
Building Good Training Datasets
A well-constructed training dataset is crucial for the success of any

machine learning model. Here are some key factors to consider when
building your dataset:
1. Quality and Relevance:
 Data Quality: Ensure your data is clean, accurate, and free from
errors or inconsistencies.
 Relevance: Ensure the data is relevant to the problem you're
trying to solve.
2. Size and Diversity:
 Sufficient Size: The dataset should be large enough to represent

the problem domain adequately.
 Diversity: The data should be diverse to capture different
scenarios and avoid bias.
3. Labeling and Annotation:
 Accuracy: Ensure that the labels or annotations are accurate and

consistent.
 Clarity: The labeling or annotation process should be clear and
well-defined.
BY A.RAJENDRA PRASAD 1
4. Feature Engineering:
 Informative Features: Create features that are informative and

relevant to the problem.
 Avoid Redundancy: Minimize redundancy among features.
5. Class Balance:
 Balanced Classes: If dealing with classification problems, ensure

that the classes are balanced to avoid bias.
6. Data Preprocessing:
 Normalization and Standardization: Normalize or standardize

numerical features to a common scale.
 Handling Missing Values: Impute missing values or remove
rows with missing data.
 Outlier Detection and Handling: Identify and handle outliers if
necessary.
7. Data Augmentation (Optional):
 Increase Dataset Size: For tasks like image or text

classification, augment the dataset by applying transformations
like rotations, scaling, or adding noise.
8. Domain Knowledge:
 Leverage Expertise: Incorporate domain knowledge to guide

data collection and feature engineering.
9. Bias and Fairness:
 Avoid Bias: Be aware of potential biases in the data and take

steps to mitigate them.
10. Privacy and Ethics:
 Data Privacy: Handle data responsibly and comply with relevant

privacy regulations.
 Ethical Considerations: Ensure that the use of the data is
ethical and aligns with ethical guidelines.
By following these guidelines, you can build high-quality training
datasets that will enable your machine learning models to perform
effectively.
Dealing with Missing Data
Missing data is a common challenge in machine learning. Here are some

strategies to handle it:
1. Deletion:
 Listwise Deletion: Remove entire rows or columns containing

missing values.
 Pairwise Deletion: Remove only the pairs of data points that
have missing values for both variables.
When to use:
 When missing data is minimal and doesn't significantly affect the

dataset.
 When the missing values are not systematically related to other
variables.
Drawbacks:
 Can lead to information loss if a significant amount of data is

deleted.
 May introduce bias if missing values are not random.
2. Imputation:
 Mean/Median Imputation: Replace missing values with the

mean or median of the corresponding column.
 Mode Imputation: Replace missing categorical values with the
most frequent category.
 Regression Imputation: Predict missing values using regression
models.
 Hot Deck Imputation: Replace missing values with values from
similar data points.
 Multiple Imputation: Create multiple imputed datasets and
combine the results.
When to use:
 When missing values are not too numerous and the imputation
method is appropriate for the data type.
 For regression imputation, when there's a strong relationship
between the missing variable and other variables.
Drawbacks:
 Can introduce bias if the imputation method is not appropriate.

 May not accurately capture the uncertainty associated with missing
values.
3. Ignoring Missing Values:
 Some algorithms: Certain algorithms (e.g., decision trees,

random forests) can handle missing values directly.
When to use:
 When the algorithm can handle missing values without requiring

imputation.
Drawbacks:
 May not be optimal for all algorithms or datasets.
4. Feature Engineering:
 Create Indicator Variables: Create binary indicator variables to

indicate whether a value is missing.
 Impute with a Special Value: Assign a special value (e.g., -
999) to missing values.
When to use:
 When you want to preserve information about missingness.
Drawbacks:
 May introduce additional complexity to the model.
The best approach for handling missing data depends on the specific
dataset, the algorithm being used, and the impact of missing values on
the analysis. It's often a good practice to experiment with different
methods and evaluate their impact on the model's performance.
Handling Categorical Data in Machine Learning
Categorical data is a type of data that represents different

categories or groups. It can be either nominal (no inherent order) or
ordinal (with an inherent order).
Here are some common techniques for handling categorical data in

machine learning:
1. One-Hot Encoding: (Dummy Variables)
 Create a binary vector for each category, where only one element
is 1 and the rest are 0.
 Suitable for nominal data.
 Increases the dimensionality of the data.
2. Label Encoding:
 Assign a unique integer to each category.

 Suitable for ordinal data.
 Preserves the order of categories.
3. Target Encoding:
 Replace each category with the mean or median of the target

variable for that category.
 Helps prevent overfitting in some cases.
4. Embedding:
 Learn a low-dimensional representation of categorical variables.

 Useful for high-cardinality categorical features.
 Commonly used in deep learning models.
5. Hashing:
 Map each category to a unique integer using a hash function.

 Reduces dimensionality but may introduce collisions.
6. Count Encoding:
 Replace each category with its frequency in the dataset.

 Can be useful for imbalanced datasets.
7. Frequency Encoding:
 Replace each category with its proportion in the dataset.

 Similar to count encoding but scales the values between 0 and 1.
Choosing the Right Technique:
 Nominal Data: One-hot encoding is generally preferred.

 Ordinal Data: Label encoding or target encoding can be used.
 High-Cardinality Data: Embedding or hashing can be effective.
 Imbalanced Data: Count or frequency encoding can help
address class imbalance.
It's important to consider the specific characteristics of your data

and the algorithm you're using when selecting a technique for handling
categorical data.
Experimentation and evaluation can help you determine the best

approach for your particular problem.
Partitioning a Dataset into Training and Test Sets
Partitioning a dataset into separate training and test sets is a

crucial step in machine learning to evaluate the model's performance
and prevent overfitting.
Why Partition?
 Training: The training set is used to teach the model the patterns
and relationships in the data.
 Testing: The test set is used to evaluate how well the model
generalizes to new, unseen data.
Common Partitioning Methods:
1. Random Split:
o The most common method.
o Randomly divide the dataset into two subsets: training and
testing.
o Typically, a 70/30 or 80/20 split is used.
2. Stratified Sampling:
o Ensures that the distribution of classes or target variables is
similar in both sets.
o Useful for imbalanced datasets.
3. K-Fold Cross-Validation:
o Divide the dataset into k folds.
o Train the model k times, each time using k-1 folds for
training and 1 fold for testing.
o Average the performance across all k folds.
4. Leave-One-Out Cross-Validation (LOOCV):
o A special case of k-fold cross-validation where k equals the
number of data points.
o Time-consuming for large datasets.
Factors to Consider:
 Dataset Size: For smaller datasets, k-fold cross-validation or

LOOCV might be more appropriate to avoid overfitting.
 Class Distribution: Stratified sampling is useful for imbalanced
datasets.
 Computational Resources: Consider the computational cost of
different methods, especially for large datasets.
Example:
from sklearn.model_selection import train_test_split
# Assuming X is your feature matrix and y is your target vector

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
This code splits the dataset into a training set (80%) and a testing set
(20%) using random splitting.
By carefully partitioning your dataset, you can ensure a reliable
evaluation of your model's performance and avoid overfitting.
Bringing Features onto the Same Scale
Scaling features is a common preprocessing step in machine

learning to ensure that different features contribute equally to the
model's predictions.
This is particularly important for algorithms that rely on distance

calculations or gradient descent, such as k-nearest neighbors, support
vector machines, and neural networks.
Why Scale Features?
 Equal Contribution: Features with larger magnitudes can

dominate the model, leading to biased results. Scaling ensures
that all features contribute equally.
 Convergence Speed: Scaling can improve the convergence
speed of gradient-based algorithms.
 Interpretability: Scaled features can make it easier to interpret
the model's coefficients.
Common Scaling Techniques:
1. Min-Max Scaling:
o Scales features to a specific range, typically between 0 and
1.
o Formula:
o scaled_feature = (feature - min(feature)) / (max(feature) -
min(feature))
2. Standardization (Z-score Scaling):
o Scales features to have a mean of 0 and a standard
deviation of 1.
o Formula:
o scaled_feature = (feature - mean(feature)) / std(feature)
3. Robust Scaling:
o Similar to standardization but uses the median and
interquartile range instead of mean and standard deviation.
o Less sensitive to outliers.
When to Use Which Technique?
 Min-Max Scaling: When you want to keep the original range of

values (e.g., for image data).
 Standardization: When you want to center the data and scale it
to a unit variance.
 Robust Scaling: When your data contains outliers that
significantly affect the mean and standard deviation.
Example:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
# Min-Max Scaling
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)
# Standardization
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
Note: It's important to scale the training and testing data using the
same scaler to ensure consistency.
By scaling features, you can improve the performance and

interpretability of your machine learning models.
Selecting Meaningful Features
Feature selection is a crucial step in machine learning, as it

involves identifying the most relevant features that contribute
significantly to the target variable.
Choosing meaningful features can improve model performance,

reduce overfitting, and enhance interpretability.
Why Feature Selection?
 Improved Performance: Relevant features can lead to better

model accuracy and generalization.
 Reduced Overfitting: Removing irrelevant features can help
prevent overfitting.
 Interpretability: A smaller set of features can make the model
easier to understand and explain.
Feature Selection Techniques:
1. Filter Methods:
o Correlation Analysis: Measure the correlation between
features and the target variable.
o Chi-Square Test: For categorical features and categorical
targets.
o ANOVA (Analysis of Variance): For numerical features
and categorical targets.
o Mutual Information: A general-purpose measure of
dependence between two variables.
2. Wrapper Methods:
o Forward Selection: Start with an empty set of features
and add one feature at a time based on performance.
o Backward Elimination: Start with all features and remove
one feature at a time based on performance.
o Recursive Feature Elimination (RFE): Rank features
based on their importance and recursively eliminate the least
important ones.
3. Embedded Methods:
o Regularization: Techniques like L1 (Lasso) and L2 (Ridge)
regularization can automatically select features by penalizing
large coefficients.
o Tree-Based Methods: Decision trees and random forests
can identify important features based on their information
gain or Gini impurity.
Considerations:
 Domain Knowledge: Leverage domain expertise to identify

potentially relevant features.
 Feature Engineering: Create new features that may be more
informative.
 Feature Importance: Use techniques like permutation
importance or SHAP values to assess the importance of features.
 Computational Cost: Some feature selection methods can be
computationally expensive, especially for large datasets.
Example:
from sklearn.feature_selection import SelectFromModel

from sklearn.linear_model import Lasso
# Create a Lasso model

lasso = Lasso(alpha=0.1)
# Fit the model to the data

lasso.fit(X, y)
# Select features with non-zero coefficients

selected_features = X[:, lasso.coef_ != 0]
By carefully selecting meaningful features, you can improve the

performance and interpretability of your machine learning models.
Assessing Feature Importance with Random Forests
Random forests are a powerful ensemble learning method that

can be used to assess the importance of features in a machine learning
model.
By analyzing the frequency with which features are selected in the

decision trees that make up the forest, we can gain insights into their
contribution to the model's predictions.
Methods for Assessing Feature Importance:
1. Mean Decrease in Impurity (MDI):

o Measures the average decrease in impurity (e.g., Gini
impurity or entropy) across all decision trees in the forest
when a feature is split on.
o A higher MDI value indicates a more important feature.
2. Mean Decrease in Accuracy (MDA):
o Randomly permutes the values of a feature in the out-of-bag
(OOB) data and measures the decrease in accuracy.
o A higher MDA value indicates a more important feature.
Python Example:
from sklearn.ensemble import RandomForestClassifier

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Load the Iris dataset

iris = load_iris()
X = iris.data
y = iris.target
# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Create a random forest classifier

rf = RandomForestClassifier(n_estimators=100, random_state=42)
# Fit the model to the training data

rf.fit(X_train, y_train)
# Get feature importances

importances = rf.feature_importances_
# Print feature importances

for feature, importance in zip(iris.feature_names, importances):
print(f"{feature}: {importance:.2f}")
Interpreting Feature Importance:
 Higher values: Features with higher importance values

contribute more to the model's predictions.
 Relative importance: Compare the values of different features
to determine their relative importance.
 Contextual understanding: Consider the domain knowledge
and the nature of the features to interpret their importance in the
context of the problem.
Note: Feature importance can be influenced by factors such as the

correlation between features and the distribution of data.
It's essential to interpret feature importance in conjunction with other

information about the data and the model.
--------------------------------End for Mid1 ----------------------------------------
Dimensionality Reduction:
Dimensionality reduction is a technique used to reduce the number
of features (dimensions) in a dataset while preserving the most
important information. This can be beneficial for several reasons:
 Computational efficiency: Reducing the number of features can

significantly speed up training and prediction processes.
 Improved performance: Removing irrelevant or redundant
features can sometimes improve model performance.
 Visualization: Dimensionality reduction can make it easier to
visualize and understand high-dimensional data.
Common Dimensionality Reduction Techniques:
1. Principal Component Analysis (PCA):

o Finds a new set of uncorrelated features (principal
components) that capture the most variance in the data.
o Can be used to visualize high-dimensional data in 2D or 3D.
2. Linear Discriminant Analysis (LDA):
o Similar to PCA but specifically designed for classification
problems.
o Finds a projection that maximizes the separation between
classes.
3. t-SNE (t-Distributed Stochastic Neighbor Embedding):
o A non-linear dimensionality reduction technique that
preserves local structure in the data.
o Often used for visualizing high-dimensional data.
4. Autoencoders:
o Neural networks trained to reconstruct the input data.
o Can be used for both dimensionality reduction and feature
learning.
5. Feature Selection:
o Selecting a subset of features based on their importance or
relevance.
o Can be considered a form of dimensionality reduction.
Choosing the Right Technique:
 Goal: Consider the specific goals of your analysis. Are you

interested in visualization, feature engineering, or computational
efficiency?
 Data Characteristics: The choice of technique may depend on
the distribution of your data and the relationships between
features.
 Desired Properties: Some techniques preserve linear
relationships, while others focus on preserving local structure.
Example:
from sklearn.decomposition import PCA
# Reduce dimensionality to 2 dimensions

pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
By applying dimensionality reduction techniques, you can often

simplify your data while preserving the most important information,
leading to more efficient and interpretable models.
Unsupervised Dimensionality Reduction via PCA
Principal Component Analysis (PCA) is a popular

unsupervised dimensionality reduction technique that transforms a high-
dimensional dataset into a lower-dimensional space while preserving the
most important information.
How PCA Works:
1. Centering: The data is centered by subtracting the mean of each

feature from all data points.
2. Covariance Matrix: The covariance matrix of the centered data
is calculated.
3. Eigenvalue Decomposition: The eigenvectors and eigenvalues
of the covariance matrix are computed.
4. Projection: The data is projected onto the eigenvectors
corresponding to the largest eigenvalues. These eigenvectors are
called principal components.
Key Concepts:
 Principal Components: Linear combinations of the original

features that capture the most variance in the data.
 Eigenvalues: Represent the variance explained by each principal
component.
 Dimensionality Reduction: By selecting the principal
components with the highest eigenvalues, you can reduce the
dimensionality of the data while preserving the most important
information.
Benefits of PCA:
 Visualization: PCA can help visualize high-dimensional data in 2D

or 3D.
 Noise Reduction: PCA can reduce the impact of noise by
removing components with low variance.
 Feature Engineering: PCA can create new features that are
linear combinations of the original features.
 Computational Efficiency: PCA can reduce the computational
cost of subsequent analysis.
Choosing the Number of Components:
 Scree Plot: A plot of the eigenvalues in descending order can

help you determine the number of components to keep.
 Explained Variance: Calculate the cumulative explained variance
by the selected components.
 Domain Knowledge: Consider the specific characteristics of your
data and the goals of your analysis.
Example:
from sklearn.decomposition import PCA
# Create a PCA object with 2 components

pca = PCA(n_components=2)
# Fit the PCA model to the data

pca.fit(X)
# Transform the data
X_reduced = pca.transform(X)
By applying PCA, you can effectively reduce the dimensionality of

your data while preserving the most important information, leading to
improved performance and interpretability in many machine learning
tasks.
PCA is an unsupervised dimensionality reduction technique. LDA,

on the other hand, is a supervised technique that aims to find a
projection that maximizes the separation between classes.
Linear Discriminant Analysis (LDA):
 Goal: To find a linear combination of features that maximizes the

separation between different classes.
 Process:
1. Calculate the mean of each class.
2. Compute the within-class scatter matrix and the between-
class scatter matrix.
3. Find the projection that maximizes the ratio of the between-
class scatter to the within-class scatter.
4. Project the data onto this projection.
Advantages of LDA:
 Class-specific: LDA specifically considers the class labels, making

it more suitable for classification tasks.
 Separation Maximization: LDA aims to maximize the separation
between classes, which can improve classification accuracy.
 Dimensionality Reduction: LDA can reduce the dimensionality
of the data while preserving class information.
Disadvantages of LDA:
 Assumption of Normality: LDA assumes that the data is

normally distributed within each class.
 Sensitivity to Outliers: Outliers can significantly affect the
results.
 Limited Flexibility: LDA is a linear technique and may not be
suitable for complex non-linear relationships.
In summary, while PCA is an unsupervised technique that focuses on

preserving variance,
LDA is a supervised technique that focuses on maximizing class

separation.
The choice between PCA and LDA depends on the specific task and the
characteristics of the data.

ML unit 3

Uploaded by

ML unit 3

Uploaded by

UNIT III:

Building good training datasets:

Building Good Training Datasets

A well-constructed training dataset is crucial for the success of any

1. Quality and Relevance:

2. Size and Diversity:

 Sufficient Size: The dataset should be large enough to represent

3. Labeling and Annotation:

 Accuracy: Ensure that the labels or annotations are accurate and

 Informative Features: Create features that are informative and

 Balanced Classes: If dealing with classification problems, ensure

 Normalization and Standardization: Normalize or standardize

7. Data Augmentation (Optional):

 Increase Dataset Size: For tasks like image or text

 Leverage Expertise: Incorporate domain knowledge to guide

9. Bias and Fairness:

 Avoid Bias: Be aware of potential biases in the data and take

10. Privacy and Ethics:

 Data Privacy: Handle data responsibly and comply with relevant

Dealing with Missing Data

Missing data is a common challenge in machine learning. Here are some

 Listwise Deletion: Remove entire rows or columns containing

 When missing data is minimal and doesn't significantly affect the

 Can lead to information loss if a significant amount of data is

 Mean/Median Imputation: Replace missing values with the

 Can introduce bias if the imputation method is not appropriate.

3. Ignoring Missing Values:

 Some algorithms: Certain algorithms (e.g., decision trees,

 When the algorithm can handle missing values without requiring

 May not be optimal for all algorithms or datasets.

 Create Indicator Variables: Create binary indicator variables to

 When you want to preserve information about missingness.

 May introduce additional complexity to the model.

Handling Categorical Data in Machine Learning

Categorical data is a type of data that represents different

Here are some common techniques for handling categorical data in

1. One-Hot Encoding: (Dummy Variables)

 Assign a unique integer to each category.

 Replace each category with the mean or median of the target

 Learn a low-dimensional representation of categorical variables.

 Map each category to a unique integer using a hash function.

 Replace each category with its frequency in the dataset.

 Replace each category with its proportion in the dataset.

Choosing the Right Technique:

 Nominal Data: One-hot encoding is generally preferred.

It's important to consider the specific characteristics of your data

Experimentation and evaluation can help you determine the best

Partitioning a Dataset into Training and Test Sets

Partitioning a dataset into separate training and test sets is a

Common Partitioning Methods:

 Dataset Size: For smaller datasets, k-fold cross-validation or

from sklearn.model_selection import train_test_split

# Assuming X is your feature matrix and y is your target vector

Bringing Features onto the Same Scale

Scaling features is a common preprocessing step in machine

This is particularly important for algorithms that rely on distance

Why Scale Features?

 Equal Contribution: Features with larger magnitudes can

Common Scaling Techniques:

 Min-Max Scaling: When you want to keep the original range of

from sklearn.preprocessing import MinMaxScaler, StandardScaler

By scaling features, you can improve the performance and

Selecting Meaningful Features

Feature selection is a crucial step in machine learning, as it

Choosing meaningful features can improve model performance,

Why Feature Selection?

 Improved Performance: Relevant features can lead to better

Feature Selection Techniques:

 Domain Knowledge: Leverage domain expertise to identify

from sklearn.feature_selection import SelectFromModel

# Create a Lasso model

# Fit the model to the data

# Select features with non-zero coefficients

By carefully selecting meaningful features, you can improve the

Assessing Feature Importance with Random Forests

Random forests are a powerful ensemble learning method that