ML unit 3
ML unit 3
Data Quality: Ensure your data is clean, accurate, and free from
errors or inconsistencies.
Relevance: Ensure the data is relevant to the problem you're
trying to solve.
BY A.RAJENDRA PRASAD 1
4. Feature Engineering:
5. Class Balance:
6. Data Preprocessing:
8. Domain Knowledge:
BY A.RAJENDRA PRASAD 2
By following these guidelines, you can build high-quality training
datasets that will enable your machine learning models to perform
effectively.
1. Deletion:
When to use:
Drawbacks:
2. Imputation:
BY A.RAJENDRA PRASAD 3
When to use:
When missing values are not too numerous and the imputation
method is appropriate for the data type.
For regression imputation, when there's a strong relationship
between the missing variable and other variables.
Drawbacks:
When to use:
Drawbacks:
4. Feature Engineering:
When to use:
Drawbacks:
BY A.RAJENDRA PRASAD 4
The best approach for handling missing data depends on the specific
dataset, the algorithm being used, and the impact of missing values on
the analysis. It's often a good practice to experiment with different
methods and evaluate their impact on the model's performance.
Create a binary vector for each category, where only one element
is 1 and the rest are 0.
Suitable for nominal data.
Increases the dimensionality of the data.
2. Label Encoding:
3. Target Encoding:
4. Embedding:
5. Hashing:
7. Frequency Encoding:
Why Partition?
Training: The training set is used to teach the model the patterns
and relationships in the data.
Testing: The test set is used to evaluate how well the model
generalizes to new, unseen data.
1. Random Split:
BY A.RAJENDRA PRASAD 6
o The most common method.
o Randomly divide the dataset into two subsets: training and
testing.
o Typically, a 70/30 or 80/20 split is used.
2. Stratified Sampling:
o Ensures that the distribution of classes or target variables is
similar in both sets.
o Useful for imbalanced datasets.
3. K-Fold Cross-Validation:
o Divide the dataset into k folds.
o Train the model k times, each time using k-1 folds for
training and 1 fold for testing.
o Average the performance across all k folds.
4. Leave-One-Out Cross-Validation (LOOCV):
o A special case of k-fold cross-validation where k equals the
number of data points.
o Time-consuming for large datasets.
Factors to Consider:
Example:
This code splits the dataset into a training set (80%) and a testing set
(20%) using random splitting.
BY A.RAJENDRA PRASAD 7
By carefully partitioning your dataset, you can ensure a reliable
evaluation of your model's performance and avoid overfitting.
1. Min-Max Scaling:
o Scales features to a specific range, typically between 0 and
1.
o Formula:
o scaled_feature = (feature - min(feature)) / (max(feature) -
min(feature))
2. Standardization (Z-score Scaling):
o Scales features to have a mean of 0 and a standard
deviation of 1.
o Formula:
o scaled_feature = (feature - mean(feature)) / std(feature)
3. Robust Scaling:
o Similar to standardization but uses the median and
interquartile range instead of mean and standard deviation.
o Less sensitive to outliers.
BY A.RAJENDRA PRASAD 8
When to Use Which Technique?
Example:
# Min-Max Scaling
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)
# Standardization
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
Note: It's important to scale the training and testing data using the
same scaler to ensure consistency.
BY A.RAJENDRA PRASAD 9
Reduced Overfitting: Removing irrelevant features can help
prevent overfitting.
Interpretability: A smaller set of features can make the model
easier to understand and explain.
1. Filter Methods:
o Correlation Analysis: Measure the correlation between
features and the target variable.
o Chi-Square Test: For categorical features and categorical
targets.
o ANOVA (Analysis of Variance): For numerical features
and categorical targets.
o Mutual Information: A general-purpose measure of
dependence between two variables.
2. Wrapper Methods:
o Forward Selection: Start with an empty set of features
and add one feature at a time based on performance.
o Backward Elimination: Start with all features and remove
one feature at a time based on performance.
o Recursive Feature Elimination (RFE): Rank features
based on their importance and recursively eliminate the least
important ones.
3. Embedded Methods:
o Regularization: Techniques like L1 (Lasso) and L2 (Ridge)
regularization can automatically select features by penalizing
large coefficients.
o Tree-Based Methods: Decision trees and random forests
can identify important features based on their information
gain or Gini impurity.
Considerations:
BY A.RAJENDRA PRASAD 10
Computational Cost: Some feature selection methods can be
computationally expensive, especially for large datasets.
Example:
BY A.RAJENDRA PRASAD 11
2. Mean Decrease in Accuracy (MDA):
o Randomly permutes the values of a feature in the out-of-bag
(OOB) data and measures the decrease in accuracy.
o A higher MDA value indicates a more important feature.
Python Example:
BY A.RAJENDRA PRASAD 12
Contextual understanding: Consider the domain knowledge
and the nature of the features to interpret their importance in the
context of the problem.
Dimensionality Reduction:
Dimensionality reduction is a technique used to reduce the number
of features (dimensions) in a dataset while preserving the most
important information. This can be beneficial for several reasons:
BY A.RAJENDRA PRASAD 13
4. Autoencoders:
o Neural networks trained to reconstruct the input data.
o Can be used for both dimensionality reduction and feature
learning.
5. Feature Selection:
o Selecting a subset of features based on their importance or
relevance.
o Can be considered a form of dimensionality reduction.
Example:
BY A.RAJENDRA PRASAD 14
Unsupervised Dimensionality Reduction via PCA
Key Concepts:
Benefits of PCA:
BY A.RAJENDRA PRASAD 15
Choosing the Number of Components:
Example:
X_reduced = pca.transform(X)
Advantages of LDA:
Disadvantages of LDA:
The choice between PCA and LDA depends on the specific task and the
characteristics of the data.
BY A.RAJENDRA PRASAD 17