Machine Learning Notes
Machine Learning Notes
- Tejas Bankar
• Data Science is the process of extracting knowledge and insights from raw data by using scientific tools and methods.
• Scientific methods includes:
Statistics
Machine Learning
Deep Learning
Visualization
Neural Network
Natural Language Processing
Time Series
• Machine Learning is a combination of statistics and computer science in which we are using a past data to answer questions.
• We are training different statistical model on past dataset by using computer science to find the pattern and rules by using which we can predict the answer for unseen
data.
Supervised ML:
• When the model is getting trained on the label dataset then it is called as supervised learning.
• Supervised Learning categorized in two types:
1. Regression : when output variable contains continuous data.
2. Classification : when output variable contains categorical data.
• Types of supervised ML algorithms:
1) Linear Regression
2) Logistic Regression
3) KNN Classifier, KNN Regressor
4) Naive Bayes Classifier
5) Decision Trees
6) Support Vector Machine ( Regression, Classification )
7) Random Forest
8) Ada Boost
9) GDBoost
10) XGBoost
Unsupervised ML :
• When the model is getting trained on unlabeled dataset then it is called as unsupervised learning.
• Unsupervised learning categorized in two categories:
1. Clustering : when we want to discover inherent grouping in data.
2. Association : when we want to discover rules that describe large portion of data.
• Types of Unsupervised ML algorithm:
1) Hierarchical Clustering
2) K-means clustering
3) DBSCAN
4) Principal Component Analysis
Statistical Concepts
Mean
• It is an average of all the values. It is sum of all the values divided by total number of values.
Median
• The median is the middle value in a sorted list of numbers. If the number of values is odd, the median is the middle number. If even, it is the average of the two middle
numbers.
Mode
• Mode is the most frequently appeared value in data.
Standard Deviation
• Standard deviation is the measure of variation or dispersion in set of values.
Variance
• Variance is the average of squared difference from the mean.
• Variance is also the measure of spread in data like standard deviation. It gives sense of overall spread in data.
• Difference in variance and standard deviation is that, variance is expressed in squared unit of original data. For example if data is in meters then variance will be in square
meters. Standard deviation on the hand expressed in same unit as of data. It is more easier to interpret than variance.
Covariance
• Covariance is the measure of directional relationship or joint variability between two random variables.
where, and are the values of two variables, and are the mean of x and y variables respectively, n = number of datapoints.
σ σ
• Coefficient of Correlation ranges from -1 to +1. If the correlation coefficient is close to +1 then it is strong positive linear relationship between two variables, if it is close
to -1 then its strong negative linear relationship. If the correlation coefficient is in between -0.3 to +0.3 then its a week relationship. If the value is 0 then it means no linear
relationship.
• It helps in understanding how one variable changes with respect to another variable. For example, a positive correlation indicates as one variable increases, other also tends
to increase.
Normal Distribution
• Normal Distribution is a continuous probability distribution which is symmetrical around its mean and most of the observation s are clustered around central peak and the
probabilities for values further away from the mean tapered off equally in both sides.
• It is also known as Gaussian distribution.
• The shape of the curve is called as Bell shape curve.
• The empirical rule for normal distribution is 68 % of observations should fall within 1st std deviation range from the mean, 95 % observations should fall within 2nd std
deviation range from mean and 99.7 % of observations should fall within 3rd std deviation range from mean.
Skewness :
• If there is a distortion or asymmetry in data, then it deviates from symmetrical bell shape curve. It is called as Skewness.
• There are two types of skewness:
1. Positive skewness / right skewness :
• When the tail of distribution is longer towards right hand side of the curve then it is positive skewness.
• Example, if we take age of India's population then it will show positive skewed curve because younger population is more than senior.
Hypothesis Testing
• Hypothesis testing is a statistical tool which is use to confirm our observation or assumption or hypothesis about population using sample data.
• First we study the sample of data and we assume that observation on sample is also followed by population.
• So, using Hypothesis testing, we can determine whether we have enough statistical evidence to conclude if the hypothesis about population is true or not.
• There are two hypothesis statements:
1. Null Hypothesis (H0) :
• It states that there is no significant difference between sample and population variables.
• It treats as everything is same or equal.
• Type I Error :
• If we reject the null hypothesis when it is actually true.
• Type II Error :
• If we accept the null hypothesis when it is actually false.
• T-Test : To compare mean of sample and population | H0 : mean of mean of two variables is same
• It assumes that data is normally distributed.
• It is use when mean and std are not known
• It is use when samples are less than 30
• from scipy.stats import ttest_1samp
• z-Test : To compare mean of sample and population | H0 : mean of mean of two variables is same
• It is use when samples are more than 30
• It is use when std is known.
• Residuals should follow a normal distribution. By using Q-Q plot, Histogram, kde plot, hypothesis testing which includes shapiro test, kstest, normality test we can check this
assumption.
If not : There may be problem with the stability, reliability of model on unseen data. Model will give random error, so to have a generalize performance of model, residual
should be normally distributed.)
• There should be homoscedastic behaviour in model, means residuals should be constant along the values of dependent variable. By using scatter plot of residuals against
dependent variable we can check for this assumption.
If not : Model will not give constant error over the range of dependent variable. So to have a generalize performance of model, homoscedasticity should be present.
Transformation can be done if homoscedasticity is not there.
r2 score :
• r2 score is also called as coefficient of determination, it is a measure to indicate how close the datapoints are fitted to regression line. formula for r2 score is :
Drawbacks of r2 score :
• r2 value will never decrease. It always increases even if we add non correlated features.
Adjusted r2 :
• Adjusted r2 is a modified version of r2 that has been adjusted for number of predictors in model.
• Adjusted r2 increases only when correlated features are added.
• Adjusted r2 will not increase when non correlated features are added.
• Adjusted r2
VIF :
• VIF is variance inflation factor which is used to check multicollinearity.
• It gives strength of correlation between two independent variables.
• It's range from 1 to inf
• It's formula is :
=
• Therefore when r2 is 0 then VIF is 1 which means there is no correlation between independent variables.
• If r2 is 0.8 then VIF is 5 and if r2 is 0.9 then VIF is 10, so whether to select VIF as 5 or VIF as 10 is depends on problem complexity. Generally we select 10.
• If we found VIF > 10 in two or more features then we have to drop one of the feature. We drop feature which having highest VIF , and it's a iterative process repeats until
we have all features having VIF < 10.
Logistic Regression
• It is a supervised learning algorithm use for a classification problem. It calculates probabilities of classes of target variable to make a prediction. It is a binary classifier.
• There are two types in logistic regression, first is binomial where target variable is having only two classes, and second is multinomial or multiclass where target variable is
having multiple classes.
• Logistic Regression uses sigmoid function to create a sigmoid curve on which probabilities are mapped and from which we predict in which class it will be going to fall, for
that we uses threshold value, by default it is 0.5.
• Sigmoid function is a transformation of linear regression function which estimates the probability of class.
• Sigmoid function is :
• It takes partial derivative of cost function with respect to m and c, and it finds new values of coefficients for each iteration, and at global minima we get values of
coefficients which gives best sigmoid curve.
Advantages :
1. Easy to implement
2. Less likely to overfitting
3. Performs well on linearly separable dataset.
4. Performs well on simple dataset ( less features )
Disadvantages :
1. When independent variables are highly correlated with each other, it may affect an performance of model.
2. If there is no linear relationship between independent variable and logit odd, it may affect the performance of model.
3. It is highly sensitive to outliers
4. Linearly separable dataset is rarely available in real world.
Confusion Matrix :
• Confusion matrix is an evaluation metrics of classifier.
• It has 4 elements :
1. True Positive : It means actual value is positive and predicted value is also positive
2. True Negative : It means actual value is negative and predicted value is also negative
3. False Positive : It means actual value is negative but predicted value is positive
4. False Negative : It means actual value is positive but predicted value is negative
TPR :
• It is Proportion of positive class got correctly classified by classifier
e.g. Out of 100 COVID+ patients 80 are correctly classified as COVID+ by classifier.
TNR / Specificity :
• It is proportion of negative class got correctly classified by classifier
e.g. Out of 100 non COVID persons 70 are correctly classified as non-COVID by classifier
FPR :
• It is Proportion of negative class got incorrectly classified by classifier
e.g. Out of 100 COVID negative persons 20 are incorrectly classified as COVID +ve by classifier.
Accuracy Score :
• Accuracy score measures how correctly classifier is predicting.
• It's a ratio of correct predictions to total number of predictions.
•
• When we get good accuracy_score, it doesn't mean that model is performing well.
• accuracy_score is good metric when we are dealing with balance data, but most of the times target classes in data is imbalance, and on such data accuracy_score is not
good metric to use.
• e.g. if we are having 95 non-spam mails and 5 spam mails, and if we train model on this imbalanced data, then model will get biased on majority classes and on test data
which is also imbalanced in same proportion, definitely we will get good accuracy_score because model will perform well on majority classes, but it will not magnify false
predictions it made on minority classes. So we can't just rely on accuracy score and we need to use other metrics like precis ion, recall, f1 score to understand model
performance well according to our use case.
Precision :
• It tells us Out of total predicted positive values how many are the actual positive values
•
Increasing the precision might decrease the recall, means decreasing the FP might increase the FN.
Recall :
• It is Proportion of positive class got correctly classified by classifier.
• It tells us out of total actual positive values how many are the predicted positive values.
•
f1 score :
• f1 score combines precision and recall into single metric.
• f1 score is a harmonic mean of precision and recall.
•
• regular mean treats all values equally, but harmonic mean gives much more weight to low values. As a result the classifier will only get high f1 score if both precision and
recall are high.
• Hence f1 score is important and gives more clear evaluation when we want both precision and recall to be high.
• Suppose we are having precision = 0.6 and recall = 0.95, then simple mean is 0.775 but harmonic mean = 0.735, so harmonic mean represents importance of both precision
and recall. It will be high only if both precision and recall are high.
Classification Report :
• Classification report consist of multiple parameters such as accuracy_score, precision, recall, f1 score. It is very useful r eport to evaluate model performance.
ROC AUC :
• ROC_AUC is evaluation metric use for binary classification.
• Receiver Operator Characteristic (ROC) is a probability curve that plots the TPR(True Positive Rate) against the FPR(False Po sitive Rate) at various threshold values.
• FPR is on x axis and TPR is on y axis
• The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes, it summarizes ROC graph by using single number. It ranges from 0
to 1. Greater the AUC better is the performance of model.
• The perfect classifier will have AUC = 1, which means high value of TPR and low value of FPR.
• When AUC = 0.5 then it means classifier is random, so generally in real life AUC >= 0.7 considered as good.
Decision Tree
• DT is a supervised learning algorithm which can be used for both classification as well as regression problem. Mostly prefer for classification problem.
• It gives tree like structure where there are 3 main parameters , first is root node or decision node, this represents the feature is going to split further. then 2nd is branches,
this represents the decision rule on which basis the root node is going to split. and 3rd is leaf node, which represents the outcome.
• To make DT there are 2 methods first is ID3 and second is CART.
• In ID3 we finds Information Gain for all features, IG is measure of reduction in entropy, and Entropy is the measure of impurity in node.
• then whichever feature is having lowest GI is going to split and it becomes Decision node.
• Difference between entropy and Gini is that time complexity is more when we use entropy as it uses log term while Gini uses simple mathematics. Entropy ranges from 0 to
1 while gini ranges from 0 to 0.5. For pure node both entropy and gini is 0 and for total impure node entropy is 1 and gini is 0.5.
• For Regression it takes best threshold value as a decision splitter. It splits the target variable in two regions and predicted value is mean of particular region.
• Initially it takes mean of 2 consecutive rows as a threshold value and splits the target variable in two regions then it finds the mse for both the region and sums it as a MSE
for that threshold value. This step repeated till last row and then whichever threshold has lowest MSE will be our best threshold and it becomes decision node or splitter.
• Decision tree is prone to overfit so to reduce overfitting we can use hyperparameter tuning, there are multiple hyperparamete rs like criterion, splitter, max_depth,
min_sample_split, min_sample_leaf.
• Another way to reduce overfitting of DT is pruning. Pruning is technique used to reduce the size of trees by removing unwanted sections of tree which are non-critical, It
reduces the complexity of model and slightly increases the training error but drastically reduces the testing error.
• There are two types of pruning, first is pre pruning also called as forward pruning, it stops the non-significant branches from generating. Hyperparameter tuning is use to
achieve this by using hyperparameters like max_depth, min_sample_split, min_sample_leaf. Second method of pruning is post pruning also called as backward pruning, in
this process the tree is generated first and then non-significant branches are removed. In this we use cost complexity pruning technique also called ccp_alpha. higher is the
value of ccp_alpha more nodes are pruned.
Random Forest
• Random Forest is an extension to bagging method use for both classification and regression problem. It uses bagging and random feature selection to create an
uncorrelated Forest of decision trees.
• Bagging is an ensemble technique and it is also called as bootstrap aggregation, here bootstrapping means creating diff subse ts of training dataset by randomly selecting
samples with replacement, means individual datapoint can be selected for multiple times.
• Then this bootstrap samples are trained independently
• Aggregation means output or predictions of all the models are aggregated together to get best optimal prediction. In regression mean of all the outputs is taken, and in
classification there are 2 methods, soft voting and hard voting, in Soft voting we take a mean of probabilities of Yes and No and whichever is having highest mean will be our
predicted class, and in hard voting we take the majority of all the outputs.
• The 2 most important hyperparameters in RF are n_estimators which represents number of DT's to be trained, and second is max_features which represents maximum
number of features to select for training the model.
Advantages :
• It is less likely to overfitting than DT.
• It works well without tuning.
Disadvantages :
• Time complexity is more
• Random Forest algorithm may change considerably by a small change in the data.
• sometimes it fails to determine significance of each feature.
Advantages :
• accuracy of weak classifier can be improved by using AdaBoost.
Disadvantages :
• sensitive to outliers, needs quality dataset
Gradient Boosting
• GB is supervised machine learning algorithm which is a boosting algorithm of ensemble technique. It is one of the most popular algorithm, we can use it for regression as
well as classification problem.
• GB builds models sequentially and each new model tries to reduce residual of previous model.
• There are 3 main components of GB:
1. Loss Function : It depends on problem we are solving, for regression there are MSE,RMSE etc. for classification there are LogLoss, Hinge Loss etc. The condition is that
the loss function should be differentiable.
2. Weak Learner : weak learners are the models or shallow Decision Trees which means highly pruned DT with max number of leaves in between 8 to 32.
3. Additive Model : It is a sequential model and at each iteration our loss function should be reduced.
GB Regressor :
• Initially we create a base model which has constant output or predicted value( ). we finds this constant predicted value in such a way that the loss is minimum, for that we
takes first order derivative of loss function w.r.t (predicted value). In regression loss function is simply the residual square which is
• Now we have to find the pseudo residual for this previous model, for that the formula is derivative of loss function w.r.t predicted value( ). (
, here Yi means actual value and means predicted value.).
• Now we have to create next model and for that output or dependent variable is residual of previous model. so we are predicting the residual of previous model. Now again
it will find the constant predicted value for this model in such a way that the loss is minimum.
• This finding constant predicted value and residuals is repeated until the residuals are minimizes, so that we get less variance and good results. By default it will take 100
number of DT's , and then the final output will be , prediction of base model plus summation of lambda * residual for each model.
• In GB the models are distinct shallow DT's means highly pruned trees which are dissimilar for each model. so that the speed will increase.
GB Classifier :
• It is use when target variable is binary
• All the steps are similar to regression, only the loss function in classification is Log-likelihood function which is also called as log loss function which we use in logistic
regression.
• To initialize the base model we have to find a constant value, for that we use log(odds) because when we differentiate loss function we get function of log(odds), so,
• After transforming the loss function the equation becomes, : L = - Y * log(odds) + log(1 + e ^ log(odds))
• log(odds) gives us the probability of class
• Now we have to find the value of log(odds) for which the loss function is minimum. For that we take the derivative of this loss function w.r.t to log(odds) and then put it
equal to 0, so we will get a probability.
• Now we have to find the residual between actual probability and predicted probability, and similar to regression we will predict this residuals in next model. Now again it
will find the constant predicted value of log(odds) for this model in such a way that the loss is minimum.
• This steps repeated until residual is minimizes.
Hyperparameters in GB are:
1. n_estimators
2. Learning Rate
Advantages of GB:
1. performance of GB is very good
2. Preprocessing is not required we can use directly with missing values and outliers as it is.
3. It is very flexible model, we can use different loss functions and provides several hyper parameter tuning options that make the function fit very flexible.
Disadvantages of GB:
1. small change in training dataset can create radical change in model.
2. Not easy to understand predictions
3. Gradient Boosting Models will continue improving to minimize all errors. This can overemphasize outliers and cause overfitting.
4. Computationally expensive - often require many trees (>1000) which can be time and memory exhaustive.
Advantages :
• SVM works relatively well when there is a clear margin of separation between classes.
• SVM is more effective in high dimensional spaces.
• SVM is effective in cases where the number of dimensions is greater than the number of samples.
• SVM is relatively memory efficient
Disadvantages :
• SVM algorithm is not suitable for large data sets.
• SVM does not performs well when classes are overlapping
Naive Bayes
• Naive Bayes is a supervised learning algorithm which is based on Bayes theorem and used for solving classification problems.
• Naive means all the features are independent to each other and Bayes is for Bayes theorem.
• Bayes theorem is use to determine the probability of hypothesis with prior knowledge. It depends on conditional probability.
• Using Bayes theorem we can find posterior probability, which is probability of event 'A' happening given that event 'B' has occurred, 'A' is hypothesis and 'B' is evidence.
Formula for posterior probability is likelihood probability * prior probability divide by marginal probability.
• There are 3 variants in Naive Bayes, 1st Gaussian NB which can be use when there are continuous variables, 2nd is Multinominal NB which can be used when there are
discrete variables, 3rd is Bernoulli NB which can be used when there are binary variables
Applications :
• NB is mainly used in text classification that includes high dimensional training dataset.
Advantages :
• It is one of the most fast and effective algorithm use for classification.
• It performs well on high dimensional dataset.
Disadvantage :
• It assumes that all the variables are independent to each other.
Ensemble Techniques
• Ensemble method means combining multiple models to solve complex problem and to get better prediction.
• There are multiple methods in ensemble learning like Bagging, Boosting, Voting, Stacking.
• Bagging is a parallel approach and Boosting is a sequential approach.
• In bagging we have to create multiple bootstrap datasets on which we train the models independently and then aggregate output of all models together to get optimum
prediction.
• In boosting we sequentially convert weak learner to strong learner by reducing the error in each iteration.
• In Voting Ensemble method we train multiple models on same data independently and then we aggregate output of all models together to get final output.
• In stacking we train multiple base models, & using output of these models we train meta model which gives us final output.
Voting Ensemble
• In Voting Ensemble method we train multiple models on same data independently and then we aggregate output of all models together to get final output.
VotingClassifier :
• In classification there are two types for voting classifier.
1. Hard Voting : In hard voting classifier we take a majority of classes
2. Soft Voting : In soft voting classifier we take mean of probabilities of classes & whichever class having highest mean that will be output class.
VotingRegressor :
• In regression we simply take a mean of all models output.
• Voting Ensemble is very imp concept for tuning and getting better results. Sometimes single model can gives better results than Voting Ensemble, but sometimes
VotingEnsemble can perform very well by combining multiple independent models.
• It's all about doing experimentation.
K-Fold method :
• In K-Fold method first we split data into 2 parts (train and test).
• Then we create K folds of train data.
• Then We train base model for K number of times on different combinations of K folds and make predictions accordingly. We will do this for multiple base models & we will
use predictions of this models as a input data to meta model.
• Then we will train our Meta model
• Now we will forgot our previous K Folds and base models. & we will train our base models one whole train data.
• So here we first get meta model and then base models.
• Now we have base models and meta model. We will test this on test data.
Unsupervised ML
K-means clustering
• It is a unsupervised learning algorithm use to create clusters or groups of similar instances or data points.
• In K-means clustering, K number of clusters are created in a dataset and every data point is allocated to nearest cluster.
• Initially it will take K number of centroids at random positions which are used as beginning points for every cluster. Then it will create clusters by finding nearest datapoints
by measuring Euclidean distance.
• Then it will finds mean of all data points of particular cluster and updates the centroid to that mean position.
• This step repeated until centroids are stabilized at one place and then we get good clusters in such a way that sum of distance between centroid and datapoints of clusters
is minimum.
• To find the best number of cluster, we can use elbow method in which we finds the WCSS for range of K values. WCSS is Within Cluster Sum Square. it is summation of
squared distance between each datapoints and its centroid for all clusters.
• Then we will plot a graph of WCSS and K values, and the point where last sharp bend found is consider as best value of K.
• Disadvantage :
• We have to predefined the number of clusters, and it always tries to create the clusters of the same size.
Hierarchical clustering
• Hierarchical clustering is a unsupervised learning algorithm which is use for clustering.
• In this algorithm we create the hierarchical series of clusters which looks like a tree and this is also called as dendrogram.
• To create this hierarchy of clusters there are two methods, agglomerative and divisive.
• Agglomerative Hierarchical Clustering is a bottom-up approach, in which the algorithm starts with taking all data points as separate clusters and then start combining the
closest pair of clusters together until only one cluster is left.
• Divisive Hierarchical clustering is top-down approach, it is opposite to the Agglomerative Hierarchical Clustering. In this we starts with only one cluster which contains all
datapoints and then at each iteration, we split the farthest point in the cluster and repeat this process until each cluster only contains a single data point.
• To measure the distance between two clusters there are various methods which are also called as linkage methods:
- single linkage : shortest distance between the closest points of two clusters,
- complete linkage : farthest distance between two points of two clusters. It is one of the popular linkage methods as it forms tighter clusters than single-linkage.
- Average Linkage : It is the linkage method in which the distance between each pair of data points are added up and then divided by the total number of data points to
calculate the average distance between two clusters. It is also one of the most popular linkage methods.
- Centroid Linkage : It is distance between centroids of clusters.
• Now using this Dendrogram, to find optimal number of clusters for our model, we will find the maximum vertical distance that does not cut any horizontal bar. The number
of clusters will be the number of vertical lines which are being intersected by the line drawn using the threshold.
Advantages :
• No need to predefined the number of clusters.
Disadvantage :
• It took more time than K-means clustering because it has to create hierarchy from all data points to create single cluster.
K-Means and Hierarchical Clustering both fail in creating clusters of arbitrary shapes. They are not able to form clusters base d on varying densities. That’s why we need
DBSCAN clustering.
Classical Machine Learning Page 10
DBSCAN clustering.
• DBSCAN ( Density-based spatial clustering of applications with noise )
• DBSCAN is popular unsupervised learning algorithm which is use for clustering problem.
• DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise
• It groups 'densely grouped' data points into a single cluster.
• In DBSCAN there are 2 main parameters, epsilon and minPoints. Epsilon is the radius of the circle to be created around each data point to check the density and minPoints is
the minimum number of data points required inside that circle
• DBSCAN creates a circle of epsilon radius around every data point and classifies them into Core point, Border point, and Noise.
• A data point is a Core point if the circle around it contains at least 'minPoints' number of data points which we have initially define as a parameter.
• If the number of points are less than minPoints, then it is classified as Border Point.
• And if there is no datapoint in circle of any data point within epsilon radius, then it is consider as noise, which is a outlier & it is seperated away from any cluster.
• For locating datapoints, it uses euclidean distance
• DBSCAN is very sensitive to the values of epsilon and minPoints, therefore it is very important to select right values of epsilon and minPoints, because even a small change
in this parameters can significantly affect the result of DBSCAN.
• Value of minPoints should be 1 more than number of dimensions. It should be at least 3. Value of minPoints is generally twice the number of dimensions, but there is no fix
rule, it also depends on domain and complexity of problem.
• Value of epsilon can be choose from k-distance graph.
Advantages :
• It is robust to outliers
• No need to predefined the number of clusters.
• It can identify clusters in large spatial datasets by looking at the local density of the data points.
silhouette score
• silhouette score is a metric used to calculate the goodness of a clusters. Its value ranges from -1 to 1.
• If its value is close to 1 then clusters are well apart from each other. If the value is close to -1 then clusters are assigned in a wrong way.
•
where,
a = average intra-cluster distance i.e. the average distance between each point within a cluster.
b = average inter-cluster distance i.e. the average distance between all clusters.
ML Concepts
ML Project Pipeline :
1. Understanding of problem statement and business requirements
2. gathering require quality data from relevant source.
3. understanding data, all features and its impact on business
4. cleaning and structuring data using relevant tools or feature engineering
5. Exploratory Data Analysis (EDA),
6. Feature Selection
7. Model training
8. Model Evaluation and Hyperparameter tuning
9. Automation of end to end pipeline
10. Deployment
11. Continuous maintenance
Outliers
• Outlier is a datapoint which is far away from the observation.
Impact of outliers :
• It will impact on the mean and std of data.
• Reduce the power of some algorithm ( sensitive to some algorithms : distance base and gradient base )
z-score :
• z-score is also called as standard score, because its formula is same as a standard scaler : (X - Xmean) / std.
• It centres the data to its mean with unit std, so it tells us how many std away a data point is.
• if z-score is > 3 or < -3, then we consider it as outlier.
IQR :
• IQR is inter quartile range which can be calculated as Q3 - Q1. It is middle 50% values.
• Q1 is a first quartile, it is 25th percentile value of data
• Q3 is a third quartile, it is 75th percentile value of data
• we can find this value by either df.describe method, np.percentile(data, 1-100) method or np.quantile(data, 0.1-1) method
Mean/Median replacement :
• In this method outlier is replaced by mean or median.
• If the data is continuous, then mean of data excluding outlier can be taken for replacement.
• If the data is discrete then median can be taken.
Dropping outliers:
• This should be last preference because data is the fuel for ML.
Missing Values
• To check missing values we can use df.isna().sum() function
mean/median/mode imputation :
• We can impute missing values by mean in case of continuous data, median in case of discrete data and mode in case of categorical data. We can also customize this
imputation method as per nature of data and business.
• Suppose our target variable contains binary classes 0 and 1. If there is independent feature which contains missing values and its mean or median or mode is different for
data which having target class as 0 and data which having target class 1. In this case imputing all missing values with single mean/median/mode of all data is not good
strategy, so we can take mean/median/mode of specific data w.r.t target classes. Similarly if target variable is continuous then we can separate independent variables w.r.t
some specific range of target value as per business complexity.
IterativeImputer :
• It is a multivariate method in which 2 or more features are consider for imputing missing values in one feature.
• It is imported from sklearn.impute library
• It uses all features as an independent feature except feature in which we have to impute missing values and feature which contain missing values is a target variable.
• For all the rows where target variable is not having missing value sklearn IterativeImputer runs regression model and then make a predictions where values are missing. This
predicted values will be imputed values.
KNNImputer :
• It uses KNN algorithm at backend and uses Euclidean Distance to find nearest neighbours.
• Suppose there are 3 features and 1 feature is having missing value, so it will finds closest values for other two features and then base on this neighbours it will impute the
value for missing place.
Anomaly detection:
• In some cases like fraud detection detecting outlier is more important than regular datapoints. It is called anomaly detection
Encoding
• Encoding is a method of converting categorical feature into numeric, because machine only understands numbers.
Types of encoding :
One Hot Encoding :
• It is use for nominal categorical variable.
• It splits the categories in different columns and puts 0 if category is not in the row and 1 if it is present.
• We can implement this method by using pd.get_dummies() or OneHotEncoder
• Disadvantage of this method is if there are lot of categories in variable then it will lead to curse of dimensionality, and it can affect the performance of model.
Label Encoding :
• This method is use for ordinal categorical data.
• It assigns labels to each category starting from 0.
Feature Scaling
Why feature scaling is important ? In which types of models it is important ? How feature with different scale impacts on model performance ?
• In Gradient Descent based and Distance based algorithms Feature Scaling is required, because :
Gradient Descent Based Models:
• In Gradient Descent based models, if we have features with different scale, then step size while convergence in gradient descent will be different for different features, and
we will not having smooth convergence. So for smooth convergence and to have same step size for all features, we need to scale all features into same scale.
• Linear regression, Logistic regression are the gradient descent base models.
Standardization :
• Standardization is another scaling technique where the values are centred around the mean with a unit standard deviation. This means that the mean of the data becomes
zero and the resultant distribution has a unit standard deviation.
• It's formula is,
X' = (X - u) / sigma
Normalization Vs Standardization :
• Normalization is good to use when data does not follow a Gaussian distribution. This can be useful in algorithms that do not assume any distribution of the data like K-
Nearest Neighbours and Neural Networks.
• Standardization, on the other hand, can be helpful in cases where the data follows a Gaussian distribution. However, this does not have to be necessarily true. Also, unlike
normalization, standardization does not have a bounding range. So, even if you have outliers in your data, they will not be affected by standardization.
• At end of the day, the choice of using normalization or standardization will depend on your problem and the machine learning algorithm you are using. There is no hard and
fast rule to tell you when to normalize or standardize your data. You can always start by fitting your model to raw, normalized and standardized data and compare the
performance for best results.
Imbalance Data
• When the target class has uneven distribution of data then it is imbalance data. Imbalance data can impact our accuracy.
• Examples of Imbalance data :
Fraud Detection
Spam filteration
disease screening
Resampling technique :
• Oversampling : Adding copies of minority class
• Undersampling : Removing samples from majority class
• Oversampling is widely use method, because in undersampling data is lost.
By using imblearn library we can use different resampling methods.
Random Sampler :
• For oversampling it will randomly add copies of minority class.
• For undersampling it will randomly deletes or removes samples of majority class
• from imblearn.over_sampling import RandomOverSampler
• RandomOverSampler().fit_resample(x,y)
• Similar for RandomUnderSampler
SMOTETomek :
• It uses oversampling and undersampling at same time. It does oversampling of data using SMOTE and undersampling using Tomek link.
• First it does oversampling using SMOTE and then undersampling process starts using TomekLinks.
• If suppose two datapoints are mutual closest neighbours and if they belongs to two different classes, then they form a pair of Tomek link. These pairs of samples are often
at the ambiguous borderline between classes, where it is easily to have false classification.
• The idea behind SMOTETomek is to make the training dataset cleaner by removing this Tomek links at the borderlines.
• from imblearn.combine import SMOTETomek
from imblearn.under_sampling import TomekLinks
resample=SMOTETomek(tomek=TomekLinks(sampling_strategy='all'), random_state=1)
X_rs, y_rs = resample.fit_resample(X, y)
SMOTEENN :
• It also does oversampling and undersampling like SMOTETomek.
• t uses Wilson’s Edited Nearest Neighbour rules (ENN) in the under-sampling step to remove instances of the majority classes.
• For each sample it calculates K Nearest Neighbours (kNN)
• Then it calculates the ratio of majority class among the kNN which also called 'r'.
• If sample is from minority class and r is >0.5, then it removes majority class instances.
• If sample is from majority class and r i <0.5, then it removes sample.
• SMOTEENN removes more samples than SMOTETomek.
• from imblearn.combine import SMOTEENN
resample=SMOTEENN(random_state=0)
X_rs, y_rs = resample.fit_resample(X, y)
Similarly lot of other variants of SMOTE are there like BorderlineSMOTE, SVMSMOTE, KMeansSMOTE and so on...
Feature Selection
• It is a way of selecting a subset of most relevant features set by removing the redundant, irrelevant or noisy features.
• It helps to reduce overfitting, improves accuracy, reduce training and testing time, also it helps in avoiding curse of dimensionality.
Wrapper Method :
• It uses ML algorithms to find best subset of features.
• In wrapper methods, we use subset of features to train model, based on the result of this model we decide to add or remove features from subset.
• This methods are computationally more expensive.
1. Forward Feature Selection :
○ It is an iterative method in which we start with having no feature in the model.
○ In each iteration, we keep adding the feature which improves the performance of our model, till an addition of new feature does not improves performance of model.
2. Backward Elimination :
○ It is opposite to Forward feature selection method.
○ In this we start with all features and removes the least significant feature at each iteration which improves performance of our model. It repeats until no
improvement is observe on removal of feature.
Dimensionality Reduction
Principle Component Analysis (PCA)
• PCA is a dimensionality reduction method used to reduce dimensions of large dataset, by using orthogonal transformation. This new transformed features are called
Principle Components.