Machine Learning Notes

Machine Learning
Lecture 1
Machine Learning Life-Cycle
The machine learning life cycle refers to the series of steps involved in developing and
deploying a machine learning model.
1. Acquiring Data: Gather relevant data required to train and evaluate your model. This
may involve collecting data from various sources, such as databases, APIs, or external
datasets.
2. Data Preprocessing: Prepare and preprocess the acquired data to make it suitable for
training. This involves tasks like data cleaning, handling missing values, outlier
detection, feature selection, feature engineering, and data normalization.
3. Data Analysis/Model Selection: Choose an appropriate machine learning algorithm or
model architecture that is well-suited for your problem. Consider factors such as the
nature of the data, the type of problem (classification, regression, etc.), and the
available computational resources.
4. Training: Use the prepared data to train the selected model. The training process aims
to minimize the difference between the model's predictions and the actual values in the
training data.
5. Model Evaluation/Testing: Assess the performance of the trained model
6. Deployment: Once satisfied with the model's performance, deploy it in a production
environment where it can make predictions on new, unseen data.
Supervised Learning
Supervised learning is a ML approach where the model learns from labeled examples, each
example is made up of input data (features) and the corresponding output (the label). This
approach is commonly used for classification or regression (predicting a value).
The model’s goal is to learn how to map the features to the label so that it can make
accurate predictions on new, unseen data. The model learns by trying to minimize the
difference (the loss) between its predicted labels and the real labels in the training data.
Ex. A supervised learning model can be trained on a dataset of emails labeled as spam or
not spam. The model learns to categorize the emails based on the contents of the email.
Unsupervised Learning
In unsupervised learning the model tries to identify patterns or relationships within the
data on its own without being given labels. Unsupervised learning is useful for tasks like
data exploration, anomaly detection, and recommendation systems
These patterns are discovered by grouping similar datapoints together (clustering).

Ex. An unsupervised learning model can be trained on a dataset of customer purchasing
behavior from an online store (without any labels) to identify different customer segments
for targeted advertising. It would learn by grouping similar customers together based on
what products they view and buy.
Reinforcement Learning
In reinforcement learning the model interacts in some environment and is being trained to
take action towards a certain goal. Reinforcement learning is commonly used in
applications like game playing, robotics, and autonomous systems (e.g., self-driving cars).
The model or agent learns through trial and error, when it takes an action that furthers it’s
goal it is given a reward, the model tries to find the best strategy to maximize rewards .
Lecture 2
Data Cleaning
Data collected from various sources can contain errors, missing values, outliers, and
inconsistencies. Data cleaning helps improve the quality and reliability of the dataset.
Within the dataset there may be missing values, this can be dealt with by replacing the
missing value with an estimated one (imputation), the estimated value is calculated as the
mean or the mode for that field. Alternatively, any rows containing missing value are
deleted (deletion).
0.16
-0.171
0.16
(−0.2 + 0.57 + 0.04 + 0.49 + (−0.3) + 0.26 + 0.26)

𝑚𝑒𝑎𝑛3 = = 0.16
7
(−0.25 + (−0.26) + 0.01 + (−0.47) + (−0.3) + (−0.09) + (−0.11) + 0.1)
𝑚𝑒𝑎𝑛6 = = −0.171
8
Outliers or extreme values must also be identified and removed, or a suitable
transformation might be applied to the dataset to make their effect less severe. Outliers
can be identified using the interquartile range (IQR).
𝐼𝑄𝑅 = 𝑄3 − 𝑄1
𝑆𝑜𝑟𝑡𝑒𝑑 𝑑𝑎𝑡𝑎 {0.1, 0.25, 0.45, 0.68, 1, 1.38, 1.77, 3.32, 4.23, 9.04, 101.2}
𝑄1 = 0.45
𝑄2 /𝑚𝑒𝑑𝑖𝑎𝑛 = 1.38
𝑄3 = 4.23
𝐼𝑄𝑅 = 4.23 − 0.45 = 3.78
(𝑄1 − 1.5(𝐼𝑄𝑅)) ≤ 𝑖 ≤ (𝑄3 + 1.5(𝐼𝑄𝑅))
−5.22 ≤ 𝑖 ≤ 9.9
𝑂𝑛𝑙𝑦 𝑜𝑛𝑒 𝑣𝑎𝑙𝑢𝑒 101.2 𝑐𝑜𝑛𝑠𝑖𝑑𝑒𝑟𝑒𝑑 𝑜𝑢𝑡𝑙𝑖𝑒𝑟
Inconsistent data can arise due to variations in data entry or integration from multiple
sources. Resolving these inconsistencies might involve standardizing units of
measurement, addressing formatting issues, or reconciling conflicting values.
Data Normalization
Data normalization/feature scaling is a technique used to bring different features or
variables to a similar scale or range. This is done to ensure that all features are given
equal importance during the learning process.
Min-Max normalization is a technique used to scale data to a specific range (usually

between 0 and 1). It is used when we want to preserve the original distribution of the data.
The formula for Min-Max scaling is:
(𝑋 − 𝑋𝑚𝑖𝑛 )
𝑋𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 =
(𝑋𝑚𝑎𝑥 − 𝑋𝑚𝑖𝑛 )
Z-Score normalization is a technique that transforms the data to have zero mean and unit
variance. It is used when we want to allow for extreme values within the dataset. The
formula for Z-Score normalization is:
*Mean Normalization:
(𝑋 − 𝜇)
𝑋𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 =
𝜎 X_normalized = (X-mean)/max(x)-min(x)
Lecture 3
Feature Extraction
Feature extraction is the process of transforming raw data into a format that we can train a
ML model on. Feature extraction is used to select only the relevant features for the model
we are training (dimensionality reduction), this allows the model to learn patterns faster as
well as eliminating noise (irrelevant/redundant features) from the data.
Image Feature Extraction

Image feature extraction is a crucial process for computer vision and image analysis. The
goal is to transform raw image data into a useful format. Image features should capture the
meaningful patterns in an image (e.g., edges, shapes, textures) and should be invariant to
changes (e.g., translation, rotation, scaling).
Image features can be global, meaning they describe some aspect of the image as a whole
(useful for classification), or they can be local, meaning they describe a region/patch of the
image (useful for object recognition).
Global Feature Extraction

Histogram of Oriented Graphs (HOG) is a widely used global feature extraction technique
used for tasks like object detection and image classification. HOG works by capturing the
distribution of local gradient orientations in an image. It’s designed to be robust to
variations in lighting, contrast, orientation, etc.
HOG steps:
1. Image preprocessing: image is resized and converted to greyscale to simplify extraction

2. Gradient computation: the gradient of pixel intensities in the x and y directions are
calculated using Sobel or Scharr operators.
−1 0 1 −1 −2 −1
𝑆𝑜𝑏𝑒𝑙𝑥 = {−2 0 2} 𝑆𝑜𝑏𝑒𝑙𝑦 = { 0 0 0}
−1 0 1 1 2 1
The image is convolved with these two operators, one for detecting changes in intensity
in the horizontal direction (Sobel_x) and the other for changes in the vertical direction
(Sobel_y). These two gradients can be combined to calculate the magnitude and
orientation of the overall gradient at each pixel.
Ex.
152 76 125 −152 0 125
78 𝟖𝟓 89 → 𝑐𝑜𝑛𝑣𝑜𝑙𝑢𝑡𝑖𝑜𝑛 𝑤𝑖𝑡ℎ 𝑆𝑜𝑏𝑒𝑙𝑥 = −156 0 178 → 𝐺𝑥 = −19
214 68 200 −214 0 200
Pixel values
152 76 125 −152 −152 −125

78 𝟖𝟓 89 → 𝑐𝑜𝑛𝑣𝑜𝑙𝑢𝑡𝑖𝑜𝑛 𝑤𝑖𝑡ℎ 𝑆𝑜𝑏𝑒𝑙𝑦 = 0 0 0 → 𝐺𝑦 = 121
214 68 200 214 136 200
Pixel values
𝑜𝑣𝑒𝑟𝑎𝑙𝑙 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡 𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒 = √(−19)2 + (121)2 = 122.48

𝑜𝑣𝑒𝑟𝑎𝑙𝑙 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡 𝑑𝑖𝑟𝑒𝑐𝑡𝑖𝑜𝑛 = 𝑎𝑡𝑎𝑛2(−19,121) = 1.7265
The gradient magnitudes and directions for each pixel in the image are stored in 2
matrices.
3. Gradient histograms: the image is divided into small overlapping cells (e.g., 8x8 pixels).
The gradient orientations/directions for all the pixels in the cell are then grouped into
bins that cover the entire range of orientations (e.g., 9 bins covering 0 to 180 degrees).
The gradient magnitude is used as the weight of each cast into a bin.
Ex. *the gradient magnitude can also be split

𝑝𝑖𝑥𝑒𝑙 ℎ𝑎𝑠
4𝜋 5𝜋 between two bins with the contribution to
𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒 = 122.44 < 1.7265 < each bin proportional to the direction its
𝑑𝑖𝑟𝑒𝑐𝑡𝑖𝑜𝑛 = 1.7265 9 9
closer to.
magnitude 122.44
direction 0 𝜋/9 2𝜋/9 𝜋/3 4𝜋/9 5𝜋/9 2𝜋/3 7𝜋/9 8𝜋/9
The result is a histogram for all the pixels within the cell that represents the gradient
orientations within that region.
4. Cell normalization: the histogram for each cell (or group of cells) is normalized to make
the HOG descriptor more robust to changes in lighting and contrast.
5. Descriptor formation: the normalized histograms are concatenated to form one giant 1D
vector, this is the final HOG descriptor for the image and can be used as input in a ML
model.
Local Feature Extraction
Local feature extraction is used for tasks like object recognition, image stitching
(combining multiple images to produce one large image), or structure from motion
(estimating how an object is moving in 3D space using images).
The Difference of Gaussians (DoG) is a method used to detect local features, especially
blob-like structures, in an image. DoG works by taking the difference between two
Gaussian-smoothed/blurred versions of the image.
DoG process:
1. Gaussian smoothing/blurring: the image is convolved with several Gaussian kernels of

different standard deviations 𝜎 (different levels of blurring). The images might also be
scaled down at different levels. This produces several octaves (series of images), each
with different levels of blurring and scaling.
2. Difference calculation: for each octave, images at the same scale are subtracted from
each other to produce a DoG image. This subtraction amplifies the difference between
the images reducing irrelevant noise and maintaining important features (e.g., edges,
corners, blobs).
3. Feature detection: local minima (dark areas) and maxima (light areas) are identified in
the DoG image by comparing them to neighboring pixels. Extrema suppression might
be applied to remove weak extrema (e.g., edges) that are not robust.
Once the features/key points have been detected in an image, there are several ways of
describing them in a way we can use in a ML model:
• Scale Invariant Feature Transform (SIFT): designed to capture the local texture and
shape around a key point as floating point variables in a way that is invariant to scale
and rotation.
SIFT works in a similar way to HOG in that it creates histograms and concatenates them
to produce a one dimensional descriptor.
• Binary descriptors (BREIF, ORB): designed to represent features efficiently as binary
strings. This makes them easier to store and compute with than a floating point
descriptor.
Binary descriptors work by comparing pairs of pixels around a key point. If the first
pixel's intensity is greater than the second, a 1 is assigned; otherwise, a 0 is assigned.
These comparisons are concatenated to form the descriptor.
Binary descriptors mainly differ in the strategy used to select pairs.
Lecture 4
Univariate Linear Regression
Univariate Linear Regression is ML algorithm used to model the relationship between a
single independent variable (predictor) and a dependent variable (target).
The relationship between the independent variable (x) and the dependent variable (y) is
represented as:
𝑦̂ = ℎ𝛽 (𝑥) = 𝛽0 + 𝛽1 𝑋
*𝛽0 is the intercept term. It represents the predicted value of 𝑦 when 𝑥 is zero.
*𝛽1 is the slope of the line.
The objective of univariate linear regression is to find the parameter values ( 𝛽0 and 𝛽1 )
that minimize the cost function/loss (sum of squared errors), which measures how well the
model fits the data.
𝑛
1 2
𝐽 (𝛽0 , 𝛽1 ) = ∑( 𝑦
̂𝑖 − 𝑦𝑖 )
2𝑛
𝑖=1
*𝑛 is the number of training examples
The Iterative Approach

*Each training
iteration is called an
epoch
1. The model starts with some arbitrary values for the parameters, the model takes some
input features 𝑋 and computes its prediction 𝑌̂
2. The predicted value 𝑌̂ and the actual value 𝑌 are put through the cost function to
compute the loss.
3. The values for the parameters are adjusted till the loss converges on a certain value
Gradient Descent
Gradient descent is an algorithm for minimizing the cost function 𝐽(𝛽). Gradient descent
works by moving parameters in the direction of the negative gradient of the cost function.
The gradient of the cost function is calculated with respect to each parameter. The
parameters are then updated in the negative gradient direction.
𝛽 = 𝛽 − 𝛼∇𝐽(𝛽 )
𝜕
𝛽 =𝛽−𝛼 𝐽(𝛽)
𝜕𝛽
*𝛼 is the learning rate
The learning rate or step size determines how big of a step is taken each iteration. A step
size too low makes learning take too long, and a step size too large might overshoot the
minimum.
Step size too small Step size too large
The ideal step size is the one that arrives at the minimum in the least number of steps.
Stochastic Gradient Descent

The batch is the set of examples used to calculate the gradient in a single training iteration.
Batches can contain hundreds or thousands of examples but as the batch size increases
gradient computation time also increases and examples within the batch become
redundant.
Stochastic gradient descent (SGD) uses a batch size of 1 by picking random examples from
the dataset. This works but produces a very noisy result.
Mini-Batch Gradient Descent

Mini-batch stochastic gradient descent (mini-batch SGD) uses batches of 10 to 1000
examples. This reduces the noise while being more efficient than full-batch processing
Lecture 5
Vectors and Matrices
A vector is a one dimensional array of numbers. Vectors are used to represent datapoints
or parameters.
Vectors are represented as column vectors with their height representing the vector’s
dimension.
𝑥1
𝑥
𝑥⃗ = [ 2 ]
⋮
𝑥𝑛
Operations on vectors:
𝑥1 ± 𝑦1
𝑥 ± 𝑦2
1. Addition/subtraction: 𝑥⃗ ± 𝑦⃗ = 2
⋮
[ 𝑛 𝑦𝑛 ]
𝑥 ±
𝑐𝑥1
𝑐𝑥
2. Scaling: 𝑐𝑥⃗ = [ 2 ]
⋮
𝑐𝑥𝑛
3. Dot product: 𝑥⃗ ∙ 𝑦⃗ = 𝑥1 𝑦1 + 𝑥2 𝑦2 + ⋯ + 𝑥𝑛 𝑦𝑛
A matrix is a two dimensional array of numbers. Matrices are used to represent datasets
and transformations.
A matrix is represented as an 𝑚 × 𝑛 matrix with their size being determined by their

dimensions.
𝑥1,1 ⋯ 𝑥1,𝑛
𝑋=[ ⋮ ⋱ ⋮ ]
𝑥𝑚,1 ⋯ 𝑥𝑚,𝑛
Operations on matrices:
𝑎1,1 ± 𝑏1,1 𝑎1,2 ± 𝑏1,2
1. Addition/subtraction: 𝐴 ± 𝐵 = [ ]
𝑎2,1 ± 𝑏2,1 𝑎2,2 ± 𝑏2,2
𝑐𝑎1,1 𝑐𝑎1,2
2. Scaling: 𝑐𝐴 = [𝑐𝑎 𝑐𝑎2,2 ]
2,1
3. Multiplication:
*each entry given by the dot product of the row in 𝐴 and the column in 𝐵
*If 𝐴 is an 𝑚 × 𝑛 matrix and 𝐵 is an 𝑛 × 𝑝 matrix, then their matrix product 𝐴𝐵 is a
𝑚 × 𝑝 matrix
1 0
1 2 3𝑇
4. Transposition: [ ] = [2 6]
0 6 7
3 7
5. Inverse: 𝐴 × 𝐴−1 = 1
*only for 𝑚 × 𝑚 /square matrices
Rewriting ℎ𝛽 (𝑋) using matrices:

𝑥
ℎ𝛽 (𝑥 ) = [𝛽1 𝛽2 ] [𝑥0 ] = 𝛽 𝑇 𝑥
1
Multivariate Linear Regression

Multivariate linear regression models the relationship between multiple independent
variables (features) and a single dependent variable.
The independent variables (𝑥1 , 𝑥2 , … , 𝑥𝑛 )

(𝑖)
𝑥 (𝑖) would be the features of the 𝑖 𝑡ℎ example. 𝑥𝑗 would be the 𝑗 𝑡ℎ feature in the 𝑖 𝑡ℎ
example
The dependent variable 𝑦 is then represented as:
𝑦 = 𝛽0 𝑥0 + 𝛽1 𝑥1 + ⋯ + 𝛽𝑛 𝑥𝑛
𝑥0 𝛽0
𝑥1 𝛽1
𝑥 (𝑖) = 𝑥2 𝛽 = 𝛽2
⋮ ⋮
[𝑥𝑛 ] [𝛽𝑛 ]
𝑛
𝑦 = ∑ 𝛽𝑗 𝑥𝑗 = 𝛽 𝑇 𝑥
𝑗=0
The cost function is represented as:
𝑛
1 2
𝐽 (𝛽 ) = ∑( 𝑦
̂𝑖 − 𝑦𝑖 )
2𝑛
𝑖=1
𝑛
1 𝑇 2
= ∑( 𝛽 𝑥(𝑖) − 𝑦𝑖 )
2𝑛
𝑖=1
𝑛 𝑛
1 2
= ∑(∑ 𝛽𝑗 𝑥𝑗 −𝑦𝑖 )
2𝑛
𝑖=1 𝑗=0
Gradient Descent for Multiple Variables

The gradient descent equation for multiple variables is the same as for one variable, its
simply repeated for 𝑛 features.
To speed up gradient descent the feature values should be normalized using one of the
normalization techniques.
Learning rate 𝛼 should also be sufficiently small so that the model can converge quickly.
Step size too small Step size too large
Correlation analysis between features should also be done to detect redundancies. This is
done by finding the correlation coefficient between the 2 features we want to investigate
𝑛
(𝐴𝑖 − 𝐴̅)(𝐵𝑖 − 𝐵̅)
𝑟𝐴,𝐵 = ∑
(𝑛 − 1)(𝜎𝐴 𝜎𝐵 )
𝑖=1
• if 𝑟𝐴,𝐵 > 0 (positive correlation) one of the features is redundant
• if 𝑟𝐴,𝐵 = 0 (no correlation) neither are redundant
• if 𝑟𝐴,𝐵 < 0 (negative correlation) neither are redundant
Normal Equation
The normal equation is an analytical approach for finding the optimal parameters
(coefficients) in linear regression as opposed to the gradient descent method.
The goal is to find the optimal coefficients 𝛽 to minimize the cost function 𝐽(𝛽).
First the data is represented in matrix form:
𝑌 = 𝛽𝑇 𝑋
(1) (1)
𝑦1 𝛽0 1 𝑥1 … 𝑥𝑛
(2) (2)
𝑦 𝛽
[ 2 ] = [ 1 ] 1 𝑥1 … 𝑥𝑛
⋮ ⋮ ⋮ ⋮ … ⋮
𝑦𝑛 𝛽𝑛 (𝑚) (𝑚)
[1 𝑥1 … 𝑥𝑛 ]
The cost function is then minimized:
𝜕
𝐽 (𝛽 ) = 0
𝜕𝛽𝑗
𝛽 = (𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑌
Example.
Suppose you have 𝑚 = 25 training examples with 𝑛 = 6 features. The normal equation is
θ = (𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑌 . For the given values of m and n what are the dimensions of 𝜃, 𝑋, and 𝑌 in
this equation?
Answer:
1. 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑠 𝑜𝑓 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑚𝑎𝑡𝑟𝑖𝑥 𝑋 = 𝑚 × (𝑛 + 1)

= 25 × (6 + 1)
= 25 × 7
2. 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑠 𝑜𝑓 𝑡𝑟𝑎𝑔𝑒𝑡 𝑣𝑒𝑐𝑡𝑜𝑟 𝑌 = 𝑛(n+1)
×1 ×1
=7×1
3. 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑠 𝑜𝑓 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑒𝑛𝑡 𝑣𝑒𝑐𝑡𝑜𝑟 𝜃 = (n+1)
𝑛×1 ×1
=7×1
Lecture 6
Logistic Regression
Logistic Regression is a statistical method used for binary classification problems, where
the outcome variable is categorical and has two classes. It utilizes the logistic/sigmoid
function to model the probability of an event occurring.
1
𝑓 (𝑧 ) =
1 + 𝑒 −𝑧
Where 𝑧 is a linear combination of the input features and their respective weights. The
logistic/sigmoid function maps any real-valued number to the range [0,1].
The hypothesis function for logistic regression is given by:
ℎ𝜃 (𝑥 ) = 𝑓(𝜃 𝑇 𝑥)
Where ℎ𝜃 (𝑥) is the probability that 𝑦 = 1 given input 𝑥. The predicted probability is
converted into a binary outcome using a threshold. If ℎ𝜃 (𝑥) ≥ 0.5, the predicted class is 1;
otherwise, it's 0.
The decision boundary is the line that separates the two classes. It is determined by the
weights 𝜃 learned during the training process.
The cost function for logistic regression is the log loss (cross-entropy loss):
𝑚
1
𝐽(𝜃) = − ∑[𝑦 (𝑖) log (ℎ𝜃 (𝑥 (𝑖) )) + (1 − 𝑦 (𝑖) ) log(1 − ℎ𝜃 (𝑥 (𝑖) ))]
𝑚
𝑖=1
Where 𝑚 is the number of training examples.
The goal is to minimize the cost function by adjusting the parameters 𝜃 during training.
Gradient descent can be used to do this. The update rule for each parameter is given by:
𝜕𝐽(𝜃)
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼
𝜕𝜃𝑗
Where 𝛼 is the learning rate.

Regularization
Regularization is a technique used to prevent overfitting. It involves adding a penalty term
to the cost function that discourages the model from fitting the training data too closely.
Overfitting can be addressed by reducing the number of features or by using regularization.
L1 regularization (Lasso) adds the absolute values of the coefficients as a penalty term to
the cost function. L1 regularization encourages sparsity in the model, meaning it tends to
drive some of the feature weights to exactly zero. This can be useful for feature selection.
𝑚 𝑛
1
𝐽(𝜃) = − ∑[𝑦 (𝑖) log (ℎ𝜃 (𝑥 (𝑖) )) + (1 − 𝑦 (𝑖) ) log(1 − ℎ𝜃 (𝑥 (𝑖) ))] + λ ∑|𝜃𝑗 |
𝑚
𝑖=1 𝑗=1
Where 𝜆 is the regularization parameter, which controls the strength of regularization.
L2 regularization (Ridge) adds the squared values of the coefficients as a penalty term to
the cost function.
𝑚 𝑛
1
𝐽(𝜃) = − ∑[𝑦 (𝑖) log (ℎ𝜃 (𝑥 (𝑖) )) + (1 − 𝑦 (𝑖) ) log(1 − ℎ𝜃 (𝑥 (𝑖) ))] + λ ∑ 𝜃𝑗2
𝑚
𝑖=1 𝑗=1
L2 regularization tends to shrink the weights of the features towards zero but usually
doesn't make them exactly zero. It addresses multicollinearity, where features are highly
correlated.
The optimal regularization parameter is found using Cross-validation.
The gradient of the L1 or L2 regularized cost function with respect to a coefficient 𝜃𝑗 is:
𝑚
𝜕𝐽(𝜃) (𝑖) 𝜆
= ∑ (𝑦 (𝑖) − ℎ𝜃 (𝑥 (𝑖) )) 𝑥𝑗 + 𝜃𝑗
𝜕𝜃𝑗 𝑚
𝑖=1
Lecture 7
KNN (k-Nearest-Neighbours)
KNN is a simple and intuitive classification algorithm that falls under the category of
instance-based learning or lazy learning. It makes predictions based on the majority class
of the k nearest neighbours in the feature space.
The training phase for a KNN model involves only storing the training dataset. The
prediction phase involves finding the k training examples with the closest feature values to
the new input (query point), it then assigns the majority class among the k neighbours to
the query point.
A distance metric is used to determine the layout of the example space i.e., which points
are considered nearest to the query point.
- Euclidean Distance (L2 Norm): measures the straight-line distance between two
points in Euclidean space.
𝑛
𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑃, 𝑄) = √∑(𝑝𝑖 − 𝑞𝑖 )2
𝑖=1
- Manhattan Distance (L1 Norm): represents the sum of the absolute differences
between corresponding coordinates.
𝑛
𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑃, 𝑄) = ∑|𝑝𝑖 − q𝑝𝑖 |

𝑖=1
- Chebyshev Distance (Infinity/Max Norm): represents the maximum absolute
difference between corresponding coordinates.
𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑃, 𝑄) = max|𝑝𝑖 − 𝑞𝑖 |
𝑖
The parameter 'k' represents the number of neighbours to consider. A small 'k' may lead to
noise sensitivity, while a large 'k' may include points from other classes, reducing the
algorithm's sensitivity to local patterns.
KNN classification can be weighted based on their distance, meaning that closer
neighbours have a higher influence on the prediction.
Pros Cons
Simple and easy to understand. Computationally expensive for large

datasets.
No model training phase; the algorithm Sensitive to irrelevant or redundant features.
memorizes the data.
Works well with small datasets and in low- Performance can degrade in high-
dimensional spaces. dimensional spaces (curse of
dimensionality). This makes it unsuitable for
use on images
Can suffer from skewed class distributions
Cross-Validation
Cross-validation is a technique used to assess the performance of a model. It involves
splitting the dataset into multiple subsets, training the model on different subsets, and
evaluating its performance on the remaining data.
The most common form of cross-validation is k-Fold Cross-Validation, where The dataset
is divided into k equally sized folds or subsets. The model is trained k times, each time
using k-1 folds for training and the remaining fold for validation.
The performance metric is averaged over the k iterations to obtain the final performance
estimate. This provides a more accurate estimate of the models performance
Lecture 8
Performance Metrics
A confusion matrix is a table that summarizes the performance of a classification
algorithm. It shows the number of true positives, true negatives, false positives, and false
negatives.
Several performance metrics can be calculated using values from the confusion matrix:
- Accuracy
𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
- Precision
𝑇𝑃
FP
𝑇𝑃 + 𝐹𝑁
- Recall (Sensitivity)
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
- F1 Score
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
2×
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
- Specificity (True Negative Rate)
𝑇𝑁/(𝑇𝑁 + 𝐹𝑃)
Another performance metric is the Receiver

Operating Characteristics (ROC) Curve. Which is
graphical representation of the sensitivity
against the specificity. The area under the ROC
curve indicates overall performance.
Example:
- True Positives:
Predicted class
Apple Orange Pear

Actual class
Apple 50 5 50
Orange 10 50 20
Pear 5 5 0
- True Negative for Apple class:
Predicted class
Apple Orange Pear

Actual class
Apple 50 5 50
Orange 10 50 20
Pear 5 5 0
- True Negatives for Orange class:
Predicted class
Apple Orange Pear

Actual class
Apple 50 5 50
Orange 10 50 20
Pear 5 5 0
- True negatives for Pear class:
Predicted class
Apple Orange Pear

Actual class
Apple 50 5 50
Orange 10 50 20
Pear 5 5 0
- False Positives for Apple class:
Predicted class
Apple Orange Pear

Actual class
Apple 50 5 50
Orange 10 50 20
Pear 5 5 0
- False Positives for Orange class:
Predicted class
Apple Orange Pear

Actual class Apple 50 5 50
Orange 10 50 20
Pear 5 5 0
- False Positives for Pear class:
Predicted class
Apple Orange Pear

Actual class
Apple 50 5 50
Orange 10 50 20
Pear 5 5 0
- False Negatives for Apple class:
Predicted class
Apple Orange Pear

Actual class
Apple 50 5 50
Orange 10 50 20
Pear 5 5 0
- False Negatives for Orange class:
Predicted class
Apple Orange Pear

Actual class
Apple 50 5 50
Orange 10 50 20
Pear 5 5 0
- False Negatives for Pear class:
Predicted class
Apple Orange Pear

Actual class
Apple 50 5 50
Orange 10 50 20
Pear 5 5 0
Model Diagnosis
Diagnostics are tests that are run to gain insight on what is/isn’t working with a learning
algorithm.
Things to consider when debugging:
- Number or training examples

- Number of features
- Degree of polynomial features (if used)
- Hyperparameter values (e.g., 𝑘 for KNN models, 𝜆 for regularization, etc.)
Bias and Variance

Bias refers to the error introduced by approximating a real-world problem, which may be
complex, with a simplified model. It is the difference between the predicted values and the
true values in the underlying data.
High bias can lead to underfitting, where the model is too simplistic to capture the
underlying patterns in the data. If a learning algorithm is suffering from high bias, getting
more training data will not (by itself) help much.
Variance is the amount by which the model's predictions would change if it were trained on
a different dataset. It measures the model's sensitivity to variations in the training data.
High variance can lead to overfitting, where the model performs well on the training data
but fails to generalize to new, unseen data. If a learning algorithm is suffering from high
variance, getting more training data is likely to help.
Dealing with bias and variance:
High Bias High Variance

Increase model complexity Simplify model
Adding features or transforming existing Removing redundant features
ones (increasing 𝑑 of polynomial features)
Reducing regularization (decreasing 𝜆) Introduce or increase regularization

(increase 𝜆)
Using a larger training dataset
Lecture 9 (part 1)
Ensemble Models
Ensemble models are machine learning techniques that combine the predictions of multiple
individual models to produce a stronger and more robust model. By aggregating the
predictions of diverse models, the weaknesses of individual models can be mitigated, and
overall performance can be improved.
Due to the need for training and storing multiple models, and combining their outputs,
ensemble models are computationally expensive and time consuming
Voting
Voting is an ensemble technique in machine learning where multiple models are trained
independently, and their predictions are combined to make a final prediction.
There are different types of voting methods, each with its own way of aggregating the
individual model predictions:
- Hard Voting: the final prediction is determined by a simple majority vote. Each model
in the ensemble "votes" for a class, and the class with the most votes is chosen as
the final prediction.
Ex. If three models predict class A, and two models predict class B, the final
prediction using hard voting would be class A.
- Soft Voting: The final prediction is the class with the highest average
probability/confidence.
Ex. If three models predict class A with probabilities 0.8, 0.7, and 0.9 [average: 0.8],
and two models predict class B with probabilities 0.4 and 0.6 [average: 0.3], the final
prediction using soft voting would be class A.
- Average Voting: the average (arithmetic mean) of the predictions is used to make
the final prediction. Used for regression problems.
Ex. If three models predict 3.0, 3.5, and 4.0, the final prediction using average voting
would be (3.0 + 3.5 + 4.0) / 3 = 3.5.
- Weighted voting: assigns different weights to the predictions of each model. The
weights reflect the confidence or performance of each model.
Ex. If there are three models, and you assign weights of 0.5, 0.3, and 0.2 to their
predictions, the final prediction would be 0.5 × 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛1 + 0.3 × 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛2 +
0.2 × 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛3
Bagging/Bootstrap Aggregating
Bagging is an ensemble learning technique that trains multiple instances of the same
model on different subsets of the training data. Bagging aims to reduce the variance and
overfitting associated with a single model by combining predictions from multiple models.
The first step is creating multiple subsets of the training data by randomly sampling with
replacement, this is called bootstrap sampling.
We randomly select 𝑛 samples with replacement from the original training dataset to
create a new training subset. This process is repeated 𝑘 times, resulting in 𝑘 diverse
subsets.
One example of Bagging is Random Forest, which builds an ensemble of decision trees,
where each tree is trained on a different bootstrap sample. Additionally, at each node, a
random subset of features is considered for splitting. This reduces the correlation between
outputs of each model
Boosting
Boosting is an ensemble learning technique that builds a sequence of models, where each
subsequent model focuses on correcting the errors of the previous ones. Boosting relies
on the use of weak learners, which are models that perform slightly better than random
chance. Weak learners are typically simple models.
Each data point in the training set is assigned a weight, and the weights are adjusted after
each model is trained. Misclassified points are given higher weights to make them more
influential in the subsequent model's training.
Boosting uses a weighted sum of the predictions from individual models to make the final
prediction. The weights are determined by the performance of each model on the training
data.
Examples of boosting algorithms:
- AdaBoost (Adaptive Boosting): assigns weights to each training sample based on its
classification error. It focuses more on misclassified samples in subsequent
iterations.
Each new model is trained to correct the errors of the combined predictions of the
previous models.
The final prediction is determined by aggregating the predictions of all weak

learners through a weighted majority vote.
- Gradient Boosting: builds an ensemble of decision trees sequentially, where each

tree corrects the errors of the combined predictions of the previous trees.
Optimizes the model by minimizing a loss function using gradient descent. Each new
tree is trained to predict the residuals (the differences between the actual and
predicted values) of the ensemble.
The final prediction is the sum of the predictions from all trees in the ensemble.
- XGBoost (Extreme Gradient Boost): builds an ensemble of tree-based and linear

models to enhance performance.
Combines gradient boosting with regularization techniques to prevent overfitting.
Stacking
Stacking is an ensemble learning technique that involves training multiple diverse models
and combining their predictions using another model called a meta-model or blender. The
meta-model learns how to best combine the predictions of the individual models..
The effectiveness of stacking relies on the diversity of the base models. Models should
capture different aspects of the data and make different types of errors.
The training set is used to train the base models, and a separate validation set is often used
to generate predictions from the base models for training the meta-model.
Model Characteristics
The performance of an ensemble model is influenced by various characteristics:
- Dependency: the degree to which the individual models in the ensemble are
correlated or dependent on each other.
Models can be sequential, meaning each model builds on the predictions of the
previous model, or parallel, where models are trained all at the same time.
- Heterogeneity: the diversity among the individual models in the ensemble.
Models can be homogeneous, where the same classifier is simply trained on

different subsets of data, or heterogeneous, where different classifiers are used.
- Fusion Method: the techniques used to combine the predictions of individual models
in the ensemble.
We can describe the ensemble algorithms we’ve covered using these characteristics:
Bagging Boosting Stacking

Use case Reduces variance Reduces bias Improves accuracy
Dependency Parallel Sequential Parallel
Heterogeneity Homogenous Homogenous Homogenous
Heterogenous
Fusion Method Max Voting Weighted Average Meta Model
Lecture 9 (part 2)
K-Means Algorithm
K-means is an unsupervised learning algorithm used for clustering. It partitions a dataset
into 𝑘 clusters and assigns each datapoint to the cluster with the nearest mean.
The goal is to minimize the within-cluster sum of squares, meaning that the sum of the
squared distances between each data point and the mean of its assigned cluster is
minimized.
1. The algorithm starts by randomly initializing

the centroids of the 𝑘 clusters
2. Assign each data point to the nearest centroid.

This is usually done using a distance metric,
such as Euclidean distance.
3. Recalculate the centroids of the clusters by

taking the mean of all data points assigned to
each cluster.
4. Steps 2 and 3 are repeated until convergence

(no change in centroids) has been reached.
The algorithm does not guarantee convergence to the global optimum. The result may
depend on the initial clusters. It is common to run it multiple times with different starting
conditions and choosing the one with the lowest cost.
The choice of 𝑘 also affects the performance of the algorithm. When the algorithm is used
for exploratory analysis the value of 𝑘 can be optimized with methods such as elbow
method.
However, when using k-means for some other downstream process 𝑘 is set according to
the number of clusters needed by the process.
Lecture 10
Dimensionality Reduction
Dimensionality reduction is a technique used to reduce the number of input variables
(features) in a dataset while preserving the essential information present in the data. The
goal is simplify the data and remove redundant or irrelevant features.
Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a dimensionality reduction technique. PCA
transforms the original features of a dataset into a new set of uncorrelated features,
known as principal components, while trying to retaining as much of the variance in the
data as possible.
Given a dataset with 𝑛 observations and 𝑝 features, the data is standardized/normalized

(by subtracting the mean and dividing by the standard deviation for each feature) and the
covariance between every possible pair of features using the formula :
𝑛
1
cov(𝑥𝑖 , 𝑥𝑗 ) = ∑(𝑥𝑘𝑖 − 𝑥̅ 𝑖 )(𝑥𝑘𝑗 − 𝑥̅𝑗 )
𝑛−1
𝑘=1
The covariance values for a set of features is represented using a covariance matrix.
Ex. Covariance matrix with 3 features:
var(𝑥1 ) cov(𝑥1 , 𝑥2 ) cov(x1 , 𝑥3 )

cov(𝑥 ) = [cov(x1 , 𝑥2 ) var(𝑥2 ) cov(x2 , 𝑥3 )]
cov(x1 , 𝑥3 ) cov(x2 , 𝑥3 ) var(𝑥3 )
Eigenvalues and Eigenvectors

An eigenvector of a square matrix 𝐴 is a non-zero vector 𝑋 such that when 𝐴 is applied to
𝑋, the result is a scalar multiple of 𝑋.
𝐴𝑋 = 𝜆𝑋
Here 𝑋 is the eigenvector and 𝜆 is the eigenvalue
Ex.
0 5 −10
𝐴 = [0 22 16 ]
0 −9 −2
If we compute the product 𝐴𝑋 for the following:
−5
𝑋 = [−4]
3
0 5 −10 −5 −50 −5
𝐴𝑋 = [0 22 16 ] [−4] = [−40] = 10 [−4]
0 −9 −2 3 30 3
The product 𝐴𝑋 resulted in a vector which is equal to 10 times the vector 𝑋. In other
words, 𝐴𝑋 = 10𝑋.
The next step is to compute the eigenvalue decomposition of the covariance matrix.
Eigenvalue decomposition is a factorization of a square matrix 𝐴 into 3 matrices:
𝐴 = 𝑉𝛬𝑉 −1
Where 𝑉 is matrix of all the eigenvectors of 𝐴, 𝛬 is a diagonal matrix whose diagonal
element are the eigenvalues of 𝐴, and 𝑉 −1 is the inverse of 𝑉.
The eigenvalues are then sorted in descending order. We choose the top 𝑘 eigenvalues,
where 𝑘 is our desired degree of dimensionality. The eigenvectors corresponding to the
highest eigenvalues form a matrix 𝑊 and are the principal components.
Once the principal components have selected the original data is projected onto the new
subspace using a matrix multiplication.
𝑍 = 𝑥𝑊
Where 𝑍 is the reduced-dimensional representation of the data.

Machine Learning Notes

Uploaded by

Machine Learning Notes

Uploaded by

Machine Learning

These patterns are discovered by grouping similar datapoints together (clustering).

(−0.2 + 0.57 + 0.04 + 0.49 + (−0.3) + 0.26 + 0.26)

Min-Max normalization is a technique used to scale data to a specific range (usually

Image Feature Extraction

Global Feature Extraction

1. Image preprocessing: image is resized and converted to greyscale to simplify extraction

152 76 125 −152 −152 −125

𝑜𝑣𝑒𝑟𝑎𝑙𝑙 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡 𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒 = √(−19)2 + (121)2 = 122.48

Ex. *the gradient magnitude can also be split

1. Gaussian smoothing/blurring: the image is convolved with several Gaussian kernels of

Binary descriptors mainly differ in the strategy used to select pairs.

*𝑛 is the number of training examples

The Iterative Approach

Step size too small Step size too large

Stochastic Gradient Descent

Mini-Batch Gradient Descent

A matrix is represented as an 𝑚 × 𝑛 matrix with their size being determined by their

Rewriting ℎ𝛽 (𝑋) using matrices:

Multivariate Linear Regression

The independent variables (𝑥1 , 𝑥2 , … , 𝑥𝑛 )

The dependent variable 𝑦 is then represented as:

Gradient Descent for Multiple Variables

Step size too small Step size too large

• if 𝑟𝐴,𝐵 > 0 (positive correlation) one of the features is redundant

• if 𝑟𝐴,𝐵 = 0 (no correlation) neither are redundant

• if 𝑟𝐴,𝐵 < 0 (negative correlation) neither are redundant

1. 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑠 𝑜𝑓 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑚𝑎𝑡𝑟𝑖𝑥 𝑋 = 𝑚 × (𝑛 + 1)

The hypothesis function for logistic regression is given by:

Where 𝑚 is the number of training examples.

Where 𝛼 is the learning rate.

Overfitting can be addressed by reducing the number of features or by using regularization.

Where 𝜆 is the regularization parameter, which controls the strength of regularization.

The optimal regularization parameter is found using Cross-validation.

𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑃, 𝑄) = ∑|𝑝𝑖 − q𝑝𝑖 |

Simple and easy to understand. Computationally expensive for large

Another performance metric is the Receiver

Apple Orange Pear

- True Negative for Apple class:

Apple Orange Pear

- True Negatives for Orange class:

Apple Orange Pear

- True negatives for Pear class:

Apple Orange Pear

- False Positives for Apple class:

Apple Orange Pear

Apple Orange Pear

- False Positives for Pear class:

Apple Orange Pear

- False Negatives for Apple class:

Apple Orange Pear

- False Negatives for Orange class:

Apple Orange Pear

- False Negatives for Pear class:

Apple Orange Pear

Things to consider when debugging:

- Number or training examples

Bias and Variance

Dealing with bias and variance:

High Bias High Variance

Reducing regularization (decreasing 𝜆) Introduce or increase regularization

Examples of boosting algorithms:

The final prediction is determined by aggregating the predictions of all weak

- Gradient Boosting: builds an ensemble of decision trees sequentially, where each

- XGBoost (Extreme Gradient Boost): builds an ensemble of tree-based and linear

Combines gradient boosting with regularization techniques to prevent overfitting.

- Heterogeneity: the diversity among the individual models in the ensemble.

Models can be homogeneous, where the same classifier is simply trained on

Bagging Boosting Stacking

1. The algorithm starts by randomly initializing

2. Assign each data point to the nearest centroid.

3. Recalculate the centroids of the clusters by

4. Steps 2 and 3 are repeated until convergence

Principal Component Analysis (PCA)

Given a dataset with 𝑛 observations and 𝑝 features, the data is standardized/normalized

var(𝑥1 ) cov(𝑥1 , 𝑥2 ) cov(x1 , 𝑥3 )