Deep Learning Module 1

lOMoARcPSD|51109247
Deep Learning Module 1
Machine Learning (MVJ College of Engineering)
Scan to open on Studocu
Studocu is not sponsored or endorsed by any college or university

Downloaded by Anusha (anusha.mitts@gmail.com)
lOMoARcPSD|51109247
DEEP LEARNING
MODULE 1
Machine Learning Basics: Learning Algorithms, Capacity, Overfitting and Underfitting,

Hyperparameters and Validation Sets, Estimator, Bias and Variance, Maximum Likelihood
Estimation, Bayesian Statistics, Supervised Learning Algorithms, Unsupervised Learning
Algorithms, Stochastic Gradient Decent, building a Machine Learning Algorithm, Challenges
Motivating Deep Learning.
Machine learning:
Machine learning is a subset of artificial intelligence (AI) that allows machines
to learn and improve from experience without being explicitly programmed.
Machine Learning Basics:
 Definition and Example of Learning Algorithm:

 A learning algorithm is introduced with an example: the linear
regression algorithm.
 Fitting vs. Generalization:
 Distinguishes between fitting the training data and finding patterns
that generalize to new data.
 Hyperparameters:
 Discussion on hyperparameters and the need to set them using
additional data external to the learning algorithm itself.
 Machine Learning as Applied Statistics:
 Machine learning is likened to applied statistics with a focus on
computational estimation of functions rather than proving confidence
intervals.
 Frequentist Estimators and Bayesian Inference:

lOMoARcPSD|51109247
 Introduction to the two central approaches to statistics: frequentist

estimators and Bayesian inference.
 Categories of Machine Learning Algorithms:
 Supervised Learning: Learning with labeled data.
 Unsupervised Learning: Learning with unlabeled data.
 Optimization Algorithm: Stochastic Gradient Descent:
 Many deep learning algorithms are based on an optimization algorithm
called stochastic gradient descent.
 Components of a Machine Learning Algorithm:
 Description of how to combine components such as an optimization
algorithm, a cost function, a model, and a dataset to build a machine
learning algorithm.
 Limitations of Traditional Machine Learning:
 Discussion in section 5.11 on factors that limit the ability of traditional
machine learning to generalize, motivating the development of deep
learning algorithms to overcome these obstacles.
Learning Algorithms:
Definition of a Learning Algorithm:
A machine learning algorithm is defined as an algorithm that learns from data.
Mitchell's Definition:
Mitchell (1997) defines learning as: “A computer program is said to learn from
experience E with respect to some class of tasks T and performance measure P, if its
performance at tasks in T, as measured by P, improves with experience E.”
Components of Learning:
 Experience (E): The data or experience from which the algorithm learns.
 Tasks (T): The specific tasks the algorithm is designed to perform.
 Performance Measure (P): The criteria used to measure the algorithm's performance on
the tasks.
1 .The Task, T
 Definition of Task (T):

 In the context of machine learning, the task is not the learning process itself but the
ability to perform a specific function (e.g., walking for a robot).
 Example of Task:
 If the goal is for a robot to walk, then walking is the task. Learning to walk is the means
to achieve this task.

lOMoARcPSD|51109247
1.1Classification:
 Definition of Classification Task:

 The task involves specifying which of k categories an input belongs to.
 Example 1: Object recognition:
 An image is represented by a numeric code.
 The goal is to recognize objects within the image.
 Example 2: Robot task:
 A robot should be able to identify different objects and then act on them
based on commands.
 Example 3: People recognition:
 People in images should be automatically recognized and tagged.

1.2.classification with missing inputs:
Normally, when you have all the information (temperature, blood pressure,
heart rate, and blood test results), the model takes all of these inputs and
makes a prediction—either "Disease Present" or "No Disease."
Now, imagine that some patients don't get a blood test because it’s
expensive. For others, maybe the heart rate is not available. This means
your model won't always have a complete set of inputs for every patient.
Solution:
Instead of training the model for just one set of inputs (all available), the
model needs to handle all combinations of missing inputs.
 Missing inputs make the problem harder because the model needs to
handle many scenarios (different inputs missing for different
patients).
 The model learns to classify (diagnose) even when some
information is unavailable by focusing on what it does know.
This approach is efÏcient because instead of building a separate model for

every possible missing input scenario, the model learns a single function
that can handle all cases by filling in the missing gaps using probabilities.
1.3. Regression
Regression: Predicting a numerical value given input data.
Eg:
 Predicting Housing Prices
 Predicting Stock Prices

lOMoARcPSD|51109247
1.4. Transcription
Transcribing unstructured data into discrete, textual form.
 Examples: Optical character recognition, speech recognition.
1.5. Machine translation:
Converting text from one language to another.
1.6. Structured Output:
Structured output tasks are a type of machine learning problem where the
output is not just a single value, but a collection of values that are
interconnected or interdependent.
Examples:
1. Parsing:
o Input: A sentence in natural language.
o Output: A parse tree representing the grammatical structure of
the sentence, with nodes labeled as nouns, verbs, adjectives,
etc.
2.Pixel-wise Segmentation:
 Input: An image.
 Output: A labeled image where each pixel is assigned to a specific
category (e.g., "sky," "road," "car").
3. Image Captioning:
 Input: An image.
 Output: A natural language sentence describing the image.
1.7 Anomaly detection:
 Identifying unusual or atypical events or objects.

 Example: Credit card fraud detection.
1.8. Synthesis and Sampling:
Synthesis and sampling are techniques used in machine learning to create

new data that is similar to the data the model has been trained on.
Data augmentation: Increasing the size and diversity of the training
dataset.
1.9. Imputation of missing values:

lOMoARcPSD|51109247
Often in real-world datasets, certain values are missing. This can be due to
various reasons like data collection errors, equipment failures, or privacy
concerns. Predicting these missing values is crucial for data analysis,
machine learning, and decision-making.
1.10. Denoising:
 Predicting a clean example (image)from a corrupted version.
1.11. Density estimation or probability mass function estimation:
 Learning a probability distribution over the data.

 Applications: Solving other tasks, such as missing value imputation.
2. Performance Measures,P
 A quantitative measure of a machine learning algorithm's performance.

 Specific to the task being carried out.
 Often measured using accuracy or error rate for classification tasks.
 For density estimation, average log-probability is commonly used.
 Evaluated on a test set of data separate from the training data.
Choosing Performance Measures:
 DifÏcult to choose a measure that corresponds well to desired

behavior.
 Depends on the application and specific task.
 May involve trade-offs between different criteria.
 Can be impractical to measure certain quantities directly.
 Alternative criteria or approximations may be necessary.
3.The Experience, E
machine learning algorithms can experience different types of data, which influence
the learning process and the types of tasks they can perform. Unsupervised learning
focuses on understanding the data structure, while supervised learning involves
predicting targets based on features.
Unsupervised learning:
Unsupervised learning is a type of machine learning where the algorithm is not
provided with any labels or target values. Instead, it is given a dataset of raw
data and must learn to find patterns(similarity), structures, or relationships within
the data on its own.
Supervised learning:

lOMoARcPSD|51109247
Supervised learning is a type of machine learning where the algorithm is provided

with a dataset containing both input features and corresponding output
labels or targets. The goal is to learn a model that can accurately predict the
target value for new, unseen data.
Datasets:
 A collection of examples, which are collections of features.

 Often represented as a design matrix, where each row is an example
and each column is a feature.
 Can be heterogeneous with different sizes for examples.
Example:
Consider a dataset of information about houses. Each example (row) might

represent a different house, and the features (columns) could include:
 Square footage
 Number of bedrooms
 Number of bathrooms
 Location
 Price
Other Learning Paradigms:
 Semi-supervised learning: Some examples have labels, others do

not.
 Multi-instance learning: A collection of examples is labeled, but
individual members are not.
 Reinforcement learning: Interacts with an environment and receives
feedback.
Example: Linear Regression

Linear regression is a statistical method used to model the
relationship between a dependent variable and one or more
independent variables.

lOMoARcPSD|51109247
Types of Linear Regression:
Simple Linear Regression: Involves only one independent variable.

Multiple Linear Regression: Involves multiple independent variables.
Simple Linear Regression:

lOMoARcPSD|51109247
Here x and y are actual data(Data set data).In case new value of x=6,we have to
predict y.For that We have to draw bestfit line.
Fitting
 Model Adaptation: Fitting refers to the process of adjusting a machine learning

model's parameters to best represent the patterns in the data.
 Optimization: This involves minimizing a loss function, which measures the
difference between the model's predictions and the actual values.
 Examples:
o In linear regression, fitting involves finding the optimal slope and intercept
for the line that best fits the data points.
Equation of line.

lOMoARcPSD|51109247
Y=mx+c
3.6=0.4*3+c
3.6=1.2+c
C=3.6-1.2
C=2.4

lOMoARcPSD|51109247
Distance betweenactual value and predicted value-error.We need to choose

optimal value for weight to reduce error.
Line with least error-Regression line.
Best fit line:Try for different values of m.compare the difference between the
actual value and predicted value is minimum will be selected as bestfit line.
The least squares method is a widely used mathematical technique for

finding the best fit line
The least squares method is based on the principle of minimizing the overall
error between the observed data points and the predicted values from the model.
Formula:
The least squares method involves minimizing the sum of squared residuals (RSS),
which is given by:
RSS = Σ(yi - ŷi)²
where:
 yi is the observed value of the dependent variable for the i-th data point
 ŷi is the predicted value of the dependent variable for the i-th data point

lOMoARcPSD|51109247
Multiple Linear Regression:

lOMoARcPSD|51109247
Goal: The goal of linear regression is to find the best values for the regression
coefÏcients that minimize the sum of squared errors between the predicted values
and the actual values.
Applications:
 Predicting house prices based on factors like square footage, number of

bedrooms, and location.
 Forecasting sales based on marketing spend and economic indicators.
 Analyzing the relationship between income and education level.
 Developing pricing models for products and services.
Limitations:
 Linearity assumption: If the relationship is not linear, linear regression may

not provide accurate predictions.
 Outliers: Outliers can significantly impact the results.
 Multicollinearity: If the independent variables are highly correlated, it can
be difÏcult to interpret the results.
Capacity, Overfitting and Underfitting

Patterns
 Underlying Relationships: Patterns are the underlying
relationships or structures present in data. These can be
simple or complex.

lOMoARcPSD|51109247
 Discovering Patterns: The goal of machine learning is to

discover and learn these patterns from data.
 Examples:
o In image classification, patterns might be the shapes,
colors, or textures that define different objects.
o Capacity
o In machine learning, a model's capacity refers to its ability to fit a
wide variety of functions. A model with high capacity can learn
complex patterns, while a model with low capacity can only learn
simpler patterns.
 High Capacity Models:
o Can fit complex patterns in the data.
o Are more prone to overfitting.
o Often have more parameters.
 Low Capacity Models:
o Can only fit simple patterns in the data.
o Are less prone to overfitting.
o Often have fewer parameters.
o Overfitting
o Overfitting occurs when a model learns the training data too well, to
the point where it starts to memorize the noise or idiosyncrasies in the
data rather than learning the underlying patterns. This leads to poor
performance on new, unseen data.
o Signs of Overfitting:
 Very low training error but high test error.
 A complex model with many parameters.
 The model performs exceptionally well on the training data but poorly on the
validation or test data.
o Underfitting
o Underfitting occurs when a model is too simple to capture the underlying
patterns in the data. This leads to poor performance on both the training and
test data.
o Signs of Underfitting:
 High training error and high test error.
 A simple model with few parameters.
 The model consistently performs poorly on both the training and test data.
o Balancing Capacity:
o The goal is to find a model with just the right capacity to capture the
underlying patterns without overfitting. This is often achieved through
techniques like:

lOMoARcPSD|51109247
 Regularization: Penalizing complex models to prevent overfitting.

 Cross-validation: Evaluating the model's performance on multiple subsets
of the data to avoid overfitting.
 Early stopping: Stopping training when the performance on a validation set
starts to deteriorate.
o By carefully considering the model's capacity and using these techniques,
you can improve the generalization performance of your machine learning
model.
o Sources and related content
The image illustrates the concepts of underfitting, appropriate capacity, and

overfitting in machine learning, using a simple regression example.
Underfitting:
 Description: The model is too simple to capture the underlying pattern in

the data.
 Visualization: The regression line is a straight line that does not fit the data
points well.
 Impact: The model fails to generalize to new data and has high error on both
the training and test sets.
Appropriate Capacity:
 Description: The model has the right complexity to capture the underlying
pattern in the data.
 Visualization: The regression line fits the data points well without
overfitting.
 Impact: The model generalizes well to new data and has low error on both
the training and test sets.
Overfitting:

lOMoARcPSD|51109247
 Description: The model is too complex and learns the noise in the data
rather than the underlying pattern.
 Visualization: The regression line is a highly complex curve that fits the
training data points perfectly but may not generalize well to new data.
 Impact: The model performs well on the training set but poorly on the test
set.
Key Elements:
 Training Error: The error measured on the training data.

 Generalization Error: The error measured on new, unseen data.
 Capacity: A measure of a model's complexity or ability to fit complex
patterns.
 Underfitting Zone: The region where the model is too simple to capture the
underlying patterns in the data.
 Overfitting Zone: The region where the model is too complex and learns
the noise in the training data.
 Optimal Capacity: The region where the model has the right balance of
complexity and simplicity to generalize well.
 Generalization Gap: The difference between the training error and the
generalization error.
Explanation:
2. Training Error: As the model's capacity increases, the training error

generally decreases. This is because more complex models can better fit the
training data.
3. Generalization Error: Initially, as the model's capacity increases, the
generalization error decreases. However, after reaching a certain point, the
generalization error starts to increase. This is due to overfitting, where the

lOMoARcPSD|51109247
model learns the noise in the training data rather than the underlying
patterns.
4. Optimal Capacity: The optimal capacity is the point where the
generalization error is minimized. This is the region where the model has just
the right amount of complexity to capture the underlying patterns without
overfitting.
5. Underfitting and Overfitting Zones: The underfitting zone is where the
model is too simple, leading to high training and generalization errors. The
overfitting zone is where the model is too complex, leading to low training
error but high generalization error.
1.No Free Lunch Theorem:

 every classification algorithm has the same error rate.
 Implies that no machine learning algorithm is universally better than any
other.
 Optimal Capacity: The optimal model capacity is the complexity level that
best balances underfitting and overfitting for a given task and dataset.
 Training Examples: The number of examples in the training set affects the
optimal capacity.
 Increasing Training Examples: As the number of training examples
increases, the optimal capacity generally increases. This is because with
more data, the model can learn more complex patterns without overfitting.
 Plateau: Eventually, the optimal capacity plateaus, meaning that adding
more training examples does not significantly improve the model's
performance. This is because the model has reached its maximum capacity
for the given task.
 Error Bars: The error bars indicate the uncertainty in the estimated optimal
capacity. They represent the range of possible values for the optimal capacity.

lOMoARcPSD|51109247
2. Regularization
Regularization:
 A technique used to modify a learning algorithm to reduce its generalization

error.
Other Forms of Regularization:
 L1 regularization: Penalizes the absolute value of the weights.
 L2 regularization:
 The model is encouraged to learn smaller weights, which can help prevent
overfitting.

lOMoARcPSD|51109247

 Dropout: Randomly drops neurons during training to prevent overfitting.
 Early stopping: Stops training when the performance on a validation set
starts to deteriorate.
Hyperparameters and Validation Sets

Hyperparameters:
Definition:
 Hyperparameters are settings used to control the behavior of a

machine learning algorithm.
 They are not learned from the training data but are set before the
learning process.
Examples:
 Polynomial Regression: The degree of the polynomial is a

hyperparameter controlling model capacity.
 Weight Decay: The regularization strength λ\lambdaλ is another
hyperparameter.
Validation Sets
Role:
 A validation set is used to estimate the generalization error of a model

and to guide hyperparameter tuning.

lOMoARcPSD|51109247
 It is crucial because hyperparameters cannot be learned from the

training set alone, as this would lead to overfitting.
Construction:
 The training data is split into two subsets:

o Training Set: Used to learn the model parameters.
o Validation Set: Used to estimate generalization error and adjust
hyperparameters.
 Typically, about 80% of the data is used for training, and 20% is used
for validation.
Validation vs. Test Sets:
 Test Set: Used only after model training and hyperparameter tuning to
estimate true generalization error.
 Validation Set:Hyperparameter tuning and model selection. The
validation set helps to fine-tune hyperparameters like learning rate,
regularization strength, or the degree of a polynomial in polynomial
regression.
1. Cross-Validation
Cross-validation is a technique used in machine learning

to evaluate the performance of a model on unseen data.
It helps to prevent overfitting and provides a more
reliable estimate of the model's generalization
performance.
How Cross-Validation Works:

lOMoARcPSD|51109247
Here
K=5
Steps/Algorithm
1. Split Data:
a. Divide the dataset D into k mutually exclusive subsets.
2. Iterate:
a. For each fold i from 1 to k:
i. Train the model A on the dataset D excluding the i-
th fold.
ii. Evaluate the model's performance on the i-th fold
using the loss function L.
iii. Record the error for each example in the i-th fold.
3. Return Average Error:
a. Calculate the average error over all examples in all
folds.
Estimators, Bias and Variance:

lOMoARcPSD|51109247
Estimators
1.Point estimation is a statistical method used to
estimate a population parameter based on a sample of data. It
involves calculating a single value (a point) that best represents
the true value of the population parameter.
Common Point Estimators:
 Mean: The average of the sample values.

 Median: The middle value in a sorted list of sample values.
 Mode: The most frequent value in the sample.
 Maximum Likelihood Estimator (MLE): The estimator
that maximizes the likelihood function, which is the
probability of observing the sample data given the
parameter value.
 Least Squares Estimator: Used in regression analysis to
minimize the sum of squared errors between the predicted
values and the actual values.
Example:
Suppose you want to estimate the average height of all adult

males in a city. You can collect a random sample of adult males,
measure their heights, and calculate the mean of the sample
heights. This mean value would be your point estimate for the
average height of all adult males in the city.
Factors Affecting Point Estimation:
 Sample Size: A larger sample size generally leads to a more

accurate point estimate.
 Sampling Method: The method used to select the sample
can affect the accuracy of the estimate.
 Distribution of the Population: The distribution of the
population can influence the choice of point estimator.

lOMoARcPSD|51109247
2. Function estimation is a fundamental task in machine

learning where the goal is to learn a function that maps inputs to
outputs based on a given dataset. This function can be used to
make predictions on new, unseen data.
Key Components:
 Input: A set of features or variables.

 Output: A predicted value or label.
 Function: A mathematical mapping that relates the input to
the output.
 Dataset: A collection of input-output pairs used to train the
function.
Types of Function Estimation:
 Regression: Predicting a continuous numerical value (e.g.,

house prices, stock prices).
 Classification: Predicting a categorical label (e.g., spam or
not spam, cat or dog).
 Structured Output: Predicting a complex output structure,
such as a sequence or a tree.
Approaches to Function Estimation:
 Statistical Learning: Using statistical methods to model

the relationship between input and output variables.
 Machine Learning Algorithms: Employing algorithms like
linear regression, decision trees, support vector machines,
neural networks, etc.
 Deep Learning: Using deep neural networks for complex
tasks like image and natural language processing.
Bias:
Bias in machine learning refers to systematic errors that occur due to the
inherent characteristics of the data, algorithm, or model. It can lead to unfair
or inaccurate predictions.

lOMoARcPSD|51109247
 Definition: The bias of an estimator θ_hat is the difference

between its expected value and the true parameter value θ.
 Mathematical Expression: bias(θ_hat) = E[θ_hat] - θ
 Interpretation: A biased estimator systematically
overestimates or underestimates the true parameter value.
Unbiased Estimator:
 Definition: An estimator is unbiased if its bias is zero.

 Mathematical Expression: E[θ_hat] = θ
 Interpretation: On average, an unbiased estimator gives
the true value of the parameter.
Asymptotically Unbiased Estimator:
 Definition: An estimator is asymptotically unbiased if its

bias approaches zero as the sample size m increases to
infinity.
 Mathematical Expression: lim_{m → ∞} bias(θ_hat) =
0
 Interpretation: While the estimator may be biased for
small sample sizes, the bias becomes negligible as the
sample size increases.
Variance and Standard Error:

1. Variance:
 Definition: Variance measures how much a set of
data points differ from their mean (or expected
value).

lOMoARcPSD|51109247
Interpretation:
 A small variance means the data points are close to

the mean.
 A large variance indicates that the data points are
more spread out.
2. Standard Error (SE):
 Definition: The standard error measures the

accuracy with which a sample represents the
population.
Interpretation:

lOMoARcPSD|51109247
 Standard error decreases as the sample size increases, implying that larger
samples yield more precise estimates of population parameters.
Standard Deviation (SD):

It indicates how much individual data points differ from the mean (average)
of the dataset.
Standard Deviation (σ) = √(Variance)
Trading off Bias and Variance to Minimize

Mean Squared Error
Trade-off:
 Underfitting Zone: As capacity increases, bias decreases (the model

becomes more complex and can better fit the data). However, variance
remains low.
 Overfitting Zone: As capacity increases further, variance increases (the
model becomes more sensitive to noise in the training data). Bias remains
low.
 Optimal Capacity: The goal is to find the optimal capacity where the
generalization error is minimized. This involves balancing bias and variance.
Maximum Likelihood Estimation

MLE finds the parameters of a model that make the observed data most
likely to occur. Maximum Likelihood Estimation (MLE) is a statistical

lOMoARcPSD|51109247
method used to estimate the parameters of a probability distribution given a

set of data. It seeks to find the parameter values that maximize the
likelihood of observing the given data.
How MLE Works:
1. Assume a Probabilistic Model:

a. Choose a suitable probability distribution (e.g., Gaussian, Bernoulli,
Poisson) for the data.
3.Maximizing the Likelihood:
Example
Basic Idea:
 We have some data.

 We assume that this data comes from a specific distribution (e.g., Normal,
Bernoulli).
 We want to find the parameter(s) of this distribution that make the data most
likely.
Step 1: Define the Problem
 We toss the coin 10 times, and we get 7 heads and 3 tails.

 We assume that the coin follows a Bernoulli distribution with parameter p (the
probability of getting heads).

lOMoARcPSD|51109247
Step 2: Write the Likelihood Function
Step 3: Maximize the Likelihood Function

The goal of MLE is to find the value of pthat maximizes the likelihood L(p).
Properties of Maximum Likelihood
 Consistency: MLE converges to the true parameter as the sample size

increases.
 Asymptotic Normality: For large samples, MLE estimates follow a normal
distribution.

lOMoARcPSD|51109247
 Asymptotic EfÏciency: MLE has the minimum possible variance for large
samples.
 Invariance: The MLE of a transformed parameter is the transformed MLE of
the original parameter.
 Bias: MLE may be biased for small samples but becomes unbiased as the
sample size grows.
 Flexibility: MLE can be applied to a wide range of probabilistic models.
 Sample Size Dependence: MLE performs better with larger datasets.
 Likelihood Principle: The entire process of MLE estimation is based on the
likelihood of the observed data.
Bayesian Statistics
Bayesian statistics is a branch of statistics that uses Bayes' theorem to update the
probability of a hypothesis as more evidence or data becomes available.
Key Components:
1. Prior: This is the initial belief or knowledge about a parameter before

observing the data. It is represented as a probability distribution.
2. Likelihood: The likelihood represents the probability of observing the data
given a certain parameter value.
3. Posterior: After observing the data, the prior is updated to form the
posterior, which is the new belief about the parameter.
4. Bayes' Theorem: This theorem is the mathematical rule that updates the
prior with the likelihood to give the posterior.
Bayes' Theorem Formula:

lOMoARcPSD|51109247
Example: Estimating the Probability of Rain
Let’s say you want to estimate the probability that it will rain tomorrow. You have
some prior knowledge based on weather patterns, and you can also observe a
weather report.
Step 1: Define the Problem
 Prior: You initially believe there's a 30% chance of rain based on past
experience. This is your prior knowledge.
Data (Likelihood): The weather report says there's a 90% chance it will rain if
clouds appear, and you observe that it is cloudy.
However, even on a clear day (when it doesn’t rain), there is still a 50% chance that
clouds might appear.
Step 2: Apply Bayes' Theorem
You now want to update your belief about the probability of rain (posterior) after
observing that it is cloudy. You use Bayes’ theorem to compute this.
Bayes’ theorem formula:

lOMoARcPSD|51109247
Advantages of Bayesian Statistics:
 Flexibility: Can incorporate prior knowledge, which is useful in situations

with limited data.
 Interpretability: Provides a natural interpretation of probability as a degree
of belief.
 Uncertainty Quantification: Posterior distributions allow you to fully
express uncertainty about model parameters.
Disadvantages:
 Computational Complexity: Bayesian methods can be computationally

intensive, especially for complex models.
 Choice of Prior: The choice of prior can significantly influence results, and
selecting an appropriate prior can be challenging.
Supervised Learning Algorithms
Supervised learning algorithms are a category of machine learning
techniques in which the model is trained on labeled data. The goal is to learn
a function that maps input data (features) to desired output labels. Once the
model is trained, it can make predictions on new, unseen data. Supervised
learning is typically used for tasks like classification and regression.
1 .Probabilistic Supervised Learning:
1. Linear Regression
 Type: Regression

lOMoARcPSD|51109247
 Use Case: Predict a continuous value (e.g., house prices,

stock prices).
 How it Works: Linear regression models the relationship
between input variables and the output by fitting a linear
equation to the data.
2. Logistic Regression
 Type: Classification
 Use Case: Binary classification tasks (e.g., spam detection, fraud detection).
 How it Works: Logistic regression estimates probabilities using a logistic
(sigmoid) function and maps inputs to discrete labels (0 or 1).
2 .Support Vector Machines:

Support Vector Machines (SVMs) are a class of supervised
machine learning algorithms that are widely used for classification
and regression tasks. They are based on the concept of finding a
hyperplane in high-dimensional space that separates data points
of different classes.
Key Concepts:
 Hyperplane: A decision boundary that separates data

points into different classes.
 Margin: The distance between the hyperplane and the
closest data points (support vectors).
 Support Vectors: The data points that lie closest to the
hyperplane and influence the position of the hyperplane.

lOMoARcPSD|51109247
Goal of SVMs:
 To find the hyperplane that maximizes the margin between the two classes.
 This helps to improve generalization performance and reduce overfitting.
 Misclassification should be less.
Hard Margin
The data is perfectly linearly separable.
Soft Margin

lOMoARcPSD|51109247
 Relaxation: Allows for some misclassifications to account for noise or

outliers in the data.
SVM uses soft margin technique.It allows misclassification.

lOMoARcPSD|51109247
Kernel Trick:
Used to map data into a higher-dimensional feature space without explicitly
computing the coordinates of the transformed data points. This is achieved by using
a kernel function.
Kernel Function
A kernel function is a function that takes two inputs and returns a scalar value
representing the similarity between them in a transformed feature space.

lOMoARcPSD|51109247
Advantages of SVM:
 Effective in high-dimensional spaces
 Robust to overfitting
 Works well with clear margin of separation
 Handles non-linear data
Disadvantages of SVM:
 Not suitable for large datasets
 DifÏcult to choose the right kernel
 Poor performance with overlapping classes
3 .Other Simple Supervised Learning Algorithms:

Assignment:
1.KNN
2.Decision Trees
Unsupervised Learning Algorithms

Unsupervised learning algorithms are used to find patterns or
structures in data without any labeled outcomes.
1. Principal Components Analysis
Principal Component Analysis (PCA) is a statistical technique

used to reduce the dimensionality of a dataset while preserving
the most important information. It transforms a high-dimensional
dataset into a lower-dimensional dataset by identifying the
principal components, which are the directions of maximum
variance in the data.

lOMoARcPSD|51109247
Steps Involved in PCA:
 Standardization: Normalize the data to have a mean of 0 and standard

deviation of 1 (important when variables have different scales).
 Covariance Matrix: Calculate the covariance matrix of the data to
understand relationships between variables.

 Eigenvalues and Eigenvectors: Derive eigenvalues (which tell the amount
of variance captured) and eigenvectors (which give the direction of the
principal components).
 Projection: Project the data onto the eigenvectors corresponding to the
largest eigenvalues to obtain the reduced dataset.

lOMoARcPSD|51109247
EXAMPLE :
https://youtu.be/ZtS6sQUAh0c?si=JxbHv8GN_3foLsAB
Applications of PCA:
 Data Compression: Reducing the dataset's size without much loss of

information.
 Noise Reduction: Eliminating components with low variance that often
represent noise.
 Visualization: Reducing data to 2 or 3 dimensions for easier visualization.
Stochastic Gradient Decent:

Stochastic Gradient Descent (SGD) is a powerful and efficient optimization
method for training machine learning models, particularly when dealing with large
datasets. Its ability to perform frequent updates and handle large-scale data makes
it the backbone of many deep learning frameworks.
1. Batch Gradient Descent:

lOMoARcPSD|51109247
a. Uses the entire dataset to compute the gradient of the loss function.
b. Pros: Exact gradient.
c. Cons: Expensive for large datasets (slow for training large models).
2. Stochastic Gradient Descent (SGD):
a. Instead of using the entire dataset, it updates the parameters for each
training example (or for a small batch).
b. Pros: Much faster for large datasets, more memory efficient.
c. Cons: Noisy updates can lead to fluctuations in the loss function.
How Stochastic Gradient Descent Works
In SGD, the parameters (weights) are updated after every individual training
example (or sometimes after a small subset, called a mini-batch). The weight
update rule in SGD is given by:
Steps of Stochastic Gradient Descent
1. Random Initialization: Randomly initialize the parameters (weights) of the

model.
2. ShufÒe the Data: ShufÒe the training dataset to avoid any bias from the
ordering of data.
3. Iterative Updates:
a. For each training example , compute the gradient of the

loss function with respect to the model parameters.
b. Update the parameters using the gradient and the learning rate.
4. Convergence: Repeat the updates for each training sample for multiple
iterations until the loss function converges to a minimum value (or stops
improving).

lOMoARcPSD|51109247
building a Machine Learning Algorithm

1. Problem Definition
 Identify the Objective: Clearly define what you want to

predict or achieve. For example, is it a classification problem
(e.g., spam detection) or a regression problem (e.g.,
predicting house prices)?
 Define the Success Metrics: Accuracy, precision, recall,
F1-score, or mean squared error (MSE) depending on the
problem.
2. Data Collection
 Source Data: Gather the data relevant to the problem. It

can come from databases, sensors, APIs, or online datasets.
 Types of Data: Structured (e.g., spreadsheets, CSV) or
unstructured (e.g., images, text, audio).
 Labeling: For supervised learning, make sure the data is
labeled (e.g., “cat” or “dog” for image classification).

lOMoARcPSD|51109247
3. Data Preprocessing
 Data Cleaning: Handle missing values, remove duplicates,

and deal with noisy data.
 Feature Selection/Extraction: Select important features
(variables) and discard irrelevant ones. Feature extraction
can include methods like PCA (Principal Component Analysis)
for dimensionality reduction.
 Normalization/Standardization: Scale numerical data to a
common scale (e.g., normalize between 0-1 or standardize to
zero mean and unit variance).
 Encoding Categorical Variables: Convert categorical
variables into numerical form (e.g., one-hot encoding).
4. Splitting the Dataset
 Training Set: Typically 70-80% of the data, used to train the

model.
 Test Set: Typically 20-30% of the data, used to evaluate
model performance on unseen data.
 Validation Set (optional): Used for fine-tuning the model
and preventing overfitting.
5. Choose an Algorithm
Select the appropriate machine learning algorithm based on the

problem type:
 Supervised Learning:
o Classification: Logistic Regression, Decision Trees,
Random Forest, Support Vector Machines (SVM), k-
Nearest Neighbors (k-NN), Neural Networks.
o Regression: Linear Regression, Ridge Regression,
Lasso, Support Vector Regressor, Neural Networks.
 Unsupervised Learning: K-means clustering, Hierarchical
clustering, DBSCAN, PCA, t-SNE.
 Reinforcement Learning: Q-learning, Deep Q Networks
(DQN), etc.

lOMoARcPSD|51109247
6. Model Training
 Fit the Model: Use the training set to train the model by
minimizing the loss function.
 Hyperparameter Tuning: Adjust hyperparameters (e.g.,
learning rate, number of trees in Random Forest, number of
layers in a Neural Network) to improve performance.
 Cross-Validation: Apply k-fold cross-validation to avoid
overfitting and ensure the model generalizes well to unseen
data.
7. Model Evaluation
 Evaluate the model using metrics relevant to the problem:

o Classification Metrics: Accuracy, Precision, Recall, F1-
score, ROC-AUC.
o Regression Metrics: Mean Squared Error (MSE), Root
Mean Squared Error (RMSE), R-squared.
 Confusion Matrix: For classification tasks, a confusion
matrix helps visualize true positives, true negatives, false
positives, and false negatives.
8. Model Improvement
 Feature Engineering: Create new features based on

domain knowledge.
 Tuning: Continue fine-tuning hyperparameters or try more
complex algorithms (e.g., switching from logistic regression
to a neural network).
 Ensemble Methods: Combine multiple models (e.g.,
bagging, boosting, stacking) to improve accuracy.
 Reduce Overfitting: Techniques like regularization (L1, L2),
dropout (in neural networks), or increasing the amount of
data can help reduce overfitting.
9. Model Deployment
 Save the Model: Serialize the trained model using libraries

like pickle or joblib in Python.

lOMoARcPSD|51109247
 Deployment: Use APIs or cloud platforms (e.g., AWS,

Google Cloud) to deploy the model for production use.
 Monitoring: Continuously monitor model performance in
production and retrain if necessary.
10. Iterate and Improve
Machine learning models typically require iterations. After

deployment, track how the model performs on new data and
periodically retrain and refine the model.
ChallengesMotivating Deep Learning

High-Dimensional Data (Curse of Dimensionality)
 Challenge: In traditional machine learning algorithms, as

the number of features (or dimensions) increases, the
volume of the feature space grows exponentially. This makes
it difÏcult for the algorithm to generalize and leads to
overfitting.
 Deep Learning Motivation: Deep learning, with its ability
to automatically learn hierarchical feature representations,
excels in handling high-dimensional data like images, videos,
and text. Neural networks can capture complex relationships
in large datasets without explicit feature engineering.
2. Feature Engineering
 Challenge: In traditional machine learning, the quality of

the model often depends heavily on manually engineered
features. This process is labor-intensive, requires domain
expertise, and doesn't always capture complex patterns.
 Deep Learning Motivation: Deep learning models,
especially Convolutional Neural Networks (CNNs) for image
data and Recurrent Neural Networks (RNNs) for sequential
data, automatically learn features from raw data. This
reduces the need for manual feature engineering and allows
the model to discover relevant features on its own.

lOMoARcPSD|51109247
3. Complex Data Structures and Unstructured Data
 Challenge: Traditional algorithms struggle to work directly

with unstructured data, such as images, audio, and natural
language text. They often require converting these forms of
data into structured formats, which can lead to a loss of
information.
 Deep Learning Motivation: Deep learning models are
designed to handle complex, unstructured data. For
example:
o CNNs process image data by learning spatial
hierarchies.
o RNNs and Long Short-Term Memory (LSTM) networks
handle sequential data like time series or language
models.
o Transformers, such as those used in NLP models (e.g.,
GPT and BERT), excel in natural language
understanding and generation.
4. Scalability with Large Datasets
 Challenge: As data grows, traditional machine learning

algorithms often face performance degradation due to
limitations in memory and processing power. Large datasets
require advanced techniques for efÏcient processing.
 Deep Learning Motivation: Deep learning architectures,
particularly deep neural networks, scale better with large
datasets. These models often improve their performance as
more data is provided, leveraging advances in parallel
computing (e.g., GPUs) to efÏciently process massive
datasets.
5. Nonlinear and Complex Relationships
 Challenge: Many real-world problems exhibit complex, non-

linear relationships between inputs and outputs. Traditional
algorithms like linear regression or decision trees may not

lOMoARcPSD|51109247
fully capture these complexities, requiring manual feature

transformation or more complex models.
 Deep Learning Motivation: Neural networks inherently
capture complex, non-linear patterns due to their layered
architecture, where each layer can apply nonlinear
transformations to the input data. This allows them to
approximate intricate relationships that simpler models
cannot.
6. End-to-End Learning
 Challenge: Traditional machine learning pipelines often

require separate stages for feature extraction,
transformation, and modeling, leading to inefÏciencies and
potential error propagation across stages.
 Deep Learning Motivation: Deep learning models are
designed for end-to-end learning. For example, in a CNN-
based image recognition model, the raw image is fed into
the network, and the model learns to perform feature
extraction, pattern recognition, and classification all in one
go.
7. Transfer Learning
 Challenge: In traditional machine learning, models are

usually built from scratch for each task, and knowledge
learned from one task often doesn’t transfer easily to
another.
 Deep Learning Motivation: Transfer learning in deep
learning allows a model trained on a large dataset (e.g.,
ImageNet) to be fine-tuned on a different, smaller dataset.
This significantly reduces training time and computational
resources while improving model accuracy on tasks with
limited data.

lOMoARcPSD|51109247
8. Overfitting and Generalization
 Challenge: Traditional machine learning algorithms often

suffer from overfitting, especially when the dataset is small,
or the model is overly complex for the available data.
 Deep Learning Motivation: Although deep learning
models are complex and prone to overfitting, several
regularization techniques—such as dropout, batch
normalization, and data augmentation—help mitigate
overfitting and improve generalization on unseen data.
9. Parallel Computing and Hardware Advances
 Challenge: Traditional machine learning algorithms often do

not fully leverage modern hardware advancements,
especially parallel computing on GPUs.
 Deep Learning Motivation: Deep learning models,
particularly deep neural networks, are highly parallelizable
and can be efÏciently trained using GPUs and TPUs (Tensor
Processing Units), reducing training time significantly for
large models and datasets.

Deep Learning Module 1

Uploaded by

Deep Learning Module 1

Uploaded by

lOMoARcPSD|51109247

Deep Learning Module 1

Machine Learning (MVJ College of Engineering)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university

Machine Learning Basics: Learning Algorithms, Capacity, Overfitting and Underfitting,

Machine Learning Basics:

 Definition and Example of Learning Algorithm:

Downloaded by Anusha (anusha.mitts@gmail.com)

 Introduction to the two central approaches to statistics: frequentist

A machine learning algorithm is defined as an algorithm that learns from data.

 Definition of Task (T):

Downloaded by Anusha (anusha.mitts@gmail.com)

 Definition of Classification Task:

This approach is efÏcient because instead of building a separate model for

Downloaded by Anusha (anusha.mitts@gmail.com)

1.7 Anomaly detection:

 Identifying unusual or atypical events or objects.

Synthesis and sampling are techniques used in machine learning to create

1.9. Imputation of missing values:

Downloaded by Anusha (anusha.mitts@gmail.com)

 Predicting a clean example (image)from a corrupted version.

1.11. Density estimation or probability mass function estimation:

 Learning a probability distribution over the data.

 A quantitative measure of a machine learning algorithm's performance.

Choosing Performance Measures:

 DifÏcult to choose a measure that corresponds well to desired

Downloaded by Anusha (anusha.mitts@gmail.com)

Supervised learning is a type of machine learning where the algorithm is provided

 A collection of examples, which are collections of features.

Consider a dataset of information about houses. Each example (row) might

Other Learning Paradigms:

 Semi-supervised learning: Some examples have labels, others do

Example: Linear Regression

Downloaded by Anusha (anusha.mitts@gmail.com)

Types of Linear Regression:

Simple Linear Regression: Involves only one independent variable.

Downloaded by Anusha (anusha.mitts@gmail.com)

 Model Adaptation: Fitting refers to the process of adjusting a machine learning

Downloaded by Anusha (anusha.mitts@gmail.com)

Downloaded by Anusha (anusha.mitts@gmail.com)

Distance betweenactual value and predicted value-error.We need to choose

Line with least error-Regression line.

The least squares method is a widely used mathematical technique for

RSS = Σ(yi - ŷi)²

Downloaded by Anusha (anusha.mitts@gmail.com)

Multiple Linear Regression:

Downloaded by Anusha (anusha.mitts@gmail.com)

 Predicting house prices based on factors like square footage, number of

 Linearity assumption: If the relationship is not linear, linear regression may

Capacity, Overfitting and Underfitting

Downloaded by Anusha (anusha.mitts@gmail.com)

 Discovering Patterns: The goal of machine learning is to

Downloaded by Anusha (anusha.mitts@gmail.com)

 Regularization: Penalizing complex models to prevent overfitting.

The image illustrates the concepts of underfitting, appropriate capacity, and

 Description: The model is too simple to capture the underlying pattern in

Downloaded by Anusha (anusha.mitts@gmail.com)

 Training Error: The error measured on the training data.

2. Training Error: As the model's capacity increases, the training error

Downloaded by Anusha (anusha.mitts@gmail.com)

1.No Free Lunch Theorem:

Downloaded by Anusha (anusha.mitts@gmail.com)

 A technique used to modify a learning algorithm to reduce its generalization

Other Forms of Regularization:

 L1 regularization: Penalizes the absolute value of the weights.

Downloaded by Anusha (anusha.mitts@gmail.com)

Hyperparameters and Validation Sets

 Hyperparameters are settings used to control the behavior of a

 Polynomial Regression: The degree of the polynomial is a

 A validation set is used to estimate the generalization error of a model

Downloaded by Anusha (anusha.mitts@gmail.com)

 It is crucial because hyperparameters cannot be learned from the

 The training data is split into two subsets:

Validation vs. Test Sets:

Cross-validation is a technique used in machine learning