Deep Learning Module 1
Deep Learning Module 1
DEEP LEARNING
MODULE 1
Machine learning:
Machine learning is a subset of artificial intelligence (AI) that allows machines
to learn and improve from experience without being explicitly programmed.
Mitchell's Definition:
Mitchell (1997) defines learning as: “A computer program is said to learn from
experience E with respect to some class of tasks T and performance measure P, if its
performance at tasks in T, as measured by P, improves with experience E.”
Components of Learning:
Experience (E): The data or experience from which the algorithm learns.
Tasks (T): The specific tasks the algorithm is designed to perform.
Performance Measure (P): The criteria used to measure the algorithm's performance on
the tasks.
1 .The Task, T
1.1Classification:
Solution:
Instead of training the model for just one set of inputs (all available), the
model needs to handle all combinations of missing inputs.
Missing inputs make the problem harder because the model needs to
handle many scenarios (different inputs missing for different
patients).
The model learns to classify (diagnose) even when some
information is unavailable by focusing on what it does know.
1.3. Regression
Regression: Predicting a numerical value given input data.
Eg:
Predicting Housing Prices
Predicting Stock Prices
1.4. Transcription
Transcribing unstructured data into discrete, textual form.
Examples: Optical character recognition, speech recognition.
1.5. Machine translation:
Converting text from one language to another.
1.6. Structured Output:
Structured output tasks are a type of machine learning problem where the
output is not just a single value, but a collection of values that are
interconnected or interdependent.
Examples:
1. Parsing:
o Input: A sentence in natural language.
o Output: A parse tree representing the grammatical structure of
the sentence, with nodes labeled as nouns, verbs, adjectives,
etc.
2.Pixel-wise Segmentation:
Input: An image.
Output: A labeled image where each pixel is assigned to a specific
category (e.g., "sky," "road," "car").
3. Image Captioning:
Input: An image.
Output: A natural language sentence describing the image.
Often in real-world datasets, certain values are missing. This can be due to
various reasons like data collection errors, equipment failures, or privacy
concerns. Predicting these missing values is crucial for data analysis,
machine learning, and decision-making.
1.10. Denoising:
2. Performance Measures,P
3.The Experience, E
machine learning algorithms can experience different types of data, which influence
the learning process and the types of tasks they can perform. Unsupervised learning
focuses on understanding the data structure, while supervised learning involves
predicting targets based on features.
Unsupervised learning:
Unsupervised learning is a type of machine learning where the algorithm is not
provided with any labels or target values. Instead, it is given a dataset of raw
data and must learn to find patterns(similarity), structures, or relationships within
the data on its own.
Supervised learning:
Datasets:
Example:
Square footage
Number of bedrooms
Number of bathrooms
Location
Price
Here x and y are actual data(Data set data).In case new value of x=6,we have to
predict y.For that We have to draw bestfit line.
Fitting
Equation of line.
Y=mx+c
3.6=0.4*3+c
3.6=1.2+c
C=3.6-1.2
C=2.4
Best fit line:Try for different values of m.compare the difference between the
actual value and predicted value is minimum will be selected as bestfit line.
The least squares method is based on the principle of minimizing the overall
error between the observed data points and the predicted values from the model.
Formula:
The least squares method involves minimizing the sum of squared residuals (RSS),
which is given by:
where:
yi is the observed value of the dependent variable for the i-th data point
ŷi is the predicted value of the dependent variable for the i-th data point
Goal: The goal of linear regression is to find the best values for the regression
coefÏcients that minimize the sum of squared errors between the predicted values
and the actual values.
Applications:
Limitations:
o Overfitting
o Overfitting occurs when a model learns the training data too well, to
the point where it starts to memorize the noise or idiosyncrasies in the
data rather than learning the underlying patterns. This leads to poor
performance on new, unseen data.
o Signs of Overfitting:
Very low training error but high test error.
A complex model with many parameters.
The model performs exceptionally well on the training data but poorly on the
validation or test data.
o Underfitting
o Underfitting occurs when a model is too simple to capture the underlying
patterns in the data. This leads to poor performance on both the training and
test data.
o Signs of Underfitting:
High training error and high test error.
A simple model with few parameters.
The model consistently performs poorly on both the training and test data.
o Balancing Capacity:
o The goal is to find a model with just the right capacity to capture the
underlying patterns without overfitting. This is often achieved through
techniques like:
Underfitting:
Appropriate Capacity:
Description: The model has the right complexity to capture the underlying
pattern in the data.
Visualization: The regression line fits the data points well without
overfitting.
Impact: The model generalizes well to new data and has low error on both
the training and test sets.
Overfitting:
Description: The model is too complex and learns the noise in the data
rather than the underlying pattern.
Visualization: The regression line is a highly complex curve that fits the
training data points perfectly but may not generalize well to new data.
Impact: The model performs well on the training set but poorly on the test
set.
Key Elements:
Explanation:
model learns the noise in the training data rather than the underlying
patterns.
4. Optimal Capacity: The optimal capacity is the point where the
generalization error is minimized. This is the region where the model has just
the right amount of complexity to capture the underlying patterns without
overfitting.
5. Underfitting and Overfitting Zones: The underfitting zone is where the
model is too simple, leading to high training and generalization errors. The
overfitting zone is where the model is too complex, leading to low training
error but high generalization error.
Optimal Capacity: The optimal model capacity is the complexity level that
best balances underfitting and overfitting for a given task and dataset.
Training Examples: The number of examples in the training set affects the
optimal capacity.
Increasing Training Examples: As the number of training examples
increases, the optimal capacity generally increases. This is because with
more data, the model can learn more complex patterns without overfitting.
Plateau: Eventually, the optimal capacity plateaus, meaning that adding
more training examples does not significantly improve the model's
performance. This is because the model has reached its maximum capacity
for the given task.
Error Bars: The error bars indicate the uncertainty in the estimated optimal
capacity. They represent the range of possible values for the optimal capacity.
2. Regularization
Regularization:
L2 regularization:
The model is encouraged to learn smaller weights, which can help prevent
overfitting.
Dropout: Randomly drops neurons during training to prevent overfitting.
Early stopping: Stops training when the performance on a validation set
starts to deteriorate.
Examples:
Validation Sets
Role:
Construction:
Test Set: Used only after model training and hyperparameter tuning to
estimate true generalization error.
Validation Set:Hyperparameter tuning and model selection. The
validation set helps to fine-tune hyperparameters like learning rate,
regularization strength, or the degree of a polynomial in polynomial
regression.
1. Cross-Validation
Here
K=5
Steps/Algorithm
1. Split Data:
a. Divide the dataset D into k mutually exclusive subsets.
2. Iterate:
a. For each fold i from 1 to k:
i. Train the model A on the dataset D excluding the i-
th fold.
ii. Evaluate the model's performance on the i-th fold
using the loss function L.
iii. Record the error for each example in the i-th fold.
3. Return Average Error:
a. Calculate the average error over all examples in all
folds.
Estimators
1.Point estimation is a statistical method used to
estimate a population parameter based on a sample of data. It
involves calculating a single value (a point) that best represents
the true value of the population parameter.
Example:
Key Components:
Unbiased Estimator:
Interpretation:
Interpretation:
Standard error decreases as the sample size increases, implying that larger
samples yield more precise estimates of population parameters.
Trade-off:
Example
Basic Idea:
Asymptotic EfÏciency: MLE has the minimum possible variance for large
samples.
Invariance: The MLE of a transformed parameter is the transformed MLE of
the original parameter.
Bias: MLE may be biased for small samples but becomes unbiased as the
sample size grows.
Flexibility: MLE can be applied to a wide range of probabilistic models.
Sample Size Dependence: MLE performs better with larger datasets.
Likelihood Principle: The entire process of MLE estimation is based on the
likelihood of the observed data.
Bayesian Statistics
Bayesian statistics is a branch of statistics that uses Bayes' theorem to update the
probability of a hypothesis as more evidence or data becomes available.
Key Components:
Let’s say you want to estimate the probability that it will rain tomorrow. You have
some prior knowledge based on weather patterns, and you can also observe a
weather report.
Prior: You initially believe there's a 30% chance of rain based on past
experience. This is your prior knowledge.
Data (Likelihood): The weather report says there's a 90% chance it will rain if
clouds appear, and you observe that it is cloudy.
However, even on a clear day (when it doesn’t rain), there is still a 50% chance that
clouds might appear.
You now want to update your belief about the probability of rain (posterior) after
observing that it is cloudy. You use Bayes’ theorem to compute this.
Disadvantages:
1. Linear Regression
Type: Regression
2. Logistic Regression
Type: Classification
Use Case: Binary classification tasks (e.g., spam detection, fraud detection).
How it Works: Logistic regression estimates probabilities using a logistic
(sigmoid) function and maps inputs to discrete labels (0 or 1).
Key Concepts:
Goal of SVMs:
To find the hyperplane that maximizes the margin between the two classes.
This helps to improve generalization performance and reduce overfitting.
Misclassification should be less.
Hard Margin
Soft Margin
Kernel Trick:
Used to map data into a higher-dimensional feature space without explicitly
computing the coordinates of the transformed data points. This is achieved by using
a kernel function.
Kernel Function
A kernel function is a function that takes two inputs and returns a scalar value
representing the similarity between them in a transformed feature space.
Advantages of SVM:
Effective in high-dimensional spaces
Robust to overfitting
Works well with clear margin of separation
Handles non-linear data
Disadvantages of SVM:
Not suitable for large datasets
DifÏcult to choose the right kernel
Poor performance with overlapping classes
Eigenvalues and Eigenvectors: Derive eigenvalues (which tell the amount
of variance captured) and eigenvectors (which give the direction of the
principal components).
Projection: Project the data onto the eigenvectors corresponding to the
largest eigenvalues to obtain the reduced dataset.
EXAMPLE :
https://youtu.be/ZtS6sQUAh0c?si=JxbHv8GN_3foLsAB
Applications of PCA:
a. Uses the entire dataset to compute the gradient of the loss function.
b. Pros: Exact gradient.
c. Cons: Expensive for large datasets (slow for training large models).
2. Stochastic Gradient Descent (SGD):
a. Instead of using the entire dataset, it updates the parameters for each
training example (or for a small batch).
b. Pros: Much faster for large datasets, more memory efficient.
c. Cons: Noisy updates can lead to fluctuations in the loss function.
In SGD, the parameters (weights) are updated after every individual training
example (or sometimes after a small subset, called a mini-batch). The weight
update rule in SGD is given by:
2. Data Collection
3. Data Preprocessing
Supervised Learning:
o Classification: Logistic Regression, Decision Trees,
Random Forest, Support Vector Machines (SVM), k-
Nearest Neighbors (k-NN), Neural Networks.
o Regression: Linear Regression, Ridge Regression,
Lasso, Support Vector Regressor, Neural Networks.
Unsupervised Learning: K-means clustering, Hierarchical
clustering, DBSCAN, PCA, t-SNE.
Reinforcement Learning: Q-learning, Deep Q Networks
(DQN), etc.
6. Model Training
Fit the Model: Use the training set to train the model by
minimizing the loss function.
Hyperparameter Tuning: Adjust hyperparameters (e.g.,
learning rate, number of trees in Random Forest, number of
layers in a Neural Network) to improve performance.
Cross-Validation: Apply k-fold cross-validation to avoid
overfitting and ensure the model generalizes well to unseen
data.
7. Model Evaluation
9. Model Deployment
6. End-to-End Learning