Stochastic Gradient Descent
(SGD)
Jaskaran Singh
12591001
[Link](C.S.E)1 Sem.
st
Introduction to Gradient Descent
• • Gradient Descent is an optimization
algorithm used to minimize the cost function
in machine learning.
• • It works by iteratively adjusting parameters
in the opposite direction of the gradient of the
cost function.
• • Used in training linear regression, logistic
regression, neural networks, and more.
What is Stochastic Gradient Descent (SGD)?
• • Instead of using the entire dataset, SGD
updates parameters using only one sample at
a time.
• • This makes updates faster and introduces
randomness, helping to escape local minima.
• • Often used in training large-scale machine
learning models and deep learning networks.
Why Escape Local Minima?
• • Local minimum = small valley (not the best
solution).
• • Global minimum = deepest valley (lowest
error, best solution).
• • If stuck in a local minimum → model accuracy
is not the best.
• • SGD’s randomness acts like a “shake,” helping
the model escape small valleys and move closer
to the best valley.
How SGD Works?
• 1. Initialize model parameters.
• 2. Select a random sample from the training
data.
• 3. Compute the gradient of the loss function
for that sample.
• 4. Update parameters using the gradient.
• 5. Repeat until convergence.
Visual Representation
SGD vs Gradient Descent: Comparison
Advantages & Disadvantages of SGD
• Advantages:
• • Faster updates with large datasets.
• • Helps escape local minima.
• • Suitable for real-time/online learning.
• Disadvantages:
• • Noisy convergence.
• • Requires careful tuning of learning rate.
• • May oscillate around the minimum.
Applications of SGD
• • Training deep learning models (CNNs, RNNs,
Transformers).
• • Online recommendation systems.
• • Natural Language Processing (NLP).
• • Large-scale optimization problems.