Stochastic Gradient Descent Algorithm With Python and NumPy - Real
Stochastic Gradient Descent Algorithm With Python and NumPy - Real
Table of Contents
Basic Gradient Descent Algorithm
Cost Function: The Goal of Optimization
Gradient of a Function: Calculus Refresher
Intuition Behind Gradient Descent
Implementation of Basic Gradient Descent
Learning Rate Impact
Application of the Gradient Descent Algorithm
Short Examples
Ordinary Least Squares
Improvement of the Code
Stochastic Gradient Descent Algorithms
Minibatches in Stochastic Gradient Descent
Momentum in Stochastic Gradient Descent
Random Start Values
Gradient Descent in Keras and TensorFlow
Conclusion
Remove ads
Stochastic gradient descent is an optimization algorithm o en used in machine learning applications to find the
model parameters that correspond to the best fit between predicted and actual outputs. It’s an inexact but powerful
technique.
The gradient descent algorithm is an approximate and iterative method for mathematical optimization. You can use it
to approach the minimum of any di erentiable function. Send Python Tricks »
Note: There are many optimization methods and subfields of mathematical programming. If you want to learn
how to use some of them with Python, then check out Scientific Python: Using SciPy for Optimization and
Hands-On Linear Programming: Optimization With Python.
Although gradient descent sometimes gets stuck in a local minimum or a saddle point instead of finding the global
minimum, it’s widely used in practice. Data science and machine learning methods o en apply it internally to
optimize model parameters. For example, neural networks find weights and biases with gradient descent.
Remove ads
In a regression problem, you typically have the vectors of input variables 𝐱 = (𝑥₁, …, 𝑥ᵣ) and the actual outputs 𝑦. You
want to find a model that maps 𝐱 to a predicted response 𝑓(𝐱) so that 𝑓(𝐱) is as close as possible to 𝑦. For example, you
might want to predict an output such as a person’s salary given inputs like the person’s number of years at the
company or level of education.
Your goal is to minimize the di erence between the prediction 𝑓(𝐱) and the actual data 𝑦. This di erence is called the
residual.
In this type of problem, you want to minimize the sum of squared residuals (SSR), where SSR = Σᵢ(𝑦ᵢ − 𝑓(𝐱ᵢ))² for all
observations 𝑖 = 1, …, 𝑛, where 𝑛 is the total number of observations. Alternatively, you could use the mean squared
error (MSE = SSR / 𝑛) instead of SSR.
Both SSR and MSE use the square of the di erence between the actual and predicted outputs. The lower the
di erence, the more accurate the prediction. A di erence of zero indicates that the prediction is equal to the actual
data.
SSR or MSE is minimized by adjusting the model parameters. For example, in linear regression, you want to find the
function 𝑓(𝐱) = 𝑏₀ + 𝑏₁𝑥₁ + ⋯ + 𝑏ᵣ𝑥ᵣ, so you need to determine the weights 𝑏₀, 𝑏₁, …, 𝑏ᵣ that minimize SSR or MSE.
In logistic regression, which is o en used to solve classification problems, the functions 𝑝(𝐱) and 𝑓(𝐱) are defined as
the following:
Email Address
Gradient of a Function: Calculus Refresher
In calculus, the derivative of a function shows you how much a value changes when youSend Python
modify Tricks »(or
its argument
arguments). Derivatives are important for optimization because the zero derivatives might indicate a minimum,
maximum, or saddle point.
The gradient of a function 𝐶 of several independent variables 𝑣₁, …, 𝑣ᵣ is denoted with ∇𝐶(𝑣₁, …, 𝑣ᵣ) and defined as
the vector function of the partial derivatives of 𝐶 with respect to each independent variable: ∇𝐶 = (∂𝐶/∂𝑣₁, …, ∂𝐶/𝑣ᵣ).
The symbol ∇ is called nabla.
The nonzero value of the gradient of a function 𝐶 at a given point defines the direction and rate of the fastest increase
of 𝐶. When working with gradient descent, you’re interested in the direction of the fastest decrease in the cost
function. This direction is determined by the negative gradient, −∇𝐶.
The idea behind gradient descent is similar: you start with an arbitrarily chosen position of the point or vector 𝐯 = (𝑣₁,
…, 𝑣ᵣ) and move it iteratively in the direction of the fastest decrease of the cost function. As mentioned, this is the
direction of the negative gradient vector, −∇𝐶.
Once you have a random starting point 𝐯 = (𝑣₁, …, 𝑣ᵣ), you update it, or move it to a new position in the direction of
the negative gradient: 𝐯 → 𝐯 − 𝜂∇𝐶, where 𝜂 (pronounced “ee-tah”) is a small positive value called the learning rate.
The learning rate determines how large the update or moving step is. It’s a very important parameter. If 𝜂 is too small,
then the algorithm might converge very slowly. Large 𝜂 values can also cause issues with convergence or make the
algorithm divergent.
Remove ads
This is a basic implementation of the algorithm that starts with an arbitrary point, start, iteratively moves it toward
the minimum, and returns a point that is hopefully at or near the minimum:
1. gradient is the function or any Python callable object that takes a vector and returns the gradient of the
function you’re trying to minimize.
Improve Your Python
2. start is the point where the algorithm starts its search, given as a sequence (tuple, list, NumPy array, and so on)
or scalar (in the case of a one-dimensional problem). ...with a fresh 🐍 Python Trick 💌
code snippet every couple of days:
3. learn_rate is the learning rate that controls the magnitude of the vector update.
4. n_iter is the number of iterations.
Email Address
This function does exactly what’s described above: it takes a starting point (line 2), iteratively updates it according to
the learning rate and the value of the gradient (lines 3 to 5), and finally returns the last position found.
Send Python Tricks »
Before you apply gradient_descent(), you can add another termination criterion:
Python
1 import numpy as np
2
3 def gradient_descent(
4 gradient, start, learn_rate, n_iter=50, tolerance=1e-06
5 ):
6 vector = start
7 for _ in range(n_iter):
8 diff = -learn_rate * gradient(vector)
9 if np.all(np.abs(diff) <= tolerance):
10 break
11 vector += diff
12 return vector
You now have the additional parameter tolerance (line 4), which specifies the minimal allowed movement in each
iteration. You’ve also defined the default values for tolerance and n_iter, so you don’t have to specify them each
time you call gradient_descent().
Lines 9 and 10 enable gradient_descent() to stop iterating and return the result before n_iter is reached if the
vector update in the current iteration is less than or equal to tolerance. This o en happens near the minimum,
where gradients are usually very small. Unfortunately, it can also happen near a local minimum or a saddle point.
Line 9 uses the convenient NumPy functions numpy.all() and numpy.abs() to compare the absolute values of diff
and tolerance in a single statement. That’s why you import numpy on line 1.
Now that you have the first version of gradient_descent(), it’s time to test your function. You’ll start with a small
example and find the minimum of the function 𝐶 = 𝑣².
This function has only one independent variable (𝑣), and its gradient is the derivative 2𝑣. It’s a di erentiable convex
function, and the analytical way to find its minimum is straightforward. However, in practice, analytical di erentiation
can be di icult or even impossible and is o en approximated with numerical methods.
You need only one statement to test your gradient descent implementation:
Python >>>
>>> gradient_descent(
... gradient=lambda v: 2 * v, start=10.0, learn_rate=0.2
... )
2.210739197207331e-06
You use the lambda function lambda v: 2 * v to provide the gradient of 𝑣². You start from the value 10.0 and set the
learning rate to 0.2. You get a result that’s very close to zero, which is the correct minimum.
The figure below shows the movement of the solution through theImprove Your Python
iterations:
Improve Your Python
...with a fresh 🐍 Python Trick 💌
code snippet every couple of days:
Email Address
You start from the rightmost green dot (𝑣 = 10) and move toward the minimum (𝑣 = 0). The updates are larger at first
because the value of the gradient (and slope) is higher. As you approach the minimum,Send Pythonlower.
they become Tricks »
Remove ads
Python >>>
>>> gradient_descent(
... gradient=lambda v: 2 * v, start=10.0, learn_rate=0.8
... )
-4.77519666596786e-07
You get another solution that’s very close to zero, but the internal behavior of the algorithm is di erent. This is what
happens with the value of 𝑣 through the iterations:
In this case, you again start with 𝑣 = 10, but because of the high learning rate, you get a large change in 𝑣 that passes to
the other side of the optimum and becomes −6. It crosses zero a few more times before settling near it.
Python >>>
>>> gradient_descent(
... gradient=lambda v: 2 * v, start=10.0, learn_rate=0.005
... )
6.050060671375367
The result is now 6.05, which is nowhere near the true minimum of zero. This is because the changes in the vector are
very small due to the small learning rate: Improve Your Python
...with a fresh 🐍 Python Trick 💌
code snippet every couple of days:
Email Address
The search process starts at 𝑣 = 10 as before, but it can’t reach zero in fi y iterations. However, with a hundred
iterations, the error will be much smaller, and with a thousand iterations, you’ll be very close to zero:
Python >>>
>>> gradient_descent(
... gradient=lambda v: 2 * v, start=10.0, learn_rate=0.005,
... n_iter=100
... )
3.660323412732294
>>> gradient_descent(
... gradient=lambda v: 2 * v, start=10.0, learn_rate=0.005,
... n_iter=1000
... )
0.0004317124741065828
>>> gradient_descent(
... gradient=lambda v: 2 * v, start=10.0, learn_rate=0.005,
... n_iter=2000
... )
9.952518849647663e-05
Nonconvex functions might have local minima or saddle points where the algorithm can get trapped. In such
situations, your choice of learning rate or starting point can make the di erence between finding a local minimum
and finding the global minimum.
Consider the function 𝑣⁴ - 5𝑣² - 3𝑣. It has a global minimum in 𝑣 ≈ 1.7 and a local minimum in 𝑣 ≈ −1.42. The gradient of
this function is 4𝑣³ − 10𝑣 − 3. Let’s see how gradient_descent() works here:
Python >>>
>>> gradient_descent(
... gradient=lambda v: 4 * v**3 - 10 * v - 3, start=0,
... learn_rate=0.2
... )
-1.4207567437458342
Email Address
During the first two iterations, your vector was moving toward the global minimum, but then it crossed to the
opposite side and stayed trapped in the local minimum. You can prevent this with a smaller learning rate:
Python >>>
>>> gradient_descent(
... gradient=lambda v: 4 * v**3 - 10 * v - 3, start=0,
... learn_rate=0.1
... )
1.285401330315467
When you decrease the learning rate from 0.2 to 0.1, you get a solution very close to the global minimum. Remember
that gradient descent is an approximate method. This time, you avoid the jump to the other side:
A lower learning rate prevents the vector from making large jumps, and in this case, the vector remains closer to the
global optimum.
Adjusting the learning rate is tricky. You can’t know the best value in advance. There are many techniques and
heuristics that try to help with this. In addition, machine learning practitioners o en tune the learning rate during
model selection and evaluation.
Besides the learning rate, the starting point can a ect the solution significantly, especially with nonconvex functions.
Short Examples
First, you’ll apply gradient_descent() to another one-dimensional problem. Take the function 𝑣 − log(𝑣). The
gradient of this function is 1 − 1/𝑣. With this information, you can find its minimum: Improve Your Python
Python ...with a fresh 🐍 Python Trick >>>
💌
code snippet every couple of days:
>>> gradient_descent(
... gradient=lambda v: 1 - 1 / v, start=2.5, learn_rate=0.5
... ) Email Address
1.0000011077232125
You can also use gradient_descent() with functions of more than one variable. The application is the same, but you
need to provide the gradient and starting points as vectors or arrays. For example, you can find the minimum of the
function 𝑣₁² + 𝑣₂⁴ that has the gradient vector (2𝑣₁, 4𝑣₂³):
Python >>>
>>> gradient_descent(
... gradient=lambda v: np.array([2 * v[0], 4 * v[1]**3]),
... start=np.array([1.0, 1.0]), learn_rate=0.2, tolerance=1e-08
... )
array([8.08281277e-12, 9.75207120e-02])
In this case, your gradient function returns an array, and the start value is an array, so you get an array as the result.
The resulting values are almost equal to zero, so you can say that gradient_descent() correctly found that the
minimum of this function is at 𝑣₁ = 𝑣₂ = 0.
This is an optimization problem. It finds the values of weights 𝑏₀, 𝑏₁, …, 𝑏ᵣ that minimize the sum of squared residuals
SSR = Σᵢ(𝑦ᵢ − 𝑓(𝐱ᵢ))² or the mean squared error MSE = SSR / 𝑛. Here, 𝑛 is the total number of observations and 𝑖 = 1, …, 𝑛.
You can also use the cost function 𝐶 = SSR / (2𝑛), which is mathematically more convenient than SSR or MSE.
The most basic form of linear regression is simple linear regression. It has only one set of inputs 𝑥 and two weights: 𝑏₀
and 𝑏₁. The equation of the regression line is 𝑓(𝑥) = 𝑏₀ + 𝑏₁𝑥. Although the optimal values of 𝑏₀ and 𝑏₁ can be calculated
analytically, you’ll use gradient descent to determine them.
First, you need calculus to find the gradient of the cost function 𝐶 = Σᵢ(𝑦ᵢ − 𝑏₀ − 𝑏₁𝑥ᵢ)² / (2𝑛). Since you have two
decision variables, 𝑏₀ and 𝑏₁, the gradient ∇𝐶 is a vector with two components:
You need the values of 𝑥 and 𝑦 to calculate the gradient of this cost function. Your gradient function will have as inputs
not only 𝑏₀ and 𝑏₁ but also 𝑥 and 𝑦. This is how it might look:
Python
ssr_gradient() takes the arrays x and y, which contain the observation inputs and outputs, and the array b that
holds the current values of the decision variables 𝑏₀ and 𝑏₁. This function first calculates the array of the residuals for
each observation (res) and then returns the pair of values of ∂𝐶/∂𝑏₀ and ∂𝐶/∂𝑏₁.
In this example, you can use the convenient NumPy method ndarray.mean() since you pass NumPy arrays as the
arguments.
1 import numpy as np
2 Send Python Tricks »
3 def gradient_descent(
4 gradient, x, y, start, learn_rate=0.1, n_iter=50, tolerance=1e-06
5 ):
6 vector = start
7 for _ in range(n_iter):
8 diff = -learn_rate * np.array(gradient(x, y, vector))
9 if np.all(np.abs(diff) <= tolerance):
10 break
11 vector += diff
12 return vector
gradient_descent() now accepts the observation inputs x and outputs y and can use them to calculate the gradient.
Converting the output of gradient(x, y, vector) to a NumPy array enables elementwise multiplication of the
gradient elements by the learning rate, which isn’t necessary in the case of a single-variable function.
Now apply your new version of gradient_descent() to find the regression line for some arbitrary values of x and y:
Python >>>
>>> gradient_descent(
... ssr_gradient, x, y, start=[0.5, 0.5], learn_rate=0.0008,
... n_iter=100_000
... )
array([5.62822349, 0.54012867])
The result is an array with two values that correspond to the decision variables: 𝑏₀ = 5.63 and 𝑏₁ = 0.54. The best
regression line is 𝑓(𝑥) = 5.63 + 0.54𝑥. As in the previous examples, this result heavily depends on the learning rate. You
might not get such a good result with too low or too high of a learning rate.
This example isn’t entirely random–it’s taken from the tutorial Linear Regression in Python. The good news is that
you’ve obtained almost the same result as the linear regressor from scikit-learn. The data and regression results are
visualized in the section Simple Linear Regression.
Remove ads
1 import numpy as np
2
3 def gradient_descent(
4 gradient, x, y, start, learn_rate=0.1, n_iter=50, tolerance=1e-06,
5 dtype="float64"
6 ):
7 # Checking if the gradient is callable
8 if not callable(gradient):
9 raise TypeError("'gradient' must be callable")
10
11 # Setting up the data type for NumPy arrays
12 dtype_ = np.dtype(dtype)
13 Improve Your Python
14 # Converting x and y to NumPy arrays
15 x, y = np.array(x, dtype=dtype_), np.array(y, dtype=dtype_) ...with a fresh 🐍 Python Trick 💌
16 if x.shape[0] != y.shape[0]: code snippet every couple of days:
17 raise ValueError("'x' and 'y' lengths do not match")
18
19 # Initializing the values of the variables Email Address
20 vector = np.array(start, dtype=dtype_)
21
22 # Setting up and checking the learning rate Send Python Tricks »
23 learn_rate = np.array(learn_rate, dtype=dtype_)
24 if np.any(learn_rate <= 0):
25 raise ValueError("'learn_rate' must be greater than zero")
26
27 # Setting up and checking the maximal number of iterations
28 n_iter = int(n_iter)
29 if n_iter <= 0:
30 raise ValueError("'n_iter' must be greater than zero")
31
32 # Setting up and checking the tolerance
33 tolerance = np.array(tolerance, dtype=dtype_)
34 if np.any(tolerance <= 0):
35 raise ValueError("'tolerance' must be greater than zero")
36
37 # Performing the gradient descent loop
38 for _ in range(n_iter):
39 # Recalculating the difference
40 diff = -learn_rate * np.array(gradient(x, y, vector), dtype_)
41
42 # Checking if the absolute difference is small enough
43 if np.all(np.abs(diff) <= tolerance):
44 break
45
46 # Updating the values of the variables
47 vector += diff
48
49 return vector if vector.shape else vector.item()
gradient_descent() now accepts an additional dtype parameter that defines the data type of NumPy arrays inside
the function. For more information about NumPy types, see the o icial documentation on data types.
In most applications, you won’t notice a di erence between 32-bit and 64-bit floating-point numbers, but when you
work with big datasets, this might significantly a ect memory use and maybe even processing speed. For example,
although NumPy uses 64-bit floats by default, TensorFlow o en uses 32-bit decimal numbers.
In addition to considering data types, the code above introduces a few modifications related to type checking and
ensuring the use of NumPy capabilities:
Lines 8 and 9 check if gradient is a Python callable object and whether it can be used as a function. If not, then
the function will raise a TypeError.
Line 12 sets an instance of numpy.dtype, which will be used as the data type for all arrays throughout the
function.
Line 15 takes the arguments x and y and produces NumPy arrays with the desired data type. The arguments x
and y can be lists, tuples, arrays, or other sequences.
Lines 16 and 17 compare the sizes of x and y. This is useful because you want to be sure that both arrays have
the same number of observations. If they don’t, then the function will raise a ValueError.
Improve Your Python
Line 20 converts the argument start to a NumPy array. This is an interesting trick: if start is a Python scalar,
then it’ll be transformed into a corresponding NumPy object (an array with one item and zero dimensions). If
you pass a sequence, then it’ll become a regular NumPy array with the same number of elements.
Line 23 does the same thing with the learning rate. This can be very useful because it enables you to specify
di erent learning rates for each decision variable by passing a list, tuple, or NumPy array to
gradient_descent().
Lines 24 and 25 check if the learning rate value (or values for all variables) is greater than zero.
Lines 28 to 35 similarly set n_iter and tolerance and check that they are greater than zero.
Line 49 conveniently returns the resulting array if you have several decision variables
...with aor a Python
fresh scalarTrick
🐍 Python if you💌
have a single variable. code snippet every couple of days:
Your gradient_descent() is now finished. Feel free to add some additional capabilities or polishing. The next step of
Email Address
this tutorial is to use what you’ve learned so far to implement the stochastic version of gradient descent.
Online stochastic gradient descent is a variant of stochastic gradient descent in which you estimate the gradient of
the cost function for each observation and update the decision variables accordingly. This can help you find the
global minimum, especially if the objective function is convex.
Batch stochastic gradient descent is somewhere between ordinary gradient descent and the online method. The
gradients are calculated and the decision variables are updated iteratively with subsets of all observations, called
minibatches. This variant is very popular for training neural networks.
You can imagine the online algorithm as a special kind of batch algorithm in which each minibatch has only one
observation. Classical gradient descent is another special case in which there’s only one batch containing all
observations.
Stochastic gradient descent randomly divides the set of observations into minibatches.
For each minibatch, the gradient is computed and the vector is moved.
Once all minibatches are used, you say that the iteration, or epoch, is finished and start the next one.
This algorithm randomly selects observations for minibatches, so you need to simulate this random (or
pseudorandom) behavior. You can do that with random number generation. Python has the built-in random module,
and NumPy has its own random generator. The latter is more convenient when you work with arrays.
You’ll create a new function called sgd() that is very similar to gradient_descent() but uses randomly selected
minibatches to move along the search space:
Python
1 import numpy as np
2
3 def sgd(
4 gradient, x, y, start, learn_rate=0.1, batch_size=1, n_iter=50,
5 tolerance=1e-06, dtype="float64", random_state=None
6 ):
7 # Checking if the gradient is callable Improve Your Python
8 if not callable(gradient):
9 raise TypeError("'gradient' must be callable")
10
11 # Setting up the data type for NumPy arrays
12 dtype_ = np.dtype(dtype)
13
14 # Converting x and y to NumPy arrays
15 x, y = np.array(x, dtype=dtype_), np.array(y, dtype=dtype_)
16 n_obs = x.shape[0]
17 if n_obs != y.shape[0]:
18 raise ValueError("'x' and 'y' lengths do not match")
19 xy = np.c_[x.reshape(n_obs, -1), y.reshape(n_obs, 1)]
20
21 # Initializing the random number generator
22 seed = None if random_state is None else int(random_state) Improve Your Python
23 rng = np.random.default_rng(seed=seed)
24 ...with a fresh 🐍 Python Trick 💌
25 # Initializing the values of the variables code snippet every couple of days:
26 vector = np.array(start, dtype=dtype_)
27
28 # Setting up and checking the learning rate Email Address
29 learn_rate = np.array(learn_rate, dtype=dtype_)
30 if np.any(learn_rate <= 0):
31 raise ValueError("'learn_rate' must be greater than zero") Send Python Tricks »
32
33 # Setting up and checking the size of minibatches
34 batch_size = int(batch_size)
35 if not 0 < batch_size <= n_obs:
36 raise ValueError(
37 "'batch_size' must be greater than zero and less than "
38 "or equal to the number of observations"
39 )
40
41 # Setting up and checking the maximal number of iterations
42 n_iter = int(n_iter)
43 if n_iter <= 0:
44 raise ValueError("'n_iter' must be greater than zero")
45
46 # Setting up and checking the tolerance
47 tolerance = np.array(tolerance, dtype=dtype_)
48 if np.any(tolerance <= 0):
49 raise ValueError("'tolerance' must be greater than zero")
50
51 # Performing the gradient descent loop
52 for _ in range(n_iter):
53 # Shuffle x and y
54 rng.shuffle(xy)
55
56 # Performing minibatch moves
57 for start in range(0, n_obs, batch_size):
58 stop = start + batch_size
59 x_batch, y_batch = xy[start:stop, :-1], xy[start:stop, -1:]
60
61 # Recalculating the difference
62 grad = np.array(gradient(x_batch, y_batch, vector), dtype_)
63 diff = -learn_rate * grad
64
65 # Checking if the absolute difference is small enough
66 if np.all(np.abs(diff) <= tolerance):
67 break
68
69 # Updating the values of the variables
70 vector += diff
71
72 return vector if vector.shape else vector.item()
You have a new parameter here. With batch_size, you specify the number of observations in each minibatch. This is
an essential parameter for stochastic gradient descent that can significantly a ect performance. Lines 34 to 39 ensure
that batch_size is a positive integer no larger than the total number of observations.
Another new parameter is random_state. It defines the seed of the random number generator on line 22. The seed is
used on line 23 as an argument to default_rng(), which creates an instance of Generator.
If you pass the argument None for random_state, then the random number generator will return di erent numbers
Improve
each time it’s instantiated. If you want each instance of the generator Your Python
to behave exactly the same way, then you need
to specify seed. The easiest way is to provide an arbitrary integer.
Line 16 deduces the number of observations with x.shape[0]. If x is a one-dimensional array, then this is its size. If x
has two dimensions, then .shape[0] is the number of rows.
On line 19, you use .reshape() to make sure that both x and y become two-dimensional arrays with n_obs rows and
that y has exactly one column. numpy.c_[] conveniently concatenates the columns of x and y into a single array, xy.
This is one way to make data suitable for random selection.
Finally, on lines 52 to 70, you implement the for loop for the stochastic gradient descent. It di ers from
gradient_descent(). On line 54, you use the random number generator and its method .shuffle() to shu le the
observations. This is one of the ways to choose minibatches randomly.
Improve Your Python
The inner for loop is repeated for each minibatch. The main di erence from the ordinary gradient descent is that, on
...with a fresh 🐍 Python Trick 💌
line 62, the gradient is calculated for the observations from a minibatch (x_batch and y_batch) instead of for all
code snippet every couple of days:
observations (x and y).
Email
On line 59, x_batch becomes a part of xy that contains the rows of the current minibatch Address
(from start to stop) and the
columns that correspond to x. y_batch holds the same rows from xy but only the last column (the outputs). For more
information about how indices work in NumPy, see the o icial documentation on indexing.
Send Python Tricks »
Now you can test your implementation of stochastic gradient descent:
Python >>>
>>> sgd(
... ssr_gradient, x, y, start=[0.5, 0.5], learn_rate=0.0008,
... batch_size=3, n_iter=100_000, random_state=0
... )
array([5.63093736, 0.53982921])
The result is almost the same as you got with gradient_descent(). If you omit random_state or use None, then you’ll
get somewhat di erent results each time you run sgd() because the random number generator will shu le xy
di erently.
Remove ads
You can use momentum to correct the e ect of the learning rate. The idea is to remember the previous update of the
vector and apply it when calculating the next one. You don’t move the vector exactly in the direction of the negative
gradient, but you also tend to keep the direction and magnitude from the previous move.
The parameter called the decay rate or decay factor defines how strong the contribution of the previous update is.
To include the momentum and the decay rate, you can modify sgd() by adding the parameter decay_rate and use it
to calculate the direction and magnitude of the vector update (diff):
Python
1 import numpy as np
2
3 def sgd(
4 gradient, x, y, start, learn_rate=0.1, decay_rate=0.0, batch_size=1,
5 n_iter=50, tolerance=1e-06, dtype="float64", random_state=None
6 ):
7 # Checking if the gradient is callable
8 if not callable(gradient):
9 raise TypeError("'gradient' must be callable")
10
11 # Setting up the data type for NumPy arrays Improve Your Python
12 dtype_ = np.dtype(dtype)
13
14 # Converting x and y to NumPy arrays
15 x, y = np.array(x, dtype=dtype_), np.array(y, dtype=dtype_)
16 n_obs = x.shape[0]
17 if n_obs != y.shape[0]:
18 raise ValueError("'x' and 'y' lengths do not match")
19 xy = np.c_[x.reshape(n_obs, -1), y.reshape(n_obs, 1)]
20
21 # Initializing the random number generator
22 seed = None if random_state is None else int(random_state)
23 rng = np.random.default_rng(seed=seed)
24
25 # Initializing the values of the variables
26 vector = np.array(start, dtype=dtype_) Improve Your Python
27
28 # Setting up and checking the learning rate
...with a fresh 🐍 Python Trick 💌
29 learn_rate = np.array(learn_rate, dtype=dtype_)
code snippet every couple of days:
30 if np.any(learn_rate <= 0):
31 raise ValueError("'learn_rate' must be greater than zero")
32 Email Address
33 # Setting up and checking the decay rate
34 decay_rate = np.array(decay_rate, dtype=dtype_)
35 if np.any(decay_rate < 0) or np.any(decay_rate > 1): Send Python Tricks »
36 raise ValueError("'decay_rate' must be between zero and one")
37
38 # Setting up and checking the size of minibatches
39 batch_size = int(batch_size)
40 if not 0 < batch_size <= n_obs:
41 raise ValueError(
42 "'batch_size' must be greater than zero and less than "
43 "or equal to the number of observations"
44 )
45
46 # Setting up and checking the maximal number of iterations
47 n_iter = int(n_iter)
48 if n_iter <= 0:
49 raise ValueError("'n_iter' must be greater than zero")
50
51 # Setting up and checking the tolerance
52 tolerance = np.array(tolerance, dtype=dtype_)
53 if np.any(tolerance <= 0):
54 raise ValueError("'tolerance' must be greater than zero")
55
56 # Setting the difference to zero for the first iteration
57 diff = 0
58
59 # Performing the gradient descent loop
60 for _ in range(n_iter):
61 # Shuffle x and y
62 rng.shuffle(xy)
63
64 # Performing minibatch moves
65 for start in range(0, n_obs, batch_size):
66 stop = start + batch_size
67 x_batch, y_batch = xy[start:stop, :-1], xy[start:stop, -1:]
68
69 # Recalculating the difference
70 grad = np.array(gradient(x_batch, y_batch, vector), dtype_)
71 diff = decay_rate * diff - learn_rate * grad
72
73 # Checking if the absolute difference is small enough
74 if np.all(np.abs(diff) <= tolerance):
75 break
76
77 # Updating the values of the variables
78 vector += diff
79
80 return vector if vector.shape else vector.item()
In this implementation, you add the decay_rate parameter on line 4, convert it to a NumPy array of the desired type
on line 34, and check if it’s between zero and one on lines 35 and 36. On line 57, you initialize diff before the
iterations start to ensure that it’s available in the first iteration.
The most important change happens on line 71. You recalculate diff with the learning rate and gradient but also add
Improve Your Python
the product of the decay rate and the old value of diff. Now diff has two components:
1. decay_rate * diff is the momentum, or impact of the previous move.
2. -learn_rate * grad is the impact of the current gradient.
The decay and learning rates serve as the weights that define the contributions of the two.
You now have the new parameter n_vars that defines the number of decision variables in your problem. The
parameter start is optional and has the default value None. Lines 27 to 31 initialize the starting values of the decision
variables:
If you provide a start value other than None, then it’s used for the starting values.
If start is None, then your random number generator creates the starting values using the standard normal
distribution and the NumPy method .normal().
Python >>>
>>> sgd(
... ssr_gradient, x, y, n_vars=2, learn_rate=0.0001,
... decay_rate=0.8, batch_size=3, n_iter=100_000, random_state=0
... )
array([5.63014443, 0.53901017])
You’ve learned how to write the functions that implement gradient descent and stochastic gradient descent. The code
above can be made more robust and polished. You can also find di erent implementations of these methods in well-
known machine learning libraries.
Adam
Adagrad
Adadelta
RMSProp
These optimization libraries are usually called internally when neural network so ware is trained. However, you can
use them independently as well:
Python >>>
You can find more information on these algorithms in the Keras and TensorFlow documentation. The article An
overview of gradient descent optimization algorithms o ers a comprehensive list with explanations of gradient
descent variants.
Remove ads
Conclusion
You now know what gradient descent and stochastic gradient descent algorithms are and how they work. They’re
widely used in the applications of artificial neural networks and are implemented in popular libraries like Keras and
TensorFlow.
How to write your own functions for gradient descent and stochastic gradient descent
How to apply your functions to solve optimization problems
What the key features and concepts of gradient descent are, like learning rate or momentum, as well as its
limitations
You’ve used gradient descent and stochastic gradient descent to find the minima of several functions and to fit the
regression line in a linear regression problem. You’ve also seen how to apply the class SGD from TensorFlow that’s
used to train neural networks.
If you have questions or comments, then please put them in the comment section below.
Mark as Completed
🐍 Python Tricks 💌
Get a short & sweet Python Trick delivered to your inbox every couple of
days. No spam ever. Unsubscribe any time. Curated by the Real Python
team.
Each tutorial at Real Python is created by a team of developers so that it meets our high quality standards. The team members who
worked on this tutorial are:
Joanna Jacob
Real Python Comment Policy: The most useful comments are those written with the goal of learning
from or helping out other readers—a er reading the whole article and all the earlier comments.
Complaints and insults generally won’t make the cut here.
What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use?
Leave a comment below and let us know. Improve Your Python
ALSO ON REAL PYTHON
...with a fresh 🐍 Python Trick 💌
code snippet every couple of days:
Python's ChainMap: CPython Internals: Python 3.10: Cool New Email Address
Manage Multiple … Paperback Now … Features for You to …
In this step-by-step tutorial, After almost two years of In this tutorial, you'll explor Send Python Tricks »
you'll learn about Python's writing, reviewing, and some of the coolest and
ChainMap and how to … testing, we're delighted to … most useful features in …
LOG IN WITH
OR SIGN UP WITH DISQUS ? Improve Your Python
Name
...with a fresh 🐍 Python Trick 💌
code snippet every couple of days:
Typo • 11 days ago
Hi Email Address
Thanks for this great article. I think there’s a typo in:
Consider the function 𝑣⁴ - 5𝑣² - 3𝑣. It has a global minimum in 𝑣 ≈ 1.7 and a local
minimum in 𝑣 ≈ −1.42. The gradient of this function is 4𝑣³ − 10𝑣 − 3. Let’s see Send Python Tricks »
how gradient_descent() works here:
Related △ ▽ Categories:
• Reply • Share ›
Tutorial advanced machine-learning
Email…
Email Address
Table of Contents
Basic Gradient Descent Algorithm
Application of the Gradient Descent Algorithm
→ Stochastic Gradient Descent Algorithms
Gradient Descent in Keras and TensorFlow
Conclusion
Mark as Completed
Remove ads