Gradient Descent

Any lover of mathematics and programming is sure to have come across the term "gradient descent" before. In short, it is an algorithm for calculating the parameters of a simple machine learning model, linear regression. In this post, I'll explain the underlying mathematics of the algorithm and show my quick implementation in Python.

Mathematical Model

We want to use the idea of least squared residuals. A residual is just the difference between the actual value $y$ and the value predicted by the model, $h(x_i)$, hence the $h(x_i)-y_i$ in the expression. However, we need to square them so that negative (underestimates) and positive (overestimates) values are treated in the same way. By squaring instead of taking the absolute value (which is still possible, a technique known as L1 regression), 1) the maths becomes easier by taking the derivative of a polynomial instead of an absolute value and more importantly, 2) we "punish" errors more severely for values that are very inaccurate, so that it is better to miss by a little many times than to miss by a lot few times. $$ J(\theta_0, \theta_1)=\frac{1}{2m} \sum_{i=1}^m{(h(x_i)-y_i)^2} $$ Note that we have half the mean, $\frac{1}{2m}$, just to make our calculation of the partial derivative easier (as you will see, the 1/2 factor will be cancelled). Now, we will calculate the partial derivative of our cost function with respect to the two parameters $\theta_0$ and $\theta_1$. Don't worry, it's not hard at all! $$ \begin{aligned} \\ J'_{\theta_0} &= \frac{\partial}{\partial\theta_0}\frac{1}{2m} \sum_{i=1} ^ m{(h(x_i)-y_i)^2} \\ &=\frac{1}{2m} \sum_{i=1} ^ m \frac{\partial}{\partial\theta_0}((\theta_1x_i+\theta_0)-y_i)^2 \\ &= \frac{1}{2m} \sum_{i=1} ^ m 2(\theta_1x_i+\theta_0)-y_i)\cdot1 \\ &= \frac{1}{m} \sum_{i=1} ^ m (h_{\theta}(x_i)-y_i) \end{aligned} $$ $$ \begin{aligned} \\ J'_{\theta_1} &= \frac{\partial}{\partial\theta_1}\frac{1}{2m} \sum_{i=1} ^ m{(h(x_i)-y_i)^2} \\ &=\frac{1}{2m} \sum_{i=1} ^ m \frac{\partial}{\partial\theta_1}((\theta_1x_i+\theta_0)-y_i)^2 \\ &= \frac{1}{2m} \sum_{i=1} ^ m 2(\theta_1x_i+\theta_0)-y_i)\cdot x_i \\ &= \frac{1}{m} \sum_{i=1} ^ m ((h_{\theta}(x_i)-y_i)x_i) \end{aligned} $$ As you can see, the only difference between the two derivatives is when we differentiate our hypothesis, $h(x_i)=\theta_1 x_i+ \theta_0$. With respect to $\theta_0$, $\theta_1 x_i$ is just a constant so the derivative is the derivative of $\theta_0$, or $1$. With respect to $\theta_1$, $\theta_0$ is a constant, so the derivative is just the derivative of $\theta_1x_i$, or $x_i$.

Gradient Descent Algorithm

Now that we have competed the partial derivatives of the least squares residuals function $J(\theta_0, \theta_1)$ we want to write an algorithm that will iteratively approach the least value of this function. Essentially, on every iteration, we want to update the value of $\theta_j$ with the current value subtracted by the derivative (to descend to the next "step" in the function) multiplied by the learning rate, $\alpha$, that determines how much we should move by on each step.

$$\\ \theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j} J(\theta_0,\theta_1)$$Here, $:=$ indicates the assignment operator, i.e. updating the value of $\theta_j$ with the new calculated value.

Python Implementation

This is a very rough code that I wrote to implement the above algorithm without using any machine learning libraries such as sklearn. As you can see in the sample outputs below, it works quite well.

import matplotlib.pyplot as plt
import numpy as np
from math import isnan

def run(x, y):
    t0, t1 = 0, 1 #initial values for parameters
    m = len(x) #size of training set
    alpha = 0.01 #alpha, learning rate
    dp = 10 #desired level of decimal places to be deemed convergent

    for i in range(10000):
        h = lambda t0, t1, x: t1*x+t0 #define hypothesis, h(x) = t0 + t1 * x
        dev0 = sum([(h(t0, t1, x[i])-y[i]) for i in range(m)]) #partial derivative
        dev1 = sum([((h(t0, t1, x[i])-y[i])*x[i]) for i in range(m)]) #partial derivative

        #simultaneous updating once derivatives have been computed
        old_t0, old_t1 = t0, t1
        t0 -= alpha*1/m*dev0
        t1 -= alpha*1/m*dev1
        print(t0, t1, old_t0, old_t1)

        if isnan(t0) or isnan(t1): #diverge to +inf or -inf
            print(f"Diverged to infinity, decrease alpha ({alpha})")
            break
        if ("{0}".format(round(old_t0, ndigits=dp)) == "{0}".format(round(t0, ndigits=dp))
            and "{0}".format(round(old_t1, ndigits=dp)) == "{0}".format(round(t1, ndigits=dp))): #check precision
            print(f"Converged after {i} iterations")
            break
    else: #loop was not broken after 10000 iterations, diverge up and down
        print(f"Could not converge in 10000 iterations, decrease alpha ({alpha})")
    return t0, t1

if __name__ == "__main__":
    x = [i for i in range(9)]
    y = [13, 11.2, 11.4, 8.3, 6.4, 1.2, 2.3, 0.1, -3]
    # y = [2, 3.2, 3.6, 4.9, 5.2, 7.2, 9.0, 10, 13]

    t0, t1 = run(x, y)
    _x = np.linspace(min(x), max(x), num=100)
    _y = t1*_x+t0

    plt.plot(_x, _y, '-r', label=f'y={t1}x+{t0}')
    plt.title(f'Regression: y={t1}x+{t0}')
    plt.scatter(x, y)
    plt.grid()
    plt.show()

One thing to note is that because the gradient descent algorithm will automatically take smaller steps when it approaches the local minimum, there is no need to decrease $\alpha$ over time, and we can keep it fixed as $0.01$.

This dataset converged to 10 decimal places of accuracy in 6953 iterations:

Here is another sample run:

Multivariate linear regression

In the case where we have multiple features (such as the number of floors, number of garages, number of square metres, etc to predict the selling price of a house), we can redefine the features and values to be vectors.

$\theta=\begin{bmatrix}\theta_1 \\ \theta_2 \\ \vdots \\ \theta_n \end{bmatrix}$ and $x=\begin{bmatrix}x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}$ where $\theta_i$ are the features such as price per floor, price per square metre, price per garage, etc and $x_i$ are the values such as the number of floors, number of square metres, number of garages, etc.

Then, the new hypothesis function is given by $$\begin{aligned} h_\theta(x) &=\theta_0 x_0+\theta_1 x_1+....+\theta_n x_n \\ &= \begin{bmatrix}\theta_1 \\ \theta_2 \\ \vdots \\ \theta_n \end{bmatrix}\begin{bmatrix}\theta_1 \ \theta_2 \ \dots \ \theta_n \end{bmatrix} \\&=\theta^Tx \end{aligned}$$, where $\theta^T$ is the transpose of the column vector (and thus yields a row vector).

For convenience's sake, we will simplify the notation so that $J(\theta)=J(\theta_0, \theta_1, \theta_2...\theta_n)$. Then, it is obvious that the computation of the partial derivative has exactly the same form as in the case with a single parameter $\theta_1$, and so the algorithm for multivariate linear regression is given by $$\\ \theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j} J(\theta)$$ for $j=0, 1, \dots n$.

Tricks to improve gradient descent

There are various ways to clean the data so that the gradient descent algorithm can converge quickly. While I won't go into the details here, below are some of them:

Feature scaling
We want each of our input values $x_i$ to be roughly in the same range because $\theta$ converges the fastest when the range is minimal. To do this, we can divide each value by either the range or the standard deviation of the feature set to reduce its range to just 1 (and of course, they do give different results, but either can be used).
Mean normaliation
Similarly, after the range has been decreased, we want to shift it to a similar order of magnitude by subtracting the mean so that the range becomes approximately $-0.5
Choosing the learning rate
In order to choose the learning rate, we can plot the cost function against the number of iterations to find out what the behaviour is. As seen in the code above, if the learning rate is too high, then it may diverge to infinity or -infinity, or it may even oscillate up and down (on rare occasions, it may cause slow convergence). On the other hand, if the learning rate is too low, the convergence will be extremely slow. In order to find the sweet spot, we can study the graph to deduce whether our learning rate is too high for too low.

There are still so many things we can do with gradient descent that it's impossible to cover it all in a single post. I could, for instance, do polynomial regression (instead of linear) by creating new features based on the polynomial (e.g. $x_i^2, x_i^3, etc$ and reducing it to a problem of multivariate linear regression. I could implement the code for this, and I could also include some contour plots to show the minimising cost function.

If you want to learn more of this cool machine learning stuff, go check out the Machine Learning course on Coursera by Andrew Ng, the cofounder of Coursera and a leading pioneer of machine learning.

SoySoy4444 2021/05/07

I like programming (Python, Java and some frontend), chess, xiangqi, maths, puzzles and more. You can find me on Chess.com @SoySoy4444. I'm always up for a challenge!

Gradient Descent

Mathematical Model