Lesson 2

Welcome to the lesson on **Basic Gradient Descent**! In this lesson, we will learn about an optimization algorithm called `gradient descent`

. It helps us find the minimum value of a function, which is very important in fields such as machine learning and data science.

Imagine you're on a hike in the mountains. You want to find the lowest point in a valley. `Gradient descent`

is like taking small steps downhill until you reach the bottom. By the end of this lesson, you'll understand what `gradient descent`

is, how it works, and how to implement it in *Python*.

Let's get started!

First, let's talk about what `gradient descent`

is. `Gradient descent`

is an algorithm used to minimize a function. It works similarly to the Newton's method. This is important for us, because in machine learning we usually optimize a multivariable **loss function** which measures the algorithm's error.

In mathematical terms, we want to minimize a function $f(x, y)$ by iteratively moving in the direction of the negative gradient (the steepest descent) of the function. The gradient is a vector of partial derivatives that point in the direction of the steepest ascent. By taking steps against the gradient, we move towards the minimum value.

Imagine you’re playing a game where you need to find the treasure buried at the lowest point of a landscape. `Gradient descent`

helps you figure out which direction to dig by looking at the slope of the land around you.

To understand `gradient descent`

better, let's use a simple example:

$f(x, y) = x^2 + y^2$

Here, $f(x, y)$ represents a bowl-shaped surface, and we want to find the lowest point or the minimum value.

Think of this function as a stretched rubber sheet. If you press down at any point, the steepest increase in height is indicated by the gradient, and moving in the opposite direction will take you towards the bottom of the sheet.

To apply `gradient descent`

, we need to calculate the gradient of the function. For the quadratic function $f(x, y) = x^2 + y^2$, the gradient is:

$\nabla f(x, y) = [2x, 2y]$

The gradient tells us how the function changes at a given point.

Imagine rolling a ball down a hill. The direction in which the ball accelerates the fastest is the gradient direction. The steeper the slope, the larger the gradient.

Let's visualize our quadratic function using a contour plot in Python.

Now, let's see how to implement a basic version of `gradient descent`

in *Python*. Let's see the formula first:

$\mathbf{x}_{t+1} = \mathbf{x}_t - \eta \nabla f(\mathbf{x}_t)$

where:

- $\mathbf{x}_t$ is the current point.
- $\eta$ is the learning rate.
- $\nabla f(\mathbf{x}_t)$ is the gradient at the current point.

We'll start by initializing a point and updating it by moving against the gradient.

Python`1# Gradient descent without path output 2def simple_gradient_descent(f_grad, init_point, learning_rate=0.1, iterations=100): 3 point = list(init_point) 4 for _ in range(iterations): 5 grad = f_grad(point) 6 point = [p - learning_rate * g for p, g in zip(point, grad)] 7 return point`

Here are the inputs of the `simple_gradient_descent`

function:

`f_grad`

: The gradient function of the target function.`init_point`

: The starting point for the gradient descent algorithm.`learning_rate`

: A hyperparameter that controls the step size in each iteration.`iterations`

: The number of iterations to update the point.

The starting point is our initial guess of where the function's minimum might be. In our example, we start from `[2, 2]`

.

The learning rate is critical for the algorithm's performance:

- A high learning rate might cause overshooting and divergence.
- A low learning rate will make convergence slow.

Typically, you experiment with these values to find the best combination.

Next, we will define the function we want to minimize and its gradient.

Python`1def quadratic_function(point): 2 x, y = point 3 return x**2 + y**2 4 5def quadratic_gradient(point): 6 x, y = point 7 return [2 * x, 2 * y]`

Now, we run the `gradient descent`

algorithm.

Python`1optimal_point = simple_gradient_descent(quadratic_gradient, [2, 2], learning_rate=0.1, iterations=100) 2print("Optimal point after basic gradient descent:", optimal_point) # Optimal point after basic gradient descent: close to [0, 0]`

In `simple_gradient_descent`

:

- We take in the gradient function, initial point, learning rate, and number of iterations.
- We initialize the point and update it using the gradient and learning rate.
- After iterating, we return the optimal point.

For the example functions:

`quadratic_function`

computes $x^2 + y^2$.`quadratic_gradient`

calculates the gradient of this function.

Finally, let's visualize the gradient descent path. We will mark each gradient step as a red point on the plot, starting with $[2, 2]$. To make the plot clear, let's limit the gradient descent to `10`

iterations.

You can observe how the gradient moves towards the minimum point. As we approach the minimum point, the function's steepness decreases, so we take smaller steps. If we take more iterations, we will find this minimum point efficiently, as demonstrated earlier.

Great job! You've learned about `gradient descent`

, its importance in minimizing functions, and how to implement it in *Python*. You now understand how we can find the minimum value of a function by iteratively moving against the gradient.

Just like finding the lowest point in a valley by taking steps downhill, `gradient descent`

helps us find the minimum value of complex functions, essential in optimization problems.

Next, you'll practice by coding your version of `gradient descent`

and applying it to various functions. Dive into the practice tasks and solidify your understanding of using `gradient descent`

for optimization tasks!