Optimizing Machine Learning with Mini-Batch Gradient Descent

Lesson 2

Introduction

Let's recall that Stochastic Gradient Descent (SGD) is an efficient optimization algorithm known for its robust functionalities. However, when dealing with large datasets, SGD encounters particular challenges that instigate instabilities in the loss function. To overcome these limitations, we'll discuss Mini-Batch Gradient Descent (MBGD) in this session - a technique that combines the best attributes of SGD and Batch Gradient Descent. By the end of today's lesson, you'll understand the theory behind MBGD and be ready to implement it using Python.

Understanding the drawbacks of SGD

While SGD's power lies in its efficiency, especially when dealing with large datasets, it has limitations. The loss function can become unstable when the model's parameters are updated at each iteration. This instability is one of the primary challenges that MBGD aims to overcome.

Introduction to Mini-Batch Gradient Descent

MBGD offers a conceptual middle ground between SGD and Batch Gradient Descent. Like its predecessors, MBGD divides the dataset into small subsets or mini-batches. It then computes the gradient of the cost function concerning this subset and accordingly updates the model's parameters.

A distinguishing feature of MBGD is its capacity to tune the size of the mini-batches. MBGD behaves as Batch Gradient Descent if the batch size equates to the dataset size. If the batch size is 1, it acts like SGD. However, a mini-batch size between 10 and 1000 is typically selected in practice.

Implementing Mini-Batch Gradient Descent in Python

Now, we'll delve into Python to implement MBGD. For this, we'll use numpy for numerical computations. The gradient_descent function carries out the Mini-Batch Gradient Descent:

Python
1def gradient_descent(X, y, learning_rate=0.01, batch_size=16, epochs=100):
2    m, n = X.shape
3    theta = np.random.randn(n, 1)  # random initialization
4
5    for epoch in range(epochs):
6        shuffled_indices = np.random.permutation(m)
7        X_shuffled = X[shuffled_indices]
8        y_shuffled = y[shuffled_indices]
9
10        for i in range(0, m, batch_size):
11            xi = X_shuffled[i:i + batch_size]
12            yi = y_shuffled[i:i + batch_size]
13
14            gradients = 2 / batch_size * xi.T.dot(xi.dot(theta) - yi)
15            theta = theta - learning_rate * gradients
16
17    return theta

The code above starts by initializing random weights and iterating through the dataset in small batches. For each batch, it calculates the gradients, representing the direction to move in the data space to decrease error, and updates the weights accordingly. This process is repeated for several epochs.

The 2 / batch_size term in the gradient calculation is part of the derivative of the mean squared error loss function. The 2 comes from the derivative of a squared error term (from power rule in calculus).

Applying Mini-Batch Gradient Descent to a real-world problem

Let's demonstrate the effectiveness of Mini-Batch Gradient Descent by applying it to a simple dataset:

Python
1from sklearn.metrics import mean_absolute_error
2
3# Apply function to some data
4X = np.random.rand(100, 3)
5y = 5 * X[:, 0] - 3 * X[:, 1] + 2 * X[:, 2] + np.random.randn(100, 1)  # sample linear regression problem
6theta = gradient_descent(X, y)
7
8# Predict and calculate MAE
9predictions = X.dot(theta)
10mae = mean_absolute_error(y, predictions)
11print(f"MAE: {mae}")  # MAE: 1.0887166179544072

After arranging our data and initializing our parameters, theta is optimized using MBGD. Finally, new predictions are generated by multiplying the data X with the optimized theta parameter.

Mathematically, this prediction can be viewed as the line of best fit to the dataset.

Lesson Summary and Practice

Today, we examined the functions of Mini-Batch Gradient Descent and its improvements over both SGD and Batch Gradient Descent. Now, engage in practice sessions that refine your skills to solidify these concepts in your working knowledge. Each practice session propels your journey in mastering optimization techniques. Let's keep progressing!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.