Lesson 2

Let's recall that **Stochastic Gradient Descent** (SGD) is an efficient optimization algorithm known for its robust functionalities. However, when dealing with large datasets, SGD encounters particular challenges that instigate instabilities in the loss function. To overcome these limitations, we'll discuss **Mini-Batch Gradient Descent** (MBGD) in this session - a technique that combines the best attributes of SGD and Batch Gradient Descent. By the end of today's lesson, you'll understand the theory behind MBGD and be ready to implement it using Python.

While SGD's power lies in its efficiency, especially when dealing with large datasets, it has limitations. The loss function can become unstable when the model's parameters are updated at each iteration. This instability is one of the primary challenges that MBGD aims to overcome.

MBGD offers a conceptual middle ground between SGD and Batch Gradient Descent. Like its predecessors, MBGD divides the dataset into small subsets or mini-batches. It then computes the gradient of the cost function concerning this subset and accordingly updates the model's parameters.

A distinguishing feature of MBGD is its capacity to tune the size of the mini-batches. MBGD behaves as Batch Gradient Descent if the batch size equates to the dataset size. If the batch size is 1, it acts like SGD. However, a mini-batch size between 10 and 1000 is typically selected in practice.

Now, we'll delve into Python to implement MBGD. For this, we'll use `numpy`

for numerical computations. The `gradient_descent`

function carries out the Mini-Batch Gradient Descent:

Python`1def gradient_descent(X, y, learning_rate=0.01, batch_size=16, epochs=100): 2 m, n = X.shape 3 theta = np.random.randn(n, 1) # random initialization 4 5 for epoch in range(epochs): 6 shuffled_indices = np.random.permutation(m) 7 X_shuffled = X[shuffled_indices] 8 y_shuffled = y[shuffled_indices] 9 10 for i in range(0, m, batch_size): 11 xi = X_shuffled[i:i + batch_size] 12 yi = y_shuffled[i:i + batch_size] 13 14 gradients = 2 / batch_size * xi.T.dot(xi.dot(theta) - yi) 15 theta = theta - learning_rate * gradients 16 17 return theta`

The code above starts by initializing random weights and iterating through the dataset in small batches. For each batch, it calculates the gradients, representing the direction to move in the data space to decrease error, and updates the weights accordingly. This process is repeated for several epochs.

The `2 / batch_size`

term in the gradient calculation is part of the derivative of the mean squared error loss function. The `2`

comes from the derivative of a squared error term (from power rule in calculus).

Let's demonstrate the effectiveness of Mini-Batch Gradient Descent by applying it to a simple dataset:

Python`1from sklearn.metrics import mean_absolute_error 2 3# Apply function to some data 4X = np.random.rand(100, 3) 5y = 5 * X[:, 0] - 3 * X[:, 1] + 2 * X[:, 2] + np.random.randn(100, 1) # sample linear regression problem 6theta = gradient_descent(X, y) 7 8# Predict and calculate MAE 9predictions = X.dot(theta) 10mae = mean_absolute_error(y, predictions) 11print(f"MAE: {mae}") # MAE: 1.0887166179544072`

After arranging our data and initializing our parameters, `theta`

is optimized using MBGD. Finally, new predictions are generated by multiplying the data `X`

with the optimized `theta`

parameter.

Mathematically, this prediction can be viewed as the line of best fit to the dataset.

Today, we examined the functions of Mini-Batch Gradient Descent and its improvements over both SGD and Batch Gradient Descent. Now, engage in practice sessions that refine your skills to solidify these concepts in your working knowledge. Each practice session propels your journey in mastering optimization techniques. Let's keep progressing!