Let's recall that Stochastic Gradient Descent (SGD) is an efficient optimization algorithm known for its robust functionalities. However, when dealing with large datasets, SGD encounters particular challenges that instigate instabilities in the loss function. To overcome these limitations, we'll discuss Mini-Batch Gradient Descent (MBGD) in this session - a technique that combines the best attributes of SGD and Batch Gradient Descent. By the end of today's lesson, you'll understand the theory behind MBGD and be ready to implement it using Python.
While SGD's power lies in its efficiency, especially when dealing with large datasets, it has limitations. The loss function can become unstable when the model's parameters are updated at each iteration. This instability is one of the primary challenges that MBGD aims to overcome.
MBGD offers a conceptual middle ground between SGD and Batch Gradient Descent. Like its predecessors, MBGD divides the dataset into small subsets or mini-batches. It then computes the gradient of the cost function concerning this subset and accordingly updates the model's parameters.
A distinguishing feature of MBGD is its capacity to tune the size of the mini-batches. MBGD behaves as Batch Gradient Descent if the batch size equates to the dataset size. If the batch size is 1, it acts like SGD. However, a mini-batch size between 10 and 1000 is typically selected in practice.
Now, we'll delve into Python to implement MBGD. For this, we'll use numpy
for numerical computations. The gradient_descent
function carries out the Mini-Batch Gradient Descent:
Python1def gradient_descent(X, y, learning_rate=0.01, batch_size=16, epochs=100): 2 m, n = X.shape 3 theta = np.random.randn(n, 1) # random initialization 4 5 for epoch in range(epochs): 6 shuffled_indices = np.random.permutation(m) 7 X_shuffled = X[shuffled_indices] 8 y_shuffled = y[shuffled_indices] 9 10 for i in range(0, m, batch_size): 11 xi = X_shuffled[i:i + batch_size] 12 yi = y_shuffled[i:i + batch_size] 13 14 gradients = 2 / batch_size * xi.T.dot(xi.dot(theta) - yi) 15 theta = theta - learning_rate * gradients 16 17 return theta
The code above starts by initializing random weights and iterating through the dataset in small batches. For each batch, it calculates the gradients, representing the direction to move in the data space to decrease error, and updates the weights accordingly. This process is repeated for several epochs.
The 2 / batch_size
term in the gradient calculation is part of the derivative of the mean squared error loss function. The 2
comes from the derivative of a squared error term (from power rule in calculus).
Let's demonstrate the effectiveness of Mini-Batch Gradient Descent by applying it to a simple dataset:
Python1from sklearn.metrics import mean_absolute_error 2 3# Apply function to some data 4X = np.random.rand(100, 3) 5y = 5 * X[:, 0] - 3 * X[:, 1] + 2 * X[:, 2] + np.random.randn(100, 1) # sample linear regression problem 6theta = gradient_descent(X, y) 7 8# Predict and calculate MAE 9predictions = X.dot(theta) 10mae = mean_absolute_error(y, predictions) 11print(f"MAE: {mae}") # MAE: 1.0887166179544072
After arranging our data and initializing our parameters, theta
is optimized using MBGD. Finally, new predictions are generated by multiplying the data X
with the optimized theta
parameter.
Mathematically, this prediction can be viewed as the line of best fit to the dataset.
Today, we examined the functions of Mini-Batch Gradient Descent and its improvements over both SGD and Batch Gradient Descent. Now, engage in practice sessions that refine your skills to solidify these concepts in your working knowledge. Each practice session propels your journey in mastering optimization techniques. Let's keep progressing!