Enhancing Machine Learning Expertise: Mini-Batch K-Means Clustering Explained

Lesson 2

Introduction

Welcome back to our exploration of clustering algorithms! Today, we'll cover an improved version of the k-means algorithm — the mini-batch k-means. While related to k-means, this variant enhances computational speed and maintains exceptional clustering quality. Let's discuss its Python implementation.

Understanding the Mini-Batch Concept

In machine learning, mini-batches refer to subsets of data that are randomly selected for every algorithm iteration. This approach optimizes computational functions. Specifically for mini-batch k-means, this technique significantly accelerates the clustering process.

Generative Dataset and Preliminaries

Before delving into the mini-batch k-means implementation, we must establish preparatory functions and a working dataset. Our dataset consists of two distinct clusters. We'll calculate the Euclidean distance and randomly initialize our centroids to assign each data point to its closest centroid.

We calculate the Euclidean distance using the formula: $d(a, b) = \sqrt{\sum (a - b)^2}$ . This formula represents the straight-line distance between two points.

Python
1import numpy as np
2import matplotlib.pyplot as plt
3
4np.random.seed(0)
5data = np.vstack([np.random.normal(loc=3, scale=1, size=(100,2)), np.random.normal(loc=-3, scale=1, size=(100,2))])
6
7def euclidean_distance(a, b):
8    return np.linalg.norm(a - b, axis=-1)
9
10def initialize_centers(data, k):
11    idx = np.random.choice(len(data), size=k)
12    return data[idx, :]

This implementation of the euclidean_distance function is more versatile than the one from the previous lesson: it assumes a and b are numpy arrays with potentially multidimensional data. It calls the numpy function linalg.norm which calculates the Frobenius norm (Euclidean norm for n-dimensional space). The axis=-1 parameter means that the difference operation a - b is performed along the last axis of array, essentially allowing for multidimensional arrays to be handled.

Python Mini-Batch K-Means Algorithm

Let's put theory into practice by implementing the mini-batch k-means. The mini_batch_kMeans function accepts the following parameters:

data: Our sample dataset contains 2-dimensional coordinates representing the location of each data point.
k: The number of clusters our algorithm should identify.
iterations: The number of iterations that our algorithm will perform. Each iteration moves the centroids, resulting in increasingly accurate clustering with each step.
batch_size: The number of data points randomly selected in each iteration. We maximize computational efficiency by not using the entire dataset in each iteration.

Our mini-batch k-means algorithm starts by initializing the centroids. Then, it enters an iterative process: in each iteration, it randomly selects a mini-batch, calculates Euclidean distances, assigns each point to the closest centroid, and recalculates the centroids based on the currently assigned points.

Python
1# Implement mini-batch K-Means
2def mini_batch_kMeans(data, k, iterations=10, batch_size=20):
3    centers = initialize_centers(data, k)
4    for _ in range(iterations):
5        idx = np.random.choice(len(data), size=batch_size)
6        batch = data[idx, :]
7        dists = euclidean_distance(batch[:, None, :], centers[None, :, :])
8        labels = np.argmin(dists, axis=1)
9        for i in range(k):
10            if np.sum(labels == i) > 0:
11                centers[i] = np.mean(batch[labels == i], axis=0)
12    return centers
13
14centers = mini_batch_kMeans(data, k=2)

Interpreting the Results

After obtaining the final centroids, it's time to visualize the formed clusters. Each color represents a data point that is assigned to a centroid (red dot). Successful clustering will reveal distinctive clusters, with each red dot near the center of each respective cluster.

Our visual representation is crucial. This readily interpretable graph serves as a check for the algorithm, positioning each red dot near the center of its cluster.

Python
1plt.scatter(data[:, 0], data[:, 1], s=50)
2plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5)
3plt.show()

Here is the resulting image:

Strengths, Drawbacks, and Applications

The mini-batch k-means algorithm, though a robust tool, has benefits and limitations. Its strengths consist of computational speed and applicability to large datasets. However, it may not be as precise as the classic k-means. This algorithm excels in large-scale data mining operations where time and computational resources may pose critical constraints.

Lesson Summary and Practice

Today's exploration into mini-batch k-means has introduced us to a more efficient approach to k-means, realized through Python. Practicing with different parameters to alter outputs will solidify your understanding and skillset. Thus, prepare for some captivating exercises that allow you to delve deeper into the intriguing world of clustering!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.