Lesson 2

Welcome back to our exploration of clustering algorithms! Today, we'll cover an improved version of the **k-means** algorithm — the **mini-batch k-means**. While related to k-means, this variant enhances computational speed and maintains exceptional clustering quality. Let's discuss its Python implementation.

In machine learning, *mini-batches* refer to subsets of data that are randomly selected for every algorithm iteration. This approach optimizes computational functions. Specifically for mini-batch k-means, this technique significantly accelerates the clustering process.

Before delving into the mini-batch k-means implementation, we must establish preparatory functions and a working dataset. Our dataset consists of two distinct clusters. We'll calculate the Euclidean distance and randomly initialize our centroids to assign each data point to its closest centroid.

We calculate the Euclidean distance using the formula: $d(a, b) = \sqrt{\sum (a - b)^2}$. This formula represents the straight-line distance between two points.

Python`1import numpy as np 2import matplotlib.pyplot as plt 3 4np.random.seed(0) 5data = np.vstack([np.random.normal(loc=3, scale=1, size=(100,2)), np.random.normal(loc=-3, scale=1, size=(100,2))]) 6 7def euclidean_distance(a, b): 8 return np.linalg.norm(a - b, axis=-1) 9 10def initialize_centers(data, k): 11 idx = np.random.choice(len(data), size=k) 12 return data[idx, :]`

This implementation of the `euclidean_distance`

function is more versatile than the one from the previous lesson: it assumes `a`

and `b`

are numpy arrays with potentially multidimensional data. It calls the numpy function `linalg.norm`

which calculates the Frobenius norm (Euclidean norm for n-dimensional space). The `axis=-1`

parameter means that the difference operation `a - b`

is performed along the last axis of array, essentially allowing for multidimensional arrays to be handled.

Let's put theory into practice by implementing the mini-batch k-means. The `mini_batch_kMeans`

function accepts the following parameters:

`data`

: Our sample dataset contains 2-dimensional coordinates representing the location of each data point.`k`

: The number of clusters our algorithm should identify.`iterations`

: The number of iterations that our algorithm will perform. Each iteration moves the centroids, resulting in increasingly accurate clustering with each step.`batch_size`

: The number of data points randomly selected in each iteration. We maximize computational efficiency by not using the entire dataset in each iteration.

Our mini-batch k-means algorithm starts by initializing the centroids. Then, it enters an iterative process: in each iteration, it randomly selects a mini-batch, calculates Euclidean distances, assigns each point to the closest centroid, and recalculates the centroids based on the currently assigned points.

Python`1# Implement mini-batch K-Means 2def mini_batch_kMeans(data, k, iterations=10, batch_size=20): 3 centers = initialize_centers(data, k) 4 for _ in range(iterations): 5 idx = np.random.choice(len(data), size=batch_size) 6 batch = data[idx, :] 7 dists = euclidean_distance(batch[:, None, :], centers[None, :, :]) 8 labels = np.argmin(dists, axis=1) 9 for i in range(k): 10 if np.sum(labels == i) > 0: 11 centers[i] = np.mean(batch[labels == i], axis=0) 12 return centers 13 14centers = mini_batch_kMeans(data, k=2)`

After obtaining the final centroids, it's time to visualize the formed clusters. Each color represents a data point that is assigned to a centroid (red dot). Successful clustering will reveal distinctive clusters, with each red dot near the center of each respective cluster.

Our visual representation is crucial. This readily interpretable graph serves as a check for the algorithm, positioning each red dot near the center of its cluster.

Python`1plt.scatter(data[:, 0], data[:, 1], s=50) 2plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5) 3plt.show()`

Here is the resulting image:

The mini-batch k-means algorithm, though a robust tool, has benefits and limitations. Its strengths consist of computational speed and applicability to large datasets. However, it may not be as precise as the classic k-means. This algorithm excels in large-scale data mining operations where time and computational resources may pose critical constraints.

Today's exploration into mini-batch k-means has introduced us to a more efficient approach to k-means, realized through Python. Practicing with different parameters to alter outputs will solidify your understanding and skillset. Thus, prepare for some captivating exercises that allow you to delve deeper into the intriguing world of clustering!