Lesson 1
Mastering K-means Clustering with Python: From Theory to Practical Implementation
Introduction

Welcome to our exploration of Unsupervised Learning and Clustering. In this lesson, we'll delve into K-means clustering, clarify its underlying principles, and navigate through the implementation of the K-means clustering algorithm in Python.

Understanding Unsupervised Learning

Unsupervised Learning uses a dataset without labels to identify inherent patterns. Unlike Supervised Learning, which leverages known outcomes from data for label prediction, Unsupervised Learning operates independently. One application is market basket analysis, which predicts customer purchases based on associated buying behaviors.

K-means Clustering: Theory and Implementation Overview

Let's encapsulate the essence of K-means clustering: this iterative algorithm partitions a group of data points into a predefined number of clusters based on their inherent distances from each other. The K in K-means denotes the number of clusters. K-means clustering operates based on a set metric, the most common of which is the Euclidean distance.

In subsequent sections, we'll adopt a hands-on approach to implement K-means clustering in Python. We'll be using libraries like numpy for numerical operations and matplotlib for visualizations. Let's get started!

Initializing and Preparing for K-means Clustering

First, we initiate the lesson by loading the necessary libraries and defining our data points:

Python
1import numpy as np 2import matplotlib.pyplot as plt 3 4np.random.seed(0) 5x1 = np.random.normal(loc=5, scale=1, size=(100, 2)) 6x2 = np.random.normal(loc=10, scale=2, size=(100, 2)) 7x = np.concatenate([x1, x2]) 8 9plt.scatter(x[:,0], x[:,1], label='True Position') 10plt.show()

Next, we ready our dataset for K-means clustering. Here, we plot the data points, indicate the number of clusters, and initialize the centroids. Additionally, we introduce helper functions for computing Euclidean distances and assigning centroids.

Python
1k = 3 2centroids = x[np.random.choice(range(x.shape[0]), size=k, replace=False), :] 3 4def calc_distance(X1, X2): 5 return (sum((X1 - X2)**2))**0.5 6 7def find_closest_centroids(ic, X): 8 assigned_centroid = [] 9 for i in X: 10 distance=[] 11 for j in ic: 12 distance.append(calc_distance(i, j)) 13 assigned_centroid.append(np.argmin(distance)) 14 return assigned_centroid
New Centroids Calculation

An important part of the K-means algorithm involves updating centroid positions, which is done in our code by the calc_centroids function. Once we have assigned data points to the nearest centroid, we need to update the centroid's position to reflect the mean of all data points now in that cluster. The calc_centroids function serves this purpose:

Python
1import pandas as pd 2 3def calc_centroids(clusters, X): 4 new_centroids = [] 5 new_df = pd.concat([pd.DataFrame(X), pd.DataFrame(clusters, columns=['cluster'])], axis=1) 6 for c in set(new_df['cluster']): 7 current_cluster = new_df[new_df['cluster'] == c][new_df.columns[:-1]] 8 cluster_mean = current_cluster.mean(axis=0) 9 new_centroids.append(cluster_mean) 10 return new_centroids

This function appends our clusters to our numpy data array X to form a new DataFrame new_df. Iterating over each unique cluster, it calculates the mean of all points in the cluster and stores it in new_centroids.

The operation of calculating the cluster mean is based on the intuition that the centroid of a set of multivariate points is a collection of their coordinate-wise mean values in the form:

centroid=(x1+x2++xnn,y1+y2++ynn)\text{{centroid}} = \left(\frac{x_1+x_2+\ldots+x_n}{n}, \frac{y_1+y_2+\ldots+y_n}{n}\right)

Here, 'N' is the count of the total number of points in the dataset, while individual 'x's and 'y's are the coordinates of each data point.

The function returns new_centroids, a list of the updated centroids, one for each unique cluster.

Updating centroid positions iteratively refines clusters, driving the K-means algorithm till it converges, and clusters reach their optimal state. This iterative optimization process is critical for the K-means algorithm to successfully separate data into distinct clusters.

Performing and Visualizing K-means Clustering

We'll now see the K-means clustering logic at work. The first step would be to create a function, kmeans_clustering, that encapsulates the main logic of the K-means clustering. This function will take in data points and the count of clusters. It will return the centroid coordinates and the assigned centroids for each point after the iterations.

Here's how we can define this function:

Python
1def kmeans_clustering(x, k): 2 # Initialize Centroids - picking random samples 3 centroids = x[np.random.choice(range(x.shape[0]), size=k, replace=False), :] 4 5 for i in range(10): 6 # Assign every data point to the closest centroid 7 get_centroids = find_closest_centroids(centroids, x) 8 # Recalculate centroid coordinates based on cluster members 9 centroids = calc_centroids(get_centroids, x) 10 11 return centroids, get_centroids

This function runs our K-Means algorithm for 10 iterations by repeatedly assigning data points to closest centroids and recalculating centroid coordinates.

Let's apply kmeans_clustering to our previously defined data x and glance into the results:

Python
1k = 3 2centroids, get_centroids = kmeans_clustering(x, k) 3print("Centroids:", centroids)

The centroids array would hold the final centroid coordinates of our clusters.

Each data point is now assigned to a particular centroid (cluster). We can visualize this using our Matplotlib:

Python
1plt.scatter(x[:,0], x[:,1], c=get_centroids) 2plt.scatter(np.array(centroids)[:,0], np.array(centroids)[:,1], c='red') 3plt.show()

Your plot will now show your data points categorized into clusters. Each color represents a data point belonging to a particular centroid, marked in red.

By organizing the main logic into a function, we have rendered our K-means algorithm reusable for different datasets and cluster configurations providing us a toolbox to work with for future data analysis tasks.

K-means Clustering with sklearn

For applications that require quick prototyping or dealing with large multidimensional datasets, implementing K-means clustering algorithm from scratch may not be feasible. Thankfully, Python provides the Scikit-Learn library, also known as sklearn, which comes with many efficient tools for machine learning and statistical modeling, one of which is the KMeans.

Firstly, we import the necessary libraries and initialize our data:

Python
1import numpy as np 2from sklearn.cluster import KMeans 3import matplotlib.pyplot as plt 4 5# Let's define our data 6x = np.array([[2, 10], 7 [2, 5], 8 [8, 4], 9 [5, 8], 10 [7, 5], 11 [6, 4], 12 [1, 2], 13 [4, 9]])

Let's instantiate a KMeans object and fit the model to our data. Here we set n_clusters as 3, the number of clusters we want. init is set to 'k-means++', this initializes the centroids to be (generally) distant from each other, leading to probably better results than random initialization.

Python
1kmeans_model = KMeans(n_clusters=3, init='k-means++') 2kmeans_model.fit(x)

After the model is fitted to the data, the labels of the clusters can be obtained by calling kmeans_model.labels_, and the cluster centers (or 'centroids') can be obtained by calling kmeans_model.cluster_centers_.

Python
1centroid = kmeans_model.cluster_centers_ 2labels = kmeans_model.labels_

Now we shall plot the data points, with the colour of the points denoting the clusters they belong to, and the centroids marked in red.

Python
1plt.scatter(x[:,0], x[:,1],c=labels) 2plt.scatter(centroid[:,0],centroid[:,1],c='red') 3plt.show()

This shows our data points now clustered into three distinct clusters. It's important to note that while the sklearn implementation automates many parts of K-means clustering, understanding the underlying processes and principles is crucial in better interpreting the results and troubleshooting, if necessary.

When we use K-means clustering in machine learning, we aim to partition our dataset into 'k' distinct clusters. The algorithm works by randomly initializing points as cluster centers and iteratively refining the cluster assignment and the center points. But there's a catch: the algorithm's success can depend hugely on how those initial centers are chosen. If you get lucky, you could have a good set of starting points, and hence, quicker and better convergence to an optimal solution. If not, the algorithm can converge to a less optimal solution. In machine learning, we're not fans of relying on such luck!

The n_init parameter tells the KMeans algorithm, "Hey, don't just start once; try multiple times with different random initializations and pick the best outcome." Essentially, n_init represents the number of times the algorithm will run with different centroid seeds.

To use n_init, you simply specify it when creating a KMeans instance:

Python
1kmeans = KMeans(n_clusters=3, init='k-means++', n_init=10, random_state=42)

Setting random_state at the same time ensures that your 'luck' is reproducible, meaning you can get the same result every time you run the algorithm with that state. It's like your very own space-time anchor in the realm of randomness!

And that, my dear pioneering data astronaut, is how n_init equips you with the power to navigate the stochastic stars of K-means clustering.

Conclusion and Practice Mentoring

Congratulations! You have understood unsupervised learning and grasped the essence of K-means clustering, inching closer to proficiency in its Python implementation. Keep practicing to solidify your understanding. Modify the clusters or experiment with varying datasets for a broader scope of exploration. Our upcoming lessons will reveal several more exciting aspects such as clustering visualizations using matplotlib and the evaluation of K-means performance. Looking forward to your continued journey!

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.