Understanding Clustering with k-Means Algorithm Basics

Lesson 1

Introduction

Welcome to the fascinating landscape of Unsupervised Learning and Clustering. In this course, we'll explore the popular k-Means clustering algorithm, a simple yet powerful form of clustering. Although clustering might seem technical, if you've ever sorted your clothes into piles based on their colors or types, you've unknowingly performed a form of "clustering" — grouping similar items into different categories or clusters. Intrigued? Let's get started!

Understanding Clustering

Supervised learning is like learning with a teacher. In this type of machine learning, you provide the computer with labeled data, which means the algorithm is given the input data and the correct answers. The aim is to find a model that makes the best predictions based on the given examples.

Unsupervised learning, on the other hand, is like learning on your own. In this type, the algorithm is given data but no specific directions about what it should be looking for. The computer is expected to explore the data and find its own patterns or structures. It is called 'unsupervised' because there are no correct answers and no teacher.

Algorithms for unsupervised learning, like clustering, aim to group objects so that objects in the same group (a cluster) are more similar to each other than to those in other groups.

Consider an example: you have a list of fruits with their corresponding weight and volume, and want to group them into two groups, but you don’t know what the fruits are. You can perform clustering, to segment the data into 2 clusters. Although we don’t know what the fruits are, we could predict that data points in the same cluster are the same type of fruit.

Given data for a new piece of fruit, you could attempt to classify which group it belongs to by seeing which cluster center it is closest to.

This lesson will focus on the widely used k-Means clustering method.

k-Means Clustering

The k-Means clustering algorithm aims to partition n observations into k clusters, where each observation belongs to the cluster with which it shares the most similarity. The steps involved are:

Initialization: Random initialization of k centroids.
Assignment: Allocation of each data point to the closest centroid.
Update: Updating each centroid by computing the mean of all points allocated to its cluster.

We repeat steps 2 and 3 until the centroids cease to change significantly. For now, we will manually set k, the number of clusters.

Implementing k-Means: Setup

Now, let's translate this algorithm into Python code, using a dataset of 2D points for visual demonstration. Let's begin!

First, we initialize a simple dataset and define the number of clusters, k. We set k=2 and randomly initialize our cluster centroids from our data points.

Python
1# Toy dataset with 2D points
2data = [(2,3), (5,3.4), (1.3,1), (3,4), (2,3.5), (7,5)]
3
4# k-Means settings
5k = 2  
6centers = random.sample(data, k)

Next, we create a distance() function to calculate the Euclidean distance between two points. This concept is integral to the k-Means algorithm, and the formula for Euclidean distance is:

\sqrt{(x_2-x_1)^2 + (y_2-y_1)^2}

for points $(x_1, y_1)$ and $(x_2, y_2)$ .

Python
1# Definition of Euclidean distance
2def distance(point1, point2):
3    return ((point1[0]-point2[0])**2 + (point1[1]-point2[1])**2)**0.5

Implementing k-Means: Algorithm

Now, we proceed to implement the k-Means algorithm. This code begins by setting up the clusters, assigning each data point to the nearest centroid, and then updating the centroid based on the mean of all points in its cluster. The process is repeated until the centroids stabilize.

Python
1# k-Means algorithm
2def k_means(data, centers, k):
3    while True:
4        clusters = [[] for _ in range(k)] 
5
6        # Assign data points to the closest center
7        for point in data:
8            distances = [distance(point, center) for center in centers]
9            index = distances.index(min(distances)) 
10            clusters[index].append(point)
11
12        # Update centers to be the mean of points in a cluster
13        new_centers = []
14        for cluster in clusters:
15            center = (sum([point[0] for point in cluster])/len(cluster), 
16                      sum([point[1] for point in cluster])/len(cluster)) 
17            new_centers.append(center)
18
19        # Break loop if centers don't change significantly
20        if max([distance(new, old) for new, old in zip(new_centers, centers)]) < 0.0001:
21            break
22        else:
23            centers = new_centers
24    return clusters, centers

In the loop, initially, an empty list of clusters is created for each cluster to collect the points that belong to that cluster.

Then, for each data point, the distance to each centroid is calculated using a helper distance function. The index of the smallest distance is found, which represents the nearest centroid. That data point is then added to the cluster represented by that centroid.

Once all the points are assigned to their nearest centroid and all the clusters are filled, the centroids are updated. Each new centroid is recalculated as the mean (average) of all the points in its respective cluster.

Implementing k-Means: Run

It's finally time to run our k-Means clustering algorithm and witness it perform its magic!

Python
1clusters, centers = k_means(data, centers, k)
2
3# Let's print the cluster centers
4for i, center in enumerate(centers):
5    print(f"Cluster{i+1} center is : {center}")
6# Cluster1 center is : (2.66, 2.98)
7# Cluster2 center is : (7.0, 5.0)
8
9# Let's print the clusters
10for i, cluster in enumerate(clusters):
11    print(f"Cluster{i+1} points are : {cluster}")
12# Cluster1 points are : [(2, 3), (5, 3.4), (1.3, 1), (3, 4), (2, 3.5)]
13# Cluster2 points are : [(7, 5)]

Our output consists of cluster centers and clusters. Cluster centers, or 'means,' are centroids representing the central point of each cluster. Each point in our dataset is assigned to the cluster closest to it. While versatile, k-Means performs optimally when cluster densities are approximately equal.

Implementing k-Means: Visualization

Let's see the obtained clusters on the plot using the following code:

Python
1import matplotlib.pyplot as plt
2
3colors = ['r', 'g', 'b', 'y', 'c', 'm']
4fig, ax = plt.subplots()
5
6# Plot points
7for i, cluster in enumerate(clusters):
8    for point in cluster:
9        ax.scatter(*point, color=colors[i])
10
11# Plot centers
12for i, center in enumerate(centers):
13    ax.scatter(*center, color='black', marker='x', s=300)
14
15ax.set_title('Clusters and their centers')
16plt.show()

Here is the resulting image. Crosses represent the cluster centers, and points of different colors belong to different clusters:

We have set the number of clusters equal to 2; hence, the algorithms separated the data into two clusters.

Lesson Summary and Practice

Congratulations on successfully navigating the core aspects of clustering and implementing the k-Means algorithm! Moving forward, practice exercises are available to help solidify these concepts. I look forward to seeing you in the next lesson!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.