Before moving on to the practical application, let's refresh our memory with a recap of K-means clustering. K-means clustering is an integral method in unsupervised learning. The main principle of K-means clustering is quite simple: it groups data points into distinct clusters based on their mutual distances to minimize the variance, also known as inertia
, within each cluster.
We will now apply K-means clustering to a well-known dataset: the Iris dataset.
The Iris dataset, as we've discussed in previous lessons, consists of measurements taken from 150 iris flowers across three distinct species. Imagine being a botanist searching for a systematic way to categorize new iris flowers based on these features. Doing so manually would be burdensome; hence, resorting to machine learning, specifically K-means clustering, becomes a logical choice!
Let's load this dataset using the sklearn
library in Python and convert it into a pandas DataFrame:
Python1from sklearn.datasets import load_iris 2import pandas as pd 3 4iris = load_iris() 5iris_df = pd.DataFrame(iris.data, columns=iris.feature_names) 6iris_df.head()
We're now going to implement K-means clustering on the Iris dataset. For this, we'll use the KMeans
class from sklearn's cluster
module. To keep our initial implementation straightforward, let's focus on just two dataset features: sepal length
and sepal width
.
Python1from sklearn.cluster import KMeans 2 3# Assigning the features for our model 4X = iris_df.iloc[:, [0, 1]].values 5 6# Defining the KMeans clustering model 7kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10, random_state=0) 8y_kmeans = kmeans.fit_predict(X)
In the above block:
n_clusters
: stipulates the number of clusters to form.init
: sets the method for initialization. The "k-means++"
method selects initial cluster centers intelligently to hasten convergence.max_iter
: limits the maximum number of iterations for a single run.n_init
: specifies the number of times the algorithm runs with different centroid seeds.tol
and algorithm
:
tol
is the tolerance with regard to inertia required to declare convergence. We did not include this to use the default tolerance.algorithm
specifies the algorithm to use. The classical EM-style algorithm is "full"
, the Elkan variant is more efficient but is not available for sparse data. To keep things simple, we chose not to specify this option.Next, let's visualize our data points and their respective clusters using matplotlib
, a powerful plotting library in Python. This visualization will help us better understand our K-means clustering and evaluate how well it performs:
Python1import matplotlib.pyplot as plt 2 3plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s=100, c='red', label='Iris-setosa') 4plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s=100, c='blue', label='Iris-versicolour') 5plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s=100, c='green', label='Iris-virginica') 6 7#Plotting the centroids of the clusters 8plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=100, c='yellow', label='Centroids') 9 10plt.legend() 11plt.show()
This code plots the data points, coloring each one according to the cluster it belongs to. The centroids, the centers of each cluster, are also plotted as yellow stars. A visual inspection shows that our clustering successfully differentiated between the different species of Iris flowers.
An integral part of implementing K-means clustering is evaluating the effectiveness of the formed clusters. The silhouette_score
provides a quantifiable way to accomplish this. It measures how close each data point in one cluster is to the data points in the neighboring clusters. This score ranges from -1 to +1. A high value indicates that the data point fits well within its own cluster and is poorly matched to neighboring clusters.
Let's calculate this score using the following code:
Python1from sklearn.metrics import silhouette_score 2 3score = silhouette_score(X, y_kmeans) 4print(f'Silhouette Score(n=3): {silhouette_score(X, y_kmeans)}') 5# Silhouette Score(n=3): 0.4450525692083638
The silhouette_score
can guide us to understand how well our data points have been clustered.
Congratulations! We've successfully delved into both the theoretical and practical aspects of K-means clustering, using the Iris dataset for practical understanding. We have visualized our clustering and evaluated it using the silhouette_score
. We've made great strides toward mastering unsupervised learning, and we will revisit this theme as we dive deeper into this exciting field.
Are you ready to get your hands dirty with some practical exercises? Practice is key to reinforcing what you've learned and helping you uncover potential challenges and gaps in your understanding. So, gear up for the hands-on activities. Enjoy your journey into the captivating world of K-means clustering. Happy coding!