Mastering Principal Component Analysis with Scikit-learn

Lesson 3

Introduction

Welcome to this lesson on Principal Component Analysis (PCA), a powerful technique widely applied in data analysis and machine learning to reduce high-dimensional data into lower dimensions, in effect simplifying the dataset whilst still holding onto the relevant information. In this lesson, we'll look at how we can prepare our data, how to apply PCA using Scikit-learn, understand the % of variance explained by each principal component (explained variance ratio), and finally, how to visualize the results of our PCA.

Preparing the Data

Before moving forward, let's first apply what we've learned to a dataset to standardize that data:

Python
1import pandas as pd
2from sklearn.preprocessing import StandardScaler
3
4# Define the dataset
5data = {
6    'Weight (lbs)': [150, 160, 155, 165, 170, 160, 158, 175, 180, 170],
7    'Height (inches)': [68, 72, 66, 69, 71, 65, 67, 70, 73, 68],
8    'Height (cm)': [172.72, 182.88, 167.64, 175.26, 180.34, 165.1, 170.18, 177.8, 185.42, 172.72]
9}
10df = pd.DataFrame(data)
11
12sc = StandardScaler()
13df_scaled = pd.DataFrame(sc.fit_transform(df), columns=df.columns)

PCA with Scikit-learn

Next, we apply PCA, a technique first computes the covariance matrix of the data, followed by finding its eigenvectors and eigenvalues. The eigenvectors correlating to the $n$ largest eigenvalues are then used to project the data into an $n$ -dimensional subspace.

Python
1from sklearn.decomposition import PCA
2
3pca = PCA(n_components=2)
4df_pca = pca.fit_transform(df_scaled)

The Importance of Explained Variance Ratio

An informative aspect of PCA is the explained variance ratio, signaling the proportion of the data's variance falling along the direction of each principal component. This information is key as it notifies us how much information we would lose if we ignored the less important dimensions and kept only the ones contributing most to the variance.

Python
1print("Explained Variance: ", pca.explained_variance_ratio_)

The output will be [0.84009963 0.15990037]. This means that the first principal component explains 84% of the variance, while the second principal component explains 16% of the variance. In this case, the first principal component is the most important one, as it captures the majority of the variance in the data.

Visualizing the Results of PCA

Finally, let's visualize our PCA results via a scatter plot using Matplotlib. Such visualization can be incredibly insightful when dealing with high-dimensional data, letting us see the structure of our data in lower-dimensional space.

Python
1import matplotlib.pyplot as plt
2
3plt.scatter(df_pca[:, 0], df_pca[:, 1])
4plt.xlabel('Principal Component 1')
5plt.ylabel('Principal Component 2')
6plt.title('PCA on Measurement Dataset')
7plt.grid()
8plt.show()

We will see a scatter plot showing the data projected onto the first two principal components:

Lesson Summary and Practice

Congratulations! You've mastered how to prepare data for PCA, how to perform PCA with Scikit-learn, understand the explained variance ratio, and visualize PCA results. Get involved with different datasets and experiment with choosing different numbers of principal components to understand how these choices affect the explained variance and plot. Happy coding!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.