Welcome to this lesson on Principal Component Analysis (PCA), a powerful technique widely applied in data analysis and machine learning to reduce high-dimensional data into lower dimensions, in effect simplifying the dataset whilst still holding onto the relevant information. In this lesson, we'll look at how we can prepare our data, how to apply PCA using Scikit-learn, understand the % of variance explained by each principal component (explained variance ratio), and finally, how to visualize the results of our PCA.
Before moving forward, let's first apply what we've learned to a dataset to standardize that data:
Python1import pandas as pd 2from sklearn.preprocessing import StandardScaler 3 4# Define the dataset 5data = { 6 'Weight (lbs)': [150, 160, 155, 165, 170, 160, 158, 175, 180, 170], 7 'Height (inches)': [68, 72, 66, 69, 71, 65, 67, 70, 73, 68], 8 'Height (cm)': [172.72, 182.88, 167.64, 175.26, 180.34, 165.1, 170.18, 177.8, 185.42, 172.72] 9} 10df = pd.DataFrame(data) 11 12sc = StandardScaler() 13df_scaled = pd.DataFrame(sc.fit_transform(df), columns=df.columns)
Next, we apply PCA, a technique first computes the covariance matrix of the data, followed by finding its eigenvectors and eigenvalues. The eigenvectors correlating to the largest eigenvalues are then used to project the data into an -dimensional subspace.
Python1from sklearn.decomposition import PCA 2 3pca = PCA(n_components=2) 4df_pca = pca.fit_transform(df_scaled)
An informative aspect of PCA is the explained variance ratio, signaling the proportion of the data's variance falling along the direction of each principal component. This information is key as it notifies us how much information we would lose if we ignored the less important dimensions and kept only the ones contributing most to the variance.
Python1print("Explained Variance: ", pca.explained_variance_ratio_)
The output will be [0.84009963 0.15990037]
. This means that the first principal component explains 84% of the variance, while the second principal component explains 16% of the variance. In this case, the first principal component is the most important one, as it captures the majority of the variance in the data.
Finally, let's visualize our PCA results via a scatter plot using Matplotlib. Such visualization can be incredibly insightful when dealing with high-dimensional data, letting us see the structure of our data in lower-dimensional space.
Python1import matplotlib.pyplot as plt 2 3plt.scatter(df_pca[:, 0], df_pca[:, 1]) 4plt.xlabel('Principal Component 1') 5plt.ylabel('Principal Component 2') 6plt.title('PCA on Measurement Dataset') 7plt.grid() 8plt.show()
We will see a scatter plot showing the data projected onto the first two principal components:
Congratulations! You've mastered how to prepare data for PCA, how to perform PCA with Scikit-learn, understand the explained variance ratio, and visualize PCA results. Get involved with different datasets and experiment with choosing different numbers of principal components to understand how these choices affect the explained variance and plot. Happy coding!