Let's dive into Principal Component Analysis (PCA), a technique often used in machine learning to simplify complex data while keeping important details. PCA transforms datasets with lots of closely connected parts into datasets with parts that do not directly relate to each other. Think of it like organizing a messy room and putting everything in clear, separate bins.
We can start using the PCA by creating our own little dataset. For this lesson, we'll make a 3D (three-dimensional) dataset of 200 points:
Python1import numpy as np 2import matplotlib.pyplot as plt 3from mpl_toolkits.mplot3d import Axes3D 4 5np.random.seed(0) 6# Creating 200-point 3D dataset 7X = np.dot(np.random.random(size=(3, 3)), np.random.normal(size=(3, 200))).T 8# Plotting the dataset 9fig = plt.figure() 10ax = fig.add_subplot(111, projection='3d') 11ax.scatter(X[:,0], X[:,1], X[:,2]) 12plt.title("Scatter Plot of Original Dataset") 13plt.show()
Before PCA, we need to bring all features of our dataset to a common standard to avoid bias. This just means making sure every feature's average value is 0, and the spread of their values is the same:
Python1# Calculate the mean and the standard deviation 2X_mean = np.mean(X, axis=0) 3X_std = np.std(X, axis=0) 4# Make the dataset standard 5X = (X - X_mean) / X_std
The above code calculates the dataset's average (np.mean
) and spread (np.std
) and then adjusts each point accordingly.
The next step is to calculate the covariance matrix. This is just a fancy math term for a matrix that tells how much two variables correlate:
Python1# Calculate Covariance Matrix 2cov_matrix = np.cov(X.T)
We use np.cov
to compute the covariance matrix.
Next, we break our covariance matrix into eigenvectors and eigenvalues. This is like taking a box of Lego and sorting it into different shapes and sizes:
Python1# Break into eigenvectors and eigenvalues 2eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
This gives us two important elements: eigenvalues (which represent data spread) and eigenvectors (which represent the direction of data spread).
Now we line up the eigenvalues and their corresponding eigenvectors from big to small:
Python1# Sort out eigenvalues and corresponding eigenvectors 2eigen_pairs = [(np.abs(eigenvalues[i]), eigenvectors[:,i]) for i in range(len(eigenvalues))] 3eigen_pairs.sort(reverse=True)
Next, we sort the eigenvalues in descending order. This allows us to select the top k
eigenvectors corresponding to the most significant k
eigenvalues representing the principal components.
Python1# Make the projection matrix 2W = np.hstack((eigen_pairs[0][1].reshape(3,1), eigen_pairs[1][1].reshape(3,1))) 3# Change the original dataset 4X_pca = X.dot(W)
Finally, we can look at our simplified dataset and appreciate how PCA made it easier to understand:
Python1plt.figure() 2plt.scatter(X_pca[:, 0],X_pca[:, 1]) 3plt.title("Scatter Plot of Transformed Dataset Using PCA") 4plt.show()
This shows that we reduced our data from a three-dimensional form to a two-dimensional form without losing important information.
Well done! You've just learned about Principal Component Analysis (PCA), a technique to simplify data without losing important details. Now it's time for you to practice! Remember, practice is the key to grasping any new concept. Keep learning!