A Practical Introduction to Principal Component Analysis (PCA)

Lesson 3

An Introduction to Principal Component Analysis (PCA)

Let's dive into Principal Component Analysis (PCA), a technique often used in machine learning to simplify complex data while keeping important details. PCA transforms datasets with lots of closely connected parts into datasets with parts that do not directly relate to each other. Think of it like organizing a messy room and putting everything in clear, separate bins.

Make A Simple Dataset

We can start using the PCA by creating our own little dataset. For this lesson, we'll make a 3D (three-dimensional) dataset of 200 points:

Python
1import numpy as np
2import matplotlib.pyplot as plt
3from mpl_toolkits.mplot3d import Axes3D
4
5np.random.seed(0)
6# Creating 200-point 3D dataset
7X = np.dot(np.random.random(size=(3, 3)), np.random.normal(size=(3, 200))).T
8# Plotting the dataset
9fig = plt.figure()
10ax = fig.add_subplot(111, projection='3d')
11ax.scatter(X[:,0], X[:,1], X[:,2])
12plt.title("Scatter Plot of Original Dataset")
13plt.show()

Standardizing the Dataset

Before PCA, we need to bring all features of our dataset to a common standard to avoid bias. This just means making sure every feature's average value is 0, and the spread of their values is the same:

Python
1# Calculate the mean and the standard deviation
2X_mean = np.mean(X, axis=0)
3X_std = np.std(X, axis=0)
4# Make the dataset standard 
5X = (X - X_mean) / X_std

The above code calculates the dataset's average (np.mean) and spread (np.std) and then adjusts each point accordingly.

Covariance Matrix

The next step is to calculate the covariance matrix. This is just a fancy math term for a matrix that tells how much two variables correlate:

Python
1# Calculate Covariance Matrix 
2cov_matrix = np.cov(X.T)

We use np.cov to compute the covariance matrix.

Eigendecomposition

Next, we break our covariance matrix into eigenvectors and eigenvalues. This is like taking a box of Lego and sorting it into different shapes and sizes:

Python
1# Break into eigenvectors and eigenvalues
2eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

This gives us two important elements: eigenvalues (which represent data spread) and eigenvectors (which represent the direction of data spread).

Sorting Eigenvalues and Eigenvectors

Now we line up the eigenvalues and their corresponding eigenvectors from big to small:

Python
1# Sort out eigenvalues and corresponding eigenvectors
2eigen_pairs = [(np.abs(eigenvalues[i]), eigenvectors[:,i]) for i in range(len(eigenvalues))]
3eigen_pairs.sort(reverse=True)

Projecting Original Dataset

Next, we sort the eigenvalues in descending order. This allows us to select the top k eigenvectors corresponding to the most significant k eigenvalues representing the principal components.

Python
1# Make the projection matrix
2W = np.hstack((eigen_pairs[0][1].reshape(3,1), eigen_pairs[1][1].reshape(3,1)))
3# Change the original dataset
4X_pca = X.dot(W)

Visualizing Results

Finally, we can look at our simplified dataset and appreciate how PCA made it easier to understand:

Python
1plt.figure()
2plt.scatter(X_pca[:, 0],X_pca[:, 1])
3plt.title("Scatter Plot of Transformed Dataset Using PCA")
4plt.show()

This shows that we reduced our data from a three-dimensional form to a two-dimensional form without losing important information.

Wrapping Up

Well done! You've just learned about Principal Component Analysis (PCA), a technique to simplify data without losing important details. Now it's time for you to practice! Remember, practice is the key to grasping any new concept. Keep learning!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.