Practical Guide to Principal Component Analysis (PCA) in Data Science

Lesson 1

Topic Overview and Actualization

Welcome to our journey into high-dimensional data and the associated challenges that it presents. We'll focus on Principal Component Analysis (PCA), a significant method in the realm of dimensionality reduction. Running through a real-world example, we'll implement PCA in Python. This lesson's roadmap proceeds as follows:

Introducing high-dimensional data and understanding its challenges
Establishing the need for dimensionality reduction
Unveiling PCA, its algorithm, and benefits
Implementing PCA using Python

Understanding High-Dimensional Data

High-dimensional data describes a dataset teeming with numerous features or attributes. One good example of high-dimensional data that would benefit from Principal Component Analysis (PCA) is a dataset from a customer survey.

This dataset may have many different features (dimensions), including age, income, frequency of shopping, amount spent per shopping trip, preferred shopping time, location, and scores on several opinion and satisfaction questions, like product variety, staff helpfulness, store cleanliness, etc.

If these many features all contribute relatively equally to the variance in the dataset or if there exist correlations among these features, it might be challenging to visualize the data or make useful conclusions directly from it. By using PCA, we can reduce the dimensionality of the dataset without significant loss of information and identify the primary areas (principal components) that explain the most variance among customers.

Let's take a look at an example where we wish the model the relationship between height and weight. In our dataset, we have 3 features, but want to reduce the dimensionality to only 2 features.

In our example, we examine a dataset recording individuals' weights and heights in two different units - inches and centimeters. This redundancy heightens the dimensionality of our dataset.

Python
1import pandas as pd
2
3data = {
4    'Weight (lbs)': [150, 160, 155, 165, 170, 160, 158, 175, 180, 170],
5    'Height (inches)': [68, 72, 66, 69, 71, 65, 67, 70, 73, 68],
6    'Height (cm)': [172.72, 182.88, 167.64, 175.26, 180.34, 165.1, 170.18, 177.8, 185.42, 172.72]
7}
8df = pd.DataFrame(data)

We show a scatter plot of height in inches versus centimeters, revealing the redundancy as the data points line up in a straight line.

Python
1import matplotlib.pyplot as plt
2
3# Creating 2D scatter plot
4plt.scatter(df['Height (inches)'], df['Height (cm)'])
5
6# Setting labels
7plt.xlabel('Height (inches)')
8plt.ylabel('Height (cm)')
9plt.title('Scatter Plot of Heights (inches vs cm)')
10
11# Show plot
12plt.grid(True)
13plt.show()

Challenges of High-Dimensional Data

High-dimensional data poses numerous challenges, especially due to the curse of dimensionality causing data sparsity and consequent overfitting. With too many features, our model could perform poorly. PCA comes into play here, aiming to reduce the redundancy between the two height dimensions.

Introduction to PCA Algorithm

PCA is a technique that captures the dataset's primary patterns. It seeks the directions of maximum variability in the data and projects it onto new subspace with fewer dimensions.

The steps of PCA are:

Standardizing the data: This process adjusts the variables to have a mean of 0 and a standard deviation of 1.
Computing the covariance matrix: The covariance matrix represents the covariance between all pairs of features. Covariance is a measure of how much two random variables vary together.
Obtaining the eigenvalues and eigenvectors of the covariance matrix: The eigenvectors (principal components) represent the directions of the new feature space and the eigenvalues explain the variance of the data along these new feature axes.
Sorting eigenvalues and selecting eigenvectors: The eigenvectors with the highest correlating eigenvalues represent the best principal components.
Projecting the data onto the new subspace: This leads to the transformation of the original dataset to a reduced dimensional dataset.

Simple PCA Implementation in Python

Let's define some functions to standardize our data and compute the covariance matrix. Don't worry about the implementations for now. We will cover them in the next lesson.

Python
1import numpy as np
2
3def standardize(X):
4    return (X - np.mean(X, axis=0)) / np.std(X, axis=0)
5
6def compute_covariance_matrix(X):
7    return np.dot(X.T, X) / (X.shape[0] - 1)

Python
1def PCA(X, num_components):
2    X = standardize(X)
3    covariance_matrix = compute_covariance_matrix(X)
4    eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)
5    idx = np.argsort(eigenvalues)[::-1]
6    eigenvalues = eigenvalues[idx][:num_components]
7    eigenvectors = eigenvectors[:, idx][:, :num_components]  # select first num_components eigenvectors
8    return eigenvalues, eigenvectors

Let's break down each step. First, we standardize the data and compute the covariance matrix. We then obtain the eigenvalues and eigenvectors using the np.linalg.eig function.

Then, we sort the eigenvalues, as higher eigenvalues indicate higher variability in our data. We take the num_components highest eigenvalues and their corresponding eigenvectors. These eigenvectors are the principal components. Note that here, the number of components is the number of dimensions we want to reduce the dataset to.

PCA in Action.

Let's use PCA to combine the two height features into a single principal component.

Python
1X = df[['Height (inches)', 'Height (cm)']].to_numpy()
2eigenvalues, eigenvectors = PCA(X,1)
3X_pca = np.dot(X, eigenvectors)
4
5# Plotting original data 
6plt.figure(figsize=(20, 10))
7plt.subplot(1, 2, 1)
8plt.scatter(X[:, 0], X[:, 1], color='b')
9plt.xlabel('Height (inches)')
10plt.ylabel('Height (cm)')
11plt.title('Original Data')
12plt.grid(True)
13
14# Plotting data after PCA
15plt.subplot(1, 2, 2)
16plt.scatter(X_pca[:,0], [0]*len(X_pca), color='r')
17plt.xlabel('Principal Component 1')
18plt.title('Data after PCA')
19plt.grid(True)
20
21plt.show()

We've successfully merged two features into a single critical one while retaining essential information.

Applying Reduction From 3D to 2D

Now that we've seen how to reduce 2-dimensional data to 1-dimensional data let's reduce our original 3-dimensional data to 2 dimensions.

Python
1X = df[['Height (inches)', 'Height (cm)', 'Weight (lbs)']].to_numpy()
2eigenvalues, eigenvectors = PCA(X, 2)
3X_pca = np.dot(X, eigenvectors)
4
5plt.scatter(X_pca[:,0], X_pca[:,1])
6
7# Setting labels
8plt.xlabel('Principal Component 1')
9plt.ylabel('Principal Component 2')
10plt.title('Scatter Plot of 2 Principal Components')
11
12# Show plot
13plt.grid(True)
14plt.show()

The plot successfully shows how to model our 3-dimensional data into two principal components.

Lesson Summary and Practice

Today, we unraveled high-dimensional data, its challenges, and the importance of PCA and practically implemented PCA in Python. Coming up, we've prepared practice exercises to bolster your understanding and expertise. Let's dive deeper into the PCA cosmos! Happy coding!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.