Welcome to our journey into high-dimensional data and the associated challenges that it presents. We'll focus on Principal Component Analysis (PCA), a significant method in the realm of dimensionality reduction. Running through a real-world example, we'll implement PCA in Python. This lesson's roadmap proceeds as follows:
High-dimensional data describes a dataset teeming with numerous features or attributes. One good example of high-dimensional data that would benefit from Principal Component Analysis (PCA) is a dataset from a customer survey.
This dataset may have many different features (dimensions), including age, income, frequency of shopping, amount spent per shopping trip, preferred shopping time, location, and scores on several opinion and satisfaction questions, like product variety, staff helpfulness, store cleanliness, etc.
If these many features all contribute relatively equally to the variance in the dataset or if there exist correlations among these features, it might be challenging to visualize the data or make useful conclusions directly from it. By using PCA, we can reduce the dimensionality of the dataset without significant loss of information and identify the primary areas (principal components) that explain the most variance among customers.
Let's take a look at an example where we wish the model the relationship between height and weight. In our dataset, we have 3 features, but want to reduce the dimensionality to only 2 features.
In our example, we examine a dataset recording individuals' weights and heights in two different units - inches and centimeters. This redundancy heightens the dimensionality of our dataset.
Python1import pandas as pd 2 3data = { 4 'Weight (lbs)': [150, 160, 155, 165, 170, 160, 158, 175, 180, 170], 5 'Height (inches)': [68, 72, 66, 69, 71, 65, 67, 70, 73, 68], 6 'Height (cm)': [172.72, 182.88, 167.64, 175.26, 180.34, 165.1, 170.18, 177.8, 185.42, 172.72] 7} 8df = pd.DataFrame(data)
We show a scatter plot of height in inches versus centimeters, revealing the redundancy as the data points line up in a straight line.
Python1import matplotlib.pyplot as plt 2 3# Creating 2D scatter plot 4plt.scatter(df['Height (inches)'], df['Height (cm)']) 5 6# Setting labels 7plt.xlabel('Height (inches)') 8plt.ylabel('Height (cm)') 9plt.title('Scatter Plot of Heights (inches vs cm)') 10 11# Show plot 12plt.grid(True) 13plt.show()
High-dimensional data poses numerous challenges, especially due to the curse of dimensionality causing data sparsity and consequent overfitting. With too many features, our model could perform poorly. PCA comes into play here, aiming to reduce the redundancy between the two height dimensions.
PCA is a technique that captures the dataset's primary patterns. It seeks the directions of maximum variability in the data and projects it onto new subspace with fewer dimensions.
The steps of PCA are:
Standardizing the data: This process adjusts the variables to have a mean of 0 and a standard deviation of 1.
Computing the covariance matrix: The covariance matrix represents the covariance between all pairs of features. Covariance is a measure of how much two random variables vary together.
Obtaining the eigenvalues and eigenvectors of the covariance matrix: The eigenvectors (principal components) represent the directions of the new feature space and the eigenvalues explain the variance of the data along these new feature axes.
Sorting eigenvalues and selecting eigenvectors: The eigenvectors with the highest correlating eigenvalues represent the best principal components.
Projecting the data onto the new subspace: This leads to the transformation of the original dataset to a reduced dimensional dataset.
Let's define some functions to standardize our data and compute the covariance matrix. Don't worry about the implementations for now. We will cover them in the next lesson.
Python1import numpy as np 2 3def standardize(X): 4 return (X - np.mean(X, axis=0)) / np.std(X, axis=0) 5 6def compute_covariance_matrix(X): 7 return np.dot(X.T, X) / (X.shape[0] - 1)
Python1def PCA(X, num_components): 2 X = standardize(X) 3 covariance_matrix = compute_covariance_matrix(X) 4 eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix) 5 idx = np.argsort(eigenvalues)[::-1] 6 eigenvalues = eigenvalues[idx][:num_components] 7 eigenvectors = eigenvectors[:, idx][:, :num_components] # select first num_components eigenvectors 8 return eigenvalues, eigenvectors
Let's break down each step. First, we standardize the data and compute the covariance matrix. We then obtain the eigenvalues and eigenvectors using the np.linalg.eig
function.
Then, we sort the eigenvalues, as higher eigenvalues indicate higher variability in our data. We take the num_components
highest eigenvalues and their corresponding eigenvectors. These eigenvectors are the principal components. Note that here, the number of components is the number of dimensions we want to reduce the dataset to.
Let's use PCA to combine the two height features into a single principal component.
Python1X = df[['Height (inches)', 'Height (cm)']].to_numpy() 2eigenvalues, eigenvectors = PCA(X,1) 3X_pca = np.dot(X, eigenvectors) 4 5# Plotting original data 6plt.figure(figsize=(20, 10)) 7plt.subplot(1, 2, 1) 8plt.scatter(X[:, 0], X[:, 1], color='b') 9plt.xlabel('Height (inches)') 10plt.ylabel('Height (cm)') 11plt.title('Original Data') 12plt.grid(True) 13 14# Plotting data after PCA 15plt.subplot(1, 2, 2) 16plt.scatter(X_pca[:,0], [0]*len(X_pca), color='r') 17plt.xlabel('Principal Component 1') 18plt.title('Data after PCA') 19plt.grid(True) 20 21plt.show()
We've successfully merged two features into a single critical one while retaining essential information.
Now that we've seen how to reduce 2-dimensional data to 1-dimensional data let's reduce our original 3-dimensional data to 2 dimensions.
Python1X = df[['Height (inches)', 'Height (cm)', 'Weight (lbs)']].to_numpy() 2eigenvalues, eigenvectors = PCA(X, 2) 3X_pca = np.dot(X, eigenvectors) 4 5plt.scatter(X_pca[:,0], X_pca[:,1]) 6 7# Setting labels 8plt.xlabel('Principal Component 1') 9plt.ylabel('Principal Component 2') 10plt.title('Scatter Plot of 2 Principal Components') 11 12# Show plot 13plt.grid(True) 14plt.show()
The plot successfully shows how to model our 3-dimensional data into two principal components.
Today, we unraveled high-dimensional data, its challenges, and the importance of PCA and practically implemented PCA in Python. Coming up, we've prepared practice exercises to bolster your understanding and expertise. Let's dive deeper into the PCA cosmos! Happy coding!