Embark on an exciting journey through the world of Principal Component Analysis (PCA). We will explore the indispensable roles of Eigenvalues and Eigenvectors in understanding PCA framework, and dive into the computation of these mathematical constructs using Python. Our adventure will cover the essential role of the Covariance Matrix and how to compute it. Ready? Set? Let's start!
At the onset, we start with a dataset housing different physical measures - weight (in lbs), height (in inches), and height (in cm). We capture these in a Python dictionary, convert it to a pandas DataFrame for easy manipulation:
Python1import pandas as pd 2 3# Given data 4data = { 5 'Weight (lbs)': [150, 160, 155, 165, 170, 160, 158, 175, 180, 170], 6 'Height (inches)': [68, 72, 66, 69, 71, 65, 67, 70, 73, 68], 7 'Height (cm)': [172.72, 182.88, 167.64, 175.26, 180.34, 165.1, 170.18, 177.8, 185.42, 172.72] 8} 9 10# Create a DataFrame 11df = pd.DataFrame(data)
Here, the DataFrame, df
, represents our collected dataset.
Before performing Principal Component Analysis, we need to standardize the data. This just means changing the scale of our data so each feature has a mean of 0 and a standard deviation of 1.
PCA is sensitive to the scale of the features. Features with larger scales will dominate the variance calculations and may bias the results towards those features. Standardizing the data ensures that each feature contributes equally to the analysis, preventing this bias.
To standardize the data, each feature is transformed using the following formula:
Where:
- is the standardized value of the feature.
- is the original value of the feature.
- is the mean of the feature.
- is the standard deviation of the feature.
Let's standardize just the 2 height columns in our dataset:
Python1import numpy as np 2import matplotlib.pyplot as plt 3 4def standardize(X): 5 return (X - np.mean(X, axis=0)) / np.std(X, axis=0) 6 7X = df[['Height (inches)', 'Height (cm)']].to_numpy() 8X_standard = standardize(X) 9 10plt.scatter(X_standard[:, 0], X_standard[:, 1], color='b') 11plt.title('Standardized Data') 12plt.xlabel('Height (inches)') 13plt.ylabel('Height (cm)') 14plt.grid(True) 15plt.show()
After standardization, our data is now centered and scaled, making variables more comparable.
Before calculating the covariance matrix, let's understand what it signifies and why it's important in PCA.
Covariance gives us a measure of the extent to which corresponding deviations from averages tend to move together. In other words, it implies how one variable changes in relation to another. Covariance between two variables can be positive, implying the variables increase or decrease together, or negative, meaning one variable increases when the other decreases.
By helping identify the direction with the most variance in data, the covariance matrix lays the foundation for PCA. The eigenvectors, derived from the covariance matrix, will form the new axes along which our data will lie. The corresponding eigenvalues denote the variances along these new axes.
We compute the covariance in Python using the numpy cov
function.
Python1import numpy as np 2 3# Compute the covariance matrix. rowvar=False indicates that columns represent variables and rows represent observations. 4cov_matrix = np.cov(X_standard, rowvar=False) 5 6print("Covariance Matrix:") 7print(cov_matrix)
The covariance matrix, cov_matrix
, saves the covariance between pairs of features in our scaled dataset. We will see the following output:
1[[1.11111111 1.11111111] 2 [1.11111111 1.11111111]]
The covariance matrix is symmetric, with the diagonal elements representing the variance of each feature and the off-diagonal elements representing the covariance between features.
We can understand the following from the covariance matrix:
- Both variables have a variance of 1.11111111. This indicates similar spread in both variables, which is expected since we standardized them.
- The covariance between the two variables is 1.10287037. This suggests a positive relationship between the two variables, meaning they tend to increase or decrease together.
Prior to performing eigendecomposition on our covariance matrix, let's take a brief detour to understand the mathematical meaning of eigenvalues and eigenvectors and the instrumental role they play in PCA.
Eigenvectors and eigenvalues are two fundamental concepts of linear algebra. They are generally associated with linear equations and matrices, which happen to be the bedrock of most machine learning and data science algorithms.
An eigenvector is a non-zero vector that remains on the same line after transformation (i.e., a linear transformation only alters its scales). If we denote our transformation as matrix A and our vector as v, then v is an eigenvector of A if Av is a scalar multiple of v. The scalar is known as the eigenvalue (λ). We can express this relationship mathematically as follows:
Let's introduce Eigendecomposition into our process. This technique decomposes matrices into their constituent parts and aids in understanding and simplifying complex matrix operations, crucial in PCA.
Here, we calculate the eigenvalues and eigenvectors of the covariance matrix using np.linalg.eig
.
Python1# Eigendecomposition 2eigenvalues, eigenvectors = np.linalg.eig(cov_matrix) 3 4print("\nEigenvalues:") 5print(eigenvalues) 6 7print("\nEigenvectors:") 8print(eigenvectors)
The eig
function returns eigenvalues and their corresponding eigenvectors, which help decipher the PCA's underlying structure.
Let's interpret the output:
1Eigenvalues: 2[2.21398148 0.00824074] 3 4Eigenvectors: 5[[ 0.70710678 -0.70710678] 6 [ 0.70710678 0.70710678]]
The eigenvalues signify the variance captured by each eigenvector. The first eigenvalue (2.21398148) is significantly higher than the second (0.00824074), indicating the first eigenvector captures most of the variance in the data.
The eigenvectors represent the directions of maximum variance in the data. The first eigenvector [0.70710678, 0.70710678] captures the direction of maximum variance, while the second eigenvector [-0.70710678, 0.70710678] captures the direction of the second highest variance.
We can plot eigenvectors on a graph to visualize their direction and magnitude. Let's plot the eigenvectors of the covariance matrix we calculated earlier.
Python1# Eigendecomposition 2eigenvalues, eigenvectors = np.linalg.eig(cov_matrix) 3 4# Plot the eigenvectors of the first covariance matrix we calculated earlier for height columns 5plt.scatter(X_standard[:, 0], X_standard[:, 1], color='b') 6plt.quiver(0, 0, eigenvectors[0, 0], eigenvectors[1, 0], color='r', scale=3, label='Eigenvector 1') 7plt.quiver(0, 0, eigenvectors[0, 1], eigenvectors[1, 1], color='g', scale=3, label='Eigenvector 2') 8plt.title('Eigenvectors of Covariance Matrix') 9plt.legend() 10plt.grid(True) 11plt.show()
The red line corresponds to the eigenvector associated with the first eigenvalue, which captures the direction of maximum variance in the data. The green line represents the eigenvector associated with the second eigenvalue, capturing the direction of the second highest variance.
In our case the maximum variance is along the diagonal between the elements of the covariance matrix and the second highest variance is along the off-diagonal elements of the covariance matrix.
Eigenvectors and eigenvalues are pivotal in PCA. Eigenvectors represent the directions of maximum variance in the data, while eigenvalues signify the variance captured by each eigenvector.
Notice how the eigenvector with the highest eigenvalue points in the direction of maximum variance. This eigenvector becomes the first principal component in PCA. Subsequent eigenvectors capture the remaining variance in descending order of eigenvalues.
Congrats! You've comfortably voyaged through understanding and calculating eigenvectors, eigenvalues, and the Covariance Matrix in PCA using Python.
In our next exploration, we delve into PCA implementation using Scikit-learn with more datasets and practical examples. Practice, learn, and venture further into PCA! Happy coding!