Mastering PCA: Eigenvectors, Eigenvalues, and Covariance Matrix Explained

Mastering Dimensionality Reduction with Python

Navigating Data Simplification with PCALesson 2

Lesson 2

Mastering PCA: Eigenvectors, Eigenvalues, and Covariance Matrix Explained

Introduction

Embark on an exciting journey through the world of Principal Component Analysis (PCA). We will explore the indispensable roles of Eigenvalues and Eigenvectors in understanding PCA framework, and dive into the computation of these mathematical constructs using Python. Our adventure will cover the essential role of the Covariance Matrix and how to compute it. Ready? Set? Let's start!

Collecting Data

At the onset, we start with a dataset housing different physical measures - weight (in lbs), height (in inches), and height (in cm). We capture these in a Python dictionary, convert it to a pandas DataFrame for easy manipulation:

Python
1import pandas as pd
2
3# Given data
4data = {
5    'Weight (lbs)': [150, 160, 155, 165, 170, 160, 158, 175, 180, 170],
6    'Height (inches)': [68, 72, 66, 69, 71, 65, 67, 70, 73, 68],
7    'Height (cm)': [172.72, 182.88, 167.64, 175.26, 180.34, 165.1, 170.18, 177.8, 185.42, 172.72]
8}
9
10# Create a DataFrame
11df = pd.DataFrame(data)

Here, the DataFrame, df, represents our collected dataset.

Introduction to Standardization

Before performing Principal Component Analysis, we need to standardize the data. This just means changing the scale of our data so each feature has a mean of 0 and a standard deviation of 1.

PCA is sensitive to the scale of the features. Features with larger scales will dominate the variance calculations and may bias the results towards those features. Standardizing the data ensures that each feature contributes equally to the analysis, preventing this bias.

To standardize the data, each feature is transformed using the following formula:

$X_{standardized} = \frac{X - \mu}{\sigma}$

Where:

$X_{standardized}$ is the standardized value of the feature.
$X$ is the original value of the feature.
$\mu$ is the mean of the feature.
$\sigma$ is the standard deviation of the feature.

Let's standardize just the 2 height columns in our dataset:

Python
1import numpy as np
2import matplotlib.pyplot as plt
3
4def standardize(X):
5    return (X - np.mean(X, axis=0)) / np.std(X, axis=0)
6
7X = df[['Height (inches)', 'Height (cm)']].to_numpy()
8X_standard = standardize(X)
9
10plt.scatter(X_standard[:, 0], X_standard[:, 1], color='b')
11plt.title('Standardized Data')
12plt.xlabel('Height (inches)')
13plt.ylabel('Height (cm)')
14plt.grid(True)
15plt.show()

After standardization, our data is now centered and scaled, making variables more comparable.

Introduction to Covariance Matrix

Before calculating the covariance matrix, let's understand what it signifies and why it's important in PCA.

Covariance gives us a measure of the extent to which corresponding deviations from averages tend to move together. In other words, it implies how one variable changes in relation to another. Covariance between two variables can be positive, implying the variables increase or decrease together, or negative, meaning one variable increases when the other decreases.

By helping identify the direction with the most variance in data, the covariance matrix lays the foundation for PCA. The eigenvectors, derived from the covariance matrix, will form the new axes along which our data will lie. The corresponding eigenvalues denote the variances along these new axes.

Building Covariance Matrix

We compute the covariance in Python using the numpy cov function.

Python
1import numpy as np
2
3# Compute the covariance matrix. rowvar=False indicates that columns represent variables and rows represent observations.
4cov_matrix = np.cov(X_standard, rowvar=False)
5
6print("Covariance Matrix:")
7print(cov_matrix)

The covariance matrix, cov_matrix, saves the covariance between pairs of features in our scaled dataset. We will see the following output:


1[[1.11111111 1.11111111]
2 [1.11111111 1.11111111]]

The covariance matrix is symmetric, with the diagonal elements representing the variance of each feature and the off-diagonal elements representing the covariance between features.

We can understand the following from the covariance matrix:

Both variables have a variance of 1.11111111. This indicates similar spread in both variables, which is expected since we standardized them.
The covariance between the two variables is 1.10287037. This suggests a positive relationship between the two variables, meaning they tend to increase or decrease together.

Introduction to Eigenvalues and Eigenvectors

Prior to performing eigendecomposition on our covariance matrix, let's take a brief detour to understand the mathematical meaning of eigenvalues and eigenvectors and the instrumental role they play in PCA.

Eigenvectors and eigenvalues are two fundamental concepts of linear algebra. They are generally associated with linear equations and matrices, which happen to be the bedrock of most machine learning and data science algorithms.

An eigenvector is a non-zero vector that remains on the same line after transformation (i.e., a linear transformation only alters its scales). If we denote our transformation as matrix A and our vector as v, then v is an eigenvector of A if Av is a scalar multiple of v. The scalar is known as the eigenvalue (λ). We can express this relationship mathematically as follows:

$Av = λv$

Deciphering Covariance Matrix with Eigendecomposition

Let's introduce Eigendecomposition into our process. This technique decomposes matrices into their constituent parts and aids in understanding and simplifying complex matrix operations, crucial in PCA.

Here, we calculate the eigenvalues and eigenvectors of the covariance matrix using np.linalg.eig.

Python
1# Eigendecomposition
2eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
3
4print("\nEigenvalues:")
5print(eigenvalues)
6
7print("\nEigenvectors:")
8print(eigenvectors)

The eig function returns eigenvalues and their corresponding eigenvectors, which help decipher the PCA's underlying structure.

Let's interpret the output:


1Eigenvalues:
2[2.21398148 0.00824074]
3
4Eigenvectors:
5[[ 0.70710678 -0.70710678]
6 [ 0.70710678  0.70710678]]

The eigenvalues signify the variance captured by each eigenvector. The first eigenvalue (2.21398148) is significantly higher than the second (0.00824074), indicating the first eigenvector captures most of the variance in the data.

The eigenvectors represent the directions of maximum variance in the data. The first eigenvector [0.70710678, 0.70710678] captures the direction of maximum variance, while the second eigenvector [-0.70710678, 0.70710678] captures the direction of the second highest variance.

Interpretation of Eigenvectors and Eigenvalues with an Example

We can plot eigenvectors on a graph to visualize their direction and magnitude. Let's plot the eigenvectors of the covariance matrix we calculated earlier.

Python
1# Eigendecomposition
2eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
3
4# Plot the eigenvectors of the first covariance matrix we calculated earlier for height columns
5plt.scatter(X_standard[:, 0], X_standard[:, 1], color='b')
6plt.quiver(0, 0, eigenvectors[0, 0], eigenvectors[1, 0], color='r', scale=3, label='Eigenvector 1')
7plt.quiver(0, 0, eigenvectors[0, 1], eigenvectors[1, 1], color='g', scale=3, label='Eigenvector 2')
8plt.title('Eigenvectors of Covariance Matrix')
9plt.legend()
10plt.grid(True)
11plt.show()

The red line corresponds to the eigenvector associated with the first eigenvalue, which captures the direction of maximum variance in the data. The green line represents the eigenvector associated with the second eigenvalue, capturing the direction of the second highest variance.

In our case the maximum variance is along the diagonal between the elements of the covariance matrix and the second highest variance is along the off-diagonal elements of the covariance matrix.

Connecting Eigenvectors and Eigenvalues to PCA

Eigenvectors and eigenvalues are pivotal in PCA. Eigenvectors represent the directions of maximum variance in the data, while eigenvalues signify the variance captured by each eigenvector.

Notice how the eigenvector with the highest eigenvalue points in the direction of maximum variance. This eigenvector becomes the first principal component in PCA. Subsequent eigenvectors capture the remaining variance in descending order of eigenvalues.

Lesson Summary & Next Steps

Congrats! You've comfortably voyaged through understanding and calculating eigenvectors, eigenvalues, and the Covariance Matrix in PCA using Python.

In our next exploration, we delve into PCA implementation using Scikit-learn with more datasets and practical examples. Practice, learn, and venture further into PCA! Happy coding!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.