Welcome to the world of Linear Discriminant Analysis (LDA), a technique widely used for dimensionality reduction in supervised learning. In this lesson, we're delving into LDA and building a Python implementation. We'll use the popular Iris dataset for a practical example.
LDA reduces dimensionality by constructing a feature space that optimally separates the classes in the data. The axes in this space are linear combinations of the original features and are known as eigenvectors. The LDA algorithm consists of several steps, such as calculating class mean vectors and scatter matrices, then finding eigenvalues and eigenvectors that map the original feature space on to a lower-dimensional space.
To understand LDA, let's start with a simple two-dimensional example. Suppose we have data points scattered across a 2-D space, with each point belonging to one of two possible classes.
In LDA, we want to project these points onto a line such that when the points are projected, the two classes are as separated as possible. The goal here is twofold:
This forms the intuition behind LDA. The crux of an LDA transformation involves formulating a weight matrix and transforming our input data by multiplying it with this weight matrix. The weights here help in increasing class separability. Let's see how we can achieve this mathematically using scatter matrices.
Scatter matrices are a critical part of LDA. To understand them, we need to differentiate between two types of scatter matrices:
Mathematically, the within-class scatter matrix is computed as the sum of scatter matrices for each class, and the between-class scatter matrix is calculated based on the means of different classes.
Given the mathematical foundations of within-class and between-class scatter matrices, the steps involved in the LDA algorithm can be conceptualized as:
Let's translate this into Python code.
Let's load the Iris dataset for LDA which is a 3-class dataset with 4 features (sepal length, sepal width, petal length, petal width)
Python1import numpy as np 2import matplotlib.pyplot as plt 3from sklearn.datasets import load_iris 4 5# Load the Iris dataset 6iris = load_iris() 7X = iris.data 8y = iris.target 9 10# Scale the data to zero mean and unit variance 11X = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
The code above loads the Iris dataset and normalizes the data. Normalization helps balance different input variables before they're input into the machine learning algorithm.
To build our LDA from scratch, we start by defining a Python class called LDA
. The class has two methods for calculating within-class and between-class scatter matrices, and finally, a fit
method for calculating the transformation matrix.
Python1class LDA: 2 def __init__(self): 3 self.W = None
The LDA
class has self
as its first parameter so that instance variables and methods can be accessed in the class. We define the __init__
method that Python calls when you create a new instance of this class. This method sets up a state for the object by defining the variable W
which will hold the transformation matrix.
The within-class scatter matrix is a crucial component in the LDA algorithm—it encapsulates the total scatter within each class. To calculate it, we first define the scatter matrix for each class, and then add them all to give us the within-class scatter matrix, .
Mathematically, the scatter matrix for a specific class is computed via:
Here, refers to the samples belonging to the class, while is the mean vector for that class—it's the average of all the samples in that class. The calculation gives us the deviation of individual samples from the class mean. This deviation is then transposed and multiplied with itself, resulting in a covariance matrix (scatter matrix) which summarizes the dispersion of samples around the class mean. Note, that is the number of classes and is the number of samples in the -th class.
To further understand this, consider an example. Suppose we're analyzing the taste preferences of children, separating them into two groups: those who prefer chocolate flavor and those who prefer vanilla. In this case, the within-class scatter matrix will summarize the individual preference strength variation for each flavor among children who prefer that flavor.
That is, if we consider chocolate-loving children as a class, the scatter matrix will capture how their chocolate preference strengths deviate from the average preference strength. This tells us how much variation is present when comparing children within the same flavor preference.
This class method computes the within-class scatter matrix, which measures how spread out individual class instances are around the mean. This matrix helps LDA understand the overall structure of classes within our data.
Python1 def within_class_scatter_matrix(self, X, y): 2 class_labels = np.unique(y) 3 mean_vectors = [np.mean(X[y == class_id], axis=0) for class_id in class_labels] 4 S_W = np.zeros((X.shape[1], X.shape[1])) 5 for class_id, mean_vec in zip(class_labels, mean_vectors): 6 class_sc_mat = np.zeros((X.shape[1], X.shape[1])) 7 for row in X[y == class_id]: 8 row, mv = row.reshape(X.shape[1], 1), mean_vec.reshape(X.shape[1], 1) 9 class_sc_mat += (row - mv).dot((row - mv).T) 10 S_W += class_sc_mat 11 12 return S_W
Another pivotal component in the LDA algorithm is the between-class scatter matrix. Unlike the within-class scatter matrix, which characterizes the variance within individual classes, the between-class scatter matrix is all about capturing the variance between different classes—hence the name.
To compute the between-class scatter matrix, we need to look at the means, or central tendencies, of our different classes. Just as in our example from the previous section—children with a preference for chocolate versus vanilla flavor—if we measure the average preference strength for chocolate among all chocolate-loving children and do the same for the vanilla-loving children, we'll have the two class mean vectors.
The between-class scatter matrix computes the covariance of these class means. Simply put, it measures how different the two classes are from each other in terms of their central tendencies.
Just imagine a scatter on a graph: Points that represent individual children are scattered around the mean point (average taste preference strength) for their class—either chocolate or vanilla. The scatter within class captures how far apart these children are spread out within their flavor preference.
Now think about the means of both classes on this graph: Are the lovers of chocolate, in general (on average), similar in taste strength to the lovers of vanilla, or are they different? The between-class scatter captures this difference—the scatter among the class means.
In other words, the between-class scatter matrix captures the 'distance' between the two classes—in our example, the distance between chocolate-lovers and vanilla-lovers, based on their average preference strength.
Mathematically, the between-class scatter matrix is calculated as follows:
Here, is the number of classes, is the number of samples in the -th class, is the mean vector of the -th class, and is the overall mean vector of the dataset.
Next, we write a method to calculate the between-class scatter matrix. This matrix measures how far the mean vector of each class is from the mean vector of the whole dataset. It is a measure of the distance between different classes.
Python1 def between_class_scatter_matrix(self, X, y): 2 class_labels = np.unique(y) 3 mean_vectors = [np.mean(X[y == class_id], axis=0) for class_id in class_labels] 4 overall_mean = np.mean(X, axis=0) 5 S_B = np.zeros((X.shape[1], X.shape[1])) 6 for class_id, mean_vec in zip(class_labels, mean_vectors): 7 S_B += len(X[y==class_id]) * (mean_vec - overall_mean).dot((mean_vec - overall_mean).T) 8 return S_B
Now that we've calculated the scatter matrices, we can write the fit
method. This method finds the generalized eigenvalues, and corresponding eigenvectors, and selects a subset for transformation.
Python1 def fit(self, X, y, n_components): 2 S_W = self.within_class_scatter_matrix(X, y) 3 S_B = self.between_class_scatter_matrix(X, y) 4 5 # Eigenvalue problem 6 eig_vals, eig_vecs = np.linalg.eigh(np.linalg.inv(S_W).dot(S_B)) 7 eig_pairs = sorted([(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))], key=lambda x: x[0], reverse=True) 8 9 # Forming a transformation matrix W 10 self.W = np.hstack([eig_pairs[i][1].reshape(X.shape[1], 1) for i in range(n_components)])
Additionally, we define a function for final transformation:
Python1 def transform(self, X): 2 return X.dot(self.W)
Let's understand the implementation step-by-step — the heart of the LDA algorithm lies in solving a traditional eigenvalue problem. After computing the scatter matrices and , we solve the generalized eigenvalue problem for the matrix , which is done in the following line of the fit
method:
Python1eig_vals, eig_vecs = np.linalg.eigh(np.linalg.inv(S_W).dot(S_B))
Here, np.linalg.eigh
function computes the eigenvalues (eig_vals
) and right eigenvectors (eig_vecs
) of a square array. The dot product of the inverse of and is fed into it.
Why are these eigenvectors important? Each eigenvector provides us with a direction in our feature space along which our data, when projected, will be spread out in a way that optimizes the class separability. The corresponding eigenvalue tells us how much of the variance in the data is captured by this particular direction or eigenvector.
But we don't want to use all these eigenvectors; we want to trim down our dimensions to a lower value n_components
. So we need to sort our eigenvectors based on the eigenvalues, in descending order. This is done in the following line, obtaining a list of tuple pairs (eigenvalue, eigenvector), sorted by the eigenvalue:
Python1eig_pairs = sorted([(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))], key=lambda x: x[0], reverse=True)
Subsequently, we form a transformation matrix W
by horizontally stacking the top n_components
eigenvectors:
Python1self.W = np.hstack([eig_pairs[i][1].reshape(X.shape[1], 1) for i in range(n_components)])
This transformation matrix W
now represents the directions or axes onto which we want to project our data.
The transform
method does this final step. It takes the dot product of the data X
and transformation matrix W
. In other words, it projects the data onto the directions that maximize between-class scatter and minimize within-class scatter:
Python1def transform(self, X): 2 return X.dot(self.W)
This completes the fitting and transformation process for LDA from scratch. The fitted LDA model now contains a way of projecting any new data input onto the subspace that optimizes for class separability, giving us our reduced-dimension output.
Next, we create an instance of the LDA
class and call the fit
method passing in our original feature space, the target variable, and the number of components we need for transformation.
We then perform projection of the dataset from the original features space to the new subspace. Afterward, we plot the projected samples.
Python1lda = LDA() 2lda.fit(X, y, n_components=2) 3X_lda = lda.transform(X) 4 5# Scatter plot of transformed data 6colors = ['red', 'blue', 'green'] 7labels = iris.target_names 8for label, color in zip(np.unique(y), colors): 9 plt.scatter(X_lda[y == label, 0], X_lda[y == label, 1], color=color, label=labels[label]) 10 11plt.title('LDA: Projected data onto the first two components') 12plt.xlabel('LD1') 13plt.ylabel('LD2') 14plt.legend() 15plt.show()
Finally, we can plot the transformed data:
Great job! You've successfully understood the concepts of LDA and its mathematical foundation, and built a simple LDA from scratch using Python. Now, get ready for practice exercises to apply your newly acquired knowledge and consolidate learning! Happy learning!