Lesson 1

Welcome to the world of Linear Discriminant Analysis (LDA), a technique widely used for dimensionality reduction in supervised learning. In this lesson, we're delving into LDA and building a Python implementation. We'll use the popular Iris dataset for a practical example.

LDA reduces dimensionality by constructing a feature space that optimally separates the classes in the data. The axes in this space are linear combinations of the original features and are known as eigenvectors. The LDA algorithm consists of several steps, such as calculating class mean vectors and scatter matrices, then finding eigenvalues and eigenvectors that map the original feature space on to a lower-dimensional space.

To understand LDA, let's start with a simple two-dimensional example. Suppose we have data points scattered across a 2-D space, with each point belonging to one of two possible classes.

In LDA, we want to project these points onto a line such that when the points are projected, the two classes are as separated as possible. The goal here is twofold:

- Maximize the distance between the means of the two classes.
- Minimize the variation (in other words, the scatter) within each category.

This forms the intuition behind LDA. The crux of an LDA transformation involves formulating a weight matrix and transforming our input data by multiplying it with this weight matrix. The weights here help in increasing class separability. Let's see how we can achieve this mathematically using scatter matrices.

Scatter matrices are a critical part of LDA. To understand them, we need to differentiate between two types of scatter matrices:

- The
**within-class scatter matrix**measures the variability of data points within each class. - The
**between-class scatter matrix**captures the variability between different classes.

Mathematically, the within-class scatter matrix $S_W$ is computed as the sum of scatter matrices for each class, and the between-class scatter matrix $S_B$ is calculated based on the means of different classes.

Given the mathematical foundations of within-class and between-class scatter matrices, the steps involved in the LDA algorithm can be conceptualized as:

- Compute class means: Calculate the average value for each class.
- Compute the within-class scatter matrix $S_{W}$.
- Compute the weights that maximize separability: This is obtained by multiplying the inverse of $S_{W}$ and the difference between the class means.
- Use the weights to transform our input data.

Let's translate this into Python code.

Let's load the Iris dataset for LDA which is a 3-class dataset with 4 features (sepal length, sepal width, petal length, petal width)

Python`1import numpy as np 2import matplotlib.pyplot as plt 3from sklearn.datasets import load_iris 4 5# Load the Iris dataset 6iris = load_iris() 7X = iris.data 8y = iris.target 9 10# Scale the data to zero mean and unit variance 11X = (X - np.mean(X, axis=0)) / np.std(X, axis=0)`

The code above loads the Iris dataset and normalizes the data. Normalization helps balance different input variables before they're input into the machine learning algorithm.

To build our LDA from scratch, we start by defining a Python class called `LDA`

. The class has two methods for calculating within-class and between-class scatter matrices, and finally, a `fit`

method for calculating the transformation matrix.

Python`1class LDA: 2 def __init__(self): 3 self.W = None`

The `LDA`

class has `self`

as its first parameter so that instance variables and methods can be accessed in the class. We define the `__init__`

method that Python calls when you create a new instance of this class. This method sets up a state for the object by defining the variable `W`

which will hold the transformation matrix.

The within-class scatter matrix is a crucial component in the LDA algorithm—it encapsulates the total scatter within each class. To calculate it, we first define the scatter matrix for each class, and then add them all to give us the within-class scatter matrix, $S_{W}$.

Mathematically, the scatter matrix for a specific class is computed via:

$S_{W} = \Sigma_{i=1}^{g}\Sigma_{j=1}^{N_i} (x_{ij} - m)(x_{ij} - m)^T$Here, $x_{ij}$ refers to the samples belonging to the class, while $m$ is the mean vector for that class—it's the average of all the samples in that class. The calculation $(x_{ij} - m)$ gives us the deviation of individual samples from the class mean. This deviation is then transposed and multiplied with itself, resulting in a covariance matrix (scatter matrix) which summarizes the dispersion of samples around the class mean. Note, that $g$ is the number of classes and $N_i$ is the number of samples in the $i$-th class.

To further understand this, consider an example. Suppose we're analyzing the taste preferences of children, separating them into two groups: those who prefer chocolate flavor and those who prefer vanilla. In this case, the within-class scatter matrix will summarize the individual preference strength variation for each flavor among children who prefer that flavor.

That is, if we consider chocolate-loving children as a class, the scatter matrix will capture how their chocolate preference strengths deviate from the average preference strength. This tells us how much variation is present when comparing children within the same flavor preference.

This class method computes the within-class scatter matrix, which measures how spread out individual class instances are around the mean. This matrix helps LDA understand the overall structure of classes within our data.

Python`1 def within_class_scatter_matrix(self, X, y): 2 class_labels = np.unique(y) 3 mean_vectors = [np.mean(X[y == class_id], axis=0) for class_id in class_labels] 4 S_W = np.zeros((X.shape[1], X.shape[1])) 5 for class_id, mean_vec in zip(class_labels, mean_vectors): 6 class_sc_mat = np.zeros((X.shape[1], X.shape[1])) 7 for row in X[y == class_id]: 8 row, mv = row.reshape(X.shape[1], 1), mean_vec.reshape(X.shape[1], 1) 9 class_sc_mat += (row - mv).dot((row - mv).T) 10 S_W += class_sc_mat 11 12 return S_W`

Another pivotal component in the LDA algorithm is the between-class scatter matrix. Unlike the within-class scatter matrix, which characterizes the variance within individual classes, the between-class scatter matrix is all about capturing the variance between different classes—hence the name.

To compute the between-class scatter matrix, we need to look at the means, or central tendencies, of our different classes. Just as in our example from the previous section—children with a preference for chocolate versus vanilla flavor—if we measure the average preference strength for chocolate among all chocolate-loving children and do the same for the vanilla-loving children, we'll have the two class mean vectors.

The between-class scatter matrix computes the covariance of these class means. Simply put, it measures how different the two classes are from each other in terms of their central tendencies.

Just imagine a scatter on a graph: Points that represent individual children are scattered around the mean point (average taste preference strength) for their class—either chocolate or vanilla. The scatter within class captures how far apart these children are spread out within their flavor preference.

Now think about the means of both classes on this graph: Are the lovers of chocolate, in general (on average), similar in taste strength to the lovers of vanilla, or are they different? The between-class scatter captures this difference—the scatter among the class means.

In other words, the between-class scatter matrix captures the 'distance' between the two classes—in our example, the distance between chocolate-lovers and vanilla-lovers, based on their average preference strength.

Mathematically, the between-class scatter matrix $S_{B}$ is calculated as follows:

$S_{B} = \Sigma_{i=1}^{g} N_i(m_i - m)(m_i - m)^T$Here, $g$ is the number of classes, $N_i$ is the number of samples in the $i$-th class, $m_i$ is the mean vector of the $i$-th class, and $m$ is the overall mean vector of the dataset.

Next, we write a method to calculate the between-class scatter matrix. This matrix measures how far the mean vector of each class is from the mean vector of the whole dataset. It is a measure of the distance between different classes.

Python`1 def between_class_scatter_matrix(self, X, y): 2 class_labels = np.unique(y) 3 mean_vectors = [np.mean(X[y == class_id], axis=0) for class_id in class_labels] 4 overall_mean = np.mean(X, axis=0) 5 S_B = np.zeros((X.shape[1], X.shape[1])) 6 for class_id, mean_vec in zip(class_labels, mean_vectors): 7 S_B += len(X[y==class_id]) * (mean_vec - overall_mean).dot((mean_vec - overall_mean).T) 8 return S_B`

Now that we've calculated the scatter matrices, we can write the `fit`

method. This method finds the generalized eigenvalues, and corresponding eigenvectors, and selects a subset for transformation.

Python`1 def fit(self, X, y, n_components): 2 S_W = self.within_class_scatter_matrix(X, y) 3 S_B = self.between_class_scatter_matrix(X, y) 4 5 # Eigenvalue problem 6 eig_vals, eig_vecs = np.linalg.eigh(np.linalg.inv(S_W).dot(S_B)) 7 eig_pairs = sorted([(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))], key=lambda x: x[0], reverse=True) 8 9 # Forming a transformation matrix W 10 self.W = np.hstack([eig_pairs[i][1].reshape(X.shape[1], 1) for i in range(n_components)])`

Additionally, we define a function for final transformation:

Python`1 def transform(self, X): 2 return X.dot(self.W)`

Let's understand the implementation step-by-step — the heart of the LDA algorithm lies in solving a traditional eigenvalue problem. After computing the scatter matrices $S_W$ and $S_B$, we solve the generalized eigenvalue problem for the matrix $S_W^{-1}S_B$, which is done in the following line of the `fit`

method:

Python`1eig_vals, eig_vecs = np.linalg.eigh(np.linalg.inv(S_W).dot(S_B))`

Here, `np.linalg.eigh`

function computes the eigenvalues (`eig_vals`

) and right eigenvectors (`eig_vecs`

) of a square array. The dot product of the inverse of $S_W$ and $S_B$ is fed into it.

Why are these eigenvectors important? Each eigenvector provides us with a direction in our feature space along which our data, when projected, will be spread out in a way that optimizes the class separability. The corresponding eigenvalue tells us how much of the variance in the data is captured by this particular direction or eigenvector.

But we don't want to use all these eigenvectors; we want to trim down our dimensions to a lower value `n_components`

. So we need to sort our eigenvectors based on the eigenvalues, in descending order. This is done in the following line, obtaining a list of tuple pairs (eigenvalue, eigenvector), sorted by the eigenvalue:

Python`1eig_pairs = sorted([(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))], key=lambda x: x[0], reverse=True)`

Subsequently, we form a transformation matrix `W`

by horizontally stacking the top `n_components`

eigenvectors:

Python`1self.W = np.hstack([eig_pairs[i][1].reshape(X.shape[1], 1) for i in range(n_components)])`

This transformation matrix `W`

now represents the directions or axes onto which we want to project our data.

The `transform`

method does this final step. It takes the dot product of the data `X`

and transformation matrix `W`

. In other words, it projects the data onto the directions that maximize between-class scatter and minimize within-class scatter:

Python`1def transform(self, X): 2 return X.dot(self.W)`

This completes the fitting and transformation process for LDA from scratch. The fitted LDA model now contains a way of projecting any new data input onto the subspace that optimizes for class separability, giving us our reduced-dimension output.

Next, we create an instance of the `LDA`

class and call the `fit`

method passing in our original feature space, the target variable, and the number of components we need for transformation.

We then perform projection of the dataset from the original features space to the new subspace. Afterward, we plot the projected samples.

Python`1lda = LDA() 2lda.fit(X, y, n_components=2) 3X_lda = lda.transform(X) 4 5# Scatter plot of transformed data 6colors = ['red', 'blue', 'green'] 7labels = iris.target_names 8for label, color in zip(np.unique(y), colors): 9 plt.scatter(X_lda[y == label, 0], X_lda[y == label, 1], color=color, label=labels[label]) 10 11plt.title('LDA: Projected data onto the first two components') 12plt.xlabel('LD1') 13plt.ylabel('LD2') 14plt.legend() 15plt.show()`

Finally, we can plot the transformed data:

Great job! You've successfully understood the concepts of LDA and its mathematical foundation, and built a simple LDA from scratch using Python. Now, get ready for practice exercises to apply your newly acquired knowledge and consolidate learning! Happy learning!