Exploring Linear Discriminant Analysis with Scikit-Learn

Lesson 2

Introduction

Welcome back to our journey through "Linear Landscapes of Dimensionality Reduction". Today's lesson focuses on applying Linear Discriminant Analysis (LDA) using Scikit-learn and then diving into feature extraction and selection. Primarily, we'll be utilizing the LinearDiscriminantAnalysis function from sklearn.discriminant_analysis. Ready to take the plunge?

Quick Recap and Loading the Data

Before diving in, let's do a quick revision of LDA, how it works, and its implementation. As we transition to using Scikit-learn today, our first step is to load the dataset we will be working with. We'll load the Iris dataset, which is included in Scikit-learn's datasets:

Python
1from sklearn.datasets import load_iris
2
3# Load the Iris dataset
4data = load_iris()
5X = data.data

Here, data is a bunch object containing both data and targets (species) among other attributes, while X captures the features data.

Preprocessing the Data

Our next step is to preprocess our data. We scale the features to a zero mean and unit variance, important for optimal LDA performance:

Python
1import numpy as np
2
3# Scale the features to zero mean and unit variance
4X = (X - np.mean(X, axis=0)) / np.std(X, axis=0)

This code subtracts the mean and divides by the standard deviation for each feature column, effectively standardizing it.

Splitting the Data

Next, we'll split our dataset into training and testing data:

Python
1X_train = X[:120]
2y_train = data.target[:120]
3
4X_test = X[120:]
5y_test = data.target[120:]

X_train and y_train make up our training set, while X_test and y_test are our testing sets.

Applying LDA with Scikit-learn

Let's apply LDA to our data:

Python
1from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
2
3lda = LinearDiscriminantAnalysis(n_components=2)
4X_lda = lda.fit_transform(X_train, y_train)

We initiate the Linear Discriminant Analysis object with 2 components (our target), fit the model to our training data, and transform it into a lower-dimensional space.

Visualizing LDA Results

To better understand our LDA model's results, let's visualize using matplotlib:

Python
1import matplotlib.pyplot as plt
2
3# Scatter plot of transformed data
4colors = ['red', 'blue', 'green']
5labels = ['Setosa', 'Versicolor', 'Virginica']
6
7for i in range(3):
8    plt.scatter(X_lda[y_train == i, 0], X_lda[y_train == i, 1], c=colors[i], label=labels[i])
9
10plt.title('LDA: Projected data onto the first two components')
11plt.xlabel('Discriminant 1')
12plt.ylabel('Discriminant 2')
13plt.legend()
14plt.show()

This visualization showcases class separation by LDA, easily distinguished by color coding:

The 'Discriminant 1' and 'Discriminant 2' axes represent the two components we selected for our LDA model - the transformed data is projected onto these axes.

Using LDA as a Classifier

Additioanlly, LDA can be used as a classifier when the classes are well-separated and the assumptions of LDA are met. It's particularly useful for multi-class classification problems.

The LinearDiscriminantAnalysis class in Scikit-learn provides a predict method to classify new samples. Let's see how to use it:

Python
1# Using the predict method to predict the class of a new sample
2y_pred = lda.predict(X_test)
3
4# Calculate the accuracy of the model
5accuracy = np.mean(y_pred == y_test)
6print(f'Accuracy: {accuracy:.2f}') # Prints 0.9

Here, we use the predict method from our trained LDA model to predict the classes for our test data, and measure its accuracy by comparing the prediction with the actual classes.

Lesson Summary and Upcoming Practice

We applied LDA using Scikit-learn, performed data extraction/selection, visualized LDA results, and calculated the accuracy of our model. In the next lesson, we'll dive into some hands-on exercises. Remember, practice is pivotal to mastering new concepts and reinforcing learning!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.