Welcome back to our journey through "Linear Landscapes of Dimensionality Reduction". Today's lesson focuses on applying Linear Discriminant Analysis (LDA) using Scikit-learn and then diving into feature extraction and selection. Primarily, we'll be utilizing the LinearDiscriminantAnalysis
function from sklearn.discriminant_analysis
. Ready to take the plunge?
Before diving in, let's do a quick revision of LDA, how it works, and its implementation. As we transition to using Scikit-learn today, our first step is to load the dataset we will be working with. We'll load the Iris dataset, which is included in Scikit-learn's datasets:
Python1from sklearn.datasets import load_iris 2 3# Load the Iris dataset 4data = load_iris() 5X = data.data
Here, data
is a bunch object containing both data and targets (species) among other attributes, while X
captures the features data.
Our next step is to preprocess our data. We scale the features to a zero mean and unit variance, important for optimal LDA performance:
Python1import numpy as np 2 3# Scale the features to zero mean and unit variance 4X = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
This code subtracts the mean and divides by the standard deviation for each feature column, effectively standardizing it.
Next, we'll split our dataset into training and testing data:
Python1X_train = X[:120] 2y_train = data.target[:120] 3 4X_test = X[120:] 5y_test = data.target[120:]
X_train
and y_train
make up our training set, while X_test
and y_test
are our testing sets.
Let's apply LDA to our data:
Python1from sklearn.discriminant_analysis import LinearDiscriminantAnalysis 2 3lda = LinearDiscriminantAnalysis(n_components=2) 4X_lda = lda.fit_transform(X_train, y_train)
We initiate the Linear Discriminant Analysis object with 2 components (our target), fit the model to our training data, and transform it into a lower-dimensional space.
To better understand our LDA model's results, let's visualize using matplotlib:
Python1import matplotlib.pyplot as plt 2 3# Scatter plot of transformed data 4colors = ['red', 'blue', 'green'] 5labels = ['Setosa', 'Versicolor', 'Virginica'] 6 7for i in range(3): 8 plt.scatter(X_lda[y_train == i, 0], X_lda[y_train == i, 1], c=colors[i], label=labels[i]) 9 10plt.title('LDA: Projected data onto the first two components') 11plt.xlabel('Discriminant 1') 12plt.ylabel('Discriminant 2') 13plt.legend() 14plt.show()
This visualization showcases class separation by LDA, easily distinguished by color coding:
The 'Discriminant 1' and 'Discriminant 2' axes represent the two components we selected for our LDA model - the transformed data is projected onto these axes.
Additioanlly, LDA can be used as a classifier when the classes are well-separated and the assumptions of LDA are met. It's particularly useful for multi-class classification problems.
The LinearDiscriminantAnalysis
class in Scikit-learn provides a predict
method to classify new samples. Let's see how to use it:
Python1# Using the predict method to predict the class of a new sample 2y_pred = lda.predict(X_test) 3 4# Calculate the accuracy of the model 5accuracy = np.mean(y_pred == y_test) 6print(f'Accuracy: {accuracy:.2f}') # Prints 0.9
Here, we use the predict
method from our trained LDA model to predict the classes for our test data, and measure its accuracy by comparing the prediction with the actual classes.
We applied LDA using Scikit-learn, performed data extraction/selection, visualized LDA results, and calculated the accuracy of our model. In the next lesson, we'll dive into some hands-on exercises. Remember, practice is pivotal to mastering new concepts and reinforcing learning!