Deciphering Dimensionality Reduction: PCA vs. LDA Techniques and Implementation

Lesson 3

Topic Overview and Introduction

In this fascinating edition of our dimensionality reduction course, we'll dig into a side-by-side comparison of Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA). Our exploration will involve identifying contexts where PCA and LDA shine, examining real-world scenarios where LDA proves beneficial, and getting our hands dirty with a Python script that performs LDA on the famous Iris dataset.

Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA)

Possessing contrasting methodologies to reduce dimensions, PCA and LDA are critical tools for dealing with high-dimensional data. PCA, an unsupervised method, converts a set of features into linearly uncorrelated principal components based on maximum variance, while LDA focuses on maximizing the separability between data classes using supervised learning.

Choosing Between PCA and LDA

Deciding between PCA and LDA depends on the nature of our dataset and the problem at hand. While PCA is ideal for larger datasets with unreliable or non-existent class labels, LDA performs excellently with smaller, well-labeled datasets with minimal within-class but high between-class variability.

Real-World Applications of LDA

Effective in a multitude of domains, LDA's ability to maintain class separability during dimensionality reduction has found broad applications in image recognition, marketing for customer segmentation, healthcare disease detection systems, and protein analysis.

LDA and PCA Implementation using Iris Dataset, Seaborn, and Scikit-Learn

We'll now unravel a Python script that applies LDA and PCA on the Iris dataset, using seaborn and scikit-learn libraries. We'll break down the script piece by piece for a smooth learning experience.

Our first code block loads the Iris dataset:

Python
1# Load iris dataset from seaborn
2iris = sns.load_dataset("iris")

We're employing seaborn's load_dataset() to load the Iris dataset, widely used for ML experimentation. Seaborn is a Python data visualization library that provides a high-level interface for creating enlightening and attractive statistical graphics.

This block standardizes the features:

Python
1# Standardizing the features
2sc = StandardScaler()
3X = sc.fit_transform(iris.iloc[:, 0:4])

The standardization of a dataset is a common requirement for many machine learning estimators: they might behave poorly if the individual features don't resemble standard normally distributed data. Here, StandardScaler standardizes features by removing the mean and scaling to unit variance.

We then split the dataset into training and testing sets:

Python
1X_train, X_test, y_train, y_test = train_test_split(X, iris['species'], test_size=0.6, random_state=0)

Using train_test_split from the sklearn.model_selection module, we divided our dataset into a training set (60%) and a testing set (40%).

Now, we apply LDA:

Python
1lda = LDA(n_components=1)
2X_train_lda = lda.fit_transform(X_train, y_train)
3X_test_lda = lda.transform(X_test)

Using LDA from sklearn.discriminant_analysis, we applied LDA to our training and test datasets.

Starting with training a model using original data:

Python
1classifier = LogisticRegression(random_state=0)
2
3# Train the model using the original data
4classifier.fit(X_train, y_train)
5y_pred = classifier.predict(X_test)
6print("Accuracy:", metrics.accuracy_score(y_test, y_pred)) # 0.95

We used a logistic regression algorithm here to classify the species of Iris. The resulting model accomplished an accuracy rate of 95%.

We now train the model using the LDA data:

Python
1classifier = LogisticRegression(random_state=0)
2
3# Train the model using the LDA data
4classifier.fit(X_train_lda, y_train)
5y_pred_lda = classifier.predict(X_test_lda)
6print("Accuracy with LDA:", metrics.accuracy_score(y_test, y_pred_lda)) # 0.96

After moving to LDA, the accuracy improved slightly to 96%. LDA accomplishes better accuracy by reducing dimensionality, and consequently, noise in the data.

Now, lets add PCA and training the model using the PCA data.

Python
1from sklearn.decomposition import PCA
2pca = PCA(n_components=2)
3X_train_pca = pca.fit_transform(X_train)
4X_test_pca = pca.transform(X_test)
5
6classifier = LogisticRegression(random_state=0)
7classifier.fit(X_train_pca, y_train)
8y_pred_pca = classifier.predict(X_test_pca)
9print("Accuracy with PCA:", metrics.accuracy_score(y_test, y_pred_pca)) # 0.84

This time, we transformed data using PCA. The accuracy reduced to 84% while using the PCA, reinforcing that LDA can often outperform PCA, thanks to considering class labels.

Lesson Summary and Practice

In this lesson, we compared PCA and LDA, delved into scenarios where one might choose one over the other, explored real-world LDA applications, and implemented PCA and LDA using Python and the Iris dataset. In the upcoming practical sessions, you will gain further hands-on experience applying PCA and LDA to various datasets, indelibly cementing your understanding of these techniques. Let's move ahead!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.