In this fascinating edition of our dimensionality reduction course, we'll dig into a side-by-side comparison of Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA). Our exploration will involve identifying contexts where PCA and LDA shine, examining real-world scenarios where LDA proves beneficial, and getting our hands dirty with a Python script that performs LDA on the famous Iris dataset.
Possessing contrasting methodologies to reduce dimensions, PCA and LDA are critical tools for dealing with high-dimensional data. PCA, an unsupervised method, converts a set of features into linearly uncorrelated principal components based on maximum variance, while LDA focuses on maximizing the separability between data classes using supervised learning.
Deciding between PCA and LDA depends on the nature of our dataset and the problem at hand. While PCA is ideal for larger datasets with unreliable or non-existent class labels, LDA performs excellently with smaller, well-labeled datasets with minimal within-class but high between-class variability.
Effective in a multitude of domains, LDA's ability to maintain class separability during dimensionality reduction has found broad applications in image recognition, marketing for customer segmentation, healthcare disease detection systems, and protein analysis.
We'll now unravel a Python script that applies LDA and PCA on the Iris dataset, using seaborn and scikit-learn libraries. We'll break down the script piece by piece for a smooth learning experience.
Our first code block loads the Iris dataset:
Python1# Load iris dataset from seaborn 2iris = sns.load_dataset("iris")
We're employing seaborn's load_dataset()
to load the Iris dataset, widely used for ML experimentation. Seaborn is a Python data visualization library that provides a high-level interface for creating enlightening and attractive statistical graphics.
This block standardizes the features:
Python1# Standardizing the features 2sc = StandardScaler() 3X = sc.fit_transform(iris.iloc[:, 0:4])
The standardization of a dataset is a common requirement for many machine learning estimators: they might behave poorly if the individual features don't resemble standard normally distributed data. Here, StandardScaler standardizes features by removing the mean and scaling to unit variance.
We then split the dataset into training and testing sets:
Python1X_train, X_test, y_train, y_test = train_test_split(X, iris['species'], test_size=0.6, random_state=0)
Using train_test_split
from the sklearn.model_selection module, we divided our dataset into a training set (60%) and a testing set (40%).
Now, we apply LDA:
Python1lda = LDA(n_components=1) 2X_train_lda = lda.fit_transform(X_train, y_train) 3X_test_lda = lda.transform(X_test)
Using LDA from sklearn.discriminant_analysis, we applied LDA to our training and test datasets.
Starting with training a model using original data:
Python1classifier = LogisticRegression(random_state=0) 2 3# Train the model using the original data 4classifier.fit(X_train, y_train) 5y_pred = classifier.predict(X_test) 6print("Accuracy:", metrics.accuracy_score(y_test, y_pred)) # 0.95
We used a logistic regression algorithm here to classify the species of Iris. The resulting model accomplished an accuracy rate of 95%.
We now train the model using the LDA data:
Python1classifier = LogisticRegression(random_state=0) 2 3# Train the model using the LDA data 4classifier.fit(X_train_lda, y_train) 5y_pred_lda = classifier.predict(X_test_lda) 6print("Accuracy with LDA:", metrics.accuracy_score(y_test, y_pred_lda)) # 0.96
After moving to LDA, the accuracy improved slightly to 96%. LDA accomplishes better accuracy by reducing dimensionality, and consequently, noise in the data.
Now, lets add PCA and training the model using the PCA data.
Python1from sklearn.decomposition import PCA 2pca = PCA(n_components=2) 3X_train_pca = pca.fit_transform(X_train) 4X_test_pca = pca.transform(X_test) 5 6classifier = LogisticRegression(random_state=0) 7classifier.fit(X_train_pca, y_train) 8y_pred_pca = classifier.predict(X_test_pca) 9print("Accuracy with PCA:", metrics.accuracy_score(y_test, y_pred_pca)) # 0.84
This time, we transformed data using PCA. The accuracy reduced to 84% while using the PCA, reinforcing that LDA can often outperform PCA, thanks to considering class labels.
In this lesson, we compared PCA and LDA, delved into scenarios where one might choose one over the other, explored real-world LDA applications, and implemented PCA and LDA using Python and the Iris dataset. In the upcoming practical sessions, you will gain further hands-on experience applying PCA and LDA to various datasets, indelibly cementing your understanding of these techniques. Let's move ahead!