Welcome back to our course, Intro to Unsupervised Machine Learning. Up until now, we have explored the captivating realm of the Iris dataset, mastered K-means clustering, and observed cluster formations among the Iris species. Today, we shift our focus to dimensionality reduction. This lesson's objective is to provide a comprehensive understanding of the concept of dimensionality reduction, its real-world applications, and a particular emphasis on Principal Component Analysis (PCA). This technique is widely used to simplify high-dimensional data while preserving its most significant structures and relationships.
Think of dimensionality reduction as reducing a complex book to a one-page summary, where this summary retains the crucial information from the book. But why would we need a summary if we have the entire book? Here's where high-dimensional data comes into play. High-dimensional data refers to data with a large number of features or dimensions. The visualization of such data for human comprehension is exceedingly challenging. In this context, PCA
acts as a summarizer, condensing high-dimensional data into a 2D
or 3D
format, perfect for visualization.
Dimensionality reduction can be viewed as a data transformation technique extensively used in machine learning. This process simplifies high-dimensional data by projecting it onto a lower-dimensional space, ensuring that the core information and structures from the original data remain intact. Essentially, it is akin to squashing a three-dimensional object into a two-dimensional space while retaining most of its original patterns and textures.
One commonly used dimensionality reduction method is Principal Component Analysis (PCA). Imagine a literary critic summarizing the main themes and motifs of a long novel into a cohesive review. Similarly, PCA transforms high-dimensional data into a lower-dimensional form, compressing the information while maintaining as much of the original data's nuances as possible. This transformation aids in the removal of noise and redundancy in data, thereby enhancing the performance of machine learning models.
Like a comprehensive encyclopedia, high-dimensional data can turn interpretation and comprehension into noteworthy challenges. Trying to find a particular topic in this data source without an index — its "dimensionality" — worsens the situation. The difficulty is further accentuated when the encyclopedia — or the data set — is so large that even a computer finds it hard to process it efficiently. This problem is known as the "curse of dimensionality".
Nevertheless, PCA rides to the rescue and breaks this curse. By using PCA, we can skim these high-dimensionality issues and simplify our data without sacrificing much valuable information.
Principal Component Analysis (PCA) makes high-dimensional data more comprehensible. It transforms the data so that the original features are substituted with new ones — principal components. These components are linear combinations of the original ones. They are designed to be free of association with each other (orthogonal) and arranged in order of reduction in the amount of variance (information) they obtain from the data. Suppose you're trying to pack as much information as possible into a brief news report — PCA assists in selecting the most meaningful yet concise words necessary for your report to carry the highest possible amount of information.
Let's now see PCA in action using Python
and sklearn
. Here's how you would import all the necessary libraries:
Python1from sklearn.datasets import load_iris 2from sklearn.preprocessing import StandardScaler 3from sklearn.decomposition import PCA 4import pandas as pd
Before applying PCA, standardizing the data is crucial. Variables measured at different scales do not contribute equally to the model's functioning and might create a bias. For instance, a variable that ranges between 0 and 1000 will outweigh a variable that ranges between 0 and 1. Without standardizing these variables, the larger variables may unintentionally hold more weight. Centering features around 0 with a standard deviation of 1 is not only crucial when variables are measured in different units (e.g., kilograms, kilometers, centimeters), but it is also a broad requirement of many machine learning algorithms. Implementing PCA typically entails zero-centering the data, calculating its covariance matrix, and utilizing eigen-decomposition on this covariance matrix.
Since PCA is sensitive to the variances of the original variables, we need to standardize our features to have unit variance. This standardization is required for the optimal performance of many machine learning algorithms and ensures each feature has an equal chance to contribute to the PCA.
Python1# Load the iris dataset 2iris = load_iris() 3 4# Standardize the features 5x = StandardScaler().fit_transform(iris.data) 6 7# PCA transformation 8# By setting n_components equal to None, all components are kept 9pca = PCA(n_components = None) 10principalComponents = pca.fit_transform(x) 11 12# Explained variance 13explained_variance = pca.explained_variance_ratio_
When running PCA, the n_components
parameter determines how many components we aim to retain. In this example, we did not specify this parameter, making it so that all components would be preserved. Another interesting parameter, svd_solver
, delineates which solver to calculate principal components. Setting it to svd_solver='auto'
chooses the best option based on the dataset.
Explained variance indicates the viability of each principal component to archive variance. Consequently, it allows us to see how much of the set's total variance each principal component can capture.
In our case, the first principal component incorporates approximately 72.96% of the variance, while the second principal component houses about 23.03%. Collectively, these first two components account for 95.99% of the information. This means that, despite reducing our 4D data to 2D, we still capture 95.99% of the dataset's original complexity.
Python1print('Explained variance: ', explained_variance) 2# Explained variance: [0.72962445 0.22850762 0.03668922 0.00517871]
The printed results of the explained variance show us that the first two principal components encompass the majority of the variance in the entire dataset.
To envision the power of PCA in high-dimensional data, imagine adjusting the focus knob of a pair of binoculars. Now, you can view different layers of the panorama or information in our data. PCA offers a clearer, more simplified view of significant features.
PCA is widely adopted for reducing the dimensionality of image datasets, visualizing high-dimensional data, eliminating redundant features, and working with datasets that have high-dimensional data, such as gene expression data or social network data.
Congratulations! Your data science learning journey just took another remarkable step. We delved into the world of dimensionality reduction and examined the benefits of Principal Component Analysis (PCA). More importantly, we understood how PCA can offer us a different perspective on high-dimensional data, significantly simplifying visualization and interpretation.
As we embark on the next phase of our course, we'll dig deeper into PCA, learning more about interpreting variances and visualizing reduced-dimension data.
By now, you've garnered compelling insights about PCA and how it can tame high-dimensional data. It's time to take your coder's hat on and dive into some hands-on exercises! Here, you'll try applying PCA to different datasets and explore various visualizations. This practice will provide greater insight into how PCA can influence your data and how to interpret the results. Mastery comes from doing, so roll up your sleeves and let's get started!