Exploring t-SNE for Dimensionality Reduction in Machine Learning

Lesson 1

Introduction

Embark on a journey into non-linear dimensionality reduction, with a specific focus on t-Distributed Stochastic Neighbor Embedding (t-SNE). Our goal is to understand the theory behind t-SNE and apply it using Scikit-learn's TSNE. This journey will take us through an understanding of the difference between linear and non-linear dimensionality reduction, a grasp of the core concepts of t-SNE, an implementation of t-SNE using Scikit-learn's TSNE, and a discussion of potential pitfalls of t-SNE.

Linear vs. Non-Linear Dimensionality Reduction

Dimensionality reduction is a pragmatic exercise which seeks to condense the number of random variables under consideration, thus obtaining a set of principal variables. By familiarizing ourselves with the dimension, we can select the technique that best suits our needs.

Imagine having a dataset that contains a person's height in inches and centimeters. These two measurements convey the same information, so one can be removed. This is an example of linear dimensionality reduction. Unlike PCA, a popular linear technique, non-linear techniques like t-SNE adopt a different approach, capturing complex relationships by preserving distances and separations, irrespective of the dimension space.

Understanding t-SNE: High-dimensional Space Calculations

t-SNE aims to keep similar data points close and dissimilar ones far apart in a lower-dimensional space. It achieves this by minimizing a cost function over the locations of the points in the lower-dimensional space.

The Gaussian joint probability is mathematically defined as:

$p_{j|i} = \frac{e^{-(\|x_{i}-x_{j}\|^{2} /2\sigma _{i}^{2})}}{\sum_{k \neq i} e^{-(\|x_{i}-x_{k}\|^{2}/2\sigma _{i}^{2})}}$

Here, $p_{j|i}$ is the probability of $x_i$ being a neighbor of $x_j$ , given the similarity of $x_i$ to other points and $\sigma_i$ is the variance of the Gaussian distribution. The variance is determined by the perplexity parameter, which controls the number of neighbors considered for each point.

From the conditional distributions created we calculate the joint probability distribution, using the following equation:

$p_{ij} = \frac{p_{j|i} + p_{i|j}}{2N}$

Where $N$ is the number of data points. This joint probability distribution is used to calculate the similarity between points in the high-dimensional space. Using the joint probability distribution rather than the conditional probability distribution helps to avoid clumping of points in the lower-dimensional space.

t-SNE constructs probability distributions in such a way that joint probabilities of similar points are high, while joint probabilities of dissimilar points are low. This is achieved by minimizing the Kullback-Leibler divergence between the joint probabilities in the high-dimensional space and the low-dimensional space.

Understanding t-SNE: Low-dimensional Space Calculations

In the lower-dimensional map, t-SNE employs t-distributions. These distributions, which are heavier-tailed, favor more effective modeling of dissimilarities. The joint probabilities in the low-dimensional space are defined as:

$q_{ij} = \frac{(1+||y_{i}-y_{j}||^{2})^{-1}}{\sum_{k \neq l}(1+||y_{k}-y_{l}||^{2})^{-1}}$

Here, $q_{ij}$ is the probability of $y_i$ being a neighbor of $y_j$ . t-SNE minimizes the divergence between the two distributions with respect to the locations of points (y) in the map. Here are the steps involved in the t-SNE algorithm:

Compute the pairwise similarities in the high-dimensional space using Gaussian joint probabilities using the formula for $p_{j|i}$ .
Compute the pairwise similarities in the low-dimensional space using t-distributions using the formula for $q_{ij}$ . Note that the t-distribution has a heavier tail than the Gaussian distribution and is more robust to outliers — It helps to avoid clumping of points.
Minimize the divergence between the two distributions by adjusting the locations of the points in the low-dimensional space using gradient descent.
Repeat the process until the distributions are similar.

t-SNE uses Kullback-Leibler divergence to measure the difference between the two distributions. Kullback-Leibler divergence is a measure of how one probability distribution diverges from a second, expected probability distribution. The cost function is defined as follows:

$C = KL(P||Q) = \sum_{i} \sum_{j} p_{ij} \log \frac{p_{ij}}{q_{ij}}$

Here, $P$ is the joint probability distribution in the high-dimensional space, $Q$ is the joint probability distribution in the low-dimensional space, and $C$ is the cost function. The goal is to minimize the cost function by adjusting the locations of the points in the low-dimensional space.

Implementing t-SNE: Python Implementation

Now, let's see how to implement t-SNE in Scikit-learn, a popular machine learning library in Python. Once our dataset is loaded, we'll build a t-SNE model using Scikit-learn's TSNE and then apply it to our data, showcasing the power and simplicity of TSNE.

Python Sample code for t-SNE and Analysis

Python
1from sklearn.manifold import TSNE
2import matplotlib.pyplot as plt
3from sklearn.datasets import make_circles
4
5# Generate a non-linearly separable dataset
6X, y = make_circles(n_samples=500, factor=0.3, noise=0.1, random_state=42)
7
8# Apply t-SNE
9tsne = TSNE(n_components=2, random_state=42)
10X_tsne = tsne.fit_transform(X)
11
12# Plot the original data
13plt.figure(figsize=(12, 6))
14plt.subplot(1, 2, 1)
15plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis')
16plt.title("Original Data")
17
18# Plot the t-SNE data
19plt.subplot(1, 2, 2)
20plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis')
21plt.title("t-SNE")
22
23plt.show()

In this code segment, we first import the necessary libraries, load the dataset, create a t-SNE model, apply it to the dataset, and finally visualize the reduced data:

Pitfalls when Using t-SNE

Though modern and effective, t-SNE comes with its share of pitfalls. Firstly, interpreting the global structure can be challenging due to disagreements between the different preservation features in t-SNE. Secondly, reproducibility presents a challenge due to random initialization, which can lead to varied results across different t-SNE runs. Finally, t-SNE is sensitive to hyperparameters such as perplexity and learning_rate, whose tuning will be covered in later lessons.

Lesson Summary and Practice

Great job! We've distinguished between linear and non-linear dimensionality reduction and explored t-SNE. We've covered practical lessons in implementing t-SNE with Scikit-learn's TSNE and have had discussions on potential pitfalls that might arise. In future lessons, we will focus on visualizing t-SNE results, delving into t-SNE's parameter tuning, and exploring its application with real-world examples. Let's continue to deepen your understanding in the next stage of this educational journey!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.