Uncovering Hidden Patterns with t-SNE: A Beginner’s Guide to Nonlinear Dimensionality Reduction
Nonlinear dimensionality reduction is a technique used to visualize high-dimensional data in lower dimensions, such as two or three dimensions. One popular method for nonlinear dimensionality reduction is t-distributed stochastic neighbor embedding (t-SNE).
t-SNE is a powerful tool for uncovering hidden patterns in complex data, and it has been widely used in a variety of applications, including image classification, natural language processing, and gene expression analysis. In this article, we will give a beginner’s guide to t-SNE and demonstrate how it can be used to visualize high-dimensional data.
To start, let’s consider an example dataset and apply t-SNE to visualize it in two dimensions. For this example, we will use the MNIST dataset, which consists of 70,000 images of handwritten digits. We will use the first 3,000 images for this example.
First, we will load the data and split it into training and test sets:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_openml
# Load the MNIST data
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)
X = X[:3000]
y = y[:3000]# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Next, we will apply t-SNE to the training data:
from sklearn.manifold import TSNE
# Apply t-SNE to the training data
tsne = TSNE(n_components=2, random_state=42)
X_train_tsne = tsne.fit_transform(X_train)
Now that we have applied t-SNE to the training data, we can visualize the resulting two-dimensional representation. To do this, we will use matplotlib to scatterplot the data points colored by their corresponding labels:
import matplotlib.pyplot as plt
# Scatterplot the data points colored by their labels
plt.scatter(X_train_tsne[:, 0], X_train_tsne[:, 1], c=list(map(int,list(y_train))), cmap="jet")plt.colorbar()
plt.show()
From the plot, we can see that t-SNE has successfully reduced the dimensionality of the data from 784 to 2 while preserving the structure of the original data. We can clearly see clusters of points corresponding to different digit classes, and the overall layout of the data points reflects the underlying structure of the data.
t-SNE is a powerful tool for visualizing high-dimensional data in lower dimensions, and it has been widely used in a variety of applications. In this article, we have given a beginner’s guide to t-SNE and demonstrated how it can be used to visualize high-dimensional data. While t-SNE is a powerful tool, it is important to keep in mind that it is sensitive to the choice of hyperparameters and can produce different results depending on the initialization and the choice of perplexity. It is always a good idea to try a few different values for these parameters and compare the results.
In conclusion, t-SNE is a valuable tool for uncovering hidden patterns in high-dimensional data and visualizing them in lower dimensions. By reducing the dimensionality of the data, t-SNE can help us better understand the structure and relationships within the data and uncover insights that may not be readily apparent in the original data.