Faisal Z. Qureshi
faisal.qureshi@ontariotechu.ca
http://www.vclab.ca
These notes draw heavily upon the excellent t-SNE lecture put together by Ethan Fetaya, James Lucas and Emad Andrews.
Place a Guassian distribution at each data point $\mathbf{x}_i \in \mathbb{R}^D$ then the probability of picking point $\mathbf{x}_j$ when starting at $\mathbf{x}_i$ is given by the Guassian sitting at $\mathbf{x}_i$. Points that are closer to $\mathbf{x}_i$ are more likely to be picked than those that are farther away.
The probability $i \rightarrow j$ that $\mathbf{x}_j$ is chosen when starting at $\mathbf{x}_i$ is
$$ i \rightarrow j = P_{j|i}= \begin{cases} \frac{ \exp \left( - \| \mathbf{x}_i - \mathbf{x}_j \|^2 / 2 \sigma_i^2 \right) }{ \sum_{k \ne i} \exp \left( - \| \mathbf{x}_i - \mathbf{x}_k \|^2 / 2 \sigma_i^2 \right) } & \text{when i $\ne$ j}\\ 0 & \mathrm{otherwise} \end{cases} $$where $\sigma_i$ is the standard deviation of the Gaussian sitting at point $\mathbf{x}_i$.
Parameter $\sigma_i$ is best seen as a scale paramter:
This suggests that it is important to choose $\sigma_i$ carefully in order to preserve the neighbourhood structure that we care about. Furthermore, $\sigma_i$ is selected separately for each data point.
Lastly, final distribution over a pair of points is symmetrized:
$$ P_{ij} = \frac{1}{2N} \left( P_{i|j} + P_{j|i} \right). $$Pick $i$ (or $j$) uniformly and then jump to $j$ (or $i$) using $P_{i|j}$ (or $P_{j|i}$).
Definition: Perplexity is the measurement of how well a probability distribution or probability model predicts a sample.
In our case, for each $P_{j|i}$ we define the perpexlity as follows
$$ perp\left( P_{j|i}\right) = 2^{H\left(P_{j|i} \right)}, $$where
$$ H(P) = - \sum_i P_i \log P_i $$is the entropy.
Small $\sigma$ leads to low perplexity and large $\sigma$ leads to large perplexity. Also, if $P$ is uniform over $k$ elements then perplexity is $k$.
Set the desired perplexity and set $\sigma_i$ to achieve that. You can use bisection method to do so.
Convergence conditions are:
Given $\mathbf{x}_1, \cdots, \mathbf{x}_N \in \mathbb{R}^D$ define distribution $P_{ij}$, find an embedding $\mathbf{y}_1, \cdots, \mathbf{y}_N \in \mathbb{R}^d$. $d$ is typically set to $2$ or $3$, since SNE is mostly used for visualization.
The quality of the embedding is measured by minimizing the KL-divergence between $P_{ij}$ and $Q_{ij}$, where the latter is the neighbourhood distribution defined our the embedded data points as follows
$$ Q_{ij} = \frac{ \exp \left( - \| \mathbf{y}_i - \mathbf{y}_j \|^2 \right) }{ \sum_k \sum_{k \ne l} \exp \left( - \| \mathbf{y}_l - \mathbf{y}_k \|^2 \right) }. $$Note that we do not have scale parameters $\sigma$ in this definition. Additionally, it is not symmetric.
We optimize on $\mathbf{y}_1, \cdots, \mathbf{y}_N$ to minimize the KL-divergence; therefore, there isn't any embedding function. There is no easy way to embed new points.
We are given $P$, we also have a mechanism for estimate $Q$ given $\mathbf{y}_i,\cdots,\mathbf{y}_N \in \mathbb{R}^d$, and we desire to estimate $\mathbf{y}$'s to minimize $D_{KL}(P || Q)$.
We know that
$$ \begin{align} KL(P || Q) &= \sum_{ij} P_{ij} \log \left( \frac{P_{ij}}{Q_{ij}} \right) \\ &= - \sum_{ij} P_{ij} \log Q_{ij} + \text{const} \end{align} $$Taking its derivative w.r.t. $\mathbf{y}$'s provides a mechanism to minimize it.
$$ \frac{\partial D_{KL}(P || Q)}{\partial \mathbf{y}_i} = \sum_j (P_{ij} - Q_{ij})(\mathbf{y}_i - \mathbf{y}_j) $$Note that this is not a convex optimization problem and there are no guarantees of a good solution! One way to solve such problems is to through multiple restarts.
SNE suffers from crowding problem. Simply stated, in higher dimensions there is a lot of room. Points can have a lot more neighbours. As we seek low-dimensional representation, there is simply not enough room to accomodate all neighbours.
t-SNE aims to solve this problem by replacing Gaussian (in $Q$) to a heavy-tailed distribution.
Redine $Q$ as follows
$$ Q_{ij} = \frac{ \left( 1 + \| \mathbf{y}_i - \mathbf{y}_j \|^2 \right)^{-1} }{ \sum_k \sum_{l \ne k} \left( 1 + \| \mathbf{y}_k - \mathbf{y}_l \|^2 \right)^{-1} } $$This is achieved by replacing the Gaussian in $Q$ with Student-t Probability density
$$ p(x) \propto \left( 1 + \frac{x^2}{v} \right)^{-(v+1)/2} $$and setting $v = 1$ we get
$$ p(x) \propto \frac{1}{1 + x^2} $$Note that this is a heavy-tailed distribution. It goes to zero much slower than a Gaussian.
The gradients for t-SNE objective function are
$$ \frac{\partial D_{KL} (P || Q)}{\partial \mathbf{y}^i} = \sum_j \left(P_{ij} - Q_{ij}\right)\left(\mathbf{y}_i - \mathbf{y}_j\right) \left( 1 + \| \mathbf{y}_i - \mathbf{y}_j \|^2 \right)^{-1} $$This gradient pushes away dissimilar points and attracts similar points (similar to the gradient term for SNE). However, the gradient for t-SNE has a smaller attraction compared to the gradient term for SNE, which solves the crowding problem.
import torchvision
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sn
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
Getting MNIST dataset
dataset_mnist = torchvision.datasets.MNIST('../datasets/common', download=True)
images_, labels = list(map(list, zip(*dataset_mnist)))
images = [np.array(i).reshape(28*28) for i in images_]
Viewing one MNIST digit
i = int(np.random.rand()*60000)
plt.figure(figsize=(5,5))
plt.title(f'index = {i}, digit={labels[i]}')
plt.imshow(images[i].reshape(28,28), cmap='gray');
Selecting first $N$ MNIST digits. t-SNE takes a lot of time for the entire MNIST dataset ($60000$) on my laptop.
N = 1000
data = np.vstack(images[:N])
Embedding $28 \times 28 = 784$ MNIST digits to 2D space using t-SNE.
tsne_model = TSNE(n_components=2, random_state=0, init='random', learning_rate='auto')
tsne_data = tsne_model.fit_transform(data)
tsne_data.shape
Now embedding MNIST digits to 2D space ussing PCA.
pca_model = PCA(n_components=2)
pca_data = pca_model.fit_transform(data)
pca_data.shape
vis_df = pd.DataFrame(data = np.vstack((tsne_data.T, pca_data.T, labels[:N])).T, columns=('t-sne-1','t-sne-2','pca-1','pca-2','label'))
vis_df
plt.figure(figsize=(16,7))
plt.suptitle(f'Embedding first {N} MNIST digits to 2D space')
plt.subplot(1,2,1)
plt.title('t-SNE')
sn.scatterplot(data=vis_df, x='t-sne-1', y='t-sne-2', hue='label', palette='husl')
plt.subplot(1,2,2)
plt.title('PCA')
sn.scatterplot(data=vis_df, x='pca-1', y='pca-2', hue='label', palette='husl');
Say you are tasked with developing an MNIST classifier that uses only two dimensional input data (i.e., $784$-dimensional MNIST digits must be embedded into 2D space prior to handing these off to the classifier). What embedding scheme would you use: t-SNE or PCA? And why?