tSNE¶

Faisal Z. Qureshi
faisal.qureshi@ontariotechu.ca
http://www.vclab.ca

Credits¶

These notes draw heavily upon the excellent t-SNE lecture put together by Ethan Fetaya, James Lucas and Emad Andrews.

Lesson Plan¶

Stochastic Neighbour Embedding (SNE)
t-Distributed Stochastic Neighbour Embeding (t-SNE)

t-SNE¶

t-SNE is a dimensionality reduction algorithm, which tries to preserve local structure
- Neighbourhoods should be preserved in their low-dimensional realizations
t-SNE is widely-used for data visualization
- No easy way to embed new data points

t-SNE vs. Principle Component Analysis (PCA)¶

PCA is a widely-used dimensionality reduction algorithm, which attempts to find the global structure
PCA allows data to be projected into a low-dimensional subspace
- It is straightforward to embed new data points into this subspace
Neighbourhoods are not preserved in this low-dimensional subspace
- Data points that are far from each other can become nearest neighbours in the low-dimensional subspace

SNE¶

Encode neighbourhood information as a probability distribution
- Intuition: if we perform a random walk between data points than the probability of jumping to a neighbouring (closer) data point is higher
Find a low-dimensional representation such that the neighbourhood probability distribution is similar
Use KL-divergence to measure the neighbourhood distributions (original, high-dimensional and desired, low-dimensional)

Setting up neighbourhood distributions¶

Place a Guassian distribution at each data point $\mathbf{x}_i \in \mathbb{R}^D$ then the probability of picking point $\mathbf{x}_j$ when starting at $\mathbf{x}_i$ is given by the Guassian sitting at $\mathbf{x}_i$. Points that are closer to $\mathbf{x}_i$ are more likely to be picked than those that are farther away.

The probability $i \rightarrow j$ that $\mathbf{x}_j$ is chosen when starting at $\mathbf{x}_i$ is

$$ i \rightarrow j = P_{j|i}= \begin{cases} \frac{ \exp \left( - \| \mathbf{x}_i - \mathbf{x}_j \|^2 / 2 \sigma_i^2 \right) }{ \sum_{k \ne i} \exp \left( - \| \mathbf{x}_i - \mathbf{x}_k \|^2 / 2 \sigma_i^2 \right) } & \text{when i $\ne$ j}\\ 0 & \mathrm{otherwise} \end{cases} $$

where $\sigma_i$ is the standard deviation of the Gaussian sitting at point $\mathbf{x}_i$.

Parameter $\sigma_i$ is best seen as a scale paramter:

when $\sigma_i$ is small, all points appear far
when $\sigma_i$ is large, all points appear at the same distance

This suggests that it is important to choose $\sigma_i$ carefully in order to preserve the neighbourhood structure that we care about. Furthermore, $\sigma_i$ is selected separately for each data point.

Lastly, final distribution over a pair of points is symmetrized:

$$ P_{ij} = \frac{1}{2N} \left( P_{i|j} + P_{j|i} \right). $$

Pick $i$ (or $j$) uniformly and then jump to $j$ (or $i$) using $P_{i|j}$ (or $P_{j|i}$).

Perplexity¶

Definition: Perplexity is the measurement of how well a probability distribution or probability model predicts a sample.

In our case, for each $P_{j|i}$ we define the perpexlity as follows

$$ perp\left( P_{j|i}\right) = 2^{H\left(P_{j|i} \right)}, $$

where

$$ H(P) = - \sum_i P_i \log P_i $$

is the entropy.

Small $\sigma$ leads to low perplexity and large $\sigma$ leads to large perplexity. Also, if $P$ is uniform over $k$ elements then perplexity is $k$.

Going back to SNE¶

Set the desired perplexity and set $\sigma_i$ to achieve that. You can use bisection method to do so.

Aside: bisection method for root finding¶

Consider a function $f(x)$ that is continuous on interval $[a, b]$. Further $f(a) * f(b) < 0$
Do
- $c = \frac{a+b}{2}$
- If $f(a)*f(c) < 0$ then $b=c$
  - Else $a=c$
- until convergence conditions are satisfied

Convergence conditions are:

Fix a priori the total number of iterations
When $|c_i - c_{i-1}|$ is less than some tolerance that is specified a priori. Here $i$ and $i-1$ refers to two consecutive iterations.
When $|f(c_i)|$ is less then some tolerance that is specified a priori. Here too $i$ refers to an iteration.

Back to SNE¶

Given $\mathbf{x}_1, \cdots, \mathbf{x}_N \in \mathbb{R}^D$ define distribution $P_{ij}$, find an embedding $\mathbf{y}_1, \cdots, \mathbf{y}_N \in \mathbb{R}^d$. $d$ is typically set to $2$ or $3$, since SNE is mostly used for visualization.

The quality of the embedding is measured by minimizing the KL-divergence between $P_{ij}$ and $Q_{ij}$, where the latter is the neighbourhood distribution defined our the embedded data points as follows

$$ Q_{ij} = \frac{ \exp \left( - \| \mathbf{y}_i - \mathbf{y}_j \|^2 \right) }{ \sum_k \sum_{k \ne l} \exp \left( - \| \mathbf{y}_l - \mathbf{y}_k \|^2 \right) }. $$

Note that we do not have scale parameters $\sigma$ in this definition. Additionally, it is not symmetric.

We optimize on $\mathbf{y}_1, \cdots, \mathbf{y}_N$ to minimize the KL-divergence; therefore, there isn't any embedding function. There is no easy way to embed new points.

SNE algorithm¶

We are given $P$, we also have a mechanism for estimate $Q$ given $\mathbf{y}_i,\cdots,\mathbf{y}_N \in \mathbb{R}^d$, and we desire to estimate $\mathbf{y}$'s to minimize $D_{KL}(P || Q)$.

We know that

$$ \begin{align} KL(P || Q) &= \sum_{ij} P_{ij} \log \left( \frac{P_{ij}}{Q_{ij}} \right) \\ &= - \sum_{ij} P_{ij} \log Q_{ij} + \text{const} \end{align} $$

Taking its derivative w.r.t. $\mathbf{y}$'s provides a mechanism to minimize it.

$$ \frac{\partial D_{KL}(P || Q)}{\partial \mathbf{y}_i} = \sum_j (P_{ij} - Q_{ij})(\mathbf{y}_i - \mathbf{y}_j) $$

Note that this is not a convex optimization problem and there are no guarantees of a good solution! One way to solve such problems is to through multiple restarts.

Crowding problem¶

SNE suffers from crowding problem. Simply stated, in higher dimensions there is a lot of room. Points can have a lot more neighbours. As we seek low-dimensional representation, there is simply not enough room to accomodate all neighbours.

t-SNE aims to solve this problem by replacing Gaussian (in $Q$) to a heavy-tailed distribution.

t-SNE¶

Redine $Q$ as follows

$$ Q_{ij} = \frac{ \left( 1 + \| \mathbf{y}_i - \mathbf{y}_j \|^2 \right)^{-1} }{ \sum_k \sum_{l \ne k} \left( 1 + \| \mathbf{y}_k - \mathbf{y}_l \|^2 \right)^{-1} } $$

This is achieved by replacing the Gaussian in $Q$ with Student-t Probability density

$$ p(x) \propto \left( 1 + \frac{x^2}{v} \right)^{-(v+1)/2} $$

and setting $v = 1$ we get

$$ p(x) \propto \frac{1}{1 + x^2} $$

Note that this is a heavy-tailed distribution. It goes to zero much slower than a Gaussian.

Algorithm¶

The gradients for t-SNE objective function are

$$ \frac{\partial D_{KL} (P || Q)}{\partial \mathbf{y}^i} = \sum_j \left(P_{ij} - Q_{ij}\right)\left(\mathbf{y}_i - \mathbf{y}_j\right) \left( 1 + \| \mathbf{y}_i - \mathbf{y}_j \|^2 \right)^{-1} $$

This gradient pushes away dissimilar points and attracts similar points (similar to the gradient term for SNE). However, the gradient for t-SNE has a smaller attraction compared to the gradient term for SNE, which solves the crowding problem.

Example: Visualizing MNIST¶

In [1]:

import torchvision
import numpy as np  
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sn
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

Getting MNIST dataset

In [2]:

dataset_mnist = torchvision.datasets.MNIST('../datasets/common', download=True)

In [3]:

images_, labels = list(map(list, zip(*dataset_mnist)))

In [4]:

images = [np.array(i).reshape(28*28) for i in images_]

Viewing one MNIST digit

In [5]:

i = int(np.random.rand()*60000)

plt.figure(figsize=(5,5))
plt.title(f'index = {i}, digit={labels[i]}')
plt.imshow(images[i].reshape(28,28), cmap='gray');

No description has been provided for this image

Selecting first $N$ MNIST digits. t-SNE takes a lot of time for the entire MNIST dataset ($60000$) on my laptop.

In [6]:

N = 1000

In [7]:

data = np.vstack(images[:N])

t-SNE¶

Embedding $28 \times 28 = 784$ MNIST digits to 2D space using t-SNE.

In [8]:

tsne_model = TSNE(n_components=2, random_state=0, init='random', learning_rate='auto')
tsne_data = tsne_model.fit_transform(data)

tsne_data.shape

Out[8]:

(1000, 2)

PCA¶

Now embedding MNIST digits to 2D space ussing PCA.

In [9]:

pca_model = PCA(n_components=2)
pca_data = pca_model.fit_transform(data)

pca_data.shape

Out[9]:

(1000, 2)

Plotting t-SNE and PCA¶

In [10]:

vis_df = pd.DataFrame(data = np.vstack((tsne_data.T, pca_data.T, labels[:N])).T, columns=('t-sne-1','t-sne-2','pca-1','pca-2','label')) 

vis_df

Out[10]:

	t-sne-1	t-sne-2	pca-1	pca-2	label
0	-17.149544	14.014745	203.116589	337.553733	5.0
1	-23.898777	29.808411	1105.173566	425.693151	0.0
2	34.776981	-3.884329	10.385630	-417.412867	4.0
3	-5.514741	-21.196997	-1012.904300	653.160022	1.0
4	19.694548	-13.817496	-335.812039	-662.657908	9.0
...	...	...	...	...	...
995	21.551165	-24.926790	274.191120	-663.309488	7.0
996	1.615674	22.288330	-41.406113	-57.500225	6.0
997	-28.282730	24.071201	767.802972	193.626733	0.0
998	-12.466372	12.977660	148.166559	621.104501	3.0
999	-4.993327	33.094105	-203.148422	166.350435	6.0

1000 rows × 5 columns

In [11]:

plt.figure(figsize=(16,7))
plt.suptitle(f'Embedding first {N} MNIST digits to 2D space')
plt.subplot(1,2,1)
plt.title('t-SNE')
sn.scatterplot(data=vis_df, x='t-sne-1', y='t-sne-2', hue='label', palette='husl')
plt.subplot(1,2,2)
plt.title('PCA')
sn.scatterplot(data=vis_df, x='pca-1', y='pca-2', hue='label', palette='husl');

A thougth experiment¶

Say you are tasked with developing an MNIST classifier that uses only two dimensional input data (i.e., $784$-dimensional MNIST digits must be embedded into 2D space prior to handing these off to the classifier). What embedding scheme would you use: t-SNE or PCA? And why?

Takeaway¶

t-SNE is a great way to visualize data
It allows us to peer inside blackbox algorithms, such as deep neural networks
t-SNE uses a heavy-tailed $Q$ distribution in order to reduce the crowding problem exhibited by SNE
t-SNE involves solving a non-convex optimization problem
- This is often solved using gradient descent with momentum
- Stochastic gradient descent is not used to solve t-SNE
t-SNE does not provide an embedding function to project new points (high-dimensional) points into 2D space
- Therefore, it is often only used for data visualization