tSNE

Faisal Z. Qureshi
faisal.qureshi@ontariotechu.ca
http://www.vclab.ca

Credits

These notes draw heavily upon the excellent t-SNE lecture put together by Ethan Fetaya, James Lucas and Emad Andrews.

Lesson Plan

  • Stochastic Neighbour Embedding (SNE)
  • t-Distributed Stochastic Neighbour Embeding (t-SNE)

t-SNE

  • t-SNE is a dimensionality reduction algorithm, which tries to preserve local structure
    • Neighbourhoods should be preserved in their low-dimensional realizations
  • t-SNE is widely-used for data visualization
    • No easy way to embed new data points

t-SNE vs. Principle Component Analysis (PCA)

  • PCA is a widely-used dimensionality reduction algorithm, which attempts to find the global structure
  • PCA allows data to be projected into a low-dimensional subspace
    • It is straightforward to embed new data points into this subspace
  • Neighbourhoods are not preserved in this low-dimensional subspace
    • Data points that are far from each other can become nearest neighbours in the low-dimensional subspace

SNE

  • Encode neighbourhood information as a probability distribution
    • Intuition: if we perform a random walk between data points than the probability of jumping to a neighbouring (closer) data point is higher
  • Find a low-dimensional representation such that the neighbourhood probability distribution is similar
  • Use KL-divergence to measure the neighbourhood distributions (original, high-dimensional and desired, low-dimensional)

Setting up neighbourhood distributions

Place a Guassian distribution at each data point $\mathbf{x}_i \in \mathbb{R}^D$ then the probability of picking point $\mathbf{x}_j$ when starting at $\mathbf{x}_i$ is given by the Guassian sitting at $\mathbf{x}_i$. Points that are closer to $\mathbf{x}_i$ are more likely to be picked than those that are farther away.

The probability $i \rightarrow j$ that $\mathbf{x}_j$ is chosen when starting at $\mathbf{x}_i$ is

$$ i \rightarrow j = P_{j|i}= \begin{cases} \frac{ \exp \left( - \| \mathbf{x}_i - \mathbf{x}_j \|^2 / 2 \sigma_i^2 \right) }{ \sum_{k \ne i} \exp \left( - \| \mathbf{x}_i - \mathbf{x}_k \|^2 / 2 \sigma_i^2 \right) } & \text{when i $\ne$ j}\\ 0 & \mathrm{otherwise} \end{cases} $$

where $\sigma_i$ is the standard deviation of the Gaussian sitting at point $\mathbf{x}_i$.

Parameter $\sigma_i$ is best seen as a scale paramter:

  • when $\sigma_i$ is small, all points appear far
  • when $\sigma_i$ is large, all points appear at the same distance

This suggests that it is important to choose $\sigma_i$ carefully in order to preserve the neighbourhood structure that we care about. Furthermore, $\sigma_i$ is selected separately for each data point.

Lastly, final distribution over a pair of points is symmetrized:

$$ P_{ij} = \frac{1}{2N} \left( P_{i|j} + P_{j|i} \right). $$

Pick $i$ (or $j$) uniformly and then jump to $j$ (or $i$) using $P_{i|j}$ (or $P_{j|i}$).

Perplexity

Definition: Perplexity is the measurement of how well a probability distribution or probability model predicts a sample.

In our case, for each $P_{j|i}$ we define the perpexlity as follows

$$ perp\left( P_{j|i}\right) = 2^{H\left(P_{j|i} \right)}, $$

where

$$ H(P) = - \sum_i P_i \log P_i $$

is the entropy.

Small $\sigma$ leads to low perplexity and large $\sigma$ leads to large perplexity. Also, if $P$ is uniform over $k$ elements then perplexity is $k$.

Going back to SNE

Set the desired perplexity and set $\sigma_i$ to achieve that. You can use bisection method to do so.

Aside: bisection method for root finding

  • Consider a function $f(x)$ that is continuous on interval $[a, b]$. Further $f(a) * f(b) < 0$
  • Do
    • $c = \frac{a+b}{2}$
    • If $f(a)*f(c) < 0$ then $b=c$
      • Else $a=c$
    • until convergence conditions are satisfied

Convergence conditions are:

  • Fix a priori the total number of iterations
  • When $|c_i - c_{i-1}|$ is less than some tolerance that is specified a priori. Here $i$ and $i-1$ refers to two consecutive iterations.
  • When $|f(c_i)|$ is less then some tolerance that is specified a priori. Here too $i$ refers to an iteration.

Back to SNE

Given $\mathbf{x}_1, \cdots, \mathbf{x}_N \in \mathbb{R}^D$ define distribution $P_{ij}$, find an embedding $\mathbf{y}_1, \cdots, \mathbf{y}_N \in \mathbb{R}^d$. $d$ is typically set to $2$ or $3$, since SNE is mostly used for visualization.

The quality of the embedding is measured by minimizing the KL-divergence between $P_{ij}$ and $Q_{ij}$, where the latter is the neighbourhood distribution defined our the embedded data points as follows

$$ Q_{ij} = \frac{ \exp \left( - \| \mathbf{y}_i - \mathbf{y}_j \|^2 \right) }{ \sum_k \sum_{k \ne l} \exp \left( - \| \mathbf{y}_l - \mathbf{y}_k \|^2 \right) }. $$

Note that we do not have scale parameters $\sigma$ in this definition. Additionally, it is not symmetric.

We optimize on $\mathbf{y}_1, \cdots, \mathbf{y}_N$ to minimize the KL-divergence; therefore, there isn't any embedding function. There is no easy way to embed new points.

SNE algorithm

We are given $P$, we also have a mechanism for estimate $Q$ given $\mathbf{y}_i,\cdots,\mathbf{y}_N \in \mathbb{R}^d$, and we desire to estimate $\mathbf{y}$'s to minimize $D_{KL}(P || Q)$.

We know that

$$ \begin{align} KL(P || Q) &= \sum_{ij} P_{ij} \log \left( \frac{P_{ij}}{Q_{ij}} \right) \\ &= - \sum_{ij} P_{ij} \log Q_{ij} + \text{const} \end{align} $$

Taking its derivative w.r.t. $\mathbf{y}$'s provides a mechanism to minimize it.

$$ \frac{\partial D_{KL}(P || Q)}{\partial \mathbf{y}_i} = \sum_j (P_{ij} - Q_{ij})(\mathbf{y}_i - \mathbf{y}_j) $$

Note that this is not a convex optimization problem and there are no guarantees of a good solution! One way to solve such problems is to through multiple restarts.

Crowding problem

SNE suffers from crowding problem. Simply stated, in higher dimensions there is a lot of room. Points can have a lot more neighbours. As we seek low-dimensional representation, there is simply not enough room to accomodate all neighbours.

t-SNE aims to solve this problem by replacing Gaussian (in $Q$) to a heavy-tailed distribution.

t-SNE

Redine $Q$ as follows

$$ Q_{ij} = \frac{ \left( 1 + \| \mathbf{y}_i - \mathbf{y}_j \|^2 \right)^{-1} }{ \sum_k \sum_{l \ne k} \left( 1 + \| \mathbf{y}_k - \mathbf{y}_l \|^2 \right)^{-1} } $$

This is achieved by replacing the Gaussian in $Q$ with Student-t Probability density

$$ p(x) \propto \left( 1 + \frac{x^2}{v} \right)^{-(v+1)/2} $$

and setting $v = 1$ we get

$$ p(x) \propto \frac{1}{1 + x^2} $$

Note that this is a heavy-tailed distribution. It goes to zero much slower than a Gaussian.

Algorithm

The gradients for t-SNE objective function are

$$ \frac{\partial D_{KL} (P || Q)}{\partial \mathbf{y}^i} = \sum_j \left(P_{ij} - Q_{ij}\right)\left(\mathbf{y}_i - \mathbf{y}_j\right) \left( 1 + \| \mathbf{y}_i - \mathbf{y}_j \|^2 \right)^{-1} $$

This gradient pushes away dissimilar points and attracts similar points (similar to the gradient term for SNE). However, the gradient for t-SNE has a smaller attraction compared to the gradient term for SNE, which solves the crowding problem.

Example: Visualizing MNIST

In [1]:
import torchvision
import numpy as np  
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sn
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

Getting MNIST dataset

In [2]:
dataset_mnist = torchvision.datasets.MNIST('../datasets/common', download=True)
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ../datasets/common\MNIST\raw\train-images-idx3-ubyte.gz
100.0%
Extracting ../datasets/common\MNIST\raw\train-images-idx3-ubyte.gz to ../datasets/common\MNIST\raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
100.0%
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ../datasets/common\MNIST\raw\train-labels-idx1-ubyte.gz
Extracting ../datasets/common\MNIST\raw\train-labels-idx1-ubyte.gz to ../datasets/common\MNIST\raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
23.8%
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ../datasets/common\MNIST\raw\t10k-images-idx3-ubyte.gz
100.0%
100.0%
Extracting ../datasets/common\MNIST\raw\t10k-images-idx3-ubyte.gz to ../datasets/common\MNIST\raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ../datasets/common\MNIST\raw\t10k-labels-idx1-ubyte.gz
Extracting ../datasets/common\MNIST\raw\t10k-labels-idx1-ubyte.gz to ../datasets/common\MNIST\raw


In [3]:
images_, labels = list(map(list, zip(*dataset_mnist)))
In [4]:
images = [np.array(i).reshape(28*28) for i in images_]

Viewing one MNIST digit

In [5]:
i = int(np.random.rand()*60000)

plt.figure(figsize=(5,5))
plt.title(f'index = {i}, digit={labels[i]}')
plt.imshow(images[i].reshape(28,28), cmap='gray');

Selecting first $N$ MNIST digits. t-SNE takes a lot of time for the entire MNIST dataset ($60000$) on my laptop.

In [6]:
N = 1000
In [7]:
data = np.vstack(images[:N])

t-SNE

Embedding $28 \times 28 = 784$ MNIST digits to 2D space using t-SNE.

In [8]:
tsne_model = TSNE(n_components=2, random_state=0, init='random', learning_rate='auto')
tsne_data = tsne_model.fit_transform(data)

tsne_data.shape
Out[8]:
(1000, 2)

PCA

Now embedding MNIST digits to 2D space ussing PCA.

In [9]:
pca_model = PCA(n_components=2)
pca_data = pca_model.fit_transform(data)

pca_data.shape
Out[9]:
(1000, 2)

Plotting t-SNE and PCA

In [10]:
vis_df = pd.DataFrame(data = np.vstack((tsne_data.T, pca_data.T, labels[:N])).T, columns=('t-sne-1','t-sne-2','pca-1','pca-2','label')) 

vis_df
Out[10]:
t-sne-1 t-sne-2 pca-1 pca-2 label
0 -12.065825 17.051611 203.116588 337.557261 5.0
1 -27.100622 29.734253 1105.174182 425.702354 0.0
2 -8.354569 -32.103477 10.385071 -417.405150 4.0
3 23.272263 -1.002394 -1012.901908 653.166328 1.0
4 8.090386 -24.260347 -335.813702 -662.653032 9.0
... ... ... ... ... ...
995 19.643333 -36.449905 274.194718 -663.303647 7.0
996 -24.794991 3.228676 -41.405453 -57.496426 6.0
997 -19.727980 32.550842 767.801826 193.614300 0.0
998 -12.604946 11.974849 148.166065 621.110377 3.0
999 -34.469353 10.839848 -203.148163 166.331364 6.0

1000 rows × 5 columns

In [11]:
plt.figure(figsize=(16,7))
plt.suptitle(f'Embedding first {N} MNIST digits to 2D space')
plt.subplot(1,2,1)
plt.title('t-SNE')
sn.scatterplot(data=vis_df, x='t-sne-1', y='t-sne-2', hue='label', palette='husl')
plt.subplot(1,2,2)
plt.title('PCA')
sn.scatterplot(data=vis_df, x='pca-1', y='pca-2', hue='label', palette='husl');

A thougth experiment

Say you are tasked with developing an MNIST classifier that uses only two dimensional input data (i.e., $784$-dimensional MNIST digits must be embedded into 2D space prior to handing these off to the classifier). What embedding scheme would you use: t-SNE or PCA? And why?

Takeaway

  • t-SNE is a great way to visualize data
  • It allows us to peer inside blackbox algorithms, such as deep neural networks
  • t-SNE uses a heavy-tailed $Q$ distribution in order to reduce the crowding problem exhibited by SNE
  • t-SNE involves solving a non-convex optimization problem
    • This is often solved using gradient descent with momentum
    • Stochastic gradient descent is not used to solve t-SNE
  • t-SNE does not provide an embedding function to project new points (high-dimensional) points into 2D space
    • Therefore, it is often only used for data visualization