No description has been provided for this image

tSNE¶

Faisal Z. Qureshi
faisal.qureshi@ontariotechu.ca
http://www.vclab.ca

Credits¶

These notes draw heavily upon the excellent t-SNE lecture put together by Ethan Fetaya, James Lucas and Emad Andrews.

Lesson Plan¶

  • Stochastic Neighbour Embedding (SNE)
  • t-Distributed Stochastic Neighbour Embeding (t-SNE)

t-SNE¶

  • t-SNE is a dimensionality reduction algorithm, which tries to preserve local structure
    • Neighbourhoods should be preserved in their low-dimensional realizations
  • t-SNE is widely-used for data visualization
    • No easy way to embed new data points

t-SNE vs. Principle Component Analysis (PCA)¶

  • PCA is a widely-used dimensionality reduction algorithm, which attempts to find the global structure
  • PCA allows data to be projected into a low-dimensional subspace
    • It is straightforward to embed new data points into this subspace
  • Neighbourhoods are not preserved in this low-dimensional subspace
    • Data points that are far from each other can become nearest neighbours in the low-dimensional subspace

SNE¶

  • Encode neighbourhood information as a probability distribution
    • Intuition: if we perform a random walk between data points than the probability of jumping to a neighbouring (closer) data point is higher
  • Find a low-dimensional representation such that the neighbourhood probability distribution is similar
  • Use KL-divergence to measure the neighbourhood distributions (original, high-dimensional and desired, low-dimensional)

Setting up neighbourhood distributions¶

Place a Guassian distribution at each data point $\mathbf{x}_i \in \mathbb{R}^D$ then the probability of picking point $\mathbf{x}_j$ when starting at $\mathbf{x}_i$ is given by the Guassian sitting at $\mathbf{x}_i$. Points that are closer to $\mathbf{x}_i$ are more likely to be picked than those that are farther away.

The probability $i \rightarrow j$ that $\mathbf{x}_j$ is chosen when starting at $\mathbf{x}_i$ is

$$ i \rightarrow j = P_{j|i}= \begin{cases} \frac{ \exp \left( - \| \mathbf{x}_i - \mathbf{x}_j \|^2 / 2 \sigma_i^2 \right) }{ \sum_{k \ne i} \exp \left( - \| \mathbf{x}_i - \mathbf{x}_k \|^2 / 2 \sigma_i^2 \right) } & \text{when i $\ne$ j}\\ 0 & \mathrm{otherwise} \end{cases} $$

where $\sigma_i$ is the standard deviation of the Gaussian sitting at point $\mathbf{x}_i$.

Parameter $\sigma_i$ is best seen as a scale paramter:

  • when $\sigma_i$ is small, all points appear far
  • when $\sigma_i$ is large, all points appear at the same distance

This suggests that it is important to choose $\sigma_i$ carefully in order to preserve the neighbourhood structure that we care about. Furthermore, $\sigma_i$ is selected separately for each data point.

Lastly, final distribution over a pair of points is symmetrized:

$$ P_{ij} = \frac{1}{2N} \left( P_{i|j} + P_{j|i} \right). $$

Pick $i$ (or $j$) uniformly and then jump to $j$ (or $i$) using $P_{i|j}$ (or $P_{j|i}$).

Perplexity¶

Definition: Perplexity is the measurement of how well a probability distribution or probability model predicts a sample.

In our case, for each $P_{j|i}$ we define the perpexlity as follows

$$ perp\left( P_{j|i}\right) = 2^{H\left(P_{j|i} \right)}, $$

where

$$ H(P) = - \sum_i P_i \log P_i $$

is the entropy.

Small $\sigma$ leads to low perplexity and large $\sigma$ leads to large perplexity. Also, if $P$ is uniform over $k$ elements then perplexity is $k$.

Going back to SNE¶

Set the desired perplexity and set $\sigma_i$ to achieve that. You can use bisection method to do so.

Aside: bisection method for root finding¶

  • Consider a function $f(x)$ that is continuous on interval $[a, b]$. Further $f(a) * f(b) < 0$
  • Do
    • $c = \frac{a+b}{2}$
    • If $f(a)*f(c) < 0$ then $b=c$
      • Else $a=c$
    • until convergence conditions are satisfied

Convergence conditions are:

  • Fix a priori the total number of iterations
  • When $|c_i - c_{i-1}|$ is less than some tolerance that is specified a priori. Here $i$ and $i-1$ refers to two consecutive iterations.
  • When $|f(c_i)|$ is less then some tolerance that is specified a priori. Here too $i$ refers to an iteration.

Back to SNE¶

Given $\mathbf{x}_1, \cdots, \mathbf{x}_N \in \mathbb{R}^D$ define distribution $P_{ij}$, find an embedding $\mathbf{y}_1, \cdots, \mathbf{y}_N \in \mathbb{R}^d$. $d$ is typically set to $2$ or $3$, since SNE is mostly used for visualization.

The quality of the embedding is measured by minimizing the KL-divergence between $P_{ij}$ and $Q_{ij}$, where the latter is the neighbourhood distribution defined our the embedded data points as follows

$$ Q_{ij} = \frac{ \exp \left( - \| \mathbf{y}_i - \mathbf{y}_j \|^2 \right) }{ \sum_k \sum_{k \ne l} \exp \left( - \| \mathbf{y}_l - \mathbf{y}_k \|^2 \right) }. $$

Note that we do not have scale parameters $\sigma$ in this definition. Additionally, it is not symmetric.

We optimize on $\mathbf{y}_1, \cdots, \mathbf{y}_N$ to minimize the KL-divergence; therefore, there isn't any embedding function. There is no easy way to embed new points.

SNE algorithm¶

We are given $P$, we also have a mechanism for estimate $Q$ given $\mathbf{y}_i,\cdots,\mathbf{y}_N \in \mathbb{R}^d$, and we desire to estimate $\mathbf{y}$'s to minimize $D_{KL}(P || Q)$.

We know that

$$ \begin{align} KL(P || Q) &= \sum_{ij} P_{ij} \log \left( \frac{P_{ij}}{Q_{ij}} \right) \\ &= - \sum_{ij} P_{ij} \log Q_{ij} + \text{const} \end{align} $$

Taking its derivative w.r.t. $\mathbf{y}$'s provides a mechanism to minimize it.

$$ \frac{\partial D_{KL}(P || Q)}{\partial \mathbf{y}_i} = \sum_j (P_{ij} - Q_{ij})(\mathbf{y}_i - \mathbf{y}_j) $$

Note that this is not a convex optimization problem and there are no guarantees of a good solution! One way to solve such problems is to through multiple restarts.

Crowding problem¶

SNE suffers from crowding problem. Simply stated, in higher dimensions there is a lot of room. Points can have a lot more neighbours. As we seek low-dimensional representation, there is simply not enough room to accomodate all neighbours.

t-SNE aims to solve this problem by replacing Gaussian (in $Q$) to a heavy-tailed distribution.

t-SNE¶

Redine $Q$ as follows

$$ Q_{ij} = \frac{ \left( 1 + \| \mathbf{y}_i - \mathbf{y}_j \|^2 \right)^{-1} }{ \sum_k \sum_{l \ne k} \left( 1 + \| \mathbf{y}_k - \mathbf{y}_l \|^2 \right)^{-1} } $$

This is achieved by replacing the Gaussian in $Q$ with Student-t Probability density

$$ p(x) \propto \left( 1 + \frac{x^2}{v} \right)^{-(v+1)/2} $$

and setting $v = 1$ we get

$$ p(x) \propto \frac{1}{1 + x^2} $$

Note that this is a heavy-tailed distribution. It goes to zero much slower than a Gaussian.

Algorithm¶

The gradients for t-SNE objective function are

$$ \frac{\partial D_{KL} (P || Q)}{\partial \mathbf{y}^i} = \sum_j \left(P_{ij} - Q_{ij}\right)\left(\mathbf{y}_i - \mathbf{y}_j\right) \left( 1 + \| \mathbf{y}_i - \mathbf{y}_j \|^2 \right)^{-1} $$

This gradient pushes away dissimilar points and attracts similar points (similar to the gradient term for SNE). However, the gradient for t-SNE has a smaller attraction compared to the gradient term for SNE, which solves the crowding problem.

Example: Visualizing MNIST¶

In [1]:
import torchvision
import numpy as np  
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sn
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

Getting MNIST dataset

In [2]:
dataset_mnist = torchvision.datasets.MNIST('../datasets/common', download=True)
In [3]:
images_, labels = list(map(list, zip(*dataset_mnist)))
In [4]:
images = [np.array(i).reshape(28*28) for i in images_]

Viewing one MNIST digit

In [5]:
i = int(np.random.rand()*60000)

plt.figure(figsize=(5,5))
plt.title(f'index = {i}, digit={labels[i]}')
plt.imshow(images[i].reshape(28,28), cmap='gray');
No description has been provided for this image

Selecting first $N$ MNIST digits. t-SNE takes a lot of time for the entire MNIST dataset ($60000$) on my laptop.

In [6]:
N = 1000
In [7]:
data = np.vstack(images[:N])

t-SNE¶

Embedding $28 \times 28 = 784$ MNIST digits to 2D space using t-SNE.

In [8]:
tsne_model = TSNE(n_components=2, random_state=0, init='random', learning_rate='auto')
tsne_data = tsne_model.fit_transform(data)

tsne_data.shape
Out[8]:
(1000, 2)

PCA¶

Now embedding MNIST digits to 2D space ussing PCA.

In [9]:
pca_model = PCA(n_components=2)
pca_data = pca_model.fit_transform(data)

pca_data.shape
Out[9]:
(1000, 2)

Plotting t-SNE and PCA¶

In [10]:
vis_df = pd.DataFrame(data = np.vstack((tsne_data.T, pca_data.T, labels[:N])).T, columns=('t-sne-1','t-sne-2','pca-1','pca-2','label')) 

vis_df
Out[10]:
t-sne-1 t-sne-2 pca-1 pca-2 label
0 -17.149544 14.014745 203.116589 337.553733 5.0
1 -23.898777 29.808411 1105.173566 425.693151 0.0
2 34.776981 -3.884329 10.385630 -417.412867 4.0
3 -5.514741 -21.196997 -1012.904300 653.160022 1.0
4 19.694548 -13.817496 -335.812039 -662.657908 9.0
... ... ... ... ... ...
995 21.551165 -24.926790 274.191120 -663.309488 7.0
996 1.615674 22.288330 -41.406113 -57.500225 6.0
997 -28.282730 24.071201 767.802972 193.626733 0.0
998 -12.466372 12.977660 148.166559 621.104501 3.0
999 -4.993327 33.094105 -203.148422 166.350435 6.0

1000 rows × 5 columns

In [11]:
plt.figure(figsize=(16,7))
plt.suptitle(f'Embedding first {N} MNIST digits to 2D space')
plt.subplot(1,2,1)
plt.title('t-SNE')
sn.scatterplot(data=vis_df, x='t-sne-1', y='t-sne-2', hue='label', palette='husl')
plt.subplot(1,2,2)
plt.title('PCA')
sn.scatterplot(data=vis_df, x='pca-1', y='pca-2', hue='label', palette='husl');
No description has been provided for this image

A thougth experiment¶

Say you are tasked with developing an MNIST classifier that uses only two dimensional input data (i.e., $784$-dimensional MNIST digits must be embedded into 2D space prior to handing these off to the classifier). What embedding scheme would you use: t-SNE or PCA? And why?

Takeaway¶

  • t-SNE is a great way to visualize data
  • It allows us to peer inside blackbox algorithms, such as deep neural networks
  • t-SNE uses a heavy-tailed $Q$ distribution in order to reduce the crowding problem exhibited by SNE
  • t-SNE involves solving a non-convex optimization problem
    • This is often solved using gradient descent with momentum
    • Stochastic gradient descent is not used to solve t-SNE
  • t-SNE does not provide an embedding function to project new points (high-dimensional) points into 2D space
    • Therefore, it is often only used for data visualization
No description has been provided for this image