The t-SNE algorithm starts by randomly projecting the data into indicated number of dimensions (usually two) as points. Then, in a series of iterations, the algorithm tries to push points that refer to similar examples in the dataset together and push points that are too different from each other apart. After a few iterations, similar points should arrange themselves in clusters separated from the other points.
Creating an example with image data
An example with image data demonstrates how to apply the tool and how to get insight from clusters. Hand-written numbers are naturally different from each other--they possess variability in that there are several ways to write certain numbers.
Each datapoint is a 8x8 image of a digit.
Dimensionality: 64
Features: integers 0-16
The example begins by loading the digits csv and assigning the data to a variable.
After loading the dataset, you run the t-SNE algorithm to squeeze the data.
This example sets the initial perplexity, early_exaggeration and n_iter parameters, which contribute to the quality of the ending representation.
When the dataset is reduced, you can plot it and place the original number label to the area of the plot where most of the similar examples are.
tSNE on Some Database
Using the provided data.txt and ground_truth.txt, try to
1. Perform the t-SNE algorithm. Set the initial perplexity, early_exaggeration and n_iter parameters as in the example for now, then you can adjust after doing no.2. (You can read the data using pd.read_table('data.txt', sep='\s+', header=None) )
2. Plot 2D graph of the plot.