embedding

embedding notes

Embedding is for creating low-dimensional continuous numeric vector representation of 

discrete / categorial variables.

For example, to study people's interests, we have the names of books people have read.

the book name serves as a data point/variable. However, encoding book names is a chanllenge.

one-hot encoding simply creates a massive vector with size equal to the number of books.

say there are 20000 books, then the vector is 20000 in size. every book is a sparse vector with 

only one bit set to 1 and the other bits 0. obviously there are two problems here:

1. the number of dimensiona is too high

2. the 1-hot encoding doesn't really represent anything. similar books are not close to each other 

    in the vector space.


so we introduce embedding. we encode books in a much lower dimenional space (e.g. 100 in size) and 

make sure similar books are close to each other. I.e. the embedding space is continous instead

of being discrete. You can search for nearest neighbours in the embedding space for similar items.

The continous numeric vector can then serves as a good variable for further machine learning purpose.

TO study people's interest for example, similar numeric vectors means similar books and then implies

similar interests. While with one-hot coding, a book variable vector is rubbish telling pretty much nothing.