Good Stuff‎ > ‎Data‎ > ‎

Papers NLP

Speed up:
model parallelism 

 data parallelism

Word embedding word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method
 Continuous bag of words
Given w_{i-2}, w_{i-1}, w_{i+1}, w_{i+2} output w_i

For CBOW, you will need more since you are conditioning on context, which can get exponentially huge.

the output is a hierarchical softmax
Skip grams
Given w_i output w_{i-2}, w_{i-1}, w_{i+1}, w_{i+2}

Chainer code:
class ContinuousBoW(chainer.Chain):
def __init__(self, n_vocab, n_units, loss_func):       # init dic
super(ContinuousBoW, self).__init__(
    embed=F.EmbedID( n_vocab, n_units, initialW=I.Uniform(1. / n_units)),
def __call__(self, x, context):
e = self.embed(context)              # get embeddings for context
h = F.sum(e, axis=1) * (1. /[1])  # sum context then avg
loss = self.loss_func(h, x)                     # loss{'loss': loss}, self)
return loss

class SkipGram(chainer.Chain):
def __init__(self, n_vocab, n_units, loss_func):      # init dic
super(SkipGram, self).__init__(
    embed=L.EmbedID(n_vocab, n_units, initialW=I.Uniform(1. / n_units)),
def __call__(self, x, context):
e = self.embed(context)
shape =
x = F.broadcast_to(x[:, None], (shape[0], shape[1]))   # repeat x to same size as context
e = F.reshape(e, (shape[0] * shape[1], shape[2]))
x = F.reshape(x, (shape[0] * shape[1],)) # set e and x to same shape.. compare x to each of the context basically
loss = self.loss_func(e, x)        # loss{'loss': loss}, self)
return loss

if args.out_type == 'hsm':
    HSM = L.BinaryHierarchicalSoftmax
    tree = HSM.create_huffman_tree(counts)
    loss_func = HSM(args.unit, tree)[...] = 0
elif args.out_type == 'ns':
    cs = [counts[w] for w in range(len(counts))]
    loss_func = L.NegativeSampling(args.unit, cs, args.negative_size)[...] = 0
elif args.out_type == 'original':
    loss_func = SoftmaxCrossEntropyLoss(args.unit, n_vocab)
    raise Exception('Unknown output type: {}'.format(args.out_type))

      v_c * v_w
   sum(v_c1 * v_w)

The idea of word2vec is to maximise the similarity (dot product) between the vectors for words which appear close together (in the context of each other) in text, and minimise the similarity of words that do not. In equation (3) of the paper you link to, ignore the exponentiation for a moment. You have

      v_c * v_w
   sum(v_c1 * v_w)

The numerator is basically the similarity between words c (the context) and w (the target) word. The denominator computes the similarity of all other contexts c1 and the target word w. Maximising this ratio ensures words that appear closer together in text have more similar vectors than words that do not. However, computing this can be very slow, because there are many contexts c1. Negative sampling is one of the ways of addressing this problem- just select a couple of contexts c1 at random. The end result is that if cat appears in the context of food, then the vector of food is more similar to the vector of cat (as measures by their dot product) than the vectors of several other randomly chosen words (e.g. democracygreedFreddy), instead of all other words in language. This makes word2vec much much faster to train.

-- How to avoid softmax:

Computing Softmax or specifically the probability is expensive since this requires summing over all words in V (denominator), which is generally very large.

What can be done?

Different strategies have been proposed to approximate the softmax. These approaches can be grouped into softmax-based and sampling-based approaches. Softmax-based approaches are methods that keep the softmax layer intact, but modify its architecture to improve its efficiency (e.g hierarchical softmax). Sampling-based approaches on the other hand completely do away with the softmax layer and instead optimise some other loss function that approximates the softmax (They do this by approximating the normalization in the denominator of the softmax with some other loss that is cheap to compute like negative sampling).


x is a one hot vector [0 0 .... 1 ... 0] 
h = wx (is like picking one row in w)
x is sparse

so negative sampling approximates softmax in this way: instead of calculating all the softmax outputs, just pick 5 up to 20 other words in the output and set the prob as to vs them, instead of versus all 1B words that you might have. We dont bother to update all the output losses, just the true one and a few negative ones to update params. dont bother to update the rest

One hot, takes the slice that belongs to that particular word from the word embeddigns lookup table. Then multply its vector with the vector of each context word, (v*d  * d*1 -> v*1) (approximate or real softmax) calculate loss and backpropagate


we want to have our embeddings in 200 dimensions. 

Create a word co-occurence matrix of 7Mx7M of words

apprximate this matrix by two matrix multipications of 7Mx200 and 200x7M

at first randomly initialize these two matrices

run an stochastic gradient decent to minimize the differnece of the value the matrix generates and the actual co-occurence value:

min sigma_i sigma_j sqrt((r_i * c_j + b_i) - (w_ij b_ij))


Ratio of the co-occurrence probabilities of two words (rather than their co-occurrence probabilities themselves) is what contains information and so look to encode this information as vector differences. They propose a weighted least squares objective J that directly aims to reduce the difference between the dot product of the vectors of two words and the logarithm of their number of co-occurrences:

where wi and bi are the word vector and bias respectively of word i, w̃ j and bj are the context word vector and bias respectively of word j, Xij is the number of times word i occurs in the context of word j, and f is a weighting function that assigns relatively lower weight to rare and frequent co-occurrences.As co-occurrence counts can be directly encoded in a word-context co-occurrence matrix, GloVe takes such a matrix rather than the entire corpus as input.

Other word embedding resources:
sebastianruder    The secret ingredients of word2vec
Hugo's class   negative sampling chainer impl and this
word2vec Explained: Deriving Mikolov et al.’sNegative-Sampling Word-Embedding Method
Hierarchical softmax in Torch Clement, FBCunn

So, basically to train wntity embeddigns I have four options:
- train a next word predicting LSTM where instead of the mention take the embedding from entity embedding.
- this works on mentions
- for wikipage, it doesnt work. Maybe use CBOW or approximate softmax after passing lstm through all words then predicting the entity at $ sign.
- train skip gram model with negative sampling to estimate softmax
- train a paragraph embedding based on all the sentences that belong to that paragraph
- train facebook's fast text word vectors using subword information.

Facebook FastText: github.
Enriching Word Vectors with Subword Information paper kjnfg

Bag of Tricks for Efficient Text Classification paper
many orders of magnitude faster for training and evaluation.
train fastText on more than one billion words in less than ten minutes using a standard multicore CPU, and classify half a million sentences among 312K classes in less than a minute.

we show that linear models with a rank constraint and a fast loss approximation
Sentence: bag of N n-grams (e.g. bi-grams), embedded, averaged
Tag prediction / sentiment analysis

Entity Resolution
CitationsYear Authors Paper  Summary
 210 2014  Entity Linking meets Word Sense Disambiguation: a Unified Approach Solve entity linking and word sense disambiguation as graphical model
  2017FB  Reading Wikipedia to Answer Open-Domain Questions 
    Paragraph embedding Each paragraph is given a random embedding,  the paragraph vector is concatenated with local context word vectors to predict the next word. The prediction task changes the word vectors and the paragraph vector.

This is called the skip-gram model. The next word is predicted. The paragraph vector can be further simplified when we use no local context in the prediction task.
distributed bag of words model of Paragraph Vector: same as before, paragraph embedding is first randomly initialized, and is updated through prediction of each sentence. 

At inference time, the parameters of the classifier and the word vectors are not needed and backpropagation is used to tune the paragraph vectors. As the distributed bag of words model is more efficient, the experiments in this paper focuses on this implementation of the Paragraph Vector.

- all words were lower-cased before the datasets were use.  
- also jointly trained word embeddings with the paragraph vectors since preliminary experiments showed that this can improve 2 the quality of the paragraph vectors
- frequency cutoff to obtain a vocabulary of 1M words
Hellinger distance for LDA and cosine distance for Paragraph Vectors for nearest neighbor comparison

This way of training the model brings all semantically related entities to same sate of space (films, species, athletes, etc)
 Wikipedia entity embedding all the description as one paragraph

  2017  Vector Embedding of Wikipedia Concepts and Entities Wikipedia entity embedding:
plain text of all these pages were extracted
- anchors belonging to the pruned list of Wikipedia pages were replaced with their Wikipedia page ID (the redirects were also handled),
for other anchors, the surface form of them was substituted.

We merged the plain text of all pages in a single text file in which, each line is a clean text of a Wikipedia page. This dataset contains 2.1B tokens.

As the size of a document increases the Doc2Vec approach for document embedding results in a lower performance.
    A World of Difference: Divergent Word Interpretations among People 
 3211 2013  Skipgram - Efficient Estimation of Word Representations in Vector Space  First skipgram came out then Word2vec

 36502013 Mikolov, Sutskever,  Word2Vec - Distributed Representations of Words and Phrases and their Compositionality 
 17 2016 Google Contextual LSTM (CLSTM) models for Large scale NLP task

  2017 Google, Chris Manning Get To The Point: Summarization with Pointer-Generator Networks

Subpages (1): Code