Continuous bag of words Given w_{i-2}, w_{i-1}, w_{i+1}, w_{i+2} output w_i
For CBOW, you will need more since you are conditioning on context, which can get exponentially huge.
the output is a hierarchical softmax
hierarchical softmax: instead of predicting the outputs themselves, predict the turns we need to take in the tree to reach that label. So instead of n items we only have to predict log(n) steps (branches) in the tree | Skip grams Given w_i output w_{i-2}, w_{i-1}, w_{i+1}, w_{i+2}
Chainer code: class ContinuousBoW(chainer.Chain):
def __init__(self, n_vocab, n_units, loss_func): # init dic
super(ContinuousBoW, self).__init__( embed=F.EmbedID( n_vocab, n_units, initialW=I.Uniform(1. / n_units)), loss_func=loss_func, )
def __call__(self, x, context):
e = self.embed(context) # get embeddings for context h = F.sum(e, axis=1) * (1. / context.data.shape[1]) # sum context then avg loss = self.loss_func(h, x) # loss reporter.report({'loss': loss}, self) return loss
class SkipGram(chainer.Chain):
def __init__(self, n_vocab, n_units, loss_func): # init dic super(SkipGram, self).__init__( embed=L.EmbedID(n_vocab, n_units, initialW=I.Uniform(1. / n_units)), loss_func=loss_func, )
def __call__(self, x, context):
e = self.embed(context) shape = e.data.shape x = F.broadcast_to(x[:, None], (shape[0], shape[1])) # repeat x to same size as context e = F.reshape(e, (shape[0] * shape[1], shape[2])) x = F.reshape(x, (shape[0] * shape[1],)) # set e and x to same shape.. compare x to each of the context basically loss = self.loss_func(e, x) # loss reporter.report({'loss': loss}, self) return loss
if args.out_type == 'hsm': HSM = L.**BinaryHierarchicalSoftmax** tree = HSM.*create_huffman_tree*(counts) loss_func = HSM(args.unit, tree) loss_func.W.data[...] = 0 elif args.out_type == 'ns': cs = [counts[w] for w in range(len(counts))] loss_func = L.**NegativeSampling**(args.unit, cs, args.negative_size) loss_func.W.data[...] = 0 elif args.out_type == 'original': loss_func = **SoftmaxCrossEntropyLoss**(args.unit, n_vocab) else: raise Exception('Unknown output type: {}'.format(args.out_type))
--------------- ```
v_c * v_w
-------------------
sum(v_c1 * v_w)
```
The idea of `word2vec` is to maximise the similarity (dot product) between the vectors for words which appear close together (in the context of each other) in text, and minimise the similarity of words that do not. In equation (3) of the paper you link to, ignore the exponentiation for a moment. You have ```
v_c * v_w
-------------------
sum(v_c1 * v_w)
```
The numerator is basically the similarity between words `c` (the context) and `w` (the target) word. The denominator computes the similarity of all other contexts `c1` and the target word `w` . Maximising this ratio ensures words that appear closer together in text have more similar vectors than words that do not. However, computing this can be very slow, because there are many contexts `c1` . Negative sampling is one of the ways of addressing this problem- just select a couple of contexts `c1` at random. The end result is that if `cat` appears in the context of `food` , then the vector of `food` is more similar to the vector of `cat` (as measures by their dot product) than the vectors of **several other randomly chosen words** (e.g. `democracy` , `greed` , `Freddy` ), instead of **all other words in language**. This makes `word2vec` much much faster to train. -- How to avoid softmax:
Computing *Softmax* or specifically the probability is expensive since this requires summing over all words in **V** (denominator), which is generally very large. *What can be done?*
Different strategies have been proposed to **approximate** the softmax. These approaches can be grouped into **softmax-based** and **sampling-based** approaches. *Softmax-based* approaches are methods that keep the softmax layer intact, but modify its architecture to improve its efficiency (e.g hierarchical softmax). *Sampling-based* approaches on the other hand completely do away with the softmax layer and instead optimise some other loss function that approximates the softmax (They do this by approximating the normalization in the denominator of the softmax with some other loss that is cheap to compute like negative sampling). ----------------
x is a one hot vector [0 0 .... 1 ... 0] h = wx (is like picking one row in w) x is sparse
------------- **>>>>>>>>>>>>>>>>>**
so negative sampling approximates softmax in this way: instead of calculating all the softmax outputs, just pick 5 up to 20 other words in the output and set the prob as to vs them, instead of versus all 1B words that you might have. We dont bother to update all the output losses, just the true one and a few negative ones to update params. dont bother to update the rest
One hot, takes the slice that belongs to that particular word from the word embeddigns lookup table. Then multply its vector with the vector of each context word, (v*d * d*1 -> v*1) (approximate or real softmax) calculate loss and backpropagate
| glove
we want to have our embeddings in 200 dimensions.
Create a word co-occurence matrix of 7Mx7M of words
apprximate this matrix by two matrix multipications of 7Mx200 and 200x7M
at first randomly initialize these two matrices
run an stochastic gradient decent to minimize the differnece of the value the matrix generates and the actual co-occurence value:
min sigma_i sigma_j sqrt((r_i * c_j + b_i) - (w_ij b_ij))
===================================
Ratio of the co-occurrence probabilities of two words (rather than their co-occurrence probabilities themselves) is what contains information and so look to encode this information as vector differences. They propose a weighted least squares objective J that directly aims to reduce the difference between the dot product of the vectors of two words and the logarithm of their number of co-occurrences:
where wi and bi are the word vector and bias respectively of word i, w̃ j and bj are the context word vector and bias respectively of word j, Xij is the number of times word i occurs in the context of word j, and f is a weighting function that assigns relatively lower weight to rare and frequent co-occurrences.As co-occurrence counts can be directly encoded in a word-context co-occurrence matrix, GloVe takes such a matrix rather than the entire corpus as input. |