Its feed forward neural network which is represented in mathematics by recurrent function.
The idea behind RNNs is to make use of sequential information. In a traditional neural network we assume that all inputs (and outputs) are independent of each other. But for many tasks that’s a very bad idea. If you want to predict the next word in a sentence you better know which words came before it. RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations and you already know that they have a “memory” which captures information about what has been calculated so far.
Ref: https://medium.com/@purnasaigudikandula/recurrent-neural-networks-and-lstm-explained-7f51c7f6bbb9
If classification problem is of size-2, sigmoid function is used. If classification problem is >2, softmax is used.
The sigmoid function is used for the two-class logistic regression, whereas the softmax function is used for the multiclass logistic regression
Ref: https://stats.stackexchange.com/questions/233658/softmax-vs-sigmoid-function-in-logistic-classifier
A regression problem is when the output variable is a real or continuous value, such as “salary” or “weight”. Many different models can be used, the simplest is the linear regression.
Ref: https://www.geeksforgeeks.org/regression-classification-supervised-machine-learning/
Logistic regression is another technique borrowed by machine learning from the field of statistics.
It is the go-to method for binary classification problems (problems with two class values).
Ref: https://machinelearningmastery.com/logistic-regression-for-machine-learning/
It focusses on document as output. For example, Q-A problem
It focusses to classify document. For example, Spam-NoSpam problem
Semantic hashing is a method to map documents to a code e.g., 32-bit memory address so documents with semantically closed content will be mapped to close addresses. This method can be used to implement an information retrieval (IR) system where the query will be a document and search results will contain documents with similar content (semantics). This method was published by G.Hinton in this paper.
Refer here
This approach is based on rules, grammars and dictionary. The main drawbacks of this approach
Accuracy issue
Difficult to identify spelling error
Complexity increases with complex rules which is made by human and are error prone as well as time consuming
Wherever there is not enough data for training machine, this approach is used.
This is automated approach.
In this approach, human does feature engineering and based on this NLP engine is configured. This approach has main drawback of feature engineering effort.
This is modern approach. In this approach, machine learns based on training set. This approach is powerful to interpret the unknown sentences as well. Main challenge is efficiency( in terms of training time and computational resource?)
In this approach, LSTM was used. Tokens are passed sequentially in this. There is an encoder and a decoder in this algorithm, Also, there is an attention. Attention is evaluated based on hidden state of previous decoder LSTM and hidden states of all LSTMs in encoder. Refer below diagram taken from here
Ref: https://medium.com/@purnasaigudikandula/recurrent-neural-networks-and-lstm-explained-7f51c7f6bbb9
It is a kind of RNN with memory
LSTM’s have a Nature of Remembering information for a long periods of time is their Default behaviour.
look at the below figure that says Every LSTM module will have 3 gates named as Forget gate, Input gate, Output gate.
Drawbacks
Slow to train (Refer https://www.youtube.com/watch?v=xI0HHN5XKDo)
Sequential token/word
Attention context learning not as good as transformer
Questions
What is source embedding?
Ans- Its encoding of the token (word). It is representation of word. there are source embedding and target embedding There are sentence embedding as well (BERT CLS output is sentence embedding)
Why y's are at two places in decoder?
Ans- It is used for training. this y is matched with prediction y value. The diff is then back-propagated to the system
This model was published early in 2018 and uses Recurrent Neural Networks (RNNs) in the form of Long Short Term Memory (LSTM)architecture to generate contextualized word embeddings
Instead of using a fixed embedding for each word, ELMo looks at the entire sentence before assigning each word in it an embedding
Reference: https://blog.floydhub.com/when-the-best-nlp-model-is-not-the-best-choice/
Before transformers algorithm, LSTM was used for NLP(NMT above). The Transformers outperforms the Google Neural Machine Translation model in specific tasks. The biggest benefit, however, comes from how The Transformer lends itself to parallelization. It is in fact Google Cloud’s recommendation to use The Transformer as a reference model to use their Cloud TPU offering (Refer here)
This 2017 paper proposes a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
Quote from the paper
In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.
Encoder
In sequence-to-sequence problems such as the neural machine translation, the initial proposals were based on the use of RNNs in an encoder-decoder architecture. These architectures have a great limitation when working with long sequences, their ability to retain information from the first elements was lost when new elements were incorporated into the sequence. In the encoder, the hidden state in every step is associated with a certain word in the input sentence, usually one of the most recent. Therefore, if the decoder only accesses the last hidden state of the decoder, it will lose relevant information about the first elements of the sequence. Then to deal with this limitation, a new concept were introduced the attention mechanism.
We can explain the relationship between words in one sentence or close context. When we see “eating”, we expect to encounter a food word very soon. The color term describes the food, but probably not so much with “eating” directly.
Refer: https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
Instead of paying attention to the last state of the encoder as is usually done with RNNs, in each step of the decoder we look at all the states of the encoder, being able to access information about all the elements of the input sequence. This is what attention does, it extracts information from the whole sequence, a weighted sum of all the past encoder states. This allows the decoder to assign greater weight or importance to a certain element of the input for each element of the output. Learning in every step to focus in the right element of the input to predict the next output element.
But this approach continues to have an important limitation, each sequence must be treated one element at a time. Both the encoder and the decoder have to wait till the completion of t-1 steps to process thet-th step. So when dealing with huge corpus it is very time consuming and computationally inefficient.
The Transformer model extract features for each word using a self-attention mechanism to figure out how important all the other words in the sentence are w.r.t. to the aforementioned word. And no recurrent units are used to obtain this features, they are just weighted sums and activations, so they can be very parallelizable and efficient.
Quote from paper
Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations
There exist 3 types of attention mechanism in the transformer model as below:
Ref: https://www.kdnuggets.com/2020/02/illustrating-reformer.html
Human attention saves time by focussing on the relevant part
Machine attention wastes time since it computes all possibilities and then decide what is relevant.
Global attention vs local attention
Additive attention
Dot product attention
One thing that’s missing from the model as we have described it so far is a way to account for the order of the words in the input sequence.
To address this, the transformer adds a vector to each input embedding.
Only Encoder part of transformer is used
Can take upto 512 token in one go
BERT Base has 12 encoder and 12 heads in multi-head self attention. For BERT Large, these numbers are 24 and 16 respectively.
These also have larger feedforward-networks (768[12*64] and 1024[16*64] hidden units respectively), and more attention heads (12 and 16 respectively)
Transformer is for NMT, but Bert can be used for
Sentiment analysis
Chat-bot
Word tagging
Text summary
Bert Training phase
Pre-training to understand language
Fill in the blanks based training
Next sentence prediction
Training to learn specific task
Question-answer task
Questions
What is distilBERT?
Its reduced version of BERT and can be used in mobile like devices for prediction
Why bidirectional in name?
Ref: http://jalammar.github.io/illustrated-bert/
http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
Its reduced version of BERT.
In this work, author proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pretraining phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster.
Ref: https://huggingface.co/transformers/model_doc/distilbert.html
The first novelty in the reformer comes from replacing dot-product attention with locality-sensitive hashing (LSH) to change the complexity from O(L²) to O(L log L).
LSH is a well-known algorithm for an efficient and approximate way of nearest neighbors search in high dimensional datasets. The main idea behind LSH is to select hash functions such that for two points ‘p’ and ‘q’, if ‘q’ is close to ‘p’ then with good enough probability we have ‘hash(q) == hash(p)’.
The simplest way to achieve this is to keep cutting space by random hyperplanes and append sign(pᵀH) as to the hash code of each point. Let’s look at an example below:
Once we find hash codes of a desired length, we divide the points into buckets based on their hash codes — in the above example, ‘a’ and ‘b’ belong to the same bucket since hash(a) == hash(b). Now the search space to find the nearest neighbors of each point reduces dramatically from the whole data set into the bucket where it belongs to.
Now the basic idea behind LSH attention is as follows. Looking back into the standard attention formula above, instead of computing attention over all of the vectors in Q and K matrices, we do the following:
Find LSH hashes of Q and K matrices.
Compute standard attention only for the k and q vectors within the same hash buckets.
Muti-round LSH attention: Repeat the above procedure a few times to increase the probability that similar items do not fall in different buckets.
The animation below illustrates a simplified version of LSH Attention based on the figure from the paper.
It takes lots of memory due to attention formula calculation (Ref: https://www.youtube.com/watch?v=i4H0kjxrias)
Computing attention on sequences of length L is O(L²) (both time and memory). Imagine what happens if we have a sequence of length 64K. Note the self attention computation. Each token needs to compute attention with other tokens.
A model with N layers consumes N-times larger memory than a single-layer model, as activations in each layer need to be stored for back-propagation.
The depth of intermediate feed-forward layers is often much larger than the depth of attention activations.
Ref: https://www.kdnuggets.com/2020/02/illustrating-reformer.html
Author introduced the first Transformer architectures, Performers, capable of provably accurate and practical estimation of regular (softmax) full-rank attention, but of only linear space and time complexity and not relying on any priors such as sparsity or low-rankness.
Uses the Fast Attention Via positive Orthogonal Random features (FAVOR+) mechanism
It improves transformer by reducing
Memory requirement in Z calculation.
It uses four directional buckets and uses locality sensitive hashing(Refer here )
Author introduce the Reformer, a Transformer model designed to handle context windows of up to 1 million words, all on a single accelerator and using only 16GB of memory.
The first challenge when applying a Transformer model to a very large text sequence is how to handle the attention layer. LSH accomplishes this by computing a hash function that matches similar vectors together, instead of searching through all possible pairs of vectors.
The second novel approach implemented in Reformer is to recompute the input of each layer on-demand during back-propagation, rather than storing it in memory.
Refer: https://ai.googleblog.com/2020/01/reformer-efficient-transformer.html
Generally language modelling requires you to train your model for the specific task you are trying to solve. So, if you want to translate text from english to french you train your model on lots of french and english text examples. If you want to train it for german then do the same for english to german and so on. Obviously this requires a lot of data since you need it for every language you are trying to solve.
Zero-shot learning is where you train one universal model on either a very large dataset or a very varied dataset. Then you can apply this model to any task. In the translation example you would train one model and use it as a kind of universal translator for other languages. A paper published at the end of 2018 did just this and was able to learn sentence representations for 93 different languages.
Refer: https://blog.floydhub.com/ten-trends-in-deep-learning-nlp/#zero-shot
What is encoder purpose?
In BERT, encoder is used to understand the language.
Quote from paper
Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35]. Here, the encoder maps an input sequence of symbol representations (x1,...,xn) to a sequence of continuous representations z = (z1,...,zn). Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time. At each step the model is auto-regressive
Is local attention good to use?
How attention is related to RNN? What is its mathematical representation?
Attention doesn't use RNN or LSTM
What is multi-head attention?
Its for increasing complexity. In other words, its used to perceive the input in different forms
Quote from paper
Instead of performing a single attention function with dmodel-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding dv -dimensional output values. These are concatenated and once again projected, resulting in the final values
Why encoder and decoder both has attention?
Encoder attention is for self relevance among each tokens.
Decoder self attention is masked. It means that only already predicted values are visible and others are masked. Another Decoder attention is for auto regression and it is similar to RNN based attention.
Decoder has self attention as well as encoder-decoder attention. Why?
Decoder self attention is masked. It means that only already predicted values are visible and others are masked. Another Decoder attention is for auto regression and it is similar to RNN based attention.
https://monkeylearn.com/blog/definitive-guide-natural-language-processing/
https://arxiv.org/abs/1706.03762?context=cs.CL
https://towardsdatascience.com/attention-is-all-you-need-discovering-the-transformer-paper-73e5ff5e0634
https://www.coursera.org/lecture/language-processing/how-to-deal-with-a-vocabulary-mvV6t
https://towardsdatascience.com/transformers-state-of-the-art-natural-language-processing-1d84c4c7462b