Transformer based models like BERT are now a standard part of NLP toolkit ( as demonstrated in Kaggle competitions, for example ). Still, BERT and like are far from being a final take in solving NLP tasks like summarization, question answering etc.
BERT can effectively cope only with short contexts. This is exactly where Reformer an incremental improvement over Transformer comes in the light.
The Reformer pushes the limit of longe sequence modeling by its ability to process up to half a million tokens at once.
As a comparison, a conventional bert-base-uncased model limits the input length to only 512 tokens. In Reformer, each part of the standard transformer architecture is re-engineered to optimize for minimal memory requirement without a significant drop in performance.
Earlier LSTM and now Tranformer model is popular N effective models. However, these models have challenges with recent NLP need
With the rise of data mining, developers are looking for a model that can remember past information for a longer time than LSTMs. Refer below example to understand the need
Video example for long sequence memory need
Understanding sequential data — such as language, music or videos — is a challenging task, especially when there is dependence on extensive surrounding context. For example, if a person or an object disappears from view in a video only to re-appear much later, many models will forget how it looked. LSTM doesn't help in this case.
The more recent Transformer model not only improved performance in sentence-by-sentence translation, but could be used to generate entire Wikipedia articles through multi-document summarisation. This is possible because the context window used by Transformer extends to thousands of words.
However, extending Transformer to even larger context windows runs into following limitations.
Impractical computation need - For example, In the case of a text of 100K words, attention layer of transformer would require assessment of 100K x 100K word pairs, or 10 billion pairs for each step, which is impractical.
Impractical memory need - For applications using large context windows, the memory requirement for storing the output of multiple model layers quickly becomes prohibitively large (from gigabytes with a few layers to terabytes in models with thousands of layers).
It improves transformer model by reducing both computational need and memory need. With these improvement, reformer model can handle context windows of up to 1 million words, all on a single accelerator and using only 16GB of memory.
The first challenge when applying a Transformer model to a very large text sequence is how to handle the attention layer. LSH accomplishes this by computing a hash function that matches similar vectors together, instead of searching through all possible pairs of vectors (Refer below diagram). [Understand it]
Angular LSH projects the points on a unit sphere which has been divided into predefined regions each with a distinct code. Then a series of random rotations of points define the bucket the points belong to (understand it).
It uses four directional buckets and uses locality sensitive hashing(Refer here )
The second novel approach implemented in Reformer is to recompute the input of each layer on-demand during back-propagation, rather than storing it in memory.
Image generation
Long text processing. Reformer can process entire novels, all at once and on a single device
https://www.geeksforgeeks.org/understanding-of-lstm-networks/
https://ai.googleblog.com/2020/01/reformer-efficient-transformer.html
https://towardsdatascience.com/illustrating-the-reformer-393575ac6ba0
https://huggingface.co/blog/reformer
https://ranko-mosic.medium.com/reformer-the-efficient-transformer-a2329c8410d1