The advent of commodity hardware like the GPU and cloud computing technologies that use it, has led to the recent Dawn of AI [1]. In the past decade, we have seen a tremendous surge in usage of these technologies to build neural networks that have complex goals like natural language understanding [2] or generation [3]. The goal of this project is to combine these ideas into creating an open question answering system capable of generating a human-like answer to users’ natural language questions. The result of the project, i.e. QABot will be a Facebook bot, capable of taking questions from users and responding in a human-like way. This bot will act as a real-time assistant, transforming any device in the world into a system that can achieve goals like clarifying topical queries for students, providing financial tips to people or acting as an encyclopedia on the go. Since the responses are human-like the bot would create a perception of a friendly chat system than a standard search engine interface. The chat interface will always answer in a friendly manner no matter the disposition of the human. Currently, the end-user interface I have chosen is a Facebook bot, but the model can be deployed into other systems like Slack, Telegram messenger, WhatsApp messenger, or even on a generic web or Android app. The widespread deployment ability expands the potential of the system and allows a large number of users to take advantage of my model.
The bot will use a natural language model I will create by training a neural network on data [4] from a real search engine which includes responses annotated by users for acceptable answers. I intend to use Rouge-L [5] and BLEU [6] score to evaluate the quality of the responses generated by my bot.
Image credit: Thomas [23]
The dataset I am using was acquired from Microsoft’s Machine Reading Comprehension (MS MARCO) project site. It provides datasets that are focused on deep learning for search. The dataset includes 101,093 records and six columns.
A preliminary analysis of the dataset shows that the data contains the columns: query_id (unique id for each question), query (unique question based on initial Bing usage), passages (contains relevant information for the answers), query_type (question categories are LOCATION, NUMERIC, PERSON, DESCRIPTION, ENTITY), answers (an appropriate answer - about 1% questions have multiple answers), wellFormedAnswers (human annotation denotes if there is a well-formed answer, about 7% questions have multiple answers). Also, I analyzed the distribution of lengths of questions and answers that I present in my project Iron Python notebook present at the GitHub link above.
I plan to dive a bit deeper by the next delivery deadline analyzing details about the types of questions I have in the dataset and complete my Exploratory Data Analytics.
Image credit: Malaiyandi [22]
I am planning to build QABot using the sequence to sequence deep learning model and Encoder Decoder architecture combined with attention mechanism to answer user searched questions. I will create the model by training a deep neural using the PyTorch Deep Learning Framework. I plan to use the following neural network designs for training my models:
Seq2Seq approach with Attention Mechanism. The Seq2Seq algorithm trains a denoising auto-encoder over sequences and I will be using RNN to deal with sequences. While training I will randomly use Teacher Forcing or Auto-Regressive approach and for evaluation I will use Auto-Regressive.
BERT (Bidirectional Encoder Representations from Transformers) for tokenization and combine Transformer and GPT-2 for model fine tuning.
Image credit: Dynamic Consultants [21]
According to Global Web Index statistics, 75% of internet users are adopting one or more messenger platforms [7]. Research shows that users on an average use 24 different apps a month. Out of which 80% of their usage time is spent in just 5 apps. This strongly implies that creating another app for helping users might not be that useful. However, integrating a system as a chatbot interface on a popular platform like Facebook messenger might be more useful to users.
Market experts at Verizon Ventures have stated that “Chatbots represent a new trend in how people access information, make decisions, and communicate” [8]. For the current generation of users chatbots are a natural extension of texting. Growing up with access to internet driven services, and modern mobile devices, they have evolved from expecting search engine responses that provide links to ticket sites for concerts to chat interfaces that sell them tickets to concerts. All this to say that I have a strong motivation for creating chatbot solutions for open questions.
Image credit: Patel [20]
Open question answering research has till date primary focused on using knowledge bases (KBs). Early examples of such systems [9], [10], [11] focused on hand curated KBs like Freebase or automatically extracted information from unstructured text. Open QA (OQA) [12] was the first to take a new approach to the problem. They created sub-problems out of the questions. Then they performed data mining over millions of rules from unlabeled question corpus across multiple KBs. OQA used a latent variable structured perceptron algorithm to train on question answer pairs. Before that the DeepQA project [13] from IBM’s Watson team was a leader in question answering systems. They combined a plethora of QA techniques that included machine translation-based approaches to logical form analysis. DeepQA extended the previous question answering state of the art [14] by creating the concept of “Challenge questions” and their massively parallel probabilistic evidence-based architecture was ranking answers to those questions. The current state of art [15] in open question answering is an evolution on using machine comprehension and builds upon the learnings from DeepQA and replies upon datasets like the SQuAD, CNN/Daily Mail etc. Built by researchers at Stanford and Facebook AI lab it tries to solve the problems of document retrieval at scale and machine comprehension and trains a multi-layer neural network model that is able to detect answers in Wikipedia articles. Teaching a machine to carry out a meaningful conversation with a human in multiple domains is a research question that is far from solved [16]. Recently, the deep learning boom has allowed for powerful generative models like Google’s Neural Conversational Model [17], which marks a large step towards multi-domain generative conversational models [18]. In this project I will create a new neural network and implement it using the PyTorch machine learning library. The main research goal of this project is to create a trained neural network that generates a machine learning model that chats in as human-like way as possible, while responding to user questions via a chat interface.
Image credit: Viper [19]
M. Versace, "2020 And The Dawn Of AI Learning At The Edge," 12 February 2020. [Online]. Available: https://www.forbes.com/sites/forbestechcouncil/2020/02/12/2020-and-the-dawn-of-ai-learning-at-the-edge/#7334b1bf2029.
L. Xu, et. al. "NLUBroker: a flexible and responsive broker for cloud-based natural language understanding services.," 2019.
E. A. Platanios, et. al. "Contextual parameter generation for universal neural machine translation.," 2018.
T. Nguyen, et. al. "Ms marco: A human-generated machine reading comprehension dataset.," 2016.
C.-Y. Lin, et. al. "Rouge: A package for automatic evaluation of summaries.," 2004.
K. Papineni, et. al. "BLEU: a method for automatic evaluation of machine translation.," 2002.
B. Live!, "Top 10 benefits Chatbots Brings to Customers and Business Owners," 27 August 2019. [Online]. Available: https://tutorials.botsfloor.com/top-10-benefits-chatbots-brings-to-customers-and-business-owners-4dfd3ad60605.
C. Crandell, "Chatbots Will Be Your New Best Friend," 23 October 2016. [Online]. Available: https://www.forbes.com/sites/christinecrandell/2016/10/23/chatbots-will-be-your-new-best-friend/#219f6de2a245.
J. Berant, et. al. "Semantic parsing on freebase from question-answer pairs.," 2013.
A. Yates, et. al. "Textrunner: open information extraction on the web.," 2007.
M. Banko, et. al. "Askmsr: Question answering using the worldwide web.," 2002.
A. Fader, et. al. "Open question answering over curated and extracted knowledge bases.," in InProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '14), New York, NY, USA, 2014.
D. Ferrucci, et. al. "Building Watson: An overview of the DeepQA project.," 2010.
D. Ferrucci, et. al. "Towards the open advancement of question answering systems.," in IBM, Armonk, NY, 2009.
D. Chen, et. al. "Reading wikipedia to answer open-domain questions.," 2017.
R. Lowe, et. al. "Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses," 23 August 2017. [Online]. Available: https://arxiv.org/abs/1708.07149.
O. Vinyals, et. al. "A Neural Conversational Model," 19 June 2015. [Online]. Available: https://arxiv.org/abs/1506.05869.
M.-T. Luong, et. al. "Effective Approaches to Attention-based Neural Machine Translation," 17 August 2015. [Online]. Available: https://arxiv.org/abs/1508.04025.
Viper. "Job Searching, Paper-Filing, and Buster-Worrying". 22 April 2010 [Online]. Available: https://www.grose.us/blog3/2010/04/22/job-searching-paper-filing-and-buster-worrying/
K. Patel. "Here’s Why Chatbots are The Future of Consumer Engagement". 2020 [Online]. Available: https://www.greengeeks.com/blog/heres-why-chatbots-are-the-future-of-consumer-engagement/
Dynamic Consultants. "Quick Start Implementation - Fast Financials". 2020 [Online]. Available: https://dynamics-consultants.co.uk/services/implementation/
P. Malaiyandi. "Enable big data analytics and enhance compliance with the cloud". 26 September 2019 [Online]. Available: https://www.druva.com/blog/enable-big-data-analytics-and-enhance-compliance-with-the-cloud/
A. Thomas. "Top 7 AI-Based Chatbots To Choose For Your Business". 17 May 2020 [Online]. Available: https://analyticsindiamag.com/top-7-ai-based-chatbots-to-choose-for-your-business/
During the second phase of the project I have carried out an exploratory data analysis (EDA) of my chosen dataset. Along with that I have also worked on creating my neural network and generating some initial results for the project. While doing so, one of the suggestions provided by the forum and professor Simsek, during the first phase of presentations started making more sense. I came to the conclusion that focusing on a specific domain of questions will make my model creation task easier.
I noticed that the text generation quality from my model was not as good or human like as I wanted it to be. After some investigation, I came to the understanding that the probable cause for the issue is that my model suffers from diversity of context. Even though the chosen dataset (MS Marco [4]) contains 100k records, the data is from an unbounded set of domains. Therefore, the size of the vocabulary in each domain is small and it results in bad model quality.
Via my EDA, shown below, I was able to determine that my chosen dataset does not contain any domain information.
MS-Marco dataset - No domain information
MS-Marco dataset - No domain information
If you see the two images above, you will notice that I have information in the dataset that allows me to look at question types from an NLP perspective. That is, I can find out the distribution of "wh", who, what when, where types of questions. But that does not tell me if the "who" query is about a sportsman, actor, politician etc. Similarly, I also have information indicating if the question is asking for a description of an entity or trying to find location of a place but it does not tell me if the entity is a scientific concept or a literary concept. Or if the place is a restaurant or a hospital.
Due to this challenge, I decided to search for a supplementary dataset that will allow me to demonstrate the feasibility of my goal to use neural networks to create a QA system with a natural chat like interface and possibly human-like responses. The supplementary dataset includes Twitter based customer service data [1]. I made it even more domain specific by filtering this data to use only Apple's customer support queries. In the following sections, I have provided a comparative EDA report of the two datasets I am using.
Above, I have drawn the neural network I have designed. In the following section, I am going to explain the details of the network. For my project, I have used Seq2Seq algorithm [3, 6, 8, 10] to train a denoising auto-encoder over sequences. It takes as input a sequence like 𝑄=𝑞1,𝑞2,…,𝑞𝑇 , and generates a new sequence 𝐴=𝑎1,𝑎2,…,𝑎𝑇′ as output. These sequences might not be same i.e. (q𝑗≠a𝑗) and might be of different lengths so, 𝑇≠𝑇′. This training approach is called a denoising auto-encoder because the question sequence 𝑄 gets mapped to some related answer sequence 𝐴, as if 𝑄 was a "noisy" version of 𝐴. There is no one-to-one relationship between question and answer. The heart of my chatbot is this sequence-to-sequence (seq2seq) model. The goal of a seq2seq model is to take a variable-length question sequence as an input, and return a variable-length answer sequence as an output.
In this project, I have created question and answer pairs and a vocabulary which is a dictionary where each unique string or token is assigned to its own unique ID. In natural language, not every sentence or sequence of words are of the same length. Therefore, I use <_PADDING_> to pad the sequence to length of the longest sequence. In this way each sequence will be of the same length and allows me to denote sequences in a tensor. The beginning of sequence and an end of sequence are represented using the <SOS> and <EOS> tokens respectively. This helps the model by indicating when to stop generating the output sequence even when the outputs have differing lengths.
The final batch of sequences from the dataset are in the form of nested tuples, ((𝑄,𝐴),𝐴). We need to do this because seq2seq_model requires both 𝑄 and 𝐴 during training. The train_seq2seq function expects tuples of (𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛, 𝑎𝑛𝑠𝑤𝑒𝑟)
The components of my neural network include an embedding layer, an encoding layer RNN and decoding layer RNN [9, 10].
For the embedding layer I have used nn.Embedding layer. The purpose of the embedding layer is to convert tokens into feature vectors.
For encoding layer, I have used Gated Recurrent Unit (GRU). The inputs to the encoder are embedded question sequence 𝑄=𝑞1,𝑞1,…,𝑞𝑇. The purpose of the encoding layer RNN is to encode a variable length question sequence to a fixed-length context vector. This context vector will contain semantic information about the question.
For decoding layer, I have three separate components.
The first component, GRUCells layer performs decoding operation, a step at a time in a token-by-token fashion.
The second component, "Attention Mechanism" [5, 7] uses as input the encoder outputs, output of the previous GRUCells layer, and mask on question sequence. This layer is the key to the seq2seq network. The goal of this algorithm is to pay attention to the important stuff and selectively ignore the unimportant, superfluous and distracting inputs. Attention Mechanism helps the decoder to learn how to ignore the noise in the input. In my project I have used the Dot Score for computation of attention weight which indicates importance of each token.
The third component, is a fully connected network for predicting the next token from the concatenated output of Attention context and local decoded context.
The end goal of the decoding layer is to take as input, a word and a context vector, and predict the next word in the sequence.
One of the steps of training an RNN uses Auto-Regressive or Teacher Forcing approaches [2].
The Auto-Regressive approach predicted token at time step 𝑡 as the input for the next time step 𝑡+1 by sampling the next token based on the probabilities. The auto-regressive approach can be slower to learn, but makes the model generalize better as there is no need to provide an answer like in teacher forcing. Auto-regressive approach is used in testing as teacher forcing requires knowing the answer.
The Teacher Forcing approach gives the model the correct token at a time step and allows it to continue it's predictions for the next token. This approach makes it easier to predict all subsequent tokens correctly.
In my project, I have used a combination of these two methods.
For training I used a combination of the auto-regressive and teacher forcing approaches by randomly deciding which approach to take.
For prediction I used the auto-regressive approach since teacher forcing requires us to know the answer.
TWCS dataset
MS-Marco dataset
Here we have on the left, a word-cloud from Apple's customer questions. As you can clearly see, the words are from a specific domain and pertains to Apple or its products. The one on the right shows words from geographical areas like a county to the weather to symptoms of medical diagnoses. Similarly, if you see below, on the left we have a words from a specific related to computer, mobile or internet related concepts. While on the right, the generic domain is talking about pain, a medical concept, years, a temporal concept or even "yes" which can refer to any field. This kind of diverse concepts makes it difficult for the network to learn concepts on its own or even with any contextual information.
TWCS dataset
MS-Marco dataset
TWCS dataset
MS-Marco dataset
Another issue I observed was that the length or number of words in the questions in the customer service dataset was significantly longer than ones in the MS-Marco dataset. Observe the graphs just above and below and on the left. The customer service dataset fairly long questions while the length of questions for the MS-Marco dataset are consistently very short. The same issue exists with the answer lengths. This is another reason why the models do not have a fair amount of words in the vocabulary to learn and thus performs poorly when it comes to prediction/generation of natural language sentences.
The graphs above show the distribution of number of words in questions and answers for the two datasets I have chosen. The graphs below show the distribution of length in questions and answers by number of characters for the two datasets.
TWCS Dataset
MS-Marco dataset
About image references: All images in part II P3 and notebooks of this project delivery are created by me.
Thought Vector and S. Axelbrooke. 3rd December, 2017. "Customer Support on Twitter, version 10". Retrieved September 30, 2020 from https://www.kaggle.com/thoughtvector/customer-support-on-twitter.
J. Brownlee. "What is Teacher Forcing for Recurrent Neural Networks?". 4 August 2019 [Online]. Available: https://machinelearningmastery.com/teacher-forcing-for-recurrent-neural-networks/
A. Wearne. "Seq2Seq with Pytorch". 25 June 2019 [Online]. Available: https://medium.com/@adam.wearne/seq2seq-with-pytorch-46dc00ff5164
T. Nguyen, et. al. "Ms marco: A human-generated machine reading comprehension dataset.," 2016.
J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho & Y. Bengio. "Attention-based models for speech recognition". In Advances in neural information processing systems (pp. 577-585). 2015.
A. M. Dai & Q. V. Le. "Semi-supervised sequence learning". In Advances in neural information processing systems (pp. 3079-3087). 2015.
K. Cho, B. V. M. C. Gulcehre, D. Bahdanau, F. B. H. Schwenk & Y. Bengio. "Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation". arXiv preprint arXiv:1406.1078 2014.
I. V. Serban, A. Sordoni, Y. Bengio, A. Courville & J. Pineau. "Building end-to-end dialogue systems using generative hierarchical neural network models". arXiv preprint arXiv:1507.04808. 2015.
D. Bahdanau, K. Cho & Y. Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).
I. Sutskever, Ilya, O. Vinyals & Q. V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.
A chatbot at its core is essentially a sequence-to-sequence (seq2seq) model [4 - 10]. The goal of the seq2seq model is to take a variable-length question sequence as an input, and return a variable-length answer sequence as an output. The Seq2Seq algorithm can be implemented using Gated Recurrent Unit, which is a Recurrent Neural Network or by using Transformers.
Deep learning models are often created by training on hundreds of gigabytes of data, using several layer deep neural networks and require multiple GPUs to handle the computations. This is where Transformer approach is valuable. RNNs and Transformers are both similar in design and can handle sequential data, for example text data in natural language [1]. However, the subtle difference being that Transformers do not need to process data in order. Thus if the input data were to be text from a natural language source, Transformer does not need to process the words in sequence from beginning to end. This allows Transformer to take advantage of massive amounts of parallelization using GPUs. Due to it having several layers and involving massive number of parallel computations, Transformer does really well as sequences get longer. Transformers also benefit from larger layers and more layers. RNNs with their three layers and 512 neurons per layer do not do as well and cannot scale up to perform massive parallel computations like Transformer can. However, Transformer can also suffer due to the massive amount of parallel computations as that increases the associated computation cost. Transformers can often require hundreds or even thousands of GPUs, which can make them prohibitively expensive.
Following the above-mentioned advantages, Transformers have become one of the most popular models in deep learning. They tend work well for improving accuracy for problems that require training on lots of data by performing massive amounts of parallel computations.
In the following section, I have described the various components I used for creating my Seq2Seq model using Transformers and GPT2.
The first component is an embedding layer implemented using DistilBertModel [12 - 16] (i.e. distilbert-base-uncased) to convert tokens into feature vectors. The purpose of the embedding layer is to convert tokens into feature vectors.
Next, for encoder I have first used a positional encoder (i.e. PositionalEncoding) followed by transformer encoder. Positional encoder allows us to positionally encode the embedded tensors and to take into account the order of sequences. The transformer encoder I used was nn.TransformerEncoder, which is an encoding transformer that takes a tensor of shape (T, B, D) . It can process all T items at once. The purpose of the encoder is to encode a variable length question sequence to a fixed-length context vector. This context vector will contain semantic information about the question.
Finally, I have the decoder layer, which I have implemented using TransformerDecoder and GPT2LMHeadModel, which generates the output one item at a time. The end goal of the decoder layer is to take as input, a word and a context vector, and predict the next word in the sequence.
Before we can train the model, we need to create the data set. For that purpose we use the individual layer's tokenizers. Note that for embedding layer I have used DistilBertModel. Therefore, I used DistilBertTokenizer for tokenizing my input data. While I used GPT2 [11, 14] for the decoding layer so, I used the GPT2Tokenizer in order to have the exact same decoding process that is used by the original GPT2LMHeadModel to convert tensor into strings.
The figure below shows the neural network architecture I built to create the Seq2Seq model using Transformers and GPT2.
Network architecture: GPT2 fine tuning with Transformer
The results above shows the quality measures for the two datasets used in this study. The TWCS dataset [2] being the domain focused one shows better quality measures as opposed to the MS-Marco dataset [3]. The various issues with respect to that dataset was presented above in part II of this project delivery. With GPT2 we observed a marginal improvement in quality measures. The reason being I was only able to use cloud based GPUs for this project and each iteration for GPT2 was taking around four hours. As a result, I would hit the Google Colab timeout after about four epochs. With access to better hardware it might be possible to improve this result.
Presented below we have the sample chat using the model trained on TWCS dataset [2] using the Seq2Seq approach. As you can see the results of a domain specific dataset containing interactions from customer service on Apple devices, produces a much better and human like chat response than that we saw with MS-Marco dataset [3].
TWCS using Seq2Seq model with RNN
The MS-Marco dataset [3] suffered from diversity of context. Even though the MS Marco dataset [3] contains 100k records, the data is from an unbounded set of domains and leads to a model that is unable to generate meaningful sentences, as can be seen below.
MS-Marco using Seq2Seq model with RNN
Using GPT2 fine tuning with Transformer provided marginal improvement and the chat results for both TWCS [2] and MS-Marco [3] datasets were very similar to what I got with the Seq2Seq model. See below for chat results for TWCS [2] followed by MS-Marco [3] datasets when using GPT2.
TWCS dataset using Seq2Seq model with GPT2 fine tuning and Transformer
MS-Marco dataset using Seq2Seq model with GPT2 fine tuning and Transformer
In conclusion, I was able to create a natural language model capable of chatting with users in a fairly human like way. I created this model using customer service data focused on Apple's iOS devices. I was able to create the model using two different Seq2Seq approaches usign RNN and Transformer combined with GPT2 fine tuning. I used Transformer due to numerous advantages it has over standard Seq2Seq RNNs. Given that I was able to run only four epochs while creating my model using Transformer I came to the understanding that I would need a more powerful and perhaps local GPU to get full advantage of this network.
My models were able to achieve a Rouge-L score of 0.96 for the TWCS dataset [2]. While the MS-Marco dataset [3], that suffered from a diversity of context, was able to achieve a Rouge-L score of 0.51. As explained in Part II of this project's deliverable above, it makes sense that a domain specific dataset would generate a better language model than a diverse domain dataset.
There are a few different goals that I can strive towards to improve the current chat solution or model I have created, in future.
While running on Google's Colab infrastructure, I realized that for Transformer, that requires massive amounts of parallel computations on hardware like a GPU, it is better to have a local machine with a decent GPU. I would like to explore this path for creating my models in future.
I would also like to try and use Google’s TPU (Tensor Processing Unit). Converting my input data into tensors was beyond the scope of this project so I think it could be a good goal for a future project.
Deploy model to a live server and connect to chatbot platforms like Facebook. I initially wanted to create and deploy my model and connect to a platform like Facebook. However, that would have required me to obtain and pay for a virtual private server like Heroku which I decided not to do in this project. If I am able to do so in the future that could be exciting. It would also allow me to use the chatbot platform to create a feedback loop and improve my model by chatting with real users.
About image references: All images in part III P3 and notebooks of this project delivery are created by me.
A. Vaswani et. al. 2017. "Attention Is All You Need." arXiv:1706.03762 [cs.CL]
Thought Vector et. al. 2017. "Customer Support on Twitter, version 10." https://www.kaggle.com/thoughtvector/customer-support-on-twitter
T. Nguyen, et. al. 2016. "MS Marco: A human-generated machine reading comprehension dataset." arxiv:1611.09268[cs.CL].
A. Wearne. 2019. "Seq2Seq with Pytorch." https://medium.com/@adam.wearne/seq2seq-with-pytorch-46dc00ff5164
J. K. Chorowski et. al. 2015. "Attention-based models for speech recognition". In Advances in neural information processing systems (pp. 577-585).
A. M. Dai & Q. V. Le. "Semi-supervised sequence learning". In Advances in neural information processing systems (pp. 3079-3087). 2015.
K. Cho et. al. 2014. "Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation". arXiv preprint arXiv:1406.1078.
I. V. Serban et. al. 2015. "Building end-to-end dialogue systems using generative hierarchical neural network models". arXiv preprint arXiv:1507.04808.
D. Bahdanau et. al. 2014. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473.
I. Sutskever et. al. 2014. "Sequence to sequence learning with neural networks." Advances in neural information processing systems.
Dejan Batanjac. 2020. "GPT2 receipt example." https://dejanbatanjac.github.io/gpt2-example/
Huggingface. 2020. "BERT." https://huggingface.co/transformers/model_doc/bert.html
Huggingface. 2020. "Loading Google AI or OpenAI pre-trained weights or PyTorch dump." https://huggingface.co/transformers/v2.1.1/serialization.html
Huggingface. 2020. "OpenAI GPT2." https://huggingface.co/transformers/model_doc/gpt2.html
Huggingface. 2020. "DistilBERT." https://huggingface.co/transformers/model_doc/bert.html
Victor Sanh et. al. 2020. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter." arXiv:1910.01108 [cs.CL]