Text summarization is the technique for generating a concise and precise summary of voluminous texts while focusing on the sections that convey useful information, and without losing the overall meaning. Automatic text summarization aims to transform lengthy documents into shortened versions, something which could be difficult and costly to undertake if done manually. Machine learning algorithms can be trained to comprehend documents and identify the sections that convey important facts and information before producing the required summarized texts.
With the present explosion of data circulating the digital space, which is mostly non-structured textual data, there is a need to develop automatic text summarization tools that allow people to get insights from them easily. Currently, we enjoy quick access to enormous amounts of information. However, most of this information is redundant, insignificant, and may not convey the intended meaning. Hence, we use automatic text summarizers that are capable of extracting useful information that leaves out inessential and insignificant data is becoming vital. Implementing summarization can enhance the readability of documents, reduce the time spent in researching for information, and allow for more information to be fitted in a particular area.
There are two main forms of Text Summarization:
Extractive: A method to algorithmically find the most informative sentences within a large body of text which are used to form a summary.
Abstractive: A method to algorithmically generate concise phrases that are semantically consistent with the large body of text. This is far more inline with how humans summarize text but is also far more difficult to implement.
This project focusses on Abstractive Summarization.
As part of the TIPSTER Text summarization and Evaluation conference, a corpus of 183 documents (research papers) from the Computation and Language collection have been marked up in XML and made available as a general resource.
The documents are scientific papers which appeared in Association for Computational Linguistics (ACL) sponsored conferences. The markup is based on automatic conversion from latex to xml, and as a result is fairly minimal. It includes tags covering core information such as title, author, date, etc., as well as basic structure such as abstract, body, sections, lists, etc.
This project aims to provide a text analysis of a given document and summarize (Abstractive summarization) it to a fraction of its length. This is an important and complicated task in natural language processing. For the scope of this project, we will be using Encoder-Decoder architecture with attention mechanism, which is used for natural language processing problems that generate variable length output sequences.
Data cleaning is a crucial step in any machine leaning project, but more so for an NLP task. Without cleaning, the dataset is often a cluster of words that a machine cannot understand the implicit meaning from these words. For this project, I have done two separate pre-processing of text on the dataset; one for the EDA and the other to prepare the data for modeling.
From the above histogram we can see that, an average article in the dataset contains 5093 words of text whereas an average summary contains, 114 words of text. The average summary contains approximately 45% of text compared to its article.
We need to clean our dataset before we perform an Exploratory Data Analysis, on the data. Data cleaning is performed by the following manner:
Removal of punctuations, numbers, special symbols and stop-words. Though they are responsible for maintaining the context in a paragraph, we remove these items because they do not imply any meaning by themselves.
Conversion of text to lower-case and tokenizing it. I have used NLTK (Natural Language Toolkit) module to tokenize the text.
Lemmatize the tokens and perform Topic Modeling, Word Cloud and Name-Entity recognition.
Word clouds or tag clouds are graphical representations of word frequency that give greater prominence to words that appear more frequently in a source text. The larger the word in the visual the more common the word was in the document.
NER is a sub-task of information extraction that seeks to locate and classify named entities mentioned in the unstructured text into pre-defined categories such as person names, organizations, locations, time expressions etc. GPE refers to Geo-Political Entity which contains the information about the location of a person/organization.
I have used NMF (Non-Negative Matrix) factorization to extract the key topics from the document. NMF converts high-dimensional vectors to low-dimensional representation, similar to a PCA. Each word is given a term frequency (TF) and an inverse document frequency (IDF) . A coefficient matrix is calculated by optimizing TF and IDF over an objective function.
For this project, I have chosen PEGASUS pre-trained model, for the abstractive summarization which is based on a transformer encoder-decoder model. Before we examine the PEGASUS model, let us look at the architecture of encoder-decoder model and transformers.
The application of architecture to encoder-decoder is as follows:
Encoder: The encoder is responsible for reading the source document and encoding it to an internal representation. Each word is embedded with a vector and a sequence of these vectors are padded and fed to the encoder model.
Decoder: The decoder is responsible for generating each word in the output summary using the encoded representation of the sequences. It generates a probability distribution over the vocabulary of the dataset when each word with the highest probability gets picked for the output sequence.
Attention: We use attention mechanisms to extract the context (meaning) out of the source data which is then fed to the decoder model with the hidden state representations of encoder model using Dot score/ General score mechanism.
Encoder-decoder models often use Recurrent Neural Networks (RNN's), which trains the data with one sequence at each time-step. Due to this it takes a lot of processing time in generating the text. Transformers try to solve this problem by using Convolutional Neural Networks, with attention models.
Transformers have a similar architecture to that of encoder-decoder model which uses self-attention to boost the speed of text generation. Each encoder consists of two layers, feed-forward network and self-attention followed by normalization layers.
The encoder's inputs first flow through the self-attention layer, which helps the encoder to look at the other words in the input sequence as it encodes a specific word.
The first step in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process. They are abstractions that are useful for calculating and thinking about attention.
In the second step we calculate a score for all the words in the input sequence against the context vector/word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.
In the third step we normalize these scores to keep the context values intact, and remove any irrelevant words. We then sum up the weighted value vectors which produces the output for the self-attention layer. These vectors are then fed to the feed-forward neural network which are usually Convolutional Neural Networks (CNN's).
Pegasus works on a pre-trained self-supervised objective called 'gap-sentence' generation for Transformer encoder-decoder models to improve fine-tuning models on abstractive summarization. In PEGASUS pre-training several whole sentences are removed from the documents and model is tasked with recovering them. An example input for pre-training is a document with missing sentences, while the output consists of the missing sentences concatenated together.
The advantage of this self-supervision is that you can create as many examples as there are documents, without any human annotation, which is often the bottleneck in purely supervised systems.
PEGASUS was pre-trained on two large corpora of text :
C4, or the Colossal and Cleaned version of Common Crawl, introduced in Raffel et al. (2019); consists of text from 350M Web-pages (750GB).
HugeNews, a dataset of 1.5B articles (3.8TB) collected from news and news-like websites from 2013- 2019. A whitelist of domains ranging from high-quality news publishers to lower-quality sites such as high-school newspapers, and blogs was curated and used to seed a web-crawler. Heuristics were used to identify news-like articles, and only the main article text was extracted as plain text.
For pre-processing of text for the model, I used the custom PegasusTokenizer which serves as the pipeline for cleaning, tokenizing and vectorizing the data for the PEGASUS model.
I divided my dataset into train and test in the ratio 60:40. I didn't create a validation set since the size of my dataset is small. I trained the model for 10 epochs using Adafactor to optimize the gradients. Then I evaluated the results using BLEU (Bilingual Evaluation Understudy) metric to get the accuracy of the model.
Adafactor is a stochastic optimization algorithm based on Adam that reduces memory usage while retaining the empirical benefits of adaptivity by maintaining factored representation of the squared gradient accumulator across training steps. By tracking moving averages of the row and column sums of the squared gradients for matrix-valued variables they were able to reconstruct a low-rank approximation of the exponentially smoothed accumulator at each training step that is optimal with respect to the generalized Kullback-Leibler divergence. For this project I have set the hyperparameters to default values proposed in the research paper.
The CrossEntropyLoss in pytorch combines LogSoftmax and NLLLoss in one single class. It is useful when training a classification problem with C classes. If provided, the optional argument weight should be a 1D Tensor assigning weight to each of the classes. This is particularly useful when you have an unbalanced training set. The input is expected to contain raw, unnormalized scores for each class. input has to be a Tensor of size either (minibatch, C) or (minibatch,C,d1,d2,...,dK) with K≥1 for the K-dimensional case (described later).This criterion expects a class index in the range [0,C−1] as the target for each value of a 1D tensor of size minibatch; if ignore_index is specified, this criterion also accepts this class index (this index may not necessarily be in the class range).
The BLEU algorithm compares consecutive phrases of the automatic translation with the consecutive phrases it finds in the reference translation, and counts the number of matches, in a weighted fashion. These matches are position independent. A higher match degree indicates a higher degree of similarity with the reference translation, and higher score. Intelligibility and grammatical correctness are not taken into account. BLEU’s strength is that it correlates well with human judgment by averaging out individual sentence judgment errors over a test corpus, rather than attempting to devise the exact human judgment for every sentence.
The difference in training loss for PEGASUS after each epoch was negligible. This is due to the fact that, the model was pre-trained on 1.8 billion articles of text and fine-tuned for many tensorflow datasets such as XSum, CNN-Dailymail, gigaword etc. After training, I compared the accuracy of the model with the Gensim and T5 summarizers.
Gensim summarizes a given document based on the ranks of text sentences using a variation of the TextRank algorithm. The output summary will consist of the most representative sentences from a given document.
T5 also is a transformer encoder-decoder model developed by the researchers at Google which is pre-trained on a multi-task mixture of supervised and unsupervised tasks, where each task is converted into text-to-text format.
From the below results we can see that Gensim had the highest accuracy in summarizing the article, with PEGASUS being the least. Gensim is an extractive summarizer where it selects the most important sentences in the article and appends them to the summary. But PEGASUS works in a similar way where sentences are masked by the encoder and and decoder is tasked with recovering them. The dataset consists of research papers from the Computation and Linguistics conference where the text is composed of scientific information. Since PEGASUS was trained pre-dominantly on news articles and other dumps of text, it would be a demanding task for PEGASUS to retain all the scientific information from the article into a summary.
Even during pre-training the length of the sequences (articles) were truncated to 1024 words/tokens due to the resource constraints, and the model cannot encode sequences greater than 1024 words. But a research paper consists of 5,000 words on an average. If we were tasked to encode the articles greater than 1024 words of length, we would be losing some part of the information.
The articles are arranged in the ascending order with respect to their length.
introduction our aim is to formalize constraints that are needed to develop a parser based on unification grammar called ug henceforth so that our parser can deal with variety of types of sentences in japanese however just parsing syntactically is not enough for natural language understanding one important and necessary task to be done when a parser processes a discourse in japanese is the so called zero anaphora resolution all of syntactic semantic and pragmatic constraints are to be involved to resolve zero anaphora of course some of omitted pronouns are syntactically resolved for instance vp with suffix te is not regarded as a clause but a conjunct vp therefore the subject of the vp with te which is possibly omitted from surface should corefer with the subject of the sentence one example is hanako felt cold and closed the window where both of zero subjects and refer to the sentential topic hanako in this example one of the possible accounts for this interpretation is the following zero subject of te phrase is anaphoric pronominal or pro in gb term as the result is controlled by the subject of the main vp which is also zero subject is in gb term anaphoric pronominal or pro the sentential topic hanako is the only possible antecedent of this zero subject in this example however in complex sentences things are quite different consider the following sentence 1 since hanako behaved like feeling cold i closed the window 2 since i behaved like feeling cold hanako closed the window if contextually we can take only hanako and the speaker of this sentence as candidates of antecedent of or intuitively the following two interpretations are equally likely a hanako speaker b speaker hanako therefore and are both pro in fact this fact is well known among japanese linguists i e as a result zero anaphora resolution of complex sentence is not only to be done syntactically but also to be done pragmatically and or semantically one of the promising candidate for this is the centering theory to apply the centering theory that is originally for a sequence of sentences namely discourse we regard the subordinate clause
the aim of this paper is to deal with constraints that are needed to develop a based on unification grammar so that our parser can deal with some types of sentences in variety. <n> the main result is the following. <n>
our parser can deal with variety of types of sentences in Japanese.\nparser processes a discourse in Japanese, is the so called zero\nomitted from surface, should corefer with the subject of the sentence.\nof the main VP, which is also zero subject.\ntopic Hanako is the only possible antecedent of this zero\n2. `Since I behaved like feeling cold, Hanako closed the window.'\n2. `Since I behaved like feeling cold, Hanako closed the window.
u.s. linguists have proposed a 'conjunctive particle' in the main clause. the subordinate clause is the subject of the VP withsuffix te, which is possibly omitted from surface, or pro
Despite getting the least accuracy (BLEU) score out of the other two models, our model was able to generate some high quality summaries. However it is optimal to use this model for articles with no more than 1000 words, since the sequences are truncated in the encoder module. Abstractive summarizers perform better where generic summaries are needed, while extractive summarizers are better at generating context specific summaries.
Building an abstractive summarizer requires high level of optimization for the model to train the sequences.
Vocabulary which is responsible for word2vec embeddings, do not use lemmatized words which results in a large corpus of words.
The domain ontology for a specific summarizer has to be defined by the domain experts.
Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis, https://www.cs.bham.ac.uk/~pxt/IDA/text_summary.pdf
A DEEP REINFORCED MODEL FOR ABSTRACTIVE SUMMARIZATION,
https://arxiv.org/pdf/1705.04304.pdf
PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization
https://arxiv.org/pdf/1912.08777.pdf
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
https://arxiv.org/pdf/1804.04235.pdf
Transformer Neural Network
https://deepai.org/machine-learning-glossary-and-terms/transformer-neural-network
PEGASUS Modules
https://github.com/google-research/pegasus
Dataset - Tipster
https://www-nlpir.nist.gov/related_projects/tipster_summac/cmp_lg.html
Domain-specific informative and indicative summarization for information retrieval,
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.21.385&rep=rep1&type=pdf