Mahima Thakkar

COVID-19 Research Paper Text Summarization

GitHub

Phase I

Background

With the enlargement of the digitized text, Automatic Text Summarization offers a solution to both better help discover relevant information and to consume relevant information faster.
Text summarization is the problem of creating a short, accurate, and fluent summary of a longer text document.^[1]
In this pandemic, there have been a lot of research papers available by people that cover all the areas such as health, vaccination, impacts of COVID-19 on economy, social and mental state of people.
With the COVID-19 pandemic, there is a growing urgency for the medical community to keep up with the accelerating growth in the new coronavirus-related literature.^[²^]

AIM

The aim of the project is to tide over between the researchers and the rapidly growing publications. The work will be potentially aiding overburdened health workers in finding scientific answers during a time of crisis.

Dataset Description

The data is collected from the COVID 19 NLP Insight Hub portal that provides links to the Sciencedirect research paper repository. https://www.sciencedirect.com/articlelist/covid

Due to access restrictions to the website, automatic scrapping of the data was denied. Thus the research papers are downloaded manually to form the dataset.
These papers cover all the areas that have been impacted due to the COVID-19 pandemic. There are 120 articles present in the dataset with each varying in the number of pages.

Phase II

Data Pre-Processing

Data cleaning is the crucial process in the machine learning pipeline. It goes by the saying "What we feed is what we get back."

Data is arguably one of the most vital assets an organization has to help support and guide its success.^[⁵^]
While dealing with Language Modelling and NLP it is necessary to clean the text before applying any models to the data.

The data from each of the PDF files were stored in a nested dictionary. The parent key of the dictionary was the title of the PDF and the value was a dictionary with key as 'Content' and value as the text of the PDF.
The text was raw data extracted from the PDF files and thus cleaning the text was extremely important.
Text cleaning process included the following:
- Removing the URLs.
- Removing punctuations.
- Removing StopWords.
- Removing numbers.
- Removing short words.

Original Text

Cleaned Text

As it can be seen that there are human names in the cleaned text. Those human names also appear while generating summaries. Many techniques were implemented to remove the names. One of them was using the nameparser library. The nameparser has a function HumanName that extracts the human name from the text. But the code was also encountering some geographical locations such as state and country names. Using that library could have ignored crucial knowledge such as geographical information. Another technique that could be used is to eliminate the words that have less frequency i.e how many times the word has occurred in the text. This option is also not viable as the threshold value of the count would also eliminate other words that might have occurred few times but are important for the context. Thus, the human name elimination was dropped as a part of data preprocessing.

Exploratory Data Analysis

Word Cloud

WordCloud is used as a data visualization technique for the textual data. As we could see, the words that are apperaing in the below image, decribes a lot about the topics in the text. We could gracefully visualizes and infer that the paper is about COVID, chronic disease, people, mental health, education, south aisa, etc.

Frequency Distribution of Words

The formation of sentences is a highly structured and history-dependent process.^[⁴^] The probability of using a specific word in a sentence strongly depends on the ‘history’ of word usage earlier in that sentence.^[⁴^] We study a simple history-dependent model of text generation assuming that the sample-space of word usage reduces along with sentence formation, on average. Thus it to get the gist of the paper, the frequency distribution of words helps to see the keywords that are used in the text.

Keywords

Montemurro and Zanette algorithm

The algorithm identifies words that are significant to the structure of the document - these often correspond to the major themes. It does so independently of a corpus.

We used the mz_keywords module from the gensim library that extracts weighted keywords based on the MZ algorithm.

Weighted Keywords

Keywords

Topics derived from NMF Topic Modeling

NMF stands for "non-negative matrix factorization".^[⁶^] NMF achieves its interpretability by decomposing samples as sums of their parts.^[⁶^] For example, NMF decomposes documents as combinations of common themes, and images as combinations of common patterns.^[⁶^] NMF has components which it learns from the samples.^[⁶^] The entries of the NMF components are always non-negative. The NMF feature values are non-negative, as well.^[⁶^]

From the above snapshot, we could tell a lot about the paper and have the gist about the topic presented in the paper. This data analysis could be extremely helpful while dealing with a tremendous amount of papers and trying to extract useful information.

Algorithms

The final aim of the project is to create summaries.

There are two types of summaries that will be generated:

Extractive Summary
Abstractive Summary

Extractive Summary

Extractive summarization aims at identifying the salient information, extract them, and group together to form a concise summary.[7] The extractive summaries are easy to summarize. Less computational power is required to get the summaries. The algorithm uses simple network architecture algorithms to get the summary.

Following are the models that have been used to obtain extractive summary:

Gensim
BERT- Summarizer

Gensim

"Gensim can process arbitrarily large corpora, using data-streamed algorithms. There are no "dataset must fit in RAM" limitations. The Gensim community also publishes pre-trained models for specific domains like legal or health, via the Gensim-data project. Thus Gensim is a ready-to-use model. It is the fastest library for training vector embeddings – Python or otherwise. The core algorithms in Gensim use battle-hardened, highly optimized & parallelized C routines."^[⁸^]

OUTPUT

low uptake covid prevention behaviours high socioeconomic impact lockdown measures south asia evidence large scale multi country surveillance programme journal pre proof low uptake covid prevention behaviours high socioeconomic impact lockdown measures south asia evidence large scale multi country surveillance programme dian kusuma rajendra pradeepa khadija khawaja mehedi hasan samreen siddiqui sara mahmood syed mohsin ali shah chamini silva laksara silva manoja gamage menka loomba vindya rajakaruna abu hanif rajan babu kamalesh balachandran kumarendran marie loh archa misra asma tassawar akansha tyagi swati waghdhare saira burney sajjad ahmad viswanathan mohan malabika sarker ian goon anuradhani kasturiratne jaspal kooner prasad katulanda sujeet jha ranjit mohan anjana malay mridha franco sassi john chambers behalf nihr global health research unit diabetes cardiovascular disease south asia pii s2352 reference ssmph appear ssm population health received date november revised date january accepted date february please cite article kusuma pradeepa khawaja k.i hasan siddiqui mahmood ali shah s.m .mridha m.k sassi chambers j.c behalf nihr global health research unit diabetes cardiovascular disease south asia low uptake covid prevention behaviours high socioeconomic impact lockdown measures south asia evidence large scale multi country surveillance programme ssm population health j.ssmph.2021.100751 . low uptake covid prevention behaviours high socioeconomic impact lockdown measures south asia evidence large scale multi country surveillance programme first name last name affiliation conception design data acquisition analysis interpretation drafted work revised work approved agreed dian kusuma imperial college business school rajendra pradeepa madras diabetes research foundation chennai india khadija khawaja services institute medical sciences mehedi hasan brac james grant school public health brac university samreen siddiqui max healthcare sara mahmood services institute medical sciences syed mohsin ali shah punjab institute cardiology chamini silva faculty medicine university kelaniya laksara silva faculty medicine university colombo manoja gamage faculty medicine university colombo menka loomba max healthcare vindya rajakaruna faculty medicine university kelaniya abu hanif brac james grant school public health brac university rajan babu kamalesh madras diabetes research foundation chennai india balachandran kumarendran faculty medicine university jaffna marie loh lee kong chian school medicine imperial college london archa misra max healthcare asma tassawar punjab institute cardiology akansha tyagi max healthcare swati waghdhare max healthcare saira burney services institute medical sciences sajjad ahmad punjab institute cardiology viswanathan mohan madras diabetes research foundation chennai india malabika sarker brac james grant school public health brac university ian goon school public health imperial college london anuradhani kasturiratne faculty medicine university kelaniya jaspal kooner nhli imperial college london prasad katulanda faculty medicine university colombo sujeet jha max healthcare ranjit mohan anjana madras diabetes research foundation chennai india malay mridha brac james grant school public health brac university franco sassi imperial college business school john chambers lee kong chian school medicine imperial college london criteria substantial contributions conception design work acquisition analysis interpretation data work criteria drafting work revising critically important intellectual content criteria final approval version published criteria agreement accountable aspects work ensuring questions related accuracy integrity part work appropriately investigated resolved .icmje criteria please check apply page low uptake covid prevention behaviours high socioeconomic impact lockdown measures south asia evidence large scale multi country surveillance programme dian kusuma rajendra pradeepa khadija khawaja mehedi hasan samreen siddiqui sara mahmood syed mohsin ali shah chamini silva laksara silva manoja gamage menka loomba vindya rajakaruna abu hanif rajan babu kamalesh balachandran kumarendran marie loh archa misra asma tassawar akansha tyagi swati waghdhare saira burney sajjad ahmad viswanathan mohan malabika sarker ian goon anuradhani kasturiratne jaspal kooner prasad katulanda sujeet jha ranjit mohan anjana malay mridha franco sassi john chambers behalf nihr global health research unit diabetes cardiovascular disease south asia centre health economics policy innovation imperial college business school madras diabetes research foundation chennai india services institute medical sciences lahore pakistan brac james grant school public health brac university dhaka bangladesh max healthcare new delhi india punjab institute cardiology lahore pakistan faculty medicine university kelaniya ragama sri lanka faculty medicine university colombo colombo sri lanka faculty medicine university jaffna jaffna sri lanka lee kong chian school medicine nanyang technological university singapore singapore school public health imperial college london london national heart lung institute imperial college london london correspondence john chambers department epidemiology biostatistics school public health imperial college london london 1pg john.chambers ic.ac.uk acknowledgements research funded national institute health research nihr using aid government support global health research wellcome trust . low uptake covid prevention behaviours high socioeconomic impact lockdown measures south asia evidence large scale multi country surveillance programme background . results identified important knowledge access uptake barriers prevention covid south asia demonstrated major adverse impacts pandemic chronic disease treatment mental health health related behaviours employment household finances . common residents lower middle income countries lmics people south asia face multiple challenges including high rates communicable non communicable disease fragile health education systems food financial insecurity limited formal economic social support .however test positivity rate rate increase new cases substantially lowered mid september january furthermore increasing evidence national lockdowns adverse effects physical mental health children education behaviours relevant chronic disease well severe social financial consequences . discussion assessed knowledge behaviours health socio economic circumstances men women surveillance sites four south asian countries national lockdowns implemented covid march july 2020.

"Unlike other extractive summarizers, it makes use of embeddings for indicating different sentences and it has only two labels namely sentence A and sentence B rather than multiple sentences. These embeddings are modified accordingly to generate required summaries."^[⁹^]

BERT (Bidirectional Encoder Representations from Transformers)

"BERT (Bidirectional transformer) is a transformer used to overcome the limitations of RNN and other neural networks as Long term dependencies. It is a pre-trained model that is naturally bidirectional. This pre-trained model can be tuned to easily perform the NLP tasks as specified, like summarization. Being trained as a masked model the output vectors are tokened instead of sentences."^[⁹^]

OUTPUT

silva c.k silva gamage loomba rajakaruna v.p hanif kamalesh r.b . study approved irbs country consent obtained participants round data collection . study team attempted contact participants surveillance study . results consistent assessment flu like symptoms amongst healthcare workers india period contrast findings serology studies delhi urban south asian settings indicate high proportion populations tested infected covid lockdown . observations contribute explaining continued sustained spread covid epidemic south asian communities highlight population groups benefit awareness raising measures improved access personal protection resources . adverse effects financial circumstances greatest younger people less educated backgrounds . community based assessment knowledge attitude practices risk factors regarding covid among pakistanis residents recent outbreak cross sectional survey.

Abstractive Summary

Abstractive summary generation rewrites the entire document by building internal semantic representation, and then a summary is created using NLP.[7] The abstractive summaries are difficult to summarize. It requires more computational power. Usually, Encoder-Decoder network architecture is used to obtain the summaries.

Following are the models that have been used to obtain an abstractive summary:

T-5
GPT2

T5 Summarizer

"T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format. T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g., for translation, for summarization. T5 is an encoder-decoder model and converts all NLP problems into a text-to-text format. It is trained using teacher forcing. This means that for training we always need an input sequence and a target sequence."^[¹⁰^]

OUTPUT

icmje criteria please check apply page low uptake covid prevention behaviours high socioeconomic impact lockdown measures south asia evidence large scale multi country surveillance programme journal pre proof. dian kusuma rajendra pradeepa, khadija, mahmood syed mohsin ali shah chamini silva laksara, manoja gamage menka loomba vindya malay mridha franco 'home' g nypd''. high d.p' co &... high-quality i'ds y ad: o.d; c ng r t b p essm vs co-sympathetic k.i hraasiyidd

GPT2

"GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data."^[¹¹^]

OUTPUT

Conclusion

When a large language model is trained on a sufficiently large and diverse dataset it is able to perform well across many domains and datasets. In this project, pre-trained models are used to achieve the desired goal of the projects. The extractive pre-trained models are computationally cheaper than the abstractive pre-trained models. Extractive summaries consist of text cropped from original data. On the contrary, abstractive summaries are paraphrased summaries obtained by using state-of-art models. The model that gives out the best summary for Extractive Summarization is BERT-Summarizer. The model that gives out the best summary for Abstractive Summarization is GPT2 (by OpenAI).

Presentations

References:

Brownlee, J. (2019, August 7). A Gentle Introduction to Text Summarization. Machine Learning Mastery. https://machinelearningmastery.com/gentle-introduction-text-summarization/
Kieuvongngam, V., Tan, B., & Niu, Y. (2020). Automatic text summarization of covid-19 medical research articles using bert and gpt-2. arXiv preprint arXiv:2006.01997.
Esteva, A., Kale, A., Paulus, R., Hashimoto, K., Yin, W., Radev, D., & Socher, R. (2020). CO-Search: COVID-19 Information Retrieval with Semantic Search, Question Answering, and Abstractive Summarization. ArXiv, abs/2006.09595.
Thurner Stefan, Hanel Rudolf, Liu Bo and Corominas-Murtra Bernat 2015Understanding Zipf's law of word frequencies through sample-space collapse in sentence formation J. R. Soc. Interface.122015033020150330 http://doi.org/10.1098/rsif.2015.0330
Data Cleansing | AOK Leads | Lead Generation. (2021). Aokleads. https://www.aokleads.com.au/data-cleansing#
Discovering interpretable features - Non-negative matrix factorization (NMF). (2021). DataCamp. https://campus.datacamp.com/courses/unsupervised-learning-in-python/discovering-interpretable-features?ex=1
Belmondo, M. (2020, July 30). Extractive and abstractive summary - Blog Text-Summarize. Text Summarize - Blog. https://blog.text-summarize.com/summarization/differences-between-extractive-and-abstractive-summary/
Rehůřek, Radim, and Petr Sojka. "Software framework for topic modeling with large corpora." Proc. LREC Workshop on New Challenges for NLP Frameworks.
Vashisht, A. (2020, July 27). BERT for text summarization. OpenGenus IQ: Learn Computer Science. https://iq.opengenus.org/bert-for-text-summarization/#
T5. (2020). T5 Summarizer. https://huggingface.co/transformers/model_doc/t5.html
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.

Page updated

Report abuse