Personal Research Knowledge Graphs
Personal knowledge graphs contain structured information about entities personally related to its user [1]. We have proposed Personal Research Knowledge Graphs (PRKGs) as a new means for a researcher to represent structured information about their research and related activities. PRKGs can power intelligent personal assistants, and personalize various applications. In our vision papers [2,3], we explore what entities should be included in a PRKG, how they should be collected, and what issues crop up when sharing a PRKG with other researchers.
Chakraborty, P., & Sanyal, D. K. (2023) A comprehensive survey of personal knowledge graphs. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, e1513.
Chakraborty P., Dutta, S., Sanyal, D. K., Majumdar, S., Das, P. P. (2023). Bringing order to chaos: Conceptualizing a personal research knowledge graph for scientists. IEEE Data Engineering Bulletin 47(4), 43-56.
Chakraborty, P., Dutta, S., & Sanyal, D. K. (2022). Personal research knowledge graphs. In Com. Proc. The Web Conference (WWW).
Topic modeling
We are studying neural topic models with an aim to improve their performance and apply them to analyze important text corpora.
In [1] (which is an extension of [5]), we have proposed improvements to a VAE-based neural topic model using an ingenious method of negative sampling devised with unsupervised learning; the documents to be used as negative samples are constructed by perturbing the topics present in the original document.
We have proposed a novel method to extract topics from a document corpus where the input document is represented as a combination of a TF-IDF vector and the embedding of a document graph [2]. In [3], we proposed a method to distill knowledge from a large Contextualized Topic Model (CTM) to a leaner CTM using optimal transport theory. We analyzed the role of dropout in neural topic models in [4] and empirically showed that careful tuning of dropout is needed to achieve good performance. Often very low to no dropout is better than a high dropout.
As a concrete application of topic modeling, we have extracted topics from the questions and answers discussed in the Lower House of the Parliament of India during the Question Hour between 1999 and 2019. We have applied a dynamic topic model for this purpose. The extracted topics capture the contemporary social, political and economic issues in a surprisingly clear pattern [6].
Adhya, S., Lahiri, A., Sanyal, D. K., & Das, P. P. (2024). Evaluating Negative Sampling Approaches for Neural Topic Models. IEEE Transactions on Artificial Intelligence, 5(11), 5630-5642.
Adhya, S., Sanyal, D. K. (2024). GINopic: Topic modeling with graph isomorphism network. In Proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).
Adhya, S. & Sanyal, D. K. (2023). Improving neural topic models with Wasserstein knowledge distillation. In Proc. ECIR.
Adhya, S., Lahiri, A., & Sanyal, D. K. (2023). Do neural topic models really need dropout? Analysis of the effect of dropout in topic modeling. In Proc. EACL.
Adhya, S., Lahiri, A., Sanyal, D. K. & Das, P. P. (2022). Improving contextualized topic models with negative sampling. In Proc. ICON 2022.
Adhya, S. & Sanyal, D. K. (2022). What does the Indian Parliament discuss? An exploratory analysis of the Question Hour in the Lok Sabha. In Proc. LREC 2022 PoliticalNLP workshop.
Entity and relation extraction from scientific papers
Extraction of entities and their relations from scientific papers is an important prerequisite for the construction of scientific knowledge graphs. We have proposed deep learning-based models that achieve high performance on benchmark datasets.
Lahiri, A., Sarkar, P., Sen, M., Sanyal, D. K., Mukherjee I. (2024). Few-TK: A Dataset for Few-shot Scientific Typed Keyphrase Recognition. In Proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) Findings.
De, S., Sanyal, D. K., & Mukherjee, I. (2023). AgriNER: An NER dataset of agricultural entities for the semantic web. In Proc. ESWC.
Santosh, T. Y. S. S., Chakraborty, P., Dutta, S., Sanyal, D. K., & Das, P. P. (2021). Joint entity and relation extraction from scientific documents: role of linguistic information and entity types. In Proc. EEKE @ JCDL.
Summarization of scientific papers
Given the enormous growth in scientific publications, there is an urgent need to build effective summarization tools that distil the main contributions of a paper (or a collection of papers) into a clear and concise gist. In [1], we explore how well pretrained and large language models can generate title of research papers, given the abstract. In the remaining papers, we have developed deep neural models to generate research highlights from a scientific paper: [2] explores SciBERT embeddings, [3] uses ELMo embeddings, and [5] uses embeddings trained from scratch -- all in the Pointer-Generator network architecture -- to produce highlights from papers. In [4], the model in [5] is augmented to use entity mentions in the input document as a single token for future processing -- this prevents their unwanted segmentation in the generated output.
Rehman, T., Sanyal, D. K., & Chattopadhyay, S. (2025). Can pre-trained language models generate titles for research papers?. In International Conference on Asian Digital Libraries ʼ(pp. 154-170). Springer, Singapore. ( 🏆 Best Student Paper Runner-Up Award)
Rehman, T., Sanyal, D. K., Chattopadhyay, S., Bhowmick, P. K. & Das, P. P. (2023). Generation of highlights from research papers using pointer-generator networks and SciBERT embeddings," IEEE Access, 11, pp. 91358-91374.
Rehman, T., Sanyal, D. K., & Chattopadhyay, S. (2023). Research highlight generation with ELMo contextual embeddings. Scalable Computing: Practice and Experience, 24(2), 181-190.
Rehman, T., Sanyal, D. K., Majumder, P. & Chattopadhyay, S. (2022). Named entity recognition based automatic generation of research highlights. In Proc. SDP @ COLING.
Rehman, T., Sanyal, D. K., Chattopadhyay, S., Bhowmick, P. K., & Das, P. P. (2021). Automatic generation of research highlights from scientific abstracts. In Proc. EEKE @ JCDL.
Free surrogates of paywalled papers
How many times did you hit the paywall when you clicked on a paper title? Paywall limits access and makes it difficult for researchers to learn about the state of the art in their fields.
But all is not bleak! Authors often leave on their institutional webpages or other open repositories either a preprint or an article very similar -- if not identical -- to their published (paywalled) paper. Sometimes papers published by an author within a time window are quite similar, and if some of them are freely available, they give a fair idea about their research and even about their paywalled articles. These surrogates are invaluable to a researcher challenged by a paywall.
We designed Surrogator -- a tool to discover free surrogates of paywalled articles.
Read about our work here:
Sanyal, D. K., Bhowmick, P. K., Das, P. P., Chattopadhyay, S., & Santosh, T. Y. S. S. (2019). Enhancing access to scholarly publications with surrogate resources. Scientometrics, 121(2), 1129–1164. Springer. [ResearchGate]
Sanyal, D. K., Banerjee, S., Agarwal, G., Chattopadhyay, S., Bhowmick, P. K., & Das, P. P. (2019). Illumine: a tool to augment the National Digital Library of India with full texts of research papers. In Proc. INDICON. [ResearchGate]
Santosh, T. Y. S. S., Sanyal, D. K., Bhowmick, P. K., & Das, P. P. (2018). Surrogator: a tool to enrich a digital library with open access surrogate resources. In Proc. JCDL. [ResearchGate]
Keyphrase extraction/generation from research papers
Keyphrases capture the salient areas of a research paper, and help in indexing and searching for the papers. But not all papers contain keyphrases. Wouldn't it be great if we could automatically generate keyphrases for a research paper?
We designed several innovative supervised deep learning-based methods to produce keyphrases for a scholarly article. While some of our methods are extractive, others are abstractive.
Here are our papers:
Santosh, T. Y. S. S., Varimalla, N. R., Vallabhajosyula, A., Sanyal, D. K., & Das, P. P. (2021). HiCoVA: Hierarchical conditional variational autoencoder for keyphrase generation. In Proc. CIKM.
Santosh, T. Y. S. S., Sanyal, D. K., Bhowmick, P. K., & Das, P. P. (2021). Gazetteer-guided keyphrase generation from research papers. In Proc. PAKDD.
Santosh, T. Y. S. S., Sanyal, D. K., Bhowmick, P. K., & Das, P. P. (2020). SaSAKE: syntax and semantics aware keyphrase extraction from research papers. In Proc. COLING.
Santosh, T. Y. S. S., Sanyal, D. K., Bhowmick, P. K., & Das, P. P. (2020). DAKE: document–level attention for keyphrase extraction. In Proc. ECIR.
Discourse segmentation of scientific abstracts
A scientific paper begins with an abstract. In domains like computer science, unlike the main body of the paper, the abstract is usually not divided into sections. We study the problem of automatically segmenting the abstract into several sections like BACKGROUND, TECHNIQUE, and OBSERVATION. It as a sequential classification problem where each sentence of the abstract is to be labeled with one of the given classes. We have leveraged transformers in [1] for this problem. In [2], we have used transfer learning to handle the scenario where labeled data of segmented abstracts is scarce.
Santosh, T. Y. S. S., Aluru, S. S., Vallabhajosyula, A., Sanyal, D. K., & Das, P. P. (2023) Label informed hierarchical transformers for sequential sentence classification in scientific abstracts. Expert Systems, e13238.
Banerjee, S., Sanyal, D. K., Chattopadhyay, S., Bhowmick, P. K., & Das, P. P. (2020). Segmenting scientific abstracts into discourse categories: a deep learning-based approach for sparse labeled data. In Proc. JCDL.
We have analyzed the intent of a citation (e.g., if it relates to "Background" or "Method" or "Result", etc.) in a scholarly paper using pre-trained language models and prompt-based techniques, and also studied the problem in the few-shot and zero-shot scenario [3].
Lahiri, A., Sanyal, D. K., & Mukherjee, I. (2023). CitePrompt: Using Prompts to Identify Citation Intent in Scientific Papers. In Proceedings of the 2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL). ACM.
https://libguides.lb.polyu.edu.hk/orcid/ORCIDintro
Author name disambiguation
Two different authors may have the same name, and the same author may write his/her name in different formats. This is a persistent anathema in a digital library. We surveyed the author name disambiguation techniques proposed for the PubMed database. We also proposed a random forest classifier to disambiguate author names, and tested our method on a large public dataset from PubMed.
Sanyal, D. K., Bhowmick, P. K., & Das, P. P. (2021). A review of author name disambiguation techniques for the PubMed bibliographic database. Journal of Information Science, 42(2), 227–254, SAGE.
Jhawar, K., Sanyal, D. K., Chattopadhyay, S., Bhowmick, P. K., & Das, P. P. (2020). Author name disambiguation in PubMed using ensemble-based classification algorithms. In Proc. JCDL.
We also wrote a paper on identifying the last name in an author name. This is an important task when preparing a bibliographic index or a book catalog. We found the Character-BiLSTM-CRF model as the top performer.
Santosh, T. Y. S. S., Sanyal, D. K., & Das, P. P. (2019). Person name segmentation with deep neural networks. In Proc. MIKE.