Indian Legal BERT - for Legal NLP in the Indian context
BERT-based foundational Pre-trained Language Models (PLMs) pre-trained over large volumes of Indian legal text. Substantially improve the state-of-the-art in various legal NLP tasks.
InLegalBERT : best performing Pre-trained Language Model [1 million+ downloads on HuggingFace till date]
InCaseLawBERT and CustomInLawBERT : other PLMs having competitive performance
Supporting publication: Pre-trained Language Models for the Legal Domain: A Case Study on Indian Law, ICAIL 2023 [pdf (ArXiv)]
Translation of legal text into Indian languages
[Dataset] The first dataset for evaluating Machine Translation systems on translating legal text from English to nine Indian languages. Can also be used to evaluate MT systems on translating from one Indian language to another. Supporting publication: MILPaC: A Novel Benchmark for Evaluating Translation of Legal Text to Indian Languages, ACM TALLIP, 2024.
InLegalTrans-En2Indic-1B: model for translating legal text from English to nine Indian languages
IL-TUR: A Benchmark of Indian Legal Text Understanding and Reasoning
[Website] IL-TUR contains monolingual (English, Hindi) and multi-lingual (9 Indian languages) domain-specific tasks from the point of view of understanding and reasoning over Indian legal documents. Supporting publication: IL-TUR: Benchmark for Indian Legal Text Understanding and Reasoning, ACL 2024.
Identifying charges / statutes from facts of a situation
[Code + Data] Identifying relevant statutes given the natural language (English) description of a situation. Experiments on Indian and European cases and statutes. Supporting publication: Legal Statute Identification: A Case Study using State-of-the-Art Datasets and Methods, SIGIR 2024.
[Code + Data] for identifying relevant Indian Penal Code (IPC) Sections, given the natural language (English) description of a situation. Supporting publication: LeSICiN: A Heterogeneous Graph-based Approach for Automatic Legal Statute Identification from Indian Legal Documents, AAAI 2022.
[Code + Data] for identifying charges/crimes in Indian Penal Code, given the natural language (English) description of a situation. Supporting publication: Automatic Charge Identification from Facts: A Few Sentence-Level Charge Annotations is All You Need, COLING 2020.
Summarization of court case judgements
[Data + Codes] Three datasets for summarizing legal case judgements; implementations of several summarization algorithms and pretrained models for summarizing legal case judgements. Supporting publications: (1) A Comparative Study of Summarization Algorithms applied to Legal Case Judgements, ECIR 2019. (2) Legal Case Document Summarization: Extractive and Abstractive Methods and their Evaluation, AACL-IJCNLP 2022
[Code] Implementation of DELSUMM, an unsupervised extractive summarization algorithm for legal case judgements. Supporting publication: Incorporating Domain Knowledge for Extractive Summarization of Legal Case Documents, ICAIL 2021.
Document image analysis of First Information Report (FIR) forms
[Dataset] First hybrid (containing both handwritten and printed text) semi-structured document analysis dataset consisting of Indian legal documents (First Information Reports from several police stations). Can be used for document image segmentation, handwriting recognition, etc. Supporting publication: TransDocAnalyser: A framework for semi-structured offline handwritten documents analysis with an application to legal domain, ICDAR2023.
Estimating the similarity between two court case judgements
[Dataset] Two datasets for the task of estimating the semantic similarity between two court case judgements, in the range [0, 1]. The datasets contain case document-pairs and a similarity value assigned by Law experts. Supporting publication: Legal Case Document Similarity: You Need Both Network and Text, Information Processing and Management, 2022.
Identifying the rhetorical role of sentences in court case judgements
[Code + data] A dataset of 50 case judgments of the Indian Supreme Court, where the rhetorical role of every sentence is labeled (by Law students), and implementation of our proposed model for identifying rhetorical role of sentences. Supporting publication: Identification of Rhetorical Roles of Sentences in Indian Legal Judgments, JURIX 2019.
[Data] A dataset of more Indian Supreme Court judgements and a set of UK Supreme Court judgements, where the rhetorical role of every sentence is labeled (by Law students). Supporting publication: MARRO: Multi-headed Attention for Rhetorical Role Labeling in Legal Documents, Artificial Intelligence and Law, 2025.
Extracting legal catchphrases (keywords) from court case judgements
[Code] An unsupervised algorithm for extracting legal catchphrases from court case judgements. Supporting publication: Automatic Catchphrase Identification from Legal Court Case Documents, CIKM 2017.
[Code] A supervised algorithm for extracting legal catchphrases from court case judgements. Supporting publication: A Sequence Labeling Model for Catchphrase Identification from Legal Case Documents. Artificial Intelligence and Law, 2021.
FIRE 2019 Track on Artificial Intelligence for Legal Assistance (AILA)
[Dataset] Dataset for two tasks -- (1) Identifying relevant prior cases for a given situation, (2) Identifying most relevant statutes for a given situation. The datasets are based on legal documents (cases, statutes) from the Indian judicial system. Supporting publication: Overview of the FIRE 2019 AILA track: Artificial Intelligence for Legal Assistance, FIRE 2019.
FIRE 2017 Track on Information Retrieval from Legal Documents (IRLed)
[Dataset] For two tasks -- (1) Catchphrase extraction from Indian legal documents, (2) Identifying prior cases relevant to a given case. Supporting publication: Overview of the FIRE 2017 IRLeD Track: Information Retrieval from Legal Documents, FIRE 2017.
Dataset of tweets reflecting hesitancy towards COVID vaccines
[Dataset] English anti-vaccine tweets labeled with specific anti-vaccine concerns (classes / labels), and the text-spans that reflect these concerns (explanations); also contains class-wise summaries; can be used for multi-label classification, explainable classification, and tweet summarization. Supporting publication: CAVES: A dataset to facilitate explainable classification and summarization of concerns towards COVID vaccines. ACM SIGIR (Resource Track) 2022
Dataset of ICPR 2024 Competition on Claim Span Identification
[Dataset] Two datasets of English and Hindi social media posts for the task of "Claim Span Identification" in which, given a text, parts/spans that correspond to claims are to be identified. Supporting publication: ICPR 2024 Competition on Multilingual Claim-Span Identification, ICPR 2024.
Summarization of tweets posted during disasters
[Dataset] This dataset can be used to develop methods for summarization of tweets that are posted during disasters. Supporting publication: Ensemble Algorithms for Microblog Summarization. IEEE Intelligent Systems, 2018.
[Codes] Implementation of a method for classification and summarization of tweets posted during disasters. The method first classifies tweets/tweet fragments into 'situational' and 'non-situational', and subsequently summarizes the situational tweets. Supporting publications: (1) Extracting Situational Information from Microblogs during Disaster Events: a Classification-Summarization Approach, ACM CIKM 2015, (2) Extracting and Summarizing Situational Information from the Twitter Social Media during Disasters, ACM Transactions on the Web 2018.
Tweets informing about resource-need and resource-availability in post-disaster situations
[Dataset] This dataset can be used to develop methods for identifying tweets that inform about resource-needs (called 'need-tweets') and resource availabilities (called 'availability tweets'). Contains tweets relevant to two earthquake events -- 2015 Nepal earthquake and 2016 Italy earthquake. Supporting publication: Extracting Resource Needs and Availabilities from Microblogs for Aiding Post-Disaster Relief Operations. IEEE Transactions on Computational Social Systems, 2019.
Algorithms for Stance Detection from posts on Web and Social Media
[Code] Codes for several stance detection algorithms over microblogs (e.g., the SemEval dataset), news articles, etc. Supporting publication: Stance Detection in Web and Social Media: A Comparative Study, CLEF 2019.
FIRE 2018 Track on Information Retrieval from Microblogs during Disasters (IRMiDis)
[Dataset] This dataset can be used for developing algorithms for identification of factual or fact-checkable tweets from tweets posted during a disaster event. Supporting publication: Overview of the FIRE 2018 track: Information Retrieval from Microblogs during Disasters (IRMiDis), FIRE 2018.
FIRE 2017 Track on Information Retrieval from Microblogs during Disasters (IRMiDis)
[Dataset] This dataset can be used for developing methods for (i) identification and (ii) matching of resource-needs and resource-availabilities from tweets posted during a disaster event. Supporting publication: Overview of the FIRE 2017 track: Information Retrieval from Microblogs during Disasters (IRMiDis), FIRE 2017.
ECIR 2017 Workshop on Exploitation of Social Media for Emergency Relief and Preparedness (SMERP 2017)
[Dataset] This dataset consists of microblogs posted during the August 2016 earthquake in central Italy, and can be used to develop algorithms for retrieval and summarization of microblogs that are useful for post-disaster relief operations. Supporting publication: ECIR 2017 Workshop on Exploitation of Social Media for Emergency Relief and Preparedness (SMERP 2017), ACM SIGIR Forum Newsletter, 2017.
FIRE 2016 Track on Information Retrieval from Microblogs during Disasters (IRMiDis)
[Dataset] This dataset can be used to develop methods for identifying tweets relevant to some practical queries/topics that are important for conducting post-disaster relief operations. Supporting publication: Overview of the FIRE 2016 Microblog track: Information Extraction from Microblogs Posted during Disasters, FIRE 2016.
[Codes] Codes for FaiRIR, a suite of Fair Related Item Recommendation Algorithms. Supporting publication: FaiRIR: Mitigating Exposure Bias from Related Item Recommendations in Two-Sided Platforms, IEEE Transactions on Computational Social Systems 2022.
[Code + data] Implementation of FairSumm, a fairness-preserving text summarization algorithm, and the three datasets used in the paper. Supporting publication: Summarizing User-generated Textual Content: Motivation and Methods for Fairness in Algorithmic Summaries, ACM CSCW 2019.
Algorithm for cleaning both machine-generated and human-generated noise in text
[Code] Implementation of an unsupervised text cleaning / normalization algorithm (UnsupClean) that can be used to clean different types of noisy text, including text containing machine-generated noise (e.g., OCR noise) as well as text containing human-generated noise (e.g., informally written microblogs). Supporting publication: An Unsupervised Normalization Algorithm for Noisy Text: A Case Study for Information Retrieval and Stance Detection. ACM Journal of Data and Information Quality, 2020.
Algorithm for stemming informally written microblogs
[Code] Implementation of an algorithm for stemming informally-written microblogs/tweets. Supporting publication: Combining Local and Global Word Embeddings for Microblog Stemming. ACM CIKM 2017.
Algorithm for zero-shot retrieval of images from textual descriptions
[Code] Implementation of a novel GAN-based model for zero-shot text to image retrieval, that can retrieve images from given textual descriptions, in a zero-shot setting. Supporting publication: ZSCRGAN: A GAN-based Expectation-Maximization Model for Zero-Shot Retrieval of Images from Textual Descriptions, ACM CIKM 2020.