Resources

Legal analytics


Indian Legal BERT - for Legal NLP in the Indian context

BERT-based pre-trained Language Models (LMs) pre-trained over large volumes of Indian legal text. These models substantially improve the state-of-the-art in various NLP tasks over Indian legal text as well as legal text from other countries.

InLegalBERT : best performing LM on Indian legal text  [170,000+ downloads till date]

InCaseLawBERT and CustomInLawBERT : other LM on Indian legal text, having competitive performance

Supporting publication: Pre-trained Language Models for the Legal Domain: A Case Study on Indian Law, ICAIL 2023  [pdf (ArXiv)]


Document image analysis of First Information Report (FIR) forms 

[Dataset] First of its kind hybrid (containing both handwritten and printed text) semi-structured document analysis dataset consisting of Indian legal documents (First Information Reports from several police stations). Can be used for document image segmentation, handwriting recognition, etc. Supporting publication: TransDocAnalyser: A framework for semi-structured offline handwritten documents analysis with an application to legal domain, ICDAR2023.


Identifying charges/crimes from facts of a situation

[Code + Data] for identifying relevant Indian Penal Code (IPC) Sections, given the natural language (English) description of a situation. Supporting publication: LeSICiN: A Heterogeneous Graph-based Approach for Automatic Legal Statute Identification from Indian Legal Documents, AAAI 2022.

[Code + Data] for identifying charges/crimes in Indian Penal Code, given the natural language (English) description of a situation. Supporting publication: Automatic Charge Identification from Facts: A Few Sentence-Level Charge Annotations is All You Need, COLING 2020.


Summarization of court case judgements

[Data + Codes] Three datasets for summarizing legal case judgements, implementations of several summarization algorithms and pretrained models for summarizing legal case judgements. Supporting publications: (1) A Comparative Study of Summarization Algorithms applied to Legal Case Judgements, ECIR 2019. (2) Legal Case Document Summarization: Extractive and Abstractive Methods and their Evaluation, AACL-IJCNLP 2022

[Code] Implementation of DELSUMM, an unsupervised extractive summarization algorithm for legal case judgements. Supporting publication: Incorporating Domain Knowledge for Extractive Summarization of Legal Case Documents, ICAIL 2021. 


Estimating the similarity between two court case judgements

[Dataset] Two datasets for the task of estimating the semantic similarity between two court case judgements, in the range [0, 1]. The datasets contain case document-pairs and a similarity value assigned by Law experts. Supporting publication: Legal Case Document Similarity: You Need Both Network and Text, Information Processing and Management, 2022.


FIRE 2019 Track on Artificial Intelligence for Legal Assistance (AILA)

[Dataset] Dataset for two tasks -- (1) Identifying relevant prior cases for a given situation, (2) Identifying most relevant statutes for a given situation. The datasets are based on legal documents (cases, statutes) from the Indian judicial system. Supporting publication: Overview of the FIRE 2019 AILA track:  Artificial Intelligence for Legal Assistance, FIRE 2019. 


Identifying the rhetorical role of sentences in court case judgements 

[Code + data] A dataset of 50 case judgments of the Indian Supreme Court, where the rhetorical role of every sentence is annotated, and  implementation of our proposed model for identifying rhetorical role of sentences. Supporting publication: Identification of Rhetorical Roles of Sentences in Indian Legal Judgments, JURIX 2019. 



Extracting legal catchphrases (keywords) from court case judgements

[Code] An unsupervised algorithm for extracting legal catchphrases from court case judgements. Supporting publication: Automatic Catchphrase Identification from Legal Court Case Documents, CIKM 2017.

[Code] A supervised algorithm for extracting legal catchphrases from court case judgements. Supporting publication: A Sequence Labeling Model for Catchphrase Identification from Legal Case Documents. Artificial Intelligence and Law, 2021.


FIRE 2017 Track on Information Retrieval from Legal Documents (IRLed)

[Dataset] For two tasks -- (1) Catchphrase extraction from Indian legal documents, (2) Identifying prior cases relevant to a given case. Supporting publication: Overview of the FIRE 2017 IRLeD Track: Information Retrieval from Legal Documents, FIRE 2017.

Utilizing social media in disaster / emergency situations


Dataset of tweets reflecting hesitancy towards COVID vaccines

[Dataset] This dataset contains English anti-vaccine tweets labeled with specific anti-vaccine concerns (classes / labels), and the text-spans that reflect these concerns (explanations); also contains class-wise summaries. The dataset can be used to develop methods for multi-label classification, explainable classification, and summarization of tweets. Supporting publication: CAVES: A dataset to facilitate explainable classification and summarization of concerns towards COVID vaccines. ACM SIGIR (Resource Track) 2022


Summarization of tweets posted during disasters

[Dataset] This dataset can be used to develop methods for summarization of tweets that are posted during disasters. Supporting publication: Ensemble Algorithms for Microblog Summarization. IEEE Intelligent Systems, 2018.  

[Codes] Implementation of a method for classification and summarization of tweets posted during disasters. The method first classifies tweets/tweet fragments into 'situational' and 'non-situational', and subsequently summarizes the situational tweets. Supporting publications: (1) Extracting Situational Information from Microblogs during Disaster Events: a Classification-Summarization Approach, ACM CIKM 2015, (2) Extracting and Summarizing Situational Information from the Twitter Social Media during Disasters, ACM Transactions on the Web 2018.


Tweets informing about resource-need and resource-availability in post-disaster situations

[Dataset] This dataset can be used to develop methods for identifying tweets that inform about resource-needs (called 'need-tweets') and resource availabilities (called 'availability tweets'). Contains tweets relevant to two earthquake events -- 2015 Nepal earthquake and 2016 Italy earthquake. Supporting publication: Extracting Resource Needs and Availabilities from Microblogs for Aiding Post-Disaster Relief Operations. IEEE Transactions on Computational Social Systems, 2019.


FIRE 2018 Track on Information Retrieval from Microblogs during Disasters (IRMiDis)

[Dataset] This dataset can be used for developing algorithms for identification of factual or fact-checkable tweets from tweets posted during a disaster event. Supporting publication: Overview of the FIRE 2018 track: Information Retrieval from Microblogs during Disasters (IRMiDis), FIRE 2018.


FIRE 2017 Track on Information Retrieval from Microblogs during Disasters (IRMiDis)

[Dataset] This dataset can be used for developing methods for (i) identification and (ii) matching of resource-needs and resource-availabilities from tweets posted during a disaster event. Supporting publication: Overview of the FIRE 2017 track: Information Retrieval from Microblogs during Disasters (IRMiDis), FIRE 2017.


ECIR 2017 Workshop on Exploitation of Social Media for Emergency Relief and Preparedness (SMERP 2017)

[Dataset] This dataset consists of microblogs posted during the August 2016 earthquake in central Italy, and can be used to develop algorithms for retrieval and summarization of microblogs that are useful for post-disaster relief operations. Supporting publication: ECIR 2017 Workshop on Exploitation of Social Media for Emergency Relief and Preparedness (SMERP 2017), ACM SIGIR Forum Newsletter, 2017. 


FIRE 2016 Track on Information Retrieval from Microblogs during Disasters (IRMiDis)

[Dataset] This dataset can be used to develop methods for identifying tweets relevant to some practical queries/topics that are important for conducting post-disaster relief operations. Supporting publication: Overview of the FIRE 2016 Microblog track: Information Extraction from Microblogs Posted during Disasters, FIRE 2016.

algorithmic bias & fairness


[Codes] Codes for FaiRIR, a suite of Fair Related Item Recommendation Algorithms. Supporting publication: FaiRIR: Mitigating Exposure Bias from Related Item Recommendations in Two-Sided Platforms, IEEE Transactions on Computational Social Systems 2022.

[Code + data] Implementation of FairSumm, a fairness-preserving text summarization algorithm, and the three datasets used in the paper. Supporting publication:  Summarizing User-generated Textual Content: Motivation and Methods for Fairness in Algorithmic Summaries, ACM CSCW 2019.

Stance detection

Algorithms for Stance Detection from posts on Web and Social Media

[Code] Codes for several stance detection algorithms, meant for microblogs (e.g., the SemEval dataset), news articles, etc. Supporting publication: Stance Detection in Web and Social Media: A Comparative Study, CLEF 2019. 

cleaning / normalization of noisy text

Algorithm for cleaning both machine-generated and human-generated noise in text

[Code] Implementation of an unsupervised text cleaning / normalization algorithm (UnsupClean) that can be used to clean different types of noisy text, including text containing machine-generated noise (e.g., OCR noise) as well as text containing human-generated noise (e.g., informally written microblogs). The algorithm can handle text in English as well as in non-English languages. Supporting publication: An Unsupervised Normalization Algorithm for Noisy Text: A Case Study for Information Retrieval and Stance Detection. ACM Journal of Data and Information Quality, 2020.


Algorithm for stemming informally written microblogs

[Code] Implementation of an algorithm for stemming informally-written microblogs/tweets. Supporting publication: Combining Local and Global Word Embeddings for Microblog Stemming. ACM CIKM  2017.

multimodal learning

Algorithm for zero-shot retrieval of images from textual descriptions

[Code] Implementation of a novel GAN-based model for zero-shot text to image retrieval, that can retrieve images from given textual descriptions, in a zero-shot setting. Supporting publication: ZSCRGAN: A GAN-based Expectation-Maximization Model for Zero-Shot Retrieval of Images from Textual Descriptions, ACM CIKM 2020.