Mining and Learning in the Legal Domain

The 3rd International Workshop on Mining and Learning in the Legal Domain (MLLD-2023)

Important Dates


The increasing accessibility of legal corpora and databases create opportunities to develop data-driven techniques and advanced tools that can facilitate a variety of tasks in the legal domain, such as legal search and research, legal document review and summary, legal contract drafting, and legal outcome prediction. Compared with other application domains, the legal domain is characterized by the huge scale of natural language text data, the high complexity of specialist knowledge, and the critical importance of ethical considerations. The MLLD workshop aims to bring together researchers and practitioners to share the latest research findings and innovative approaches in employing data mining, machine learning, information retrieval, and knowledge management techniques to transform the legal sector. Building upon the previous successes, the third edition of the MLLD workshop will emphasize the exploration of new research opportunities brought about by recent rapid advances in Large Language Models and Generative AI. We encourage submissions that intersect computer science and law, from both academia and industry, embodying the interdisciplinary spirit of CIKM. 


We encourage submissions on novel mining and learning based solutions in various aspects of legal data analysis such as legislations, litigations, court cases, contracts, patents, NDAs and bylaws. Topics of interest include, but are not limited to: 


All submissions must be in English, in PDF format, and in ACM two-column format (sigconf). The ACM LaTeX template are available from the ACM website and the Overleaf online editor

To enable double-blind reviewing, authors are required to take all reasonable measures to conceal their identity. The anonymous option of the acmart class must be used.  Furthermore, ACM copyright and permission information should be removed by using the nonacm option. Therefore, the first line of your main LaTeX document should be as follows.

To facilitate the exchange of ideas, this year we adopt a policy similar to that of ICTIR'23 which allows submissions of any length between 2 and 9 pages plus unrestricted space for references. Authors are expected to submit a paper whose length reflects what is needed for the content of the work, i.e., page length should be commensurate with contribution size.  Reviewers will assess whether the contribution is appropriate for the given length. Consequently, there is no longer a distinction between long and short papers, nor a need of condensing or enlarging medium-length ones. We will probably allocate more presentation time to longer papers during the workshop.

As in the previous editions of MLLD, each paper will be reviewed by at least 3 reviewers from the Program Committee. 

We are going to produce non-archival proceedings for this workshop on, similar to IPA'20. Thus, authors can refine their accepted papers and submit them to formal conferences/journals after the workshop.

Submissions should be made electronically via EasyChair:


The CIKM-2023 conference will be held in-person in Birmingham, UK. Therefore, it is expected that most (if not all) of the authors will present their accepted papers in-person for this workshop. Some invited speakers and/or participants may have the flexibility to attend online.


The registration for the workshop is done through the main conference CIKM-2023

CIKM will be opening the registration in a few weeks. If you would like to express your interest in the project and be notified when the registration is open, please drop an email to Alina Petrova.

Programme Committee

Keynote Talk


Ensuring Reliability in Legal LLM Applications


The usage of large language models (LLMs) has exploded over the past year, especially since OpenAI introduced ChatGPT in November 2022; however, ensuring accuracy and reliability in LLM-generated outputs remains a challenge, especially in knowledge-intensive domains such as law. In this talk, we will present some methods that we use to ensure reliability in CoCounsel, Casetext's GPT-4 based legal AI assistant, touching upon topics including retrieval-augmented generation for legal research, methods for reducing hallucinations, managing cost vs. reliability tradeoff, evaluating LLMs in the legal context, and generating synthetic data from GPT4.


Javed Qadrud-Din is the Director of Research & Development at Casetext, where he builds early systems to push the boundaries of legal technology, including Casetext's deep learning-based semantic search system and, more recently, Casetext's GPT4-based products. Prior to Casetext, Javed was a machine learning engineer at Meta and held engineering and product roles at IBM. He has been working in machine learning for the past decade, but, before that, he worked as a lawyer for startup companies at the law firm Fenwick & West. He holds a JD from Harvard Law School and a BA from Cornell University.

Martin Gajek is the head of Machine learning at Casetext. His team researches and develops low-latency contextual information retrieval systems and language generation systems, including LLMs. These technologies form the backbone of Casetext’s flagship product, CoCounsel. Martin holds a PhD in Applied Physics from Sorbonne University (UPMC) and held postdoctoral positions at UC Berkeley and IBM Research.  Prior to joining Casetext, he was involved in R&D for semiconductor hardware, specifically optimizing memory architectures for deep learning accelerators.

Shang Gao is a senior machine learning researcher at Casetext, where he designs, develops, and deploys solutions for legal and transactional language understanding, generative question answering, and knowledge retrieval. His recent work includes the development of CoCounsel, Casetext's AI legal assistant based on OpenAI’s GPT-4, and demonstrating that GPT-4 can pass all portions of the Unified Bar Exam. Prior to Casetext, Shang was a research scientist at Oak Ridge National Laboratory, where he led a research team building clinical NLP solutions for the National Cancer Institute. Shang has a PhD in Data Science from the University of Tennessee.



If you have any question regarding this workshop, please email

Previous Workshops