Mining and Learning in the Legal Domain
The 3rd International Workshop on Mining and Learning in the Legal Domain (MLLD-2023)
In conjunction with the 32nd ACM International Conference on Information and Knowledge Management (CIKM-2023)
University of Birmingham and Eastside Rooms, UK
Sunday 22nd October 2023.
University of Birmingham and Eastside Rooms, UK
Sunday 22nd October 2023.
Time: 2pm - 5:30pm (BST)
Venue: Alan Walters - 112
Programme
2:00pm-2:05pm Short Intro
2:05pm-2:50pm Keynote Talk (remote)
2:50pm-3:30pm Paper Presentations (remote): [1,2]
3:30pm-4:00pm Coffee Break
4:00pm-4:40pm Paper Presentations (in-person) [3,4]
4:40pm-5:25pm Panel Discussion (in-person)
5:25pm-5:30pm Wrap-up
6:00pm-8:00pm CIKM-2023 Drink Reception at the Great Hall
Keynote Talk
Title:
Ensuring Reliability in Legal LLM Applications
Abstract:
The usage of large language models (LLMs) has exploded over the past year, especially since OpenAI introduced ChatGPT in November 2022; however, ensuring accuracy and reliability in LLM-generated outputs remains a challenge, especially in knowledge-intensive domains such as law. In this talk, we will present some methods that we use to ensure reliability in CoCounsel, Casetext's GPT-4 based legal AI assistant, touching upon topics including retrieval-augmented generation for legal research, methods for reducing hallucinations, managing cost vs. reliability tradeoff, evaluating LLMs in the legal context, and generating synthetic data from GPT4.
Speakers:
Javed Qadrud-Din is the Director of Research & Development at Casetext, where he builds early systems to push the boundaries of legal technology, including Casetext's deep learning-based semantic search system and, more recently, Casetext's GPT4-based products. Prior to Casetext, Javed was a machine learning engineer at Meta and held engineering and product roles at IBM. He has been working in machine learning for the past decade, but, before that, he worked as a lawyer for startup companies at the law firm Fenwick & West. He holds a JD from Harvard Law School and a BA from Cornell University.
Martin Gajek is the head of Machine learning at Casetext. His team researches and develops low-latency contextual information retrieval systems and language generation systems, including LLMs. These technologies form the backbone of Casetext’s flagship product, CoCounsel. Martin holds a PhD in Applied Physics from Sorbonne University (UPMC) and held postdoctoral positions at UC Berkeley and IBM Research. Prior to joining Casetext, he was involved in R&D for semiconductor hardware, specifically optimizing memory architectures for deep learning accelerators.
Shang Gao is a senior machine learning researcher at Casetext, where he designs, develops, and deploys solutions for legal and transactional language understanding, generative question answering, and knowledge retrieval. His recent work includes the development of CoCounsel, Casetext's AI legal assistant based on OpenAI’s GPT-4, and demonstrating that GPT-4 can pass all portions of the Unified Bar Exam. Prior to Casetext, Shang was a research scientist at Oak Ridge National Laboratory, where he led a research team building clinical NLP solutions for the National Cancer Institute. Shang has a PhD in Data Science from the University of Tennessee.
Paper Presentations
[1] Nishchal Prasad, Mohand Boughanem and Taoufiq Dkaki. A Hierarchical Neural Framework for Classification and its Explanation in Large Unstructured Legal Documents. arXiv:2309.10563
[2] Subinay Adhikary, Procheta Sen, Dwaipayan Roy and Kripabandhu Ghosh. Automated Attribute Extraction from Legal Proceedings. arXiv.org:2310.12131
[3] Zongyue Xue, Huanghai Liu, Yiran Hu, Kangle Kong, Chenlu Wang, Yun Liu and Weixing Shen. LEEC: A Legal Element Extraction Dataset with an Extensive Domain-Specific Label System. arXiv:2310.01271
[4] Oscar Tuvey and Procheta Sen. Automated Argument Generation from Legal Facts. arXiv:2310.05680
Panel Discussion
Title:
LLM Meets LLP: An Industry Insider’s Experience over the Past 12 Months
Abstract:
In this engaging session, our expert panelists from Thomson Reuters and Simmons & Simmons will explore the dynamic future of the legal industry in the wake of the advent of generative AI, shedding light on the profound impact it has on law firms and legal practitioners. We will delve into the opportunities and challenges AI presents within the legal sector, while also addressing the real-world hurdles in adopting AI solutions. This discussion should be of equal interest to industry professionals seeking valuable insights and to academics in pursuit of novel research questions uncovered in practical industry experience.
Moderator:
John Armour, Dean of the Faculty of Law, University of Oxford
Panelists:
Morgane Van Ermengem is Head of Legal Operations Consulting in the Simmons Wavelength team. Based in London, she helps clients understand and implement legal technology to create more efficient workflows and drive high-value performance. Her work centres on legal operations, including process optimisation, data strategy and contract lifecycle management (CLM). Her background as a lawyer helps her tailor Simmons Wavelength’s services to the specific needs of in-house legal teams. Morgane has previous experience working at a top tier law firm in commercial litigation, before moving to a legal tech start-up with a mission to improve access to legal services through commoditisation and the use of technology. She joined Wavelength as a legal engineer in 2018.
Daisy Mak is a Lead Data Scientist (LLM) at Simmons Wavelength with a background in mathematics, computation, and data-intensive fields, and has extensive experience in data analytics across multiple disciplines. She is passionate about finding insights in data and uses technology to provide innovative solutions that drive efficiency and transform operational processes in legal matters. Working alongside international legal teams at Simmons & Simmons and technical teams at Simmons Wavelength, her role involves leading the development of technical solutions and project management of client matters, with a particular focus on the application of Generative AI and Large Language Models. Before joining Simmons & Simmons, she got her PhD in Astrophysics from University of Southern California and has academic research experiences at academic institutes and space agencies in US and Europe. She also worked at various tech companies that specialised in different fields (i.e. healthcare and mining), and hence has diverse interest and experience in applying data science skills in many subject areas.
Ben Ridgway serves as Chief Technology Officer for Simmons Wavelength, where he is responsible for setting the technical direction of the business, developing and leading our technical teams, and ensuring we have the appropriate technical capabilities to meet the needs of our business. Prior to Simmons Wavelength, Ben co-founded a legal technology business in the transaction management space and before that was a lawyer at Clifford Chance. Ben holds a law degree from the University of Warwick.
Alina Petrova is an Applied Research Scientist at Thomson Reuters Labs in London. She is currently working on various features for TR's flagship legal research product, Westlaw. Her research interests include legal NLP, in particular, legal outcome prediction and legal information extraction, as well as generative AI and prompt engineering. Alina received her PhD in Computer Science from the University of Oxford and later worked there as a postdoctoral researcher focusing on NLP, deep learning, and data science in the legal and economics domains. Prior to that, Alina worked as a research assistant in BIOTEC TU Dresden, specialising in biomedical text mining, as well as for several startups in the areas of machine translation, biomedical NLP, and knowledge graphs.
Dell Zhang currently leads the Applied Research team at Thomson Reuters Labs in London, UK. Prior to this role, he was a Tech Lead Manager at ByteDance AI Lab and TikTok UK, a Staff Research Scientist at Blue Prism AI Labs, and a Reader in Computer Science at Birkbeck College, University of London. He is a Senior Member of ACM, a Senior Member of IEEE, and a Fellow of RSS. He got his PhD from the Southeast University (SEU) in Nanjing, China, and then worked as a Research Fellow at the Singapore-MIT Alliance (SMA) until he moved to the UK in 2005. His main research interests include Machine Learning, Information Retrieval and Natural Language Processing. He has published 110+ papers, graduated 11 PhD students, received multiple best paper awards and won several prizes from international data science competitions. He has been the co-chair of five international workshops such as IPA in AAAI-2020.
Abstract
The increasing accessibility of legal corpora and databases create opportunities to develop data-driven techniques and advanced tools that can facilitate a variety of tasks in the legal domain, such as legal search and research, legal document review and summary, legal contract drafting, and legal outcome prediction. Compared with other application domains, the legal domain is characterized by the huge scale of natural language text data, the high complexity of specialist knowledge, and the critical importance of ethical considerations. The MLLD workshop aims to bring together researchers and practitioners to share the latest research findings and innovative approaches in employing data mining, machine learning, information retrieval, and knowledge management techniques to transform the legal sector. Building upon the previous successes, the third edition of the MLLD workshop will emphasize the exploration of new research opportunities brought about by recent rapid advances in Large Language Models and Generative AI. We encourage submissions that intersect computer science and law, from both academia and industry, embodying the interdisciplinary spirit of CIKM.
Topics
We encourage submissions on novel mining and learning based solutions in various aspects of legal data analysis such as legislations, litigations, court cases, contracts, patents, NDAs and bylaws. Topics of interest include, but are not limited to:
Applications of Large Language Models (LLMs) and Generative AI in the legal domain
Prompt engineering and automated prompting for legal NLP tasks
LLMs for legal contract drafting
Legal assistance using conversational AI
Risks and limitations of LLMs in the legal domain
Applications of data mining techniques in the legal domain
Classifying, clustering, and identifying anomalies in big corpora of legal records
Legal analytics
Citation analysis for case law
Applications of machine learning and NLP techniques for legal textual data
Information extraction, information retrieval, question answering and entity extraction/resolution for legal document reviews
Summarization of legal documents
eDiscovery in legal research
Case outcome prediction
Legal language modelling and legal document embedding and representation
Recommender systems for legal applications
Topic modeling in large amounts of legal documents
Training data for the legal domain
Acquisition, representation, indexing, storage, and management of legal data
Automatic annotation and learning with human in the loop
Data augmentation techniques for legal data
Semi-supervised and transfer learning, domain adaptation, distant supervision
Ethical issues in mining legal data
Privacy and GDPR in legal analytics
Bias and trust in the applications of data mining
Transparency in legal data mining
Emerging topics in the intersection of AI and law
Digital lawyers and legal machines
Smart contracts
Future of law practice in the era of Generative AI
Submission
All submissions must be in English, in PDF format, and in ACM two-column format (sigconf). The ACM LaTeX template are available from the ACM website and the Overleaf online editor.
To enable double-blind reviewing, authors are required to take all reasonable measures to conceal their identity. The anonymous option of the acmart class must be used. Furthermore, ACM copyright and permission information should be removed by using the nonacm option. Therefore, the first line of your main LaTeX document should be as follows.
\documentclass[sigconf,review,anonymous,nonacm]{acmart}
To facilitate the exchange of ideas, this year we adopt a policy similar to that of ICTIR'23 which allows submissions of any length between 2 and 9 pages plus unrestricted space for references. Authors are expected to submit a paper whose length reflects what is needed for the content of the work, i.e., page length should be commensurate with contribution size. Reviewers will assess whether the contribution is appropriate for the given length. Consequently, there is no longer a distinction between long and short papers, nor a need of condensing or enlarging medium-length ones. We will probably allocate more presentation time to longer papers during the workshop.
As in the previous editions of MLLD, each paper will be reviewed by at least 3 reviewers from the Program Committee.
We are going to produce non-archival proceedings for this workshop on arXiv.org, similar to IPA'20. Thus, authors can refine their accepted papers and submit them to formal conferences/journals after the workshop.
Submissions should be made electronically via EasyChair:
https://easychair.org/conferences/?conf=mlld2023
Important Dates
Paper submission deadline: September 1st, 2023 (AoE)
Paper acceptance notification: September 17th, 2023
Paper final version due: October 1st, 2023
Workshop date: October 22nd, 2023
Attendance
The CIKM-2023 conference will be held in-person in Birmingham, UK. Therefore, it is expected that most (if not all) of the authors will present their accepted papers in-person for this workshop. Some invited speakers and/or participants may have the flexibility to attend online.
Registration
The registration for the workshop is done through the main conference CIKM-2023.
CIKM will be opening the registration in a few weeks. If you would like to express your interest in the project and be notified when the registration is open, please drop an email to Alina Petrova.
Programme Committee
Arian Askari, Leiden University, Netherlands
Pan Du, Thomson Reuters Labs, Canada
Shang Gao, Casetext, USA
Shoaib Jameel, University of Southampton, UK
Evangelos Kanoulas, University of Amsterdam, Netherlands
Dave Lewis, Redgrave Data, USA
Haiming Liu, University of Southampton, UK
Yiqun Liu, Tsinghua University, China
Miguel Martinez, Law Business Research, UK
Isabelle Moulinier, Thomson Reuters Labs, USA
Aileen Nielsen, Harvard University, USA
Joel Niklaus, Standford University, USA
Milda Norkute, Thomson Reuters Labs, Switzerland
Douglas Oard, University of Maryland, USA
Jaromir Savelka, Carnegie Mellon University, USA
Frank Schilder, Thomson Reuters Labs, USA
Shohreh Shaghaghian, Amazon, Canada
Dietrich Trautmann, Thomson Reuters Labs, Switzerland
Xiaoling Wang, East China Normal University, China
Gineke Wiggers, Wolters Kluwer, Netherlands
Josef Valvoda, University of Cambridge, UK
Jun Xu, Renmin University, China
Fattane Zarrinkalam, University of Guelph, Canada
Organizers
Masoud Makrehchi, Thomson Reuters Labs & OntarioTech University, Canada
Dell Zhang, Thomson Reuters Labs, UK
Alina Petrova, Thomson Reuters Labs, UK
John Armour, University of Oxford, UK
Contact
If you have any question regarding this workshop, please email mlld23@easychair.org.