Thank you for attending

Please find full details of our sessions below for record, and materials used at the sessions can be found in the AI4LAM Community Drive.

Conference Opening

November 30 2022, 19:00 UTC

Two hour session.

Keynote Speakers:

BigScience: A community effort to create a large open language model, Margaret Mitchell

Over the past decade, large language models have had a massive impact on the ability of computers to work with text. Language models are trained on large amounts of unstructured text and the resulting model can be applied to other tasks through a technique known as transfer learning. Search, Named-Entity-Recognition, text classification, text-summarisation, translation and many other Natural Language Processing tasks have significantly improved performance due to large language models.

However, several challenges and issues exist around developing and using large language models. Language models can exhibit various types of bias. For example, when a language model completes the sentence "the man worked as a ___," often the model will answer differently than "the woman worked as a ___".

Training large language models is expensive and resource-intensive in terms of the computational power required and the extensive amounts of data necessary to train robust models. Because of this, these models have primarily been developed by large companies. As a result, many models — or the data used to train the models —have not been fully open. Beyond this, many language models are significantly biased towards English and other dominant languages.

BigScience is a community-focused effort to help overcome some of these challenges. The BigScience project has trained and open-sourced a BLOOM, a large language model, created the ROOTS corpus, a 1.6TB multilingual dataset and has done extensive work on data governance and model evaluation. In this presentation, Margret Mitchell will discuss the project's goals, focusing on data governance and GLAM institutions' potential role in supporting these efforts.

Margret Mitchell is an AI researcher at Hugging Face and is Co-Chairing the BigScience Data Governance Working Group. Margaret has led significant work in understanding and managing the biases associated with machine learning models.

Reproducibility Crisis in Machine Learning Based Science, Sayash Kapoor

The use of machine learning (ML) methods for prediction and forecasting has become widespread across the quantitative sciences. However, there are many known methodological pitfalls, including data leakage, in ML-based science. In this paper, we systematically investigate reproducibility issues in ML-based science. We show that data leakage is indeed a widespread problem and has led to severe reproducibility failures. Specifically, through a survey of literature in research communities that adopted ML methods, we find 17 fields where errors have been found, collectively affecting 329 papers and in some cases leading to wildly overoptimistic conclusions. Based on our survey, we present a fine-grained taxonomy of 8 types of leakage that range from textbook errors to open research problems.

We argue for fundamental methodological changes to ML-based science so that cases of leakage can be caught before publication. To that end, we propose model info sheets for reporting scientific claims based on ML models that would address all types of leakage identified in our survey. To investigate the impact of reproducibility errors and the efficacy of model info sheets, we undertake a reproducibility study in a field where complex ML models are believed to vastly outperform older statistical models such as Logistic Regression (LR): civil war prediction. We find that all papers claiming the superior performance of complex ML models compared to LR models fail to reproduce due to data leakage, and complex ML models don't perform substantively better than decades-old LR models. While none of these errors could have been caught by reading the papers, model info sheets would enable the detection of leakage in each case.

Sayash Kapoor is a Ph.D. candidate at Princeton University's Center for Information Technology Policy. His research critically investigates Machine Learning methods and their use in science and has been featured in WIRED, LA Times, and Nature among other media outlets. At Princeton University, he organized a workshop titled The Reproducibility Crisis in ML-based Science, which saw more than 1,700 registrations. He has worked on Machine Learning in several institutions in the industry and academia, including Facebook, Columbia University, and EPFL Switzerland. He is a recipient of a Best Paper award by ACM FAccT and an Impact Recognition award by ACM CSCW.

Chapitre Francophone / French Speaking Chapter Meeting

November 30 2022, 14:00 UTC

Two hour session.

Ordre du jour

actualités du bureau du chapitre (15 min) (Florence Clavaud, Archives nationales de France)
"le groupe de travail du chapitre sur l'HTR : d'où partons-nous, où voulons-nous aller ?" (Aurélia Rostaing et Jean-François Moufflet, Archives nationales) (30 min)
"actualités de la feuille de route IA de la BnF" (Céline Leclaire et Jean-Philippe Moreux, BnF) (30 min)
"vers une solution de mutualisation des outils de mise en oeuvre d’IA" (Luc Bellier, direction des Bibliothèques de l'Université Paris Saclay) (20 min)
divers (discussion sur tout autre sujet suggéré par les membres avant ou pendant la réunion)
en guise de conclusion : date de la prochaine réunion (et renouvellement du bureau) (Luc Bellier)

In English (meeting will be held in French):

news from the bureau of the chapter (15 min) (Florence Clavaud, National Archives of France)
"the working group of the chapter on HTR: where are we starting from, where do we want to go?" (Aurélia Rostaing and Jean-François Moufflet, National Archives of France) (30 min)
"news on the AI roadmap of the BnF" (Céline Leclaire and Jean-Philippe Moreux, BnF) (30 min)
"towards a solution for pooling AI implementation tools" (Luc Bellier, direction of the Libraries of the University of Paris Saclay) (20 min)
miscellaneous (discussion on any other subject suggested by the members before or during the meeting)
as a conclusion: date of the next meeting (and renewal of the bureau) (Luc Bellier)

Community Led Discussion - AI4LAM: Let's Talk

December 1 2022, 02:30 UTC

One hour session.

Let's Talk

Facilitators: Alexis Tindall (University of Adelaide) and Rowan Payne (DigitalNZ), with support from Ingrid Mason

Abstract:

Looking ahead, we'll be discussing what's on the agenda for the rest of the AI4LAM 2022 Virtual Event. With some talking points to kick us off, we're also looking to you to bring your thoughts, questions, and topics for discussion. In this open session, we're hoping to explore all the topics that you want to talk about, so come prepared and ready to chat.

Chapter Recap

Kick-off questions:

What do you see as the most pressing capability needs for AI/ML in our sector?
What is the maturity of AI/ML capability and resources in our sector?

Community Led Discussion - Applying AI in the Smaller Research Library

December 1 2022, 07:00 UTC

Two hour session.

Session full.

Applying AI in the Smaller Research Library

Facilitators: Andrew Cox (Sheffield University), Emmanuelle Bermès (École nationale des chartes) and Sam Thomas (NHS), in association with IFLA Artificial Intelligence SIG

Abstract:

The AI4LAM community is focused on the application of AI techniques to collections as data, but most realisation and use cases come from larger institutions with capacity of investing massively in innovation and R&D. Is AI reachable for smaller organisations? Many research libraries with special collections would like to experiment / develop services using the same techniques but lack resource, skills and confidence to do so.

Participants are invited to share their perspectives, ideas and proposed solutions in an open discussion.

Questions to address:

Can we identify AI use cases for smaller organisations?
What types of projects could be low resource / risk starting points?
What resources exist to support skill development?
What existing open source or commercial platforms offer pathways into use of AI?
What kind of collaborations could be built to support the wider library community to engage with AI, e.g. sharing technical resources or learning experiences?

Community Led Discussion - Teaching and Learning AI for GLAM

December 1 2022, 16:00 UTC

Two hour session.

Teaching and Learning AI for GLAM

Facilitators: Claudia Engel (Stanford University), Mike Trizna (Smithsonian Institute), Daniel van Strien (British Library)

Abstract:

The AI4LAM Teaching and Learning Working Group was officially formed and approved in summer 2020. The facilitators will present what the working group have achieved over these last couple of years and discuss community feedback about future directions of the group.

Hands-on Session - Introduction to Computer Vision

December 2 2022, 09:00 UTC

Two hour session.

Introduction to Computer Vision

Facilitator: Giles Bergel, Senior Researcher in Digital Humanities, University of Oxford

Abstract:

This hands-on workshop will introduce the application of visual AI to cultural heritage collections such as printed books, photographs, paintings and audiovisual content. Using the example of collaborations between Oxford’s Visual Geometry Group (VGG) and researchers and curators within the GLAM sector, the workshop will provide a hands-on introduction to VGG’s free and open-source tools for visual search, classification, comparison and annotation. The workshop will also outline some of the critical and ethical issues facing LAM institutions seeking to deploy machine learning, such as questions around privacy, bias and accreditation of labour and ownership of trained models. No prior knowledge of computer vision or coding is required to participate in the workshop.

Attendees will:

gain an understanding of how collections of materials such as printed books, paintings, maps, photographs and audio-visual materials can be made searchable;
learn how visual AI is being used in GLAM institutions and by affiliated researchers;
and discover how they can use AI for themselves on their own collections.

Hands-on Session - Integration of Data Sheets into GLAM Practice

December 2 2022, 16:00 UTC

One hour session.

Integration of Data Sheets into GLAM Practice

Facilitator: Claudia Engel, Stanford University

Abstract:

As one of the ways to address bias in Machine Learning algorithms, attention is directed towards the data that serve as training sets for the algorithms. One of the proposed mitigation efforts is to provide a description of the data with the aim to create awareness of potential shortcomings for their use in predictive modeling. While descriptive metadata are a core practice in GLAM organizations, it was proposed as practice for the machine learning community only recently, perhaps most prominently by Gebru et al (2021)[1]. These descriptions, also named data biographies or data sheets provide information about provenance, use, and limitations of a digital data set.

In this session we will review ongoing relevant efforts in the GLAM sector and discuss how information from data sheets might be incorporated into GLAM Machine Learning models and practices. How are digital data sets currently described in GLAMs and are there other models that might be useful? What are the various applications of Machine Learning within GLAM organizations and how might that affect the creation of data sheets? What might be pathways for best practices towards an integration into machine learning workflows?

[1] Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Iii, H. D., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86–92. https://doi.org/10.1145/3458723

Hands-on Session - A Proposed Framework for Operationalising AI in LAMs

December 2 2022, 20:00 UTC

One hour session.

A Proposed Framework for Operationalising AI in LAMs

Facilitators: Abigail Potter and Meghan Ferriter, LC Labs/Digital Strategy Division/Office of the Chief Information Officer Library of Congress

Abstract:

Libraries, archives, museums and other public cultural heritage organisations have shared challenges in operationalizing AI technologies in ways that help them achieve their goals and be responsible stewards of heritage collections. Through research, experimentation and collaboration the LC Labs team has developed a set of tools to document, analyse, prioritise and assess AI technologies in a LAM context. This framework is in draft form and in need of additional use cases and perspectives.

The facilitators will introduce the framework and ask participants to use the proposed framework to evaluation their own proposed or in process ML or AI project, system or task as a use case.

Sharing the framework elements and gathering feedback are the goals of the workshop.

Sample Elements and Prompts from the framework:

Organisational Profile: How will or does your organisation want to use AI or Machine learning?
Define the Problem you are trying to solve.
Write a user story about the AI/ML task or system your are planning/doing
Risks and Benefits: What are the benefits and risks to users, staff and the organisation when an AI/ML technology is/will be used?
What systems or policies will/do the AI/ML task or system impact or touch?
What are the limitations of future use of any training, target, validation or derived data?
Data Processing Plan: What documentation are/will you require when using AI or ML technologies - What existing open source or commercial platforms offer
pathways into use of AI?
What are the success metrics and measures for the AI/ML task?
What are the quality benchmarks for the AI/ML output? What could come next?

Page updated

Google Sites

Report abuse