ICDAR 2024 Tutorial: Hands on Deep Learning for Document Analysis

ICDAR 2024 Tutorial

Hands-On Deep Learning for Document Analysis

Course Materials for the ICDAR 2024 Tutorial

Original Announcement

Hands-On Deep Learning for Document Analysis

Organizer: Thomas M. Breuel (Nvidia, USA) | Contact: tmb@9×9.com

In recent years, there has been a resurgence in OCR, document analysis, and linguistic tasks on document databases, in large part due to the success of using transformers for large language models and OCR.

The tutorial will cover basic deep learning and data processing techniques for document analysis and LLM training. The objective is to give participants the basic practical tools for working with very large datasets in this research area:

survey of available model architectures and pretrained models
large scale document collections in PDF and image formats; Common Crawl
training transformer models in PyTorch for LLM and OCR
LLM fine tuning for document analysis related tasks
WebDataset for training, managing, and transforming large datasets
Ray for large scale distributed processing
data augmentation and generation
evaluation metrics and performance measurements
Huggingface and libraries for LLMs

The tutorial will consist of a series of exercises in Jupyter Notebooks.

Prerequisites: basic knowledge of Python, PyTorch, and deep learning is recommended; participants are encouraged to set up a working PyTorch / Jupyter environment (laptop, remote desktop, and/or Google CoLab).

Thomas Breuel is a research scientist at NVIDIA, focusing on petascale deep learning, distributed learning tools, text recognition, and the relationship between deep learning and statistics. He has over 30 years of experience in machine learning and computer vision. Breuel's career includes a position as a research scientist at Google, where he was part of the Google Brain team working on machine learning, pattern recognition, and computer vision. He served as a professor of computer science and director of the IUPR Research Lab at the University of Kaiserslautern, Germany, leading research in pattern recognition, machine learning, and image understanding. His work at the University of Kaiserslautern involved collaborations with Google, Microsoft, Smiths Detection, Deutsche Telekom, and the BMBF. Before his academic tenure, Breuel was a member of the research staff at Xerox PARC, focusing on computer vision, pattern recognition, and document layout analysis. He developed the layout analysis technology behind UbiText and initiated the GroupFire project for collaborative and personalized Internet search methods. Additionally, he was a member of the research staff at IBM Almaden Research Center, where he contributed to IBM's DCS 2000 team for the Year 2000 US Census and was part of the team that developed QBIC, an early content-based multimedia image and video database retrieval system. Breuel earned his Ph.D. in Computational Neuroscience from MIT, where he researched geometric aspects of visual object recognition.

Page updated

Google Sites

Report abuse