Registration closed
The “Hemeroteca Digital” is part of the Hispanic Digital Library project, which aims to provide public consultation and dissemination via the Internet of the Spanish Bibliographic Heritage held by the Biblioteca Nacional de España. The Hemeroteca was created in March 2007 to provide public access to the digital collection of Spanish historical press housed in the Library, with an initial collection of 143 press and magazine titles. Today, it contains millions of digitalised pages publicly available. This digital collection has been created to serve as a key reference for the study and consultation of Spanish historical press and magazines. In addition to offering access to the texts for reading and consultation, it also provides information on major digital newspaper collections, facilitating greater awareness and access to the still partly unexplored Spanish newspaper heritage. The guiding criterion for curating this collection has been the selection of newspapers and magazines that are representative of their time and showcase the thematic diversity of Hispanic press publishing. Visitors to the newspaper library will thus find publications spanning politics, satire, humor, science, religion, illustration, entertainment, sports, art, literature, and more.
The digital publications are provided in PDF format with OCR, enabling users to search for any desired topic within the text. These advanced text search capabilities make the Digital Newspaper Library an invaluable tool for research purposes. The process of generating a final transcription from scanned pages is a challenging task that, nowadays, requires a vast amount of human resources. The process involves, not only a very performant OCR, but also a robust error-correction approach, as many pages are of bad quality or the OCR system is just unable to reproduce the original text (spots, stains, non-standard spelling used and other). In addition, newspapers have several OCR difficulties associated with their structure: they tend to be arranged in columns that are not always regular, they include different types of images with and without text, and often news items start on one page and continue on successive non-continuous pages, with the abbreviated title at the beginning of the news item and ‘continued on page xx’ at the end. This task aims to advance in the automation of this process.
If you want to participate in the PastReader@IberLEF2025 shared task, please fill this form. Once you are registered, you can ask any questions through the Google Group of the shared task PastReader@IberLEF2025.
Participants will be required to submit their runs and are asked to describe their systems in paper submissions. We encourage participating teams to highlight the real contribution of their systems in identifying successful approaches along with failed attempts and findings on how to advance in more performant solutions. This description must contain the following details:
Architecture: modules, components, data flow…
Additional data used for training (if any): augmented data, additional datasets…
Additional technologies employed (if any): existing OCR systems along with selection criteria
Pre-trained models used (if any): source of the model, selection criteria…
Experiments conducted and training parameters: configuration, hyperparameters used…
Analysis of results: findings from results, ranking according to different metrics, interpretation, and validation…
Error analysis: a study of failed predictions and their characterization, possible improvements, and lessons learned…
This information is considered minimal for submission approval, that is, this information is mandatory.
If you have any specific question about the PastReader 2025 task, we may ask you to let us know through the Google Group PastReader@IberLEF2025.
For any other questions that do not directly concern the shared task, please contact with Isabel Cabrera de Castro or Arturo Montejo Ráez.
PastReader at IberLEF2025
SINAI Research Group
Twitter: @NLP_SINAI
This task is partially funded by projects CONSENSO (Ref. PID2021-122263OB-C21), SocialTox (Ref. PDC2022-133146-C21), MODERATES (Ref. TED2021-130145B-I00) and GRESEL-UNED (PID2023-151280OB-C22) funded by Plan Nacional I+D+i from the Spanish Government.