Dataset

The historical press publications in the public domain, digitized by the National Library of Spain (BNE) and available for free access through Hemeroteca Digital, form a corpus of 298 press titles, 88,748 issues and 8,302,407 pages as of the date of this proposal. This collection is continuously growing thanks to the BNE's ongoing digitization processes.

The collection spans from the 17th to the 20th century, covering a wide range of topics. The content is accesible in PDF format via the Hemeroteca Digital viewer, while the OCR (optical character recognition) of specific digital objects can also be accessed via URL:

The full text files derived from the OCR process are available for full download via the Hemeroteca Digital interface. A .txt file is provided for each issue, and all issues of a specific publication are included in a compressed .zip file. A summary of publications offering full text can be accessed at: https://hemerotecadigital.bne.es/hd/es/fulltext-csv.

The quality of OCR results varies depending on several factors: date of digitization and technology available (press materials have been digitized by the BNE since the late 1990s), quality of optical technology, state of preservation of the originals, text structure complexity, etc.

The BNE has undertaken efforts to improve these resulting texts through various approaches. One of the most significant initiatives involves open and collaborative OCR correction (among other types of projects) through the ComunidadBNE platform: https://comunidad.bne.es/.

The manually corrected output serves as a valuable resource for testing and training technology, such as "ground truth" datasets. Project proposals for collaborative correction are selected based on general interest (frequently consulted publications with poor OCR), uniqueness of the publications, or feasibility of an open correction process accessible to any user (avoiding overly complex orthotypographic or structural issues).

Train, development and test sets will be prepared from the above collection, ensuring stratified partitions. The expected dataset is to be as follows:

Train partition: 8,959 pages (Scanned PDF, OCR output and corrected text) [NOW AVAILABLE!] [LINK TO DATA]
Development partition: 500 pages (Scanned PDF, OCR output and corrected text)[NOW AVAILABLE!] [LINK TO DATA]
Test partition [NOW AVAILABLE!] [LINK TO DATA]:
- Task 1: 2,736 pages (OCR output only released to participants)
- Task 2: 2,736 pages (Scanned PDF only released to participants)

Page updated

Google Sites

Report abuse