From CIDEHUSDigital: to create and develop an annotation team
The CD_Nota project aims to use texts or other digital artefacts to generate data that is fully processable by humans and machines.
The research team established a pipeline of the process aiming to "transform" historical texts into Data with historical and linguistic values. The team is multidisciplinary (History, Linguistics, Computer Science), and all its members converge.
The textual data have been processed in two sides: spelling normalisation and semantic annotation (Named Entities). Some of the data obtain are already available to scientific community. Importantly, they are also able to be linked to other repositories, demonstrating the project's scalability and potential for wider impact.
The Parish Memories collection (1758) is the starting point and represents a laboratory in this process. We perform several tasks: revision of the transcriptions, manual spelling normalisation, assisted manual annotation, analysis of the results obtained, and training of large language models.
Data will feed plural studies about the South and be linked to other repositories, like the WHG-World Historic Gazetter, in what concerns Portuguese toponymy, in a pipeline of extraction, processing, and making available textual data.
Some texts are already available at CIDEHUSDigital in a collaborative transcription process that started in 2008. Besides that, the team transcribed and revised some documents in confrontation with the original manuscripts to obtain linguistic data from these texts.
Manual spelling normalisation, aiming to obtaim:
normalised version (21st spelling pattern - European Portuguese)
pairs of 18th century-correspondant contemporary texts
lexical lists, in both patterns
INCEpTION - A semantic annotation platform offering intelligent assistance and knowledge management
https://inception-project.github.io/
1st DataSet obtained
Phase #4 - Data Analysis
Phase #5 - Training LLM (Large Language Models)