Toolset for temporal processing of documents
Time is an important dimension for understanding the text information. The tools were developed to be possible to take advantage of this dimension, allowing incorporate this dimension in several contexts. The temporal processing is carried out in Portuguese written texts. The toolset finds temporal information in the document content and puts it in a machine-readable format. The temporal processing provides the establishment of a temporal relationship between words. [P1,P2,P3]
The toolset was built for a research purpose, therefore all the tools should be considered BETA software. Try it!
Extraction tool performs the extraction of temporal information using two main modules.
1. Annotator. It identifies and classifies the temporal expressions found in Portuguese written texts, annotating them in the original text with specific tags. Considering the sentence "Os quenianos dominaram a corrida de São Silvestre(1)/Yesterday Kenyans dominated the Saint Silvester Road Race.", the following example shows the Annotator output.
Os quenianos dominaram a corrida de São Silvestre <EM ID="1" CATEG="TEMPO" TIPO="TEMPO_CALEND" SUBTIPO="DATA">ontem</EM>.
2. Resolver. It transforms the temporal expressions identified by the annotation module in a machine-readable format, adding VAL_NORM to the annotation tag, as the following example ilustrates.
Os quenianos dominaram a corrida de São Silvestre <EM ID="1" CATEG="TEMPO" TIPO="TEMPO_CALEND" SUBTIPO="DATA" VAL_NORM="1993-12-31">ontem</EM>.
Segmentation tool (Time4Word) carries out the temporal segmentation of texts in Portuguese that were duly annotated and normalized by the Extraction tool. The segmentation is based on time discontinuities found in the text to obtain temporal relationships between words. The segmentation is marked up in the text using a specific tag, as the following example illustrates.
<SEGMENT DN="1993-12-31">Os quenianos dominaram a corrida de São Silvestre <EM ID="1" CATEG="TEMPO" TIPO="TEMPO_CALEND" SUBTIPO="DATA" VAL_NORM="1993-12-31">ontem</EM>.</SEGMENT>
(1) sentence extracted from CHAVE, a Portuguese text collection available at: http://www.linguateca.pt/chave.
Collections
The following collections were created for research purposes only, in the scope of the temporal extraction and segmentation of documents. More information is available in [P1,P2,P3].
The collections are composed of Portuguese written texts. All documents were extracted from the Second HAREM collection created by Linguateca.
myCol01_normalizated is composed of 30 documents which were annotated by PorTexTO and normalized manually.
myCol01_normalizated (expanded version) was expanded with more 4 documents.
myCol02_segmented is composed of 4 documents which were annotated by PorTexTO, and normalized and segmented manually.
myCol03_segmented is composed of 28 documents which were annotated by PorTexTO, and normalized and segmented manually.