The system has three components: a preprocessing module to format input poems uniformly, an NLP pipeline, and the enjambment-detection module itself. A diagram below depicts the workflow.
System Workflow Diagram
The poems were originally in HTML. This module removes them and identifies the content that is relevant for the poem (i.e. the 2 quatrains and the two tercets in the case of a sonnet).
The NLP pipeline is IXA Pipes (Agerri et al., 2014). Its results for contemporary Spanish are competitive.
We used the Spanish modules provided with this library, that have been trained on the ANCORA corpus for part-of-speech-tagging, constituency parsing, and syntactic dependency parsing. Dependency parsing in this library is based on MATE tools.
Our system uses this NLP pipeline to obtain part-of-speech tags, syntactic constituency (e.g. verb-phrase, noun-phrase) and syntactic dependencies (e.g. direct object).
The enjambment detection module is rule and dictionary-based, and exploits the information provided by the NLP pipeline. Rules (30 in total) of different characteristics identify enjambed lines, assigning them a type among a list of 12 types, based on the typology available at this link.
A basic description follows. More details, and examples, are available elsewhere on this site, [here].
The system outputs enjambment annotations in two formats.
Since these formats contain a poem's and line's ID together with the system's annotations, it is possible to derive other output formats (e.g. with the TEI Verse module) from them.