Enjambment Detection System

The system has three components: a preprocessing module to format input poems uniformly, an NLP pipeline, and the enjambment-detection module itself. A diagram below depicts the workflow.

System Workflow Diagram

Preprocessing

The poems were originally in HTML. This module removes them and identifies the content that is relevant for the poem (i.e. the 2 quatrains and the two tercets in the case of a sonnet).

Natural Language Processing Toolkit

The NLP pipeline is IXA Pipes (Agerri et al., 2014). Its results for contemporary Spanish are competitive.

We used the Spanish modules provided with this library, that have been trained on the ANCORA corpus for part-of-speech-tagging, constituency parsing, and syntactic dependency parsing. Dependency parsing in this library is based on MATE tools.

Our system uses this NLP pipeline to obtain part-of-speech tags, syntactic constituency (e.g. verb-phrase, noun-phrase) and syntactic dependencies (e.g. direct object).

Enjambment detection module

The enjambment detection module is rule and dictionary-based, and exploits the information provided by the NLP pipeline. Rules (30 in total) of different characteristics identify enjambed lines, assigning them a type among a list of 12 types, based on the typology available at this link.

Some rules are very shallow and only take parts of speech into account.
Some rules additionally exploit constituency info.
Some rules use dependency information, e.g. to detect subject / object / verb relations.
For any type of rule, custom dictionaries can restrict rule application to a set of terms. E.g. certain verbs govern arguments introduced by one specific preposition; we itemized these verbs and their prepositions in a dictionary, to complement information provided by the NLP pipeline or correct parsing errors.

Output formatters

A basic description follows. More details, and examples, are available elsewhere on this site, [here].

The system outputs enjambment annotations in two formats.

An inline-annotation delimited format where you see a poem's and line's ID, each word in the poetry line, plus its part-of-speech and term-id, and finally the enjambment types found for the line. This format can be informative for human experts to look at. It provides part-of-speech and the term-id that allows to find constituency and dependecy information in the NLP outputs.
A standoff annotation delimited format, where you see a line's ID and the enjambment types found for that line. This format is convenient for automatic evaluation of the system against the manually annotated reference, with our evaluation scripts.

Since these formats contain a poem's and line's ID together with the system's annotations, it is possible to derive other output formats (e.g. with the TEI Verse module) from them.

Page updated

Google Sites

Report abuse