Text as Data
Image credit: Prof. Hall's photo of notes about data creation and cleaning, March 8, 2023.
Digital Humanities, as a field, has struggled with the Anglophone priorities of tool design and digitization (see Fiormonte 2012). Our work encountered the typical challenges: standard tools for computational language analysis of Italian do not overlap easily with pre-modern Italian because of orthography and the evolution of the language. This was most notable when handling 'stop words', the frequent terms such as articles, conjunctions, and pronouns that often distract from the content-bearing language of character and plot descriptions. At the same time, we were able to capture many aspects of things that are lost in translation by using quantitative comparisons.
For the purposes of our work, we used the list of stop words as found in Voyant Tools for both English and Italian. For the Italian corpus, we added this list of stopwords to improve upon its default.
Full Text
English
The translation of Lodovico Ariosto’s Orlando furioso used for the subsequent English corpus analysis was done by William Stewart Rose and released in 1910 in London. This version of the epic poem is accessible via Project Gutenberg and hypertextualised with wordlists and concordances via IntraText. (Importantly, IntraText does not provide KWIC information for some of the words that were most important in our analysis - see Ariosto's Asides - requiring a workaround.) Ariosto personally superintended three editions of the Furioso, in 1516, 1521 and 1532, when he added the last six canti which now define this 46-canto epic poem as we know it (Gardner 1906, 288). The Project Gutenberg version uses the Italian revision in 1532 for its English translation in 1910. Like other popular translations by David R. Slavitt, Barbara Reynolds, and A. S. Kline, the Rose translation used in this corpus includes 46 canti. Yet, Ariosto had started an expansion of the poem (Cinque Canti) which was not finished on account of his death in 1532, and thus remains outside of this data set. We chose William Stewart Rose’s translation due to its demonstrated effort to stay close to the original and its wide availability as hypertextualised text.
Rose's translation was not without its problems. Certain lines seemed to have been inadvertently skipped during digitization: in octaves V.68, V.69, XIII.7, XIV.131, XV.97 XVI.60, XIX.76, XX.133, XXVII.14, XXX.81, XXXI.75, XXXIII.105, XXXVII.44, XXXVIII.39, XLII.77, XLIII.21, XLIII.35, and XLIII.185. Others were deliberately censored by Rose for sexual content: VIII.49-50, XXV.65-90, XXVIII.43, XXVIII.54, XXVIII.61, XXVIII.62, XXVIII.70, XXVIII.79-83. Professor Hall supplied translations for the missing lines and octaves in consultation with the Reynolds translation and the Italian critical editions. (Documentation)
Our final version of the English lines of the poem can be consulted at: OF English lines.
Italian
The Italian text is the one provided by the Biblioteca italiana (BIBIT), transcribed by Daniela Amicucci (CIBIT-Roma) in 2007 from a copy of the 1532 edition of the Orlando furioso held by the University of Bologna: Orlando furioso - 1532. The associated XML file indicates that octave numbers were added. We removed the "Argomento" from each canto (8 lines that summarize the events) in order to achieve equivalency with the English text that did not have these introductory octaves. We also needed to consult a digitization of the print volume and the critical edition by Lanfranco Caretti (Turin: Einaudi, 1992) to identify missing lines that were inadvertently skipped during transcription of the following octaves: II.41, VI.20, VI.62, VII.75, X.29, XII.27, XIII.12, XIII.15. (Documentation)
Our final version of the Italian lines of the poem can be consulted at: OF Italian lines.
For both texts
All font formatting from the text has been removed (i.e. italics).
No visualizations are included.
Footnotes and editorial apparati have been removed.
Open Questions
Is the data sufficiently accurate to support the study of enjambment from line to line and the "spilling over" of ideas from octave to octave?
Indices
Barbara Reynold's translation of the Orlando furioso (London: Penguin, 1975; 2 vols.) has a comprehensive, although not exhaustive, index of people and places mentioned or implied in Ariosto's verses. The regular formatting of the index entries suggested that the automation of turning this information into a spreadsheet of data would be relatively straightforward. Nonetheless, the Optical Character Recognition (OCR) process stumbled on Roman numerals and punctuation. As a team, we cleaned the data by hand, dedicating some class time to it so that we could co-create rules for handling anomalies. We also added metadata columns: Type, PlaceLevel, PlaceNotes, NondefinitiveEthnicity, PersonGender.
These are described in the "DataBiography" tab of the following spreadsheets:
Index Data (represents Reynolds' index with event descriptions and octave ranges; descriptions were not systematically edited for typos)
Expanded Data (every octave range is expressed with one row per octave in the range; descriptions were not systematically edited for typos)
Limitations of the Reynolds Index
Barbara Reynolds' index of character names and places provided the basis for much of our digital analysis of the Furioso. We used OCR on the Reynolds Index to build a dataset of the names, ethnicity groups, and gender assigned to each character in the poem. While this dataset is immensely useful for modeling the trends in character identities throughout the poem, the index has some important limitations.
First, the index is not comprehensive. We located several characters and places that were not included. While we did not have time to identify every character or place that Reynolds missed, the fact that some were not coded undermines the legitimacy of the dataset - the most we can model right now is Reynolds' (or her indexer's) reading of the poem. Reynolds also provides a long list of minor characters that are not indexed, but that are essentialy to the unfolding of the poem and could change any of the results that we report. These include servants, shepherds, hermits, etc.
Secondly, Reynolds’ characterization of gender and ethnicity groups was based on her understanding of the poem and might not accurately reflect the ethnicity or gender that Ariosto intended to assign to each character. While there is good reason to believe Reynolds accurately characterized the majority of characters, the legitimacy of her analysis is slightly diminished by the fact that she did not write the Furioso. This same limitation applies to our efforts to fill in the blanks for characters that were not included in the summary lists of Christians, Saracens, Pagans, etc. in the translation. The most accurate account of character ethnicities and genders would have been done by Ariosto himself.
Finally, Reynolds also included some 'implied' characters in the index, ones who were not directly mentioned in the poem but Reynolds inferred that Ariosto was implying them. For example, Reynolds puts the character 'Sichaeus' in the index, who was implied in Canto 35, Octave 28. Below is the octave in question:
what fame eliza she so chaste of sprite
on the other hand has left behind her hear
who widely is a wanton baggage hight
solely that she to maro was not dear
marvel not this should cause me sore despite
and if my speech diffusive should appear
authors i love and pay the debt i owe
speaking their praise an author i below
As we can see, Sichaeus is not explicitly mentioned in the octave, but he was Dido's husband and her (in)fidelity is the subject of these lines. By relying on Reynolds’ inclusion of implied characters, we are assuming that Reynolds correctly identified the characters that Ariosto was implicitly referring to. This assumption slightly decreases the validity of the resulting quantitative and digital models, because it is impossible to verify with 100% certainty that all implied characters are indexed.
More Data Needed
As we worked throught our reading of the text and the computational analyses, we quickly realized that we needed more and better data. Among these were data related to:
point of view and narration
poetic style
literary references
interactions rather than co-mentions in an octave
episode lengths rather than mentions in the index