Abstract
Stylometry is the linguistic discipline that evaluates an author's style through the application of statistical analysis to a body of their work. In this project, you will perform a stylometric analysis to identify the authors behind a set of translated sentences and extend it by means of key logs and other behavioral data collected during the translation process.
Description
For this project, a small corpus of 430 sentences taken from different Wikipedia sources has been translated from English into Italian by three professional translators, that we will call subjects. Each sentence has been translated only once by each subject in one of the following modalities:
Translation from scratch, in which the translator performs the translation without having access to anything but the source text in English.
Post-editing of a commercial MT system, in which the translator post-edits a machine translation of the original source text produced by the Google Translate API.
Post-editing of a multilingual research MT model, in which the translator post-edits a machine translation of the original source text produced by an mBART model fine-tuned for multilingual machine translation from English into 50 languages.
The three modalities are assigned at random to subjects for each sentence so that every sentence is translated exactly once for every modality.
The platform PET, used to carry out the translations, collects a fine-grained history of keylogs, edit times, and other information related to the translation process. Moreover, this information has been post-processed to extract interesting aggregates such as the type of edits that have been performed by the translator during the post-editing of the sentence. The full list of features is provided on the dataset page in the Dataset Hub.
The data have already been split into a training set and a test set. The test set has been replicated into three test sets, each one masking respectively: (i) the subject information, (ii) the translation modality information, and (iii) the temporal information, in order to prevent involuntary leaking of information during test time. Your main goal is to use the training set to select which features are most effective in identifying a subject as the author of a translation in the test set lacking subject information, exploring both linguistic and behavioral information available in the training set. The analysis should not be limited to a simple fitting of a model on the scores provided.
Ideas for research directions
[Challenge π] A good amount of information can be found by analyzing the edits performed by the translators over the errors of MT systems. What are the most frequent errors performed by the two MT systems that get corrected during post-editing? Annotate the data with POS/NER/Morphological information. Does the addition of linguistic information extracted from the source, the target, and the edits help in identifying the subject?
Assess whether it is possible to predict the translation modality, using the test set in which the latter and related information are masked as targets. Which features are the most relevant to predict whether the translation is performed from scratch or from an existing machine translation? Is it easier to consider both post-editing settings as a single category for the sake of setting prediction? What properties are common to sentences for which is hard to predict the modality?
Assess whether it is possible to accurately predict editing times from the available data, using the test set in which temporal information is masked. Which features are the most relevant to predicting total edit times? Can we also predict the duration of pauses during editing?
Materials
The folder containing all the data necessary for the project is provided on Nestor. Sharing the data on any public platform is strictly forbidden at the moment: if you work on Github, be sure to gitignore its contents. A HuggingFace dataset associated with the data has been created and is available on the Dataset Hub, and it contains the instructions to load the dataset locally from your downloaded folder.
Refer to the dataset card on the Dataset Hub for all information related to available features and an example from the dataset.
References
Toral, Antonio et al. βPost-editing Effort of a Novel With Statistical and Neural Machine Translation.β Frontiers Digit. Humanit. 5 (2018): 9.
Guerberof-Arenas, Ana et al. βThe Impact of Post-editing and Machine Translation on Creativity and Reading Experience.β ArXiv abs/2101.06125 (2021)
Lee, Changsoo. βHow do machine translators measure up to human literary translators in stylometric tests.β Digital Scholarship in the Humanities (2021)
Eder, Maciej et al. βStylometry with R: A Package for Computational Text Analysis.β R J. 8 (2016): 107.
Freitag, Markus et al. βExperts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation.β Transactions of the Association for Computational Linguistics 9 (2021): 1460-1474.