TDA - project

Summer Project: Translation Data Analytics (TDA)

From July 13 to August 15, 2014, the CRITT plans to conduct a Translation Data Analytics (TDA) project.

The aim of the TDA project is to explore and analyse user activity data which is collected in advanced man-machine communication situations. The TDA project will assess and elaborate methods to produce data-driven user profiles, to investigate differences in communication styles, and to identify patterns of user behavior for more and less successful man-machine communication. Data repositories such as the CRITT TPR-DB will be taken as a basis to analyze concrete and specialized forms of professional man-machine communication such as translators behaviour in advanced computer-assisted collaborative production environments.

An introductory PhD summer course on Translation Process Research (TPR) and a one-day workshop will precede the TDA project, in which students get acquainted with peculiarities of the TPR data and data acquisition methods. The TDA project will use exploratory statistical approaches for discovering new features and dependencies in the TPR data, and to formulate hypotheses about the causal relations. Statistical hypothesis tests will be deployed for confirming or falsifying existing hypotheses.

End-of-Project Presentations

After four weeks of intense work on the TDA project, we will present preliminary results during four morning sessions from August 11 to August 14 in the Mødehuset, Spejdercenter Holmen (

Monday 11/08/2014:
     9:30 - 10:15 Annegret Sturm, Bergljot Behrens, Moritz Schaeffer, Arndt Heilmann, Maheshwar Ghankot: Syntactic entropy
    10:15 - 11:00 Tanik Saikh: Predicting source text gaze fixation durations - a machine learning approach

Tuesday 12/08/2014:
     9:30 - 10:15 Karan Singla, David Orrego Carmona, Ashleigh Gonzales: Predicting translator behaviour
    10:15 - 11:00 Akshay Minocha, Alena Konina: Segmentation in translation: analysis and visualisation 
Wednesday 13/08/2014:
     9:30 - 10:15 Tatiana Fedorchenko, Arlene Koglin, Bartolomé Mesa-Lao, Mercedes García Martinez, Julián Zapata: CasMaCat Field Trial data analysis 2014
    10:15 - 11:00 Andreas Søeborg Kirkedal, Dipti Pandey, Julián Zapata: ASR systems - Speech recognition and translation/post-editing
Thursday 14/08/2014:
     9:30 - 10:15 Pintu Lohar, Ambarish Jadhav: Correlation between human translation entropy and machine translation entropy
    10:15 - 11:00 Joke Daems: The usage of external resources in post-editing vs. human translation

Research outputs

Samuel Läubli:

  • Statistical Modelling of Human Translation Processes (PDF file)

Annegret Sturm, Bergljot Behrens, Moritz Schaeffer, Arndt Heilmann, Maheshwar Ghankot:

  • Syntactic entropy

Abstract: The present study investigates the question whether the co-activation of both source and target language have an influence on the translator's behaviour. A way to measure co-activation is the comparative analysis of the influence of different syntactic realizations of the target language entropy on gaze time and production duration during translation and post-editing. We measure syntactic choice in terms of entropy which quantifies the distribution of different translations realisations of a given source segment. High entropy is an indicator of selection effort related to available options of the final translation out of many different realizations. We investigate the impact of syntactic entropy on cognitive effort. In a first step, as research by Jensen et al. (2010) su ggests, source text segments which need reordering yield longer gaze times than segments which do not need reordering. In a second step, based on the assumption that syntax is shared across languages (Hartsuiker & Pickering 2004), a recently activated syntactic structure is likely to influence subsequent processing, thus “priming” it. Low syntactic entropy of translation choices could, thus, be taken to be a sign of priming. To test the hypothesis whether syntactic choices have an influence on cognitive effort, we compared four datasets comprising translation and post-editing data of the same English source texts translated into Danish, German, Spanish and Hindi. Data was manually annotated for syntactic structure along three relevant features: clause type, valency and voice. Our analyses reveal a positive correlation between syntactic entropy of translation realizations values and gaze as well as production time in all four languages. However, no effect of syntactic entropy could be detected in the post-editing data, suggesting that the post-editors were primed by the MT output.

Tanik Saikh:

  • Predicting source text gaze fixation durations - a machine learning approach

Abstract: I tried to predict gaze fixation duration at source text using supervised learning. GazeS basically represents the sum total of all fixation on a particular source text word during the whole task, it summarize the visual attention a particular source text word has received during the whole task. It can be described as Machine Learning models for predicting the gaze fixation duration on source text using lexical, syntactic and semantic information. For this experiment I have used TPR-DB (KTHJ08) dataset which has been prepared and maintained by CRITT, CBS. Different combinations of feature has been extracted from the dataset and different model built. In my experiment the dependant variable was Gaze at source, and the independent variable was different feature. Then train the model into support vector machine. After training that model different classification accuracy achieved. So far I get close to 50% of classification accuracy with the baseline of 25% classification accuracy. From the several sets of experiment my findings is that unigram, bigram frequency of the source word, length of the word, Translator identity, perplexity, lexical entropy, syntantic entropy, polysemy of word, SuperTag are the very good predictor of Gaze at source.

Karan Singla, David Orrego Carmona, Ashleigh Gonzales:

  • Predicting translator behaviour

Abstract: Computer-assisted translation remains a progressive field of research, and there is an evergrowing interest in providing translators and post-editors with better software tools to facilitate their work and increase productivity. The purpose of the current investigation is to predict post-editor profiles based on user behaviour and demographics using novel machine learning techniques to gain a better understanding of post-editor styles. Our study extracts process unit features from the CasMaCat LS14 database from the CRITT Translation Process Research Database (TPR-DB). The analysis has two main research goals: We create n-gram models based on user activity and part-of-speech sequences to automatically cluster post-editors, and we use discriminative classifier models to characterize post-editors based on a diverse range of translation process features. Our results allow for more targeted consideration of requirements unique to post-editors in user interface design of a translator’s workbench.

Akshay Minocha, Alena Konina:

  • Segmentation in translation: analysis and visualisation

Abstract: The analytics of the translation process research depends on many factors, and in our case, with the TPR-DB, there is a lot of untapped potential in this research field that we would like to look into, and some of it we did in this research. Our aim is to identify the areas of interest for translators while translating a source text into target text. This research shows and promises to predict the parts of a sentence where the attention of the translator is for a specific time while translating a chunk, the implications of this would let us know, which parts of the sentence are more difficult to understand, how the patterns change across different translators and different studies involving many target text languages, predicting the look ahead buffer which would help translation systems and other machine translation and speech recognition tools by focussing their interest on this particular area.

Tatiana Fedorchenko, Arlene Koglin, Bartolomé Mesa-Lao, Mercedes García Martinez, Julián Zapata:

  • CasMaCat Field Trials 2014

Abstract: Our group looked into the user activity data collected in two different studies, i.e. LS14 and CFT14, in the CRITT Translation Process Research Database. Both datasets were collected in the framework of the CasMaCat project. LS14 is a longitudinal study involving 5 post-editors working with interactive machine translation over a period of six week. The aim was to know whether they are faster working with interactivity as they become acquainted with this type of assistive technology. Results show that participants became faster over time after working with interactivity during the post-editing process (week 1 - week 5). Baseline: traditional post-editing vs. interactive machine translation. However, there was an unexpected increase in time in week 6 that could be explained due to the quality of the MT provided during this last week. CFT14 dataset originates in the third CasMaCat field trial. The aim of this study was to explore the benefits of working with interactive machine translation combined with online learning techniques for post-editing purposes. Baseline: traditional post-editing vs. interactive machine translatoin with online learning. As it should be expected, results show that working with online learning techniques made the post-editing process faster, but only when the time used by the post-editors to make Internet searches is not taken into account. Our analyses make clear that productivity metrics in terms of overall time to complete the task might not be a good indicator of performance when the post-editor needs to conduct Internet searches in order to verify the quality of the MT provided.

Andreas Søeborg Kirkedal, Dipti Pandey, Julián Zapata:

  • ASR systems - Speech recognition and translation/post-editing

This work is divided into two parts. First, we present a pilot study that investigates the usefulness of automatic speech recognition (ASR) as an input method for translation and post-editing in a native or a foreign language. The pilot study uses different hardware and software configurations as well as Spanish native and English and French non-native speakers in the experiments. The experiments were a monolingual dictation setting (i.e., the mental translation process was not involved). The pilot study concludes that not only native speakers but also non-native speakers with a lower recognition accuracy could benefit from the use of speech modality in translation and post-editing.  Even after 2 iterations of dictation with commands and manual post-editing of ASR output, the participants still showed improvements w.r.t. to time duration. This leads to a tentative lower bound on the recognition accuracy for useful ASR in dictation settings.
The second part presents work on creating baselines and infrastructure for the development of ASR systems for Hindi. The goal is to create a baseline ASR system and infrastructure that makes it possible to iteratively improve ASR systems by adding more data and experimenting with settings and parameters. Additionally, the focus is on using free and open source software so the infrastructure can be shared between research institutions.

Pintu Lohar, Ambarish Jadhav:

  • Correlation between human translation entropy and machine translation entropy

Joke Daems:

  • The usage of external resources in post-editing vs. human translation

Abstract: My main goal was to clean up my data and to get it into an analyzable format. I worked on data which I gathered as part of my PhD project: 72 translation and post-editing sessions from master's students of translation logged with Casmacat, Inputlog and an EyeLink eyetracker. The main challenges were merging the EyeLink and Casmacat data (which was successful), and mering the Inputlog and Casmacat data (which turned out to be harder than anticipated). Other than preparing the data, I also looked into the usage of external resources during translation and post-editing. I found that concordancers and dictionaries were the most frequently used tools, along with Google search (which was often a way to get to other sources rather than a source in itself). Both concordancers and dictionaries were used more frequently when translating than when post-editing, though closer inspection revealed that this also depends on the participant as well as the text. More research is needed to identify exactly why these participants or these texts differ with regards to external resources consulted. Extra factors that will be taken into account are the time spent looking at external resources, meta-data relating to participants' attitude towards machine translation and post-editing, the type of search queries, text topics, structure and vocabulary of the source text, and quality of the machine translation. In a later stage, information on external resources will be linked to the final translation quality and productivity of the participants.

TDA Challenges

The idea of the TDA project is to approach the CRITT TPR-DB from different possible angles. In order to do so, we have formulated a few challenges. Participants are invited to add further challenges according to their interests, or to extend/modify the list below:

  • Exploratory approach to translation process data: Under this aspect we investigate the latent relations of variables in translation production as coded and  accessible through the TPR-DB. The goal is to detect, characterize and classify behavioral patterns, to uncover different styles of reading, writing, translation and post-editing, and describe in what consists more or less successful human-machine interaction. Given typing activity, eye movement data, textual information and meta data of transltors, the exploratory challenge proposes to discover:

    • Translator expertise: What characterizes an expert translator and in what consists translator expertise? Which features are indicative of whether the translator is native in the source or target language? Do novice and experienced translators prefer different kinds of translation assistance? Is post-editing of machine translation equally helpful for L1 and L2 translators, for novices and for experienced translators?
    • Translation styles: What are styles of translation and how are they different from post-editing styles? Are there fundamentally different behavioural patterns in translation and in post-editing, or are these basically similar types of activity? Are the mental states and representations comparble during translation and post-editing or are they completely different? Is it possible for monolinguals to translate or to post-edit machine translation output and how is that different from bi- (or multi) lingual translators?
    • Productivity information: How fast do translators translate / post-edit? How long are 'normal' production pauses and editing 'bursts'? What textual characteristics and/or translator styles determine the length of pauses and sequences of typing activity? Are there combinations that are more (or less) effective for translation and for post-editing? What text characteristics cause translation difficulties? What impact has the quality of the MT output on human post-editing productivity?
    • Segment information: How do translators mentally segment texts when translating and when post-editing? Do they work with similar size translation units? How much context is needed for a translator to produce, and for a post-editor to change a translation segment? What type of segments are cognitively most demanding for translation and for post-editing?
    • Gaze information: How often and how long do translators look at ST and TT during translation and during post-editing? When do translators/post-editors gaze at source and when at the target text segments? Is there different gazing behaviour for segments that are easier to translate than for segments that are more difficult? To what extent does gaze behaviour inform us about translation problems?
    • Cross lingual comparision: What is the impact when translating into closer languages (e.g. English-Danish) or quite diffferent languages (e.g. English-Hindi). How do gaze and production pattern change for different types of languages? Is it easier to post-edit more different languages or more related languages?

In order to investigate these and similar questions, we will examine the TPR-DB which consists of richly annotated reading, writing, translation and post-editing data. Besides gaze-, keystroke-, and word alignment data, some parts of the data also contain quality information of reviewer and evaluators, as well as meta information about the translator/post-editor experience.

  • Gaze-to-word mapping: Due to free head movements, changing light conditions, etc. the gaze data that we obtain from the eye-trackers may be very noisy, so that the mapping of the physical fixtion location on the screen (X/Y position) onto the symbols actually gazed at is often distorted. This task investigates possibilities to post-hoc rectify the gaze path over the text so as to obtain the most likely sequence of fixated words during a session. We propose two alternative problem formulations:
  1. In the first formulation, participants need to select a likely reading path through a confusion network of possible gaze-to-word mappings. Successive nodes may be connected by up to 5 different edges that correspond to hypotheses of gaze-to-word mappings.
  2. In the second formulation, only the coordinates of the fixations and the coordinates of words are provided, and participants need to map every fixation into its likely intended word.

Information from the TPR-DB will be used for this task and a set of gold standard gaze-to-word mappings will be provided for training and evaluation. Success of different approaches will be evaluated by comparing the predicted gaze paths and word mappings to the gold standard using the Levenshtein distance, or Precision and Recall scores.
Substantial advancements in gaze-to-word mapping would ease analysis of user behavior in unconstrained tasks such as translation or natural reading behavior.

  • Keystroke prediction: Participants need to build prediction mechanisms that anticipate the next keystrokes that human translators are going to produce. Thus, given at time t:
    1. the source text

    2. the target text produced up to time t

    3. a translation memory with word alignments

    4. the sequence of fixations on the source and the target texts up to time t

    5. the sequence of keystrokes that produced the target text up to time t

the task is to predict the next keystroke (at time t+1) that the translator or post-editor is going to produce. Success of participants will be measured by comparing the identity and the time of the predicted keystrokes with respect to the actual keystrokes produced by human translators.
Advancements in keystroke prediction would increase the anticipation power of machines and the efficiency of human-computer interaction.


Location: The TDA project will take place in the Spejdercenter Holmen, Arsenalvej 10, 1436, København, Denmark.

Accommodation: dormitory accommodation for up to 10 students can be provided in the Spejdercenter Holmen.

Structure:  According to the interests of each participant, we will define small tasks and sub-groups to work on a focussed sub-project and discuss its relation to the TDA project on a daily basis. Similar to last year SEECAT project, we'll have lectures and discussion sessions in the morning and team work in the afternoons.

Application: Participation is free of charge. A number of grants for accommodation and travel are available for students and early career researchers. Applications (cover letter + full CV) must be sent to Michel Carl ( before March 15, 2014.


Sponsored by:


Barto Mesa,
28 Feb 2015, 13:56