Automating translation by determining text difficulty

Building a model to determine the features of text which make it more difficult for machine translation

Project status



One of the interesting unresolved problems in language translation practice is determining how difficult a text or its fragment is for machine translation (MT). Starting from the source text, its raw MT output, the final published version and the amount of time it takes to translate a text, this project aims to build a model which should be able to determine the features of the source text which make it more difficult for MT. This activity will therefore investigate the link between MT and human translators which remains an under-researched area. 

Explaining the science

The proportion of linguistic features, such as time adverbials, noun phrases or verbs in the past tense, can be used as an initial estimate of the difficulty of a text. Various regression methods were used to predict the rate at which documents were being translated, based on these linguistic features. In order to improve this feature engineering approach, Facebook’s open-source cross-lingual language model (XLM) neural network was used to produce sentence embeddings. XLM takes sentences as input, outputting a vector based on an objective such as translation.

Due to the limited availability of timed data, translation edit rate (TER) was used. TER is defined as the number of edits needed to make a machine translation match a reference human translation. TER scores range from 0 for a perfect translation with no edits required, to 1 where the entire sentence is changed. TER scores were computed for over 10 million sentences found in the UN-parallel corpus for both Spanish and French translations. Sentences were fed through XLM and the output vectors are used as inputs for regression and classification.

To provide ground-truth data, a small subset of 300 timed sentences were produced. These gave a closer understanding of the difficulty in human translation. 

Project aims

The aim of this study was to better inform decision-making in the field of translation. This was achieved by predicting the difficulty of human translation and the usefulness of machine translation. A set of 300 timed documents was provided by the United Nations Office at Geneva. These contained the time at which translation started and ended. 


This work is directly relevant to the overall translation industry, with the specific focus of this work on translation in the context of inter-governmental organisations. The results of this study established collaboration with the United Nations and the World Trade Organisation.


Researchers and collaborators