Distributed training for machine translation

Training neural networks, and developing related hardware, to be better at translating millions of words of online text

Project status

Finished

Introduction

Neural networks have huge potential for a wide range of applications, amongst them machine-led translation of various different languages. Training these networks takes significant time and resources, which can be reduced by distributing the training across multiple machines. In collaboration with Intel, this work is aiming to make this training process, and the related hardware, better at translating millions of words of online text.

Explaining the science

Neural networks are multi-layered systems inspired by the way the human brain is wired, that are trained to learn the mapping between inputs and a response. Each individual ‘neuron’ or node in the network is given a set of parameters (or weightings and biases) which are iteratively adjusted. These adjustments help to form smart ‘neural pathways’ that optimise the network’s ability to fulfil certain tasks, e.g. accurately translating a word from one language to another.

A training method known as ‘stochastic gradient descent’, or SGD, is often used. The ‘gradient descent’ part of this method describes the changes (or gradient) in parameters that happen with each iteration of data passing through the network. These changes aim to minimise (descend) a ‘cost function’, which indicates how accurate the network is at fulfilling its purpose – i.e. the smaller the value of the cost function, the better the network. ‘Stochastic’ (i.e. random) relates to how it is more efficient to take small, random samples of the training data to compute each tiny step change in the parameters, than it is to use the entire dataset every time.

Despite recent advances in training methods, as well as in GPU (graphical processing unit) hardware and network architectures, training these neural networks with data can take an impractically long time on a single machine. However, distributed training across multiple machines allows for significantly more efficient development of neural networks.

The most common form of distributed training is data parallelism, in which each machine or ‘worker’ gets a different portion of the input data, but a complete copy of the network, and then each machine’s results are subsequently combined. It’s therefore important to ensure the same level of quality on all machines as one would have on one machine.

Project aims

When utilising distributed training, it’s often possible to run out of bandwidth and run into increased overheads. This project is looking at ways to improve the efficiency of the training process, within the specific application of machine translation of text.

The machine translation group at the University of Edinburgh have collected 236TB of text, partly from trawling the internet, partly from EU parliamentary transcripts, looking at the following languages: French, Spanish, Portuguese, Italian, Danish, Polish, Czech, and Mandarin. Each word in this text is encoded as 1024 numbers or bits, naturally leading to very large datasets.

Training efficiency is increased by reducing the size of the data ‘vectors’ used, by acknowledging that most words don’t appear in most sentences. By looking at the probability of words being used and discarding 99% of the data related to unused words, it’s possible to produce a ‘sparse vector’ that requires 50 times less bandwidth. Efficiency can also be increased by ensuring each distributed machine or ‘worker’ completes in as similar an amount of time as possible, by giving each worker sentences of the same, or similar, length to work with.

First, a low quality, first pass translation to English is run. Information retrieval is then used to identify if words are a good matching pair, and if they are they’re added to the training data, to iteratively improve the network. The work has already required in excess of 400,000 GPU hours.

The main goals of the project have been to translate as much text as possible, improve the quality of translation, reduce overheads, and make the neural networks produced as useful for as many different cases as possible.

Applications

The main aim of the collaboration with Intel is to run neural networks faster on Intel hardware. Advances in hardware also lead to more advanced models and make training more efficient. By applying new distributed training techniques to the challenge of machine translation, Turing researchers have been able to inform Intel what specific aspects of their hardware would make them better at dealing with neural network analysis.

Recent updates


September 2017: The Edinburgh group ranked tied first for eleven out of twelve translation directions they participated in at the 2017 Conference on Machine Translation.


August 2016: The machine translation group at Edinburgh University won first place in seven of twelve language pairs in the 2016 Conference on Machine Translation, in particular beating Google Translate in Mandarin and Czech.

Contact info

[email protected]

Funders