Introduction to Neural Machine Translation

Machine Translation (MT) is one of the most important uses of natural language processing and artificial intelligence systems. Machine translation helps to bridge the digital divide in information access and it gives us the ability to read and write in other languages, that we may not even understand. By doing so, machine translation creates a more literate and knowledgeable society, by sharing one's knowledge with the rest. Another common use of machine translation is to aid human translators. MT systems are often used to produce a draft translation, that is then adjusted and improved in a process called Machine Translation Post-Editing (MTPE). A more recent application of MT is also speech translation, allowing simultaneous interpretation, to some extent.

Up until a few years ago, the state-of-the-art had a much lower performance in machine translation, compared to our possibilities today. State-of-the-art systems were based on statistical models, such as Systran, Moses or Nematus ¹. This translation paradigm, called Statistical Machine Translation (SMT) generated translations based on statistical models whose parameters were derived from the analysis of bilingual parallel corpora. The resulting translations lacked in understanding, context-awareness and were limited by the size of their vocabulary.

In 2016, a submission by Bentivogli et al. ² at IWSLT 16 demonstrated the performance superiority of Neural Machine Translation, compared to the state-of-the-art Phrase-Based SMT in the English-German language pair. Since then, NMT has become the de-facto standard for machine translation, replacing any other model and system used in the past.

How does Neural Machine Translation Work?

Neural Machine Translation (NMT) operates through artificial neural networks that mimic the structure and function of the human brain. ³ These networks consist of interconnected nodes called neurons that work in layers: input, hidden, and output. The process starts with an initial training phase where the network learns from large datasets comprising parallel corpora, which contain pairs of source and target texts. During this phase, the network identifies patterns and relationships in the data, refining the weights of the connections between neurons. The neural network consists of an input layer, hidden layers, and an output layer. The input layer receives the source text, which is then processed through the hidden layers. These hidden layers apply deep learning techniques to extract semantic and syntactic information from the context of the entire sentence, allowing the network to better understand the source text.

In standard encoder-decoder systems, which we will delve into in a later post, here's how the system works:

Training: In this phase, the neural network learns to map input text in one language to the corresponding translation in another language. It uses large-scale datasets of parallel texts to understand the relationships and patterns across languages. The network adjusts the weights of the connections between neurons to improve translation accuracy. The Gradient Descent algorithm is commonly used to optimize these weights.
Encoding: In this step, the input layer takes the source text and transforms it into a numerical representation that the neural network can process. This process involves converting words into vectors, typically using embeddings, which capture semantic and syntactic information. The encoder's role is to analyze the input text and create a compressed, abstract representation that summarizes the essential information from the source language. This encoded representation is then passed to the next stage for further processing.
Decoding: After the encoding step, the decoder takes the compressed representation from the encoder and converts it back into a sequence of words in the target language. The decoder works with these representations, leveraging context and learned patterns to generate an accurate translation. It does this by iteratively predicting the next word in the target language, using not only the encoded information but also its own internal states and the words generated thus far. This iterative process continues until the entire target sentence is produced.
Output: After processing through the hidden layers, the data reaches the output layer, which generates the translated text. The output is based on the optimized weights from the training phase, allowing the network to predict translations with a high degree of accuracy.

Compared to statistical machine translation (SMT), which relies on fixed sequences of words (n-grams), NMT considers the broader context of the entire sentence, allowing for more nuanced and contextually accurate translations. This ability to leverage deep learning techniques and work with larger contexts is what makes NMT so effective and has led to its dominance in the field of machine translation.

Footnotes

Rivera-Trigueros, I. (2022). Machine translation systems and quality assessment: a systematic review. Lang Resources & Evaluation 56, 593–619. ↩
Bentivogli, L., Bisazza, A., Cettolo, M., & Federico, M. (2016). Neural versus Phrase-Based Machine Translation Quality: a Case Study. Association for Computational Linguistics, 257-267. ↩
Monti, J. (2018). Dalla Zairja alla traduzione automatica. Riflessioni sulla traduzione nell'era digitale. Paolo Loffredo. ↩