MACHINE TRANSLATION – OVERVIEW – Natural Language Processing

WHAT IS MACHINE TRANSLATION?

The term machine translation (MT) is used in the sense of translation of one language to another. The ideal aim of machine translation systems is to produce the best possible translation without human assistance. Basically every machine translation system requires programs for translation and automated dictionaries and grammars to support translation.

The translation quality of the machine translation systems can be improved by pre-editing the input. Pre-editing means adjusting the input by marking prefixes, suffixes, clause boundaries, etc. Translation quality can also be improved by controlling the vocabulary. The output of the machine translation should be post-edited to make it perfect. Post-editing is required especially for health related information.

TYPES OF MACHINE TRANSLATION SYSTEMS

Machine translation systems that produce translations between only two particular languages are called bilingual systems and those that produce translations for any given pair of languages are called multilingual systems. Multilingual systems may be either uni-directional or bi-directional. Multilingual systems are preferred to be bi-directional and bi-lingual as they have ability to translate from any given language to any other given language and vice versa.

Figure: Machine Translation Pyramid

DIRECT MACHINE TRANSLATION APPROACH

Direct translation approach is the oldest and less popular approach. Machine translation systems that use this approach are capable of translating a language, called source language (SL) directly to another language, called target language (TL). The analysis of SL texts is oriented to only one TL. Direct translation systems are basically bilingual and uni-directional. Direct translation approach needs only a little syntactic and semantic analysis. SL analysis is oriented specifically to the production of representations appropriate for one particular TL.

INTERLINGUA APPROACH

Interlingua approach intends to translate SL texts to that of more than one language. Translation is from SL to an intermediate form called interlingua (IL) and then from IL to TL. Interlingua may be artificial one or auxiliary language like Esperanto with universal vocabulary. Interlingua approach requires complete resolution of all ambiguities in the SL text.

TRANSFER APPROACH

Unlike interlingua approach, transfer approach has three stages involved. In the first stage, SL texts are converted into abstract SL-oriented representations. In the second stage, SL-oriented representations are converted into equivalent TL-oriented representations. Final texts are generated in the third stage. In transfer approach complete resolution of ambiguities of SL text is not required, but only the ambiguities inherent in the language itself are tackled. Three types of dictionaries are required: SL dictionaries, TL dictionaries and a bilingual transfer dictionary. Transfer systems have separate grammars for SL analysis, TL analysis and for the transformation of SL structures into equivalent TL forms.

EMPIRICAL MACHINE TRANSLATION APPROACH

Empirical approach is the emerging one that uses large amount of raw data in the form of parallel corpora. The raw data consists of texts and their translations. Example-based MT, analogy-based MT, memory-based MT, and case-based MT are the techniques that use empirical approach. Basically all these techniques use a corpus or database of translated examples. Statistical machine translation is corpus based but slightly different in the sense that it depends on statistical modelling of the word order of the target language and of source-target word equivalences. Statistical machine translation automatically learns lexical and structural preferences from corpora. Statistical models offer good solution to ambiguity problem. They are robust and work well even if there are errors and the presence of new data. IBM researchers pioneered the first statistical approach to machine translation in 1980’s. IBM group relies on the source-channel approach, a framework for combining a word-based translation model and a language model. The translation model ensures that the machine translation system produces target hypothesis corresponding to the source sentence. The language model ensures the grammatically correct output.