Posts Tagged ‘Machine Translation’
Machine translation is the process of translating from source language text into the target language. The following diagram shows all the phases involved.
This is the first phase in the machine translation process and is the first module in any MT system. The sentence categories can be classified based on the degree of difficulty of translation. Sentences that have relations, expectations, assumptions, and conditions make the MT system understand very difficult. Speaker’s intentions and mental status expressed in the sentences require discourse analysis for interpretation. This is due to the inter-relationship among adjacent sentences. World knowledge and commonsense knowledge could be required for interpreting some sentences.
Deformating and Reformating
This is to make the machine translation process easier and qualitative. The source language text may contain figures, flowcharts, etc that do not require any translation. So only translation portions should be identifed. Once the text is translated the target text is to be reformatted after post-edting. Reformating is to see that the target text also conatains the non-translation portion.
Pre-editing and Post editing
The level of pre-editing and post-editing depend on the efficiency of the particular MT system. For some systems segmenting the long sentences into short sentences may be required. Fixing up punctuation marks and blocking material that does not require trarnslation are also done during pre-editing. Post editing is done to make sure that the quality of the translation is upto the mark. Post-editing is unavoidable especially for translation of crucial information such as one for health. Post-editing should continue till the MT systems reach the human-like.
Analysis, Trasfer and Generation
Morphological analysis determines the word form such as inflections, tense, number, part of speech, etc. Syntactic analysis determines whether the word is subject or object. Semantic and contextual analysis determine a proper interpretation of a sentence from the results produced by the syntactic analysis. Syntactic and semantic analysis are often executed simultaneously and produce syntactic tree structure and semantic network respectively. This results in internal structure of a sentence. The sentence generation phase is just reverse of the process of analysis.
Morphological analysis and generation
Computational morphology deals with recognition, analysis and generation of words. Some of the morphological process are inflection, derivation, affixes and cobining forms. Inflection is the most regular and productive morphological process across languages. Inflection alters the form of the word in number, gender, mood, tense, aspect, person, and case. Morphlogical analyser gives information concerning morphological properties of the words it analyses.
Syntactic analysis and generation
As words are the foundation of speech and language processing, syntax can considered as the skeleton. Syntactic analysis concerns with how words are grouped into classes called parts-of-speech, how they group their neighbours into phrases, and the way in which words depends on other words in a sentence.
Grammar formalism is a framework to explain the basic structure of a language. Researchers propose the following grammar formalisms:
Phrase Structure Grammar (PSG)
The variants of PSG are
Context Free PSG
Context Sensitive PSG
Augmented Transition Network Grammar (ATN)
Definite Clasue (DC) Grammar
Lexical Functional Grammar (LFG)
Head Driven PSG
Tree Adjoining (TAG)
Not all the grammars suit a particular language. PSG, for example, does suit Japanese while dependency grammar does suite. Case grammar is popular as sentence in differet languages that express the same contents may have the same case frames.
Parsing and Tagging
Tagging means the identification of linguistic properties of the individual words and parsing is the assessment of the functions of the words in relation to eah other.
Semantic and Contextual analysis and Generation
Semantic analysis composes the meaning representations and assign them the linguisitc inputs.The semantic analysers uses lexicon and grammar to create context independent meanings. The source of knowledge consists of meaning of words, meanings associated with grammatical structures, knowledge about the discourse context and commonsense knowledge.
The basic idea of Example-Based Machine Translation (EBMT) is to reuse examples of already existing translations as the basis for for new translation. The process of EBMT is broken down into three stages:
The matching stage in example based machine translation finds examples that are going to contribute to the translation on the basis of their similarity with the input. The way matching stage should be implemented is based on how the examples are stored. In old systems, examples were stored as annotated tree structures and the constituents in the two languages were connected by explicit links. The input to be matched is parsed using the grammar that was used to build the example database and the tree is formed. This tree is compared with trees in the example database.
The input and examples can be matched by comparing character by character. This process is called sequence comparison. Alignment and recombination will be difficult if this approach is used. Examples may be annotated with Parts-Of-Speech tags. Several simple examples may be combined into a more general single example containing variables. The examples should be analyzed to see if they are suitable for further processing. Overlapping or contradictory examples should be properly dealt with.
Alignment is used to identify which parts of the corresponding translation are to be reused. Alignment is done by using bilingual dictionary or comparing with other examples. The process of alignment in example based machine translation must be automated.
Recombination is the final phase in example based machine translation approach. Recombination makes sure that the reusable parts in example identified during alignement are putting together in a legitimate way. It takes source language sentences and a set of translation patters as inputs and produces target language sentences as outputs. The design of recombination strategy depends on previous matching and alignment phases.
For further study:
What is Machine Translation?
The term machine translation (MT) is used in the sense of translation of one language to another. The ideal aim of machine translation systems is to produce the best possible translation without human assistance. Basically every machine translation system requires programs for translation and automated dictionaries and grammars to support translation.
The translation quality of the machine translation systems can be improved by pre-editing the input. Pre-editing means adjusting the input by marking prefixes, suffixes, clause boundaries, etc. Translation quality can also be improved by controlling the vocabulary. The output of the machine translation should be post-edited to make it perfect. Post-editing is required especially for health related information.
Types Of Machine Translation Systems
Machine translation systems that produce translations between only two particular languages are called bilingual systems and those that produce translations for any given pair of languages are called multilingual systems. Multilingual systems may be either uni-directional or bi-directional. Multilingual systems are preferred to be bi-directional and bi-lingual as they have ability to translate from any given language to any other given language and vice versa.
Direct Machine Translation Approach
Direct translation approach is the oldest and less popular approach. Machine translation systems that use this approach are capable of translating a language, called source language (SL) directly to another language, called target language (TL). The analysis of SL texts is oriented to only one TL. Direct translation systems are basically bilingual and uni-directional. Direct translation approach needs only a little syntactic and semantic analysis. SL analysis is oriented specifically to the production of representations appropriate for one particular TL.
Interlingua approach intends to translate SL texts to that of more than one language. Translation is from SL to an intermediate form called interlingua (IL) and then from IL to TL. Interlingua may be artificial one or auxiliary language like Esperanto with universal vocabulary. Interlingua approach requires complete resolution of all ambiguities in the SL text.
Unlike interlingua approach, transfer approach has three stages involved. In the first stage, SL texts are converted into abstract SL-oriented representations. In the second stage, SL-oriented representations are converted into equivalent TL-oriented representations. Final texts are generated in the third stage. In transfer approach complete resolution of ambiguities of SL text is not required, but only the ambiguities inherent in the language itself are tackled. Three types of dictionaries are required: SL dictionaries, TL dictionaries and a bilingual transfer dictionary. Transfer systems have separate grammars for SL analysis, TL analysis and for the transformation of SL structures into equivalent TL forms.
Empirical Machine Translation Approach
Empirical approach is the emerging one that uses large amount of raw data in the form of parallel corpora. The raw data consists of texts and their translations. Example-based MT, analogy-based MT, memory-based MT, and case-based MT are the techniques that use empirical approach. Basically all these techniques use a corpus or database of translated examples. Statistical machine translation is corpus based but slightly different in the sense that it depends on statistical modelling of the word order of the target language and of source-target word equivalences. Statistical machine translation automatically learns lexical and structural preferences from corpora. Statistical models offer good solution to ambiguity problem. They are robust and work well even if there are errors and the presence of new data. IBM researchers pioneered the first statistical approach to machine translation in 1980’s. IBM group relies on the source-channel approach, a framework for combining a word-based translation model and a language model. The translation model ensures that the machine translation system produces target hypothesis corresponding to the source sentence. The language model ensures the grammtically correct output.
For further study
Following is a list of challenges one has to face when attempt to do machine translation.
Not all the words in one language have equivalent words in another language. In some cases a word in one language is to be expressed by group of words in another.
Two given languages may have completely different structures. For example English has SVO structure while Tamil has SOV structure.
Sometimes there is a lack of one-to-one correspondence of parts of speech between two languages. For example, color terms of Tamil are nouns whereas in English they are adjectives.
The way sentences are put together also differ among languages.
Words can have more than one meaning and sometimes group of words or whole sentence may have more than one meaning in a language. This problem is called ambiguity.
Not all the translation problems can be solved by applying values of grammar.
It is too difficult for the software programs to predict meaning.
Translation requires not only vocabulary and grammar but also knowledge gathered from past experience.
The programmer should understand the rules under which complex human language operates and how the mechanism of this operation can be simulated by automatic means.
The simulation of human language behavior by automatic means is almost impossible to achieve as the language is open and dynamic system in constant change. More importantly the system is not yet completely understood.
What is a rule based machine translation system?
A rule based machine translation system consists of collection of rules called grammar rules, lexicon and software programs to process the rules. It is extensible and maintainable. Rule based approach is the first strategy ever developed in the field of machine translation. Rules are written with linguistic knowledge gathered from linguists. Rules play major role in various stages of translation: syntactic processing, semantic interpretation, and contextual processing of language.
Structure of rule based machine translation system
Tree structure is used to represent the structure of the sentence. A typical English sentence consists of two major parts: noun phrase (NP) and verb phrase (VP). These two parts can be further divided as per the structure of the sentence. ‘Rewrite rules’ are used to describe what tree structures are allowable for a given sentence. Only the sentence with right structure can lead to correct translation. Following are the rules to represent a simple grammar.
S -> NP VP
VP -> V NP
NP -> Name
NP -> ART N
Where S stands for sentence, V for verb, N for noun and ART for article. A grammar can derive a sentence if there is a sequence of rules to rewrite the start symbol, S, into a sentence.
Logical form is commonly used in semantic interpretation. For example the sentence, Joe was happy, can be written in logical form as:
(< PAST HAPPY> (NAME j1 “Joe”))
where PAST stands for past tense. Semantic interpretation is a compositional process in that interpretations can be built incrementally from the interpretations of subphrases. Lexicon plays a major role in semantic interpretation. Grammar rules are used to compute the logical form of the given sentence. Consider the grammar rule given below.
(S SEM (?semvp ?semnp)) -> (NP SEM ?semnp) (VP SEM ?semvp)
where SEM stands for semantic feature. This rule says that a sentence consists of noun phrase and verb phrase.
How does translation happen in rule based machine translation system?
Translation in rule based machine translation system is done by pattern matching of the rules. The success lies in avoiding the pattern matching of unfruitful rules. Knowledge and reasoning are used for language understanding. General world knowledge is required for solving interpretation problems such as disambiguation. Context specific knowledge can be used to determine the referent of noun phrases and disambiguating word senses based on what makes sense in the current situation. A knowledge representation consists of knowledgebase and inference techniques. Inference techniques apply inference rules to derive new sentences from the knowledgebase.
Anaphora is the linguistic phenomenon of pointing back to a previously mentioned item in the text. The pointing back word or phrase is called anaphor and the entity to which it refers or for which it stands is its antecedent. For example,
Joe is not yet here but he is expected to arrive in the next one hour.
Here the ‘Joe’ is anaphor and ‘he’ is antecedent. When the anaphor refers to an antecedent and when both have the same referent in the real world, they are termed coreferential. Coreference is the act of referring to the same referent in the real world. The process of determining the antecedent of an anaphor is called anaphora resolution. Rules that are used for resolution are called resolution rules. These rules are based on different sources of knowledge. Needless to say that interpretation of anaphora is crucial for the successful operation of a machine translation system.
Advantage of rule based machine translation approach
The advantage of rule based machine translation approach is that it can deeply analyze at syntax and semantic levels. There are drawbacks such as requirement of huge linguistic knowledge and very large number of rules to cover all the features of a language.