Posts Tagged ‘Introduction’

PostHeaderIcon Introduction To Tamil Language

Tamil is one of the classical languages spoken by more than 77 million people across the world. Tamil language is a member of the south Dravidian family. It is one of the most ancient languages still spoken in the world. It is written in south Brahmi script. Tamil has 12 vowels and 18 consonants.

Tamil Vowels

Tamil Vowels

Tamil is relatively free word order, verb final and inflectional language. The third person pronouns show singular-plural distinction and masculine-feminine distinction.

In Tamil language the subject is identified by the case markers it takes. As Tamil is an agglutinative language, the words are formed by combining several morphemes. A Tamil word is a composition consisting of a root combined with other grammatical accretions. Irrespective of the length, complexity and type of Tamil words, the roots can be traced up to monosyllabic level by careful removal of successive accretions. Traditionally, a Tamil word is divided into a maximum of six parts, namely pakuthy (prime-stem), sandhi (junction), vihaaram (variation), iTainilai (middle part), saariyai (enunciater) and vikuti (terminator) in that order. For example, a word, ndaTantananmeaning ‘(He) walked’, is made up of the morphemes: naTa + t(n) + t + an + an. The middle part and terminator are grammatical additions to the prime-stem. The middle part marks the tense and the terminator marks the gender. Usually, the prime-stem is the main part of the word responsible for its meaning.

PostHeaderIcon Machine Translation – Overview

What is Machine Translation?

The term machine translation (MT) is used in the sense of translation of one language to another. The ideal aim of machine translation systems is to produce the best possible translation without human assistance. Basically every machine translation system requires programs for translation and automated dictionaries and grammars to support translation.

The translation quality of the machine translation systems can be improved by pre-editing the input. Pre-editing means adjusting the input by marking prefixes, suffixes, clause boundaries, etc. Translation quality can also be improved by controlling the vocabulary. The output of the machine translation should be post-edited to make it perfect. Post-editing is required especially for health related information.

Types Of Machine Translation Systems

Machine translation systems that produce translations between only two particular languages are called bilingual systems and those that produce translations for any given pair of languages are called multilingual systems. Multilingual systems may be either uni-directional or bi-directional. Multilingual systems are preferred to be bi-directional and bi-lingual as they have ability to translate from any given language to any other given language and vice versa.

Figure: Machine Translation Pyramid

Figure: Machine Translation Pyramid

Direct Machine Translation Approach

Direct translation approach is the oldest and less popular approach. Machine translation systems that use this approach are capable of translating a language, called source language (SL) directly to another language, called target language (TL). The analysis of SL texts is oriented to only one TL. Direct translation systems are basically bilingual and uni-directional. Direct translation approach needs only a little syntactic and semantic analysis. SL analysis is oriented specifically to the production of representations appropriate for one particular TL.

Interlingua Approach

Interlingua approach intends to translate SL texts to that of more than one language. Translation is from SL to an intermediate form called interlingua (IL) and then from IL to TL. Interlingua may be artificial one or auxiliary language like Esperanto with universal vocabulary. Interlingua approach requires complete resolution of all ambiguities in the SL text.

Transfer Approach

Unlike interlingua approach, transfer approach has three stages involved. In the first stage, SL texts are converted into abstract SL-oriented representations. In the second stage, SL-oriented representations are converted into equivalent TL-oriented representations. Final texts are generated in the third stage. In transfer approach complete resolution of ambiguities of SL text is not required, but only the ambiguities inherent in the language itself are tackled. Three types of dictionaries are required: SL dictionaries, TL dictionaries and a bilingual transfer dictionary. Transfer systems have separate grammars for SL analysis, TL analysis and for the transformation of SL structures into equivalent TL forms.

Empirical Machine Translation Approach

Empirical approach is the emerging one that uses large amount of raw data in the form of parallel corpora. The raw data consists of texts and their translations. Example-based MT, analogy-based MT, memory-based MT, and case-based MT are the techniques that use empirical approach. Basically all these techniques use a corpus or database of translated examples. Statistical machine translation is corpus based but slightly different in the sense that it depends on statistical modelling of the word order of the target language and of source-target word equivalences. Statistical machine translation automatically learns lexical and structural preferences from corpora. Statistical models offer good solution to ambiguity problem. They are robust and work well even if there are errors and the presence of new data. IBM researchers pioneered the first statistical approach to machine translation in 1980’s. IBM group relies on the source-channel approach, a framework for combining a word-based translation model and a language model. The translation model ensures that the machine translation system produces target hypothesis corresponding to the source sentence. The language model ensures the grammtically correct output.

Related Articles

Machine Translation Process

Challenges In Machine Translation

Rule-based Machine Translation

Example-based Machine Translation

For further study

Machine Translation: AI Methods for Translating from One Language to Another

Machine Translation Book

PostHeaderIcon Linguistics-Overview

What is Linguistics?

Linguistics is the study of human languages. It follows scientific approach. So it is also referred to as linguistic science. Linguistics deals with describing and explaining the nature of human languages. it treats language and the ways people use it as phenomena to be studied. Linguist is one who is expertise in linguistics. Linguist studies the general principles of language organization and language behavior.

Linguistic analysis concerns with identifying the structural units and classes of language. Linguists also attempt to describe how smaller units can be combined to form larger grammatical units such as how words can be combined to form phrases, phrases can be combined to form clauses, and so on. They also concerns what constrains the possible meanings for a sentence. Linguists use intuitions about well- formedness and meaning and mathematical models of structure such as formal language theory and model theoretic semantics.

Structure of language include morphemes, words, phrases, and grammatical classes.

Sub-fields with respect to linguistic structure are phonetics, phonology, morphology, syntax, semantics, pragmatics, and discourse analysis.

There are many branches of linguistics including applied linguistics, computational linguistics, evolutionary linguistics, neurolinguistics, cognitive linguistics and psycholinguistics.

What is Corpus?
Corpus Linguistics

PostHeaderIcon Natural Language Processing: Overview

Natural Language Processing (NLP) aims to acquire, understand and generate the human languages such as English, French, Tamil, Hindi, etc.

Symbolic Approaches to Natural Language Processing

Symbolic Approaches also known as Rationalist approaches believe that  significant part of the knowledge in the human mind is not derived by the senses but is fixed in advance, presumably by genetic inheritance. Noam Chomsky was the strong advocate of this approach. It was believed that machine can be made to function like human brain by giving some basic knowledge and reasoning mechanisms Linguistic knowledge is explicitly encoded in rule or other forms of representation. This helps automatic process of natural languages.

Natural Language analysis

It runs into many stages, namely tokenization, lexical analysis,  syntactic analysis, semantic analysis, and pragmatic analysis.

Syntactic analysis provides an order and structure of each sentence in the text. Semantic analysis is to find the literal meaning, and pragmatic analysis is to determine the meaning of the text in context. These major tasks are further broken down into, parsing and so on.

Natural Language Generation

This is to generate fluent and coherent multi-sentential texts from an underlying source of information. The kind of text generated ranging from a single word or a phrase as an answer to a question to full-page explanations and even to the extent of speech depending upon the context.

Empirical Approaches to Natural Language Processing

Empirical Approaches focus on the use of large amounts of data and the procedures involving statistical manipulations. Corpus, bulk of data in a particular format, comes handy for analysis. Crucial tasks using these approaches are POS tagging, alignment, collacations, word-sense-disambiguation, etc.

Challenges In Natural Language Processing

Still a perfect natural language processing system is developed. There are many problems like flexibility in the structure of sentences, ambiguity, etc.

Natural language processing applications require the availability of Lexical Resources, Corpora and Computational Models.

For Further Study

Foundations of Statistical Natural Language Processing

Handbook of Natural Language Processing, Second Edition (Chapman & Hall/CRC Machine Learning & Pattern Recognition)

Natural Language Understanding (2nd Edition)

Natural Language Processing with Python

Related Articles

Natural Language Understanding

Natural Language Generation

Open Problems

Linguistics: Overview

Tokenization: Overview

Parts Of Speech Tagging

Machine Translation