Posts Tagged ‘rule based’
Rule-based Parts-Of-Speech Tagging
Rule-based part-of-speech tagging is the oldest approach that uses hand-written rules for tagging. Rule based taggers depends on dictionary or lexicon to get possible tags for each word to be tagged. Hand-written rules are used to identify the correct tag when a word has more than one possible tag. Disambiguation is done by analysing the linguistic features of the word, its preceding word, its following word and other aspects. For example, if the preceding word is article then the word in question must be noun. This information is coded in the form of rules.
The rules may be context-pattern rules or as regular expressions compiled into finite-state automata that are intersected with lexically ambiguous sentence representations. TAGGIT, the first large rule based tagger, used context-pattern rules. TAGGIT used a set of 71 tags and 3300 disambiguation rules. These rules disambiguated 77% of words in the million-word Brown University corpus.
NEXT: Markov Models
What is a rule based machine translation system?
A rule based machine translation system consists of collection of rules called grammar rules, lexicon and software programs to process the rules. It is extensible and maintainable. Rule based approach is the first strategy ever developed in the field of machine translation. Rules are written with linguistic knowledge gathered from linguists. Rules play major role in various stages of translation: syntactic processing, semantic interpretation, and contextual processing of language.
Structure of rule based machine translation system
Tree structure is used to represent the structure of the sentence. A typical English sentence consists of two major parts: noun phrase (NP) and verb phrase (VP). These two parts can be further divided as per the structure of the sentence. ‘Rewrite rules’ are used to describe what tree structures are allowable for a given sentence. Only the sentence with right structure can lead to correct translation. Following are the rules to represent a simple grammar.
S -> NP VP
VP -> V NP
NP -> Name
NP -> ART N
Where S stands for sentence, V for verb, N for noun and ART for article. A grammar can derive a sentence if there is a sequence of rules to rewrite the start symbol, S, into a sentence.
Logical form is commonly used in semantic interpretation. For example the sentence, Joe was happy, can be written in logical form as:
(< PAST HAPPY> (NAME j1 “Joe”))
where PAST stands for past tense. Semantic interpretation is a compositional process in that interpretations can be built incrementally from the interpretations of subphrases. Lexicon plays a major role in semantic interpretation. Grammar rules are used to compute the logical form of the given sentence. Consider the grammar rule given below.
(S SEM (?semvp ?semnp)) -> (NP SEM ?semnp) (VP SEM ?semvp)
where SEM stands for semantic feature. This rule says that a sentence consists of noun phrase and verb phrase.
How does translation happen in rule based machine translation system?
Translation in rule based machine translation system is done by pattern matching of the rules. The success lies in avoiding the pattern matching of unfruitful rules. Knowledge and reasoning are used for language understanding. General world knowledge is required for solving interpretation problems such as disambiguation. Context specific knowledge can be used to determine the referent of noun phrases and disambiguating word senses based on what makes sense in the current situation. A knowledge representation consists of knowledgebase and inference techniques. Inference techniques apply inference rules to derive new sentences from the knowledgebase.
Anaphora is the linguistic phenomenon of pointing back to a previously mentioned item in the text. The pointing back word or phrase is called anaphor and the entity to which it refers or for which it stands is its antecedent. For example,
Joe is not yet here but he is expected to arrive in the next one hour.
Here the ‘Joe’ is anaphor and ‘he’ is antecedent. When the anaphor refers to an antecedent and when both have the same referent in the real world, they are termed coreferential. Coreference is the act of referring to the same referent in the real world. The process of determining the antecedent of an anaphor is called anaphora resolution. Rules that are used for resolution are called resolution rules. These rules are based on different sources of knowledge. Needless to say that interpretation of anaphora is crucial for the successful operation of a machine translation system.
Advantage of rule based machine translation approach
The advantage of rule based machine translation approach is that it can deeply analyze at syntax and semantic levels. There are drawbacks such as requirement of huge linguistic knowledge and very large number of rules to cover all the features of a language.