## Maximum Entropy

# Maximum Entropy Tagging

Maximum Entropy Tagging aims to find a model with maximum entropy. The term, maximum entropy here means maximum randomness or minimum additional structure. It exploits some of the good properties of tranformation-based learning and Markov model tagging. It allows flexibility in cues used to disambiguate words. The outputs of the maximum entropy tagging are tags and their probabilities.

The maximum entropy framework finds a single probability model consistent with the constraints of the training data and maximally agnostic beyond what the training data indicates. The probability model is taken over a space H * T, where H is the set of environments in which a word appears and T is the set of possible POS tags. Maximum entropy model specifies a set of features from the environment for tag prediction. The features remind us transformation rules in transformation based learning.

A typical environment is specified as

hi = {w_{i}, w_{i+1}, w_{i+2}, w_{i-1}, w_{i-2}, t_{i-1}, t_{i-2}}

where h stands for environment, w for word and t for tag, and i for index. The above equation is for the i_{th} word , wi whose preceding two words are w_{i-1} and w_{i-2} and the succeeding two words are w_{i+1} and w_{i+2}, and the previous two tags are t_{i-1} and t_{i-2}.

Given the environment, a set of binary features can be defined. Following is the jth feature and is on or off based on environment properties.

f_{j}(h_{i}, t_{i}) = {1 if suffix(wi) = *ing* and t_{i} = PastPartVerb

{0 otherwise

That is, the feature mentioned above will be on (i.e, 1) if the suffix of the word in question is ing and the tag is past participle and will be off (i.e, 0) if not.

Features are generated from feature templates. For the above feature, the template is

X is a suffix of wi, |X| < 5 AND t_{i} = T

where X and T are variables

A set of features and their observed probabilities are extracted from the training set. Generalized Iterative scaling method is then used to create the maximum entropy model consistent with the observed feature probabilities. Now we get the model trained.

Like Markov model tagging, most probable tag sequence according to the probability model is built. Beam search is used for this purpose, keeping n most likely tag sequences up to the word being tagged. Unlike Markov model approach, there is a great deal of flexibility in what contextual cues can be used.

Maximum entropy method is powerful enough to achieve the accuracy in POS tagging tasks.