MAXIMUM ENTROPY – Natural Language Processing

MAXIMUM ENTROPY TAGGING

Maximum Entropy Tagging aims to find a model with maximum entropy. The term, maximum entropy here means maximum randomness or minimum additional structure. It exploits some of the good properties of tranformation-based learning and Markov model tagging. It allows flexibility in cues used to disambiguate words. The outputs of the maximum entropy tagging are tags and their probabilities.

The maximum entropy framework finds a single probability model consistent with the constraints of the training data and maximally agnostic beyond what the training data indicates. The probability model is taken over a space H * T, where H is the set of environments in which a word appears and T is the set of possible POS tags. Maximum entropy model specifies a set of features from the environment for tag prediction. The features remind us transformation rules in transformation based learning.

A typical environment is specified as

hi = {w_i, w_i+1, w_i+2, w_i-1, w_i-2, t_i-1, t_i-2}

where h stands for environment, w for word and t for tag, and i for index. The above equation is for the i_th word , wi whose preceding two words are w_i-1 and w_i-2 and the succeeding two words are w_i+1 and w_i+2, and the previous two tags are t_i-1 and t_i-2.

Given the environment, a set of binary features can be defined. Following is the jth feature and is on or off based on environment properties.

f_j(h_i, t_i) = {1 if suffix(wi) = ing and t_i = PastPartVerb
{0 otherwise

That is, the feature mentioned above will be on (i.e, 1) if the suffix of the word in question is ing and the tag is past participle and will be off (i.e, 0) if not.

Features are generated from feature templates. For the above feature, the template is

X is a suffix of wi, |X| < 5 AND t_i = T

where X and T are variables

A set of features and their observed probabilities are extracted from the training set. Generalized Iterative scaling method is then used to create the maximum entropy model consistent with the observed feature probabilities. Now we get the model trained.

Like Markov model tagging, most probable tag sequence according to the probability model is built. Beam search is used for this purpose, keeping n most likely tag sequences up to the word being tagged. Unlike Markov model approach, there is a great deal of flexibility in what contextual cues can be used.

Maximum entropy method is powerful enough to achieve the accuracy in POS tagging tasks.

MAXIMUM ENTROPY TAGGING

Related Posts