ZIPF’S LAW – Natural Language Processing

The observation of Zipf on the distribution of words in natural languages is called Zipf’s law. It desribes the word behaviour in an entire corpus and can be regarded as a roughly accurate characterization of certain empirical facts. According to Zipf’s law,

Frequency * rank = constant.

That is, the frequency of words multiplied by their ranks in a large corpus is approximately constant. Note that frequency is the number of times a word occurs in a corpus. If we compute the frequencies of the words in a corpus, and arrange them in decreasing order of frequency then the product of the frequency of a word and its rank (its position in the list) is more or less equal to the product of the frequency and rank of another word. So frequency of a word is inversely proportional to its rank.

Researchers use the Zipf’s law and experiment on a large corpus. They found that only a small number of words occur more often than a large number of words that occur with low frequency. In between these two extremes there are medium frequency words as well. This distribution has its impact on the information retrieval process in which only medium frequency words, having content-bearing terms, can be used for indexing. High and low frequency words dropped as the former has less discriminating power and the latter is likely to be included in the query.

According to Zipf’s theory a speaker’s effort can be conserved by having a small vocabulary of common words and the listener’s effort can be minimised by having a large vocabulary of rarer words thereby reducing the ambiguity. So both speaker and listener attempt to reduce their effort.

Related Posts