What is collocation?
Collocation or lexical collocation means two or more words co-occur in a sentence more frequently than by chance. A collocation is an expression that forms a specific meaning. It may be noun phrase like large house, verbal phrase like pick up, idioms, cliches or technical terms. Collocations are characterized by limited compositionality, that is, it is difficult to predict the meaning of collocation from the meaning of its parts. For example,
He is known for his fair and square dealings and everybody trusts his work.
Here fair and square means honest but if we take the individual words though the word fair gives somewhat closer meaning as it means just the word square confuses us. So instead of taking individual words one should take the collocation fair and square and find meaning. It shows that collocations play a key role in understanding sentences. Collocations are recursive so collocational phrase may contain more than two words.
Types of Collocations
Grammatical Collocations: Contain prepositions, including paired syntactic categories, such as verb+preposition (e.g come to, put on), adjective+preposition (e.g. afraid that, fond of), and noun+preposition (e.g by accident, witness to). The open class world is called the base and determines the words it can collocate with the collocators.
Semantic Collocations: They are lexically restricted word pairs, for which only a subset of the synonym of the collocator can be used in the same lexical context.
Collocations are also categorized into compounds and flexible word pairs.
Compounds: Compounds include word pairs that occur consecutively in language and typically are immutable in function. Example Noun+noun pairs not only occur consecutively but also function as constituent. Compounds form a bridge between collocations and idioms as they are quite invariable but need not be semantically opaque.
Flexible Word Pairs: Flexible word pairs include collocations between subject and verb, or verb and object. Any number of intervening words may occur between the words of the collocation.
Approaches to finding collocations
There are many approaches to find collocations in a text corpus. The important ones are:
2. Mean and Variance
3. Hypothesis Testing
4. Mutual Information
Frequency is the simplest method for finding collocations in a text corpus. This approach assumes that if two words occur together many times in a text then it they could form a collocation. But just selecting the most frequently occurring bigrams (sequence of two adjacent words) does not always yield the better results. Following are a few bigrams resulted from an experiment.
In the above list of bigrams, only New York makes a collocation. We can improve these results by using heuristic: pass through the candidate phrases through a part-of-speech filter which only allows those patterns that are likely to be phrases. Juteson and Katz have introduced a part-of-speech filter that uses the patterns adjective-noun, noun-noun, etc to decide which pattern should be allowed to let through. Following table shows some of the results from an experiment that employs Justeson and Katz part-of-speech filter.
Where A stands for adjective, N for noun and P for propositon. Note that the bigrams, last week and last year cannot be regarded as non-compositional phrases. Frequency based method works well only for fixed phrases.
2. Mean and Variance
Collocations are not always of fixed phrases. Consider the following sentences.
Obama cemented relations with the East.
Jill cemented his relations with Jack.
John cemented Mark Taylor’s relations with Tom.
While cement and relations occur together frequently cement does not follow relations immediately always. In the above sentences though both cement and relations exist one after another in the first sentence, there is a word his in between in the second sentence. There are two words in between in the third sentence. Moreover the words between cement and relations may vary and the distance between the two words also can vary. But there is a regularity in the patterns so that we can determine that cement is the right verb to use for this situation.
Collocational windows are used to capture bigrams at a specific distance. Let us use a three word collocational window for the following sentence.
Plane crashed as climate worsened.
We get the following bigrams.
Plane crashed plane as plane climate
crashed as crashed climate crashed worsened as climate as worsened
The mean and variance based methods look at the pattern of varying distance between two words. If that pattern of distance is relatively predictable then we have evidence for a collocation of variable phrases like cement … relations.
3. Hypothesis Testing
Hypothesis testing is used to check whether two words co-occur more often than a chance. We formulate a null hypothesis H0 that there is no association between the words beyond chance occurrences, compute the probability p that the event would occur if H0 were true, and then reject H0 if p is too low and retain H0 as possible otherwise.
4. Mutual Information
By the chain rule for entropy,
H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)
H(X) – H(X|Y) = H(Y) – H(Y|X)
where H represents entropy and X and Y are random variables.
This difference is called the mutual information between X and Y. It is the amount of information one random variable contains about another. Mutual information is a symmetric, non-negative measure of the common information in the two variables.
Church and Hanks made use of mutual information (MI) to evaluate the correlation between a pair of words. They take a window size of five words for co-occurrence and conduct experiments based on a corpus. They were able to extract interesting pairs of related words such as doctors and nurses doctors and treating. They also found many two-word verbs such as set up and set off. Note that here verbs precede propositions.
Calzolari and Bindi uses mutual information for extracting lexical information from an Italian corpus. They used a measure, dispersion, to show how the second word is distributed within the window under consideration. They were able to acquire the following patterns from the corpus: proper nouns, titles, compounds, technical terms, idioms, modification relations between adjectives and nouns, verbs and their associated prepositions to form two-word verbs, nouns and their support verbs, and so on.