Posts Tagged ‘corpus’
The observation of Zipf on the distribution of words in natural languages is called Zipf’s law. It desribes the word behaviour in an entire corpus and can be regarded as a roughly accurate characterization of certain empirical facts. According to Zipf’s law,
Frequency * rank = constant.
That is, the frequency of words multiplied by their ranks in a large corpus is approximately constant. Note that frequency is the number of times a word occurs in a corpus. If we compute the frequencies of the words in a corpus, and arrange them in decreasing order of frequency then the product of the frequency of a word and its rank (its position in the list) is more or less equal to the product of the frequency and rank of another word. So frequency of a word is inversely proportional to its rank.
Researchers use the Zipf’s law and experiment on a large corpus. They found that only a small number of words occur more often than a large number of words that occur with low frequency. In between these two extremes there are medium frequency words as well. This distribution has its impact on the information retrieval process in which only medium frequency words, having content-bearing terms, can be used for indexing. High and low frequency words dropped as the former has less discriminating power and the latter is likely to be included in the query.
According to Zipf’s theory a speaker’s effort can be conserved by having a small vocabulary of common words and the listener’s effort can be minimised by having a large vocabulary of rarer words thereby reducing the ambiguity. So both speaker and listener attempt to reduce their effort.
This article gives a brief overview of what is corpus, types, applications and a short note on British National Corpus.
What is Corpus?
Corpus is a large collection of texts. It is a body of written or spoken material upon which a linguistic analysis is based. The plural form of corpus is corpora. Some popular corpora are British National Corpus (BNC), COBUILD/Birmingham Corpus, IBM/Lancaster Spoken English Corpus. Monolingual corpora represent only one language while bilingual corpora represent two languages. European Corpus Initiative (ECI) corpus is multilingual having 98 million words in Turkish, Japenese, Russian, Chinese, and other languages. The corpus may be composed of written language, spoken language or both. Spoken corpus is usually in the form of audio recordings. A corpus may be open or closed. An open corpus is one which does not claim to contain all data from a specific area while a closed corpus does claim to contain all or nearly all data from a particular field. Historical corpora, for example, are closed as there can be no further input to an area.
What is the use of Corpus?
A corpus provides grammarians, lexicographers, and other interested parties with better discriptions of a language. Computer-procesable corpora allow linguists to adopt the principle of total accountability, retrieving all the occurrences of a particular word or structure for inspection or randomly selcted samples. Corpus analysis provide lexical information, morphosyntactic information, semantic information and pragmatic information.
Linguistic information is provided by concordance and frequency counts.
What is Concordance?
Concordances are listings of the occurrences of a particular feature or combination of fearures in a corpus. Each occurrence found (or hit) is displayed with a ceratain amount of context, the text preceding and following it. The most commonly used concordance type is KWIC whichs stands for Key Word In Context. It shows one hit per line of screen or print-out with principal search feature (or focus) highlighted in the centre. Concordance is used to determine the syntax in which a form is embedded. Concordances can be generated with Corpus Presenter and with Corpus Presenter Flash, programs allow one to retrieve the contexts in which a word occurs.
Frequency Counts the number of hits. Frequency counts require finding all the occurences of a particular feature in the corpus. So it is implicit in concordancing. Software is used for this purpose. Frequency counts can be explained statistically.
British National Corpus
British National Corpus (BNC) consists of a sample collection representing the universe of contemporary British English. BNC is a balanced corpus in the sense that it attempts to capture the full range of varieties of language use. It is also a mixed corpus containing both written and spoken ones. The spoken texts are the transcriptions of narurally occuring speech. It is estimated that BNC corpus has 100 million words. Ninety percent of the BNC is made up of written texts.
Applications of Corpus
Corpora are used in the development of NLP tools. Applications include spell-checking, grammar-checking, speech recognition, text-to-speech and speech-to-text synthesis, automatic abstraction and indexing, information retrieval and machine translation. Corpora also used for creation of new dictionaries and grammars for learners.