This article gives a brief overview of what is corpus, types, applications and a short note on British National Corpus.
What is Corpus?
Corpus is a large collection of texts. It is a body of written or spoken material upon which a linguistic analysis is based. The plural form of corpus is corpora. Some popular corpora are British National Corpus (BNC), COBUILD/Birmingham Corpus, IBM/Lancaster Spoken English Corpus. Monolingual corpora represent only one language while bilingual corpora represent two languages. European Corpus Initiative (ECI) corpus is multilingual having 98 million words in Turkish, Japenese, Russian, Chinese, and other languages. The corpus may be composed of written language, spoken language or both. Spoken corpus is usually in the form of audio recordings. A corpus may be open or closed. An open corpus is one which does not claim to contain all data from a specific area while a closed corpus does claim to contain all or nearly all data from a particular field. Historical corpora, for example, are closed as there can be no further input to an area.
What is the use of Corpus?
A corpus provides grammarians, lexicographers, and other interested parties with better discriptions of a language. Computer-procesable corpora allow linguists to adopt the principle of total accountability, retrieving all the occurrences of a particular word or structure for inspection or randomly selcted samples. Corpus analysis provide lexical information, morphosyntactic information, semantic information and pragmatic information.
Linguistic information is provided by concordance and frequency counts.
What is Concordance?
Concordances are listings of the occurrences of a particular feature or combination of fearures in a corpus. Each occurrence found (or hit) is displayed with a ceratain amount of context, the text preceding and following it. The most commonly used concordance type is KWIC whichs stands for Key Word In Context. It shows one hit per line of screen or print-out with principal search feature (or focus) highlighted in the centre. Concordance is used to determine the syntax in which a form is embedded. Concordances can be generated with Corpus Presenter and with Corpus Presenter Flash, programs allow one to retrieve the contexts in which a word occurs.
Frequency Counts the number of hits. Frequency counts require finding all the occurences of a particular feature in the corpus. So it is implicit in concordancing. Software is used for this purpose. Frequency counts can be explained statistically.
British National Corpus
British National Corpus (BNC) consists of a sample collection representing the universe of contemporary British English. BNC is a balanced corpus in the sense that it attempts to capture the full range of varieties of language use. It is also a mixed corpus containing both written and spoken ones. The spoken texts are the transcriptions of narurally occuring speech. It is estimated that BNC corpus has 100 million words. Ninety percent of the BNC is made up of written texts.
Applications of Corpus
Corpora are used in the development of NLP tools. Applications include spell-checking, grammar-checking, speech recognition, text-to-speech and speech-to-text synthesis, automatic abstraction and indexing, information retrieval and machine translation. Corpora also used for creation of new dictionaries and grammars for learners.