What is Corpus?

This article gives a brief overview of what is corpus, types, applications and a short note on British National Corpus.

What is Corpus?

Corpus is a large collection of texts. It is a body of written or spoken material upon which a linguistic analysis is based. The plural form of corpus is corpora. Some popular corpora are British National Corpus (BNC), COBUILD/Birmingham Corpus, IBM/Lancaster Spoken English Corpus. Monolingual corpora represent only one language while bilingual corpora represent two languages. European Corpus Initiative (ECI) corpus is multilingual having 98 million words in Turkish, Japenese, Russian, Chinese, and other languages. The corpus may be composed of written language, spoken language or both. Spoken corpus is usually in the form of audio recordings. A corpus may be open or closed. An open corpus is one which does not claim to contain all data from a specific area while a closed corpus does claim to contain all or nearly all data from a particular field. Historical corpora, for example, are closed as there can be no further input to an area.

What is the use of Corpus?

A corpus provides grammarians, lexicographers, and other interested parties with better discriptions of a language. Computer-procesable corpora allow linguists to adopt the principle of total accountability, retrieving all the occurrences of a particular word or structure for inspection or randomly selcted samples. Corpus analysis provide lexical information, morphosyntactic information, semantic information and pragmatic information.

Linguistic information is provided by concordance and frequency counts.

What is Concordance?

Concordances are listings of the occurrences of a particular feature or combination of fearures in a corpus. Each occurrence found (or hit) is displayed with a ceratain amount of context, the text preceding and following it. The most commonly used concordance type is KWIC whichs stands for Key Word In Context. It shows one hit per line of screen or print-out with principal search feature (or focus) highlighted in the centre. Concordance is used to determine the syntax in which a form is embedded. Concordances can be generated with Corpus Presenter and with Corpus Presenter Flash, programs allow one to retrieve the contexts in which a word occurs.

Frequency Counts

Frequency Counts the number of hits. Frequency counts require finding all the occurences of a particular feature in the corpus. So it is implicit in concordancing. Software is used for this purpose. Frequency counts can be explained statistically.

British National Corpus

British National Corpus (BNC) consists of a sample collection representing the universe of contemporary British English. BNC is a balanced corpus in the sense that it attempts to capture the full range of varieties of language use. It is also a mixed corpus containing both written and spoken ones. The spoken texts are the transcriptions of narurally occuring speech. It is estimated that BNC corpus has 100 million words. Ninety percent of the BNC is made up of written texts.

Applications of Corpus

Corpora are used in the development of NLP tools. Applications include spell-checking, grammar-checking, speech recognition, text-to-speech and speech-to-text synthesis, automatic abstraction and indexing, information retrieval and machine translation. Corpora also used for creation of new dictionaries and grammars for learners.

Linguistics-Overview

What is Linguistics?

Linguistics is the study of human languages. It follows scientific approach. So it is also referred to as linguistic science. Linguistics deals with describing and explaining the nature of human languages. it treats language and the ways people use it as phenomena to be studied. Linguist is one who is expertise in linguistics. Linguist studies the general principles of language organization and language behavior.

Linguistic analysis concerns with identifying the structural units and classes of language. Linguists also attempt to describe how smaller units can be combined to form larger grammatical units such as how words can be combined to form phrases, phrases can be combined to form clauses, and so on. They also concerns what constrains the possible meanings for a sentence. Linguists use intuitions about well- formedness and meaning and mathematical models of structure such as formal language theory and model theoretic semantics.

Structure of language include morphemes, words, phrases, and grammatical classes.

Sub-fields with respect to linguistic structure are phonetics, phonology, morphology, syntax, semantics, pragmatics, and discourse analysis.

There are many branches of linguistics including applied linguistics, computational linguistics, evolutionary linguistics, neurolinguistics, cognitive linguistics and psycholinguistics.

