PostHeaderIcon Glossary of NLP Terms

Glossary of Terms in Natural Language Processing

What is what?

Agenda
A prioritized list of parser tasks still to be executed.

Alignment
Mapping the text segments of a parallel text onto each other.

Ambiguity
Ambiguity is a situation in which a word, phrase or sentence conveys more than one meaning.

Back-off
A mechanism for smoothing the estimates of the probabilities of rare events by relying on less specific models.

Bilingual corpus
A collection of texts in which each text appears in two languages.

Bilingual dictionary
A dictionary that provides translations of words into another language.

Chunk
A sequence of words in text that constitutes a non-recursive, elementary grouping of a particular syntactic category.

Concordance
A list showing all the occurrences and contexts of a given word or phrase, as found in a corpus, typically in the form of KWIC index.

Corpus
Corpus is a body of text. In Latin corpus means body.

Corpora
Collection of texts. Plural form of corpus.

Derivation tree
The structure characterized by a set of nodes and a dominance relation between them.

Dialogue
Communicative linguistic activity in which at least two speakers or agents participate.

Discourse
An extended sequence of sentences produced by one or more people with the aim of conveying or exchanging information.

Entropy
The degree of disorder or randomness in a system, often taken as a measuer of how difficult it is to predict the outcome of a random variabe.

Formal language
Any set of strings over an alphabet.

Glossing
Providing a rough word-by-word translation of a document.

Headword
A word or phrase that is defined or explained in a dictionary.

Interlingua
A language netutral text representation used in machine translation.

Lambda calculus
A universal model of computation, used widely in semantics and computer science to model the functional behaviour of linguistic expressions.

Lexicalization
The process of generating an appropriate lexical item for given semantic content, typically a phase of the automatic text generation process.

Mal-rules
Rules inappropriate to the true structure of a language, but which describe the systemic errors that learners make. Mal-rules are used for finding and diagnosing learners’ errors.

n-gram
A sequence of n tokens.

Natural language generation
The automatic production of natural langauge text.

Ontolingua
A standardized knowledge representation formalism developed at Stanford University.

Ontology
An inventory of the objects or processes in a domain, together with a specification of some or all of the relations that hold among them, generally arranged as a hierarchy.

Overacceptance
The error of returning too many phrases for a grammatical input.

Phrase-structure grammar
A type of grammar where no restriction is imposed on the form of its productions.

Polysemy
The phenomenon of words having multiple meanings.

Production
A rewriting rule in grammar.

Regular grammar
A type of grammar where every production is of any of the forms A->wB or A->w, where A, B are nonterminal letters and w is any terminal string.

Semantics
The study of linguistic meaning.

Sublanguage
A proper subset of expressions in a natural or artificial language which exhibits language-like behaviour.

Synset
A set of one or more words that are considered to be synonyms in some or all contexts.

Treebank
A syntactically analysed text corpus.

Word-token
An occurrence in text of a word from a language vocabulary.

Zero anaphora
Anaphora where an elliptically omitted word or expression acts as an anaphor. Example: Sullivan attacked the Lion but not able to find it. Here after but the anaphor he is omitted.

Abbreviations