This article presents an overview of Tokenization and the challenges associated with it.
What is Tokenization?
Tokenization is the process of breaking up the given text into units called tokens. The tokens may be words or number or punctuation mark. Tokenization does this task by locating word boundaries. Ending point of a word and beginning of the next word is called word boundaries. Tokenization is also known as word segmentation.
Challenges in Tokenization
Challenges in tokenization depends on the type of language. Languages such as English and French are referred to as space-delimited as most of the words are separated from each other by white spaces. Languages such as Chinese and Thai are referred to as unsegmented as words do not have clear boundaries. Tokenising unsegmented language sentences requires additional lexical and morphological information. Tokenization is also affected by writing system and the typographical structure of the words. Structures of languges can be grouped into three categories:
Isolating: Words do not divide into smaller units. Example: Mandarin Chinese
Agglutinative: Words divide into smaller units. Example: Japanese, Tamil
Inflectional: Boundaries between morphemes are not clear and ambiguous in terms of grammatical meaning. Example: Latin.
Please note that languages given as examples for each category can exhibit some traces of other categories as well.