Posts Tagged ‘Tokenization’

PostHeaderIcon Python code for Tokenization

In the field of natural language processing it is often necessary to parse the sentences and analyze them. For this purpose tokenization is the key task. Python splits the given text or sentence based on the given delimiter or separator.

Following code splits the given text and generate a list of tokens.

if __name__ == '__main__':

#No separator
text = ‘This is a text for testing tokenization’
tokens = text.split()
print tokens

tokens = tokenize(text,’ ‘)
print tokens

#inconrrect separator
tokens = text.split(‘|’)
print tokens

#with more than one space
text = ‘This is a text for testing tokenization’
tokens = text.split()
print tokens

text = ‘This is a text for testing tokenization’
tokens = text.split(‘ ‘)
print tokens

tokens = text.split(‘ ‘)
print tokens

text = ‘This,is,a,text,for,testing,tokenization’
tokens = text.split(‘,’)
print tokens

Running the above program produces the following output.

[‘This’, ‘is’, ‘a’, ‘text’, ‘for’, ‘testing’, ‘tokenization’]
[‘This’, ‘is’, ‘a’, ‘text’, ‘for’, ‘testing’, ‘tokenization’]
[‘This is a text for testing tokenization’]
[‘This’, ‘is’, ‘a’, ‘text’, ‘for’, ‘testing’, ‘tokenization’]
[‘This’, ”, ‘is’, ”, ‘a’, ”, ‘text’, ”, ‘for’, ”, ‘testing’, ”, ‘tokenization’]
[‘This’, ‘is’, ‘a’, ‘text’, ‘for’, ‘testing’, ‘tokenization’]
[‘This’, ‘is’, ‘a’, ‘text’, ‘for’, ‘testing’, ‘tokenization’]

PostHeaderIcon Tokenization: Overview

This article presents an overview of Tokenization and the challenges associated with it.

What is Tokenization?

Tokenization is the process of breaking up the given text into units called tokens. The tokens may be words or number or punctuation mark. Tokenization does this task by locating word boundaries. Ending point of a word and beginning of the next word is called word boundaries. Tokenization is also known as word segmentation.

Challenges in Tokenization

Challenges in tokenization depends on the type of language. Languages such as English and French are referred to as space-delimited as most of the words are separated from each other by white spaces. Languages such as Chinese and Thai are referred to as unsegmented as words do not have clear boundaries. Tokenising unsegmented language sentences requires additional lexical and morphological information. Tokenization is also affected by writing system and the typographical structure of the words. Structures of languges can be grouped into three categories:

Isolating: Words do not divide into smaller units. Example: Mandarin Chinese

Agglutinative: Words divide into smaller units. Example: Japanese, Tamil

Inflectional: Boundaries between morphemes are not clear and ambiguous in terms of grammatical meaning. Example: Latin.

Please note that languages given as examples for each category can exhibit some traces of other categories as well.

Natural Language Processing