PostHeaderIcon Python code for Tokenization

In the field of natural language processing it is often necessary to parse the sentences and analyze them. For this purpose tokenization is the key task. Python splits the given text or sentence based on the given delimiter or separator.

Following code splits the given text and generate a list of tokens.

if __name__ == '__main__':

#No separator
text = ‘This is a text for testing tokenization’
tokens = text.split()
print tokens

tokens = tokenize(text,’ ‘)
print tokens

#inconrrect separator
tokens = text.split(‘|’)
print tokens

#with more than one space
text = ‘This is a text for testing tokenization’
tokens = text.split()
print tokens

text = ‘This is a text for testing tokenization’
tokens = text.split(‘ ‘)
print tokens

tokens = text.split(‘ ‘)
print tokens

text = ‘This,is,a,text,for,testing,tokenization’
tokens = text.split(‘,’)
print tokens

Running the above program produces the following output.

[‘This’, ‘is’, ‘a’, ‘text’, ‘for’, ‘testing’, ‘tokenization’]
[‘This’, ‘is’, ‘a’, ‘text’, ‘for’, ‘testing’, ‘tokenization’]
[‘This is a text for testing tokenization’]
[‘This’, ‘is’, ‘a’, ‘text’, ‘for’, ‘testing’, ‘tokenization’]
[‘This’, ”, ‘is’, ”, ‘a’, ”, ‘text’, ”, ‘for’, ”, ‘testing’, ”, ‘tokenization’]
[‘This’, ‘is’, ‘a’, ‘text’, ‘for’, ‘testing’, ‘tokenization’]
[‘This’, ‘is’, ‘a’, ‘text’, ‘for’, ‘testing’, ‘tokenization’]