Package nltk_lite :: Package tokenize :: Module simple
[hide private]
[frames] | no frames]

Module simple

source code

Functions for tokenizing a text, based on a regular expression which matches tokens or gaps.

Functions [hide private]
 
space(s)
Tokenize the text at a single space character.
source code
 
line(s)
Tokenize the text into lines.
source code
 
blankline(s)
Tokenize the text into paragraphs (separated by blank lines).
source code
 
shoebox(s)
Tokenize a Shoebox entry into its fields (separated by backslash markers).
source code
 
demo()
A demonstration that shows the output of several different tokenizers on the same string.
source code
Variables [hide private]
  SPACE = ' '
  NEWLINE = '\n'
  BLANKLINE = '\n\n'
  SHOEBOXSEP = '^\\\\'
Function Details [hide private]

space(s)

source code 

Tokenize the text at a single space character.

Parameters:
  • s (string or iter(string)) - the string or string iterator to be tokenized
Returns:
An iterator over tokens

line(s)

source code 

Tokenize the text into lines.

Parameters:
  • s (string or iter(string)) - the string or string iterator to be tokenized
Returns:
An iterator over tokens

blankline(s)

source code 

Tokenize the text into paragraphs (separated by blank lines).

Parameters:
  • s (string or iter(string)) - the string or string iterator to be tokenized
Returns:
An iterator over tokens

shoebox(s)

source code 

Tokenize a Shoebox entry into its fields (separated by backslash markers).

Parameters:
  • s (string or iter(string)) - the string or string iterator to be tokenized
Returns:
An iterator over tokens