Package nltk_lite :: Package tokenize :: Module regexp'
[hide private]
[frames] | no frames]

Module regexp'

source code

Functions for tokenizing a text, based on a regular expression which matches tokens or gaps.

Functions [hide private]
 
_compile(regexp) source code
 
_display(tokens)
A helper function for demo that displays a list of tokens.
source code
 
_remove_group_identifiers(parsed_re)
Modifies the given parsed regular expression, replacing all groupings (as indicated by parenthesis in the regular expression string) with non-grouping variants (indicated with '(?:...)').
source code
 
blankline(s)
Tokenize the text into paragraphs (separated by blank lines).
source code
 
demo()
A demonstration that shows the output of several different tokenizers on the same string.
source code
 
line(s)
Tokenize the text into lines.
source code
 
regexp(text, pattern, gaps=True, advanced=True)
Tokenize the text according to the regular expression pattern.
source code
 
shoebox(s)
Tokenize a Shoebox entry into its fields (separated by backslash markers).
source code
 
token_split(text, pattern, advanced=True)
Returns: An iterator that generates tokens and the gaps between them
source code
 
treebank(s)
Tokenize a Treebank file into its tree strings
source code
 
whitespace(s)
Tokenize the text at whitespace.
source code
 
word(s)
Tokenize the text into sequences of word characters (a-zA-Z0-9).
source code
 
wordpunct(s)
Tokenize the text into sequences of alphabetic and non-alphabetic characters.
source code
Variables [hide private]
  BLANKLINE = '\\s*\\n\\s*\\n\\s*'
  NEWLINE = '\\n'
  SHOEBOXSEP = '^\\\\'
  TREEBANK = '^\\(.*?(?=^\\(|\\Z)'
  WHITESPACE = '\\s+'
  WORD = '\\w+'
  WORDPUNCT = '[a-zA-Z]+|[^a-zA-Z\\s]+'
Function Details [hide private]

_remove_group_identifiers(parsed_re)

source code 

Modifies the given parsed regular expression, replacing all groupings (as indicated by parenthesis in the regular expression string) with non-grouping variants (indicated with '(?:...)'). This works on the output of sre_parse.parse, modifing the group indentifier in SUBPATTERN structures to None.

Parameters:
  • parsed_re (SubPattern) - the output of sre_parse.parse(string)

blankline(s)

source code 

Tokenize the text into paragraphs (separated by blank lines).

Parameters:
  • s (string or iter(string)) - the string or string iterator to be tokenized
Returns:
An iterator over tokens

line(s)

source code 

Tokenize the text into lines.

Parameters:
  • s (string or iter(string)) - the string or string iterator to be tokenized
Returns:
An iterator over tokens

regexp(text, pattern, gaps=True, advanced=True)

source code 

Tokenize the text according to the regular expression pattern.

Parameters:
  • text (string or iter(string)) - the string or string iterator to be tokenized
  • pattern (string) - the regular expression
  • gaps (boolean) - set to True if the pattern matches material between tokens
  • advanced (boolean) - set to True if the pattern is complex, making use of () groups
Returns:
An iterator over tokens

shoebox(s)

source code 

Tokenize a Shoebox entry into its fields (separated by backslash markers).

Parameters:
  • s (string or iter(string)) - the string or string iterator to be tokenized
Returns:
An iterator over tokens

token_split(text, pattern, advanced=True)

source code 
Returns:
An iterator that generates tokens and the gaps between them

treebank(s)

source code 

Tokenize a Treebank file into its tree strings

Parameters:
  • s (string or iter(string)) - the string or string iterator to be tokenized
Returns:
An iterator over tokens

whitespace(s)

source code 

Tokenize the text at whitespace.

Parameters:
  • s (string or iter(string)) - the string or string iterator to be tokenized
Returns:
An iterator over tokens

word(s)

source code 

Tokenize the text into sequences of word characters (a-zA-Z0-9).

Parameters:
  • s (string or iter(string)) - the string or string iterator to be tokenized
Returns:
An iterator over tokens

wordpunct(s)

source code 

Tokenize the text into sequences of alphabetic and non-alphabetic characters. E.g. "She said 'hello.'" would be tokenized to ["She", "said", "'", "hello", ".'"]

Parameters:
  • s (string or iter(string)) - the string or string iterator to be tokenized
Returns:
An iterator over tokens