Home | Trees | Index | Help |
|
---|
Package nltk_lite :: Package tokenize :: Module regexp |
|
Function Summary | |
---|---|
Tokenize the text into paragraphs (separated by blank lines). | |
A demonstration that shows the output of several different tokenizers on the same string. | |
Tokenize the text into lines. | |
Tokenize the text according to the regular expression pattern. | |
Tokenize a Shoebox entry into its fields (separated by backslash markers). | |
Return an iterator that generates tokens and the gaps between them | |
Tokenize a Treebank file into its tree strings | |
Tokenize the text at whitespace. | |
Tokenize the text into sequences of alphabetic and non-alphabetic characters. | |
_compile(regexp)
| |
A helper function for demo that displays a list of tokens. | |
Modifies the given parsed regular expression, replacing all groupings (as indicated by parenthesis in the regular expression string) with non-grouping variants (indicated with '(?:...)'). |
Variable Summary | |
---|---|
str |
BLANKLINE = '\\s*\\n\\s*\\n\\s*'
|
str |
NEWLINE = '\\n'
|
str |
SHOEBOXSEP = '^\\\\'
|
str |
TREEBANK = '^\\(.*?(?=^\\(|\\Z)'
|
str |
WHITESPACE = '\\s+'
|
str |
WORDPUNCT = '[a-zA-Z]+|[^a-zA-Z\\s]+'
|
Function Details |
---|
blankline(s)Tokenize the text into paragraphs (separated by blank lines).
|
demo()A demonstration that shows the output of several different tokenizers on the same string. |
line(s)Tokenize the text into lines.
|
regexp(text, pattern, gaps=False, advanced=False)Tokenize the text according to the regular expression pattern.
|
shoebox(s)Tokenize a Shoebox entry into its fields (separated by backslash markers).
|
token_split(text, pattern, advanced=False)
|
treebank(s)Tokenize a Treebank file into its tree strings
|
whitespace(s)Tokenize the text at whitespace.
|
wordpunct(s)Tokenize the text into sequences of alphabetic and non-alphabetic characters. E.g. "She said 'hello.'" would be tokenized to ["She", "said", "'", "hello", ".'"]
|
_display(tokens)A helper function fordemo that displays a list of tokens.
|
_remove_group_identifiers(parsed_re)Modifies the given parsed regular expression, replacing all groupings (as indicated by parenthesis in the regular expression string) with non-grouping variants (indicated with '(?:...)'). This works on the output of sre_parse.parse, modifing the group indentifier in SUBPATTERN structures to None.
|
Variable Details |
---|
BLANKLINE
|
NEWLINE
|
SHOEBOXSEP
|
TREEBANK
|
WHITESPACE
|
WORDPUNCT
|
Home | Trees | Index | Help |
|
---|
Generated by Epydoc 2.1 on Tue Sep 5 09:37:22 2006 | http://epydoc.sf.net |