Package nltk_lite :: Package parse :: Module chunk
[show private | hide private]
[frames | no frames]

Module nltk_lite.parse.chunk

Classes and interfaces for identifying non-overlapping linguistic
groups (such as base noun phrases) in unrestricted text.  This task is
called X{chunk parsing} or X{chunking}, and the identified groups are
called X{chunks}.  The chunked text is represented using a shallow
tree called a "chunk structure."  A X{chunk structure} is a tree
containing tokens and chunks, where each chunk is a subtree containing
only tokens.  For example, the chunk structure for base noun phrase
chunks in the sentence "I saw the big dog on the hill" is::

  (SENTENCE:
    (NP: <I>)
    <saw>
    (NP: <the> <big> <dog>)
    <on>
    (NP: <the> <hill>))

To convert a chunk structure back to a list of tokens, simply use the
chunk structure's L{leaves<Tree.leaves>} method.

The C{parser.chunk} module defines L{ChunkI}, a standard interface for
chunking texts; and L{RegexpChunk}, a regular-expression based
implementation of that interface.  It uses the L{tree.chunk} and
L{tree.conll_chunk} methods, which tokenize strings containing chunked
and tagged texts.  It defines L{ChunkScore}, a utility class for
scoring chunk parsers.

RegexpChunk
===========

  C{parse.RegexpChunk} is an implementation of the chunk parser interface
  that uses regular-expressions over tags to chunk a text.  Its
  C{parse} method first constructs a C{ChunkString}, which encodes a
  particular chunking of the input text.  Initially, nothing is
  chunked.  C{parse.RegexpChunk} then applies a sequence of
  C{RegexpChunkRule}s to the C{ChunkString}, each of which modifies
  the chunking that it encodes.  Finally, the C{ChunkString} is
  transformed back into a chunk structure, which is returned.

  C{RegexpChunk} can only be used to chunk a single kind of phrase.
  For example, you can use an C{RegexpChunk} to chunk the noun
  phrases in a text, or the verb phrases in a text; but you can not
  use it to simultaneously chunk both noun phrases and verb phrases in
  the same text.  (This is a limitation of C{RegexpChunk}, not of
  chunk parsers in general.)

  RegexpChunkRules
  ------------------
    C{RegexpChunkRule}s are transformational rules that update the
    chunking of a text by modifying its C{ChunkString}.  Each
    C{RegexpChunkRule} defines the C{apply} method, which modifies
    the chunking encoded by a C{ChunkString}.  The
    L{RegexpChunkRule} class itself can be used to implement any
    transformational rule based on regular expressions.  There are
    also a number of subclasses, which can be used to implement
    simpler types of rules:

      - L{ChunkRule} chunks anything that matches a given regular
        expression.
      - L{ChinkRule} chinks anything that matches a given regular
        expression.
      - L{UnChunkRule} will un-chunk any chunk that matches a given
        regular expression.
      - L{MergeRule} can be used to merge two contiguous chunks.
      - L{SplitRule} can be used to split a single chunk into two
        smaller chunks.
      - L{ExpandLeftRule} will expand a chunk to incorporate new
        unchunked material on the left.
      - L{ExpandRightRule} will expand a chunk to incorporate new
        unchunked material on the right.

    Tag Patterns
    ~~~~~~~~~~~~
      C{RegexpChunkRule}s use a modified version of regular
      expression patterns, called X{tag patterns}.  Tag patterns are
      used to match sequences of tags.  Examples of tag patterns are::

         r'(<DT>|<JJ>|<NN>)+'
         r'<NN>+'
         r'<NN.*>'

      The differences between regular expression patterns and tag
      patterns are:

        - In tag patterns, C{'<'} and C{'>'} act as parenthases; so
          C{'<NN>+'} matches one or more repetitions of C{'<NN>'}, not
          C{'<NN'} followed by one or more repetitions of C{'>'}.
        - Whitespace in tag patterns is ignored.  So
          C{'<DT> | <NN>'} is equivalant to C{'<DT>|<NN>'}
        - In tag patterns, C{'.'} is equivalant to C{'[^{}<>]'}; so
          C{'<NN.*>'} matches any single tag starting with C{'NN'}.

      The function L{tag_pattern2re_pattern} can be used to transform
      a tag pattern to an equivalent regular expression pattern.

  Efficiency
  ----------
    Preliminary tests indicate that C{RegexpChunk} can chunk at a
    rate of about 300 tokens/second, with a moderately complex rule
    set.

    There may be problems if C{RegexpChunk} is used with more than
    5,000 tokens at a time.  In particular, evaluation of some regular
    expressions may cause the Python regular expression engine to
    exceed its maximum recursion depth.  We have attempted to minimize
    these problems, but it is impossible to avoid them completely.  We
    therefore recommend that you apply the chunk parser to a single
    sentence at a time.

  Emacs Tip
  ---------
    If you evaluate the following elisp expression in emacs, it will
    colorize C{ChunkString}s when you use an interactive python shell
    with emacs or xemacs ("C-c !")::

      (let ()
        (defconst comint-mode-font-lock-keywords 
          '(("<[^>]+>" 0 'font-lock-reference-face)
            ("[{}]" 0 'font-lock-function-name-face)))
        (add-hook 'comint-mode-hook (lambda () (turn-on-font-lock))))

    You can evaluate this code by copying it to a temporary buffer,
    placing the cursor after the last close parenthasis, and typing
    "C{C-x C-e}".  You should evaluate it before running the interactive
    session.  The change will last until you close emacs.

  Unresolved Issues
  -----------------
    If we use the C{re} module for regular expressions, Python's
    regular expression engine generates "maximum recursion depth
    exceeded" errors when processing very large texts, even for
    regular expressions that should not require any recursion.  We
    therefore use the C{pre} module instead.  But note that C{pre}
    does not include Unicode support, so this module will not work
    with unicode strings.  Note also that C{pre} regular expressions
    are not quite as advanced as C{re} ones (e.g., no leftward
    zero-length assertions).

@type _VALID_CHUNK_STRING: C{regexp}
@var _VALID_CHUNK_STRING: A regular expression to test whether a chunk
     string is valid.
@type _VALID_TAG_PATTERN: C{regexp}
@var _VALID_TAG_PATTERN: A regular expression to test whether a tag
     pattern is valid.

Classes
ChinkRule A rule specifying how to remove chinks to a ChunkString, using a matching tag pattern.
ChunkParseI A processing interface for identifying non-overlapping groups in unrestricted text.
ChunkRule A rule specifying how to add chunks to a ChunkString, using a matching tag pattern.
ChunkScore A utility class for scoring chunk parsers.
ChunkString A string-based encoding of a particular chunking of a text.
ExpandLeftRule A rule specifying how to expand chunks in a ChunkString to the left, using two matching tag patterns: a left pattern, and a right pattern.
ExpandRightRule A rule specifying how to expand chunks in a ChunkString to the right, using two matching tag patterns: a left pattern, and a right pattern.
MergeRule A rule specifying how to merge chunks in a ChunkString, using two matching tag patterns: a left pattern, and a right pattern.
RegexpChunk A regular expression based chunk parser.
RegexpChunkRule A rule specifying how to modify the chunking in a ChunkString, using a transformational regular expression.
SplitRule A rule specifying how to split chunks in a ChunkString, using two matching tag patterns: a left pattern, and a right pattern.
UnChunkRule A rule specifying how to remove chunks to a ChunkString, using a matching tag pattern.

Function Summary
  demo()
A demonstration for the RegexpChunk class.
  demo_cascade(chunkparsers, text)
Demonstration code for cascading chunk parsers.
  demo_eval(chunkparser, text)
Demonstration code for evaluating a chunk parser, using a ChunkScore.
string tag_pattern2re_pattern(tag_pattern)
Convert a tag pattern to a regular expression pattern.

Function Details

demo()

A demonstration for the RegexpChunk class. A single text is parsed with four different chunk parsers, using a variety of rules and strategies.

demo_cascade(chunkparsers, text)

Demonstration code for cascading chunk parsers.
Parameters:
text - The chunked tagged text that should be used for evaluation.
           (type=string)

demo_eval(chunkparser, text)

Demonstration code for evaluating a chunk parser, using a ChunkScore. This function assumes that text contains one sentence per line, and that each sentence has the form expected by tree.chunk. It runs the given chunk parser on each sentence in the text, and scores the result. It prints the final score (precision, recall, and f-measure); and reports the set of chunks that were missed and the set of chunks that were incorrect. (At most 10 missing chunks and 10 incorrect chunks are reported).
Parameters:
chunkparser - The chunkparser to be tested
           (type=ChunkParseI)
text - The chunked tagged text that should be used for evaluation.
           (type=string)

tag_pattern2re_pattern(tag_pattern)

Convert a tag pattern to a regular expression pattern. A tag pattern is a modified version of a regular expression, designed for matching sequences of tags. The differences between regular expression patterns and tag patterns are:
  • In tag patterns, '<' and '>' act as parenthases; so '<NN>+' matches one or more repetitions of '<NN>', not '<NN' followed by one or more repetitions of '>'.
  • Whitespace in tag patterns is ignored. So '<DT> | <NN>' is equivalant to '<DT>|<NN>'
  • In tag patterns, '.' is equivalant to '[^{}<>]'; so '<NN.*>' matches any single tag starting with 'NN'.
In particular, tag_pattern2re_pattern performs the following transformations on the given pattern:
  • Replace '.' with '[^<>{}]'
  • Remove any whitespace
  • Add extra parens around '<' and '>', to make '<' and '>' act like parenthases. E.g., so that in '<NN>+', the '+' has scope over the entire '<NN>'; and so that in '<NN|IN>', the '|' has scope over 'NN' and 'IN', but not '<' or '>'.
  • Check to make sure the resulting pattern is valid.
Parameters:
tag_pattern - The tag pattern to convert to a regular expression pattern.
           (type=string)
Returns:
A regular expression pattern corresponding to tag_pattern.
           (type=string)
Raises:
ValueError - If tag_pattern is not a valid tag pattern. In particular, tag_pattern should not include braces; and it should not contain nested or mismatched angle-brackets.

Generated by Epydoc 2.1 on Tue Sep 5 09:37:22 2006 http://epydoc.sf.net