1
2
3
4
5
6
7
8
9
10 """
11 Classes and interfaces for identifying non-overlapping linguistic
12 groups (such as base noun phrases) in unrestricted text. This task is
13 called X{chunk parsing} or X{chunking}, and the identified groups are
14 called X{chunks}. The chunked text is represented using a shallow
15 tree called a "chunk structure." A X{chunk structure} is a tree
16 containing tokens and chunks, where each chunk is a subtree containing
17 only tokens. For example, the chunk structure for base noun phrase
18 chunks in the sentence "I saw the big dog on the hill" is::
19
20 (SENTENCE:
21 (NP: <I>)
22 <saw>
23 (NP: <the> <big> <dog>)
24 <on>
25 (NP: <the> <hill>))
26
27 To convert a chunk structure back to a list of tokens, simply use the
28 chunk structure's L{leaves<Tree.leaves>} method.
29
30 The C{parser.chunk} module defines L{ChunkI}, a standard interface for
31 chunking texts; and L{RegexpChunk}, a regular-expression based
32 implementation of that interface. It uses the L{tree.chunk} and
33 L{tree.conll_chunk} methods, which tokenize strings containing chunked
34 and tagged texts. It defines L{ChunkScore}, a utility class for
35 scoring chunk parsers.
36
37 RegexpChunk
38 ===========
39
40 C{parse.RegexpChunk} is an implementation of the chunk parser interface
41 that uses regular-expressions over tags to chunk a text. Its
42 C{parse} method first constructs a C{ChunkString}, which encodes a
43 particular chunking of the input text. Initially, nothing is
44 chunked. C{parse.RegexpChunk} then applies a sequence of
45 C{RegexpChunkRule}s to the C{ChunkString}, each of which modifies
46 the chunking that it encodes. Finally, the C{ChunkString} is
47 transformed back into a chunk structure, which is returned.
48
49 C{RegexpChunk} can only be used to chunk a single kind of phrase.
50 For example, you can use an C{RegexpChunk} to chunk the noun
51 phrases in a text, or the verb phrases in a text; but you can not
52 use it to simultaneously chunk both noun phrases and verb phrases in
53 the same text. (This is a limitation of C{RegexpChunk}, not of
54 chunk parsers in general.)
55
56 RegexpChunkRules
57 ----------------
58
59 C{RegexpChunkRule}s are transformational rules that update the
60 chunking of a text by modifying its C{ChunkString}. Each
61 C{RegexpChunkRule} defines the C{apply} method, which modifies
62 the chunking encoded by a C{ChunkString}. The
63 L{RegexpChunkRule} class itself can be used to implement any
64 transformational rule based on regular expressions. There are
65 also a number of subclasses, which can be used to implement
66 simpler types of rules:
67
68 - L{ChunkRule} chunks anything that matches a given regular
69 expression.
70 - L{ChinkRule} chinks anything that matches a given regular
71 expression.
72 - L{UnChunkRule} will un-chunk any chunk that matches a given
73 regular expression.
74 - L{MergeRule} can be used to merge two contiguous chunks.
75 - L{SplitRule} can be used to split a single chunk into two
76 smaller chunks.
77 - L{ExpandLeftRule} will expand a chunk to incorporate new
78 unchunked material on the left.
79 - L{ExpandRightRule} will expand a chunk to incorporate new
80 unchunked material on the right.
81
82 Tag Patterns
83 ~~~~~~~~~~~~
84
85 C{RegexpChunkRule}s use a modified version of regular
86 expression patterns, called X{tag patterns}. Tag patterns are
87 used to match sequences of tags. Examples of tag patterns are::
88
89 r'(<DT>|<JJ>|<NN>)+'
90 r'<NN>+'
91 r'<NN.*>'
92
93 The differences between regular expression patterns and tag
94 patterns are:
95
96 - In tag patterns, C{'<'} and C{'>'} act as parentheses; so
97 C{'<NN>+'} matches one or more repetitions of C{'<NN>'}, not
98 C{'<NN'} followed by one or more repetitions of C{'>'}.
99 - Whitespace in tag patterns is ignored. So
100 C{'<DT> | <NN>'} is equivalant to C{'<DT>|<NN>'}
101 - In tag patterns, C{'.'} is equivalant to C{'[^{}<>]'}; so
102 C{'<NN.*>'} matches any single tag starting with C{'NN'}.
103
104 The function L{tag_pattern2re_pattern} can be used to transform
105 a tag pattern to an equivalent regular expression pattern.
106
107 Efficiency
108 ----------
109
110 Preliminary tests indicate that C{RegexpChunk} can chunk at a
111 rate of about 300 tokens/second, with a moderately complex rule set.
112
113 There may be problems if C{RegexpChunk} is used with more than
114 5,000 tokens at a time. In particular, evaluation of some regular
115 expressions may cause the Python regular expression engine to
116 exceed its maximum recursion depth. We have attempted to minimize
117 these problems, but it is impossible to avoid them completely. We
118 therefore recommend that you apply the chunk parser to a single
119 sentence at a time.
120
121 Emacs Tip
122 ---------
123
124 If you evaluate the following elisp expression in emacs, it will
125 colorize C{ChunkString}s when you use an interactive python shell
126 with emacs or xemacs ("C-c !")::
127
128 (let ()
129 (defconst comint-mode-font-lock-keywords
130 '(("<[^>]+>" 0 'font-lock-reference-face)
131 ("[{}]" 0 'font-lock-function-name-face)))
132 (add-hook 'comint-mode-hook (lambda () (turn-on-font-lock))))
133
134 You can evaluate this code by copying it to a temporary buffer,
135 placing the cursor after the last close parenthesis, and typing
136 "C{C-x C-e}". You should evaluate it before running the interactive
137 session. The change will last until you close emacs.
138
139 Unresolved Issues
140 -----------------
141
142 If we use the C{re} module for regular expressions, Python's
143 regular expression engine generates "maximum recursion depth
144 exceeded" errors when processing very large texts, even for
145 regular expressions that should not require any recursion. We
146 therefore use the C{pre} module instead. But note that C{pre}
147 does not include Unicode support, so this module will not work
148 with unicode strings. Note also that C{pre} regular expressions
149 are not quite as advanced as C{re} ones (e.g., no leftward
150 zero-length assertions).
151
152 @type CHUNK_TAG_PATTERN: C{regexp}
153 @var CHUNK_TAG_PATTERN: A regular expression to test whether a tag
154 pattern is valid.
155 """
156
157 import re, types
158 import types
159 from nltk_lite.parse import ParseI
160 from convert import tree2conlltags
161
162
163
164
165
167 """
168 A processing interface for identifying non-overlapping groups in
169 unrestricted text. Typically, chunk parsers are used to find base
170 syntactic constituants, such as base noun phrases. Unlike
171 L{ParseI}, C{ChunkParseI} guarantees that the C{parse} method
172 will always generate a parse.
173
174 """
175 - def parse(self, tokens):
176 """
177 Find the best chunk structure for the given tokens
178 and return a tree
179
180 @param tokens: The list of (word, tag) tokens to be chunked.
181 @type tokens: L{list} of L{tuple}
182 """
183 assert 0, "ChunkParseI is an abstract interface"
184
185
186
187
188
189
190 CHUNK_TAG_CHAR = r'[^\{\}<>]'
191 CHUNK_TAG = r'(<%s+?>)' % CHUNK_TAG_CHAR
192
193
194
195
196
197
199 """
200 A string-based encoding of a particular chunking of a text.
201 Internally, the C{ChunkString} class uses a single string to
202 encode the chunking of the input text. This string contains a
203 sequence of angle-bracket delimited tags, with chunking indicated
204 by braces. An example of this encoding is::
205
206 {<DT><JJ><NN>}<VBN><IN>{<DT><NN>}<.>{<DT><NN>}<VBD><.>
207
208 C{ChunkString} are created from tagged texts (i.e., C{list}s of
209 C{tokens} whose type is C{TaggedType}). Initially, nothing is
210 chunked.
211
212 The chunking of a C{ChunkString} can be modified with the C{xform}
213 method, which uses a regular expression to transform the string
214 representation. These transformations should only add and remove
215 braces; they should I{not} modify the sequence of angle-bracket
216 delimited tags.
217
218 @type _str: C{string}
219 @ivar _str: The internal string representation of the text's
220 encoding. This string representation contains a sequence of
221 angle-bracket delimited tags, with chunking indicated by
222 braces. An example of this encoding is::
223
224 {<DT><JJ><NN>}<VBN><IN>{<DT><NN>}<.>{<DT><NN>}<VBD><.>
225
226 @type _pieces: C{list} of pieces (tagged tokens and chunks)
227 @ivar _pieces: The tagged tokens and chunks encoded by this C{ChunkString}.
228 @ivar _debug: The debug level. See the constructor docs.
229
230 @cvar IN_CHUNK_PATTERN: A zero-width regexp pattern string that
231 will only match positions that are in chunks.
232 @cvar IN_CHINK_PATTERN: A zero-width regexp pattern string that
233 will only match positions that are in chinks.
234 """
235 IN_CHUNK_PATTERN = r'(?=[^\{]*\})'
236 IN_CHINK_PATTERN = r'(?=[^\}]*(\{|$))'
237
238
239 _CHUNK = r'(\{%s+?\})+?' % CHUNK_TAG
240 _CHINK = r'(%s+?)+?' % CHUNK_TAG
241 _VALID = re.compile(r'(\{?%s\}?)*?' % CHUNK_TAG)
242 _BRACKETS = re.compile('[^\{\}]+')
243 _BALANCED_BRACKETS = re.compile(r'(\{\})*$')
244
245 - def __init__(self, chunk_struct, debug_level=3):
246 """
247 Construct a new C{ChunkString} that encodes the chunking of
248 the text C{tagged_tokens}.
249
250 @type chunk_struct: C{Tree}
251 @param chunk_struct: The chunk structure to be further chunked.
252 @type debug_level: int
253 @param debug_level: The level of debugging which should be
254 applied to transformations on the C{ChunkString}. The
255 valid levels are:
256 - 0: no checks
257 - 1: full check on to_chunkstruct
258 - 2: full check on to_chunkstruct and cursory check after
259 each transformation.
260 - 3: full check on to_chunkstruct and full check after
261 each transformation.
262 We recommend you use at least level 1. You should
263 probably use level 3 if you use any non-standard
264 subclasses of C{RegexpChunkRule}.
265 """
266 self._top_node = chunk_struct.node
267 self._pieces = chunk_struct[:]
268 tags = [self._tag(tok) for tok in self._pieces]
269 self._str = '<' + '><'.join(tags) + '>'
270 self._debug = debug_level
271
272 - def _tag(self, tok):
273 if type(tok) == types.TupleType:
274 return tok[1]
275 elif isinstance(tok, Tree):
276 return tok.node
277 else:
278 raise ValueError, 'chunk structures must contain tokens and trees'
279
281 """
282 Check to make sure that C{_str} still corresponds to some chunked
283 version of C{_pieces}.
284
285 @type verify_tags: C{boolean}
286 @param verify_tags: Whether the individual tags should be
287 checked. If this is false, C{_verify} will check to make
288 sure that C{_str} encodes a chunked version of I{some}
289 list of tokens. If this is true, then C{_verify} will
290 check to make sure that the tags in C{_str} match those in
291 C{_pieces}.
292
293 @raise ValueError: if this C{ChunkString}'s internal string
294 representation is invalid or not consistent with _pieces.
295 """
296
297 if not ChunkString._VALID.match(self._str):
298 raise ValueError('Transformation generated invalid chunkstring: %s' % self._str)
299
300
301
302
303 brackets = ChunkString._BRACKETS.sub('', self._str)
304 for i in range(1+len(brackets)/5000):
305 substr = brackets[i*5000:i*5000+5000]
306 if not ChunkString._BALANCED_BRACKETS.match(substr):
307 raise ValueError('Transformation generated invalid chunkstring: %s' % substr)
308
309 if verify_tags<=0: return
310
311 tags1 = (re.split(r'[\{\}<>]+', self._str))[1:-1]
312 tags2 = [self._tag(piece) for piece in self._pieces]
313 if tags1 != tags2:
314 raise ValueError('Transformation generated invalid chunkstring: %s / %s' % (tags1,tags2))
315
317 """
318 @return: the chunk structure encoded by this C{ChunkString}.
319 @rtype: C{Tree}
320 @raise ValueError: If a transformation has generated an
321 invalid chunkstring.
322 """
323 if self._debug > 0: self._verify(1)
324
325
326 pieces = []
327 index = 0
328 piece_in_chunk = 0
329 for piece in re.split('[{}]', self._str):
330
331
332 length = piece.count('<')
333 subsequence = self._pieces[index:index+length]
334
335
336 if piece_in_chunk:
337 pieces.append(Tree(chunk_node, subsequence))
338 else:
339 pieces += subsequence
340
341
342 index += length
343 piece_in_chunk = not piece_in_chunk
344
345 return Tree(self._top_node, pieces)
346
383
385 """
386 @rtype: C{string}
387 @return: A string representation of this C{ChunkString}. This
388 string representation has the form::
389
390 <ChunkString: '{<DT><JJ><NN>}<VBN><IN>{<DT><NN>}'>
391
392 """
393 return '<ChunkString: %s>' % `self._str`
394
396 """
397 @rtype: C{string}
398 @return: A formatted representation of this C{ChunkString}'s
399 string encoding. This representation will include extra
400 spaces to ensure that tags will line up with the
401 representation of other C{ChunkStrings} for the same text,
402 regardless of the chunking.
403 """
404
405 str = re.sub(r'>(?!\})', r'> ', self._str)
406 str = re.sub(r'([^\{])<', r'\1 <', str)
407 if str[0] == '<': str = ' ' + str
408 return str
409
410
411
412
413
414
415 from nltk_lite import evaluate
417 """
418 Score the accuracy of the chunker against the gold standard.
419 Strip the chunk information from the gold standard and rechunk it using
420 the chunker, then compute the accuracy score.
421
422 @type chunker: C{ChunkParseI}
423 @param tagger: The chunker being evaluated.
424 @type gold: C{tree}
425 @param gold: The chunk structures to score the chunker on.
426 @rtype: C{float}
427 """
428
429 gold_tags = []
430 test_tags = []
431 for gold_tree in gold:
432 test_tree = chunker.parse(gold_tree.flatten())
433 gold_tags += tree2conlltags(gold_tree)
434 test_tags += tree2conlltags(test_tree)
435
436
437
438 return evaluate.accuracy(gold_tags, test_tags)
439
440
441
442
443
444
445
447 """
448 A utility class for scoring chunk parsers. C{ChunkScore} can
449 evaluate a chunk parser's output, based on a number of statistics
450 (precision, recall, f-measure, misssed chunks, incorrect chunks).
451 It can also combine the scores from the parsing of multiple texts;
452 this makes it signifigantly easier to evaluate a chunk parser that
453 operates one sentence at a time.
454
455 Texts are evaluated with the C{score} method. The results of
456 evaluation can be accessed via a number of accessor methods, such
457 as C{precision} and C{f_measure}. A typical use of the
458 C{ChunkScore} class is::
459
460 >>> chunkscore = ChunkScore()
461 >>> for correct in correct_sentences:
462 ... guess = chunkparser.parse(correct.leaves())
463 ... chunkscore.score(correct, guess)
464 >>> print 'F Measure:', chunkscore.f_measure()
465 F Measure: 0.823
466
467 @ivar kwargs: Keyword arguments:
468
469 - max_tp_examples: The maximum number actual examples of true
470 positives to record. This affects the C{correct} member
471 function: C{correct} will not return more than this number
472 of true positive examples. This does *not* affect any of
473 the numerical metrics (precision, recall, or f-measure)
474
475 - max_fp_examples: The maximum number actual examples of false
476 positives to record. This affects the C{incorrect} member
477 function and the C{guessed} member function: C{incorrect}
478 will not return more than this number of examples, and
479 C{guessed} will not return more than this number of true
480 positive examples. This does *not* affect any of the
481 numerical metrics (precision, recall, or f-measure)
482
483 - max_fn_examples: The maximum number actual examples of false
484 negatives to record. This affects the C{missed} member
485 function and the C{correct} member function: C{missed}
486 will not return more than this number of examples, and
487 C{correct} will not return more than this number of true
488 negative examples. This does *not* affect any of the
489 numerical metrics (precision, recall, or f-measure)
490
491 @type _tp: C{list} of C{Token}
492 @ivar _tp: List of true positives
493 @type _fp: C{list} of C{Token}
494 @ivar _fp: List of false positives
495 @type _fn: C{list} of C{Token}
496 @ivar _fn: List of false negatives
497
498 @type _tp_num: C{int}
499 @ivar _tp_num: Number of true positives
500 @type _fp_num: C{int}
501 @ivar _fp_num: Number of false positives
502 @type _fn_num: C{int}
503 @ivar _fn_num: Number of false negatives.
504 """
506 self._correct = set()
507 self._guessed = set()
508 self._tp = set()
509 self._fp = set()
510 self._fn = set()
511 self._max_tp = kwargs.get('max_tp_examples', 100)
512 self._max_fp = kwargs.get('max_fp_examples', 100)
513 self._max_fn = kwargs.get('max_fn_examples', 100)
514 self._tp_num = 0
515 self._fp_num = 0
516 self._fn_num = 0
517 self._count = 0
518
519 self._measuresNeedUpdate = False
520
522 if (self._measuresNeedUpdate):
523 self._tp = self._guessed & self._correct
524 self._fn = self._correct - self._guessed
525 self._fp = self._guessed - self._correct
526 self._tp_num = len(self._tp)
527 self._fp_num = len(self._fp)
528 self._fn_num = len(self._fn)
529 self._measuresNeedUpdate = False
530
531 - def score(self, correct, guessed):
532 """
533 Given a correctly chunked sentence, score another chunked
534 version of the same sentence.
535
536 @type correct: chunk structure
537 @param correct: The known-correct ("gold standard") chunked
538 sentence.
539 @type guessed: chunk structure
540 @param guessed: The chunked sentence to be scored.
541 """
542
543 self._correct |= _chunksets(correct, self._count)
544 self._guessed |= _chunksets(guessed, self._count)
545 self._count += 1
546 self._measuresNeedUpdate = True
547
549 """
550 @return: the overall precision for all texts that have been
551 scored by this C{ChunkScore}.
552 @rtype: C{float}
553 """
554 self._updateMeasures()
555 div = self._tp_num + self._fp_num
556 if div == 0: return 0
557 else: return float(self._tp_num) / div
558
560 """
561 @return: the overall recall for all texts that have been
562 scored by this C{ChunkScore}.
563 @rtype: C{float}
564 """
565 self._updateMeasures()
566 div = self._tp_num + self._fn_num
567 if div == 0: return 0
568 else: return float(self._tp_num) / div
569
571 """
572 @return: the overall F measure for all texts that have been
573 scored by this C{ChunkScore}.
574 @rtype: C{float}
575
576 @param alpha: the relative weighting of precision and recall.
577 Larger alpha biases the score towards the precision value,
578 while smaller alpha biases the score towards the recall
579 value. C{alpha} should have a value in the range [0,1].
580 @type alpha: C{float}
581 """
582 self._updateMeasures()
583 p = self.precision()
584 r = self.recall()
585 if p == 0 or r == 0:
586 return 0
587 return 1/(alpha/p + (1-alpha)/r)
588
590 """
591 @rtype: C{list} of chunks
592 @return: the chunks which were included in the
593 correct chunk structures, but not in the guessed chunk
594 structures, listed in input order.
595 """
596 self._updateMeasures()
597 chunks = list(self._fn)
598 return [c[1] for c in chunks]
599
601 """
602 @rtype: C{list} of chunks
603 @return: the chunks which were included in the
604 guessed chunk structures, but not in the correct chunk
605 structures, listed in input order.
606 """
607 self._updateMeasures()
608 chunks = list(self._fp)
609 return [c[1] for c in chunks]
610
612 """
613 @rtype: C{list} of chunks
614 @return: the chunks which were included in the correct
615 chunk structures, listed in input order.
616 """
617 chunks = list(self._correct)
618 return [c[1] for c in chunks]
619
621 """
622 @rtype: C{list} of chunks
623 @return: the chunks which were included in the guessed
624 chunk structures, listed in input order.
625 """
626 chunks = list(self._guessed)
627 return [c[1] for c in chunks]
628
632
634 """
635 @rtype: C{String}
636 @return: a concise representation of this C{ChunkScoring}.
637 """
638 return '<ChunkScoring of '+`len(self)`+' chunks>'
639
641 """
642 @rtype: C{String}
643 @return: a verbose representation of this C{ChunkScoring}.
644 This representation includes the precision, recall, and
645 f-measure scores. For other information about the score,
646 use the accessor methods (e.g., C{missed()} and
647 C{incorrect()}).
648 """
649 return ("ChunkParse score:\n" +
650 (" Precision: %5.1f%%\n" % (self.precision()*100)) +
651 (" Recall: %5.1f%%\n" % (self.recall()*100))+
652 (" F-Measure: %5.1f%%" % (self.f_measure()*100)))
653
655 """
656 @return: The list of tokens contained in C{text}.
657 """
658 return [tok for tok in text if isinstance(tok, AbstractTree)]
659
660
661
672
673
674
675 from regexp import *
676 from convert import *
677