Package nltk_lite :: Package corpora :: Module chat80
[hide private]
[frames] | no frames]

Source Code for Module nltk_lite.corpora.chat80

  1  # Natural Language Toolkit: Chat-80 KB Reader 
  2  # See http://www.w3.org/TR/swbp-skos-core-guide/ 
  3  # 
  4  # Copyright (C) 2001-2007 University of Pennsylvania 
  5  # Author: Ewan Klein <ewan@inf.ed.ac.uk>, 
  6  # URL: <http://nltk.sourceforge.net> 
  7  # For license information, see LICENSE.TXT 
  8   
  9  """ 
 10  Overview 
 11  ======== 
 12   
 13  Chat-80 was a natural language system which allowed the user to 
 14  interrogate a Prolog knowledge base in the domain of world 
 15  geography. It was developed in the early '80s by Warren and Pereira; see 
 16  U{http://acl.ldc.upenn.edu/J/J82/J82-3002.pdf} for a description and 
 17  U{http://www.cis.upenn.edu/~pereira/oldies.html} for the source 
 18  files. 
 19   
 20  This module contains functions to extract data from the Chat-80 
 21  relation files ('the world database'), and convert then into a format 
 22  that can be incorporated in the FOL models of 
 23  L{nltk_lite.semantics.evaluate}. The code assumes that the Prolog 
 24  input files are available in the NLTK corpora directory. 
 25   
 26  The Chat-80 World Database consists of the following files:: 
 27   
 28      world0.pl 
 29      rivers.pl 
 30      cities.pl 
 31      countries.pl 
 32      contain.pl 
 33      borders.pl 
 34   
 35  This module uses a slightly modified version of C{world0.pl}, in which 
 36  a set of Prolog rules have been omitted. The modified file is named 
 37  C{world1.pl}. Currently, the file C{rivers.pl} is not read in, since 
 38  it uses a list rather than a string in the second field. 
 39   
 40  Reading Chat-80 Files 
 41  ===================== 
 42   
 43  Chat-80 relations are like tables in a relational database. The 
 44  relation acts as the name of the table; the first argument acts as the 
 45  'primary key'; and subsequent arguments are further fields in the 
 46  table. In general, the name of the table provides a label for a unary 
 47  predicate whose extension is all the primary keys. For example, 
 48  relations in C{cities.pl} are of the following form:: 
 49   
 50     'city(athens,greece,1368).' 
 51   
 52  Here, C{'athens'} is the key, and will be mapped to a member of the 
 53  unary predicate M{city}. 
 54   
 55  The fields in the table are mapped to binary predicates. The first 
 56  argument of the predicate is the primary key, while the second 
 57  argument is the data in the relevant field. Thus, in the above 
 58  example, the third field is mapped to the binary predicate 
 59  M{population_of}, whose extension is a set of pairs such as C{'(athens, 
 60  1368)'}. 
 61   
 62  An exception to this general framework is required by the relations in 
 63  the files C{borders.pl} and C{contains.pl}. These contain facts of the 
 64  following form:: 
 65   
 66      'borders(albania,greece).' 
 67       
 68      'contains0(africa,central_africa).' 
 69   
 70  We do not want to form a unary concept out the element in 
 71  the first field of these records, and we want the label of the binary 
 72  relation just to be C{'border'}/C{'contain'} respectively. 
 73   
 74  In order to drive the extraction process, we use 'relation metadata bundles' 
 75  which are Python dictionaries such as the following:: 
 76   
 77    city = {'label': 'city', 
 78            'closures': [], 
 79            'schema': ['city', 'country', 'population'], 
 80            'filename': 'cities.pl'} 
 81   
 82  According to this, the file C{city['filename']} contains a list of 
 83  relational tuples (or more accurately, the corresponding strings in 
 84  Prolog form) whose predicate symbol is C{city['label']} and whose 
 85  relational schema is C{city['schema']}. The notion of a C{closure} is 
 86  discussed in the next section. 
 87   
 88  Concepts 
 89  ======== 
 90  In order to encapsulate the results of the extraction, a class of 
 91  L{Concept}s is introduced.  A L{Concept} object has a number of 
 92  attributes, in particular a C{prefLabel} and C{extension}, which make 
 93  it easier to inspect the output of the extraction. In addition, the 
 94  C{extension} can be further processed: in the case of the C{'border'} 
 95  relation, we check that the relation is B{symmetric}, and in the case 
 96  of the C{'contain'} relation, we carry out the B{transitive 
 97  closure}. The closure properties associated with a concept is 
 98  indicated in the relation metadata, as indicated earlier. 
 99   
100  The C{extension} of a L{Concept} object is then incorporated into a 
101  L{Valuation} object. 
102   
103  Persistence 
104  =========== 
105  The functions L{val_dump} and L{val_load} are provided to allow a 
106  valuation to be stored in a persistent database and re-loaded, rather 
107  than having to be re-computed each time. 
108   
109  Individuals and Lexical Items  
110  ============================= 
111  As well as deriving relations from the Chat-80 data, we also create a 
112  set of individual constants, one for each entity in the domain. The 
113  individual constants are string-identical to the entities. For 
114  example, given a data item such as C{'zloty'}, we add to the valuation 
115  a pair C{('zloty', 'zloty')}. In order to parse English sentences that 
116  refer to these entities, we also create a lexical item such as the 
117  following for each individual constant:: 
118   
119     PropN[num=sg, sem=<\P.(P zloty)>] -> 'Zloty' 
120   
121  The set of rules is written to the file C{chat_pnames.cfg} in the 
122  current directory. 
123   
124  """ 
125   
126  import re 
127  import nltk_lite.semantics.evaluate as evaluate 
128  import shelve, os, sys 
129  from nltk_lite.corpora import get_basedir 
130   
131  ########################################################################### 
132  # Chat-80 relation metadata bundles needed to build the valuation 
133  ########################################################################### 
134   
135  borders = {'rel_name': 'borders', 
136             'closures': ['symmetric'], 
137             'schema': ['region', 'border'], 
138             'filename': 'borders.pl'} 
139   
140  contains = {'rel_name': 'contains0', 
141              'closures': ['transitive'], 
142              'schema': ['region', 'contain'], 
143              'filename': 'contain.pl'} 
144   
145  city = {'rel_name': 'city', 
146          'closures': [], 
147          'schema': ['city', 'country', 'population'], 
148          'filename': 'cities.pl'} 
149   
150  country = {'rel_name': 'country', 
151             'closures': [], 
152             'schema': ['country', 'region', 'latitude', 'longitude', 
153                        'area', 'population', 'capital', 'currency'], 
154             'filename': 'countries.pl'} 
155   
156  circle_of_lat = {'rel_name': 'circle_of_latitude', 
157                   'closures': [], 
158                   'schema': ['circle_of_latitude', 'degrees'], 
159                   'filename': 'world1.pl'} 
160   
161  circle_of_long = {'rel_name': 'circle_of_longitude', 
162                   'closures': [], 
163                   'schema': ['circle_of_longitude', 'degrees'], 
164                   'filename': 'world1.pl'} 
165   
166  continent = {'rel_name': 'continent', 
167               'closures': [], 
168               'schema': ['continent'], 
169               'filename': 'world1.pl'} 
170   
171  region = {'rel_name': 'in_continent', 
172            'closures': [], 
173            'schema': ['region', 'continent'], 
174            'filename': 'world1.pl'} 
175   
176  ocean = {'rel_name': 'ocean', 
177           'closures': [], 
178           'schema': ['ocean'], 
179           'filename': 'world1.pl'} 
180   
181  sea = {'rel_name': 'sea', 
182         'closures': [], 
183         'schema': ['sea'], 
184         'filename': 'world1.pl'} 
185   
186   
187   
188  items = ['borders', 'contains', 'city', 'country', 'circle_of_lat', 
189           'circle_of_long', 'continent', 'region', 'ocean', 'sea'] 
190   
191  item_metadata = { 
192      'borders': borders, 
193      'contains': contains, 
194      'city': city, 
195      'country': country, 
196      'circle_of_lat': circle_of_lat, 
197      'continent': continent, 
198      'region': region, 
199      'ocean': ocean, 
200      'sea': sea 
201      } 
202   
203  rels = item_metadata.values() 
204   
205  not_unary = ['borders.pl', 'contain.pl']  
206   
207  ########################################################################### 
208   
209 -class Concept(object):
210 """ 211 A Concept class, loosely 212 based on SKOS (U{http://www.w3.org/TR/swbp-skos-core-guide/}). 213 """
214 - def __init__(self, prefLabel, arity, altLabels=[], closures=[], extension=set()):
215 """ 216 @param prefLabel: the preferred label for the concept 217 @type prefLabel: str 218 @param arity: the arity of the concept 219 @type arity: int 220 @keyword altLabels: other (related) labels 221 @type altLabels: list 222 @keyword closures: closure properties of the extension \ 223 (list items can be C{symmetric}, C{reflexive}, C{transitive}) 224 @type closures: list 225 @keyword extension: the extensional value of the concept 226 @type extension: set 227 """ 228 self.prefLabel = prefLabel 229 self.arity = arity 230 self.altLabels = altLabels 231 self.closures = closures 232 self.extension = extension
233
234 - def __str__(self):
235 236 return "Label = '%s'\nArity = %s\nExtension = %s" % \ 237 (self.prefLabel, self.arity, sorted(self.extension))
238
239 - def __repr__(self):
240 return "Concept('%s')" % self.prefLabel
241
242 - def augment(self, data):
243 """ 244 Add more data to the C{Concept}'s extension set. 245 246 @param data: a new semantic value 247 @type data: string or pair of strings 248 @rtype: set 249 250 """ 251 self.extension.add(data) 252 return self.extension
253 254
255 - def _make_graph(self, s):
256 """ 257 Convert a set of pairs into an adjacency linked list encoding of a graph. 258 """ 259 g = {} 260 for (x, y) in s: 261 if x in g: 262 g[x].append(y) 263 else: 264 g[x] = [y] 265 return g
266
267 - def _transclose(self, g):
268 """ 269 Compute the transitive closure of a graph represented as a linked list. 270 """ 271 for x in g: 272 for adjacent in g[x]: 273 # check that adjacent is a key 274 if adjacent in g: 275 for y in g[adjacent]: 276 if y not in g[x]: 277 g[x].append(y) 278 return g
279
280 - def _make_pairs(self, g):
281 """ 282 Convert an adjacency linked list back into a set of pairs. 283 """ 284 pairs = [] 285 for node in g: 286 for adjacent in g[node]: 287 pairs.append((node, adjacent)) 288 return set(pairs)
289 290
291 - def close(self):
292 """ 293 Close a binary relation in the C{Concept}'s extension set. 294 295 @return: a new extension for the C{Concept} in which the 296 relation is closed under a given property 297 298 299 """ 300 assert evaluate.isrel(self.extension) 301 if 'symmetric' in self.closures: 302 pairs = [] 303 for (x, y) in self.extension: 304 pairs.append((y, x)) 305 sym = set(pairs) 306 self.extension = self.extension.union(sym) 307 if 'transitive' in self.closures: 308 all = self._make_graph(self.extension) 309 closed = self._transclose(all) 310 trans = self._make_pairs(closed) 311 #print sorted(trans) 312 self.extension = self.extension.union(trans)
313 314 315
316 -def clause2concepts(filename, rel_name, closures, schema):
317 """ 318 Convert a file of Prolog clauses into a list of L{Concept} objects. 319 320 @param filename: filename containing the relations 321 @type filename: string 322 @param rel_name: name of the relation 323 @type rel_name: string 324 @param schema: the schema used in a set of relational tuples 325 @type schema: list 326 @return: a list of L{Concept}s 327 @rtype: list 328 """ 329 concepts = [] 330 # position of the subject of a binary relation 331 subj = 0 332 # label of the 'primary key' 333 pkey = schema[0] 334 # fields other than the primary key 335 fields = schema[1:] 336 337 # convert a file into a list of lists 338 records = _str2records(filename, rel_name) 339 340 # add a unary concept corresponding to the set of entities 341 # in the primary key position 342 # relations in 'not_unary' are more like ordinary binary relations 343 if not filename in not_unary: 344 concepts.append(unary_concept(pkey, subj, records)) 345 346 # add a binary concept for each non-key field 347 for field in fields: 348 obj = schema.index(field) 349 concepts.append(binary_concept(field, closures, subj, obj, records)) 350 351 return concepts
352
353 -def _str2records(filename, rel):
354 """ 355 Read a file into memory and convert each relation clause into a list. 356 """ 357 recs = [] 358 path = os.path.join(get_basedir(), "chat80", filename) 359 for line in open(path): 360 if line.startswith(rel): 361 line = re.sub(rel+r'\(', '', line) 362 line = re.sub(r'\)\.$', '', line) 363 line = line[:-1] 364 record = line.split(',') 365 recs.append(record) 366 return recs
367
368 -def unary_concept(label, subj, records):
369 """ 370 Make a unary concept out of the primary key in a record. 371 372 A record is a list of entities in some relation, such as 373 C{['france', 'paris']}, where C{'france'} is acting as the primary 374 key. 375 376 @param label: the preferred label for the concept 377 @type label: string 378 @param subj: position in the record of the subject of the predicate 379 @type subj: int 380 @param records: a list of records 381 @type records: list of lists 382 @return: L{Concept} of arity 1 383 @rtype: L{Concept} 384 """ 385 c = Concept(label, arity=1, extension=set()) 386 for record in records: 387 c.augment(record[subj]) 388 return c
389
390 -def binary_concept(label, closures, subj, obj, records):
391 """ 392 Make a binary concept out of the primary key and another field in a record. 393 394 A record is a list of entities in some relation, such as 395 C{['france', 'paris']}, where C{'france'} is acting as the primary 396 key, and C{'paris'} stands in the C{'capital_of'} relation to 397 C{'france'}. 398 399 More generally, given a record such as C{['a', 'b', 'c']}, where 400 label is bound to C{'B'}, and C{obj} bound to 1, the derived 401 binary concept will have label C{'B_of'}, and its extension will 402 be a set of pairs such as C{('a', 'b')}. 403 404 405 @param label: the base part of the preferred label for the concept 406 @type label: string 407 @param closures: closure properties for the extension of the concept 408 @type closures: list 409 @param subj: position in the record of the subject of the predicate 410 @type subj: int 411 @param obj: position in the record of the object of the predicate 412 @type obj: int 413 @param records: a list of records 414 @type records: list of lists 415 @return: L{Concept} of arity 2 416 @rtype: L{Concept} 417 """ 418 if not label == 'border' and not label == 'contain': 419 label = label + '_of' 420 c = Concept(label, arity=2, closures=closures, extension=set()) 421 for record in records: 422 c.augment((record[subj], record[obj])) 423 # close the concept's extension according to the properties in closures 424 c.close() 425 return c
426 427
428 -def process_bundle(rels):
429 """ 430 Given a list of relation metadata bundles, make a corresponding 431 dictionary of concepts, indexed by the relation name. 432 433 @param rels: bundle of metadata needed for constructing a concept 434 @type rels: list of dictionaries 435 @return: a dictionary of concepts, indexed by the relation name. 436 @rtype: dict 437 """ 438 concepts = {} 439 for rel in rels: 440 rel_name = rel['rel_name'] 441 closures = rel['closures'] 442 schema = rel['schema'] 443 filename = rel['filename'] 444 445 concept_list = clause2concepts(filename, rel_name, closures, schema) 446 for c in concept_list: 447 label = c.prefLabel 448 if(label in concepts.keys()): 449 for data in c.extension: 450 concepts[label].augment(data) 451 concepts[label].close() 452 else: 453 concepts[label] = c 454 return concepts
455 456
457 -def make_valuation(concepts, read=False, lexicon=False):
458 """ 459 Convert a list of C{Concept}s into a list of (label, extension) pairs; 460 optionally create a C{Valuation} object. 461 462 @param concepts: concepts 463 @type concepts: list of L{Concept}s 464 @param read: if C{True}, C{(symbol, set)} pairs are read into a C{Valuation} 465 @type read: bool 466 @rtype: list or a L{Valuation} 467 """ 468 vals = [] 469 470 for c in concepts: 471 vals.append((c.prefLabel, c.extension)) 472 if lexicon: read = True 473 if read: 474 val = evaluate.Valuation() 475 val.read(vals) 476 # add labels for individuals 477 val = label_indivs(val, lexicon=lexicon) 478 return val 479 else: return vals
480 481
482 -def val_dump(rels, db):
483 """ 484 Make a L{Valuation} from a list of relation metadata bundles and dump to 485 persistent database. 486 487 @param rels: bundle of metadata needed for constructing a concept 488 @type rels: list of dictionaries 489 @param db: name of file to which data is written. 490 The suffix '.db' will be automatically appended. 491 @type db: string 492 """ 493 concepts = process_bundle(rels).values() 494 valuation = make_valuation(concepts, read=True) 495 db_out = shelve.open(db, 'n') 496 497 db_out.update(valuation) 498 499 db_out.close()
500 501
502 -def val_load(db):
503 """ 504 Load a L{Valuation} from a persistent database. 505 506 @param db: name of file from which data is read. 507 The suffix '.db' should be omitted from the name. 508 @type db: string 509 """ 510 dbname = db+".db" 511 512 if not os.access(dbname, os.R_OK): 513 sys.exit("Cannot read file: %s" % dbname) 514 else: 515 db_in = shelve.open(db) 516 val = evaluate.Valuation(db_in) 517 # val.read(db_in.items()) 518 return val
519 520
521 -def alpha(str):
522 """ 523 Utility to filter out non-alphabetic constants. 524 525 @param str: candidate constant 526 @type str: string 527 @rtype: bool 528 """ 529 try: 530 int(str) 531 return False 532 except ValueError: 533 # some unknown values in records are labeled '?' 534 if not str == '?': 535 return True
536 537
538 -def label_indivs(valuation, lexicon=False):
539 """ 540 Assign individual constants to the individuals in the domain of a C{Valuation}. 541 542 Given a valuation with an entry of the form {'rel': {'a': True}}, 543 add a new entry {'a': 'a'}. 544 545 @type valuation: L{Valuation} 546 @rtype: L{Valuation} 547 """ 548 # collect all the individuals into a domain 549 domain = valuation.domain 550 # convert the domain into a sorted list of alphabetic terms 551 entities = sorted(e for e in domain if alpha(e)) 552 # use the same string as a label 553 pairs = [(e, e) for e in entities] 554 if lexicon: 555 lex = make_lex(entities) 556 open("chat_pnames.cfg", mode='w').writelines(lex) 557 # read the pairs into the valuation 558 valuation.read(pairs) 559 return valuation
560
561 -def make_lex(symbols):
562 """ 563 Create lexical CFG rules for each individual symbol. 564 565 Given a valuation with an entry of the form {'zloty': 'zloty'}, 566 create a lexical rule for the proper name 'Zloty'. 567 568 @param symbols: a list of individual constants in the semantic representation 569 @type symbols: sequence 570 @rtype: list 571 """ 572 lex = [] 573 header = """ 574 ################################################################## 575 # Lexical rules automatically generated by running 'chat80.py -x'. 576 ################################################################## 577 578 """ 579 lex.append(header) 580 template = "PropN[num=sg, sem=<\P.(P %s)>] -> '%s'\n" 581 582 for s in symbols: 583 parts = s.split('_') 584 caps = [p.capitalize() for p in parts] 585 pname = ('_').join(caps) 586 rule = template % (s, pname) 587 lex.append(rule) 588 return lex
589 590 591 ########################################################################### 592 # Interface function to emulate other corpus readers 593 ########################################################################### 594
595 -def concepts(items = items):
596 """ 597 Build a list of concepts corresponding to the relation names in C{items}. 598 599 @param items: names of the Chat-80 relations to extract 600 @type items: list of strings 601 @return: the L{Concept}s which are extracted from the relations 602 @rtype: list 603 """ 604 if type(items) is str: items = (items,) 605 606 rels = [item_metadata[r] for r in items] 607 608 concept_map = process_bundle(rels) 609 return concept_map.values()
610 611 612 613 614 ########################################################################### 615 616
617 -def main():
618 import sys 619 from optparse import OptionParser 620 description = \ 621 """ 622 Extract data from the Chat-80 Prolog files and convert them into a 623 Valuation object for use in the NLTK semantics package. 624 """ 625 626 opts = OptionParser(description=description) 627 opts.set_defaults(verbose=True, lex=False, vocab=False) 628 opts.add_option("-s", "--store", dest="outdb", 629 help="store a valuation in DB", metavar="DB") 630 opts.add_option("-l", "--load", dest="indb", 631 help="load a stored valuation from DB", metavar="DB") 632 opts.add_option("-c", "--concepts", action="store_true", 633 help="print concepts instead of a valuation") 634 opts.add_option("-r", "--relation", dest="label", 635 help="print concept with label REL (check possible labels with '-v' option)", metavar="REL") 636 opts.add_option("-q", "--quiet", action="store_false", dest="verbose", 637 help="don't print out progress info") 638 opts.add_option("-x", "--lex", action="store_true", dest="lex", 639 help="write a file of lexical entries for country names, then exit") 640 opts.add_option("-v", "--vocab", action="store_true", dest="vocab", 641 help="print out the vocabulary of concept labels and their arity, then exit") 642 643 (options, args) = opts.parse_args() 644 if options.outdb and options.indb: 645 opts.error("Options --store and --load are mutually exclusive") 646 647 648 if options.outdb: 649 # write the valuation to a persistent database 650 if options.verbose: 651 outdb = options.outdb+".db" 652 print "Dumping a valuation to %s" % outdb 653 val_dump(rels, options.outdb) 654 sys.exit(0) 655 else: 656 # try to read in a valuation from a database 657 if options.indb is not None: 658 dbname = options.indb+".db" 659 if not os.access(dbname, os.R_OK): 660 sys.exit("Cannot read file: %s" % dbname) 661 else: 662 valuation = val_load(options.indb) 663 # we need to create the valuation from scratch 664 else: 665 # build some concepts 666 concept_map = process_bundle(rels) 667 concepts = concept_map.values() 668 # just print out the vocabulary 669 if options.vocab: 670 items = [(c.arity, c.prefLabel) for c in concepts] 671 items.sort() 672 for (arity, label) in items: 673 print label, arity 674 sys.exit(0) 675 # show all the concepts 676 if options.concepts: 677 for c in concepts: 678 print c 679 print 680 if options.label: 681 print concept_map[options.label] 682 sys.exit(0) 683 else: 684 # turn the concepts into a Valuation 685 if options.lex: 686 if options.verbose: 687 print "Writing out lexical rules" 688 make_valuation(concepts, lexicon=True) 689 else: 690 valuation = make_valuation(concepts, read=True) 691 print valuation
692 693 694 695 if __name__ == '__main__': 696 main() 697