1
2
3
4
5
6
7
8
9 """
10 Overview
11 ========
12
13 Chat-80 was a natural language system which allowed the user to
14 interrogate a Prolog knowledge base in the domain of world
15 geography. It was developed in the early '80s by Warren and Pereira; see
16 U{http://acl.ldc.upenn.edu/J/J82/J82-3002.pdf} for a description and
17 U{http://www.cis.upenn.edu/~pereira/oldies.html} for the source
18 files.
19
20 This module contains functions to extract data from the Chat-80
21 relation files ('the world database'), and convert then into a format
22 that can be incorporated in the FOL models of
23 L{nltk_lite.semantics.evaluate}. The code assumes that the Prolog
24 input files are available in the NLTK corpora directory.
25
26 The Chat-80 World Database consists of the following files::
27
28 world0.pl
29 rivers.pl
30 cities.pl
31 countries.pl
32 contain.pl
33 borders.pl
34
35 This module uses a slightly modified version of C{world0.pl}, in which
36 a set of Prolog rules have been omitted. The modified file is named
37 C{world1.pl}. Currently, the file C{rivers.pl} is not read in, since
38 it uses a list rather than a string in the second field.
39
40 Reading Chat-80 Files
41 =====================
42
43 Chat-80 relations are like tables in a relational database. The
44 relation acts as the name of the table; the first argument acts as the
45 'primary key'; and subsequent arguments are further fields in the
46 table. In general, the name of the table provides a label for a unary
47 predicate whose extension is all the primary keys. For example,
48 relations in C{cities.pl} are of the following form::
49
50 'city(athens,greece,1368).'
51
52 Here, C{'athens'} is the key, and will be mapped to a member of the
53 unary predicate M{city}.
54
55 The fields in the table are mapped to binary predicates. The first
56 argument of the predicate is the primary key, while the second
57 argument is the data in the relevant field. Thus, in the above
58 example, the third field is mapped to the binary predicate
59 M{population_of}, whose extension is a set of pairs such as C{'(athens,
60 1368)'}.
61
62 An exception to this general framework is required by the relations in
63 the files C{borders.pl} and C{contains.pl}. These contain facts of the
64 following form::
65
66 'borders(albania,greece).'
67
68 'contains0(africa,central_africa).'
69
70 We do not want to form a unary concept out the element in
71 the first field of these records, and we want the label of the binary
72 relation just to be C{'border'}/C{'contain'} respectively.
73
74 In order to drive the extraction process, we use 'relation metadata bundles'
75 which are Python dictionaries such as the following::
76
77 city = {'label': 'city',
78 'closures': [],
79 'schema': ['city', 'country', 'population'],
80 'filename': 'cities.pl'}
81
82 According to this, the file C{city['filename']} contains a list of
83 relational tuples (or more accurately, the corresponding strings in
84 Prolog form) whose predicate symbol is C{city['label']} and whose
85 relational schema is C{city['schema']}. The notion of a C{closure} is
86 discussed in the next section.
87
88 Concepts
89 ========
90 In order to encapsulate the results of the extraction, a class of
91 L{Concept}s is introduced. A L{Concept} object has a number of
92 attributes, in particular a C{prefLabel} and C{extension}, which make
93 it easier to inspect the output of the extraction. In addition, the
94 C{extension} can be further processed: in the case of the C{'border'}
95 relation, we check that the relation is B{symmetric}, and in the case
96 of the C{'contain'} relation, we carry out the B{transitive
97 closure}. The closure properties associated with a concept is
98 indicated in the relation metadata, as indicated earlier.
99
100 The C{extension} of a L{Concept} object is then incorporated into a
101 L{Valuation} object.
102
103 Persistence
104 ===========
105 The functions L{val_dump} and L{val_load} are provided to allow a
106 valuation to be stored in a persistent database and re-loaded, rather
107 than having to be re-computed each time.
108
109 Individuals and Lexical Items
110 =============================
111 As well as deriving relations from the Chat-80 data, we also create a
112 set of individual constants, one for each entity in the domain. The
113 individual constants are string-identical to the entities. For
114 example, given a data item such as C{'zloty'}, we add to the valuation
115 a pair C{('zloty', 'zloty')}. In order to parse English sentences that
116 refer to these entities, we also create a lexical item such as the
117 following for each individual constant::
118
119 PropN[num=sg, sem=<\P.(P zloty)>] -> 'Zloty'
120
121 The set of rules is written to the file C{chat_pnames.cfg} in the
122 current directory.
123
124 """
125
126 import re
127 import nltk_lite.semantics.evaluate as evaluate
128 import shelve, os, sys
129 from nltk_lite.corpora import get_basedir
130
131
132
133
134
135 borders = {'rel_name': 'borders',
136 'closures': ['symmetric'],
137 'schema': ['region', 'border'],
138 'filename': 'borders.pl'}
139
140 contains = {'rel_name': 'contains0',
141 'closures': ['transitive'],
142 'schema': ['region', 'contain'],
143 'filename': 'contain.pl'}
144
145 city = {'rel_name': 'city',
146 'closures': [],
147 'schema': ['city', 'country', 'population'],
148 'filename': 'cities.pl'}
149
150 country = {'rel_name': 'country',
151 'closures': [],
152 'schema': ['country', 'region', 'latitude', 'longitude',
153 'area', 'population', 'capital', 'currency'],
154 'filename': 'countries.pl'}
155
156 circle_of_lat = {'rel_name': 'circle_of_latitude',
157 'closures': [],
158 'schema': ['circle_of_latitude', 'degrees'],
159 'filename': 'world1.pl'}
160
161 circle_of_long = {'rel_name': 'circle_of_longitude',
162 'closures': [],
163 'schema': ['circle_of_longitude', 'degrees'],
164 'filename': 'world1.pl'}
165
166 continent = {'rel_name': 'continent',
167 'closures': [],
168 'schema': ['continent'],
169 'filename': 'world1.pl'}
170
171 region = {'rel_name': 'in_continent',
172 'closures': [],
173 'schema': ['region', 'continent'],
174 'filename': 'world1.pl'}
175
176 ocean = {'rel_name': 'ocean',
177 'closures': [],
178 'schema': ['ocean'],
179 'filename': 'world1.pl'}
180
181 sea = {'rel_name': 'sea',
182 'closures': [],
183 'schema': ['sea'],
184 'filename': 'world1.pl'}
185
186
187
188 items = ['borders', 'contains', 'city', 'country', 'circle_of_lat',
189 'circle_of_long', 'continent', 'region', 'ocean', 'sea']
190
191 item_metadata = {
192 'borders': borders,
193 'contains': contains,
194 'city': city,
195 'country': country,
196 'circle_of_lat': circle_of_lat,
197 'continent': continent,
198 'region': region,
199 'ocean': ocean,
200 'sea': sea
201 }
202
203 rels = item_metadata.values()
204
205 not_unary = ['borders.pl', 'contain.pl']
206
207
208
210 """
211 A Concept class, loosely
212 based on SKOS (U{http://www.w3.org/TR/swbp-skos-core-guide/}).
213 """
214 - def __init__(self, prefLabel, arity, altLabels=[], closures=[], extension=set()):
215 """
216 @param prefLabel: the preferred label for the concept
217 @type prefLabel: str
218 @param arity: the arity of the concept
219 @type arity: int
220 @keyword altLabels: other (related) labels
221 @type altLabels: list
222 @keyword closures: closure properties of the extension \
223 (list items can be C{symmetric}, C{reflexive}, C{transitive})
224 @type closures: list
225 @keyword extension: the extensional value of the concept
226 @type extension: set
227 """
228 self.prefLabel = prefLabel
229 self.arity = arity
230 self.altLabels = altLabels
231 self.closures = closures
232 self.extension = extension
233
235
236 return "Label = '%s'\nArity = %s\nExtension = %s" % \
237 (self.prefLabel, self.arity, sorted(self.extension))
238
240 return "Concept('%s')" % self.prefLabel
241
243 """
244 Add more data to the C{Concept}'s extension set.
245
246 @param data: a new semantic value
247 @type data: string or pair of strings
248 @rtype: set
249
250 """
251 self.extension.add(data)
252 return self.extension
253
254
256 """
257 Convert a set of pairs into an adjacency linked list encoding of a graph.
258 """
259 g = {}
260 for (x, y) in s:
261 if x in g:
262 g[x].append(y)
263 else:
264 g[x] = [y]
265 return g
266
268 """
269 Compute the transitive closure of a graph represented as a linked list.
270 """
271 for x in g:
272 for adjacent in g[x]:
273
274 if adjacent in g:
275 for y in g[adjacent]:
276 if y not in g[x]:
277 g[x].append(y)
278 return g
279
281 """
282 Convert an adjacency linked list back into a set of pairs.
283 """
284 pairs = []
285 for node in g:
286 for adjacent in g[node]:
287 pairs.append((node, adjacent))
288 return set(pairs)
289
290
292 """
293 Close a binary relation in the C{Concept}'s extension set.
294
295 @return: a new extension for the C{Concept} in which the
296 relation is closed under a given property
297
298
299 """
300 assert evaluate.isrel(self.extension)
301 if 'symmetric' in self.closures:
302 pairs = []
303 for (x, y) in self.extension:
304 pairs.append((y, x))
305 sym = set(pairs)
306 self.extension = self.extension.union(sym)
307 if 'transitive' in self.closures:
308 all = self._make_graph(self.extension)
309 closed = self._transclose(all)
310 trans = self._make_pairs(closed)
311
312 self.extension = self.extension.union(trans)
313
314
315
317 """
318 Convert a file of Prolog clauses into a list of L{Concept} objects.
319
320 @param filename: filename containing the relations
321 @type filename: string
322 @param rel_name: name of the relation
323 @type rel_name: string
324 @param schema: the schema used in a set of relational tuples
325 @type schema: list
326 @return: a list of L{Concept}s
327 @rtype: list
328 """
329 concepts = []
330
331 subj = 0
332
333 pkey = schema[0]
334
335 fields = schema[1:]
336
337
338 records = _str2records(filename, rel_name)
339
340
341
342
343 if not filename in not_unary:
344 concepts.append(unary_concept(pkey, subj, records))
345
346
347 for field in fields:
348 obj = schema.index(field)
349 concepts.append(binary_concept(field, closures, subj, obj, records))
350
351 return concepts
352
367
369 """
370 Make a unary concept out of the primary key in a record.
371
372 A record is a list of entities in some relation, such as
373 C{['france', 'paris']}, where C{'france'} is acting as the primary
374 key.
375
376 @param label: the preferred label for the concept
377 @type label: string
378 @param subj: position in the record of the subject of the predicate
379 @type subj: int
380 @param records: a list of records
381 @type records: list of lists
382 @return: L{Concept} of arity 1
383 @rtype: L{Concept}
384 """
385 c = Concept(label, arity=1, extension=set())
386 for record in records:
387 c.augment(record[subj])
388 return c
389
391 """
392 Make a binary concept out of the primary key and another field in a record.
393
394 A record is a list of entities in some relation, such as
395 C{['france', 'paris']}, where C{'france'} is acting as the primary
396 key, and C{'paris'} stands in the C{'capital_of'} relation to
397 C{'france'}.
398
399 More generally, given a record such as C{['a', 'b', 'c']}, where
400 label is bound to C{'B'}, and C{obj} bound to 1, the derived
401 binary concept will have label C{'B_of'}, and its extension will
402 be a set of pairs such as C{('a', 'b')}.
403
404
405 @param label: the base part of the preferred label for the concept
406 @type label: string
407 @param closures: closure properties for the extension of the concept
408 @type closures: list
409 @param subj: position in the record of the subject of the predicate
410 @type subj: int
411 @param obj: position in the record of the object of the predicate
412 @type obj: int
413 @param records: a list of records
414 @type records: list of lists
415 @return: L{Concept} of arity 2
416 @rtype: L{Concept}
417 """
418 if not label == 'border' and not label == 'contain':
419 label = label + '_of'
420 c = Concept(label, arity=2, closures=closures, extension=set())
421 for record in records:
422 c.augment((record[subj], record[obj]))
423
424 c.close()
425 return c
426
427
429 """
430 Given a list of relation metadata bundles, make a corresponding
431 dictionary of concepts, indexed by the relation name.
432
433 @param rels: bundle of metadata needed for constructing a concept
434 @type rels: list of dictionaries
435 @return: a dictionary of concepts, indexed by the relation name.
436 @rtype: dict
437 """
438 concepts = {}
439 for rel in rels:
440 rel_name = rel['rel_name']
441 closures = rel['closures']
442 schema = rel['schema']
443 filename = rel['filename']
444
445 concept_list = clause2concepts(filename, rel_name, closures, schema)
446 for c in concept_list:
447 label = c.prefLabel
448 if(label in concepts.keys()):
449 for data in c.extension:
450 concepts[label].augment(data)
451 concepts[label].close()
452 else:
453 concepts[label] = c
454 return concepts
455
456
458 """
459 Convert a list of C{Concept}s into a list of (label, extension) pairs;
460 optionally create a C{Valuation} object.
461
462 @param concepts: concepts
463 @type concepts: list of L{Concept}s
464 @param read: if C{True}, C{(symbol, set)} pairs are read into a C{Valuation}
465 @type read: bool
466 @rtype: list or a L{Valuation}
467 """
468 vals = []
469
470 for c in concepts:
471 vals.append((c.prefLabel, c.extension))
472 if lexicon: read = True
473 if read:
474 val = evaluate.Valuation()
475 val.read(vals)
476
477 val = label_indivs(val, lexicon=lexicon)
478 return val
479 else: return vals
480
481
483 """
484 Make a L{Valuation} from a list of relation metadata bundles and dump to
485 persistent database.
486
487 @param rels: bundle of metadata needed for constructing a concept
488 @type rels: list of dictionaries
489 @param db: name of file to which data is written.
490 The suffix '.db' will be automatically appended.
491 @type db: string
492 """
493 concepts = process_bundle(rels).values()
494 valuation = make_valuation(concepts, read=True)
495 db_out = shelve.open(db, 'n')
496
497 db_out.update(valuation)
498
499 db_out.close()
500
501
503 """
504 Load a L{Valuation} from a persistent database.
505
506 @param db: name of file from which data is read.
507 The suffix '.db' should be omitted from the name.
508 @type db: string
509 """
510 dbname = db+".db"
511
512 if not os.access(dbname, os.R_OK):
513 sys.exit("Cannot read file: %s" % dbname)
514 else:
515 db_in = shelve.open(db)
516 val = evaluate.Valuation(db_in)
517
518 return val
519
520
522 """
523 Utility to filter out non-alphabetic constants.
524
525 @param str: candidate constant
526 @type str: string
527 @rtype: bool
528 """
529 try:
530 int(str)
531 return False
532 except ValueError:
533
534 if not str == '?':
535 return True
536
537
539 """
540 Assign individual constants to the individuals in the domain of a C{Valuation}.
541
542 Given a valuation with an entry of the form {'rel': {'a': True}},
543 add a new entry {'a': 'a'}.
544
545 @type valuation: L{Valuation}
546 @rtype: L{Valuation}
547 """
548
549 domain = valuation.domain
550
551 entities = sorted(e for e in domain if alpha(e))
552
553 pairs = [(e, e) for e in entities]
554 if lexicon:
555 lex = make_lex(entities)
556 open("chat_pnames.cfg", mode='w').writelines(lex)
557
558 valuation.read(pairs)
559 return valuation
560
562 """
563 Create lexical CFG rules for each individual symbol.
564
565 Given a valuation with an entry of the form {'zloty': 'zloty'},
566 create a lexical rule for the proper name 'Zloty'.
567
568 @param symbols: a list of individual constants in the semantic representation
569 @type symbols: sequence
570 @rtype: list
571 """
572 lex = []
573 header = """
574 ##################################################################
575 # Lexical rules automatically generated by running 'chat80.py -x'.
576 ##################################################################
577
578 """
579 lex.append(header)
580 template = "PropN[num=sg, sem=<\P.(P %s)>] -> '%s'\n"
581
582 for s in symbols:
583 parts = s.split('_')
584 caps = [p.capitalize() for p in parts]
585 pname = ('_').join(caps)
586 rule = template % (s, pname)
587 lex.append(rule)
588 return lex
589
590
591
592
593
594
596 """
597 Build a list of concepts corresponding to the relation names in C{items}.
598
599 @param items: names of the Chat-80 relations to extract
600 @type items: list of strings
601 @return: the L{Concept}s which are extracted from the relations
602 @rtype: list
603 """
604 if type(items) is str: items = (items,)
605
606 rels = [item_metadata[r] for r in items]
607
608 concept_map = process_bundle(rels)
609 return concept_map.values()
610
611
612
613
614
615
616
618 import sys
619 from optparse import OptionParser
620 description = \
621 """
622 Extract data from the Chat-80 Prolog files and convert them into a
623 Valuation object for use in the NLTK semantics package.
624 """
625
626 opts = OptionParser(description=description)
627 opts.set_defaults(verbose=True, lex=False, vocab=False)
628 opts.add_option("-s", "--store", dest="outdb",
629 help="store a valuation in DB", metavar="DB")
630 opts.add_option("-l", "--load", dest="indb",
631 help="load a stored valuation from DB", metavar="DB")
632 opts.add_option("-c", "--concepts", action="store_true",
633 help="print concepts instead of a valuation")
634 opts.add_option("-r", "--relation", dest="label",
635 help="print concept with label REL (check possible labels with '-v' option)", metavar="REL")
636 opts.add_option("-q", "--quiet", action="store_false", dest="verbose",
637 help="don't print out progress info")
638 opts.add_option("-x", "--lex", action="store_true", dest="lex",
639 help="write a file of lexical entries for country names, then exit")
640 opts.add_option("-v", "--vocab", action="store_true", dest="vocab",
641 help="print out the vocabulary of concept labels and their arity, then exit")
642
643 (options, args) = opts.parse_args()
644 if options.outdb and options.indb:
645 opts.error("Options --store and --load are mutually exclusive")
646
647
648 if options.outdb:
649
650 if options.verbose:
651 outdb = options.outdb+".db"
652 print "Dumping a valuation to %s" % outdb
653 val_dump(rels, options.outdb)
654 sys.exit(0)
655 else:
656
657 if options.indb is not None:
658 dbname = options.indb+".db"
659 if not os.access(dbname, os.R_OK):
660 sys.exit("Cannot read file: %s" % dbname)
661 else:
662 valuation = val_load(options.indb)
663
664 else:
665
666 concept_map = process_bundle(rels)
667 concepts = concept_map.values()
668
669 if options.vocab:
670 items = [(c.arity, c.prefLabel) for c in concepts]
671 items.sort()
672 for (arity, label) in items:
673 print label, arity
674 sys.exit(0)
675
676 if options.concepts:
677 for c in concepts:
678 print c
679 print
680 if options.label:
681 print concept_map[options.label]
682 sys.exit(0)
683 else:
684
685 if options.lex:
686 if options.verbose:
687 print "Writing out lexical rules"
688 make_valuation(concepts, lexicon=True)
689 else:
690 valuation = make_valuation(concepts, read=True)
691 print valuation
692
693
694
695 if __name__ == '__main__':
696 main()
697