pennimport reads corpus data in Penn Treebank format and converts the data to MQL statements for later importing into Emdros.
The filenames given after the options on the command line are interpreted as if each of them contains one Penn Treebank document, each containing one or more trees. If no filenames are given, the input is read from stdin, and the whole of the stream is interpreted as being one document.
If no -o switch is given, the output is printed on stdout.
If an error occurs, the string "FAILURE" or the string "ERROR" is printed on stderr, along with an error message.
If no error occurs, a string of the form "SUCCESS: next_monad is X next_id_d is Y" is printed on stderr, where X and Y are positive integers denoting the next monad and the next id_d to be used by the next invocation of the program, respectively. This is useful if you've got several directories' worth of documents to import.
Coreference links are assumed to be unique within a document, but not across documents. Thus you should exercise care if you choose to use stdin as the input method -- all of the data on stdin will be treated as a single document, and coreference links will be assumed to be unique within the entire stream. If this is not the case, then use the method of putting files on the command line rather than using stdin.
Support for importing the BLIPP corpus is implemented. In particular, the "hash" sign (number sign) as a delimiter for coreference relations is supported.
The schema can be seen by giving the program -s switch, with an optional -d switch.
Briefly, a "Document" corresponds to one file given on the command line (or, in the case of using stdin, to the entire stream on stdin).
A "Root" is one stand-alone tree. Its "parent" feature points to the
"Document" which is its parent.
A "Nonterminal" is a nonterminal in the tree which is not a root. Its
"parent" feature points to its parent. Its "coref" feature gives a
list of id_ds of other objects which are coreferent with this
nonterminal. Its "mytype" feature gives the first part of the
nonterminal name. Its "function" feature gives the rest of the
nonterminal name, apart from any coreference pointer. For example,
"NP-SUBJ-1036" would have "NP" for the "mytype" feature, "SUBJ" for
the "function" feature, and the 1036 coreference link would be
translated to the id_d(s) of the other object(s) with the same
coreference link.
A "Token" is a terminal node in the tree. Its "parent" feature points to its parent. Its "coref" feature gives a list of id_ds of other objects which are coreferent with this terminal (only for traces). Its "surface" feature gives the string associated with the terminal. For traces, any coreference link will have been stripped off. Its "mytype" feature gives its part-of-speech tag, and its "function" is defined analogously to "function" on "Nonterminal".