PENNIMPORT

Section: User Commands (1)
Updated: January 20, 2007
Index Return to Main Contents
 

NAME

pennimport - A tool to convert Penn Treebank format to Emdros MQL  

SYNOPSIS

pennimport [ options ] [input_filename ...]
 

DESCRIPTION

pennimport is a command-line tool to convert Penn Treebank format to Emdros MQL for later importing into Emdros.

 

OPTIONS

pennimport supports the following command-line switches:
--help
show help, then quit
-V , --version
show version, then quit
-b , --backend backend
set database backend to `backend'. Valid values are: For PostgreSQL: "p", "pg", "postgres", and "postgresql". For MySQL: "m", "my", and "mysql". For SQLite 2.X.X: "2", "s", "l", "lt", "sqlite", and "sqlite2". For SQLite 3.X.X: "3", "s3", "lt3", and "sqlite3".
-d , --dbname dbname
set database name. If used with -s, the string "CREATE DATABASE
-o , --output filename
dump to file filename. The default is "-", which means "standard output".
--start-monad monad
The start monad to use. Must be >= 1. The default is 1.
--start-id_d id_d
The start id_d to use. Must be >= 1. The default is 1.
 

OPERATION

pennimport reads corpus data in Penn Treebank format and converts the data to MQL statements for later importing into Emdros.

The filenames given after the options on the command line are interpreted as if each of them contains one Penn Treebank document, each containing one or more trees. If no filenames are given, the input is read from stdin, and the whole of the stream is interpreted as being one document.

If no -o switch is given, the output is printed on stdout.

If an error occurs, the string "FAILURE" or the string "ERROR" is printed on stderr, along with an error message.

If no error occurs, a string of the form "SUCCESS: next_monad is X next_id_d is Y" is printed on stderr, where X and Y are positive integers denoting the next monad and the next id_d to be used by the next invocation of the program, respectively. This is useful if you've got several directories' worth of documents to import.

Coreference links are assumed to be unique within a document, but not across documents. Thus you should exercise care if you choose to use stdin as the input method -- all of the data on stdin will be treated as a single document, and coreference links will be assumed to be unique within the entire stream. If this is not the case, then use the method of putting files on the command line rather than using stdin.

Support for importing the BLIPP corpus is implemented. In particular, the "hash" sign (number sign) as a delimiter for coreference relations is supported.

 

SCHEMA

The schema can be seen by giving the program -s switch, with an optional -d switch.

Briefly, a "Document" corresponds to one file given on the command line (or, in the case of using stdin, to the entire stream on stdin).

A "Root" is one stand-alone tree. Its "parent" feature points to the
 "Document" which is its parent.

A "Nonterminal" is a nonterminal in the tree which is not a root. Its
 "parent" feature points to its parent.  Its "coref" feature gives a list of id_ds of other objects which are coreferent with this nonterminal. Its "mytype" feature gives the first part of the nonterminal name. Its "function" feature gives the rest of the nonterminal name, apart from any coreference pointer. For example,
 "NP-SUBJ-1036" would have "NP" for the "mytype" feature, "SUBJ" for the "function" feature, and the 1036 coreference link would be translated to the id_d(s) of the other object(s) with the same coreference link.

A "Token" is a terminal node in the tree. Its "parent" feature points to its parent. Its "coref" feature gives a list of id_ds of other objects which are coreferent with this terminal (only for traces). Its "surface" feature gives the string associated with the terminal. For traces, any coreference link will have been stripped off. Its "mytype" feature gives its part-of-speech tag, and its "function" is defined analogously to "function" on "Nonterminal".

 

RETURN VALUES

0 Success
1 Wrong usage
2 Connection to backend server could not be established
3 An exception occurred (the type is printed on stderr)
4 Could not open file
5 Database error
6 Compiler error (internal error)
 

AUTHORS

Copyright 2001-2006 by Ulrik Petersen (ulrikp@users.sourceforge.net). Note that this software is distributed under the GNU GPL. See the sources for details.


 

Index

NAME
SYNOPSIS
DESCRIPTION
OPTIONS
OPERATION
SCHEMA
RETURN VALUES
AUTHORS

This document was created by man2html, using the manual pages.
Time: 12:15:51 GMT, August 15, 2009