TIGERXMLIMPORT

Section: User Commands (1)
Updated: April 4, 2007
Index Return to Main Contents
 

NAME

tigerxmlimport - A tool to convert TIGER XML format to Emdros MQL  

SYNOPSIS

tigerxmlimport [ options ] [input_filename ...]
 

DESCRIPTION

tigerxmlimport is a command-line tool for converting a corpus stored as TIGER XML to Emdros MQL statements that create object types and/or objects. These can then later be imported into Emdros, e.g., with the mql(1) program.

Only the data definition part of TIGER XML is supported, not the parts that represent queries or matches. This is not really a bug, since the purpose of the program is to import data, not queries or matches.

 

OPTIONS

tigerxmlimport supports the following command-line switches:
--help
show help, then quit
-V , --version
show version, then quit
-b , --backend backend
set database backend to `backend'. Valid values are: For PostgreSQL: "p", "pg", "postgres", and "postgresql". For MySQL: "m", "my", and "mysql". For SQLite 2.X.X: "2", "s", "l", "lt", "sqlite", and "sqlite2". For SQLite 3.X.X: "3", "s3", "lt3", and "sqlite3".
-s , --schema
print MQL schema to stdout between reading corpus and dumping objects as MQL (can be used with -d)
-d , --dbname dbname
set database name. If used with -s, the string "CREATE DATABASE
-o , --output filename
dump to file filename. The default is "-", which means "standard output".
--start-monad monad
The start monad to use. Must be >= 1. The default is 1.
--start-id_d id_d
The start id_d to use. Must be >= 1. The default is 1.
 

OPERATION

tigerxmlimport reads corpus data in TIGER XML format, and converts the data to MQL statements for later importing into Emdros.

The filename given after the options on the command line is opened, and it must contain a TIGER XML document. If no filename is given, the input is read from stdin.

If no -o switch is given, the output is printed on stdout.

If an error occurs, the string "FAILURE" or the string "ERROR" is printed on stderr, along with an error message.

If no error occurs, a string of the form "SUCCESS: next_monad is X next_id_d is Y" is printed on stderr, where X and Y are positive integers denoting the next monad and the next id_d to be used by the next invocation of the program, respectively. This is useful if you've got several directories' worth of documents to import.

 

SCHEMA

The schema can be seen by giving the program -s switch, with an optional -d switch.

A "Sentence" corresponds to one top-level sentence. It has a single feature, called "id", giving the corpus id of the sentence. The "id" feature is of type STRING.

A "Nonterminal" is a nonterminal in the tree which is not a top-level sentence. Note that this encompasses clauses ("S") as well as phrases. Its "parent" feature is an id_d that points to its immediate parent. Its "edge" feature is a STRING which shows the edge label <annotation> tag held.

The rest of the features of "Nonterminal" are computed as follows: Inside the <annotation> tag, each <feature> tag whose domain is either "NT" or "FREC" becomes an Emdros feature of type STRING FROM SET, with the name given in the "name" attribute of the <feature> tag. For example, <feature name="cat" domain="NT"> will become one Emdros feature, with the name "cat" and the type "STRING FROM SET".

Secondary edges are given as secedgeX (a STRING FROM SET) and secedgeparentX (an id_d pointing to the secondary edge parent), where X is an integer (1,2,3,4.... etc.). The number of pairs of secedgeX/secedgeparentX features depends on the maximum number of secondary edges actually found in the corpus imported. For example, if there are a maximum of two secondary edges on any node in the corpus, then the features will be secedge1, secedgeparent1, secedge2, secedgeparent2.

A "Terminal" is a terminal node in the tree (i.e., a token). Its features are calculated precisely as for Nonterminal, except that the domain of the <feature> tag is either "T" or "FREC", but not "NT".

Since the schema is created from the <feature> parts of the header, the program needs to read the corpus before it can emit the schema. If you give the -s switch to the program, the schema will be printed on stdout.

 

EXAMPLES

tigerxmlimport -o data.mql -d mycorpus --schema \ corpus.xml > schema.mql

This example reads the corpus.xml corpus (which must be in TIGER XML format), and writes the data to the data.mql file, while writing the schema to schema.mql. The Emdros database will be called 'mycorpus'

Afterwards, the following will create the database:

mql schema.mql

mql data.mql

 

BUGS

The subcorpus feature of TIGER XML is not supported.

 

RETURN VALUES

0 Success
1 Wrong usage
2 Connection to backend server could not be established
3 An exception occurred (the type is printed on stderr)
4 Could not open file
5 Database error
6 Compiler error (internal error)
 

AUTHORS

Copyright 2001-2007 by Ulrik Petersen (ulrikp@users.sourceforge.net). Note that this software is distributed under the GNU GPL. See the sources for details.


 

Index

NAME
SYNOPSIS
DESCRIPTION
OPTIONS
OPERATION
SCHEMA
EXAMPLES
BUGS
RETURN VALUES
AUTHORS

This document was created by man2html, using the manual pages.
Time: 20:36:17 GMT, January 26, 2008