Only the data definition part of TIGER XML is supported, not the parts that represent queries or matches. This is not really a bug, since the purpose of the program is to import data, not queries or matches.
tigerxmlimport reads corpus data in TIGER XML format, and converts the data to MQL statements for later importing into Emdros.
The filename given after the options on the command line is opened, and it must contain a TIGER XML document. If no filename is given, the input is read from stdin.
If no -o switch is given, the output is printed on stdout.
If an error occurs, the string "FAILURE" or the string "ERROR" is printed on stderr, along with an error message.
If no error occurs, a string of the form "SUCCESS: next_monad is X next_id_d is Y" is printed on stderr, where X and Y are positive integers denoting the next monad and the next id_d to be used by the next invocation of the program, respectively. This is useful if you've got several directories' worth of documents to import.
The schema can be seen by giving the program -s switch, with an optional -d switch.
A "Sentence" corresponds to one top-level sentence. It has a single feature, called "id", giving the corpus id of the sentence. The "id" feature is of type STRING.
If the --ntotf switch is not given, all nonterminals are gathered into a single object type, called "Nonterminal". A "Nonterminal" is a nonterminal in the tree which is not a top-level sentence. Note that this encompasses clauses ("S") as well as phrases. Its "parent" feature is an id_d that points to its immediate parent. Its "edge" feature is a STRING which shows the edge label <annotation> tag held.
The rest of the features of a "Nonterminal" object are computed as follows: Inside the <annotation> tag, each <feature> tag whose domain is either "NT" or "FREC" becomes an Emdros feature of type STRING FROM SET, with the name given in the "name" attribute of the <feature> tag. For example, <feature name="cat" domain="NT"> will become one Emdros feature, with the name "cat" and the type "STRING FROM SET".
Secondary edges are given as secedgeX (a STRING FROM SET) and secedgeparentX (an id_d pointing to the secondary edge parent), where X is an integer (1,2,3,4.... etc.). The number of pairs of secedgeX/secedgeparentX features depends on the maximum number of secondary edges actually found in the corpus imported. For example, if there are a maximum of two secondary edges on any node in the corpus, then the features will be secedge1, secedgeparent1, secedge2, secedgeparent2.
If the --ntotf switch is given, the same as the above occurs, except that the TigerXML feature named by the --ntotf switch will not be added to any nonterminal object. Instead, the feature named by the switch will be interpreted as an object type name, and all nonterminals with the same value for this feature will be placed into an object type of the same name. For example, if the switch + value "--ntotf cat" are given, then the TigerXML feature "cat" is used. If a Nonterminal has a "cat" of "CL", then that nonterminal will be placed into the object type named "CL". If the value is not a valid object type name (i.e., if it is not a valid C identifier), then the value is renamed in order to become a valid object type name. This is done as follows: All characters which do not belong to the set [A-Za-z0-9_] are first made into underscores. Then, if the result starts with a digit ("[0-9]"), then an underscore is prepended. Then, if the result is either a reserved MQL word, or if the name already exists in the set of object types to be used, underscores are appended until the name is unique and not an MQL reserved word.
If the "--termotn" switch is not given, all terminals are gathered into an object type by the name "Terminal". A "Terminal" is a terminal node in the tree (i.e., a token). Its features are calculated precisely as for Nonterminal, except that the domain of the <feature> tag is either "T" or "FREC", but not "NT".
If the "--termotn" switch, on the other hand, is given, then all terminals are gathered into an object type with the name given by the switch.
Since the schema is created from the <feature> parts of the header, the program needs to read the corpus before it can emit the schema. If you give the -s switch to the program, the schema will be printed on stdout.
tigerxmlimport -o data.mql -d mycorpus --schema \ corpus.xml > schema.mql
This example reads the corpus.xml corpus (which must be in TIGER XML format), and writes the data to the data.mql file, while writing the schema to schema.mql. The Emdros database will be called 'mycorpus'
Afterwards, the following will create the database:
mql schema.mql
mql data.mql
The subcorpus feature of TIGER XML is not supported.