it.unimi.dsi.mg4j.tool
Class ZerothPass

java.lang.Object
  extended byit.unimi.dsi.mg4j.tool.ZerothPass

public final class ZerothPass
extends Object

Builds the list of terms appearing in a sequence of documents read from standard input.

This class reads from standard input a sequence of documents and produces two files containing the terms appearing in the documents, both in appearance and sorted order. The only mandatory argument is a basename, which will be used to stem the names of all files generated.

By default, the whole set of terms contained in the document collection is accumulated in a set, and then dumped. For large collections, it is possible to limit the amount of memory used by dumping at regular intervals the set of terms accumulated so far. The sets are then merged. Note that in this case it is possible that the unsorted list of terms contains duplicates.

These are the files currently generated:

basename.terms.unsorted[.dups]
For each indexed term, the corresponding literal string in UTF-8 encoding. More precisely, the i-th line of the file (starting from 0) contains the literal string corresponding to term index i. Terms are in appearance order. If several dumps were necessary to build the index, a suffix .dups is added to remember that the list may contain duplicates.
basename.terms
For each indexed term, the corresponding literal string in UTF-8 encoding. More precisely, the i-th line of the file (starting from 0) contains the literal string corresponding to term index i. Terms are sorted lexicographically.
basename.properties
A Java property file containing information about the index. Currently, the following keys are generated:
basename
the basename (see above);
documents
number documents in the collection;
terms
number of indexed terms.

Since:
0.9
Author:
Sebastiano Vigna

Method Summary
static void main(String[] arg)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

main

public static void main(String[] arg)
                 throws UnsupportedEncodingException,
                        FileNotFoundException
Throws:
UnsupportedEncodingException
FileNotFoundException