it.unimi.dsi.mg4j.tool
Class ZerothPass
java.lang.Object
it.unimi.dsi.mg4j.tool.ZerothPass
- public final class ZerothPass
- extends Object
Builds the list of terms appearing in a sequence of documents read from standard input.
This class reads from standard input a sequence of documents and produces
two files containing the terms appearing in the documents, both in appearance and sorted order.
The only mandatory argument is a basename, which will be used to
stem the names of all files generated.
By default, the whole set of terms contained in the document collection is accumulated in a
set, and then dumped. For large collections, it is possible to limit the amount of memory used
by dumping at regular intervals the set of terms accumulated so far. The sets are then merged.
Note that in this case it is possible that the unsorted list of terms contains duplicates.
These are the files currently generated:
- basename.terms.unsorted[.dups]
- For each indexed term, the corresponding literal string in UTF-8 encoding. More precisely,
the i-th line of the file (starting from 0) contains the literal
string corresponding to term index i. Terms are in appearance order.
If several dumps were necessary to build the index, a suffix .dups is added
to remember that the list may contain duplicates.
- basename.terms
- For each indexed term, the corresponding literal string in UTF-8 encoding. More precisely,
the i-th line of the file (starting from 0) contains the literal
string corresponding to term index i. Terms are sorted lexicographically.
- basename.properties
- A Java
property file
containing information
about the index. Currently, the following keys are generated:
- basename
- the basename (see above);
- documents
- number documents in the collection;
- terms
- number of indexed terms.
- Since:
- 0.9
- Author:
- Sebastiano Vigna
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
main
public static void main(String[] arg)
throws UnsupportedEncodingException,
FileNotFoundException
- Throws:
UnsupportedEncodingException
FileNotFoundException