|
||||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |
See:
Description
Class Summary | |
FirstPass | Builds batched occurrence files from a list of documents read from standard input. |
MiddlePass | Permutes a set of batches and related files so that the resulting index has lexicographically ordered terms. |
Occurrence | A class denoting an occurrence. |
SecondPass | Builds an inverted index by merging occurrence batches produced by FirstPass . |
ZerothPass | Builds the list of terms appearing in a sequence of documents read from standard input. |
Line-command tools for index construction. The classes in this package contain a main method, and can be used to build indices starting from a document stream.
Index construction in MG4J is applied to a stream of documents separated by a given Unicode character. The tools in this package do not perform any kind of tokenization: every maximal subsequence of letters and digits (in the Java sense) will be indexed. More sophisticated tokenisation must be provided by a preprocessor, whose output can then be piped into MG4J (this allows to exploit more easily parallelism).
Indices built by MG4J are full text. This means that the position of every occurrence of every term in every document is recorded by the index (this is necessary to run proximity queries). However, you can optionally turn this feature off, and just record the number of occurrences of each term in each document.
Indices are built in several phases using command-line tools contained in the public classes of this package. You can use the -h option to get help on the possible options. More information is also contained in the documentation of each class.
Warning: several options require a size to be specified. Sizes are in bytes unless you explicitly use a unit (Ki, Mi, Gi, Ti, K, M, G, or T).
This pass is optional, but it is highly recommended if you are indexing a very large collection (the overall indexing time is usually smaller, and memory issues are much less likely), provided that your document stream is easily reproducible (e.g., it is in a file).
Firstly, you call ZerothPass
, that reads the
stream of documents and outputs some files to be used in the following stages;
most importantly, it produces a file containing the list of terms in lexicographic order:
this list is used to build the indices (but you can, at this stage, sort it in another
way, if you want the terms to be numbered differently; by the way, pass 0 is not optional
if you want to use a custom order for terms).
Then, you call MinimalPerfectHash
, that reads
the list of terms and produces a file containing a minimal perfect hash map used in the following
stages.
You call FirstPass
to
analyse the document stream and produce a number of batches. Each
batch contains information about a fixed, settable number of
occurrences. You should try to make batches as large as possible, avoiding
however swap and excessive garbage collection. This usually requires some tuning of the maximum available
memory options of the JVM. This phase is the slowest, and most
computationally intensive but it can be made sensibly faster if the zeroth pass is run. In this phase,
you can give a permutation to renumber the documents in an order that differs
from their natural order (i.e., the order in which they appear in the input
stream).
Optionally you can call MiddlePass
to permute the list of terms so that they are ordered lexicographically. If
ordering is not a concern, or if the zeroth pass was run, you can skip this pass.
Finally, you call SecondPass
to merge
the batches into an index. At this time you can specify the kind of coding you want to use
for the various parts of the index.
In this example, we do a (rather raw) indexing of a mailbox in standard format. We assume that the mailbox is contained in a file called INBOX, encoded in UTF-8 (the default encoding scheme used by MG4J), and that no Unicode NUL appears in the text.
We first filter the mailbox with sed so to insert a Unicode NUL
at the start of each message, and then pipe the result into ZerothPass
.
Note that we provide a large amount of memory to the JVM (this quantity should be as large as possible,
avoiding swap, say 3/4 of core memory; in the event of an OutOfMemoryError
, you should
try again with a larger size); also observe that we provide the document separator explicitly.
sed -e 's/^From /\x00From /' <INBOX | java -Xmx128M it.unimi.dsi.mg4j.tool.ZerothPass -d0 mail
This command produces some files stemmed from the basename mail. In particular, we now build a minimal perfect hash map (file mail.mph) using the sorted term list mail.terms, with the following command:
java -Xmx128M it.unimi.dsi.mg4j.util.MinimalPerfectHash -o mail.terms mail.mph
Now, we must proceed with the actual text indexing, with the command:
sed -e 's/^From /\x00From /' <INBOX | java -Xmx128M it.unimi.dsi.mg4j.tool.FirstPass -z -d0 mail
Again, the amount of memory for the JVM should be chosen as large as possible; if you have much memory,
you can also try to increase the batch size (using the suitable option to FirstPass
), which will
speed up the last phase. Note that the option -z
is used because we have already run the
zeroth pass.
Finally, we produce the index with the command:
java -Xmx128M it.unimi.dsi.mg4j.tool.SecondPass -Q32 -H8 mail
In this case, we are producing an index with skips, with a quantum of size 32 and towers of height ≤ 8 (i.e., looking at most 28 entries farther).
If you plan to access the index, it can be a good idea to produce a signed minimal perfect hash table recording the indexed terms. Such a table allows to compute in constant time the index of a given term and whether the term was indexed at all. This can be accomplished, for instance, as follows:
java -Xmx128M it.unimi.dsi.mg4j.util.MinimalPerfectHash -c it.unimi.dsi.mg4j.util.HashCodeSignedMinimalPerfectHash -o mail.terms mail.smph
You might do so after the zeroth pass, but it is wiser to proceed as suggested above and then build the signed map only at the end, because the signed map is significantly larger, and it is useless in the construction of the index.
Suppose that we want to skip the zeroth pass, for example, because we think that the sed-filtering phase is too slow. In this case, we can proceed directly with the actual text indexing, with the command:
sed -e 's/^From /\x00From /' <INBOX | java -Xmx128M it.unimi.dsi.mg4j.tool.FirstPass -d0 mail
Again, the amount of memory for the JVM should be chosen as large as possible. One should however
know that skipping the zeroth pass makes the first pass much more memory-consuming and, more importantly,
the amount of memory required by the first pass will increase during the text indexing; this may cause
an OutOfMemoryError
at a late time, something that cannot occur if the zeroth pass has
been run successfully.
At this point, if we want terms to be ordered lexicographically, we need to run the middle pass (the first pass has produced an index where terms are numbered in appearence order, instead). This step is entirely optional, and usually it runs faster than any other pass:
java -Xmx128M it.unimi.dsi.mg4j.tool.MiddlePass mail
Then, we conclude the index construction as in the previous case, by running SecondPass
,
and possibly building a signed minimal perfect hash map.
|
||||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |