|
||||||||||
PREV NEXT | FRAMES NO FRAMES |
See:
Description
Packages | |
it.unimi.dsi.mg4j.index | Index generation and access. |
it.unimi.dsi.mg4j.io | Bit-level I/O classes. |
it.unimi.dsi.mg4j.query | Classes for handling queries. |
it.unimi.dsi.mg4j.query.parser | |
it.unimi.dsi.mg4j.search | Iterators over documents, and composition thereof. |
it.unimi.dsi.mg4j.search.score | Classes for assigning scores to documents. |
it.unimi.dsi.mg4j.tool | Line-command tools for index construction. |
it.unimi.dsi.mg4j.util | General-purpose utility classes. |
MG4J (Managing Gigabytes for Java) is a collaborative effort aimed at providing a free Java implementation of inverted-index compression techniques; as a by-product, it offers several general-purpose optimised classes, including fast & compact mutable strings, bit-level I/O, fast unsychronised buffered streams and (possibly signed) minimal perfect hashing.
Generating full-text inverted indices for very large sets of documents (say, hundreds of millions) and accessing them efficiently is a nontrivial task. MG4J tries to make the techniques described in the book Managing Gigabytes, by Ian Witten, Alistair Moffat and Timothy Bell, accessible without having to deal with bit-level operations in a clean, object-oriented environment.
MG4J provides a layered access to index construction and acccess. At the
highest level, you can build an index
using the command-line tools, open it using an Index
, and then interrogate it using our QueryParser
, which will turn a query into a
DocumentIterator
. Or you can
get from an Index
an IndexReader
, from which, given a term, you can
obtain a IndexIterator
returning all
documents containing the term (and the positions of the term in the
document, if the index is full text).
MG4J is distributed under the GNU Lesser General Public License.
MG4J is a spin-off of the Ubi Project: after the development of a distributed, fault-tolerant crawler a set of tools to index the results of a crawl was clearly a necessity. Since all techniques implemented are standard, distributing the resulting software seemed a good idea.
Writing in Java code that (essentially) has to roll bits over and over may seem a Bad Thing™. However, one should take into consideration the following points:
All classes are not synchronised. If multiple threads access one of these classes concurrently, and at least one of the threads modifies it, it must be synchronised externally. Iterators will behave unpredictably in the presence of concurrent modifications.
MG4J uses three packages providing high-performance containers and
algorithms, that is, the COLT distribution,
Overview
Package
Class
Tree
Deprecated
Index
Help
PREV
NEXT
FRAMES
NO FRAMES