MG4J (Managing Gigabytes for Java) is a collaborative effort aimed at providing a free Java implementation of inverted-index compression techniques; as a by-product, it offers several general-purpose optimised classes, including fast & compact mutable strings, bit-level I/O, fast unsychronised buffered streams and (possibly signed) minimal perfect hashing.

See:
          Description

Packages
it.unimi.dsi.mg4j.index Index generation and access.
it.unimi.dsi.mg4j.io Bit-level I/O classes.
it.unimi.dsi.mg4j.query Classes for handling queries.
it.unimi.dsi.mg4j.query.parser  
it.unimi.dsi.mg4j.search Iterators over documents, and composition thereof.
it.unimi.dsi.mg4j.search.score Classes for assigning scores to documents.
it.unimi.dsi.mg4j.tool Line-command tools for index construction.
it.unimi.dsi.mg4j.util General-purpose utility classes.

 

MG4J (Managing Gigabytes for Java) is a collaborative effort aimed at providing a free Java implementation of inverted-index compression techniques; as a by-product, it offers several general-purpose optimised classes, including fast & compact mutable strings, bit-level I/O, fast unsychronised buffered streams and (possibly signed) minimal perfect hashing.

Generating full-text inverted indices for very large sets of documents (say, hundreds of millions) and accessing them efficiently is a nontrivial task. MG4J tries to make the techniques described in the book Managing Gigabytes, by Ian Witten, Alistair Moffat and Timothy Bell, accessible without having to deal with bit-level operations in a clean, object-oriented environment.

MG4J provides a layered access to index construction and acccess. At the highest level, you can build an index using the command-line tools, open it using an Index, and then interrogate it using our QueryParser, which will turn a query into a DocumentIterator. Or you can get from an Index an IndexReader, from which, given a term, you can obtain a IndexIterator returning all documents containing the term (and the positions of the term in the document, if the index is full text).

MG4J is distributed under the GNU Lesser General Public License.

History and Motivation

MG4J is a spin-off of the Ubi Project: after the development of a distributed, fault-tolerant crawler a set of tools to index the results of a crawl was clearly a necessity. Since all techniques implemented are standard, distributing the resulting software seemed a good idea.

Writing in Java code that (essentially) has to roll bits over and over may seem a Bad Thing™. However, one should take into consideration the following points:

Conventions

All classes are not synchronised. If multiple threads access one of these classes concurrently, and at least one of the threads modifies it, it must be synchronised externally. Iterators will behave unpredictably in the presence of concurrent modifications.

Package Dependencies

MG4J uses three packages providing high-performance containers and algorithms, that is, the COLT distribution, Jal and fastutil. Moreover, all tools require the Java port of GNU getopt, and compiling MG4J requires javacc. Most utility and I/O classes, however, are completely self-contained.