Package it.unimi.dsi.mg4j.index

Index generation and access.

See:
          Description

Interface Summary
CompressionFlags This interface provides constants to be used as compression flags for the constructors of the IndexReader and IndexWriter classes.
IndexIterator An iterator over an inverted list.
TermMap A map from term to term indices.
 

Class Summary
AbstractTermMap An abstract implementation of a map from term to term indices.
Index An abstract representation of an index.
IndexProperties This class provides symbolic names for properties of an index.
IndexReader Provides facilities to read directly an inverted index.
IndexWriter Provides facilities to write an inverted index.
SkipIndex An abstract representation of an index with skips.
SkipIndexProperties This class provides symbolic names for properties of a SkipIndex.
SkipIndexReader Provides facilities to read a skip inverted index from an InputBitStream.
SkipIndexWriter Provides facilities to write skip inverted indices, that is, inverted indices with an additional skip structure.
SkipIndexWriter.TowerData A structure maintaining statistical data about tower construction.
 

Package it.unimi.dsi.mg4j.index Description

Index generation and access.

This package contains the classes that handle index generation and access. The interval iterators defined in it.unimi.dsi.mg4j.search build upon the classes of this package to provide answer to queries using interval semantics, but it is also possible to access an index directly.

You can easily build indices using the tools in it.unimi.dsi.mg4j.tool. Once an index has been built, it can be opened using an Index object, which gathers metadata that is necessary to access the index. You do not create an Index with a constructor: rather, you use the static factory Index.getInstance(CharSequence) (or one of its variants) to create an instance. This is necessary so that different kind of indices can be treated transparently: for example, the factory may return a SkipIndex if the index was created with skipping information, but you do not need to know that.

From an Index, you can easily obtain either an IndexReader, which allows to scan sequentially or randomly the index. In turn from an IndexReader you can obtain a IndexIterator returning the documents containing a certain term and the position of the term within the document.

But there is more: an IndexIterator is a kind of DocumentIterator, and DocumentIterators can be combined in several ways using the classes of the package it.unimi.dsi.mg4j.search: for instance, you can combine document iterators using AND/OR. Note that you can combine document iterators on different indices, but of course the operation is meaningful only if the two indices contain different information about the same document collection (e.g., title and main text).

More importantly, if the index is full text (the default) for each document containing the term you can get interval iterators that return intervals representing extents of text satisfying the query: for instance, in case of an AND of two terms, the intervals will contain both terms.

Structure of an inverted index

An inverted index is made by a sequence of inverted lists (one inverted list for each term). Inverted lists are made by document records: each document record contains information about the occurrences of the term within a certain document.

More precisely, each inverted list starts with a suitably encoded integer, called the frequency, which is the number of document records that will follow (i.e., the number of documents in which the term appears). After that, there are exactly as many document records as the frequency.

Each document record is made by two parts:

As a basic and fundamental implementation, the classes of this package provide methods that write and read document data in a default form. In this default structure, each document data is a suitable coding of a (strictly increasing) sequence of integers, that correspond to the positions where the term occurs within the document. The length of the sequence (i.e., the number of positions in at which the term appears) is called the count (it is also common to call it "frequency", but we find this usage confusing).