it.unimi.dsi.mg4j.index
Class Index

java.lang.Object
  extended byit.unimi.dsi.mg4j.index.Index
All Implemented Interfaces:
CompressionFlags
Direct Known Subclasses:
SkipIndex

public class Index
extends Object
implements CompressionFlags

An abstract representation of an index.

An instance of this class stores index data such as the basename, flags, etc. It allows to build easily index readers and document iterators over the given index.

This class provides also a main method that can be used to dump data about an index for diagnostic purposes.

Since:
0.9
Author:
Paolo Boldi, Sebastiano Vigna

Field Summary
 String basename
          The basename of this index.
 int countCoding
          The coding for counts.
 it.unimi.dsi.mg4j.index.Index.EmptyDocumentIterator emptyDocumentIterator
          A singleton for an iterator returning no documents based on this index.
 int frequencyCoding
          The coding for frequencies.
 boolean hasCounts
          Whether this index contains counts.
 boolean hasPositions
          Whether this index contains positions.
 File indexFile
          The file containing the index.
 boolean isCaseSensitive
          Whether this index is case sensitive.
 int maxDocPos
          The maximum number of positions in an position list, or -1 if it is not known.
 int numberOfDocuments
          The number of documents of the collection.
 int numberOfTerms
          The number of terms of the collection.
 LongList offsets
          The offset of each term, if offsets were loaded or specified at creation time, or null.
 int pointerCoding
          The coding for pointers.
 int positionCoding
          The coding for positions.
 Properties properties
          The properties of this index.
 Set singletonSet
          A singleton set containing just this index.
 IntList sizes
          The size of each document, or null if sizes are not necessary in this index.
 TermMap termMap
          The term list for this index, or null if the term list was not loaded.
 
Fields inherited from interface it.unimi.dsi.mg4j.index.CompressionFlags
ARITH, CODING_NAME, COUNTS_DEFAULT, COUNTS_DELTA, COUNTS_GAMMA, COUNTS_SHIFT, DELTA, FREQUENCIES_DEFAULT, FREQUENCIES_DELTA, FREQUENCIES_GAMMA, FREQUENCIES_SHIFT, GAMMA, GOLOMB, INTERP, NIBBLE, NO_COUNTS, NO_POSITIONS, NONE, POINTERS_DEFAULT, POINTERS_DELTA, POINTERS_GAMMA, POINTERS_GOLOMB, POINTERS_SHIFT, POSITIONS_ARITH, POSITIONS_DEFAULT, POSITIONS_DELTA, POSITIONS_GAMMA, POSITIONS_GOLOMB, POSITIONS_INTERP, POSITIONS_SHIFT, POSITIONS_SKEWED_GOLOMB, SKEWED_GOLOMB, UNARY, ZETA
 
Constructor Summary
protected Index(CharSequence basename, TermMap termMap, boolean loadOffsets, ProgressMeter pm)
          Creates a new index using the given basename.
  Index(String basename, File indexFile, Properties properties, int numberOfDocuments, int numberOfTerms, int maxDocPos, boolean isCaseSensitive, long flags, TermMap termMap, LongList offsets, IntList sizes)
          Creates a new index using the given data.
 
Method Summary
static Index getInstance(CharSequence basename)
          Creates a new index using the given basename.
static Index getInstance(CharSequence basename, boolean loadOffsets, ProgressMeter pm)
          Creates a new index using the given basename, loading offsets.
static Index getInstance(CharSequence basename, ProgressMeter pm)
          Creates a new index using the given basename, loading offsets.
static Index getInstance(CharSequence basename, TermMap termMap, boolean loadOffsets, ProgressMeter pm)
          Creates a new index using the given basename, loading offsets.
 IndexReader getReader()
          Creates and returns a new IndexReader based on this index.
 IndexReader getReader(int bufferSize)
          Creates and returns a new IndexReader based on this index.
static TermMap loadTermMap(String filename)
          Utility static method that loads a term map.
static void main(String[] arg)
           
static LongList readOffsets(InputBitStream in, int T, ProgressMeter pm)
          Utility method to load a compressed offset file into a list.
static IntList readSizes(InputBitStream in, int N, ProgressMeter pm)
          Utility method to load a compressed size file into a list.
 String toString()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

basename

public final String basename
The basename of this index. All file names will be stemmed from the basename. It may be null if this index has been built using Index(String, File, Properties, int, int, int, boolean, long, TermMap, LongList, IntList).


properties

public final Properties properties
The properties of this index. It may be null if this index has been built using Index(String, File, Properties, int, int, int, boolean, long, TermMap, LongList, IntList).


indexFile

public final File indexFile
The file containing the index. It may be null if this index has been built using Index(String, File, Properties, int, int, int, boolean, long, TermMap, LongList, IntList).


numberOfDocuments

public final int numberOfDocuments
The number of documents of the collection.


numberOfTerms

public final int numberOfTerms
The number of terms of the collection.


maxDocPos

public final int maxDocPos
The maximum number of positions in an position list, or -1 if it is not known.


frequencyCoding

public final int frequencyCoding
The coding for frequencies.


pointerCoding

public final int pointerCoding
The coding for pointers.


countCoding

public final int countCoding
The coding for counts.


positionCoding

public final int positionCoding
The coding for positions.


hasCounts

public final boolean hasCounts
Whether this index contains counts.


hasPositions

public final boolean hasPositions
Whether this index contains positions.


isCaseSensitive

public final boolean isCaseSensitive
Whether this index is case sensitive.


singletonSet

public final Set singletonSet
A singleton set containing just this index.


offsets

public final LongList offsets
The offset of each term, if offsets were loaded or specified at creation time, or null.


sizes

public final IntList sizes
The size of each document, or null if sizes are not necessary in this index.


termMap

public final TermMap termMap
The term list for this index, or null if the term list was not loaded.


emptyDocumentIterator

public final it.unimi.dsi.mg4j.index.Index.EmptyDocumentIterator emptyDocumentIterator
A singleton for an iterator returning no documents based on this index.

Constructor Detail

Index

protected Index(CharSequence basename,
                TermMap termMap,
                boolean loadOffsets,
                ProgressMeter pm)
         throws IOException
Creates a new index using the given basename. TODO: describe what files are needed, and spend some time on properties also!

Parameters:
basename - the basename of the index.
termMap - the term list for this index, or null if there is no term list.
loadOffsets - offsets are loaded only if this parameter is true.
pm - an optional progress meter. If null, no progress information will be displayed.

Index

public Index(String basename,
             File indexFile,
             Properties properties,
             int numberOfDocuments,
             int numberOfTerms,
             int maxDocPos,
             boolean isCaseSensitive,
             long flags,
             TermMap termMap,
             LongList offsets,
             IntList sizes)
      throws IOException
Creates a new index using the given data.

This constructor provides an index that is initialised exactly to the provided data. It is mainly useful for debugging and testing purposes, in the case you are creating an index file (for instance, as a memory-based bit stream) but you have no property file to use with getInstance(CharSequence).

It is usually safe to provide null for basename, indexFile and properties, but the responsibility for data consistence is up to the caller.

Parameters:
basename - the basename of this index, or null.
indexFile - the file containing this index, or null.
properties - the properties of this index, or null.
numberOfDocuments - the number of documents in this index.
numberOfTerms - the number of terms in this index.
maxDocPos - the maximum length of an occurrence list, or -1 if it is not known.
flags - a bit mask setting the coding techniques to be used (see CompressionFlags).
termMap - the term list for this index, or null if there is no term list.
offsets - the offset list; may be null if you do not plan using IndexReader.position(int).
sizes - the size list; may be null if your code does not require it.
Method Detail

readOffsets

public static LongList readOffsets(InputBitStream in,
                                   int T,
                                   ProgressMeter pm)
                            throws IOException
Utility method to load a compressed offset file into a list.

Parameters:
in - the input bit stream providing the offsets (see IndexWriter).
T - the number of terms indexed.
pm - an optional progress meter. If null, no progress information will be displayed.
Returns:
a list of longs backed by an array; the list has an additional final element of index T that gives the number of bytes of the index file.
Throws:
IOException

readSizes

public static IntList readSizes(InputBitStream in,
                                int N,
                                ProgressMeter pm)
                         throws IOException
Utility method to load a compressed size file into a list.

Parameters:
in - the input bit stream providing the offsets (see IndexWriter).
N - the number of documents indexed.
pm - an optional progress meter. If null, no progress information will be displayed.
Returns:
a list of integers backed by an array.
Throws:
IOException

loadTermMap

public static TermMap loadTermMap(String filename)
                           throws IOException
Utility static method that loads a term map.

Parameters:
filename - the name of the file containing the term map.
Returns:
the map, or null if the file did not exist.
Throws:
IOException - if some IOException (other than FileNotFoundException) occurred.

getInstance

public static Index getInstance(CharSequence basename,
                                TermMap termMap,
                                boolean loadOffsets,
                                ProgressMeter pm)
                         throws IOException
Creates a new index using the given basename, loading offsets.

Parameters:
basename - the basename of the index.
termMap - the term list for this index, or null if there is no term list.
loadOffsets - offsets are loaded only if this parameter is true.
pm - an optional progress meter. If null, no progress information will be displayed.
Throws:
IOException

getInstance

public static Index getInstance(CharSequence basename,
                                boolean loadOffsets,
                                ProgressMeter pm)
                         throws IOException
Creates a new index using the given basename, loading offsets. If there is a term map file (basename stemmed with ".termmap"), it is loaded; otherwise, the term map is set to null.

Parameters:
basename - the basename of the index.
loadOffsets - offsets are loaded only if this parameter is true.
pm - an optional progress meter. If null, no progress information will be displayed.
Throws:
IOException

getInstance

public static Index getInstance(CharSequence basename,
                                ProgressMeter pm)
                         throws IOException
Creates a new index using the given basename, loading offsets. If there is a term map file (basename stemmed with ".termmap"), it is loaded; otherwise, the term map is set to null.

Parameters:
basename - the basename of the index.
pm - an optional progress meter. If null, no progress information will be displayed.
Throws:
IOException

getInstance

public static Index getInstance(CharSequence basename)
                         throws IOException
Creates a new index using the given basename. If there is a term map file (basename stemmed with ".termmap"), it is loaded; otherwise, the term map is set to null.

This constructor provides no progress report.

Parameters:
basename - the basename of the index.
Throws:
IOException

getReader

public IndexReader getReader()
                      throws IOException
Creates and returns a new IndexReader based on this index. After that, you can use the reader to read this index.

Returns:
a new IndexReader to read this index.
Throws:
IOException

getReader

public IndexReader getReader(int bufferSize)
                      throws IOException
Creates and returns a new IndexReader based on this index. After that, you can use the reader to read this index.

Parameters:
bufferSize - the size of the buffer to be used when opening the InputBitStream underlying the IndexReader that is going to be returned.
Returns:
a new IndexReader to read this index.
Throws:
IOException

toString

public String toString()

main

public static void main(String[] arg)
                 throws IOException
Throws:
IOException