it.unimi.dsi.mg4j.index
Class DiskBasedIndex

java.lang.Object
  extended by it.unimi.dsi.mg4j.index.DiskBasedIndex

public class DiskBasedIndex
extends Object

A static container providing facilities to load an index based on data stored on disk.

This class contains several useful static methods such as readOffsets(InputBitStream, int) and readSizes(InputBitStream, int), and static factor methods such as getInstance(CharSequence, boolean, boolean, boolean, EnumMap) that take care of reading the properties associated to the index, identify the correct Index implementation that should be used to load the index, and load the necessary data into memory.

As an option, a disk-based index can be loaded into main memory (key: Index.UriKeys.INMEMORY), returning an InMemoryIndex/InMemoryHPIndex, or mapped into main memory (key: Index.UriKeys.MAPPED), returning a MemoryMappedIndex/InMemoryHPIndex (note that the value assigned to the keys is irrelevant). In both cases some insurmountable Java problems prevents using indices whose size exceeds two gigabytes (but see MemoryMappedIndex for some elaboration on this topic).

Moreover, by default the term-offset list is accessed using a SemiExternalOffsetList with a step of DEFAULT_OFFSET_STEP. This behaviour can be changed using the URI key Index.UriKeys.OFFSETSTEP.

Disk-based indices are the workhorse of MG4J. All other indices (clustered, remote, etc.) ultimately rely on disk-based indices to provide results.

Note that not all data produced by Scan and by the other indexing utilities are actually necessary to run a disk-based index. Usually the property file and the index file (plus the positions file, for high-performance indices) are sufficient: if one needs random access, also the offsets file must be present, and if the compression method requires document sizes or if sizes are requested explicitly, also the sizes file must be present. A StringMap and possibly a PrefixMap will be fetched automatically by getInstance(CharSequence, boolean, boolean) using standard extensions.

Thread safety

A disk-based index is thread safe as long as the offset list, the size list and the term/prefix map are. The static factory methods provided by this class load offsets and sizes using data structures that are thread safe. If you use directly a constructor, instead, it is your responsability to pass thread-safe data structures.

Since:
1.1
Author:
Sebastiano Vigna

Field Summary
static int DEFAULT_OFFSET_STEP
          The default value for the query parameter Index.UriKeys.OFFSETSTEP.
static String FREQUENCIES_EXTENSION
          Standard extension for the file of frequencies.
static String GLOBCOUNTS_EXTENSION
          Standard extension for the file of global counts.
static String INDEX_EXTENSION
          Standard extension for the index bitstream.
static String OFFSETS_EXTENSION
          Standard extension for the file of offsets.
static String POSITIONS_EXTENSION
          Standard extension for the positions bitstream of an high-performance index.
static String PREFIXMAP_EXTENSION
          Standard extension for the prefix map.
static String PROPERTIES_EXTENSION
          Standard extension for the index properties.
static String SIZES_EXTENSION
          Standard extension for the file of sizes.
static String STATS_EXTENSION
          Standard extension for the stats file.
static String TERMMAP_EXTENSION
          Standard extension for the term map.
static String TERMS_EXTENSION
          Standard extension for the file of terms.
static String UNSORTED_TERMS_EXTENSION
          Standard extension for the file of terms, unsorted.
 
Method Summary
static BitStreamIndex getInstance(CharSequence basename)
          Returns a new local index, trying to guess reasonable term and prefix maps from the basename, loading offsets but loading document sizes only if it is necessary.
static BitStreamIndex getInstance(CharSequence basename, boolean randomAccess)
          Returns a new local index, trying to guess reasonable term and prefix maps from the basename, and loading document sizes only if it is necessary.
static BitStreamIndex getInstance(CharSequence basename, boolean randomAccess, boolean documentSizes)
          Returns a new disk-based index, guessing reasonable term and prefix maps from the basename.
static BitStreamIndex getInstance(CharSequence basename, boolean randomAccess, boolean documentSizes, boolean maps)
          Returns a new disk-based index, using preloaded Properties and possibly guessing reasonable term and prefix maps from the basename.
static BitStreamIndex getInstance(CharSequence basename, boolean randomAccess, boolean documentSizes, boolean maps, EnumMap<Index.UriKeys,String> queryProperties)
          Returns a new disk-based index, possibly guessing reasonable term and prefix maps from the basename.
static BitStreamIndex getInstance(CharSequence basename, Properties properties, boolean randomAccess, boolean documentSizes, boolean maps, EnumMap<Index.UriKeys,String> queryProperties)
          Returns a new disk-based index, using preloaded Properties and possibly guessing reasonable term and prefix maps from the basename.
static BitStreamIndex getInstance(CharSequence basename, Properties properties, StringMap<? extends CharSequence> termMap, PrefixMap<? extends CharSequence> prefixMap, boolean randomAccess, boolean documentSizes, EnumMap<Index.UriKeys,String> queryProperties)
          Returns a new disk-based index, loading exactly the specified parts and using preloaded Properties.
static PrefixMap<? extends CharSequence> loadPrefixMap(String filename)
          Utility static method that loads a prefix map.
static StringMap<? extends CharSequence> loadStringMap(String filename)
          Utility static method that loads a term map.
static LongList readOffsets(InputBitStream in, int T)
          Utility method to load a compressed offset file into a list.
static IntList readSizes(InputBitStream in, int N)
          Utility method to load a compressed size file into a list.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_OFFSET_STEP

public static final int DEFAULT_OFFSET_STEP
The default value for the query parameter Index.UriKeys.OFFSETSTEP.

See Also:
Constant Field Values

INDEX_EXTENSION

public static final String INDEX_EXTENSION
Standard extension for the index bitstream.

See Also:
Constant Field Values

POSITIONS_EXTENSION

public static final String POSITIONS_EXTENSION
Standard extension for the positions bitstream of an high-performance index.

See Also:
Constant Field Values

PROPERTIES_EXTENSION

public static final String PROPERTIES_EXTENSION
Standard extension for the index properties.

See Also:
Constant Field Values

SIZES_EXTENSION

public static final String SIZES_EXTENSION
Standard extension for the file of sizes.

See Also:
Constant Field Values

OFFSETS_EXTENSION

public static final String OFFSETS_EXTENSION
Standard extension for the file of offsets.

See Also:
Constant Field Values

GLOBCOUNTS_EXTENSION

public static final String GLOBCOUNTS_EXTENSION
Standard extension for the file of global counts.

See Also:
Constant Field Values

FREQUENCIES_EXTENSION

public static final String FREQUENCIES_EXTENSION
Standard extension for the file of frequencies.

See Also:
Constant Field Values

TERMS_EXTENSION

public static final String TERMS_EXTENSION
Standard extension for the file of terms.

See Also:
Constant Field Values

UNSORTED_TERMS_EXTENSION

public static final String UNSORTED_TERMS_EXTENSION
Standard extension for the file of terms, unsorted.

See Also:
Constant Field Values

TERMMAP_EXTENSION

public static final String TERMMAP_EXTENSION
Standard extension for the term map.

See Also:
Constant Field Values

PREFIXMAP_EXTENSION

public static final String PREFIXMAP_EXTENSION
Standard extension for the prefix map.

See Also:
Constant Field Values

STATS_EXTENSION

public static final String STATS_EXTENSION
Standard extension for the stats file.

See Also:
Constant Field Values
Method Detail

readOffsets

public static LongList readOffsets(InputBitStream in,
                                   int T)
                            throws IOException
Utility method to load a compressed offset file into a list.

Parameters:
in - the input bit stream providing the offsets (see BitStreamIndexWriter).
T - the number of terms indexed.
Returns:
a list of longs backed by an array; the list has an additional final element of index T that gives the number of bytes of the index file.
Throws:
IOException

readSizes

public static IntList readSizes(InputBitStream in,
                                int N)
                         throws IOException
Utility method to load a compressed size file into a list.

Parameters:
in - the input bit stream providing the offsets (see BitStreamIndexWriter).
N - the number of documents indexed.
Returns:
a list of integers backed by an array.
Throws:
IOException

loadStringMap

public static StringMap<? extends CharSequence> loadStringMap(String filename)
                                                       throws IOException
Utility static method that loads a term map.

Parameters:
filename - the name of the file containing the term map.
Returns:
the map, or null if the file did not exist.
Throws:
IOException - if some IOException (other than FileNotFoundException) occurred.

loadPrefixMap

public static PrefixMap<? extends CharSequence> loadPrefixMap(String filename)
                                                       throws IOException
Utility static method that loads a prefix map.

Parameters:
filename - the name of the file containing the prefix map.
Returns:
the map, or null if the file did not exist.
Throws:
IOException - if some IOException (other than FileNotFoundException) occurred.

getInstance

public static BitStreamIndex getInstance(CharSequence basename,
                                         Properties properties,
                                         StringMap<? extends CharSequence> termMap,
                                         PrefixMap<? extends CharSequence> prefixMap,
                                         boolean randomAccess,
                                         boolean documentSizes,
                                         EnumMap<Index.UriKeys,String> queryProperties)
                                  throws ClassNotFoundException,
                                         IOException,
                                         InstantiationException,
                                         IllegalAccessException
Returns a new disk-based index, loading exactly the specified parts and using preloaded Properties.

Parameters:
basename - the basename of the index.
properties - the properties obtained from the given basename.
termMap - the term map for this index, or null for no term map.
prefixMap - the prefix map for this index, or null for no prefix map.
randomAccess - whether the index should be accessible randomly (e.g., if it will be possible to call IndexReader.documents(int) on the index readers returned by the index).
documentSizes - if true, document sizes will be loaded (note that sometimes document sizes might be loaded anyway because the compression method for positions requires it).
queryProperties - a map containing associations between Index.UriKeys and values, or null.
Throws:
ClassNotFoundException
IOException
InstantiationException
IllegalAccessException

getInstance

public static BitStreamIndex getInstance(CharSequence basename,
                                         Properties properties,
                                         boolean randomAccess,
                                         boolean documentSizes,
                                         boolean maps,
                                         EnumMap<Index.UriKeys,String> queryProperties)
                                  throws ClassNotFoundException,
                                         IOException,
                                         InstantiationException,
                                         IllegalAccessException
Returns a new disk-based index, using preloaded Properties and possibly guessing reasonable term and prefix maps from the basename.

Parameters:
basename - the basename of the index.
properties - the properties obtained by stemming basename.
randomAccess - whether the index should be accessible randomly.
documentSizes - if true, document sizes will be loaded.
maps - if true, term and prefix maps will be guessed and loaded.
queryProperties - a map containing associations between Index.UriKeys and values, or null.
Throws:
IllegalAccessException
InstantiationException
ClassNotFoundException
IOException
See Also:
getInstance(CharSequence, Properties, StringMap, PrefixMap, boolean, boolean, EnumMap)

getInstance

public static BitStreamIndex getInstance(CharSequence basename,
                                         boolean randomAccess,
                                         boolean documentSizes,
                                         boolean maps,
                                         EnumMap<Index.UriKeys,String> queryProperties)
                                  throws ConfigurationException,
                                         ClassNotFoundException,
                                         IOException,
                                         InstantiationException,
                                         IllegalAccessException
Returns a new disk-based index, possibly guessing reasonable term and prefix maps from the basename.

If there is a term map file (basename stemmed with .termmap), it is used as term map and, in case it implements PrefixMap. Otherwise, we search for a prefix map (basename stemmed with .prefixmap) and, if it implements StringMap and no term map has been found, we use it as prefix map.

Parameters:
basename - the basename of the index.
randomAccess - whether the index should be accessible randomly (e.g., if it will be possible to call IndexReader.documents(int) on the index readers returned by the index).
documentSizes - if true, document sizes will be loaded (note that sometimes document sizes might be loaded anyway because the compression method for positions requires it).
maps - if true, term and prefix maps will be guessed and loaded (this feature might not be available with some kind of index).
queryProperties - a map containing associations between Index.UriKeys and values, or null.
Throws:
ConfigurationException
ClassNotFoundException
IOException
InstantiationException
IllegalAccessException

getInstance

public static BitStreamIndex getInstance(CharSequence basename,
                                         boolean randomAccess,
                                         boolean documentSizes,
                                         boolean maps)
                                  throws ConfigurationException,
                                         ClassNotFoundException,
                                         IOException,
                                         InstantiationException,
                                         IllegalAccessException
Returns a new disk-based index, using preloaded Properties and possibly guessing reasonable term and prefix maps from the basename.

If there is a term map file (basename stemmed with .termmap), it is used as term map and, in case it implements PrefixMap. Otherwise, we search for a prefix map (basename stemmed with .prefixmap) and, if it implements StringMap and no term map has been found, we use it as prefix map.

Parameters:
basename - the basename of the index.
randomAccess - whether the index should be accessible randomly (e.g., if it will be possible to call IndexReader.documents(int) on the index readers returned by the index).
documentSizes - if true, document sizes will be loaded (note that sometimes document sizes might be loaded anyway because the compression method for positions requires it).
maps - if true, term and prefix maps will be guessed and loaded (this feature might not be available with some kind of index).
Throws:
ConfigurationException
ClassNotFoundException
IOException
InstantiationException
IllegalAccessException
See Also:
getInstance(CharSequence, boolean, boolean, boolean, EnumMap)

getInstance

public static BitStreamIndex getInstance(CharSequence basename,
                                         boolean randomAccess,
                                         boolean documentSizes)
                                  throws ConfigurationException,
                                         ClassNotFoundException,
                                         IOException,
                                         InstantiationException,
                                         IllegalAccessException
Returns a new disk-based index, guessing reasonable term and prefix maps from the basename.

Parameters:
basename - the basename of the index.
randomAccess - whether the index should be accessible randomly (e.g., if it will be possible to call IndexReader.documents(int) on the index readers returned by the index).
documentSizes - if true, document sizes will be loaded (note that sometimes document sizes might be loaded anyway because the compression method for positions requires it).
Throws:
ConfigurationException
ClassNotFoundException
IOException
InstantiationException
IllegalAccessException

getInstance

public static BitStreamIndex getInstance(CharSequence basename,
                                         boolean randomAccess)
                                  throws ConfigurationException,
                                         ClassNotFoundException,
                                         IOException,
                                         InstantiationException,
                                         IllegalAccessException
Returns a new local index, trying to guess reasonable term and prefix maps from the basename, and loading document sizes only if it is necessary.

Parameters:
basename - the basename of the index.
randomAccess - whether the index should be accessible randomly (e.g., if it will be possible to call IndexReader.documents(int) on the index readers returned by the index).
Throws:
ConfigurationException
ClassNotFoundException
IOException
InstantiationException
IllegalAccessException

getInstance

public static BitStreamIndex getInstance(CharSequence basename)
                                  throws ConfigurationException,
                                         ClassNotFoundException,
                                         IOException,
                                         InstantiationException,
                                         IllegalAccessException
Returns a new local index, trying to guess reasonable term and prefix maps from the basename, loading offsets but loading document sizes only if it is necessary.

Parameters:
basename - the basename of the index.
Throws:
ConfigurationException
ClassNotFoundException
IOException
InstantiationException
IllegalAccessException