it.unimi.dsi.mg4j.index
Class SkipBitStreamIndexWriter

java.lang.Object
  extended by it.unimi.dsi.mg4j.index.AbstractBitStreamIndexWriter
      extended by it.unimi.dsi.mg4j.index.BitStreamIndexWriter
          extended by it.unimi.dsi.mg4j.index.SkipBitStreamIndexWriter
All Implemented Interfaces:
IndexWriter

public class SkipBitStreamIndexWriter
extends BitStreamIndexWriter

Provides facilities to write skip inverted indices, that is, inverted indices with an additional skip structure. A skip inverted index allows one to skip ahead when reading inverted lists. More specifically, when reading the inverted list relative to a certain term, one may want to decide to skip all document records that concern documents with pointer less than a given integer. In a normal inverted index this is impossible: one would have to read all document records sequentially.

The skipping structure used by this class is new: details can be found here.

Since:
0.6
Author:
Paolo Boldi, Sebastiano Vigna

Nested Class Summary
static class SkipBitStreamIndexWriter.TowerData
          A structure maintaining statistical data about tower construction.
 
Field Summary
 long bitsForEntryBitLengths
          The number of bits written for entry lenghts.
 long bitsForQuantumBitLengths
          The number of bits written for quantum lengths.
static int DEFAULT_TEMP_BUFFER_SIZE
          The size of the buffer for the temporary file used to build an inverted list.
 long numberOfBlocks
          The number of written blocks.
 int prevEntryBitLength
          An estimate on the number of bits occupied per tower entry in the last written cache, or -1 if no cache has been written for the current inverted list.
 int prevQuantumBitLength
          An estimate on the number of bits occupied per quantum in the last written cache, or -1 if no cache has been written for the current inverted list.
 SkipBitStreamIndexWriter.TowerData towerData
          The sum of all tower data computed so far.
 
Fields inherited from class it.unimi.dsi.mg4j.index.BitStreamIndexWriter
b, BEFORE_COUNT, BEFORE_DOCUMENT_RECORD, BEFORE_FREQUENCY, BEFORE_INVERTED_LIST, BEFORE_PAYLOAD, BEFORE_POINTER, BEFORE_POSITIONS, currentDocument, FIRST_UNUSED_STATE, frequency, lastDocument, log2b, maxCount, obs, state, writtenDocuments
 
Fields inherited from class it.unimi.dsi.mg4j.index.AbstractBitStreamIndexWriter
bitsForCounts, bitsForFrequencies, bitsForPayloads, bitsForPointers, bitsForPositions, countCoding, currentTerm, flags, frequencyCoding, hasCounts, hasPayloads, hasPositions, numberOfDocuments, numberOfOccurrences, numberOfPostings, pointerCoding, positionCoding
 
Constructor Summary
SkipBitStreamIndexWriter(CharSequence basename, int numberOfDocuments, boolean writeOffsets, int tempBufferSize, Map<CompressionFlags.Component,CompressionFlags.Coding> flags, int q, int h)
          Creates a new skip index writer, with the specified basename.
SkipBitStreamIndexWriter(CharSequence basename, int numberOfDocuments, boolean writeOffsets, Map<CompressionFlags.Component,CompressionFlags.Coding> flags, int q, int h)
          Creates a new skip index writer, with the specified basename.
SkipBitStreamIndexWriter(OutputBitStream obs, OutputBitStream offset, int N, int tempBufferSize, Map<CompressionFlags.Component,CompressionFlags.Coding> flags, int q, int h)
          Creates a new skip index writer.
 
Method Summary
 void close()
          Closes this index writer, completing the index creation process and releasing all resources.
 OutputBitStream newDocumentRecord()
          Starts a new document record.
 long newInvertedList()
          Starts a new inverted list.
 void printStats(PrintStream stats)
          Writes to the given print stream statistical information about the index just built.
 Properties properties()
          Returns properties of the index generated by this index writer.
 int writeDocumentPointer(OutputBitStream out, int pointer)
          Writes a document pointer.
 int writeFrequency(int frequency)
          Writes the frequency.
 long writtenBits()
          Returns the overall number of bits written onto the underlying stream(s).
 
Methods inherited from class it.unimi.dsi.mg4j.index.BitStreamIndexWriter
writeDocumentPositions, writePayload, writePositionCount
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_TEMP_BUFFER_SIZE

public static final int DEFAULT_TEMP_BUFFER_SIZE
The size of the buffer for the temporary file used to build an inverted list. Inverted lists shorter than this number of bytes will be directly rebuilt from the buffer, and never flushed to disk.

See Also:
Constant Field Values

towerData

public final SkipBitStreamIndexWriter.TowerData towerData
The sum of all tower data computed so far.


bitsForQuantumBitLengths

public long bitsForQuantumBitLengths
The number of bits written for quantum lengths.


bitsForEntryBitLengths

public long bitsForEntryBitLengths
The number of bits written for entry lenghts.


numberOfBlocks

public long numberOfBlocks
The number of written blocks.


prevEntryBitLength

public int prevEntryBitLength
An estimate on the number of bits occupied per tower entry in the last written cache, or -1 if no cache has been written for the current inverted list.


prevQuantumBitLength

public int prevQuantumBitLength
An estimate on the number of bits occupied per quantum in the last written cache, or -1 if no cache has been written for the current inverted list.

Constructor Detail

SkipBitStreamIndexWriter

public SkipBitStreamIndexWriter(CharSequence basename,
                                int numberOfDocuments,
                                boolean writeOffsets,
                                Map<CompressionFlags.Component,CompressionFlags.Coding> flags,
                                int q,
                                int h)
                         throws IOException
Creates a new skip index writer, with the specified basename. The index will be written on a file (stemmed with .index). If writeOffsets, also an offset file will be produced (stemmed with .offsets).

The size of the internal temporary buffer will be DEFAULT_TEMP_BUFFER_SIZE.

Parameters:
basename - the basename.
numberOfDocuments - the number of documents in the collection to be indexed.
writeOffsets - if true, the offset file will also be produced.
flags - a flag map setting the coding techniques to be used (see CompressionFlags).
q - the cache contains at most 2h document records.
h - the maximum height of a skip tower.
Throws:
IOException

SkipBitStreamIndexWriter

public SkipBitStreamIndexWriter(CharSequence basename,
                                int numberOfDocuments,
                                boolean writeOffsets,
                                int tempBufferSize,
                                Map<CompressionFlags.Component,CompressionFlags.Coding> flags,
                                int q,
                                int h)
                         throws IOException
Creates a new skip index writer, with the specified basename. The index will be written on a file (stemmed with .index). If writeOffsets, also an offset file will be produced (stemmed with .offsets).

Parameters:
basename - the basename.
numberOfDocuments - the number of documents in the collection to be indexed.
writeOffsets - if true, the offset file will also be produced.
tempBufferSize - the size in bytes of the internal temporary buffer (inverted lists shorter than this size will never be flushed to disk).
flags - a flag map setting the coding techniques to be used (see CompressionFlags).
q - the cache contains at most 2h document records.
h - the maximum height of a skip tower.
Throws:
IOException

SkipBitStreamIndexWriter

public SkipBitStreamIndexWriter(OutputBitStream obs,
                                OutputBitStream offset,
                                int N,
                                int tempBufferSize,
                                Map<CompressionFlags.Component,CompressionFlags.Coding> flags,
                                int q,
                                int h)
                         throws IOException
Creates a new skip index writer.

Parameters:
obs - the underlying output bit stream.
offset - the offset bit stream.
N - the number of documents in the collection to be indexed.
tempBufferSize - the size in bytes of the internal temporary buffer (inverted lists shorter than this size will never be flushed to disk).
flags - a flag map setting the coding techniques to be used (see CompressionFlags).
q - the cache contains at most 2h document records.
h - the maximum height of a skip tower.
Throws:
IOException
Method Detail

newInvertedList

public long newInvertedList()
                     throws IOException
Description copied from interface: IndexWriter
Starts a new inverted list. The previous inverted list, if any, is actually written to the underlying bit stream.

Specified by:
newInvertedList in interface IndexWriter
Overrides:
newInvertedList in class BitStreamIndexWriter
Returns:
the position (in bytes) of the underlying bit stream where the new inverted list starts.
Throws:
IOException

writeFrequency

public int writeFrequency(int frequency)
                   throws IOException
Description copied from interface: IndexWriter
Writes the frequency.

Specified by:
writeFrequency in interface IndexWriter
Overrides:
writeFrequency in class BitStreamIndexWriter
Parameters:
frequency - the (positive) number of document records that this inverted list will contain.
Returns:
the number of bits written.
Throws:
IOException

newDocumentRecord

public OutputBitStream newDocumentRecord()
                                  throws IOException
Description copied from interface: IndexWriter
Starts a new document record.

This method must be called exactly exactly f times, where f is the frequency specified with IndexWriter.writeFrequency(int).

Specified by:
newDocumentRecord in interface IndexWriter
Overrides:
newDocumentRecord in class BitStreamIndexWriter
Returns:
the output bit stream where the next document record data should be written.
Throws:
IOException

writeDocumentPointer

public int writeDocumentPointer(OutputBitStream out,
                                int pointer)
                         throws IOException
Description copied from interface: IndexWriter
Writes a document pointer.

This method must be called immediately after IndexWriter.newDocumentRecord().

Specified by:
writeDocumentPointer in interface IndexWriter
Overrides:
writeDocumentPointer in class BitStreamIndexWriter
Parameters:
out - the output bit stream where the pointer will be written.
pointer - the document pointer.
Returns:
the number of bits written.
Throws:
IOException

close

public void close()
           throws IOException
Description copied from interface: IndexWriter
Closes this index writer, completing the index creation process and releasing all resources.

Specified by:
close in interface IndexWriter
Overrides:
close in class BitStreamIndexWriter
Throws:
IOException

writtenBits

public long writtenBits()
Description copied from interface: IndexWriter
Returns the overall number of bits written onto the underlying stream(s).

Specified by:
writtenBits in interface IndexWriter
Overrides:
writtenBits in class BitStreamIndexWriter
Returns:
the number of bits written, according to the variables keeping statistical records.

properties

public Properties properties()
Description copied from interface: IndexWriter
Returns properties of the index generated by this index writer.

This method should only be called after IndexWriter.close(). It returns a new property object containing values for (whenever appropriate) Index.PropertyKeys.DOCUMENTS, Index.PropertyKeys.TERMS, Index.PropertyKeys.POSTINGS, Index.PropertyKeys.MAXCOUNT, Index.PropertyKeys.INDEXCLASS, Index.PropertyKeys.CODING, Index.PropertyKeys.PAYLOADCLASS, BitStreamIndex.PropertyKeys.SKIPQUANTUM, and BitStreamIndex.PropertyKeys.SKIPHEIGHT.

Specified by:
properties in interface IndexWriter
Overrides:
properties in class BitStreamIndexWriter
Returns:
properties a new set of properties for the just created index.

printStats

public void printStats(PrintStream stats)
Description copied from interface: IndexWriter
Writes to the given print stream statistical information about the index just built. This method must be called after IndexWriter.close().

Specified by:
printStats in interface IndexWriter
Overrides:
printStats in class AbstractBitStreamIndexWriter
Parameters:
stats - a print stream where statistical information will be written.