it.unimi.dsi.mg4j.index
Class IndexWriter

java.lang.Object
  extended byit.unimi.dsi.mg4j.index.IndexWriter
All Implemented Interfaces:
CompressionFlags
Direct Known Subclasses:
SkipIndexWriter

public class IndexWriter
extends Object
implements CompressionFlags

Provides facilities to write an inverted index.

This class is used to write inverted lists in sequential order, as follows:

The newDocumentRecord(int) writes the (suitable coding of) the pointer, and returns an OutputBitStream that may be used to write the document data. Note that there is no guarantee that the returned OutputBitStream coincides with the underlying bit stream. Moreover, there is no guarantee as to when the bits will be actually written on the underlying stream, except that when starting a new inverted list, the previous inverted list, if any, will be written onto the underlying stream.

The method writeDocumentPositions(OutputBitStream,int[],int,int,int) is a facility that writes, on the given OutputBitStream, the list of positions using the specified compression scheme: the method should be called, if needed, just after a newDocumentRecord(int) call, on the same output bit stream returned by the latter.

Compression flags

As explained above, this class needs to know the compression method that it should use to write the frequency, the document pointer, and the positions; moreover, for positions, the method writes the number of positions first (called the count) followed by the sequence of positions. Hence, there are four different compression methods that one should choose (the method used for frequencies, pointers, counts and positions).

Such methods can be chosen when creating an instance of IndexWriter using the available compression flags specified in CompressionFlags.

Offset bit stream

Every inverted index may have an associated OutputBitStream of offsets: this file contains T+1 integers, where T is the number of inverted lists (i.e., the number of terms), and the i-th entry is a suitable coding of the position in bytes where the i-th inverted list starts (the last entry is actually the length, in bytes, of the inverted index file itself). The coding used for the offset stream is a γ code of the difference between the current position and the last position.

Statistics

This class performs some bookkeeping of the number of bits written the output stream for several purposes: see, for instance, bitsForPositions, bitsForAlignment, etc.

Since:
0.6
Author:
Paolo Boldi, Sebastiano Vigna

Field Summary
protected  int b
          The parameter b for Golomb coding of pointers.
protected static int BEFORE_COUNT
          This value of state can be assumed only in indices that contain counts; it means that we are positioned just before the count for the current document record.
protected static int BEFORE_DOCUMENT_RECORD
          This value of state means that we are ready to call newDocumentRecord().
protected static int BEFORE_FREQUENCY
          This value of state means that we are positioned at the start of an inverted list, and we should call writeFrequency(int).
protected static int BEFORE_INVERTED_LIST
          This value of state means that we should call newInvertedList().
protected static int BEFORE_POINTER
          This value of state means that we just started a new document record, and we should call writeDocumentPointer(OutputBitStream, int).
protected static int BEFORE_POSITIONS
          This value of state can be assumed only in indices that contain document positions; it means that we are positioned just before the position list of the current document record.
 long bitsForAlignment
          The number of bits written for alignment.
 long bitsForCounts
          The number of bits written for counts.
 long bitsForFrequencies
          The number of bits written for frequencies.
 long bitsForPointers
          The number of bits written for document pointers.
 long bitsForPositions
          The number of bits written for positions.
protected  int currentDocument
          The current document pointer.
protected  int currentTerm
          The current term.
protected static int FIRST_UNUSED_STATE
          This is the first unused state.
 long flags
          The coding flags.
protected  int frequency
          The number of document records that the current inverted list will contain.
 boolean hasCounts
          Whether this index contains counts.
 boolean hasPositions
          Whether this index contains positions.
protected  int lastDocument
          The last document pointer in the current list.
protected  int log2b
          The parameter log2b for Golomb coding of pointers; it is the most significant bit of b.
protected  int maxDocPos
          The maximum number of positions in a document record so far.
protected  int numberOfDocuments
          The number of documents of the collection to be indexed.
protected  OutputBitStream obs
          The underlying OutputBitStream.
protected  int pointerCoding
          The coding for pointers.
protected  double relativeFrequency
          The number of document records that the current inverted list will contain, divided by the number of documents.
protected  int state
          The current state of the writer.
protected  int writtenDocuments
          The number of document records already written for the current inverted list.
 
Fields inherited from interface it.unimi.dsi.mg4j.index.CompressionFlags
ARITH, CODING_NAME, COUNTS_DEFAULT, COUNTS_DELTA, COUNTS_GAMMA, COUNTS_SHIFT, DELTA, FREQUENCIES_DEFAULT, FREQUENCIES_DELTA, FREQUENCIES_GAMMA, FREQUENCIES_SHIFT, GAMMA, GOLOMB, INTERP, NIBBLE, NO_COUNTS, NO_POSITIONS, NONE, POINTERS_DEFAULT, POINTERS_DELTA, POINTERS_GAMMA, POINTERS_GOLOMB, POINTERS_SHIFT, POSITIONS_ARITH, POSITIONS_DEFAULT, POSITIONS_DELTA, POSITIONS_GAMMA, POSITIONS_GOLOMB, POSITIONS_INTERP, POSITIONS_SHIFT, POSITIONS_SKEWED_GOLOMB, SKEWED_GOLOMB, UNARY, ZETA
 
Constructor Summary
IndexWriter(CharSequence basename, int numberOfDocuments, boolean writeOffsets, long flags)
          Creates a new index writer, with the specified basename.
IndexWriter(OutputBitStream obs, int numberOfDocuments, int flags)
          Creates a new index writer, with the specified underlying OutputBitStream, without an associated offset bit stream.
IndexWriter(OutputBitStream obs, OutputBitStream offset, int numberOfDocuments, long flags)
          Creates a new index writer, with the specified underlying OutputBitStream.
 
Method Summary
 void close()
          Closes the underlying bit stream.
static String flags2String(long flags)
          This method unpacks the flags, expressed as a long, and returns a string with the compression flags.
 OutputBitStream newDocumentRecord()
          Starts a new document record.
 OutputBitStream newDocumentRecord(int pointer)
          Deprecated. Use newDocumentRecord() followed by writeDocumentPointer(OutputBitStream, int).
 long newInvertedList()
          Starts a new inverted list.
 long newInvertedList(int frequency)
          Deprecated. Use newInvertedList() followed by writeFrequency(int).
 Properties properties()
          This method should only be called after close().
 int writeDocumentPointer(OutputBitStream out, int pointer)
          Writes a document pointer.
 int writeDocumentPositions(OutputBitStream out, int[] occ, int offset, int len, int docSize)
          Writes the positions of the occurrences of the current term in the current document to the given OutputBitStream.
 int writeFrequency(int frequency)
          Writes the frequency.
 int writePositionCount(OutputBitStream out, int count)
          Writes the count of the occurrences of the current term in the current document to the given OutputBitStream.
 long writtenBits()
          This method returns the overall number of bits written onto the underlying stream.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

BEFORE_INVERTED_LIST

protected static final int BEFORE_INVERTED_LIST
This value of state means that we should call newInvertedList().

See Also:
Constant Field Values

BEFORE_FREQUENCY

protected static final int BEFORE_FREQUENCY
This value of state means that we are positioned at the start of an inverted list, and we should call writeFrequency(int).

See Also:
Constant Field Values

BEFORE_DOCUMENT_RECORD

protected static final int BEFORE_DOCUMENT_RECORD
This value of state means that we are ready to call newDocumentRecord().

See Also:
Constant Field Values

BEFORE_POINTER

protected static final int BEFORE_POINTER
This value of state means that we just started a new document record, and we should call writeDocumentPointer(OutputBitStream, int).

See Also:
Constant Field Values

BEFORE_COUNT

protected static final int BEFORE_COUNT
This value of state can be assumed only in indices that contain counts; it means that we are positioned just before the count for the current document record.

See Also:
Constant Field Values

BEFORE_POSITIONS

protected static final int BEFORE_POSITIONS
This value of state can be assumed only in indices that contain document positions; it means that we are positioned just before the position list of the current document record.

See Also:
Constant Field Values

FIRST_UNUSED_STATE

protected static final int FIRST_UNUSED_STATE
This is the first unused state. Subclasses may start from this value to define new states.

See Also:
Constant Field Values

obs

protected OutputBitStream obs
The underlying OutputBitStream.


state

protected int state
The current state of the writer.


numberOfDocuments

protected int numberOfDocuments
The number of documents of the collection to be indexed.


frequency

protected int frequency
The number of document records that the current inverted list will contain.


relativeFrequency

protected double relativeFrequency
The number of document records that the current inverted list will contain, divided by the number of documents.


writtenDocuments

protected int writtenDocuments
The number of document records already written for the current inverted list.


currentDocument

protected int currentDocument
The current document pointer.


currentTerm

protected int currentTerm
The current term.


lastDocument

protected int lastDocument
The last document pointer in the current list.


b

protected int b
The parameter b for Golomb coding of pointers.


log2b

protected int log2b
The parameter log2b for Golomb coding of pointers; it is the most significant bit of b.


maxDocPos

protected int maxDocPos
The maximum number of positions in a document record so far.


bitsForAlignment

public long bitsForAlignment
The number of bits written for alignment.


bitsForFrequencies

public long bitsForFrequencies
The number of bits written for frequencies.


bitsForPointers

public long bitsForPointers
The number of bits written for document pointers.


bitsForCounts

public long bitsForCounts
The number of bits written for counts.


bitsForPositions

public long bitsForPositions
The number of bits written for positions.


flags

public long flags
The coding flags.


pointerCoding

protected int pointerCoding
The coding for pointers.


hasCounts

public final boolean hasCounts
Whether this index contains counts.


hasPositions

public final boolean hasPositions
Whether this index contains positions.

Constructor Detail

IndexWriter

public IndexWriter(CharSequence basename,
                   int numberOfDocuments,
                   boolean writeOffsets,
                   long flags)
            throws IOException
Creates a new index writer, with the specified basename. The index will be written on a file (stemmed with .index). If writeOffsets, also an offset file will be produced (stemmed with .offsets). When close() will be called, the property file will also be produced (stemmed with .properties), or enriched if it already exists.

Parameters:
basename - the basename.
numberOfDocuments - the number of documents in the collection to be indexed.
writeOffsets - if true, the offset file will also be produced.
flags - a bit mask setting the coding techniques to be used (see the IndexWriter introduction).

IndexWriter

public IndexWriter(OutputBitStream obs,
                   OutputBitStream offset,
                   int numberOfDocuments,
                   long flags)
Creates a new index writer, with the specified underlying OutputBitStream.

Parameters:
obs - the underlying output bit stream.
offset - the offset bit stream.
numberOfDocuments - the number of documents in the collection to be indexed.
flags - a bit mask setting the coding techniques to be used (see the IndexWriter introduction).

IndexWriter

public IndexWriter(OutputBitStream obs,
                   int numberOfDocuments,
                   int flags)
Creates a new index writer, with the specified underlying OutputBitStream, without an associated offset bit stream.

Parameters:
obs - the underlying output bit stream.
numberOfDocuments - the number of documents in the collection to be indexed.
flags - a bit mask setting the coding techniques to be used (see the IndexWriter introduction).
Method Detail

newInvertedList

public long newInvertedList()
                     throws IOException,
                            IllegalStateException
Starts a new inverted list. The previous inverted list, if any, is actually written to the underlying bit stream; then, the stream is byte-aligned.

Returns:
the position (in bytes) of the underlying bit stream where the new inverted list starts.
Throws:
IOException - if an exception is thrown by the underlying stream.
IllegalStateException - if too few records were written for the previous inverted list.

writeFrequency

public int writeFrequency(int frequency)
                   throws IOException
Writes the frequency.

Parameters:
frequency - the (positive) number of document records that this inverted list will contain.
Returns:
the number of bits written.
Throws:
IOException - if an exception is thrown by the underlying stream.

newInvertedList

public long newInvertedList(int frequency)
                     throws IOException,
                            IllegalStateException
Deprecated. Use newInvertedList() followed by writeFrequency(int).

Starts a new inverted list. The previous inverted list, if any, is actually written to the underlying bit stream; then, the stream is byte-aligned, and frequency is written onto the underlying stream.

Parameters:
frequency - the number of document records that this inverted list will contain.
Returns:
the position (in bytes) of the underlying bit stream where the new inverted list starts.
Throws:
IOException - if an exception is thrown by the underlying stream.
IllegalStateException - if too few records were written for the previous inverted list.

newDocumentRecord

public OutputBitStream newDocumentRecord()
                                  throws IOException,
                                         IllegalStateException
Starts a new document record.

This method must be called exactly n times after a call to newInvertedList(n).

Returns:
the output bit stream where the next document record data should be written.
Throws:
IOException - if an exception is thrown by the underlying stream.
IllegalStateException - if too many records were written for the current inverted list, or if there is no current inverted list.

writeDocumentPointer

public int writeDocumentPointer(OutputBitStream out,
                                int pointer)
                         throws IOException,
                                IllegalStateException
Writes a document pointer.

This method must be called immediately after newDocumentRecord().

Parameters:
out - the output bit stream where the pointer will be written.
pointer - the document pointer.
Returns:
the number of bits written.
Throws:
IOException - if an exception is thrown by the underlying stream.
IllegalStateException

newDocumentRecord

public OutputBitStream newDocumentRecord(int pointer)
                                  throws IOException,
                                         IllegalStateException
Deprecated. Use newDocumentRecord() followed by writeDocumentPointer(OutputBitStream, int).

Starts a new document record. This method should be called exactly n times after a call to newInvertedList(n).

Parameters:
pointer - the document pointer (will be written to the bit stream).
Returns:
the OutputBitStream where the next document record should be written.
Throws:
IOException - if an exception is thrown by the underlying stream.
IllegalStateException - if too many records were written for the current inverted list, or if there is no current inverted list.

close

public void close()
           throws IOException,
                  IllegalStateException
Closes the underlying bit stream.

Throws:
IOException - if an exception is thrown by the underlying stream
IllegalStateException - if too few records were written for the previous inverted list.

writePositionCount

public int writePositionCount(OutputBitStream out,
                              int count)
                       throws IOException,
                              IllegalStateException
Writes the count of the occurrences of the current term in the current document to the given OutputBitStream.

Parameters:
out - the output stream where the occurrences should be written.
count - the count.
Returns:
the number of bits written.
Throws:
IOException
IllegalStateException

writeDocumentPositions

public int writeDocumentPositions(OutputBitStream out,
                                  int[] occ,
                                  int offset,
                                  int len,
                                  int docSize)
                           throws IOException,
                                  IllegalStateException
Writes the positions of the occurrences of the current term in the current document to the given OutputBitStream. First the count (number of positions) is written using the specified coding. Then, the specified vector of occurrences is written using the specified coding.

Parameters:
out - the output stream where the occurrences should be written.
occ - the position vector (a sequence of strictly increasing natural numbers).
offset - the first valid entry in occ.
len - the number of valid entries in occ.
docSize - the size of the current document (only for Golomb and interpolative coding; you can safely pass 0 otherwise).
Returns:
the number of bits written.
Throws:
IOException - if an exception is thrown by the underlying stream
IllegalStateException - if there is no current inverted list.

writtenBits

public long writtenBits()
This method returns the overall number of bits written onto the underlying stream.

Returns:
the number of bits written, according to the variables keeping statistical records.

properties

public Properties properties()
This method should only be called after close(). It returns the values for all the properties specified in IndexProperties.

Returns:
the properties specified in IndexProperties along with their values.

flags2String

public static String flags2String(long flags)
This method unpacks the flags, expressed as a long, and returns a string with the compression flags. The string is a pipe-separated sequence of compression flags, as described in CompressionFlags.

Parameters:
flags - the flags.
Returns:
a string corresponding to the given flags.