it.unimi.dsi.mg4j.index
Class BitStreamIndexWriter

java.lang.Object
  extended by it.unimi.dsi.mg4j.index.AbstractBitStreamIndexWriter
      extended by it.unimi.dsi.mg4j.index.BitStreamIndexWriter
All Implemented Interfaces:
IndexWriter
Direct Known Subclasses:
SkipBitStreamIndexWriter

public class BitStreamIndexWriter
extends AbstractBitStreamIndexWriter

Writes a bitstream-based interleaved index.

Offsets bit stream

An inverted index may have an associated OutputBitStream of offsets: this file contains T+1 integers, where T is the number of inverted lists (i.e., the number of terms), and the i-th entry is the position in bits where the i-th inverted list starts (the last entry is actually the length, in bytes, of the inverted index file itself).

The file actually contains γ-coded gaps: thus, in practice, it is formed by the number zero (the offset of the first list) followed by the length of each inverted list.

Since:
0.6
Author:
Paolo Boldi, Sebastiano Vigna

Field Summary
protected  int b
          The parameter b for Golomb coding of pointers.
protected static int BEFORE_COUNT
          This value of state can be assumed only in indices that contain counts; it means that we are positioned just before the count for the current document record.
protected static int BEFORE_DOCUMENT_RECORD
          This value of state means that we are ready to call newDocumentRecord().
protected static int BEFORE_FREQUENCY
          This value of state means that we are positioned at the start of an inverted list, and we should call writeFrequency(int).
protected static int BEFORE_INVERTED_LIST
          This value of state means that we should call newInvertedList().
protected static int BEFORE_PAYLOAD
          This value of state can be assumed only in indices that contain payloads; it means that we are positioned just before the payload for the current document record.
protected static int BEFORE_POINTER
          This value of state means that we just started a new document record, and we should call writeDocumentPointer(OutputBitStream, int).
protected static int BEFORE_POSITIONS
          This value of state can be assumed only in indices that contain document positions; it means that we are positioned just before the position list of the current document record.
protected  int currentDocument
          The current document pointer.
protected static int FIRST_UNUSED_STATE
          This is the first unused state.
protected  int frequency
          The number of document records that the current inverted list will contain.
protected  int lastDocument
          The last document pointer in the current list.
protected  long lastInvertedListPos
          The position (in bytes) where the last inverted list started.
protected  int log2b
          The parameter log2b for Golomb coding of pointers; it is the most significant bit of b.
 int maxCount
          The maximum number of positions in a document record so far.
protected  OutputBitStream obs
          The underlying OutputBitStream.
protected  int state
          The current state of the writer.
protected  int writtenDocuments
          The number of document records already written for the current inverted list.
 
Fields inherited from class it.unimi.dsi.mg4j.index.AbstractBitStreamIndexWriter
bitsForCounts, bitsForFrequencies, bitsForPayloads, bitsForPointers, bitsForPositions, countCoding, currentTerm, flags, frequencyCoding, hasCounts, hasPayloads, hasPositions, numberOfDocuments, numberOfOccurrences, numberOfPostings, pointerCoding, positionCoding
 
Constructor Summary
BitStreamIndexWriter(CharSequence basename, int numberOfDocuments, boolean writeOffsets, Map<CompressionFlags.Component,CompressionFlags.Coding> flags)
          Creates a new index writer, with the specified basename.
BitStreamIndexWriter(OutputBitStream obs, int numberOfDocuments, Map<CompressionFlags.Component,CompressionFlags.Coding> flags)
          Creates a new index writer, with the specified underlying OutputBitStream, without additional bit streams.
BitStreamIndexWriter(OutputBitStream obs, OutputBitStream offset, OutputBitStream posNumBits, int numberOfDocuments, Map<CompressionFlags.Component,CompressionFlags.Coding> flags)
          Creates a new index writer with payloads using the specified underlying OutputBitStream.
 
Method Summary
 void close()
          Closes this index writer, completing the index creation process and releasing all resources.
 OutputBitStream newDocumentRecord()
          Starts a new document record.
 long newInvertedList()
          Starts a new inverted list.
 Properties properties()
          Returns properties of the index generated by this index writer.
 int writeDocumentPointer(OutputBitStream out, int pointer)
          Writes a document pointer.
 int writeDocumentPositions(OutputBitStream out, int[] occ, int offset, int len, int docSize)
          Writes the positions of the occurrences of the current term in the current document to the given OutputBitStream.
 int writeFrequency(int frequency)
          Writes the frequency.
 int writePayload(OutputBitStream out, Payload payload)
          Writes the payload for the current document.
 int writePositionCount(OutputBitStream out, int count)
          Writes the count of the occurrences of the current term in the current document to the given OutputBitStream.
 long writtenBits()
          Returns the overall number of bits written onto the underlying stream(s).
 
Methods inherited from class it.unimi.dsi.mg4j.index.AbstractBitStreamIndexWriter
printStats
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

BEFORE_INVERTED_LIST

protected static final int BEFORE_INVERTED_LIST
This value of state means that we should call newInvertedList().

See Also:
Constant Field Values

BEFORE_FREQUENCY

protected static final int BEFORE_FREQUENCY
This value of state means that we are positioned at the start of an inverted list, and we should call writeFrequency(int).

See Also:
Constant Field Values

BEFORE_DOCUMENT_RECORD

protected static final int BEFORE_DOCUMENT_RECORD
This value of state means that we are ready to call newDocumentRecord().

See Also:
Constant Field Values

BEFORE_POINTER

protected static final int BEFORE_POINTER
This value of state means that we just started a new document record, and we should call writeDocumentPointer(OutputBitStream, int).

See Also:
Constant Field Values

BEFORE_PAYLOAD

protected static final int BEFORE_PAYLOAD
This value of state can be assumed only in indices that contain payloads; it means that we are positioned just before the payload for the current document record.

See Also:
Constant Field Values

BEFORE_COUNT

protected static final int BEFORE_COUNT
This value of state can be assumed only in indices that contain counts; it means that we are positioned just before the count for the current document record.

See Also:
Constant Field Values

BEFORE_POSITIONS

protected static final int BEFORE_POSITIONS
This value of state can be assumed only in indices that contain document positions; it means that we are positioned just before the position list of the current document record.

See Also:
Constant Field Values

FIRST_UNUSED_STATE

protected static final int FIRST_UNUSED_STATE
This is the first unused state. Subclasses may start from this value to define new states.

See Also:
Constant Field Values

obs

protected OutputBitStream obs
The underlying OutputBitStream.


state

protected int state
The current state of the writer.


frequency

protected int frequency
The number of document records that the current inverted list will contain.


writtenDocuments

protected int writtenDocuments
The number of document records already written for the current inverted list.


currentDocument

protected int currentDocument
The current document pointer.


lastDocument

protected int lastDocument
The last document pointer in the current list.


lastInvertedListPos

protected long lastInvertedListPos
The position (in bytes) where the last inverted list started.


b

protected int b
The parameter b for Golomb coding of pointers.


log2b

protected int log2b
The parameter log2b for Golomb coding of pointers; it is the most significant bit of b.


maxCount

public int maxCount
The maximum number of positions in a document record so far.

Constructor Detail

BitStreamIndexWriter

public BitStreamIndexWriter(CharSequence basename,
                            int numberOfDocuments,
                            boolean writeOffsets,
                            Map<CompressionFlags.Component,CompressionFlags.Coding> flags)
                     throws IOException
Creates a new index writer, with the specified basename. The index will be written on a file (stemmed with .index). If writeOffsets, also an offset file will be produced (stemmed with .offsets). When close() will be called, the property file will also be produced (stemmed with .properties), or enriched if it already exists.

Parameters:
basename - the basename.
numberOfDocuments - the number of documents in the collection to be indexed.
writeOffsets - if true, the offset file will also be produced.
flags - a flag map setting the coding techniques to be used (see CompressionFlags).
Throws:
IOException

BitStreamIndexWriter

public BitStreamIndexWriter(OutputBitStream obs,
                            OutputBitStream offset,
                            OutputBitStream posNumBits,
                            int numberOfDocuments,
                            Map<CompressionFlags.Component,CompressionFlags.Coding> flags)
Creates a new index writer with payloads using the specified underlying OutputBitStream.

Parameters:
obs - the underlying output bit stream.
offset - the offset bit stream, or null if offsets should not be written.
posNumBits - the bit stream for positions bit lengths, or null if such lengths should not be written.
numberOfDocuments - the number of documents in the collection to be indexed.
flags - a flag map setting the coding techniques to be used (see CompressionFlags).

BitStreamIndexWriter

public BitStreamIndexWriter(OutputBitStream obs,
                            int numberOfDocuments,
                            Map<CompressionFlags.Component,CompressionFlags.Coding> flags)
Creates a new index writer, with the specified underlying OutputBitStream, without additional bit streams.

Parameters:
obs - the underlying output bit stream.
numberOfDocuments - the number of documents in the collection to be indexed.
flags - a flag map setting the coding techniques to be used (see CompressionFlags).
Method Detail

newInvertedList

public long newInvertedList()
                     throws IOException
Description copied from interface: IndexWriter
Starts a new inverted list. The previous inverted list, if any, is actually written to the underlying bit stream.

Returns:
the position (in bits) of the underlying bit stream where the new inverted list starts.
Throws:
IOException

writeFrequency

public int writeFrequency(int frequency)
                   throws IOException
Description copied from interface: IndexWriter
Writes the frequency.

Parameters:
frequency - the (positive) number of document records that this inverted list will contain.
Returns:
the number of bits written.
Throws:
IOException

newDocumentRecord

public OutputBitStream newDocumentRecord()
                                  throws IOException
Description copied from interface: IndexWriter
Starts a new document record.

This method must be called exactly exactly f times, where f is the frequency specified with IndexWriter.writeFrequency(int).

Returns:
the output bit stream where the next document record data should be written.
Throws:
IOException

writeDocumentPointer

public int writeDocumentPointer(OutputBitStream out,
                                int pointer)
                         throws IOException
Description copied from interface: IndexWriter
Writes a document pointer.

This method must be called immediately after IndexWriter.newDocumentRecord().

Parameters:
out - the output bit stream where the pointer will be written.
pointer - the document pointer.
Returns:
the number of bits written.
Throws:
IOException

writePayload

public int writePayload(OutputBitStream out,
                        Payload payload)
                 throws IOException
Description copied from interface: IndexWriter
Writes the payload for the current document.

This method must be called immediately after IndexWriter.writeDocumentPointer(OutputBitStream, int).

Parameters:
out - the output bit stream where the payload will be written.
payload - the payload.
Returns:
the number of bits written.
Throws:
IOException

close

public void close()
           throws IOException
Description copied from interface: IndexWriter
Closes this index writer, completing the index creation process and releasing all resources.

Throws:
IOException

writePositionCount

public int writePositionCount(OutputBitStream out,
                              int count)
                       throws IOException
Description copied from interface: IndexWriter
Writes the count of the occurrences of the current term in the current document to the given OutputBitStream.

Parameters:
out - the output stream where the occurrences should be written.
count - the count.
Returns:
the number of bits written.
Throws:
IOException

writeDocumentPositions

public int writeDocumentPositions(OutputBitStream out,
                                  int[] occ,
                                  int offset,
                                  int len,
                                  int docSize)
                           throws IOException
Description copied from interface: IndexWriter
Writes the positions of the occurrences of the current term in the current document to the given OutputBitStream.

Parameters:
out - the output stream where the occurrences should be written.
occ - the position vector (a sequence of strictly increasing natural numbers).
offset - the first valid entry in occ.
len - the number of valid entries in occ.
docSize - the size of the current document (only for Golomb and interpolative coding; you can safely pass -1 otherwise).
Returns:
the number of bits written.
Throws:
IOException

writtenBits

public long writtenBits()
Description copied from interface: IndexWriter
Returns the overall number of bits written onto the underlying stream(s).

Returns:
the number of bits written, according to the variables keeping statistical records.

properties

public Properties properties()
Description copied from interface: IndexWriter
Returns properties of the index generated by this index writer.

This method should only be called after IndexWriter.close(). It returns a new property object containing values for (whenever appropriate) Index.PropertyKeys.DOCUMENTS, Index.PropertyKeys.TERMS, Index.PropertyKeys.POSTINGS, Index.PropertyKeys.MAXCOUNT, Index.PropertyKeys.INDEXCLASS, Index.PropertyKeys.CODING, Index.PropertyKeys.PAYLOADCLASS, BitStreamIndex.PropertyKeys.SKIPQUANTUM, and BitStreamIndex.PropertyKeys.SKIPHEIGHT.

Returns:
properties a new set of properties for the just created index.