|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectit.unimi.dsi.mg4j.index.AbstractBitStreamIndexWriter
it.unimi.dsi.mg4j.index.BitStreamIndexWriter
public class BitStreamIndexWriter
Writes a bitstream-based interleaved index.
An inverted index may have an associated OutputBitStream
of
offsets: this file contains T+1
integers, where T
is the number of inverted lists (i.e., the number of terms), and the
i
-th entry is a suitable coding of the position in bits where
the i
-th inverted list starts (the last entry is actually the
length, in bytes, of the inverted index file itself). The coding used for
the offset stream is a γ code of the difference between the current
position and the last position.
Field Summary | |
---|---|
protected int |
b
The parameter b for Golomb coding of pointers. |
protected static int |
BEFORE_COUNT
This value of state can be assumed only in indices that contain counts; it
means that we are positioned just before the count for the current document record. |
protected static int |
BEFORE_DOCUMENT_RECORD
This value of state means that we are ready to call newDocumentRecord() . |
protected static int |
BEFORE_FREQUENCY
This value of state means that we are positioned at the start of an inverted list,
and we should call writeFrequency(int) . |
protected static int |
BEFORE_INVERTED_LIST
This value of state means that we should call newInvertedList() . |
protected static int |
BEFORE_PAYLOAD
This value of state can be assumed only in indices that contain payloads; it
means that we are positioned just before the payload for the current document record. |
protected static int |
BEFORE_POINTER
This value of state means that we just started a new document record, and we
should call writeDocumentPointer(OutputBitStream, int) . |
protected static int |
BEFORE_POSITIONS
This value of state can be assumed only in indices that contain document positions;
it means that we are positioned just before the position list of the current document record. |
protected int |
currentDocument
The current document pointer. |
protected static int |
FIRST_UNUSED_STATE
This is the first unused state. |
protected int |
frequency
The number of document records that the current inverted list will contain. |
protected int |
lastDocument
The last document pointer in the current list. |
protected int |
log2b
The parameter log2b for Golomb coding of pointers; it is the most significant bit of b . |
int |
maxCount
The maximum number of positions in a document record so far. |
protected OutputBitStream |
obs
The underlying OutputBitStream . |
protected int |
state
The current state of the writer. |
protected int |
writtenDocuments
The number of document records already written for the current inverted list. |
Fields inherited from class it.unimi.dsi.mg4j.index.AbstractBitStreamIndexWriter |
---|
bitsForCounts, bitsForFrequencies, bitsForPayloads, bitsForPointers, bitsForPositions, countCoding, currentTerm, flags, frequencyCoding, hasCounts, hasPayloads, hasPositions, numberOfDocuments, numberOfOccurrences, numberOfPostings, pointerCoding, positionCoding |
Constructor Summary | |
---|---|
BitStreamIndexWriter(CharSequence basename,
int numberOfDocuments,
boolean writeOffsets,
Map<CompressionFlags.Component,CompressionFlags.Coding> flags)
Creates a new index writer, with the specified basename. |
|
BitStreamIndexWriter(OutputBitStream obs,
int numberOfDocuments,
Map<CompressionFlags.Component,CompressionFlags.Coding> flags)
Creates a new index writer, with the specified underlying OutputBitStream ,
without an associated offset bit stream. |
|
BitStreamIndexWriter(OutputBitStream obs,
OutputBitStream offset,
int numberOfDocuments,
Map<CompressionFlags.Component,CompressionFlags.Coding> flags)
Creates a new index writer with payloads using the specified underlying OutputBitStream . |
Method Summary | |
---|---|
void |
close()
Closes this index writer, completing the index creation process and releasing all resources. |
OutputBitStream |
newDocumentRecord()
Starts a new document record. |
long |
newInvertedList()
Starts a new inverted list. |
Properties |
properties()
Returns properties of the index generated by this index writer. |
int |
writeDocumentPointer(OutputBitStream out,
int pointer)
Writes a document pointer. |
int |
writeDocumentPositions(OutputBitStream out,
int[] occ,
int offset,
int len,
int docSize)
Writes the positions of the occurrences of the current term in the current document to the given OutputBitStream . |
int |
writeFrequency(int frequency)
Writes the frequency. |
int |
writePayload(OutputBitStream out,
Payload payload)
Writes the payload for the current document. |
int |
writePositionCount(OutputBitStream out,
int count)
Writes the count of the occurrences of the current term in the current document to the given OutputBitStream . |
long |
writtenBits()
Returns the overall number of bits written onto the underlying stream(s). |
Methods inherited from class it.unimi.dsi.mg4j.index.AbstractBitStreamIndexWriter |
---|
printStats |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected static final int BEFORE_INVERTED_LIST
state
means that we should call newInvertedList()
.
protected static final int BEFORE_FREQUENCY
state
means that we are positioned at the start of an inverted list,
and we should call writeFrequency(int)
.
protected static final int BEFORE_DOCUMENT_RECORD
state
means that we are ready to call newDocumentRecord()
.
protected static final int BEFORE_POINTER
state
means that we just started a new document record, and we
should call writeDocumentPointer(OutputBitStream, int)
.
protected static final int BEFORE_PAYLOAD
state
can be assumed only in indices that contain payloads; it
means that we are positioned just before the payload for the current document record.
protected static final int BEFORE_COUNT
state
can be assumed only in indices that contain counts; it
means that we are positioned just before the count for the current document record.
protected static final int BEFORE_POSITIONS
state
can be assumed only in indices that contain document positions;
it means that we are positioned just before the position list of the current document record.
protected static final int FIRST_UNUSED_STATE
protected OutputBitStream obs
OutputBitStream
.
protected int state
protected int frequency
protected int writtenDocuments
protected int currentDocument
protected int lastDocument
protected int b
b
for Golomb coding of pointers.
protected int log2b
log2b
for Golomb coding of pointers; it is the most significant bit of b
.
public int maxCount
Constructor Detail |
---|
public BitStreamIndexWriter(CharSequence basename, int numberOfDocuments, boolean writeOffsets, Map<CompressionFlags.Component,CompressionFlags.Coding> flags) throws IOException
writeOffsets
, also an offset file will be produced (stemmed with .offsets).
When close()
will be called, the property file will also be produced (stemmed with .properties),
or enriched if it already exists.
basename
- the basename.numberOfDocuments
- the number of documents in the collection to be indexed.writeOffsets
- if true
, the offset file will also be produced.flags
- a flag map setting the coding techniques to be used (see CompressionFlags
).
IOException
public BitStreamIndexWriter(OutputBitStream obs, OutputBitStream offset, int numberOfDocuments, Map<CompressionFlags.Component,CompressionFlags.Coding> flags)
OutputBitStream
.
obs
- the underlying output bit stream.offset
- the offset bit stream, or null
if offsets should not be written.numberOfDocuments
- the number of documents in the collection to be indexed.flags
- a flag map setting the coding techniques to be used (see CompressionFlags
).public BitStreamIndexWriter(OutputBitStream obs, int numberOfDocuments, Map<CompressionFlags.Component,CompressionFlags.Coding> flags)
OutputBitStream
,
without an associated offset bit stream.
obs
- the underlying output bit stream.numberOfDocuments
- the number of documents in the collection to be indexed.flags
- a flag map setting the coding techniques to be used (see CompressionFlags
).Method Detail |
---|
public long newInvertedList() throws IOException
IndexWriter
IOException
public int writeFrequency(int frequency) throws IOException
IndexWriter
frequency
- the (positive) number of document records that this inverted list will contain.
IOException
public OutputBitStream newDocumentRecord() throws IOException
IndexWriter
This method must be called exactly exactly f times, where f is the frequency specified with
IndexWriter.writeFrequency(int)
.
IOException
public int writeDocumentPointer(OutputBitStream out, int pointer) throws IOException
IndexWriter
This method must be called immediately after IndexWriter.newDocumentRecord()
.
out
- the output bit stream where the pointer will be written.pointer
- the document pointer.
IOException
public int writePayload(OutputBitStream out, Payload payload) throws IOException
IndexWriter
This method must be called immediately after IndexWriter.writeDocumentPointer(OutputBitStream, int)
.
out
- the output bit stream where the payload will be written.payload
- the payload.
IOException
public void close() throws IOException
IndexWriter
IOException
public int writePositionCount(OutputBitStream out, int count) throws IOException
IndexWriter
OutputBitStream
.
out
- the output stream where the occurrences should be written.count
- the count.
IOException
public int writeDocumentPositions(OutputBitStream out, int[] occ, int offset, int len, int docSize) throws IOException
IndexWriter
OutputBitStream
.
out
- the output stream where the occurrences should be written.occ
- the position vector (a sequence of strictly increasing natural numbers).offset
- the first valid entry in occ
.len
- the number of valid entries in occ
.docSize
- the size of the current document (only for Golomb and interpolative coding; you can safely pass -1 otherwise).
IOException
public long writtenBits()
IndexWriter
public Properties properties()
IndexWriter
This method should only be called after IndexWriter.close()
.
It returns a new property object
containing values for (whenever appropriate)
Index.PropertyKeys.DOCUMENTS
, Index.PropertyKeys.TERMS
,
Index.PropertyKeys.POSTINGS
, Index.PropertyKeys.MAXCOUNT
,
Index.PropertyKeys.INDEXCLASS
, Index.PropertyKeys.CODING
, Index.PropertyKeys.PAYLOADCLASS
,
BitStreamIndex.PropertyKeys.SKIPQUANTUM
, and BitStreamIndex.PropertyKeys.SKIPHEIGHT
.
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |