|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectit.unimi.dsi.mg4j.index.IndexWriter
it.unimi.dsi.mg4j.index.SkipIndexWriter
Provides facilities to write skip inverted indices,
that is, inverted indices with an additional skip structure. A skip inverted index
allows one to skip ahead when reading inverted lists. More specifically,
when reading the inverted list relative to a certain term, one may want to
decide to skip all document records that concern documents with pointer
less than a given integer. In a normal inverted index this is impossible:
one would have to read all document records sequentially (by the way,
this is what IndexReader.skipTo(int)
does).
Before reading this section, please consult it.unimi.dsi.mg4j.index
to understand what is the usual structure of an inverted index.
A skip inverted index is just like a usual inverted index, with the only difference that some of the document records contain additional information that allow to skip document records without actually reading them.
The classical approach to skipping is to embed pointers that allow to skip a fixed amount of document records. MG4J, instead, embeds a perfect, static skip list in every inverted list using a new method to represent and compress skip towers. As a result, MG4J can embed tall towers with small quanta (e.g., quantum 64 and height 8) with a very small growth of the index size. The details of the method will be presented in a forthcoming paper.
The document records that contain a skip tower are called skip document records; skip towers are stored within a document record just after the document pointer and just before the document data.
The construction of a skip inverted index depends on two positive integers, q and h, called the quantum and the height. For sake of simplicity, let H=2h and w=Hq.
Consider the records in a certain inverted list; these records are (logically) grouped from the left into blocks of width w (the last block, of course, may contain less than w records; see below). Hence, the records within a certain block are indexed from 0 to w-1. The skip records are only those whose index (within the block) is of the form kq, where k ranges from 0 to H-1; the number k is called the skip index of the skip record (within the block that contains it).
The skip tower of the skip record with skip index k contains LSB(k)+1 entries, where LSB(k) is the index of the least significant bit of k, a.k.a. the number of trailing zeroes in the h-long binary expansion of k. Hence, the 0-th skip tower has h+1 entries, the H/2-th skip tower has h entries, and so on. Skip entries within each skip tower have an index, called the skip tower index: if a skip tower contains s+1 entries, their skip tower index will range from 0 to s (inclusive).
Each entry within a skip tower is really made of two numbers: the skip pointer and the skip amount. Consider the i-th entry of the k-th skip tower (the one contained in the record kq within the block); let P be the skip pointer and A be the skip amount.
The entry we are considering points to another skip record: the one whose skip index is k+2i. For example, the last entry (the entry with skip tower index h) in the skip tower of skip index 0 points to the H-th skip record within the block (that is: the first record in the following block, if any). The next-to-last entry (skip tower index h-1) in the same skip tower points to the H/2-th skip record within the block (that is: the record in the middle of the block). The entry number 0 in the same skip tower points to the skip record with skip index 1.
Now, P is the document pointer of the pointed skip document record, and A is the number of bits to skip from the end of the skip tower to the start of the pointed (the k+2i-th) tower.
The last block within an inverted list may contain less than w records, and in this case we say that it is defective. If a defective block contains only C < w records, the number of entries in the k-th skip tower is not LSB(k)+1, but rather 1+min(LSB(k),MSB(⌊C/q⌋ - k)), where MSB(k) is the most significant bit of k, and moreover MSB(0) = -1 by definition. In particular, defective quanta have an empty tower.
This class is used in the same way as IndexWriter
. It further collects
more statistics concerning the compression of skips
.
Nested Class Summary | |
static class |
SkipIndexWriter.TowerData
A structure maintaining statistical data about tower construction. |
Field Summary | |
long |
bitsForEntryBitLengths
The number of bits written for entry lenghts. |
long |
bitsForQuantumBitLengths
The number of bits written for quantum lengths. |
long |
numberOfBlocks
The number of written blocks. |
int |
prevEntryBitLength
An estimate on the number of bits occupied per tower entry in the last written cache, or -1 if no cache has been written for the current inverted list. |
int |
prevQuantumBitLength
An estimate on the number of bits occupied per quantum in the last written cache, or -1 if no cache has been written for the current inverted list. |
SkipIndexWriter.TowerData |
towerData
The sum of all tower data computed so far. |
Fields inherited from class it.unimi.dsi.mg4j.index.IndexWriter |
b, BEFORE_COUNT, BEFORE_DOCUMENT_RECORD, BEFORE_FREQUENCY, BEFORE_INVERTED_LIST, BEFORE_POINTER, BEFORE_POSITIONS, bitsForAlignment, bitsForCounts, bitsForFrequencies, bitsForPointers, bitsForPositions, currentDocument, currentTerm, FIRST_UNUSED_STATE, flags, frequency, hasCounts, hasPositions, lastDocument, log2b, maxDocPos, numberOfDocuments, obs, pointerCoding, relativeFrequency, state, writtenDocuments |
Fields inherited from interface it.unimi.dsi.mg4j.index.CompressionFlags |
ARITH, CODING_NAME, COUNTS_DEFAULT, COUNTS_DELTA, COUNTS_GAMMA, COUNTS_SHIFT, DELTA, FREQUENCIES_DEFAULT, FREQUENCIES_DELTA, FREQUENCIES_GAMMA, FREQUENCIES_SHIFT, GAMMA, GOLOMB, INTERP, NIBBLE, NO_COUNTS, NO_POSITIONS, NONE, POINTERS_DEFAULT, POINTERS_DELTA, POINTERS_GAMMA, POINTERS_GOLOMB, POINTERS_SHIFT, POSITIONS_ARITH, POSITIONS_DEFAULT, POSITIONS_DELTA, POSITIONS_GAMMA, POSITIONS_GOLOMB, POSITIONS_INTERP, POSITIONS_SHIFT, POSITIONS_SKEWED_GOLOMB, SKEWED_GOLOMB, UNARY, ZETA |
Constructor Summary | |
SkipIndexWriter(OutputBitStream obs,
int N,
int flags,
int q,
int h)
Creates a new skip index writer, with the specified underlying OutputBitStream ,
without an associated offset bit stream. |
|
SkipIndexWriter(OutputBitStream obs,
OutputBitStream offset,
int N,
int flags,
int q,
int h)
Creates a new skip index writer, with the specified underlying OutputBitStream . |
Method Summary | |
void |
close()
Closes the underlying bit stream. |
OutputBitStream |
newDocumentRecord()
Starts a new document record. |
OutputBitStream |
newDocumentRecord(int pointer)
Starts a new document record. |
long |
newInvertedList()
Starts a new inverted list. |
long |
newInvertedList(int frequency)
Deprecated. Use newInvertedList() followed by writeFrequency(int) . |
Properties |
properties()
This method should only be called after close() . |
int |
writeDocumentPointer(OutputBitStream out,
int pointer)
Writes a document pointer. |
int |
writeFrequency(int frequency)
Writes the frequency. |
long |
writtenBits()
This method returns the overall number of bits written onto the underlying stream. |
Methods inherited from class it.unimi.dsi.mg4j.index.IndexWriter |
flags2String, writeDocumentPositions, writePositionCount |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
public SkipIndexWriter.TowerData towerData
public long bitsForQuantumBitLengths
public long bitsForEntryBitLengths
public long numberOfBlocks
public int prevEntryBitLength
public int prevQuantumBitLength
Constructor Detail |
public SkipIndexWriter(OutputBitStream obs, OutputBitStream offset, int N, int flags, int q, int h)
OutputBitStream
.
obs
- the underlying output bit stream.offset
- the offset bit stream.N
- the number of documents in the collection to be indexed.flags
- a bit mask setting the coding techniques to be used (see the IndexWriter
introduction).q
- the cache contains at most 2h document records.h
- the maximum height of a skip tower.public SkipIndexWriter(OutputBitStream obs, int N, int flags, int q, int h)
OutputBitStream
,
without an associated offset bit stream.
obs
- the underlying output bit stream.N
- the number of documents in the collection to be indexed.flags
- a bit mask setting the coding techniques to be used (see the IndexWriter
introduction).q
- the cache contains at most 2h document records.h
- the maximum height of a skip tower.Method Detail |
public long newInvertedList() throws IOException, IllegalStateException
IndexWriter
newInvertedList
in class IndexWriter
IllegalStateException
- if too few records were written for the previous inverted
list.
IOException
- if an exception is thrown by the underlying stream.public int writeFrequency(int frequency) throws IOException, IllegalStateException
IndexWriter
writeFrequency
in class IndexWriter
frequency
- the (positive) number of document records that this inverted list will contain.
IOException
- if an exception is thrown by the underlying stream.
IllegalStateException
public long newInvertedList(int frequency) throws IOException, IllegalStateException
newInvertedList()
followed by writeFrequency(int)
.
frequency
is written onto the underlying stream
using the specified coding (the number is decremented by one first, as it cannot be 0).
newInvertedList
in class IndexWriter
frequency
- the number of document records that this inverted list will contain.
IOException
- if an exception is thrown by the underlying stream
IllegalStateException
- if too few records were written for the previous inverted
list.public OutputBitStream newDocumentRecord() throws IOException, IllegalStateException
IndexWriter
This method must be called exactly n
times
after a call to newInvertedList(n)
.
newDocumentRecord
in class IndexWriter
IOException
- if an exception is thrown by the underlying stream.
IllegalStateException
- if too many records were written for the current inverted list,
or if there is no current inverted list.public int writeDocumentPointer(OutputBitStream out, int pointer) throws IOException, IllegalStateException
IndexWriter
This method must be called immediately after IndexWriter.newDocumentRecord()
.
writeDocumentPointer
in class IndexWriter
out
- the output bit stream where the pointer will be written.pointer
- the document pointer.
IOException
- if an exception is thrown by the underlying stream.
IllegalStateException
public OutputBitStream newDocumentRecord(int pointer) throws IOException, IllegalStateException
newDocumentRecord()
followed by writeDocumentPointer(OutputBitStream, int)
.
IndexWriter
n
times
after a call to newInvertedList(n)
.
newDocumentRecord
in class IndexWriter
pointer
- the document pointer (will be written to the bit stream).
OutputBitStream
where the next document record
should be written.
IllegalStateException
- if too many records were written for the current inverted list,
or if there is no current inverted list.
IOException
- if an exception is thrown by the underlying stream.public void close() throws IOException, IllegalStateException
close
in class IndexWriter
IOException
- if an exception is thrown by the underlying stream
IllegalStateException
- if too few records were written for the previous inverted
list.public long writtenBits()
writtenBits
in class IndexWriter
public Properties properties()
IndexWriter
IndexWriter.close()
. It returns the values for all the properties
specified in IndexProperties
.
properties
in class IndexWriter
IndexProperties
along with their values.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |