|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectit.unimi.dsi.mg4j.index.IndexReader
Provides facilities to read directly an inverted index.
An inverted index is made by a sequence of inverted lists (one inverted list
for each term). Inverted lists are made by document records. For more information
on the structure of an inverted index, see the documentation of it.unimi.dsi.mg4j.index
.
Note that Index
provides a much easier interface to
index access. Instead of creating an IndexReader
explicitly, you can use Index.getReader()
.
The simplest way of getting data from an IndexReader
is to ask for
the documents matching a term.
Otherwise, you can access the index directly.
The method readFrequency()
should be called only when the underlying
stream is positioned on the start of an inverted list; in such case, it returns
the frequency of the corresponding term.
After that, you can read the document records sequentially: you have to read
exactly as many document records as specified in the frequency, at the beginning of the list.
For each document record, you call (possibly: many times) readDocumentPointer()
to read the document
pointer, and then readDocumentPositions(int[])
(just once) to read the
positions. After reading the last record, you will be positioned on the beginning of
the next inverted list, if any.
The skipTo(int)
method can be called either after or before
reading a document pointer, and allows one to skip ahead in the inverted list.
More precisely, assume that you are currently on a record with pointer p
,
and you call skipTo(n)
; if there is a document record with pointer
≥ n
in the current inverted list, you will be positioned on the
beginning of the first such record following the current one (in particular, if
p
≥n
, the reader will not move); if there is no
document record satisfying the condition, you will be positioned on the beginning
of the next inverted list (or at the end of the index, if this is the last
inverted list).
Hence, as explained above, the index is read sequentially, one inverted list
after another, one document record after another. The skipTo(int)
method
can be used to jump ahead (but not backwards!) to another document record
inside a single inverted list.
As explained in the IndexWriter
documentation, an inverted index may
be created with an associated offset bit stream: the offset bit stream stores
the positions where each inverted list starts inside the index. You can read
an offset bit stream using the Index.readOffsets(InputBitStream, int, ProgressMeter)
static
method: this method reads the offset bit stream and returns its content as
a list.
Such a list may be passed to the IndexReader
constructor: in this case,
you can use the position(t)
method to position the index reader on
the beginning of the inverted list relative to a the t
-th term.
In order to do so, two conditions must be satisfied.
IndexReader
must have been created with an associated offset map;
positionable
.
position(int)
method may be called at any time.
After calling readFrequency()
, the reader is at the beginning of the
first document record. We assume that this is an invariant: the reader is always
at the beginning of a document record; the variable numberOfDocumentRecord
keeps track of the number of the document record: this ranges from 0 to the
frequency (when equal to the frequency, we are not really at the beginning of
a document record but rather at the beginning of the next inverted list, or
at the end of the file).
Field Summary | |
protected int |
b
The parameter b for Golomb coding of pointers. |
protected static int |
BEFORE_COUNT
This value of state can be assumed only in indices that contain counts; it
means that we are positioned just before the count for the current document record. |
protected static int |
BEFORE_FREQUENCY
This value of state means that we are positioned at the start of an inverted list (just before the frequency). |
protected static int |
BEFORE_POINTER
This value of state means that we are at the start of a new document record,
unless we already read all documents (i.e., numberOfDocumentRecord == frequency ),
in which case we are at the end of the inverted list, and endOfList() is true. |
protected static int |
BEFORE_POSITIONS
This value of state can be assumed only in indices that contain document positions;
it means that we are positioned just before the position list of the current document record. |
protected int |
count
The current count (if this index contains counts). |
protected int |
currentDocument
The last document pointer in the current list, or -1 if we just read the frequency. |
protected static int |
FIRST_UNUSED_STATE
This is the first unused state. |
protected int |
frequency
The current frequency. |
protected InputBitStream |
ibs
The underlying input bit stream. |
protected Index |
index
The reference index. |
protected int |
log2b
The parameter log2b for Golomb coding of pointers; it is the most significant bit of b . |
protected int |
numberOfDocumentRecord
The number of the document record we are going to read inside the current inverted list. |
protected double |
relativeFrequency
The number of document records that the current inverted list will contain, divided by the number of documents. |
protected int |
state
This variable tracks the current state of the reader. |
protected int |
term
The current term. |
Fields inherited from interface it.unimi.dsi.mg4j.index.CompressionFlags |
ARITH, CODING_NAME, COUNTS_DEFAULT, COUNTS_DELTA, COUNTS_GAMMA, COUNTS_SHIFT, DELTA, FREQUENCIES_DEFAULT, FREQUENCIES_DELTA, FREQUENCIES_GAMMA, FREQUENCIES_SHIFT, GAMMA, GOLOMB, INTERP, NIBBLE, NO_COUNTS, NO_POSITIONS, NONE, POINTERS_DEFAULT, POINTERS_DELTA, POINTERS_GAMMA, POINTERS_GOLOMB, POINTERS_SHIFT, POSITIONS_ARITH, POSITIONS_DEFAULT, POSITIONS_DELTA, POSITIONS_GAMMA, POSITIONS_GOLOMB, POSITIONS_INTERP, POSITIONS_SHIFT, POSITIONS_SKEWED_GOLOMB, SKEWED_GOLOMB, UNARY, ZETA |
Constructor Summary | |
IndexReader(Index index)
Creates a new index reader, with the specified underlying Index and default buffer size. |
|
IndexReader(Index index,
InputBitStream ibs)
Creates a new index reader, with the specified underlying Index and input bit stream. |
|
IndexReader(Index index,
int bufferSize)
Creates a new index reader, with the specified underlying Index . |
|
IndexReader(InputBitStream ibs,
int N,
long flags)
Deprecated. Please use Index.getReader() or IndexReader(Index) . |
|
IndexReader(InputBitStream ibs,
LongList offsets,
IntList sizes,
int N,
long flags)
Deprecated. Please use Index.getReader() or IndexReader(Index) . |
|
IndexReader(InputBitStream ibs,
LongList offsets,
int N,
long flags)
Deprecated. Please use Index.getReader() or IndexReader(Index) . |
Method Summary | |
void |
close()
Closes the underlying bit stream. |
int |
currentDocumentPointer()
Returns the current document pointer. |
DocumentIterator |
documents(CharSequence term)
Returns a document iterator over the documents containing a term; the term is given explicitly, and the index term map is used, if present. |
DocumentIterator |
documents(int term)
Returns a document iterator over the documents containing a term. |
boolean |
endOfList()
Returns whether there are no more document records in the current inverted list. |
protected boolean |
justAfterPointer()
Returns true if the index is positioned just after a pointer. |
void |
position(int term)
Positions the index on the inverted list of a given term. |
long |
readBits()
Returns the number of bits read from the last call to position(int) . |
int |
readDocumentPointer()
Reads a document pointer from a given InputBitStream . |
int |
readDocumentPositions(int[] occ)
Reads the positions of the occurrences of the current term in the current document. |
int |
readFrequency()
Reads and returns the frequency (number of documents where the term appears). |
int |
readPositionCount()
Reads the position count for the current document. |
boolean |
skipTo(int p)
Skips to the first document record whose document pointer is greater than or equal to the given value. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
protected final Index index
protected InputBitStream ibs
protected int b
b
for Golomb coding of pointers.
protected int log2b
log2b
for Golomb coding of pointers; it is the most significant bit of b
.
protected int term
protected int frequency
protected int count
protected double relativeFrequency
protected int currentDocument
protected int numberOfDocumentRecord
protected int state
protected static final int BEFORE_FREQUENCY
state
means that we are positioned at the start of an inverted list (just before the frequency).
protected static final int BEFORE_POINTER
state
means that we are at the start of a new document record,
unless we already read all documents (i.e., numberOfDocumentRecord
== frequency
),
in which case we are at the end of the inverted list, and endOfList()
is true.
protected static final int BEFORE_COUNT
state
can be assumed only in indices that contain counts; it
means that we are positioned just before the count for the current document record.
protected static final int BEFORE_POSITIONS
state
can be assumed only in indices that contain document positions;
it means that we are positioned just before the position list of the current document record.
protected static final int FIRST_UNUSED_STATE
Constructor Detail |
public IndexReader(Index index) throws IOException
Index
and default buffer size.
index
- the index.IndexReader(Index, int)
public IndexReader(Index index, int bufferSize) throws IOException
Index
.
index
- the index.bufferSize
- the size of the buffer of the underlying input bit stream.public IndexReader(Index index, InputBitStream ibs) throws IOException
Index
and input bit stream.
index
- the index.ibs
- the underlying bit stream.public IndexReader(InputBitStream ibs, LongList offsets, IntList sizes, int N, long flags) throws IOException
Index.getReader()
or IndexReader(Index)
.
InputBitStream
.
ibs
- the underlying input bit stream.offsets
- the offset list; may be null
if you do not plan using position(int)
.sizes
- the size list; may be null
if your code does not require it.N
- the number of documents in the collection.flags
- a bit mask setting the coding techniques to be used (see the IndexReader
introduction).public IndexReader(InputBitStream ibs, LongList offsets, int N, long flags) throws IOException
Index.getReader()
or IndexReader(Index)
.
InputBitStream
.
ibs
- the underlying input bit stream.N
- the number of documents in the collection.flags
- a bit mask setting the coding techniques to be used (see the IndexReader
introduction).public IndexReader(InputBitStream ibs, int N, long flags) throws IOException
Index.getReader()
or IndexReader(Index)
.
InputBitStream
.
ibs
- the underlying input bit stream.N
- the number of documents in the collection.flags
- a bit mask setting the coding techniques to be used (see the IndexReader
introduction).Method Detail |
public void position(int term) throws IOException
term
- a term.
IOException
public boolean endOfList()
public int readFrequency() throws IOException, IllegalStateException
endOfList()
must
return true, but also all data contained in the last document record must
have been read).
After that, you are ready to read the document records.
IOException
- if an exception is thrown by the underlying stream
IllegalStateException
public int readDocumentPointer() throws IOException
InputBitStream
. This
method returns the document pointer; after this, you can use the stream
to read the positions.
IOException
- if an exception is thrown by the underlying streampublic int currentDocumentPointer()
protected boolean justAfterPointer()
public int readPositionCount() throws IOException, NoSuchElementException
IOException
- if an exception is thrown by the underlying stream
IllegalStateException
- if you try to read a count but you have not just read a pointer.
NoSuchElementException
- if you try to read beyond the end of an inverted list.public int readDocumentPositions(int[] occ) throws IOException, NoSuchElementException
First the count (number of positions) is read using the specified coding (unless it has
already been read using readPositionCount()
). Then,
the vector of positions is read using the specified coding.
This method can also skip the list of document positions by passing a null
vector.
occ
- the vector of positions; this vector must be at least as large as the number of
occurrences that shall be read; if it is null
, positions are discarded.
occ
); if the returned value is
negative, occ
has not enough space to store positions (the absolute returned value is the number
of positions).
IOException
- if an exception is thrown by the underlying stream
IllegalStateException
- if you try to read positions without known the frequency.
NoSuchElementException
- if you try to read beyond the end of an inverted list.public boolean skipTo(int p) throws IOException
If this method returns true, then there is one record satisfying the condition, and the index is positioned over it. If this method returns false, then the condition is not satisfied and the index is positioned on the last document record.
In any case, this method leaves the index in the same state as after a call to
readDocumentPointer()
, unless it is called when endOfList()
is true,
in which case it does nothing and returns false.
p
- a document pointer.
p
.
IOException
public long readBits()
position(int)
.
public void close() throws IOException, IllegalStateException
IOException
- if an exception is thrown by the underlying stream
IllegalStateException
- if too few records were written for the previous inverted
list.public DocumentIterator documents(int term) throws IOException
term
- a term.
IOException
public DocumentIterator documents(CharSequence term) throws IOException
term
- a term (the term will be downcased if the index is case insensitive).
IOException
- if an exception occurred while accessing the index.
UnsupportedOperationException
- if the term map is not available for the underlying index.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |