|
|||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |
See:
Description
Interface Summary | |
---|---|
IndexIterator | An iterator over an inverted list. |
IndexReader | Provides access to an inverted index. |
IndexWriter | An interface for classes that generate indices. |
PrefixMap | Deprecated. As of MG4J 2.1, replaced by PrefixMap . |
TermMap | Deprecated. As of MG4J 2.1, replaced by StringMap . |
TermProcessor | A term processor, implementing term/prefix transformation and possibly term/prefix filtering. |
Class Summary | |
---|---|
AbstractBitStreamIndexWriter | An abstract bitstream-based index writer, providing common variables and a basic AbstractBitStreamIndexWriter.printStats(PrintStream) implementation. |
AbstractIndexIterator | A very basic abstract implementation of an index interator,
providing an obvious implementation of AbstractIndexIterator.term() , AbstractIndexIterator.id() ,
and of the visiting methods. |
AbstractIndexReader | An abstract, safely closeable implementation of an index reader. |
AbstractPrefixMap | Deprecated. Use PrefixMap and related classes. |
AbstractTermMap | Deprecated. Use StringMap and related classes. |
BitStreamHPIndex | A high-performance bitstream-based index. |
BitStreamHPIndexReader | A bitstream-based index reader for high-performance indices. |
BitStreamHPIndexReader.BitStreamHPIndexReaderIndexIterator | |
BitStreamHPIndexWriter | Writes a bitstream-based high-performance index. |
BitStreamHPIndexWriter.TowerData | A structure maintaining statistical data about tower construction. |
BitStreamIndex | A bitstream-based index. |
BitStreamIndexReader | A bitstream-based index reader. |
BitStreamIndexReader.BitStreamIndexReaderIndexIterator | |
BitStreamIndexWriter | Writes a bitstream-based interleaved index. |
CachingOutputBitStream | A special output bit stream with an additional
method CachingOutputBitStream.buffer() that returns the internal buffer
if the internal buffer contains all that has been written since
the last call to position(0) . |
CompressionFlags | A container for constants and enums related to index compression. |
DiskBasedIndex | A static container providing facilities to load an index based on data stored on disk. |
DowncaseTermProcessor | A term processor downcasing all characters. |
FileHPIndex | A file-based high-performance index. |
FileIndex | A file-based index. |
Index | An abstract representation of an index. |
IndexIterators | A class providing static methods and objects that do useful things with index iterators. |
InMemoryHPIndex | A BitStreamHPIndex index loaded in memory. |
InMemoryIndex | A local bitstream index loaded in memory. |
MemoryMappedHPIndex | A memory-mapped BitStreamHPIndex . |
MemoryMappedIndex | A local memory-mapped bistream index. |
MultiTermIndexIterator | A virtual index iterator that merges several component index iterators. |
NullTermProcessor | A term processor that accepts all terms and does not do any processing. |
SkipBitStreamIndexWriter | Provides facilities to write skip inverted indices, that is, inverted indices with an additional skip structure. |
SkipBitStreamIndexWriter.TowerData | A structure maintaining statistical data about tower construction. |
TermMaps | Deprecated. Use StringMap and related classes. |
TermMaps.SynchronizedPrefixMap | |
TermMaps.SynchronizedTermMap | |
TermMaps.SynchronizedTermPrefixMap |
Enum Summary | |
---|---|
BitStreamIndex.PropertyKeys | Symbolic names for additional properties of a BitStreamIndex . |
CompressionFlags.Coding | A coding for an index component. |
CompressionFlags.Component | A component of the index. |
Index.PropertyKeys | Symbolic names for properties of a Index . |
Index.UriKeys | Keys to be used (downcased) in specifiying additional parameters to a MG4J URI. |
Exception Summary | |
---|---|
TooManyTermsException | Thrown to indicate that a prefix query generated too many terms. |
Index generation and access.
This package contains the classes that handle index generation and
access. The interval iterators defined in it.unimi.dsi.mg4j.search
build upon the classes of this package to provide answer to queries using interval semantics,
but it is also possible to access an index directly.
You can easily build indices using the tools in it.unimi.dsi.mg4j.tool
. Once an index
has been built, it can be opened using an Index
object, which
gathers metadata that is necessary to access the index. You do not create an Index
with a constructor: rather, you use the static factory Index.getInstance(CharSequence)
(or one of its variants) to create an instance.
This is necessary so that different kind of indices can be treated transparently: for example, the factory
may return a IndexCluster
if the index is actually a cluster,
but you do not need to know that.
From an Index
,
you can easily obtain either an IndexReader
, which allows to
scan sequentially or randomly the index. In turn from an IndexReader
you can obtain a IndexIterator
returning the documents containing a certain term and the position of the term within the document.
But there is more: an IndexIterator
is a kind of DocumentIterator
, and
DocumentIterator
s can be combined in several ways
using the classes of the package it.unimi.dsi.mg4j.search
: for instance, you can combine
document iterators using AND/OR. Note that you can combine document iterators on different
indices, but of course the operation is meaningful only if the two indices contain different information
about the same document collection (e.g., title and main text).
More importantly, if the index is full text (the default) for each document containing the term you can get interval iterators that return intervals representing extents of text satisfying the query: for instance, in case of an AND of two terms, the intervals will contain both terms.
An inverted index is made by a sequence of inverted lists (one inverted list for each term). Inverted lists are made by document records: each document record contains information about the occurrences of the term within a certain document.
More precisely, each inverted list starts with a suitably encoded integer, called the frequency, which is the number of document records that will follow (i.e., the number of documents in which the term appears). After that, there are exactly as many document records as the frequency.
Each document record is made by two parts:
As a basic and fundamental implementation, the classes of this package provide methods that write and read document data in a default form. In this default structure, each document data is a suitable coding of a (strictly increasing) sequence of integers, that correspond to the positions where the term occurs within the document. The length of the sequence (i.e., the number of positions in at which the term appears) is called the count (it is also common to call it “within-document frequency”, but we find this usage confusing).
|
|||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |