Package it.unimi.dsi.mg4j.search

Iterators over documents, and composition thereof.

See:
          Description

Interface Summary
DocumentIterator An iterator over documents and their intervals.
IntervalIterator An interface that allows one to iterate over intervals.
 

Class Summary
AbstractIntersectionDocumentIterator An abstract iterator on documents, generating the intersection of the documents returned by a number of document iterators.
AndDocumentIterator An iterator on documents that returns the AND of a number of document iterators.
ConsecutiveDocumentIterator An iterator returning documents containing consecutive intervals satisfying the underlying queries; the intervals must be in query order.
DocumentIterators A class providing static methods and objects that do useful things with document iterators.
Interval An integral interval.
IntervalIterators A class providing static methods and objects that do useful things with interval iterators.
IntervalIterators.EmptyIntervalIterator An iterator returning no intervals.
IntervalIterators.FakeIterator An iterator that throws an exception on all method calls, except for IntervalIterators.FakeIterator.hasNext(), which has a settable value.
Intervals A class providing static methods and objects that do useful things with intervals.
LowPassDocumentIterator A document iterator that filters another document iterator, returning just intervals (and containing documents) whose length does not exceed a given threshold.
NotDocumentIterator A document iterator that returns documents not returned by its underlying iterator, and returns just IntervalIterators.TRUE on all interval iterators.
OrDocumentIterator A document iterator that ORs given component iterators.
 

Package it.unimi.dsi.mg4j.search Description

Iterators over documents, and composition thereof.

This package contains the classes that allow to compose iterators over documents. Such iterators are returned, for instance, by IndexReader.documents(int).

Minimal-interval semantics

MG4J provides minimal-interval semantics. That is, if the index is full-text, a document iterator will provide a list of documents and, for each document, a list of minimal intervals. This intervals denote ranges of positions in the document that satisfy the iterator: for instance, if you compose two documents iterators using an AndDocumentIterator, you will get as a result the intersection of the document lists of the underlying iterators. Moreover, for each document you will get the minimal set of intervals that contain one interval both from the first iterators and from the second one.

This information is of course very useful if you're going to assign a score to the document, as smaller intervals mean a more precise match. At the basic level (e.g., iterators returned by an index), the intervals returned upon a document are intervals of length one containing the term that was used to generate the iterator. Intervals for compound iterators are built in a natural way, preserving minimality. More details can be found in Charles L. A. Clarke and Gordon V. Cormack, Shortest-Substring Retrieval and Ranking (ACM Transactions on Information Systems, vol. 18, no. 1, Jan 2000, pages 44−78). Scorers for documents may be found in the it.unimi.dsi.mg4j.search.score package.

Note that MG4J provides minimal-interval semantics for a set of indices. This extension is a significant improvement over single-index semantics. However, defining the exact meaning of a query is a nontrivial problem that will be fully dealt with in a forthcoming paper.