it.unimi.dsi.mg4j.tool
Class Scan

java.lang.Object
  extended by it.unimi.dsi.mg4j.tool.Scan

public class Scan
extends Object

Scans a document sequence, dividing it in batches of occurrences and writing for each batch a corresponding subindex.

This class (more precisely, its run() method) reads a document sequence and produces several batches, that is, subindices corresponding to subsets of term/document pairs of the collection. A set of batches is generated for each indexed field of the collection. A main method invokes the above method setting its parameters using suitable options.

Unless a serialised DocumentSequence is specified using the suitable option, an implicit InputStreamDocumentSequence is created using separator byte (default is 10, i.e., newline). In the latter case, the factory and its properties can be set with command-line options.

The only mandatory argument is a basename, which will be used to stem the names of all files generated. The first batch of a field named field will use the basename basename-field@0, the second batch basename-field@1 and so on. It is also possible to specify a separate directory for batch files (e.g., for easier cleanup when they are no longer necessary).

Since documents are read sequentially, every document has a natural index starting from 0. If no remapping (i.e., renumbering) is specified, the document index of each document corresponds to its natural index. If, however, a remapping is specified, under the form of a list of integers, the document index of a document is the integer found in the corresponding position of the list. More precisely, a remapping for N documents is a list of N distinct integers, and a document with natural index i has document index given by the i-th element of the list. This is useful when indexing statically ranked documents (e.g., if you are indexing a part of the web and would like the index to return documents with higher static rank first). If the remapping file is provided, it must be a sequence of integers, written using the DataOutputStream.writeInt(int) method; if N is the number of documents, the file is to contain exactly N distinct integers. The integers need not be between 0 and N-1, to allow the remapping of subindices (but a warning will be logged in this case, just to be sure you know what you're doing).

Also every term has an associated number starting from 0, assigned in lexicographic order.

Index types and indexing types

A standard index contains a list of terms, and for each term a posting list. Each posting contains mandatorily a document pointer, and then, optionally, the count and the positions of the term (whether the last two elements appear can be specified using suitable compression flags).

The indexing type of a standard index can be Scan.IndexingType.STANDARD, Scan.IndexingType.REMAPPED or Scan.IndexingType.VIRTUAL. In the first case, we index the words occurring in documents as usual. In the second case, before writing the index all documents are renumbered following a provided map. In the third case (used only with DocumentFactory.FieldType.VIRTUAL fields) indexing is performed on a virtual document obtained by collating a number of fragments. Fragments are associated to documents by some key, and a VirtualDocumentResolver turns a key into a document natural number, so that the collation process can take place (a settable gap is inserted between fragments).

Besides storing document pointers, document counts, and position, MG4J makes it possible to store an arbitrary payload with each posting. This feature is presently used only to create payload-based indices—indices without counts and positions that contain a single, dummy word #. They are actually used to store arbitrary data associated to each document, such as dates and integers: using a special syntax, is then possible to specify range queries on the values of such fields.

The main difference between standard and payload-based indices is that the first type is handled by instances of this class, whereas the second type is handled by instances of Scan.PayloadAccumulator. The run() method creates a set of suitable instances, one for each indexed field, and feeds them in parallel with data from the appropriate field of the same document.

Batch subdivision and content

The scanning process uses a user-settable number of documents per batch, and will try to build batches containing exactly that number of documents (for all indexed fields). There are of course space constraints that could make building exact batches impossible, as the entire data of a batch must into core memory. If memory is too low, a batch will be generated with fewer documents than expected.

In some extreme cases, it could be impossible to produce cleanly a set of batches for all fields: in that case, emergency dumps will create fragmented batches—instead of a single batch containing k documents a certain field will generate two separate batches. As a consequence, different fields will have a number of batches, but a simple inspection of the property files (see below) will reveal the details of the emergency dumps (and Combine can be used to rebuild the desired exact batches, if necessary).

The larger the number of documents in a batch is, the quicker index construction will be. Usually, some experiments and a look at the logs is all that suffices to find out good parameters for the Java virtual machine maximum memory setting and for the number of documents per batch.

These are the files currently generated for each batch (basename denotes the basename of the batch, not of the index):

basename.terms
For each indexed term, the corresponding literal string in UTF-8 encoding. More precisely, the i-th line of the file (starting from 0) contains the literal string corresponding to term index i.
basename.terms.unsorted
The list of indexed terms in the same order in which they were met in the document collection. This list is not produced unless you ask for it explicitly with a suitable option.
basename.frequencies
For each term, the number of documents in which the term appears in γ coding. More precisely, i-th integer of the file (starting from 0) is the number of documents in which the term of index i appears.
basename.sizes (not generated for payload-based indices)
For each indexed document, the corresponding size (=number of words) in γ coding. More precisely, i-th integer of the file (starting from 0) is the size in words of the document of index i.
basename.index
The inverted index.
basename.offsets (not generated for payload-based indices)
For each term, the bit offset in basename.index at which the inverted lists start. More precisely, the first integer is the offset for term 0 in γ coding, and then the i-th integer is the difference between the i-th and the i−1-th offset in γ coding. If T terms were indexed, this file will contain T+1 integers, the last being the difference (in bits) between the length of the entire inverted index and the offset of the last inverted list.
basename.globcounts (not generated for payload-based indices)
For each term, the number of its occurrences throughout the whole document collection, in γ coding. More precisely, the i-th integer of the file (starting from 0) is the number of occurrences of the term of index i.
basename.properties
A Java property file containing information about the index. Currently, the following keys (taken from Index.PropertyKeys) generated:
indexclass
the class used to generate the batch (presently, BitStreamIndexWriter);
documents
number documents in the collection;
terms
number of indexed terms;
occurrences
number of words throughout the whole collection;
postings
number of postings (pairs term/document) throughout the whole collection;
maxdocsize
maximum size of a document in words;
termprocessor
the term processor (if any) used during the index construction;
coding
one or more items, each defining a key/pair value for the flag map of the index; each pair is of the form component:coding (see CompressionFlags);
field
the name of the field that generated this batch (optional)
maxcount
the maximum count in the collection, that is, the maximum count of a term maximised on all terms and documents;
size
the index size in bits;
basename.cluster.properties
A Java property file containing information about the set of batches seen as a DocumentalCluster. The keys are same as in the previous case, but additionally a number of localindex entries specify the basename of the batches, and a splitstrategy. After creating manually suitable term maps for each batch, you will be able to access the set of batches as a single index (but note that standard batches have no skip structure, and should not be used in production; if you intend to do so, you have to write a customised scanning procedure).

Since:
1.0
Author:
Sebastiano Vigna

Nested Class Summary
static class Scan.IndexingType
           
protected static class Scan.PayloadAccumulator
          An accumulator for payloads.
static interface Scan.VirtualDocumentFragment
          An interface that describes a virtual document fragment.
 
Field Summary
static String CLUSTER_PROPERTIES_EXTENSION
          The extension of the strategy for the cluster associated to a scan.
protected  int[] currMaxPos
          The current maximum position for each document, if the field indexed is virtual.
protected  IntArrayList cutPoints
          The cutpoints of the batches (for building later a ContiguousDocumentalStrategy).
static int DEFAULT_BATCH_SIZE
          The default batch size.
static int DEFAULT_BUFFER_SIZE
          The default buffer size.
static int DEFAULT_DELIMITER
          The default delimiter separating two documents read from standard input (a newline).
static int DEFAULT_VIRTUAL_DOCUMENT_GAP
          The default virtual field gap.
 boolean outOfMemoryError
          If true, this class experienced an OutOfMemoryError during some buffer reallocation.
protected  int virtualDocumentGap
          The width of the artificial gap introduced between virtual-document fragments.
 
Constructor Summary
Scan(String basename, String field, TermProcessor termProcessor, boolean documentsAreInOrder, int bufferSize, ZipDocumentCollectionBuilder builder, File batchDir)
          Creates a new scanner instance.
Scan(String basename, String field, TermProcessor termProcessor, Scan.IndexingType indexingType, int numVirtualDocs, int virtualDocumentGap, int bufferSize, ZipDocumentCollectionBuilder builder, File batchDir)
          Creates a new scanner instance.
Scan(String basename, String field, TermProcessor termProcessor, Scan.IndexingType indexingType, int bufferSize, ZipDocumentCollectionBuilder builder, File batchDir)
          Creates a new scanner instance.
 
Method Summary
protected static String batchBasename(int batch, String basename, File batchDir)
          Returns the name of a batch.
static void cleanup(String basename, int batches, File batchDir)
          Cleans all intermediate files generated by a run of this class.
 void close()
          Closes this pass, releasing all resources.
protected  long dumpBatch()
          Dumps the current batch on disk as an index.
static DocumentSequence getSequence(String sequenceName, Class<?> factoryClass, String[] property, int delimiter, Logger logger)
          Returns the document sequence to be indexed.
static void main(String[] arg)
           
protected  void openSizeBitStream()
           
static int[] parseFieldNames(String[] indexedFieldName, DocumentFactory factory, boolean allSupported)
           
static int[] parseQualifiedSizes(String[] qualifiedSizes, String defaultSize, int[] indexedField, DocumentFactory factory)
           
static int[] parseVirtualDocumentGap(String[] virtualDocumentGapSpec, int[] indexedField, DocumentFactory factory)
           
static VirtualDocumentResolver[] parseVirtualDocumentResolver(String[] virtualDocumentSpec, int[] indexedField, DocumentFactory factory)
           
 void processDocument(int documentPointer, WordReader wordReader)
          Processes a document.
static void run(String basename, DocumentSequence documentSequence, TermProcessor termProcessor, String zipCollectionBasename, int bufferSize, int documentsPerBatch, int[] indexedField, String renumberingFile, long logInterval, String tempDirName)
          Runs in parallel a number of instances.
static void run(String basename, DocumentSequence documentSequence, TermProcessor termProcessor, String zipCollectionBasename, int bufferSize, int documentsPerBatch, int[] indexedField, VirtualDocumentResolver[] virtualDocumentResolver, int[] virtualGap, String mapFile, long logInterval, String tempDirName)
          Runs in parallel a number of instances.
 String toString()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

CLUSTER_PROPERTIES_EXTENSION

public static final String CLUSTER_PROPERTIES_EXTENSION
The extension of the strategy for the cluster associated to a scan.

See Also:
Constant Field Values

outOfMemoryError

public boolean outOfMemoryError
If true, this class experienced an OutOfMemoryError during some buffer reallocation.


currMaxPos

protected int[] currMaxPos
The current maximum position for each document, if the field indexed is virtual.


virtualDocumentGap

protected int virtualDocumentGap
The width of the artificial gap introduced between virtual-document fragments.


cutPoints

protected final IntArrayList cutPoints
The cutpoints of the batches (for building later a ContiguousDocumentalStrategy).


DEFAULT_DELIMITER

public static final int DEFAULT_DELIMITER
The default delimiter separating two documents read from standard input (a newline).

See Also:
Constant Field Values

DEFAULT_BATCH_SIZE

public static final int DEFAULT_BATCH_SIZE
The default batch size.

See Also:
Constant Field Values

DEFAULT_BUFFER_SIZE

public static final int DEFAULT_BUFFER_SIZE
The default buffer size.

See Also:
Constant Field Values

DEFAULT_VIRTUAL_DOCUMENT_GAP

public static final int DEFAULT_VIRTUAL_DOCUMENT_GAP
The default virtual field gap.

See Also:
Constant Field Values
Constructor Detail

Scan

public Scan(String basename,
            String field,
            TermProcessor termProcessor,
            boolean documentsAreInOrder,
            int bufferSize,
            ZipDocumentCollectionBuilder builder,
            File batchDir)
     throws FileNotFoundException
Creates a new scanner instance.

Parameters:
basename - the basename (usually a global filename followed by the field name, separated by a dash).
field - the field to be indexed.
termProcessor - the term processor for this index.
documentsAreInOrder - if true, documents will be served in increasing order.
bufferSize - the buffer size used in all I/O.
builder - a builder used to create a compressed document collection on the fly.
batchDir - a directory for batch files; batch names will be relativised to this directory if it is not null.
Throws:
FileNotFoundException

Scan

public Scan(String basename,
            String field,
            TermProcessor termProcessor,
            Scan.IndexingType indexingType,
            int bufferSize,
            ZipDocumentCollectionBuilder builder,
            File batchDir)
     throws FileNotFoundException
Creates a new scanner instance.

Throws:
FileNotFoundException

Scan

public Scan(String basename,
            String field,
            TermProcessor termProcessor,
            Scan.IndexingType indexingType,
            int numVirtualDocs,
            int virtualDocumentGap,
            int bufferSize,
            ZipDocumentCollectionBuilder builder,
            File batchDir)
     throws FileNotFoundException
Creates a new scanner instance.

Parameters:
basename - the basename (usually a global filename followed by the field name, separated by a dash).
field - the field to be indexed.
termProcessor - the term processor for this index.
indexingType - the type of indexing procedure.
numVirtualDocs - the number of virtual documents that will be used, in case of a virtual index; otherwise, immaterial.
virtualDocumentGap - the artificial gap introduced between virtual documents fragments, in case of a virtual index; otherwise, immaterial.
bufferSize - the buffer size used in all I/O.
builder - a builder used to create a compressed document collection on the fly.
batchDir - a directory for batch files; batch names will be relativised to this directory if it is not null.
Throws:
FileNotFoundException
Method Detail

cleanup

public static void cleanup(String basename,
                           int batches,
                           File batchDir)
                    throws IOException
Cleans all intermediate files generated by a run of this class.

Parameters:
basename - the basename of the run.
batches - the number of generated batches.
batchDir - if not null, a temporary directory where the batches are located.
Throws:
IOException

batchBasename

protected static String batchBasename(int batch,
                                      String basename,
                                      File batchDir)
Returns the name of a batch.

You can override this method if you prefer a different batch naming scheme.

Parameters:
batch - the batch number.
basename - the index basename.
batchDir - if not null, a temporary directory for batches.
Returns:
simply basename@batch, if batchDir is null; otherwise, we relativise the name to batchDir.

dumpBatch

protected long dumpBatch()
                  throws IOException,
                         ConfigurationException
Dumps the current batch on disk as an index.

Returns:
the number of occurrences contained in the batch.
Throws:
IOException
ConfigurationException

openSizeBitStream

protected void openSizeBitStream()
                          throws FileNotFoundException
Throws:
FileNotFoundException

run

public static void run(String basename,
                       DocumentSequence documentSequence,
                       TermProcessor termProcessor,
                       String zipCollectionBasename,
                       int bufferSize,
                       int documentsPerBatch,
                       int[] indexedField,
                       String renumberingFile,
                       long logInterval,
                       String tempDirName)
                throws ConfigurationException,
                       IOException
Runs in parallel a number of instances.

Throws:
ConfigurationException
IOException

run

public static void run(String basename,
                       DocumentSequence documentSequence,
                       TermProcessor termProcessor,
                       String zipCollectionBasename,
                       int bufferSize,
                       int documentsPerBatch,
                       int[] indexedField,
                       VirtualDocumentResolver[] virtualDocumentResolver,
                       int[] virtualGap,
                       String mapFile,
                       long logInterval,
                       String tempDirName)
                throws ConfigurationException,
                       IOException
Runs in parallel a number of instances.

This commodity method takes care of instantiating one instance per indexed field, and to pass the right information to each instance. All options are common to all fields, except for the number of occurrences in a batch, which can be tuned for each field separately.

Parameters:
basename - the index basename.
documentSequence - a document sequence.
termProcessor - the term processor for this index.
zipCollectionBasename - if not null, the basename of a new GZIP'd collection built using documentSequence.
bufferSize - the buffer size used in all I/O.
documentsPerBatch - the number of documents that we should try to put in each segment.
indexedField - the fields that should be indexed, in increasing order.
virtualDocumentResolver - the array of virtual document resolvers to be used, parallel to indexedField: it can safely contain anything (even null) in correspondence to non-virtual fields, and can safely be null if no fields are virtual.
virtualGap - the array of virtual field gaps to be used, parallel to indexedField: it can safely contain anything in correspondence to non-virtual fields, and can safely be null if no fields are virtual.
mapFile - the name of a file containing a map to be applied to document indices.
logInterval - the minimum time interval between activity logs in milliseconds.
tempDirName - a directory for temporary files.
Throws:
IOException
ConfigurationException

processDocument

public void processDocument(int documentPointer,
                            WordReader wordReader)
                     throws IOException
Processes a document.

Parameters:
documentPointer - the integer pointer associated to the document.
wordReader - the word reader associated to the document.
Throws:
IOException

close

public void close()
           throws ConfigurationException,
                  IOException
Closes this pass, releasing all resources.

Throws:
ConfigurationException
IOException

toString

public String toString()
Overrides:
toString in class Object

parseQualifiedSizes

public static int[] parseQualifiedSizes(String[] qualifiedSizes,
                                        String defaultSize,
                                        int[] indexedField,
                                        DocumentFactory factory)
                                 throws com.martiansoftware.jsap.ParseException
Throws:
com.martiansoftware.jsap.ParseException

parseVirtualDocumentResolver

public static VirtualDocumentResolver[] parseVirtualDocumentResolver(String[] virtualDocumentSpec,
                                                                     int[] indexedField,
                                                                     DocumentFactory factory)

parseVirtualDocumentGap

public static int[] parseVirtualDocumentGap(String[] virtualDocumentGapSpec,
                                            int[] indexedField,
                                            DocumentFactory factory)

parseFieldNames

public static int[] parseFieldNames(String[] indexedFieldName,
                                    DocumentFactory factory,
                                    boolean allSupported)

getSequence

public static DocumentSequence getSequence(String sequenceName,
                                           Class<?> factoryClass,
                                           String[] property,
                                           int delimiter,
                                           Logger logger)
                                    throws IllegalAccessException,
                                           InvocationTargetException,
                                           NoSuchMethodException,
                                           IOException,
                                           ClassNotFoundException,
                                           InstantiationException
Returns the document sequence to be indexed.

Parameters:
sequenceName - the name of a serialised document sequence, or null for standard input.
factoryClass - the class of the DocumentFactory that should be passed to the document sequence.
property - an array of property strings to be used in the factory initialisation.
delimiter - a delimiter in case we want to use standard input.
logger - a logger.
Returns:
the document sequence to be indexed.
Throws:
IllegalAccessException
InvocationTargetException
NoSuchMethodException
IOException
ClassNotFoundException
InstantiationException

main

public static void main(String[] arg)
                 throws com.martiansoftware.jsap.JSAPException,
                        InvocationTargetException,
                        NoSuchMethodException,
                        ConfigurationException,
                        ClassNotFoundException,
                        IOException,
                        IllegalAccessException,
                        InstantiationException
Throws:
com.martiansoftware.jsap.JSAPException
InvocationTargetException
NoSuchMethodException
ConfigurationException
ClassNotFoundException
IOException
IllegalAccessException
InstantiationException