it.unimi.dsi.mg4j.tool
Class PartitionDocumentally

java.lang.Object
  extended by it.unimi.dsi.mg4j.tool.PartitionDocumentally

public class PartitionDocumentally
extends Object

Partitions an index documentally.

A global index is partitioned documentally by providing a DocumentalPartitioningStrategy that specifies a destination local index for each document, and a local document pointer. The global index is scanned, and the postings are partitioned among the local indices using the provided strategy. For instance, a ContiguousDocumentalStrategy divides an index into blocks of contiguous documents.

Since each local index contains a (proper) subset of the original set of documents, it contains in general a (proper) subset of the terms in the global index. Thus, the local term numbers and the global term numbers will not in general coincide. As a result, when a set of local indices is accessed transparently as a single index using a DocumentalCluster, a call to Index.documents(int) will throw an UnsupportedOperationException, because there is no way to map the global term numbers to local term numbers.

On the other hand, a call to Index.documents(CharSequence) will be passed each local index to build a global iterator. To speed up this phase for not-so-frequent terms, when partitioning an index you can require the construction of Bloom filters that will be used to try to avoid inquiring indices that do not contain a term. The precision of the filters is settable.

The property file will use a DocumentalMergedCluster unless you provide a ContiguousDocumentalStrategy, in which case a DocumentalConcatenatedCluster will be used instead. Note that there might be other cases in which the latter is adapt, in which case you can edit manually the property file. Important: this class just partitions the index. No auxiliary files (most notably, term maps or prefix maps) will be generated. Please refer to a StringMap implementation (e.g., ShiftAddXorSignedStringMap or ImmutableExternalPrefixMap). Warning: variable quanta are not supported by this class, as it is impossible to predict accurately the number of bits used for positions when partitioning documentally. If you want to use variable quanta, use a simple interleaved indices without skips as an intermediate step, and pass them through Combine.

Write-once output and distributed index partitioning

Plase see PartitionLexically—the same comments apply.

Since:
1.0.1
Author:
Alessandro Arrabito, Sebastiano Vigna

Field Summary
static int DEFAULT_BUFFER_SIZE
          The default buffer size for all involved indices.
 
Constructor Summary
PartitionDocumentally(String inputBasename, String outputBasename, DocumentalPartitioningStrategy strategy, String strategyFilename, int bloomFilterPrecision, int bufferSize, Map<CompressionFlags.Component,CompressionFlags.Coding> writerFlags, boolean interleaved, boolean skips, int quantum, int height, int skipBufferSize, long logInterval)
           
 
Method Summary
static void main(String[] arg)
           
 void run()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_BUFFER_SIZE

public static final int DEFAULT_BUFFER_SIZE
The default buffer size for all involved indices.

See Also:
Constant Field Values
Constructor Detail

PartitionDocumentally

public PartitionDocumentally(String inputBasename,
                             String outputBasename,
                             DocumentalPartitioningStrategy strategy,
                             String strategyFilename,
                             int bloomFilterPrecision,
                             int bufferSize,
                             Map<CompressionFlags.Component,CompressionFlags.Coding> writerFlags,
                             boolean interleaved,
                             boolean skips,
                             int quantum,
                             int height,
                             int skipBufferSize,
                             long logInterval)
                      throws ConfigurationException,
                             IOException,
                             ClassNotFoundException,
                             SecurityException,
                             InstantiationException,
                             IllegalAccessException
Throws:
ConfigurationException
IOException
ClassNotFoundException
SecurityException
InstantiationException
IllegalAccessException
Method Detail

run

public void run()
         throws Exception
Throws:
Exception

main

public static void main(String[] arg)
                 throws ConfigurationException,
                        IOException,
                        URISyntaxException,
                        ClassNotFoundException,
                        Exception
Throws:
ConfigurationException
IOException
URISyntaxException
ClassNotFoundException
Exception