it.unimi.dsi.mg4j.document
Class ZipDocumentCollectionBuilder

java.lang.Object
  extended by it.unimi.dsi.mg4j.document.ZipDocumentCollectionBuilder

public class ZipDocumentCollectionBuilder
extends Object

A builder to create ZipDocumentCollections.

After creating an instance of this class, it is possible to add incrementally new documents. Each document must be started with startDocument(CharSequence, CharSequence) and ended with endDocument(); inside each document, each non-text field must be written by passing an object to nonTextField(Object), whereas each text field must be started with startTextField() and ended with endTextField(): inbetween, a call to add(MutableString, MutableString) must be made for each word/nonword pair retrieved from the original collection. At the end, close() returns a ZipDocumentCollection that must be serialised.

Alternatively, you can just call build(DocumentSequence) and all the above will be handled for you.

Each Zip entry corresponds to a document: the title is recorded in the comment field, whereas the URI is written with MutableString.writeSelfDelimUTF8(java.io.OutputStream) directly to the zipped output stream. When building an exact ZipDocumentCollection subsequent word/nonword pairs are written in the same way, and delimited by two empty strings. If the collection is not exact, just words are written, and delimited by an empty string. Non-text fields are written directly to the zipped output stream.


Constructor Summary
ZipDocumentCollectionBuilder(String zipFilename, DocumentFactory factory, boolean exact, ProgressLogger progressLogger)
          Creates a new zipped collection builder.
 
Method Summary
 void add(MutableString word, MutableString nonWord)
          Adds a word and a nonword to the current text field, provided that a text field has started but not yet ended; otherwise, doesn't do anything.
 ZipDocumentCollection build(DocumentSequence inputSequence)
          A utility method copying all documents of an input sequence to a zipped collection.
 ZipDocumentCollection close()
          Terminates the contruction of the zipped collection and returns it.
 void endDocument()
          Ends a document entry.
 void endTextField()
          Ends a new text field.
static void main(String[] arg)
           
 void nonTextField(Object o)
          Adds a non-text field.
 void startDocument(CharSequence title, CharSequence uri)
          Starts a document entry.
 void startTextField()
          Starts a new text field.
 void virtualField(ObjectList<Scan.VirtualDocumentFragment> fragments)
          Adds a virtual field.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ZipDocumentCollectionBuilder

public ZipDocumentCollectionBuilder(String zipFilename,
                                    DocumentFactory factory,
                                    boolean exact,
                                    ProgressLogger progressLogger)
                             throws FileNotFoundException
Creates a new zipped collection builder.

Parameters:
zipFilename - the filename of the zip file.
factory - the factory of the base document sequence.
exact - true iff also non-words should be preserved.
progressLogger - a progress logger.
Throws:
FileNotFoundException
Method Detail

startDocument

public void startDocument(CharSequence title,
                          CharSequence uri)
                   throws IOException
Starts a document entry.

Parameters:
title - the document title (usually, the result of Document.title()).
uri - the document uri (usually, the result of Document.uri()).
Throws:
IOException

endDocument

public void endDocument()
                 throws IOException
Ends a document entry.

Throws:
IOException

startTextField

public void startTextField()
Starts a new text field.


nonTextField

public void nonTextField(Object o)
                  throws IOException
Adds a non-text field.

Parameters:
o - the content of the non-text field.
Throws:
IOException

virtualField

public void virtualField(ObjectList<Scan.VirtualDocumentFragment> fragments)
                  throws IOException
Adds a virtual field.

Parameters:
fragments - the virtual fragments to be added.
Throws:
IOException

endTextField

public void endTextField()
                  throws IOException
Ends a new text field.

Throws:
IOException

add

public void add(MutableString word,
                MutableString nonWord)
         throws IOException
Adds a word and a nonword to the current text field, provided that a text field has started but not yet ended; otherwise, doesn't do anything.

Usually, word e nonWord are just the result of a call to WordReader.next(MutableString, MutableString).

Parameters:
word - a word.
nonWord - a nonword.
Throws:
IOException

close

public ZipDocumentCollection close()
                            throws IOException
Terminates the contruction of the zipped collection and returns it.

Throws:
IOException

build

public ZipDocumentCollection build(DocumentSequence inputSequence)
                            throws IOException
A utility method copying all documents of an input sequence to a zipped collection.

Throws:
IOException

main

public static void main(String[] arg)
                 throws com.martiansoftware.jsap.JSAPException,
                        IOException,
                        ClassNotFoundException,
                        InvocationTargetException,
                        NoSuchMethodException,
                        IllegalAccessException,
                        InstantiationException
Throws:
com.martiansoftware.jsap.JSAPException
IOException
ClassNotFoundException
InvocationTargetException
NoSuchMethodException
IllegalAccessException
InstantiationException