it.unimi.dsi.mg4j.document
Class ZipDocumentCollection

java.lang.Object
  extended by it.unimi.dsi.mg4j.document.AbstractDocumentSequence
      extended by it.unimi.dsi.mg4j.document.AbstractDocumentCollection
          extended by it.unimi.dsi.mg4j.document.ZipDocumentCollection
All Implemented Interfaces:
SafelyCloseable, FlyweightPrototype<DocumentCollection>, DocumentCollection, DocumentSequence, Closeable, Serializable

public class ZipDocumentCollection
extends AbstractDocumentCollection
implements Serializable

A DocumentCollection produced from a document sequence using ZipDocumentCollectionBuilder.

The collection will produce the same documents as the original sequence whence it was produced, in the following sense:

Warning: the Reader returned by Document.content(int) for documents produced by this factory is just obtained as the concatenation of words and non-words returned by the word reader for that field.

The collection will be, as any other collection, serialized on a file, but it will refer to another zip file that is going to contain the documents themselves.

See Also:
Serialized Form

Nested Class Summary
protected static class ZipDocumentCollection.ZipFactory
          A factory tightly coupled to a ZipDocumentCollection.
 
Field Summary
 
Fields inherited from interface it.unimi.dsi.mg4j.document.DocumentCollection
DEFAULT_EXTENSION
 
Constructor Summary
ZipDocumentCollection(String zipFilename, DocumentFactory underlyingFactory, int numberOfDocuments, boolean exact)
          Constructs a document collection (for reading) corresponding to a given zip collection file.
 
Method Summary
 void close()
          Closes this document sequence, releasing all resources.
 ZipDocumentCollection copy()
           
 Document document(int index)
          Returns the document given its index.
 DocumentFactory factory()
          Returns the factory used by this sequence.
 DocumentIterator iterator()
          Returns an iterator over the sequence of documents.
 Reference2ObjectMap<Enum<?>,Object> metadata(int index)
          Returns the metadata map for a document.
 int size()
          Returns the number of documents in this collection.
 InputStream stream(int index)
          Returns an input stream for the raw content of a document.
 
Methods inherited from class it.unimi.dsi.mg4j.document.AbstractDocumentCollection
ensureDocumentIndex, main, printAllDocuments, toString
 
Methods inherited from class it.unimi.dsi.mg4j.document.AbstractDocumentSequence
finalize
 
Methods inherited from class java.lang.Object
clone, equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

ZipDocumentCollection

public ZipDocumentCollection(String zipFilename,
                             DocumentFactory underlyingFactory,
                             int numberOfDocuments,
                             boolean exact)
                      throws IOException
Constructs a document collection (for reading) corresponding to a given zip collection file.

Parameters:
zipFilename - the filename of the zip collection.
underlyingFactory - the underlying document factory.
numberOfDocuments - the number of documents.
exact - true iff this is an exact reproduction of the original sequence.
Throws:
IOException
Method Detail

copy

public ZipDocumentCollection copy()
Specified by:
copy in interface FlyweightPrototype<DocumentCollection>
Specified by:
copy in interface DocumentCollection

factory

public DocumentFactory factory()
Description copied from interface: DocumentSequence
Returns the factory used by this sequence.

Every document sequence is based on a document factory that transforms raw bytes into a sequence of characters. The factory contains useful information such as the number of fields.

Specified by:
factory in interface DocumentSequence
Returns:
the factory used by this sequence.

size

public int size()
Description copied from interface: DocumentCollection
Returns the number of documents in this collection.

Specified by:
size in interface DocumentCollection
Returns:
the number of documents in this collection.

document

public Document document(int index)
                  throws IOException
Description copied from interface: DocumentCollection
Returns the document given its index.

Specified by:
document in interface DocumentCollection
Parameters:
index - an index between 0 (inclusive) and DocumentCollection.size() (exclusive).
Returns:
the index-th document.
Throws:
IOException

metadata

public Reference2ObjectMap<Enum<?>,Object> metadata(int index)
Description copied from interface: DocumentCollection
Returns the metadata map for a document.

Specified by:
metadata in interface DocumentCollection
Parameters:
index - an index between 0 (inclusive) and DocumentCollection.size() (exclusive).
Returns:
the metadata map for the document.

stream

public InputStream stream(int index)
                   throws IOException
Description copied from interface: DocumentCollection
Returns an input stream for the raw content of a document.

Specified by:
stream in interface DocumentCollection
Parameters:
index - an index between 0 (inclusive) and DocumentCollection.size() (exclusive).
Returns:
the raw content of the document as an input stream.
Throws:
IOException

iterator

public DocumentIterator iterator()
Description copied from interface: DocumentSequence
Returns an iterator over the sequence of documents.

Warning: this method can be safely called just one time. For instance, implementations based on standard input will usually throw an exception if this method is called twice.

Implementations may decide to override this restriction (in particular, if they implement DocumentCollection). Usually, however, it is not possible to obtain two iterators at the same time on a collection.

Specified by:
iterator in interface DocumentSequence
Overrides:
iterator in class AbstractDocumentCollection
Returns:
an iterator over the sequence of documents.
See Also:
DocumentCollection

close

public void close()
           throws IOException
Description copied from interface: DocumentSequence
Closes this document sequence, releasing all resources.

You should always call this method after having finished with this document sequence. Implementations are invited to call this method in a finaliser as a safety net (even better, implement SafelyCloseable), but since there is no guarantee as to when finalisers are invoked, you should not depend on this behaviour.

Specified by:
close in interface DocumentSequence
Specified by:
close in interface Closeable
Overrides:
close in class AbstractDocumentSequence
Throws:
IOException