it.unimi.dsi.mg4j.util
Class ImmutableExternalPrefixDictionary

java.lang.Object
  extended by it.unimi.dsi.fastutil.objects.AbstractObjectCollection<K>
      extended by it.unimi.dsi.fastutil.objects.AbstractObjectList<CharSequence>
          extended by it.unimi.dsi.mg4j.util.ImmutableExternalPrefixDictionary
All Implemented Interfaces:
ObjectCollection<CharSequence>, ObjectIterable<CharSequence>, ObjectList<CharSequence>, Stack<CharSequence>, PrefixMap, TermMap, Serializable, Comparable<List<? extends CharSequence>>, Iterable<CharSequence>, Collection<CharSequence>, List<CharSequence>
Direct Known Subclasses:
ImmutableExternalTreePrefixDictionary, ImmutableExternalTriePrefixDictionary

Deprecated. Moved to dsiutils.

@Deprecated
public abstract class ImmutableExternalPrefixDictionary
extends AbstractObjectList<CharSequence>
implements TermMap, PrefixMap, Serializable

An immutable prefix dictionary mostly stored in external memory.

A prefix dictionary is a data structure that stores efficiently a lexicographically ordered list of character sequences so that it is possible to obtain, given a character sequence s, its position in the sequence, and also the (possibly empty) interval of the sequence composed by character sequences that are extensions of s. In other words, a prefix dictionary is an implementation of a TermMap and of a PrefixMap at the same time.

This class releases on a dump stream most of the data that would be contained in the corresponding internal-memory dictionary (e.g., a TernaryIntervalSearchTree). More precisely, each block (with user-definable length, possibly the size of a basic disk I/O operation) is filled as much as possible with strings front coded and compressed with a PrefixCodec provided by the abstract factory method getPrefixCodec(). Each block starts with the length of the first string in unary, followed by the encoding of the string. Then, for each string we write in unary the length of the common prefix (in characters) with the previous string, the length of the remaning suffix (in characters) and finally the encoded suffix. Note that if the encoding of a string is longer than a block, the string will occupy more than one block.

We keep track using an IntervalApproximator returned by the abstract factory method getIntervalApproximator() of the strings at the start of each block: thus, we are able to retrieve the interval corresponding to a given prefix by calling getApproximatedInterval() and scanning at most two blocks.

The two abovementioned abstract factory methods are the only methods that must be implemented by concrete subclasses. For instance, ImmutableExternalTreePrefixDictionary uses a HuffmanCodec to compress the strings, and keeps the block delimiters in a TernaryIntervalSearchTree, whereas ImmutableExternalTriePrefixDictionary uses a HuTuckerCodec to compress the strings, and the same codec to store the block delimiters in an ImmutableTriePrefixTree.

Besides handling externalisation and compression, this abstract implementation has a number of useful features:

Self-contained or non-self-contained

There are two kinds of external prefix dictionaries: self-contained and non-self-contained. In the first case, you get a serialised object that you can load at any time. The dump stream is serialised with the object and expanded at each deserialisation in the Java temporary directory. If you deserialise a dictionary several times, you will get correspondingly many copies of the dump stream in the temporary directory. The dump streams are deleted when the JVM exits. This mechanism is not very efficient, but since this class implements several interfaces it is essential that clients can make the thing work in a standard way.

Alternatively, you can give at creation time a filename for the dump stream. The resulting non-self-contained external prefix dictionary can be serialised, but after deserialisation you need to set back the dump stream filename or even directly the dump stream (for instance, to an output bit stream wrapping a byte array where the dump stream has been loaded). You can deserialise many copies of an external prefix dictionary, letting all copies share the same dump stream.

This data structure is not synchronised, and concurrent reads may cause problems because of clashes in the usage of the underlying input bit stream. It would not be a good idea in any case to open a new stream for each caller, as that would certainly lead to disk thrashing.

The main method of this class, as usual with immutable containers in MG4J, helps in building large external prefix dictionaries. Of course, you must provide a concrete subclass of this class.

Since:
0.9.3
Author:
Sebastiano Vigna
See Also:
Serialized Form

Nested Class Summary
static interface ImmutableExternalPrefixDictionary.IntervalApproximator
          Deprecated. A data structure providing queries for approximated prefix intervals.
 
Nested classes/interfaces inherited from class it.unimi.dsi.fastutil.objects.AbstractObjectList
AbstractObjectList.ObjectSubList<K>
 
Field Summary
protected  long blockSize
          Deprecated. The block size of this dictionary (in bits).
protected  Char2IntOpenHashMap char2symbol
          Deprecated. A map from characters to symbols of the coder.
protected  Decoder decoder
          Deprecated. A decoder used to read data from the dump stream.
protected  ImmutableExternalPrefixDictionary.IntervalApproximator intervalApproximator
          Deprecated. The in-memory data structure used to approximate intervals..
static long serialVersionUID
          Deprecated.  
protected  int size
          Deprecated. The number of terms in this dictionary.
static int STD_BLOCK_SIZE
          Deprecated. The standard block size (in bytes).
protected  char[] symbol2char
          Deprecated. A map (given by an array) from symbols in the coder to characters.
 
Constructor Summary
ImmutableExternalPrefixDictionary(Iterable<? extends CharSequence> terms)
          Deprecated. Creates an external prefix dictionary with block size STD_BLOCK_SIZE.
ImmutableExternalPrefixDictionary(Iterable<? extends CharSequence> terms, CharSequence dumpStreamFilename)
          Deprecated. Creates an external dictionary with block size STD_BLOCK_SIZE and specified dump stream.
ImmutableExternalPrefixDictionary(Iterable<? extends CharSequence> terms, int blockSizeInBytes)
          Deprecated. Creates an external dictionary with specified block size.
ImmutableExternalPrefixDictionary(Iterable<? extends CharSequence> terms, int blockSizeInBytes, CharSequence dumpStreamFilename)
          Deprecated. Creates an external dictionary.
 
Method Summary
 boolean contains(CharSequence term)
          Deprecated.  
 CharSequence get(int index)
          Deprecated.  
 int getIndex(CharSequence s)
          Deprecated. 
 Interval getInterval(CharSequence prefix)
          Deprecated. Returns the interval of terms starting with the given prefix.
protected abstract  ImmutableExternalPrefixDictionary.IntervalApproximator getIntervalApproximator(List<? extends CharSequence> delimiters, PrefixCoder prefixCoder)
          Deprecated. An abstract factory method returning an interval approximator for this external dictionary.
 int getNumber(CharSequence term)
          Deprecated. Returns the ordinal number corresponding to the given term, or possibly (but not necessarily) -1 if the term was not indexed.
 CharSequence getPrefix(Interval interval)
          Deprecated. Returns the maximum prefix common to all terms in the given nonempty interval (optional operation).
 MutableString getPrefix(Interval interval, MutableString prefix)
          Deprecated. Writes in the given mutable string the maximum prefix common to all terms in the given nonempty interval (optional operation).
protected abstract  PrefixCodec getPrefixCodec(int[] frequency)
          Deprecated. An abstract factory method returning a prefix codec that will be used to compress the dump file.
 CharSequence getTerm(int index)
          Deprecated. Returns the term corresponding to the given ordinal number (optional operation).
 MutableString getTerm(int index, MutableString s)
          Deprecated. Writes in the given mutable string the term corresponding to the given ordinal number (optional operation).
 boolean hasPrefixes()
          Deprecated. Returns true if this prefix map supports prefix retrieval.
 boolean hasTerms()
          Deprecated. Returns true if this prefix map supports term retrieval.
 int indexOf(Object o)
          Deprecated.  
 ObjectIterator<CharSequence> iterator()
          Deprecated. Returns an iterator over the dictionary.
 int lastIndexOf(Object o)
          Deprecated.  
static void main(String[] arg)
          Deprecated.  
 void setDumpStream(CharSequence dumpStreamFilename)
          Deprecated. Sets the dump stream of this external prefix dictionary to a given filename.
 void setDumpStream(InputBitStream dumpStream)
          Deprecated. Sets the dump stream of this external prefix dictionary to a given input bit stream.
 int size()
          Deprecated. Returns the number of terms in this prefix map.
 
Methods inherited from class it.unimi.dsi.fastutil.objects.AbstractObjectList
add, add, addAll, addAll, addElements, addElements, compareTo, contains, ensureIndex, ensureRestrictedIndex, equals, getElements, hashCode, listIterator, listIterator, objectListIterator, objectListIterator, objectSubList, peek, pop, push, remove, removeElements, set, size, subList, top, toString
 
Methods inherited from class it.unimi.dsi.fastutil.objects.AbstractObjectCollection
clear, containsAll, isEmpty, objectIterator, remove, removeAll, retainAll, toArray, toArray
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface java.util.List
clear, containsAll, isEmpty, remove, removeAll, retainAll, toArray, toArray
 
Methods inherited from interface it.unimi.dsi.fastutil.objects.ObjectCollection
objectIterator, toArray
 
Methods inherited from interface it.unimi.dsi.fastutil.Stack
isEmpty
 

Field Detail

serialVersionUID

public static final long serialVersionUID
Deprecated. 
See Also:
Constant Field Values

STD_BLOCK_SIZE

public static final int STD_BLOCK_SIZE
Deprecated. 
The standard block size (in bytes).

See Also:
Constant Field Values

intervalApproximator

protected final ImmutableExternalPrefixDictionary.IntervalApproximator intervalApproximator
Deprecated. 
The in-memory data structure used to approximate intervals..


blockSize

protected final long blockSize
Deprecated. 
The block size of this dictionary (in bits).


decoder

protected final Decoder decoder
Deprecated. 
A decoder used to read data from the dump stream.


symbol2char

protected final char[] symbol2char
Deprecated. 
A map (given by an array) from symbols in the coder to characters.


char2symbol

protected final Char2IntOpenHashMap char2symbol
Deprecated. 
A map from characters to symbols of the coder.


size

protected final int size
Deprecated. 
The number of terms in this dictionary.

Constructor Detail

ImmutableExternalPrefixDictionary

public ImmutableExternalPrefixDictionary(Iterable<? extends CharSequence> terms,
                                         int blockSizeInBytes,
                                         CharSequence dumpStreamFilename)
                                  throws IOException
Deprecated. 
Creates an external dictionary.

This constructor does not assume that strings returned by terms.iterator() will be distinct. Thus, it can be safely used with FileLinesCollection.

Parameters:
terms - an iterable whose iterator will enumerate in lexicographical order the terms for the dictionary.
blockSizeInBytes - the block size (in bytes).
dumpStreamFilename - the name of the dump stream, or null for a dictionary with an automatic dump stream.
Throws:
IOException

ImmutableExternalPrefixDictionary

public ImmutableExternalPrefixDictionary(Iterable<? extends CharSequence> terms,
                                         CharSequence dumpStreamFilename)
                                  throws IOException
Deprecated. 
Creates an external dictionary with block size STD_BLOCK_SIZE and specified dump stream.

This constructor does not assume that strings returned by terms.iterator() will be distinct. Thus, it can be safely used with FileLinesCollection.

Parameters:
terms - a collection whose iterator will enumerate in lexicographical order the terms for the dictionary.
dumpStreamFilename - the name of the dump stream, or null for a dictionary with an automatic dump stream.
Throws:
IOException

ImmutableExternalPrefixDictionary

public ImmutableExternalPrefixDictionary(Iterable<? extends CharSequence> terms,
                                         int blockSizeInBytes)
                                  throws IOException
Deprecated. 
Creates an external dictionary with specified block size.

This constructor does not assume that strings returned by terms.iterator() will be distinct. Thus, it can be safely used with FileLinesCollection.

Parameters:
blockSizeInBytes - the block size (in bytes).
terms - a collection whose iterator will enumerate in lexicographical order the terms for the dictionary.
Throws:
IOException

ImmutableExternalPrefixDictionary

public ImmutableExternalPrefixDictionary(Iterable<? extends CharSequence> terms)
                                  throws IOException
Deprecated. 
Creates an external prefix dictionary with block size STD_BLOCK_SIZE.

This constructor does not assume that strings returned by terms.iterator() will be distinct. Thus, it can be safely used with FileLinesCollection.

Parameters:
terms - a collection whose iterator will enumerate in lexicographical order the terms for the dictionary.
Throws:
IOException
Method Detail

getPrefixCodec

protected abstract PrefixCodec getPrefixCodec(int[] frequency)
Deprecated. 
An abstract factory method returning a prefix codec that will be used to compress the dump file.

Implementing subclasses must provide a codec. The coder will be used at construction time, and will be passed to getIntervalApproximator(List, PrefixCoder), whereas the decoder will be available in decoder.

Parameters:
frequency - the symbol frequencies.
Returns:
the prefix codec that will be used to compress the dump file.

getIntervalApproximator

protected abstract ImmutableExternalPrefixDictionary.IntervalApproximator getIntervalApproximator(List<? extends CharSequence> delimiters,
                                                                                                  PrefixCoder prefixCoder)
Deprecated. 
An abstract factory method returning an interval approximator for this external dictionary.

Parameters:
delimiters - a list of the words appearing at the start of each block.
prefixCoder - the prefix coder provided by the prefix coded previously returned by getPrefixCodec(int[]); the returned approximator can use its coder/decoder.
Returns:
an interval approximator for this external dictionary.

setDumpStream

public void setDumpStream(CharSequence dumpStreamFilename)
                   throws FileNotFoundException
Deprecated. 
Sets the dump stream of this external prefix dictionary to a given filename.

This method sets the dump file used by this dictionary, and should be only called after deserialisation, providing exactly the file generated at creation time. Essentially anything can happen if you do not follow the rules.

Note that this method will attempt to close the old stream, if present.

Parameters:
dumpStreamFilename - the name of the dump file.
Throws:
FileNotFoundException
See Also:
setDumpStream(InputBitStream)

setDumpStream

public void setDumpStream(InputBitStream dumpStream)
Deprecated. 
Sets the dump stream of this external prefix dictionary to a given input bit stream.

This method sets the dump file used by this dictionary, and should be only called after deserialisation, providing a repositionable stream containing exactly the file generated at creation time. Essentially anything can happen if you do not follow the rules.

Using this method you can load an external prefix dictionary in core memory, enjoying the compactness of the data structure, but getting much more speed.

Note that this method will attemp to close the old stream, if present.

Parameters:
dumpStream - a repositionable input bit stream containing exactly the dump stream generated at creation time.
See Also:
setDumpStream(CharSequence)

getInterval

public Interval getInterval(CharSequence prefix)
Deprecated. 
Description copied from interface: PrefixMap
Returns the interval of terms starting with the given prefix.

Specified by:
getInterval in interface PrefixMap
Parameters:
prefix - a prefix.
Returns:
the interval of terms starting with prefix (Intervals.EMPTY_INTERVAL in case no term starts with prefix).

getTerm

public MutableString getTerm(int index,
                             MutableString s)
Deprecated. 
Description copied from interface: TermMap
Writes in the given mutable string the term corresponding to the given ordinal number (optional operation).

Specified by:
getTerm in interface TermMap
Parameters:
index - a term ordinal number.
s - a mutable string that will be filled with the corresponding term.
Returns:
term, or possibly (but not necessarily) null if the term was not indexed.

getTerm

public CharSequence getTerm(int index)
Deprecated. 
Description copied from interface: TermMap
Returns the term corresponding to the given ordinal number (optional operation).

Specified by:
getTerm in interface TermMap
Parameters:
index - a term ordinal number.
Returns:
the corresponding term, or possibly (but not necessarily) null if the term was not indexed.

getIndex

@Deprecated
public int getIndex(CharSequence s)
Deprecated. 


getNumber

public int getNumber(CharSequence term)
Deprecated. 
Description copied from interface: TermMap
Returns the ordinal number corresponding to the given term, or possibly (but not necessarily) -1 if the term was not indexed.

We intentionally prefer “ordinal number” to “index” because of the obvious confusion that the latter term can cause.

Specified by:
getNumber in interface TermMap
Parameters:
term - a term.
Returns:
its ordinal number, or possibly (but not necessarily) -1 if the term was not indexed.

indexOf

public int indexOf(Object o)
Deprecated. 
Specified by:
indexOf in interface List<CharSequence>
Overrides:
indexOf in class AbstractObjectList<CharSequence>

lastIndexOf

public int lastIndexOf(Object o)
Deprecated. 
Specified by:
lastIndexOf in interface List<CharSequence>
Overrides:
lastIndexOf in class AbstractObjectList<CharSequence>

get

public CharSequence get(int index)
Deprecated. 
Specified by:
get in interface List<CharSequence>

contains

public boolean contains(CharSequence term)
Deprecated. 

hasPrefixes

public boolean hasPrefixes()
Deprecated. 
Description copied from interface: PrefixMap
Returns true if this prefix map supports prefix retrieval.

Specified by:
hasPrefixes in interface PrefixMap
Returns:
true if this prefix map supports prefix retrieval.

hasTerms

public boolean hasTerms()
Deprecated. 
Description copied from interface: TermMap
Returns true if this prefix map supports term retrieval.

Specified by:
hasTerms in interface TermMap
Returns:
true if this prefix map supports term retrieval.

iterator

public ObjectIterator<CharSequence> iterator()
Deprecated. 
Returns an iterator over the dictionary.

The iterator returned by this method scans directly the dump stream. List.listIterator(), instead, makes a call to get(int) for every call to hasNext() and hasPrevious().

Note that the returned iterator uses the same stream as all get methods. Calling such methods while the iterator is being used will produce an IllegalStateException.

Specified by:
iterator in interface ObjectCollection<CharSequence>
Specified by:
iterator in interface ObjectIterable<CharSequence>
Specified by:
iterator in interface ObjectList<CharSequence>
Specified by:
iterator in interface Iterable<CharSequence>
Specified by:
iterator in interface Collection<CharSequence>
Specified by:
iterator in interface List<CharSequence>
Overrides:
iterator in class AbstractObjectList<CharSequence>
Returns:
an iterator over the dictionary that just scans the dump stream.

size

public int size()
Deprecated. 
Description copied from interface: PrefixMap
Returns the number of terms in this prefix map.

Specified by:
size in interface PrefixMap
Specified by:
size in interface TermMap
Specified by:
size in interface Collection<CharSequence>
Specified by:
size in interface List<CharSequence>
Returns:
the number of terms in this prefix map.

getPrefix

public CharSequence getPrefix(Interval interval)
Deprecated. 
Description copied from interface: PrefixMap
Returns the maximum prefix common to all terms in the given nonempty interval (optional operation).

Specified by:
getPrefix in interface PrefixMap
Parameters:
interval - an interval.
Returns:
the maximum prefix common to all terms in the given nonempty interval.

getPrefix

public MutableString getPrefix(Interval interval,
                               MutableString prefix)
Deprecated. 
Description copied from interface: PrefixMap
Writes in the given mutable string the maximum prefix common to all terms in the given nonempty interval (optional operation).

Specified by:
getPrefix in interface PrefixMap
Parameters:
interval - an interval.
prefix - a mutable string that will be filled with the maximum prefix common to all terms in the given nonempty interval.
Returns:
prefix.

main

public static void main(String[] arg)
                 throws ClassNotFoundException,
                        IOException,
                        InstantiationException,
                        IllegalAccessException,
                        InvocationTargetException,
                        NoSuchMethodException,
                        com.martiansoftware.jsap.JSAPException
Deprecated. 
Throws:
ClassNotFoundException
IOException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
com.martiansoftware.jsap.JSAPException