|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectit.unimi.dsi.fastutil.objects.AbstractObjectCollection<K>
it.unimi.dsi.fastutil.objects.AbstractObjectList<CharSequence>
it.unimi.dsi.mg4j.util.ImmutableExternalPrefixDictionary
dsiutils
.
@Deprecated public abstract class ImmutableExternalPrefixDictionary
An immutable prefix dictionary mostly stored in external memory.
A prefix dictionary is a data structure that stores efficiently a lexicographically ordered list
of character sequences so that it is possible to obtain, given a character sequence s,
its position in the sequence, and also the (possibly empty) interval of the sequence composed by
character sequences that are extensions of s. In other words, a prefix dictionary
is an implementation of a TermMap
and of a PrefixMap
at the same time.
This class releases on a dump stream most of the data that
would be contained in the corresponding internal-memory dictionary
(e.g., a TernaryIntervalSearchTree
).
More precisely, each
block (with user-definable length, possibly the size of a basic disk I/O operation)
is filled as much as possible with strings front coded and compressed with a
PrefixCodec
provided by the abstract factory method getPrefixCodec()
.
Each block starts with the length of the first string in unary, followed by the encoding of the
string. Then, for each string we write in unary the length of the common prefix (in characters)
with the previous string, the length of the remaning suffix (in characters)
and finally the encoded suffix. Note that if the encoding of a string is longer than a block, the string will occupy more than one block.
We keep track using an IntervalApproximator
returned
by the abstract factory method getIntervalApproximator()
of the strings at the start of each block: thus, we are able to retrieve the interval corresponding
to a given prefix by calling getApproximatedInterval()
and scanning at most two blocks.
The two abovementioned abstract factory methods
are the only methods that must be implemented by concrete subclasses. For instance,
ImmutableExternalTreePrefixDictionary
uses a HuffmanCodec
to compress the strings, and keeps the block delimiters in a TernaryIntervalSearchTree
, whereas
ImmutableExternalTriePrefixDictionary
uses a HuTuckerCodec
to compress the strings, and the same codec to store the block delimiters in an ImmutableTriePrefixTree
.
Besides handling externalisation and compression, this abstract implementation has a number of useful features:
ObjectArrayList
,
exposing the list of character sequences, and by means of indexOf()
,
also the inverse mapping; TermMap
and PrefixMap
;
TermMap
's
and PrefixMap
's optional operations).
iterator()
will return an iterator
that just scans linearly the dump stream (instead of using the superclass implementation,
which scans the list using calls to get(int)
in a for
loop).
There are two kinds of external prefix dictionaries: self-contained and non-self-contained. In the first case, you get a serialised object that you can load at any time. The dump stream is serialised with the object and expanded at each deserialisation in the Java temporary directory. If you deserialise a dictionary several times, you will get correspondingly many copies of the dump stream in the temporary directory. The dump streams are deleted when the JVM exits. This mechanism is not very efficient, but since this class implements several interfaces it is essential that clients can make the thing work in a standard way.
Alternatively, you can give at creation time a filename for the dump stream. The resulting non-self-contained external prefix dictionary can be serialised, but after deserialisation you need to set back the dump stream filename or even directly the dump stream (for instance, to an output bit stream wrapping a byte array where the dump stream has been loaded). You can deserialise many copies of an external prefix dictionary, letting all copies share the same dump stream.
This data structure is not synchronised, and concurrent reads may cause problems because of clashes in the usage of the underlying input bit stream. It would not be a good idea in any case to open a new stream for each caller, as that would certainly lead to disk thrashing.
The main method of this class, as usual with immutable containers in MG4J, helps in building large external prefix dictionaries. Of course, you must provide a concrete subclass of this class.
Nested Class Summary | |
---|---|
static interface |
ImmutableExternalPrefixDictionary.IntervalApproximator
Deprecated. A data structure providing queries for approximated prefix intervals. |
Nested classes/interfaces inherited from class it.unimi.dsi.fastutil.objects.AbstractObjectList |
---|
AbstractObjectList.ObjectSubList<K> |
Field Summary | |
---|---|
protected long |
blockSize
Deprecated. The block size of this dictionary (in bits). |
protected Char2IntOpenHashMap |
char2symbol
Deprecated. A map from characters to symbols of the coder. |
protected Decoder |
decoder
Deprecated. A decoder used to read data from the dump stream. |
protected ImmutableExternalPrefixDictionary.IntervalApproximator |
intervalApproximator
Deprecated. The in-memory data structure used to approximate intervals.. |
static long |
serialVersionUID
Deprecated. |
protected int |
size
Deprecated. The number of terms in this dictionary. |
static int |
STD_BLOCK_SIZE
Deprecated. The standard block size (in bytes). |
protected char[] |
symbol2char
Deprecated. A map (given by an array) from symbols in the coder to characters. |
Constructor Summary | |
---|---|
ImmutableExternalPrefixDictionary(Iterable<? extends CharSequence> terms)
Deprecated. Creates an external prefix dictionary with block size STD_BLOCK_SIZE . |
|
ImmutableExternalPrefixDictionary(Iterable<? extends CharSequence> terms,
CharSequence dumpStreamFilename)
Deprecated. Creates an external dictionary with block size STD_BLOCK_SIZE and specified dump stream. |
|
ImmutableExternalPrefixDictionary(Iterable<? extends CharSequence> terms,
int blockSizeInBytes)
Deprecated. Creates an external dictionary with specified block size. |
|
ImmutableExternalPrefixDictionary(Iterable<? extends CharSequence> terms,
int blockSizeInBytes,
CharSequence dumpStreamFilename)
Deprecated. Creates an external dictionary. |
Method Summary | |
---|---|
boolean |
contains(CharSequence term)
Deprecated. |
CharSequence |
get(int index)
Deprecated. |
int |
getIndex(CharSequence s)
Deprecated. |
Interval |
getInterval(CharSequence prefix)
Deprecated. Returns the interval of terms starting with the given prefix. |
protected abstract ImmutableExternalPrefixDictionary.IntervalApproximator |
getIntervalApproximator(List<? extends CharSequence> delimiters,
PrefixCoder prefixCoder)
Deprecated. An abstract factory method returning an interval approximator for this external dictionary. |
int |
getNumber(CharSequence term)
Deprecated. Returns the ordinal number corresponding to the given term, or possibly (but not necessarily) -1 if the term was not indexed. |
CharSequence |
getPrefix(Interval interval)
Deprecated. Returns the maximum prefix common to all terms in the given nonempty interval (optional operation). |
MutableString |
getPrefix(Interval interval,
MutableString prefix)
Deprecated. Writes in the given mutable string the maximum prefix common to all terms in the given nonempty interval (optional operation). |
protected abstract PrefixCodec |
getPrefixCodec(int[] frequency)
Deprecated. An abstract factory method returning a prefix codec that will be used to compress the dump file. |
CharSequence |
getTerm(int index)
Deprecated. Returns the term corresponding to the given ordinal number (optional operation). |
MutableString |
getTerm(int index,
MutableString s)
Deprecated. Writes in the given mutable string the term corresponding to the given ordinal number (optional operation). |
boolean |
hasPrefixes()
Deprecated. Returns true if this prefix map supports prefix retrieval. |
boolean |
hasTerms()
Deprecated. Returns true if this prefix map supports term retrieval. |
int |
indexOf(Object o)
Deprecated. |
ObjectIterator<CharSequence> |
iterator()
Deprecated. Returns an iterator over the dictionary. |
int |
lastIndexOf(Object o)
Deprecated. |
static void |
main(String[] arg)
Deprecated. |
void |
setDumpStream(CharSequence dumpStreamFilename)
Deprecated. Sets the dump stream of this external prefix dictionary to a given filename. |
void |
setDumpStream(InputBitStream dumpStream)
Deprecated. Sets the dump stream of this external prefix dictionary to a given input bit stream. |
int |
size()
Deprecated. Returns the number of terms in this prefix map. |
Methods inherited from class it.unimi.dsi.fastutil.objects.AbstractObjectList |
---|
add, add, addAll, addAll, addElements, addElements, compareTo, contains, ensureIndex, ensureRestrictedIndex, equals, getElements, hashCode, listIterator, listIterator, objectListIterator, objectListIterator, objectSubList, peek, pop, push, remove, removeElements, set, size, subList, top, toString |
Methods inherited from class it.unimi.dsi.fastutil.objects.AbstractObjectCollection |
---|
clear, containsAll, isEmpty, objectIterator, remove, removeAll, retainAll, toArray, toArray |
Methods inherited from class java.lang.Object |
---|
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
Methods inherited from interface java.util.List |
---|
clear, containsAll, isEmpty, remove, removeAll, retainAll, toArray, toArray |
Methods inherited from interface it.unimi.dsi.fastutil.objects.ObjectCollection |
---|
objectIterator, toArray |
Methods inherited from interface it.unimi.dsi.fastutil.Stack |
---|
isEmpty |
Field Detail |
---|
public static final long serialVersionUID
public static final int STD_BLOCK_SIZE
protected final ImmutableExternalPrefixDictionary.IntervalApproximator intervalApproximator
protected final long blockSize
protected final Decoder decoder
protected final char[] symbol2char
protected final Char2IntOpenHashMap char2symbol
protected final int size
Constructor Detail |
---|
public ImmutableExternalPrefixDictionary(Iterable<? extends CharSequence> terms, int blockSizeInBytes, CharSequence dumpStreamFilename) throws IOException
This constructor does not assume that strings returned by terms.iterator()
will be distinct. Thus, it can be safely used with FileLinesCollection
.
terms
- an iterable whose iterator will enumerate in lexicographical order the terms for the dictionary.blockSizeInBytes
- the block size (in bytes).dumpStreamFilename
- the name of the dump stream, or null
for a dictionary
with an automatic dump stream.
IOException
public ImmutableExternalPrefixDictionary(Iterable<? extends CharSequence> terms, CharSequence dumpStreamFilename) throws IOException
STD_BLOCK_SIZE
and specified dump stream.
This constructor does not assume that strings returned by terms.iterator()
will be distinct. Thus, it can be safely used with FileLinesCollection
.
terms
- a collection whose iterator will enumerate in lexicographical order the terms for the dictionary.dumpStreamFilename
- the name of the dump stream, or null
for a dictionary
with an automatic dump stream.
IOException
public ImmutableExternalPrefixDictionary(Iterable<? extends CharSequence> terms, int blockSizeInBytes) throws IOException
This constructor does not assume that strings returned by terms.iterator()
will be distinct. Thus, it can be safely used with FileLinesCollection
.
blockSizeInBytes
- the block size (in bytes).terms
- a collection whose iterator will enumerate in lexicographical order the terms for the dictionary.
IOException
public ImmutableExternalPrefixDictionary(Iterable<? extends CharSequence> terms) throws IOException
STD_BLOCK_SIZE
.
This constructor does not assume that strings returned by terms.iterator()
will be distinct. Thus, it can be safely used with FileLinesCollection
.
terms
- a collection whose iterator will enumerate in lexicographical order the terms for the dictionary.
IOException
Method Detail |
---|
protected abstract PrefixCodec getPrefixCodec(int[] frequency)
Implementing subclasses must provide a codec. The coder will be used
at construction time, and will be passed to getIntervalApproximator(List, PrefixCoder)
,
whereas the decoder will be available in decoder
.
frequency
- the symbol frequencies.
protected abstract ImmutableExternalPrefixDictionary.IntervalApproximator getIntervalApproximator(List<? extends CharSequence> delimiters, PrefixCoder prefixCoder)
delimiters
- a list of the words appearing at the start of each block.prefixCoder
- the prefix coder provided by the prefix coded previously returned by getPrefixCodec(int[])
;
the returned approximator can use its coder/decoder.
public void setDumpStream(CharSequence dumpStreamFilename) throws FileNotFoundException
This method sets the dump file used by this dictionary, and should be only called after deserialisation, providing exactly the file generated at creation time. Essentially anything can happen if you do not follow the rules.
Note that this method will attempt to close the old stream, if present.
dumpStreamFilename
- the name of the dump file.
FileNotFoundException
setDumpStream(InputBitStream)
public void setDumpStream(InputBitStream dumpStream)
This method sets the dump file used by this dictionary, and should be only called after deserialisation, providing a repositionable stream containing exactly the file generated at creation time. Essentially anything can happen if you do not follow the rules.
Using this method you can load an external prefix dictionary in core memory, enjoying the compactness of the data structure, but getting much more speed.
Note that this method will attemp to close the old stream, if present.
dumpStream
- a repositionable input bit stream containing exactly the dump stream generated
at creation time.setDumpStream(CharSequence)
public Interval getInterval(CharSequence prefix)
PrefixMap
getInterval
in interface PrefixMap
prefix
- a prefix.
prefix
(Intervals.EMPTY_INTERVAL
in case no term starts with prefix
).public MutableString getTerm(int index, MutableString s)
TermMap
getTerm
in interface TermMap
index
- a term ordinal number.s
- a mutable string that will be filled with the corresponding term.
term
, or possibly (but not necessarily) null
if the term was not indexed.public CharSequence getTerm(int index)
TermMap
getTerm
in interface TermMap
index
- a term ordinal number.
null
if the term was not indexed.@Deprecated public int getIndex(CharSequence s)
public int getNumber(CharSequence term)
TermMap
We intentionally prefer “ordinal number” to “index” because of the obvious confusion that the latter term can cause.
getNumber
in interface TermMap
term
- a term.
public int indexOf(Object o)
indexOf
in interface List<CharSequence>
indexOf
in class AbstractObjectList<CharSequence>
public int lastIndexOf(Object o)
lastIndexOf
in interface List<CharSequence>
lastIndexOf
in class AbstractObjectList<CharSequence>
public CharSequence get(int index)
get
in interface List<CharSequence>
public boolean contains(CharSequence term)
public boolean hasPrefixes()
PrefixMap
hasPrefixes
in interface PrefixMap
public boolean hasTerms()
TermMap
hasTerms
in interface TermMap
public ObjectIterator<CharSequence> iterator()
The iterator returned by this method scans directly the dump stream. List.listIterator()
,
instead, makes a call to get(int)
for every call to hasNext()
and hasPrevious()
.
Note that the returned iterator uses the same stream as all get methods. Calling such methods while
the iterator is being used will produce an IllegalStateException
.
iterator
in interface ObjectCollection<CharSequence>
iterator
in interface ObjectIterable<CharSequence>
iterator
in interface ObjectList<CharSequence>
iterator
in interface Iterable<CharSequence>
iterator
in interface Collection<CharSequence>
iterator
in interface List<CharSequence>
iterator
in class AbstractObjectList<CharSequence>
public int size()
PrefixMap
size
in interface PrefixMap
size
in interface TermMap
size
in interface Collection<CharSequence>
size
in interface List<CharSequence>
public CharSequence getPrefix(Interval interval)
PrefixMap
getPrefix
in interface PrefixMap
interval
- an interval.
public MutableString getPrefix(Interval interval, MutableString prefix)
PrefixMap
getPrefix
in interface PrefixMap
interval
- an interval.prefix
- a mutable string that will be filled with the maximum prefix common to all terms in the given nonempty interval.
prefix
.public static void main(String[] arg) throws ClassNotFoundException, IOException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException, com.martiansoftware.jsap.JSAPException
ClassNotFoundException
IOException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
com.martiansoftware.jsap.JSAPException
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |