|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectit.unimi.dsi.mg4j.index.AbstractTermMap
it.unimi.dsi.mg4j.util.MinimalPerfectHash
@Deprecated public class MinimalPerfectHash
Order-preserving minimal perfect hash tables.
Given a list of terms without duplicates,
the constructors of this class finds an order-preserving minimal perfect hash function for
the list. Subsequent calls to the getNumber(CharSequence)
method will return the ordinal position of
the provided character sequence (if it appeared in the list; otherwise, the result is a random position). The class
can then be saved by serialisation and reused later.
If you also need to establish whether a given term was or not in the
list, you should consider using a SignedMinimalPerfectHash
implementing class instead.
This class is very scalable, and if you have enough memory it will handle efficiently hundreds of millions of terms: in particular, the offline constructor can build a map without loading the terms in memory.
To do its job, the class must store three vectors of weights that are used to compute
certain hash functions. By default, the vectors are long as the longest term, but if
your collection is sorted you can ask (passing WEIGHT_UNKNOWN_SORTED_TERMS
to a constructor)
for the shortest possible vector length. This optimisation does not change the
memory footprint, but it can improve performance.
As a commodity, this class provides a main method that reads from
standard input a (possibly gzip'd) sequence of newline-separated terms, and
writes a serialised minimal perfect hash table for the given list. You can
specify on the command line the kind of table (e.g.,
HashCodeSignedMinimalPerfectHash
) and have it fetched by reflection.
As a commodity, all signed classes in MG4J have a main method invoking the parameterised main method of this class,
which accept the default class to be built. In this way, running the main method of
any signed class provides the same features.
For efficiency, there are also method that access a minimal perfect hash using byte arrays interpreted as ISO-8859-1 characters.
An object of this class uses about 1.26n integers between 0 and n-1 inclusive to hash n terms. This is asymptotically optimal, but for small collections using an integer wastes a large number of bits in each entry. At construction time, however, about 15n integers (i.e., 60n bytes) are necessary.
The technique used here is suggested (but not fully described) in the second edition of Managing Gigabytes. It is due to Havas, Majewski, Wormald and Czech, and it is fully described in the monograph by Czech, Havas and Majewski on perfect hashing (“Perfect Hashing”, Theoret. Comput. Sci. 182:1−143, 1997).
First, a triple of intermediate hash functions (generated via universal hashing) define for each term a 3-hyperedge in a hypergraph with 1.26n vertices. Each intermediate hash function uses a vector of random weights; the length of the vector is by default the length of the longest term, but if the collection is sorted it is possible to compute the minimum length of a prefix that will distinguish any pair of term.
Then, by successively stripping hyperedges with one vertex of degree one, we create an ordering on the hyperedges. Finally, we assign (in reverse order) to each vertex of a hyperedge a number in the range 0 and n-1 in such a way that the sum of the three numbers modulo n is exactly the original index of the term corresponding to the hyperedge. This is possible with high probability, so we try until we succeed.
Note that the mathematical results guaranteeing that it is possible to find a
function in expected constant time are asymptotic.
Thus, when the number of terms is less than TERM_THRESHOLD
, an instance
of this class just stores the terms in a vector and scans them linearly. This behaviour,
however, is completely transparent to the user.
Building a minimal perfect hash map may take hours. As it happens with all probabilistic algorithms, one can just give estimates of the expected time.
There are two probabilistic sources of problems: degenerate hyperedges and non-acyclic hypergraphs. The ratio between the number of vertices and the number of terms guarantee that acyclicity is true with high probability. However, when generating the hypergraph we use three hash functions, and it must never happen that the value of two of those functions coincide on a given term. Because of the birthday paradox, the probability of getting a nondegerate edge is just
(m-1)(m-2)/m2,where m=1.26n. Since this must happen for n times, we must raise this probability to n, and as n grows the value quickly (and monotonically) reaches e-3/1.26. So the expected number of trials to generate a random 3-hypergraph would be bounded by e3/1.26, which is about 11. Note that this bound does not depend on n.
However, starting from MG4J 1.2 this class will patch deterministically the hash functions so that
degenerate edges are extremely unlikely (in fact, so unlikely they never happen). One round of
generation should be sufficient for generating a valid hypergraph. The fix is performed by adding
NODE_OVERHEAD
nodes to the graph, and using them to remap the random hash functions in case
of clashes. The fix must be repeated, of course, each time getNumber(MutableString)
is called,
but in the vaste majority of cases it reduces to two checks for equality with negative result.
Once the hypergraph has been generated, the stripping procedure may fail. However, the expected number of trials tends to 1 as n approaches infinity (Czech, Havas and Majewski, for instance, report that on a set of 50,000 terms they obtained consistently one trial for more than 5000 experiments).
To help diagnosing problem with the generation process
class, this class will log at INFO
level
what's happening. In particular, it will signal whenever a degenerate
hyperedge is generated, and the various construction phases.
Note that if during the generation process the log warns more than once about duplicate hyperedges, you should suspect that there are duplicates in the term list, as duplicate hyperedges are extremely unlikely.
Field Summary | |
---|---|
static float |
ENLARGEMENT_FACTOR
Deprecated. The number of nodes the graph will actually have. |
protected int[] |
g
Deprecated. The final magick—the function turning the intermediate hash values into the final bucket. |
protected int[] |
init
Deprecated. Initialisation values for the intermediate hash functions. |
protected int |
m
Deprecated. The number of vertices of the intermediate hypergraph, excluding the overhead nodes. |
protected int |
n
Deprecated. The number of buckets. |
protected long |
n4
Deprecated. Four times the number of buckets. |
static int |
NODE_OVERHEAD
Deprecated. The overhead in term of graphs nodes that is used to solve deterministically node clashes. |
protected int |
rightShift
Deprecated. The maximum amount of right shift to perform on a 32-bit number so to obtain something greater than or equal to m . |
static long |
serialVersionUID
Deprecated. |
protected CharSequence[] |
t
Deprecated. If n is smaller than TERM_THRESHOLD , a vector containing the terms. |
static int |
TERM_THRESHOLD
Deprecated. The minimum number of terms that will trigger the construction of a minimal perfect hash; overwise, terms are simply stored in a vector. |
static int |
WEIGHT_UNKNOWN
Deprecated. A special value denoting that the weight length is unknown, and should be computed using the maximum length of a term. |
static int |
WEIGHT_UNKNOWN_SORTED_TERMS
Deprecated. A special value denoting that the weight length is unknown, and should be computed assuming that the terms appear in lexicographic order. |
protected int[] |
weight0
Deprecated. Vector of weights to compute the first intermediate hash function. |
protected int[] |
weight1
Deprecated. Vector of weights to compute the second intermediate hash function. |
protected int[] |
weight2
Deprecated. Vector of weights to compute the third intermediate hash function. |
protected int |
weightLength
Deprecated. The length of the components of the weight vectors (it's faster than asking the length of the vectors). |
Constructor Summary | |
---|---|
|
MinimalPerfectHash(Iterable<? extends CharSequence> terms)
Deprecated. Creates a new order-preserving minimal perfect hash table for the given terms, using as many weights as the longest term in the collection. |
|
MinimalPerfectHash(Iterable<? extends CharSequence> terms,
int weightLength)
Deprecated. Creates a new order-preserving minimal perfect hash table for the given terms using the given number of weights. |
protected |
MinimalPerfectHash(MinimalPerfectHash mph)
Deprecated. Creates a new minimal perfect hash by copying a given one; non-transient fields are (shallow) copied. |
|
MinimalPerfectHash(String termFile,
String encoding)
Deprecated. Creates a new order-preserving minimal perfect hash table for the given file of terms. |
|
MinimalPerfectHash(String termFile,
String encoding,
boolean zipped)
Deprecated. Creates a new order-preserving minimal perfect hash table for the (possibly gzip'd) given file of terms. |
|
MinimalPerfectHash(String termFile,
String encoding,
int weightLength)
Deprecated. Creates a new order-preserving minimal perfect hash table for the given file of terms using the given number of weights. |
|
MinimalPerfectHash(String termFile,
String encoding,
int weightLength,
boolean zipped)
Deprecated. Creates a new order-preserving minimal perfect hash table for the (possibly gzip'd) given file of terms using the given number of weights. |
Method Summary | |
---|---|
protected int |
getFromT(CharSequence term)
Deprecated. Gets a term out of the stored array t . |
int |
getNumber(byte[] a)
Deprecated. Hashes a term given as a byte array interpreted in the ISO-8859-1 charset encoding. |
int |
getNumber(byte[] a,
int off,
int len)
Deprecated. Hashes a term given as a byte-array fragment interpreted in the ISO-8859-1 charset encoding. |
int |
getNumber(CharSequence term)
Deprecated. Hashes a given term. |
int |
getNumber(MutableString term)
Deprecated. Hashes a given term. |
protected void |
hash(CharSequence term,
int[] h)
Deprecated. Hashes a given term using the intermediate hash functions. |
boolean |
hasTerms()
Deprecated. Returns true if this prefix map supports term retrieval. |
protected static void |
main(Class<? extends MinimalPerfectHash> klass,
String[] arg)
Deprecated. A main method for minimal perfect hash construction that accepts a default class. |
static void |
main(String[] arg)
Deprecated. |
int |
size()
Deprecated. Returns the number of terms hashed. |
int |
weightLength()
Deprecated. Returns the length of the weight vectors. |
Methods inherited from class it.unimi.dsi.mg4j.index.AbstractTermMap |
---|
getIndex, getTerm, getTerm |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final float ENLARGEMENT_FACTOR
public static final int NODE_OVERHEAD
public static final int TERM_THRESHOLD
public static final int WEIGHT_UNKNOWN
public static final int WEIGHT_UNKNOWN_SORTED_TERMS
protected final int n
protected final int m
protected final int rightShift
m
.
protected final int[] init
protected final int[] weight0
protected final int[] weight1
protected final int[] weight2
protected final int weightLength
protected final int[] g
protected transient long n4
protected transient CharSequence[] t
n
is smaller than TERM_THRESHOLD
, a vector containing the terms.
public static final long serialVersionUID
Constructor Detail |
---|
public MinimalPerfectHash(Iterable<? extends CharSequence> terms)
Caution: if the collection contains twice the same term, this constructor will never return.
terms
- some terms to hash; it is assumed that there are no duplicates.public MinimalPerfectHash(Iterable<? extends CharSequence> terms, int weightLength)
This constructor can be used to force the use of a reduced number of weights if you are sure that
the first weightLength
characters of each term in terms
are
distinct.
If you do not know your weight length in advance, you can pass two
special values as weightLength
: WEIGHT_UNKNOWN
will force the computation of weightLength
as the maximum
length of a term, whereas WEIGHT_UNKNOWN_SORTED_TERMS
forces
the assumption that terms are sorted: in this case, we search for
the minimum prefix that will disambiguate all terms in the collection
(a shorter prefix yields faster lookups).
Caution: if two terms in the collection have a
common prefix of length weightLength
this constructor will
never return.
terms
- some terms to hash; if weightLength
is specified explicitly, it is assumed that there are no terms with a common prefix of
weightLength
characters.weightLength
- the number of weights used generating the
intermediate hash functions, WEIGHT_UNKNOWN
or WEIGHT_UNKNOWN_SORTED_TERMS
.MinimalPerfectHash(Iterable)
public MinimalPerfectHash(String termFile, String encoding, int weightLength, boolean zipped)
termFile
- an file containing one term on each line; it is assumed that
it does not contain terms with a common prefix of
weightLength
characters.encoding
- the encoding of termFile
; if null
, it
is assumed to be the platform default encoding.weightLength
- the number of weights used generating the
intermediate hash functions, WEIGHT_UNKNOWN
or WEIGHT_UNKNOWN_SORTED_TERMS
.zipped
- if true, the provided file is zipped and will be opened using a GZIPInputStream
.MinimalPerfectHash(Iterable, int)
public MinimalPerfectHash(String termFile, String encoding, boolean zipped)
termFile
- an file containing one term on each line; it is assumed that
it does not contain terms with a common prefix of
weightLength
characters.encoding
- the encoding of termFile
; if null
, it
is assumed to be the platform default encoding.zipped
- if true, the provided file is zipped and will be opened using a GZIPInputStream
.MinimalPerfectHash(Iterable, int)
public MinimalPerfectHash(String termFile, String encoding, int weightLength)
termFile
- an file containing one term on each line; it is assumed that
it does not contain terms with a common prefix of
weightLength
characters.encoding
- the encoding of termFile
; if null
, it
is assumed to be the platform default encoding.weightLength
- the number of weights used generating the
intermediate hash functions, WEIGHT_UNKNOWN
or WEIGHT_UNKNOWN_SORTED_TERMS
.MinimalPerfectHash(Iterable, int)
public MinimalPerfectHash(String termFile, String encoding)
termFile
- an file containing one term on each line; it is assumed that
it does not contain terms with a common prefix of
weightLength
characters.encoding
- the encoding of termFile
; if null
, it
is assumed to be the platform default encoding.MinimalPerfectHash(Iterable, int)
protected MinimalPerfectHash(MinimalPerfectHash mph)
mph
- the perfect hash to be copied.Method Detail |
---|
protected void hash(CharSequence term, int[] h)
term
- a term to hash.h
- a three-element vector that will be filled with the three intermediate hash values.public int getNumber(CharSequence term)
getNumber
in interface TermMap
term
- a term to be hashed.
public int getNumber(MutableString term)
term
- a term to be hashed.
public int getNumber(byte[] a, int off, int len)
a
- a byte array.off
- the first valid byte in a
.len
- the number of bytes composing the term, starting at off
.
len
bytes starting at off
(interpreted
as ISO-8859-1 characters) in the generating collection, starting from 0. If the
term was not in the original collection, the result is a random position.public int getNumber(byte[] a)
a
- a byte array.
a
(interpreted
as ISO-8859-1 characters) in the generating collection, starting from 0. If the
term was not in the original collection, the result is a random position.protected int getFromT(CharSequence term)
t
.
Note: This method does not check for t
being non-null
.
term
- a term.
public int weightLength()
public int size()
size
in interface TermMap
public boolean hasTerms()
TermMap
hasTerms
in interface TermMap
protected static void main(Class<? extends MinimalPerfectHash> klass, String[] arg) throws InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException, IOException, com.martiansoftware.jsap.JSAPException, ClassNotFoundException
klass
- the default class to be built.arg
- the usual argument array.
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
IOException
com.martiansoftware.jsap.JSAPException
ClassNotFoundException
public static void main(String[] arg) throws InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException, IOException, com.martiansoftware.jsap.JSAPException, ClassNotFoundException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
IOException
com.martiansoftware.jsap.JSAPException
ClassNotFoundException
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |