it.unimi.dsi.mg4j.util
Class MinimalPerfectHash

java.lang.Object
  extended by it.unimi.dsi.mg4j.index.AbstractTermMap
      extended by it.unimi.dsi.mg4j.util.MinimalPerfectHash
All Implemented Interfaces:
TermMap, Serializable
Direct Known Subclasses:
SignedMinimalPerfectHash

Deprecated. Use the new hashing stuff in Sux4J.

@Deprecated
public class MinimalPerfectHash
extends AbstractTermMap
implements Serializable

Order-preserving minimal perfect hash tables.

Given a list of terms without duplicates, the constructors of this class finds an order-preserving minimal perfect hash function for the list. Subsequent calls to the getNumber(CharSequence) method will return the ordinal position of the provided character sequence (if it appeared in the list; otherwise, the result is a random position). The class can then be saved by serialisation and reused later.

If you also need to establish whether a given term was or not in the list, you should consider using a SignedMinimalPerfectHash implementing class instead.

This class is very scalable, and if you have enough memory it will handle efficiently hundreds of millions of terms: in particular, the offline constructor can build a map without loading the terms in memory.

To do its job, the class must store three vectors of weights that are used to compute certain hash functions. By default, the vectors are long as the longest term, but if your collection is sorted you can ask (passing WEIGHT_UNKNOWN_SORTED_TERMS to a constructor) for the shortest possible vector length. This optimisation does not change the memory footprint, but it can improve performance.

As a commodity, this class provides a main method that reads from standard input a (possibly gzip'd) sequence of newline-separated terms, and writes a serialised minimal perfect hash table for the given list. You can specify on the command line the kind of table (e.g., HashCodeSignedMinimalPerfectHash) and have it fetched by reflection. As a commodity, all signed classes in MG4J have a main method invoking the parameterised main method of this class, which accept the default class to be built. In this way, running the main method of any signed class provides the same features.

For efficiency, there are also method that access a minimal perfect hash using byte arrays interpreted as ISO-8859-1 characters.

Implementation Details

An object of this class uses about 1.26n integers between 0 and n-1 inclusive to hash n terms. This is asymptotically optimal, but for small collections using an integer wastes a large number of bits in each entry. At construction time, however, about 15n integers (i.e., 60n bytes) are necessary.

The technique used here is suggested (but not fully described) in the second edition of Managing Gigabytes. It is due to Havas, Majewski, Wormald and Czech, and it is fully described in the monograph by Czech, Havas and Majewski on perfect hashing (“Perfect Hashing”, Theoret. Comput. Sci. 182:1−143, 1997).

First, a triple of intermediate hash functions (generated via universal hashing) define for each term a 3-hyperedge in a hypergraph with 1.26n vertices. Each intermediate hash function uses a vector of random weights; the length of the vector is by default the length of the longest term, but if the collection is sorted it is possible to compute the minimum length of a prefix that will distinguish any pair of term.

Then, by successively stripping hyperedges with one vertex of degree one, we create an ordering on the hyperedges. Finally, we assign (in reverse order) to each vertex of a hyperedge a number in the range 0 and n-1 in such a way that the sum of the three numbers modulo n is exactly the original index of the term corresponding to the hyperedge. This is possible with high probability, so we try until we succeed.

Note that the mathematical results guaranteeing that it is possible to find a function in expected constant time are asymptotic. Thus, when the number of terms is less than TERM_THRESHOLD, an instance of this class just stores the terms in a vector and scans them linearly. This behaviour, however, is completely transparent to the user.

Rounds and Logging

Building a minimal perfect hash map may take hours. As it happens with all probabilistic algorithms, one can just give estimates of the expected time.

There are two probabilistic sources of problems: degenerate hyperedges and non-acyclic hypergraphs. The ratio between the number of vertices and the number of terms guarantee that acyclicity is true with high probability. However, when generating the hypergraph we use three hash functions, and it must never happen that the value of two of those functions coincide on a given term. Because of the birthday paradox, the probability of getting a nondegerate edge is just

(m-1)(m-2)/m2,
where m=1.26n. Since this must happen for n times, we must raise this probability to n, and as n grows the value quickly (and monotonically) reaches e-3/1.26. So the expected number of trials to generate a random 3-hypergraph would be bounded by e3/1.26, which is about 11. Note that this bound does not depend on n.

However, starting from MG4J 1.2 this class will patch deterministically the hash functions so that degenerate edges are extremely unlikely (in fact, so unlikely they never happen). One round of generation should be sufficient for generating a valid hypergraph. The fix is performed by adding NODE_OVERHEAD nodes to the graph, and using them to remap the random hash functions in case of clashes. The fix must be repeated, of course, each time getNumber(MutableString) is called, but in the vaste majority of cases it reduces to two checks for equality with negative result.

Once the hypergraph has been generated, the stripping procedure may fail. However, the expected number of trials tends to 1 as n approaches infinity (Czech, Havas and Majewski, for instance, report that on a set of 50,000 terms they obtained consistently one trial for more than 5000 experiments).

To help diagnosing problem with the generation process class, this class will log at INFO level what's happening. In particular, it will signal whenever a degenerate hyperedge is generated, and the various construction phases.

Note that if during the generation process the log warns more than once about duplicate hyperedges, you should suspect that there are duplicates in the term list, as duplicate hyperedges are extremely unlikely.

Since:
0.1
Author:
Sebastiano Vigna, Paolo Boldi
See Also:
Serialized Form

Field Summary
static float ENLARGEMENT_FACTOR
          Deprecated. The number of nodes the graph will actually have.
protected  int[] g
          Deprecated. The final magick—the function turning the intermediate hash values into the final bucket.
protected  int[] init
          Deprecated. Initialisation values for the intermediate hash functions.
protected  int m
          Deprecated. The number of vertices of the intermediate hypergraph, excluding the overhead nodes.
protected  int n
          Deprecated. The number of buckets.
protected  long n4
          Deprecated. Four times the number of buckets.
static int NODE_OVERHEAD
          Deprecated. The overhead in term of graphs nodes that is used to solve deterministically node clashes.
protected  int rightShift
          Deprecated. The maximum amount of right shift to perform on a 32-bit number so to obtain something greater than or equal to m.
static long serialVersionUID
          Deprecated.  
protected  CharSequence[] t
          Deprecated. If n is smaller than TERM_THRESHOLD, a vector containing the terms.
static int TERM_THRESHOLD
          Deprecated. The minimum number of terms that will trigger the construction of a minimal perfect hash; overwise, terms are simply stored in a vector.
static int WEIGHT_UNKNOWN
          Deprecated. A special value denoting that the weight length is unknown, and should be computed using the maximum length of a term.
static int WEIGHT_UNKNOWN_SORTED_TERMS
          Deprecated. A special value denoting that the weight length is unknown, and should be computed assuming that the terms appear in lexicographic order.
protected  int[] weight0
          Deprecated. Vector of weights to compute the first intermediate hash function.
protected  int[] weight1
          Deprecated. Vector of weights to compute the second intermediate hash function.
protected  int[] weight2
          Deprecated. Vector of weights to compute the third intermediate hash function.
protected  int weightLength
          Deprecated. The length of the components of the weight vectors (it's faster than asking the length of the vectors).
 
Constructor Summary
  MinimalPerfectHash(Iterable<? extends CharSequence> terms)
          Deprecated. Creates a new order-preserving minimal perfect hash table for the given terms, using as many weights as the longest term in the collection.
  MinimalPerfectHash(Iterable<? extends CharSequence> terms, int weightLength)
          Deprecated. Creates a new order-preserving minimal perfect hash table for the given terms using the given number of weights.
protected MinimalPerfectHash(MinimalPerfectHash mph)
          Deprecated. Creates a new minimal perfect hash by copying a given one; non-transient fields are (shallow) copied.
  MinimalPerfectHash(String termFile, String encoding)
          Deprecated. Creates a new order-preserving minimal perfect hash table for the given file of terms.
  MinimalPerfectHash(String termFile, String encoding, boolean zipped)
          Deprecated. Creates a new order-preserving minimal perfect hash table for the (possibly gzip'd) given file of terms.
  MinimalPerfectHash(String termFile, String encoding, int weightLength)
          Deprecated. Creates a new order-preserving minimal perfect hash table for the given file of terms using the given number of weights.
  MinimalPerfectHash(String termFile, String encoding, int weightLength, boolean zipped)
          Deprecated. Creates a new order-preserving minimal perfect hash table for the (possibly gzip'd) given file of terms using the given number of weights.
 
Method Summary
protected  int getFromT(CharSequence term)
          Deprecated. Gets a term out of the stored array t.
 int getNumber(byte[] a)
          Deprecated. Hashes a term given as a byte array interpreted in the ISO-8859-1 charset encoding.
 int getNumber(byte[] a, int off, int len)
          Deprecated. Hashes a term given as a byte-array fragment interpreted in the ISO-8859-1 charset encoding.
 int getNumber(CharSequence term)
          Deprecated. Hashes a given term.
 int getNumber(MutableString term)
          Deprecated. Hashes a given term.
protected  void hash(CharSequence term, int[] h)
          Deprecated. Hashes a given term using the intermediate hash functions.
 boolean hasTerms()
          Deprecated. Returns true if this prefix map supports term retrieval.
protected static void main(Class<? extends MinimalPerfectHash> klass, String[] arg)
          Deprecated. A main method for minimal perfect hash construction that accepts a default class.
static void main(String[] arg)
          Deprecated.  
 int size()
          Deprecated. Returns the number of terms hashed.
 int weightLength()
          Deprecated. Returns the length of the weight vectors.
 
Methods inherited from class it.unimi.dsi.mg4j.index.AbstractTermMap
getIndex, getTerm, getTerm
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

ENLARGEMENT_FACTOR

public static final float ENLARGEMENT_FACTOR
Deprecated. 
The number of nodes the graph will actually have. This value guarantees that the graph will be acyclic with positive probability.

See Also:
Constant Field Values

NODE_OVERHEAD

public static final int NODE_OVERHEAD
Deprecated. 
The overhead in term of graphs nodes that is used to solve deterministically node clashes.

See Also:
Constant Field Values

TERM_THRESHOLD

public static final int TERM_THRESHOLD
Deprecated. 
The minimum number of terms that will trigger the construction of a minimal perfect hash; overwise, terms are simply stored in a vector.

See Also:
Constant Field Values

WEIGHT_UNKNOWN

public static final int WEIGHT_UNKNOWN
Deprecated. 
A special value denoting that the weight length is unknown, and should be computed using the maximum length of a term.

See Also:
Constant Field Values

WEIGHT_UNKNOWN_SORTED_TERMS

public static final int WEIGHT_UNKNOWN_SORTED_TERMS
Deprecated. 
A special value denoting that the weight length is unknown, and should be computed assuming that the terms appear in lexicographic order.

See Also:
Constant Field Values

n

protected final int n
Deprecated. 
The number of buckets.


m

protected final int m
Deprecated. 
The number of vertices of the intermediate hypergraph, excluding the overhead nodes.


rightShift

protected final int rightShift
Deprecated. 
The maximum amount of right shift to perform on a 32-bit number so to obtain something greater than or equal to m.


init

protected final int[] init
Deprecated. 
Initialisation values for the intermediate hash functions.


weight0

protected final int[] weight0
Deprecated. 
Vector of weights to compute the first intermediate hash function.


weight1

protected final int[] weight1
Deprecated. 
Vector of weights to compute the second intermediate hash function.


weight2

protected final int[] weight2
Deprecated. 
Vector of weights to compute the third intermediate hash function.


weightLength

protected final int weightLength
Deprecated. 
The length of the components of the weight vectors (it's faster than asking the length of the vectors).


g

protected final int[] g
Deprecated. 
The final magick—the function turning the intermediate hash values into the final bucket.


n4

protected transient long n4
Deprecated. 
Four times the number of buckets.


t

protected transient CharSequence[] t
Deprecated. 
If n is smaller than TERM_THRESHOLD, a vector containing the terms.


serialVersionUID

public static final long serialVersionUID
Deprecated. 
See Also:
Constant Field Values
Constructor Detail

MinimalPerfectHash

public MinimalPerfectHash(Iterable<? extends CharSequence> terms)
Deprecated. 
Creates a new order-preserving minimal perfect hash table for the given terms, using as many weights as the longest term in the collection.

Caution: if the collection contains twice the same term, this constructor will never return.

Parameters:
terms - some terms to hash; it is assumed that there are no duplicates.

MinimalPerfectHash

public MinimalPerfectHash(Iterable<? extends CharSequence> terms,
                          int weightLength)
Deprecated. 
Creates a new order-preserving minimal perfect hash table for the given terms using the given number of weights.

This constructor can be used to force the use of a reduced number of weights if you are sure that the first weightLength characters of each term in terms are distinct.

If you do not know your weight length in advance, you can pass two special values as weightLength: WEIGHT_UNKNOWN will force the computation of weightLength as the maximum length of a term, whereas WEIGHT_UNKNOWN_SORTED_TERMS forces the assumption that terms are sorted: in this case, we search for the minimum prefix that will disambiguate all terms in the collection (a shorter prefix yields faster lookups).

Caution: if two terms in the collection have a common prefix of length weightLength this constructor will never return.

Parameters:
terms - some terms to hash; if weightLength is specified explicitly, it is assumed that there are no terms with a common prefix of weightLength characters.
weightLength - the number of weights used generating the intermediate hash functions, WEIGHT_UNKNOWN or WEIGHT_UNKNOWN_SORTED_TERMS.
See Also:
MinimalPerfectHash(Iterable)

MinimalPerfectHash

public MinimalPerfectHash(String termFile,
                          String encoding,
                          int weightLength,
                          boolean zipped)
Deprecated. 
Creates a new order-preserving minimal perfect hash table for the (possibly gzip'd) given file of terms using the given number of weights.

Parameters:
termFile - an file containing one term on each line; it is assumed that it does not contain terms with a common prefix of weightLength characters.
encoding - the encoding of termFile; if null, it is assumed to be the platform default encoding.
weightLength - the number of weights used generating the intermediate hash functions, WEIGHT_UNKNOWN or WEIGHT_UNKNOWN_SORTED_TERMS.
zipped - if true, the provided file is zipped and will be opened using a GZIPInputStream.
See Also:
MinimalPerfectHash(Iterable, int)

MinimalPerfectHash

public MinimalPerfectHash(String termFile,
                          String encoding,
                          boolean zipped)
Deprecated. 
Creates a new order-preserving minimal perfect hash table for the (possibly gzip'd) given file of terms.

Parameters:
termFile - an file containing one term on each line; it is assumed that it does not contain terms with a common prefix of weightLength characters.
encoding - the encoding of termFile; if null, it is assumed to be the platform default encoding.
zipped - if true, the provided file is zipped and will be opened using a GZIPInputStream.
See Also:
MinimalPerfectHash(Iterable, int)

MinimalPerfectHash

public MinimalPerfectHash(String termFile,
                          String encoding,
                          int weightLength)
Deprecated. 
Creates a new order-preserving minimal perfect hash table for the given file of terms using the given number of weights.

Parameters:
termFile - an file containing one term on each line; it is assumed that it does not contain terms with a common prefix of weightLength characters.
encoding - the encoding of termFile; if null, it is assumed to be the platform default encoding.
weightLength - the number of weights used generating the intermediate hash functions, WEIGHT_UNKNOWN or WEIGHT_UNKNOWN_SORTED_TERMS.
See Also:
MinimalPerfectHash(Iterable, int)

MinimalPerfectHash

public MinimalPerfectHash(String termFile,
                          String encoding)
Deprecated. 
Creates a new order-preserving minimal perfect hash table for the given file of terms.

Parameters:
termFile - an file containing one term on each line; it is assumed that it does not contain terms with a common prefix of weightLength characters.
encoding - the encoding of termFile; if null, it is assumed to be the platform default encoding.
See Also:
MinimalPerfectHash(Iterable, int)

MinimalPerfectHash

protected MinimalPerfectHash(MinimalPerfectHash mph)
Deprecated. 
Creates a new minimal perfect hash by copying a given one; non-transient fields are (shallow) copied.

Parameters:
mph - the perfect hash to be copied.
Method Detail

hash

protected void hash(CharSequence term,
                    int[] h)
Deprecated. 
Hashes a given term using the intermediate hash functions.

Parameters:
term - a term to hash.
h - a three-element vector that will be filled with the three intermediate hash values.

getNumber

public int getNumber(CharSequence term)
Deprecated. 
Hashes a given term.

Specified by:
getNumber in interface TermMap
Parameters:
term - a term to be hashed.
Returns:
the position of the given term in the generating collection, starting from 0. If the term was not in the original collection, the result is a random position.

getNumber

public int getNumber(MutableString term)
Deprecated. 
Hashes a given term.

Parameters:
term - a term to be hashed.
Returns:
the position of the given term in the generating collection, starting from 0. If the term was not in the original collection, the result is a random position.

getNumber

public int getNumber(byte[] a,
                     int off,
                     int len)
Deprecated. 
Hashes a term given as a byte-array fragment interpreted in the ISO-8859-1 charset encoding.

Parameters:
a - a byte array.
off - the first valid byte in a.
len - the number of bytes composing the term, starting at off.
Returns:
the position of term defined by len bytes starting at off (interpreted as ISO-8859-1 characters) in the generating collection, starting from 0. If the term was not in the original collection, the result is a random position.

getNumber

public int getNumber(byte[] a)
Deprecated. 
Hashes a term given as a byte array interpreted in the ISO-8859-1 charset encoding.

Parameters:
a - a byte array.
Returns:
the position of term defined by the bytes in a a (interpreted as ISO-8859-1 characters) in the generating collection, starting from 0. If the term was not in the original collection, the result is a random position.

getFromT

protected int getFromT(CharSequence term)
Deprecated. 
Gets a term out of the stored array t.

Note: This method does not check for t being non-null.

Parameters:
term - a term.
Returns:
the position of the given term in the generating collection, starting from 0. If the term was not in the original collection, the result is 0.

weightLength

public int weightLength()
Deprecated. 
Returns the length of the weight vectors.

Returns:
the length of weight vectors used in this table.

size

public int size()
Deprecated. 
Returns the number of terms hashed.

Specified by:
size in interface TermMap
Returns:
the number of terms hashed.

hasTerms

public boolean hasTerms()
Deprecated. 
Description copied from interface: TermMap
Returns true if this prefix map supports term retrieval.

Specified by:
hasTerms in interface TermMap
Returns:
true if this prefix map supports term retrieval.

main

protected static void main(Class<? extends MinimalPerfectHash> klass,
                           String[] arg)
                    throws InstantiationException,
                           IllegalAccessException,
                           InvocationTargetException,
                           NoSuchMethodException,
                           IOException,
                           com.martiansoftware.jsap.JSAPException,
                           ClassNotFoundException
Deprecated. 
A main method for minimal perfect hash construction that accepts a default class.

Parameters:
klass - the default class to be built.
arg - the usual argument array.
Throws:
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
IOException
com.martiansoftware.jsap.JSAPException
ClassNotFoundException

main

public static void main(String[] arg)
                 throws InstantiationException,
                        IllegalAccessException,
                        InvocationTargetException,
                        NoSuchMethodException,
                        IOException,
                        com.martiansoftware.jsap.JSAPException,
                        ClassNotFoundException
Deprecated. 
Throws:
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
IOException
com.martiansoftware.jsap.JSAPException
ClassNotFoundException