it.unimi.dsi.mg4j.util
Class SignedMinimalPerfectHash

java.lang.Object
  extended byit.unimi.dsi.mg4j.index.AbstractTermMap
      extended byit.unimi.dsi.mg4j.util.MinimalPerfectHash
          extended byit.unimi.dsi.mg4j.util.SignedMinimalPerfectHash
All Implemented Interfaces:
Serializable, TermMap
Direct Known Subclasses:
CRC32SignedMinimalPerfectHash, HashCodeSignedMinimalPerfectHash, LiterallySignedMinimalPerfectHash

public abstract class SignedMinimalPerfectHash
extends MinimalPerfectHash
implements Serializable

Signed order-preserving minimal perfect hash tables.

IMPORTANT: In MG4J 1.0, the serialisation format is incompatible with previous versions.

Minimal perfect hash tables will always return a result, even for terms that were not present in the collection indexed by the table. Sometimes you may prefer to single out terms that were not present in the collection.

To this purpose, MG4J provides signed minimal perfect tables. In a signed table, every term in the collection gets a signature that is used to tell false positives. Signature may go from the simple, CRC based signatures provided by the sample CRC32SignedMinimalPerfectHash and HashCodeSignedMinimalPerfectHash classes, to sophisticated cryptographic signatures, to (at the other extreme) a class that actually stores the terms (and thus completely avoids false positives) such as LiterallySignedMinimalPerfectHash.

A signed table extends this class, and provides two methods: a initSignatures(Collection) method that sets up the necessary data structures, and a checkSignature(CharSequence,int) method that checks a given character sequence against the signature stored for a term having given index.

It is good practise, of course, to replicate the constructors of this class in all implementing subclasses (by simply invoking super with the same arguments). Moreover, to be useful classes implementing this class must be serialisable.

Since:
0.4
Author:
Sebastiano Vigna, Marco Olivo
See Also:
Serialized Form

Field Summary
static long serialVersionUID
           
 
Fields inherited from class it.unimi.dsi.mg4j.util.MinimalPerfectHash
g, m, n, n4, rightShift, t, TERM_THRESHOLD, terms, VERBOSE, WEIGHT_UNKNOWN, WEIGHT_UNKNOWN_SORTED_TERMS, weight0, weight1, weight2, weightLength
 
Constructor Summary
SignedMinimalPerfectHash(Collection terms)
          Creates a new signed order-preserving minimal perfect hash table for the given set of terms, using as many weights as the longest term in the collection.
SignedMinimalPerfectHash(Collection terms, int weightLength)
          Creates a new signed order-preserving minimal perfect hash table for the given set of terms using the given number of weights.
SignedMinimalPerfectHash(String termFile)
          Creates a new CRC32-signed order-preserving minimal perfect hash table for the given file of terms, using as many weights as the longest term in the file.
SignedMinimalPerfectHash(String termFile, String encoding, int weightLength)
          Creates a new CRC32-signed order-preserving minimal perfect hash table for the given file of terms using the given number of weights.
 
Method Summary
protected abstract  boolean checkSignature(CharSequence term, int index)
          Checks a signature.
 int get(CharSequence term)
          Hashes a given term.
protected abstract  void initSignatures(Collection terms)
          Sets up the signature system from a collection.
 
Methods inherited from class it.unimi.dsi.mg4j.util.MinimalPerfectHash
get, getFromT, getWeightLength, hash, hash, hash, main, size, weightLength
 
Methods inherited from class it.unimi.dsi.mg4j.index.AbstractTermMap
get, get
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

serialVersionUID

public static final long serialVersionUID
See Also:
Constant Field Values
Constructor Detail

SignedMinimalPerfectHash

public SignedMinimalPerfectHash(Collection terms)
Creates a new signed order-preserving minimal perfect hash table for the given set of terms, using as many weights as the longest term in the collection.

After calling the corresponding constructor of MinimalPerfectHash, this constructor will invoke initSignatures(Collection).

Parameters:
terms - some terms to hash; it is assumed that this Collection does not contain duplicates.
See Also:
MinimalPerfectHash.MinimalPerfectHash(Collection)

SignedMinimalPerfectHash

public SignedMinimalPerfectHash(Collection terms,
                                int weightLength)
Creates a new signed order-preserving minimal perfect hash table for the given set of terms using the given number of weights.

After calling the corresponding constructor of MinimalPerfectHash, this constructor will invoke initSignatures(Collection).

Parameters:
terms - some terms to hash; it is assumed that this Collection does not contain terms with a common prefix of weightLength characters.
weightLength - the number of weights used generating the intermediate hash functions.
See Also:
MinimalPerfectHash.MinimalPerfectHash(Collection,int)

SignedMinimalPerfectHash

public SignedMinimalPerfectHash(String termFile,
                                String encoding,
                                int weightLength)
Creates a new CRC32-signed order-preserving minimal perfect hash table for the given file of terms using the given number of weights.

After calling the corresponding constructor of MinimalPerfectHash, this constructor will invoke initSignatures(Collection).

Parameters:
termFile - an UTF-8 file containing one term on each line; it is assumed that it does not contain terms with a common prefix of weightLength characters.
encoding - the encoding of termFile; if null, it is assumed to be the platform default encoding.
weightLength - the number of weights used generating the intermediate hash functions.
See Also:
MinimalPerfectHash.MinimalPerfectHash(String,String,int)

SignedMinimalPerfectHash

public SignedMinimalPerfectHash(String termFile)
Creates a new CRC32-signed order-preserving minimal perfect hash table for the given file of terms, using as many weights as the longest term in the file.

After calling the corresponding constructor of MinimalPerfectHash, this constructor will invoke initSignatures(Collection).

Parameters:
termFile - a file in the platform default encoding containing one term on each line; it is assumed that the file does not contain twice the same term.
See Also:
MinimalPerfectHash.MinimalPerfectHash(String)
Method Detail

get

public int get(CharSequence term)
Hashes a given term.

Specified by:
get in interface TermMap
Overrides:
get in class MinimalPerfectHash
Parameters:
term - a term to hash.
Returns:
the position of the given term in the generating collection, starting from 0, if the term was in the original collection; otherwise, -1.

initSignatures

protected abstract void initSignatures(Collection terms)
Sets up the signature system from a collection.

This abstract method must be overriden by implementing subclasses. It must set up all data structures that are necessary to handle signatures; in particular, it will usually compute signatures for all terms in the given collection.

Parameters:
terms - the collection of terms given to the constructor of this class.
See Also:
CRC32SignedMinimalPerfectHash.initSignatures(Collection), LiterallySignedMinimalPerfectHash.initSignatures(Collection)

checkSignature

protected abstract boolean checkSignature(CharSequence term,
                                          int index)
Checks a signature.

This abstract method must be overriden by implementing subclasses. It must check whether the signature of the given character sequence matches the one stored for the index-th term.

Parameters:
term - a character sequence.
index - an integer denoting a term in the indexed collection.
Returns:
true iff the signature of the given character sequence matches the one stored for the index-th term.
See Also:
CRC32SignedMinimalPerfectHash.checkSignature(CharSequence,int), LiterallySignedMinimalPerfectHash.checkSignature(CharSequence,int)