pitt.search.semanticvectors
Class LuceneUtils

java.lang.Object
  extended by pitt.search.semanticvectors.LuceneUtils

public class LuceneUtils
extends java.lang.Object

Class to support reading extra information from Lucene indexes, including term frequency, doc frequency.


Constructor Summary
LuceneUtils(java.lang.String path)
           
 
Method Summary
 float getEntropy(org.apache.lucene.index.Term term)
          Gets the 1 - entropy (i.e.
 int getGlobalTermFreq(org.apache.lucene.index.Term term)
          Gets the global term frequency of a term, i.e.
 float getGlobalTermWeight(org.apache.lucene.index.Term term)
          Gets the global term weight for a term, used in query weighting.
 float getGlobalTermWeightFromString(java.lang.String termString)
          This is a hacky wrapper to get an approximate term weight for a string.
 int getNumDocs()
          Gets the number of documents
protected  boolean termFilter(org.apache.lucene.index.Term term, java.lang.String[] desiredFields, int nonAlphabet, int minFreq)
          Filters out non-alphabetic terms and those of low frequency
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

LuceneUtils

public LuceneUtils(java.lang.String path)
            throws java.io.IOException
Parameters:
path - - path to lucene index
Throws:
java.io.IOException
Method Detail

getGlobalTermFreq

public int getGlobalTermFreq(org.apache.lucene.index.Term term)
Gets the global term frequency of a term, i.e. how may times it occurs in the whole corpus

Parameters:
term - whose frequency you want
Returns:
Global term frequency of term, or 1 if unavailable.

getGlobalTermWeightFromString

public float getGlobalTermWeightFromString(java.lang.String termString)
This is a hacky wrapper to get an approximate term weight for a string.


getGlobalTermWeight

public float getGlobalTermWeight(org.apache.lucene.index.Term term)
Gets the global term weight for a term, used in query weighting. Currently returns some power of inverse document frequency - you can experiment.

Parameters:
term - whose frequency you want
Returns:
Global term weight, or 1 if unavailable.

getNumDocs

public int getNumDocs()
Gets the number of documents


getEntropy

public float getEntropy(org.apache.lucene.index.Term term)
Gets the 1 - entropy (i.e. 1+ plogp) of a term, a function that favors terms that are focally distributed We use the definition of log-entropy weighting provided in Martin and Berry (2007): Entropy = 1 + sum ((Pij log2(Pij)) / log2(n)) where Pij = frequency of term i in doc j / global frequency of term i n = number of documents in collection

Parameters:
term - whose entropy you want Thanks to Vidya Vasuki for adding the hash table to eliminate redundant calculation

termFilter

protected boolean termFilter(org.apache.lucene.index.Term term,
                             java.lang.String[] desiredFields,
                             int nonAlphabet,
                             int minFreq)
                      throws java.io.IOException
Filters out non-alphabetic terms and those of low frequency

Parameters:
term - - Term to be filtered.
desiredFields - - Terms in only these fields are filtered in
nonAlphabet - - number of allowed non-alphabetic characters in the term -1 if we want no character filtering
minFreq - - min global frequency allowed Thanks to Vidya Vasuki for refactoring and bug repair
Throws:
java.io.IOException