Package pitt.search.semanticvectors

Semantic Vector indexes, created by applying a Random Projection algorithm to term-document matrices created using Apache Lucene.

See:
          Description

Interface Summary
CloseableVectorStore Some vector stores (e.g., those that read from the filesystem) claim resources that aren't automatically garbage collected or released.
VectorStore Classes implementing this interface are used to represent a collection of object vectors, including i.
 

Class Summary
BuildBilingualIndex Command line utility for creating bilingual semantic vector indexes.
BuildIndex Command line utility for creating semantic vector indexes.
BuildPositionalIndex Command line utility for creating semantic vector indexes using the sliding context window approach (see work on HAL, and by Shutze).
ClusterResults  
ClusterVectorStore This class is used for performing kMeans clustering on an entire vector store.
CompareTerms Command line term vector comparison utility.
CompareTermsBatch Command line term vector comparison utility designed to be run in batch mode.
CompoundVectorBuilder This class contains methods for manipulating queries, e.g., taking a list of queryterms and producing a (possibly weighted) aggregate query vector.
DocVectors Implementation of vector store that collects doc vectors by iterating through all the terms in a term vector store and incrementing document vectors for each of the documents containing that term.
Flags Class for representing and parsing global command line flags.
IncrementalDocVectors generates document vectors incrementally requires a
LuceneUtils Class to support reading extra information from Lucene indexes, including term frequency, doc frequency.
ObjectVector This class provides a basic object (e.g., term or document id) and corresponding vector.
Search Command line term vector search utility.
SearchResult Class to represent search results.
TermTermVectorsFromLucene Implementation of vector store that creates term by term cooccurence vectors by iterating through all the documents in a Lucene index.
TermVectorsFromLucene Implementation of vector store that creates term vectors by iterating through all the terms in a Lucene index.
VectorSearcher Class for searching vector stores using different scoring functions.
VectorSearcher.BalancedVectorSearcherPerm Class for searching a permuted vector store using cosine similarity.
VectorSearcher.VectorSearcherConvolutionSim Class for searching a vector store using convolution similarity.
VectorSearcher.VectorSearcherCosine Class for searching a vector store using cosine similarity.
VectorSearcher.VectorSearcherCosineSparse Class for searching a vector store using sparse cosine similarity.
VectorSearcher.VectorSearcherMaxSim Class for searching a vector store using minimum distance similarity.
VectorSearcher.VectorSearcherPerm Class for searching a permuted vector store using cosine similarity.
VectorSearcher.VectorSearcherSubspaceSim Class for searching a vector store using quantum disjunction similarity.
VectorSearcher.VectorSearcherTensorSim Class for searching a vector store using tensor product similarity.
VectorStoreRAM This class provides methods for reading a VectorStore into memory as an optimization if batching many searches.
VectorStoreReader Wrapper class used to get access to underlying VectorStore implementations.
VectorStoreReaderLucene This class provides methods for reading a VectorStore from disk.
VectorStoreReaderText This class provides methods for reading a VectorStore from a textfile.
VectorStoreSparseRAM This class provides methods for reading a VectorStore into memory as an optimization if batching many searches.
VectorStoreTranslater Class providing command-line interface for transforming vector store between the optimized Lucene format and plain text.
VectorStoreWriter This class provides methods for serializing a VectorStore to disk.
VectorUtils This class provides standard vector methods, e.g., cosine measure, normalization, tensor utils.
 

Exception Summary
ZeroVectorException  
 

Package pitt.search.semanticvectors Description

Semantic Vector indexes, created by applying a Random Projection algorithm to term-document matrices created using Apache Lucene.

This Semantic Vecotors package implements a Random Projection algorithm, a form of automatic semantic analysis, similar to Latent Semantic Analysis (LSA) and its variants like Probabilistic Latent Semantic Analysis (PLSA). However, unlike these methods, Random Projection does not rely on the use of computationally intensive matrix decomposition algorithms like Singular Value Decomposition (SVD). This makes Random Projection a much more scalable technique in practice.

Our application of Random Projection for Natural Language Processing (NLP) is descended from Pentti Kanerva's work on Sparse Distributed Memory, which in semantic analysis and text mining, this method has also been called Random Indexing. A growing number of researchers have applied Random Projection to NLP tasks, demonstrating:

  1. Performance comparable with other forms of Latent Semantic Analysis.
  2. Significant computational advantages in creating and incrementally maintaining models.

The current package was created as part of a project by the University of Pittsburgh Office of Technology Management, to explore the potential for automatically matching related concepts in the technology management domain, e.g., mapping new technologies to potentatially interested licensors. This project can be found at http://real.hsls.pitt.edu.

The package requires Apache Ant and Apache Lucene to have been installed, and the Lucene classes must be available in your CLASSPATH.

Further documentation and links to articles on Random Projection and related techniques can be found at the package download site, http://code.google.com/p/semanticvectors.

Author:
Dominic Widdows, in collaboration with Kathleen Ferraro and the University of Pittsburgh.