Class ibis::keywords defines a boolean term-document matrix. More...
#include <ikeywords.h>
Classes | |
class | tokenizer |
A simple tokenizer used to parse the keywords. More... | |
Public Member Functions | |
virtual long | append (const char *dt, const char *df, uint32_t nnew) |
Extend the index. | |
virtual void | binBoundaries (std::vector< double > &b) const |
The function binBoundaries and binWeights return bin boundaries and counts of each bin respectively. | |
virtual void | binWeights (std::vector< uint32_t > &b) const |
virtual void | estimate (const ibis::qContinuousRange &expr, ibis::bitvector &lower, ibis::bitvector &upper) const |
Computes an approximation of hits as a pair of lower and upper bounds. | |
virtual uint32_t | estimate (const ibis::qContinuousRange &expr) const |
Returns an upper bound on the number of hits. | |
virtual double | estimateCost (const ibis::qContinuousRange &expr) const |
Estimate the cost of evaluating a range condition. | |
virtual double | estimateCost (const ibis::qDiscreteRange &expr) const |
Estimate the cost of evaluating a range condition. | |
virtual long | evaluate (const ibis::qContinuousRange &expr, ibis::bitvector &hits) const |
To evaluate the exact hits. | |
virtual double | getMax () const |
The maximum value recorded in the index. | |
virtual double | getMin () const |
The minimum value recorded in the index. | |
virtual double | getSum () const |
Compute the approximate sum of all the values indexed. | |
keywords (const ibis::column *c, const char *f=0) | |
Constructor. | |
keywords (const ibis::column *c, ibis::text::tokenizer &tkn, const char *f=0) | |
Constructor. | |
keywords (const ibis::column *c, ibis::fileManager::storage *st) | |
Constructor. Reconstruct a keyword index from an existing file. | |
virtual const char * | name () const |
Returns the name of the index, similar to the function type , but returns a string instead. | |
virtual void | print (std::ostream &out) const |
Prints human readable information. | |
virtual int | read (const char *idxfile) |
Reconstructs an index from the named file. | |
virtual int | read (ibis::fileManager::storage *st) |
Reconstructs an index from an array of bytes. | |
long | search (const char *kw, ibis::bitvector &hits) const |
Match a particular keyword. | |
long | search (const char *kw) const |
Estimate the number of matches. | |
virtual long | select (const ibis::qContinuousRange &, void *) const |
Evaluate the range condition and select values. | |
virtual long | select (const ibis::qContinuousRange &, void *, ibis::bitvector &) const |
Evaluate the range condition, select values, and record the positions. | |
virtual INDEX_TYPE | type () const |
Returns an index type identifier. | |
virtual float | undecidable (const ibis::qContinuousRange &, ibis::bitvector &iffy) const |
This class and its derived classes should produce exact answers, therefore no undecidable rows. | |
virtual int | write (const char *dt) const |
Write the boolean term-document matrix as two files, xx.terms for the terms and xx.idx for the bitmaps that marks the positions. | |
Protected Member Functions | |
void | clear () |
Clear the current content. | |
virtual size_t | getSerialSize () const throw () |
Estimate the size of the .idx file. | |
int | parseTextFile (ibis::text::tokenizer &tkn, const char *f) |
Parse the text file to build a keyword index. | |
int | readTDLine (std::istream &in, std::string &key, std::vector< uint32_t > &idlist, char *buf, uint32_t nbuf) const |
Read one line from the term-docuement file. | |
char | readTerm (const char *&buf, std::string &key) const |
Extract the term from a line of input term-document file. | |
int | readTermDocFile (const ibis::column *idcol, const char *f) |
Reads a term-document list from an external file. | |
uint32_t | readUInt (const char *&buf) const |
Extract the next integer in an inputline. | |
void | setBits (std::vector< uint32_t > &pos, ibis::bitvector &bvec) const |
Turn on the specified positions in a bitvector. |
Class ibis::keywords defines a boolean term-document matrix.
The terms are stored in an ibis::dictionary and the bitmaps are stored in a series of bitvectors.
The current implementation can either read a term-document list or parse the binary string values with a list of delimiters for determining tokens. It first checks for the presence of a term-document list which can be explicitly or implicitly specified. Here are the options.
<table-name>.<column-name>.tdlist=filename
Note that the filename given above can be either a fully qualified name or a name in the same directory as the data file.
If a term-document list is provided, the document id used in the list may be specified explicitly through docIdName either in the index specification or in a configuration file. An example of index specification is as follows
In a configuration file, the syntax for specifying a docIdName is as follows.
<table-name>.<column-name>.docIDName=<id-column-name>
For example,
enrondata.subject.docIDName=mid enrondata.body.docIDName=mid
If an ID column is not specified, the integer IDs in the .tdlist file is assumed to the row numbers.
If the term-document list is not explicitly specified, one may specify a list of delimiters for the tokenizer to parse the text values. The list of delimiters can be specified in either the index option or through a configuration file. Here is an example with indexing option
index=keywords delimiters=" \t,;"
The following is an example line in a configuration file (say, ibis.rc)
<table-name>.<column-name>.delimiters=" \t,;"
There are two different ways of building a keyword index and they can each be specified explicitly or implicitly. The precedence is as follows: an explicitly specified option takes precedence over an implicitly option, the term-document list has precedence over built-in parser.
ibis::keywords::keywords | ( | const ibis::column * | c, |
const char * | f = 0 |
||
) | [explicit] |
Constructor.
It first tries to read the terms (.terms) and the tdmat (
.idx) files if they both exist. If that fails, it will attempt to build an index using the externally provided term-document list or parsing the text with a specified list of delimiters.
References ibis::index::bits, ibis::CATEGORY, clear(), ibis::index::col, ibis::index::dataFileName(), ibis::util::getFileSize(), ibis::gVerbose, ibis::column::indexSpec(), ibis::INT, ibis::index::optionalUnpack(), parseTextFile(), print(), read(), readTermDocFile(), ibis::TEXT, ibis::column::type(), and ibis::UINT.
ibis::keywords::keywords | ( | const ibis::column * | c, |
ibis::text::tokenizer & | tkn, | ||
const char * | f = 0 |
||
) |
Constructor.
Construct a new keyword index using the user-provided tokenizer.
References ibis::index::bits, ibis::index::col, ibis::gVerbose, ibis::column::indexSpec(), ibis::index::optionalUnpack(), parseTextFile(), and print().
virtual void ibis::keywords::binBoundaries | ( | std::vector< double > & | ) | const [inline, virtual] |
The function binBoundaries and binWeights return bin boundaries and counts of each bin respectively.
Reimplemented from ibis::index.
void ibis::keywords::estimate | ( | const ibis::qContinuousRange & | , |
ibis::bitvector & | lower, | ||
ibis::bitvector & | upper | ||
) | const [virtual] |
Computes an approximation of hits as a pair of lower and upper bounds.
expr | the query expression to be evaluated. |
lower | a bitvector marking a subset of the hits. All rows marked with one (1) are definitely hits. |
upper | a bitvector marking a superset of the hits. All hits are marked with one, but some of the rows marked one may not be hits. If the variable upper is empty, the variable lower is assumed to contain the exact answer. |
Reimplemented from ibis::index.
long ibis::keywords::evaluate | ( | const ibis::qContinuousRange & | expr, |
ibis::bitvector & | hits | ||
) | const [virtual] |
To evaluate the exact hits.
On success, return the number of hits, otherwise a negative value is returned.
Implements ibis::index.
References ibis::gVerbose.
size_t ibis::keywords::getSerialSize | ( | ) | const throw () [protected, virtual] |
Estimate the size of the .idx file.
The .idx file contains only the bitmaps without the actual terms. The bitmap offsets are assumed to be 8-byte long.
virtual double ibis::keywords::getSum | ( | ) | const [inline, virtual] |
Compute the approximate sum of all the values indexed.
If it decides that computing the sum directly from the vertical partition is more efficient, it will return NaN immediately.
Reimplemented from ibis::index.
virtual const char* ibis::keywords::name | ( | ) | const [inline, virtual] |
Returns the name of the index, similar to the function type
, but returns a string instead.
Implements ibis::index.
int ibis::keywords::parseTextFile | ( | ibis::text::tokenizer & | tkn, |
const char * | dir | ||
) | [protected] |
Parse the text file to build a keyword index.
This function is called by the constructor of the class to build a new keyword index.
References ibis::fileManager::buffer< T >::address(), ibis::util::clear(), ibis::gVerbose, ibis::fileManager::buffer< T >::resize(), ibis::bitvector::setBit(), ibis::fileManager::buffer< T >::size(), and UnixOpen.
Referenced by keywords().
void ibis::keywords::print | ( | std::ostream & | out | ) | const [virtual] |
Prints human readable information.
Outputs information about the index as text to the specified output stream.
Implements ibis::index.
References ibis::util::compactValue(), and ibis::gVerbose.
Referenced by keywords().
int ibis::keywords::read | ( | const char * | name | ) | [virtual] |
Reconstructs an index from the named file.
The name can be the directory containing an index file. In this case, the name of the index file must be the name of the column followed by ".idx" suffix.
Implements ibis::index.
References ibis::util::clear(), ibis::gVerbose, ibis::fileManager::instance(), ibis::index::KEYWORDS, ibis::fileManager::recordPages(), ibis::util::strnewdup(), and UnixOpen.
Referenced by keywords().
int ibis::keywords::read | ( | ibis::fileManager::storage * | st | ) | [virtual] |
Reconstructs an index from an array of bytes.
Intended for internal use only!
Implements ibis::index.
References ibis::fileManager::storage::begin(), ibis::util::clear(), and ibis::index::KEYWORDS.
int ibis::keywords::readTDLine | ( | std::istream & | in, |
std::string & | key, | ||
std::vector< uint32_t > & | idlist, | ||
char * | linebuf, | ||
uint32_t | nbuf | ||
) | const [protected] |
Read one line from the term-docuement file.
The caller has opened the file already, read one line from the input stream. Extract the keyword and the list of ids.
References ibis::gVerbose, and ibis::util::readUInt().
char ibis::keywords::readTerm | ( | const char *& | buf, |
std::string & | keyword | ||
) | const [inline, protected] |
Extract the term from a line of input term-document file.
A keyword is any number of printable characters. Returns the first non-space character following the keyword, which should be the delimiter ':'. Consecutive spaces in the keyword are replaced with a single plain space character.
int ibis::keywords::readTermDocFile | ( | const ibis::column * | idcol, |
const char * | f | ||
) | [protected] |
Reads a term-document list from an external file.
Returns the number of terms found if successful, otherwise returns a negative number to indicate error.
References ibis::fileManager::buffer< T >::address(), ibis::bitvector::adjustSize(), ibis::bitvector::cnt(), FASTBIT_DIRSEP, ibis::gVerbose, ibis::roster::locate(), ibis::column::name(), ibis::bitvector::set(), ibis::fileManager::buffer< T >::size(), and ibis::util::strnewdup().
Referenced by keywords().
virtual float ibis::keywords::undecidable | ( | const ibis::qContinuousRange & | , |
ibis::bitvector & | iffy | ||
) | const [inline, virtual] |
This class and its derived classes should produce exact answers, therefore no undecidable rows.
Reimplemented from ibis::index.
References ibis::bitvector::clear().
int ibis::keywords::write | ( | const char * | dt | ) | const [virtual] |
Write the boolean term-document matrix as two files, xx.terms for the terms and xx.idx for the bitmaps that marks the positions.
Implements ibis::index.
References ibis::fileManager::flushFile(), ibis::gVerbose, ibis::fileManager::instance(), ibis::index::KEYWORDS, and UnixOpen.
![]() |