repoze.catalog is an indexing and search system for Python. It is inspired by (and uses much code from) Zope’s zope.app.catalog, and uses other Zope libraries to do much of its work. It manages its own persistence: it stores catalog information into a ZODB database.
In order to make use of repoze.catalog, your application will be required to create objects that are willing to be indexed, and it will be responsible for providing each of these objects a unique integer identifier, and maintaining the association between the object and the unique identifier for the lifetime of your application. Objects which are willing to be indexed must either have a particular attribute which is guaranteed to have a value or you must provide a callback that is willing to inspect the content for a value.
The result of searching a catalog is a sequence of integers that represent all the document ids that match the query. Your application is responsible for being able to (re-) resolve these integers into content objects.
Here’s a simple example of indexing data within your application. This example sets up two indexes.
The first index for flavor is a field index. The second index, text, is a text index.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | from repoze.catalog.catalog import FileStorageCatalogFactory
from repoze.catalog.catalog import ConnectionManager
from repoze.catalog.indexes.field import CatalogFieldIndex
from repoze.catalog.indexes.text import CatalogTextIndex
factory = FileStorageCatalogFactory('catalog.db', 'mycatalog')
_initialized = False
def initialize_catalog():
global _initialized
if not _initialized:
# create a catalog
manager = ConnectionManager()
catalog = factory(manager)
# set up indexes
catalog['flavors'] = CatalogFieldIndex('flavor')
catalog['texts'] = CatalogTextIndex('text')
# commit the indexes
manager.commit()
manager.close()
_initialized = True
class Content(object):
def __init__(self, flavor, text):
self.flavor = flavor
self.text = text
if __name__ == '__main__':
initialize_catalog()
manager = ConnectionManager()
catalog = factory(manager)
content = {
1:Content('peach', 'i am so very very peachy'),
2:Content('pistachio', 'i am nutty'),
}
for docid, doc in content.items():
catalog.index_doc(docid, doc)
manager.commit()
|
Here’s a more complicated example. It uses callbacks to adapt cataloged objects to values rather than directly inspecting attributes of the content object. We use the same types of indexes as the previous example, but we set up callbacks that allow us to adapt content to a result instead of examining the object for an attribute directly. This is useful in the case that your content objects don’t have attributes that match exactly what you want to index:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | from repoze.catalog.catalog import FileStorageCatalogFactory
from repoze.catalog.catalog import ConnectionManager
from repoze.catalog.indexes.field import CatalogFieldIndex
from repoze.catalog.indexes.text import CatalogTextIndex
factory = FileStorageCatalogFactory('catalog.db', 'mycatalog')
class Content(object):
def __init__(self, flavor, text):
self.theflavor = flavor
self.thetext = text
def get_flavor(object, default):
return getattr(object, 'theflavor', default)
def get_text(object, default):
return getattr(object, 'thetext', default)
_initialized = False
def initialize_catalog():
global _initialized
if not _initialized:
# create a catalog
manager = ConnectionManager()
catalog = factory(manager)
# set up indexes
catalog['flavors'] = CatalogFieldIndex(get_flavor)
catalog['text'] = CatalogTextIndex(get_text)
# commit the indexes
manager.commit()
manager.close()
_initialized = True
if __name__ == '__main__':
initialize_catalog()
manager = ConnectionManager()
catalog = factory(manager)
content = {
1:Content('peach', 'i am so very very peachy'),
2:Content('pistachio', 'i am nutty'),
}
for docid, doc in content.items():
catalog.index_doc(docid, doc)
manager.commit()
|
Searching for values from a previously indexed corpus of content is significantly easier than indexing. There are a number of ways to perform searches.
The suggested way to perform searches is to use the repoze.catalog.Catalog.query() method. This method accepts a number of arguments:
For example:
from repoze.catalog.catalog import FileStorageCatalogFactory
from repoze.catalog.catalog import ConnectionManager
from repoze.catalog.query import Eq
factory = FileStorageCatalogFactory('catalog.db', 'mycatalog')
manager = ConnectionManager()
catalog = factory(manager)
numdocs, results = catalog.query(Eq('flavors', 'peach'))
print (numdocs, [ x for x in results ])
The results of the above search will search the corpus for documents which have a result in the flavor index that matches the value peach.
The repoze.catalog.Catalog.query() method will return a two-tuple, with the first element in the sequence being the length of the result set, and the second element being the result set itself. Our above example will print:
(1, [1])
The first element in the tuple is the length of the result set (the integer 1, in this case).
The second element in the tuple is the result set. It has one item. This item is the document id for the content we indexed. Your application is responsible for resolving this document identifier back to its constituent content.
Warning
The result set is only guaranteed to be an iterable. It will always be of a particular type, and not always sliceable; for example it may be a generator.
You can also combine query objects, using boolean operations, to search multiple indexes:
1 2 3 4 5 6 7 8 9 | from repoze.catalog.catalog import FileStorageCatalogFactory
from repoze.catalog.catalog import ConnectionManager
factory = FileStorageCatalogFactory('catalog.db', 'mycatalog')
manager = ConnectionManager()
catalog = factory(manager)
numdocs, results = catalog.query(
Eq('flavors', 'peach') & Eq('texts', 'nutty'))
print (numdocs, [ x for x in results ])
|
The results of the above search will return the following:
(0, [])
This is because no results in our index match a document which has both a flavor of peach and text which contains the word nutty.
You can sort the result set using sort_index. The value of sort_index should be the name of an index which supports being used as a sort index:
from repoze.catalog.query import Range
numdocs, results = catalog.query(
Range('flavors', 'peach', 'pistachio'),
sort_index='flavors')
print (numdocs, [ x for x in results ])
Would result in:
(2, [1, 2])
The default sort order is ascending. You can reverse the sort using reverse:
from repoze.catalog.query import Range
numdocs, results = catalog.query(
Range('flavors', 'peach', 'pistachio'),
sort_index='flavors',
reverse=True)
print (numdocs, [ x for x in results ])
Would result in:
(2, [2, 1])
The value passed as the queryobject argument to repoze.catalog.Catalog.query() may be one of two distinct types:
For example, you can construct a raw query object using Python, and pass it as queryobject to the repoze.catalog.Catalog.query() method:
1 2 | from repoze.catalog.query import Eq
results = catalog.query(Eq('index_name', 'value'))
|
Or you can allow repoze.catalog to construct a query object on your behalf by passing a string as queryobject.
1 2 | from repoze.catalog.query import Eq
catalog.query('index_name == "value"')
|
The above string is a CQE. A “CQE” is a string representing a Python expression which uses index names and values. It is parsed by the catalog to create a query object.
Warning
CQE strings are not supported on Python versions < 2.6.
Whether a query object is used directly or query objects are generated as the result of a CQE, an individual query object will be one of two types: a comparator or a boolean operator. A comparator performs a single query on a single index. A boolean operator allows results from individual queries to be combined using boolean operations. For example:
1 2 | from repoze.catalog.query import And, Eq, Contains
query = And(Eq('author', 'crossi'), Contains('body', 'biscuits'))
|
In the above example, And is a boolean operator, and both Eq and Contains are comparison operators. The resulting query will search two indexes, author and body. Because the individual comparators are passed as arguments to the And set operator, the result becomes all documents which satisfy both comparators.
All query objects overload the bitwise and (&) and or (|) operators and can be combined using these. The above query could also have been written as follows:
1 | query = Eq('author', 'crossi') & Contains('body', 'biscuits')
|
Note
Although it would be more intuitive to use the boolean operators, or and and for this rather than bitwise operators, Python does not allow overloading boolean operators.
Query objects may also be created by parsing a CQE string. The query parser uses Python’s internal code parser to parse CQE query expression strings, so the syntax is just like Python:
mycatalog.query("author == 'crossi' and 'biscuits' in body")
The query parser allows name substitution in expressions. Names are resolved using a dict passed into repoze.catalog.Catalog.query():
author = request.params.get("author")
word = request.params.get("search_term")
query = mycatalog.query("author == author and word in body",
names=locals())
Unlike true Python expressions, ordering of the terms in a CQE expression is important for comparators. For most comparators the index_name must be written on the left. The following, for example, would raise an exception:
query = mycatalog.query("'crossi' == author")
Note that not all index types support all comparators. An attempt to perform a query using a comparator that is not supported by the index being queried will result in a NotImplementedError being raised when the query is performed.
The supported comparator operators are as follows:
Python:
from repoze.catalog.query import Eq
Eq(index_name, value)
CQE:
index_name == value
Python:
from repoze.catalog.query import NotEq
NotEq(index_name, value)
CQE:
index_name != value
Python:
from repoze.catalog.query import Gt
Gt(index_name, value)
CQE:
index_name > value
Python:
from repoze.catalog.query import Lt
Lt(index_name, value)
CQE:
index_name < value
Python:
from repoze.catalog.query import Ge
Ge(index_name, value)
CQE:
index_name >= value
Python:
from repoze.catalog.query import Ge
Le(index_name, value)
CQE:
index_name <= value
Python:
from repoze.catalog.query import Contains
Contains(index_name, value)
CQE:
value in index_name
Python:
from repoze.catalog.query import DoesNotContain
DoesNotContain(index_name, value)
CQE:
value not in index_name
Python:
from repoze.catalog.query import Any
Any(index_name, [value1, value2, ...])
CQE:
index_name == value1 or index_name == value2 or etc...
index_name in any([value1, value2, ...])
index_name in any(values)
Python:
from repoze.catalog.query import NotAny
NotAny(index_name, [value1, value2, ...])
CQE:
index_name != value1 and index_name != value2 and etc...
index_name not in any([value1, value2, ...])
index_name not in any(values)
Python:
from repoze.catalog.query import All
All(index_name, [value1, value2, ...])
CQE:
index_name == value1 and index_name == value2 and etc...
index_name in all([value1, value2, ...])
index_name in all(values)
Python:
from repoze.catalog.query import NotAll
NotAll(index_name, [value1, value2, ...])
CQE:
index_name != value1 or index_name != value2 or etc...
index_name not in all([value1, value2, ...])
index_name not in all(values)
Python:
from repoze.catalog.query import InRange
InRange(index_name, start, end,
start_exclusive=False, end_exclusive=False)
CQE:
index_name >= start and index_name <= end
start < index_name < end
Python:
from repoze.catalog.query import NotInRange
NotInRange(index_name, start, end,
start_exclusive=False, end_exclusive=False)
CQE:
index_name <= start or index_name >= end
not(start < index_name < end)
The following set operators are allowed in queries:
Python (explicit):
from repoze.catalog.query import And
And(query1, query2)
Python (implicit):
query1 & query2
CQE:
query1 and query2
query1 & query2
Python (explicit):
from repoze.catalog.query import Or
Or(query1, query2)
Python (implicit):
query1 | query2
CQE:
query1 or query2
query1 | query2
Warning
The repoze.catalog.Catalog.search() method is deprecated as of repoze.catalog 0.8. Use repoze.catalog.Catalog.query() instead.
We can pass a query into our catalog’s search method, which is composed of the name of our index and a value we’d like to find a document for.
1 2 3 4 5 6 7 8 | from repoze.catalog.catalog import FileStorageCatalogFactory
from repoze.catalog.catalog import ConnectionManager
factory = FileStorageCatalogFactory('catalog.db', 'mycatalog')
manager = ConnectionManager()
catalog = factory(manager)
numdocs, results = catalog.search(flavors=('peach', 'peach'))
print (numdocs, [ x for x in results ])
|
The results of the above search will search the corpus for documents which have a result in the flavor index that matches the value peach. Since the index is a “field” index, its query arguments are a “range” search: you can read ('peach', 'peach') as “from peach to peach”. You could say ('peach', 'pistachio') to find all documents that are in the “range” from peach to pistachio.
The repoze.catalog.Catalog.search() method will return a two-tuple, with the first element in the sequence being the length of the result set, and the second element being the result set itself. Our above example will print:
(1, [1])
The first element in the tuple is the length of the result set (the integer 1, in this case).
The second element in the tuple is the result set. It has one item. This item is the document id for the content we indexed. Your application is responsible for resolving this document identifier back to its constituent content.
You can also pass compound search parameters for multiple indexes. The results are intersected to provide a result:
1 2 3 4 5 6 7 8 | from repoze.catalog.catalog import FileStorageCatalogFactory
from repoze.catalog.catalog import ConnectionManager
factory = FileStorageCatalogFactory('catalog.db', 'mycatalog')
manager = ConnectionManager()
catalog = factory(manager)
numdocs, results = catalog.search(flavors=('peach', 'peach'), texts='nutty')
print (numdocs, [ x for x in results ])
|
The results of the above search will return the following:
(0, [])
This is because no results in our index match a document which has both a flavor of peach and text which contains the word nutty.
See the zope.index documentation and implementation for more information about what specific index types expect for query parameters.
You can also use a field index as a sort_index, which sorts the document ids based on the values for that docid present in that index:
numdocs, results = catalog.search(flavors=('peach', 'pistachio'),
sort_index='flavors')
print (numdocs, [ x for x in results ])
(2, [1, 2])
The default sort order is ascending. You can reverse the sort using reverse:
numdocs, results = catalog.search(flavors=('peach', 'pistachio'),
sort_index='flavors',
reverse=True)
print (numdocs, [ x for x in results ])
(2, [2, 1])
If you use a sort index, you may choose to limit the number of results returned. Do this by passing limit with an integer value of the number of results you want. Note that this parameter has no effect if you do not supply a sort_index:
numdocs, results = catalog.search(flavors=('peach', 'pistachio'),
sort_index='flavors',
limit=1)
print (numdocs, [ x for x in results ])
(1, [1])
You may combine reverse and limit as necessary.
If a sort_index is used, and the sort index you’re using does not contain all the documents returned by the search, the numdocs value returned by search may be incorrect. There will be fewer results than those indicated by numdocs in this circumstance.
When querying a text index, to sort the results by relevance, specify the name of the text index as the sort index. The most relevant results will be provided first, unless you specify reverse=True, in which case the least relevant will be provided first.
An implementation of a “document map” suitable for ZODB applications exists within the repoze.bfg.document.DocumentMap class. A document map allows you to map document ids to “addresses” (e.g. paths or unique identifiers). See repoze.catalog.document in the API documentation chapter for more information.
Values indexed by a repoze.catalog catalog cannot subclass from the ZODB Persistent class. This is a safeguard to ensure that irresolveable cross-database references aren’t put into the catalog’s (separate) database.
When the ConnectionManager ‘s commit method is called, it will commit a transaction for all databases participating in Zope transaction management. Don’t use this method if you already have transaction management enabled in another way.