Linux logo


Database Interface

A database within pyrpm is a set of rpms. Basic operations supported by databases are:


Database Classes

Most features are implemented in seperate classes. Those features are brought together either by inheritance or by using instances of other classes.

RpmDatabase - abstract super class
  RpmDB - The on disk rpm db(4)
    RpmDiskShadowDB - allow virtually removes from db that are not written
                      to disk but insted are just filtered from all results
  RpmMemoryDB - in memory db that builds hashes for searching, work with all
                kind of rpms
    RpmRepoDB - Yum repository, reads data into memory
      SqliteRepoDB - uses the yum sqlite db
        RhnChannelRepoDB - deals with RHN channels which are very similar
                           to Yum repositories
    RpmExternalSearchDB - use another db (sqlite) for searching while
                          maintaining an own list of rpms. All rpms must be
                          contained in the external db!
  JointDB - treat several dbs as one
    RhnRepoDB - RHN Repository. Work is done by RhnChannelRepoDB instances
    RpmShadowDB - current state during resolving - see RpmYum.pydb
                  use case below

Use Cases

Altough databases are used in more or less every script. There are two use cases within pyrpmyum that cover all database classes.

"->" means holding a pointer to an/several instance(s) of another class

RpmYum.repos

Database containing all rpms that are used to resolve dependencies. After creation this database is read only.

JointDB
 -> SqliteRepoDB - on per repository
 -> RhnRepoDB - optional
  -> RhnChannelRepoDB - one per channel
 -> RpmMemoryDB - containing rpms given at the command line (optional)

RpmYum.pydb

Database used for resolving. Rpms are added and removed to/from that db and the searches for resolving dependencies are performed on it. All modifications are kept in memory. It uses the RpmYum.repos and the RpmDB for searching and filters the results to the rpms that have not yet removed or have been added. That way neither linear search nor building additional hashes is needed.

RpmShadowDB
 -> RpmExternalSearchDB - keeps track of rpms installed from the repos
  -> RpmYum.repos - used for searches. See above for details
 -> RpmDiskShadowDB - keeps track of the rpms deleted in the RpmDB
  -> RpmDB - used for searches

How does a binary rpm look like?

For RPM there are nowadays several "formats" in which you can find information about rpm packages. The most typical one is of course the binary rpm header which is part of every binart rpm package. A typical binary rpm package looks like this:

+------+-----------+--------+-------------+
| Lead | Signature | Header | Gziped CPIO |
+------+-----------+--------+-------------+

The lead has a fixed size of 96 bytes and contains some very basic information about the binary rpm. It can also generally be used to determine if a file is a binary rpm or not (using file e.g.) as it contains some very specific to easily identify them.

The signature and the header are stored as rpm header structures. Rpm header structures look like this:

+-------+---------+-----------+-----------+-----------+
| Magic | IndexNr | StoreSize | Indexdata | Storedata |
+-------+---------+-----------+-----------+-----------+

The Magic is a hardcoded value, IndexNr the number of index entries and StoreSize the size in bytes of the store data.

Indexdata consists of IndexNr index entries each of which is 16 bytes. Each index entry looks like this:

+-----+------+--------+-------+
| Tag | Type | Offset | Count |
+-----+------+--------+-------+

Tag specifies which tag this entry is about. Type specifies the type of the tage. Offset specifies at which offset in the Storedata the data begins for this tag. Count has various size meanings depending on the type.

Storedata finally contains the real tag information. As mentioned in the previous paragraph by using an index entry from the Indexdata you can find and parse all data relevant to a specifc tag. The format depends of course on the type of the tag.

More detailed information about the binary rpm format can be found here: http://www.rpm.org/max-rpm/s1-rpm-file-format-rpm-file-format.html

The rpm binary format can be partially found in the rpmdb as well. The file /var/lib/rpm/Packages contains the complete headers of the orignal binary rpms in a rpm header structure format without the 8 byte magic and with some additional installation revelvant indexes appended.

Another nowadays common format for reduced rpm header data is the repo metadata format used by yum. It is a split up and reduced version of the orignal rpm header information using XML. It is mainly useful to determine and resolve dependencies of rpm packages. More information about the metadata can be found here:

http://linux.duke.edu/projects/metadata/

Other less common storage formats include databases like SQLite or MySQL which e.g yum uses to convert the repodata format to a more usable form locally.

Apart from that rpm itself extracts quite a bit of the information from rpm binary headers and writes them in various db4 files in /var/lib/rpm.


RPM database internals

This section describes the structure from the various files in /var/lib/rpm. All files are db4 files, either hash or btree based. With the exception of Packages all files have the corresponding rpmtag based value as key. The data consists of integer pairs which contain the package id and the index at which this entry can be found in the rpm header of that tag. The values are 4 byte integers in host byte order. For some tags the index doesn't make any sense. In those cases the index value will always be set to 0.

Filelist

Basenames (hash)

Conflictname (hash)

Dirnames (btree)

Filemd5s (hash)

Group (hash)

Installtid (btree)

Name (hash)

Packages (hash)

Providename (hash)

Provideversion (btree)

Pubkeys (hash)

Requirename (hash)

Requireversion (btree)

Sha1header (hash)

Sigmd5 (hash)

Triggername (hash)

Example

Now an example of the connection between the package headers which are stored in Packages and the rest of the files.

The connection between /var/lib/rpm/Packages and the other files looks like this:

/var/lib/rpm/Packages:

Package id Requirename Index
5 a 0
b 1
8 c 0
a 1
b 2

/var/lib/rpm/Requirename:

Requirename Package Id Index
a 5 0
8 1
b 5 1
8 2
c 8 0

That means the complete /var/lib/rpm files can be cross checked with /var/lib/rpm/Packages and can be regenerated from that file as well.

An exception is Installtid. This db file contains as keys the TID which is a unique time in seconds since 1970 that reflects a complete transaction. Every header in Packages contains that TID as "installtid" tag. The values of the Installtid db file are again pairs of integers with a package id as first value and the second value always 0. Here a small example:

/var/lib/rpm/Packages:

Package id Install Tid
5 1000000
8 1000000
6 1234567
9 1234567
7 2345678

/var/lib/rpm/Installtid:

Install Tid Package ID Index
1000000 5 0
8 0
1234567 6 0
9 0
2345678 7 0

As you can see it can happen that package ID's get reused, in our example 6. This can happen if a package gets deleted and the ID "dropped". So there is unfortunately no autoincrementing ID for the packages.


Notes about the Repo-Metadata

The following things should be noted about the repo metadata. yum is using the repodata only within the resolver part to determine a set of rpms that should be updated and/or installed. Then the complete rpm headers are downloaded and another dependency check from librpm is run in addition to determining the ordering of rpm packages.

Here a few limitations you should be aware of if you want to work with the repodata for more than the resolver or understand the limits of the resolver:


Huge Dependency Data

The data eating up RAM in rpm headers are descriptions, changelogs and filelists.

The dependency data we operate with is extremely huge. In addition to the Provides: data which contains shared libs, rpm versions and explicitely listed ones in .spec files, dependency data can also use any filerequires like e.g. Requires: /usr/bin/foo to reference any file in any other rpm package. That means we potentially have to look at a filelist of all rpm packages. That data is extremely huge as the current Fedora Core development tree contains more than 350000 files.

As the dependency data is worked with on each client to update the machine, it must be a goal to reduce this data to a smaller subset.

The current repo metadata has a fixed file regex of (.*bin/.*|/etc/.*|/usr/lib/sendmail)$ and a directory regex of (.*bin/.*|/etc/.*)$. That regex specifies the data given in the repodata/primary.xml.gz file and you have to fallback to the complete filelists available in repodata/filelists.xml.gz if any dependency request is done outside of that data. (The regex gives a deterministic way to know when to load the full filelist.) The regex used to be pretty complete for Fedora Core in the past, but additional filerequires are present in newer Fedora Core and Fedora Extra rpm packages which require a reload of the complete lists.

In addition to the completeness problems above, it was also noted that the regex lists contain 100 times more data than actually being used in current repositories. Conary is thus maintaining explicit lists of possible file requires. Maybe new ways to add autogenerated, small filelists can be worked out that would work for most comon usage cases, also with the fallback to the complete lists like yum / createrepo implement right now.


Storing Complete Dep Graphs

It would also be possible to store dependency graphs that contain data for the resolver to select the right rpm packages plus the orderer to specify the right sequence to install them. But many machines do have further packages installed outside of that package set, so this would then mostly be used for new installs. Optimizing the general update path for running machines should be more important than improving the install path for new installs, so this is currently no goal, but would very well be possible todo.


Last updated 25-Apr-2007 17:57:12 CEST