The V6 Web Engine
Release 1

Bernard Lang
François Rouaix
INRIA Rocquencourt

June 1996

Table of Contents

Overview

V6 is to the Web what pipes are in Unix systems: a compositional device to combine document processing. To be easily integrated in the Web architecture, V6 is available as a personal proxy. Relying on a common skeleton architecture and Web related libraries, V6 can be easily configured to support various sets of filters while remaining portable and browser independent. The filters may act on the requests emitted by the browser (or other web client) or on the document returned by a server, or both. In the current release, the available filters include V6 can be used to support many other navigation aids and Web-related tools in a uniform, browser independent way. In addition, V6 can also be used as a traditional http server: this is particularly useful to serve private files without needing access to the site-wide http server, or to interface to local, private applications (mail, ...) through the CGI interface.

Chapter 1  Motivations and rationale

The design of V6 is explained in the position paper presented at the workshop Programming the Web - in search for APIs, who took place during the 5th World Wide Web conference (may 1996). The paper is available on-line at
http://pauillac.inria.fr/~lang/Papers/v6/

Chapter 2  Implementation design

2.1  Overview

Given the organising principles given in the previous chapter, we postulated the following requirements for an implementation of a web engine:
client independency:
the engine should not depend on vendor specific browser extensions, beit HTML or HTTP.
modularity:
the engine should be highly modular, so that servers, service components, and filters can be separately implemented and added to an engine (possibly already running).
dynamic configuration:
the engine components should contain as less as possible hard-coded information (e.g. compile-time decisions). Components should be configurable through generic HTML/HTTP communication with the client: in particular, this means that there should not be any GUI-based component configuration except through a Web browser. Even if a component has persistent file-based configuration, this information should be editable by a client browser.
performance:
the performance of the engine is not critical. The only requirement is to avoid as much as possible complete locking of the engine while responding to some request.
language independent components:
the engine should support, whenever feasible, components written in arbitrary languages.
Our implementation choices were the following:
concurrency:
the engine exists as a unique process, using threads.
dynamic linking:
the various components are dynamically loaded into the engine, who provides only the skeleton architecture
(almost) nothing hard-coded:
the engine by itself is ``empty'', in the sense that it implements only the library functions.
The V6 engine is written in Objective Caml, which provides concurrency through the threads library, as well as safe dynamic linking. Some components may be given as arbitrary binaries, either with CGI-type interface, or with classical stdio (pipe combinable Unix programs). However, as of version 1, some components still have to be written in Objective Caml, mainly for performance reasons.

2.2  Architecture of the engine

The design of V6 essentially follows the figure in chapter 1. The only important difference is the notion of scheduler and the associated job queue: instead of having a thread created for each incoming request, dealing everything from request reading and filtering to serving, there are two separate worlds in V6. The first world contains servers, receiving request from clients, and responding to them from an abstract feed object; the second world contains services, which produces feeds. The way the feeds are created depends on the nature of the request, beit a proxy request or a request for the local services of V6. The scheduler and the jobs act an an isolator between the different components of the proxy, and should simplify extensions of V6 to support other protocols, notably the future HTTP-NG.

2.2.1  Servers

When a server component receive a request from a client (usually a Web browser), it obtains a job structure from the scheduler, and queues this job. The job contains the future of the document. The server waits for this future (a feed) to be available, and then responds to the client by sending the data obtained from the feed.

Filters

V6 offers a general mechanism for writing and combining filters that act either on an incoming request, or on responses and document bodies, or on both. The filters can be written either as Caml filters, or as external programs (for filters acting on document bodies). Several examples of filters are given in the current V6 distribution, and are described in chapter 3: A server component has an associated set of filters. Each incoming request is passed through the set of request filters. Each of these filter rewrites the HTTP request message, and returns an optional return filter. The return filter is itself decomposed : the filter first rewrites the HTTP response message, and returns an optional filter to be applied to the document body.

This decomposition, although apparently complex, has several advantages: The combination of filters is completely transparent to each filter. Filters acting on document bodies obey the traditionnal pipe model of Un*x operating systems: in the case of external program filters, they receive the document body on the standard input, and must return the filtered body on the standard output. In the case of Caml filters, the filter function is run on a separate thread, and receives as argument an stdio like record of functions (one for reading and one for writing).

2.2.2  Proxy components

Since the main role of V6 is to act as a proxy, there are naturally proxy components. A proxy component handles requests that are not adressed to local services (such as files or CGIs). Currently, V6 only implements HTTP 1.0 poxying (that is, handling requests for URIs of scheme http:, and proxy proxying (transmitting requests of other protocol schemes, such as ftp:, gopher: to a further proxy).

2.2.3  Service components

A service component handles URIs that are adressed to the V6 engine itself. Service components are registered for some path prefix. When an incoming request is identified as local, V6 chooses the component who registered the longest path prefix matching the request, and calls it with the request.

In this architecture, one can have all requests starting with foo served by one component, except /foo/bar which is served by another component.

Chapter 3  User's Manual

There are several levels of utilization of V6: the first level is the installation and configuration of the standard V6 distribution; it is described in this chapter. The second level is the conception and implementation of new components; it is described in chapter 4. The third level is the addition of more library functions into the V6 core, for use by new components; it is not yet documented.

3.1  Installation and configuration

Assuming that V6 has been installed on your system (or check the INSTALL file in the distribution), you have to install some V6 configuration files in your $HOME directory. Check the USER INSTALLATION section in the INSTALL file.

Then, you have to choose the components that will be loaded by your own instance of V6, and configure each of them.

Here are specific choices that depend on the kind of system you are going to run V6 on:
multi-user machine, connected to the outside world:
in this configuration, the main points are the choice of the server ports (select a port that nobody else will use, say your uid), and protection. Since the machine is routed, you must take care that all accesses to V6 are subject to user identification : be sure that pauth.cmo is in modules.conf, and that all filter sets defined in filters.conf contain "Proxy authentication". Then check pauth.conf, and edit it to add a user/password pair (use the utility v6pass to get the encrypted password).

workstation dedicated to single user, inside a firewall:
in this configuration, V6 can be used only as a primary proxy. Do not include http_proxy.cmo in modules.conf. Instead, use proxies.cmo, and edit proxy.conf so that it contains the proper host names and port numbers of the regular proxy that your employer is bound to have installed.

3.1.1  Loaded modules (modules.conf)

The modules.conf file contains the list of modules that will be loaded by V6 during startup. These module should reside in $HOME/.v6/modules. The list of available modules is described in the next section. The order in which components are specified in modules.conf is the order of loading. It is irrelevant except for the servers.cmo module who should always be placed last.

3.1.2  Filter sets (filters.conf)

The configuration of filters in V6 is decomposed in two steps. The first step is the definition of filter sets, in the file filters.conf. Each filter set is given by a name and a list of regular expressions. The set is computed by selecting all registered filters whose name match the given regular expressions.

The order in which the regexp are given is also the order in which the filters will be applied to an incoming request (and, consequently, the reverse of response filters.

The second step is the configuration of each server component (currently only the http server) to use some filter set.

3.2  Components Library

3.2.1  Servers

HTTP 1.0 server (hserver.cmo)

The configuration file (servers.conf) allows the definition of one or several HTTP ports on which the server is active. Each port is defined by the hostname (in case your machine has several names on several networks), the port number, and the name of the filter set applied to each server (see below for filter set definitions).

It may be useful to have several ports with different filters, in the case where some filters are specific to given browsers or tools (such as on-the-fly conversion of documents, caching, etc...).

Be sure to check your browser configuration so that it points to the ports you specified !

3.2.2  Proxy components

HTTP proxy (http_proxy.cmo)

No configuration required. Will work only if your machine has default routing.

Forwarding proxy (proxies.cmo)

For each protocol scheme that should be forwarded, specify the scheme (ftp, http, etc...), the fully qualified host name of the proxy host, and the port number.

Since V6 does not support proxying for anything else than http at this time, you will probably need this component for the other common protocol schemes. Remember, the main interest of V6 is to have all documents retrieved from the Web go through a series of filter. Thus, even if V6 only act as a forwarding proxy for certain protocols, it may be useful however to configure your browser so that all requests go through V6.

3.2.3  Service Components

File system mapping (fs.cmo)

In contrast to most other http servers, this components allows several different file system hierarchies to be served under different path prefixes, without having to make symbolic links all over the place. The configuration is quite simple: for each hierarchy, specify the local URL prefix to be used on the server and the corresponding absolute path on the filesystem. For example,
doc/foobar      /net/software/foobar/documentation
will tell V6 to respond to a request for /doc/foobar/some/path with the file at /net/software/foobar/documentation/some/path.

CGI frontend (cgi_ctl.cmo)

CGI support is actually in the core V6 engine. This module only provides an interface to access this feature and to read the configuration file. The configuration files (cgi.conf) allows for specifying either directories of CGI programs, or single CGI programs. The first form
dir bin                 ~/bin/cgi
maps requests for /bin/foo to the CGI program ~/bin/cgi/foo. However, /bin/ga/bu is mapped to ~/bin/cgi/ga, and not to ~/bin/cgi/ga/bu even if it exists. The second form
file search             ~/bin/ffwsearch
maps requests for /search to the CGI program ~/bin/ffwsearch.

3.2.4  Filters

Proxy authentication (pauth.cmo)

By inserting this filter, all accesses to the proxy are required to be authentified (header Proxy-Authorization). The accepted users are defined in the configuration file pauth.conf. User names are arbitrary tokens (no space). Passwords are encoded with MD5. To get the encoding of a password, use the v6pass program included in the distribution.

Redirection (redirect.cmo)

The configuration file (redirect.conf) contains redirection specifications as regular expressions (from the Str library in the Objective Caml distribution). Each redirection is composed of a regular expression matching the request URI, and a substitution expression for producing the redirected URI. The example
Redirect "http://v6\(:80\)?\(.*\)" "\2"
makes v6 a virtual host name for the local services of the V6 engine, so that you can add things like http://v6/services in your bookmarks independantly of where the engine is actually running.

HTML Filtering (a.k.a NoShit) (noshit.cmo)

GIF Deinterlacing (deinterlace.cmo)

Global History (history.cmo)

Cache (cache*.cmo)

The configuration file cache.conf contains directives for specifying
codes
: the list of HTTP codes that we want to cache. A good default is positive answers (200) and permanent redirections (301)
codes 200 301


privacy
: the private directive says that your cache is supposed to be strictly private. Even documents that were protected by HTTP Authentication will be cached.

nocache
: the nocache directive specifies regular expressions of URLs that should not be cached (e.g. local documents, local services, servers on the same site, responses from search engines, ...)

cookie
: an arbitrary string to protect cache modifications from being triggered my malicious third-party pages.
We hope to provide reasonable expiration support in the next releases. The documentation for the cache interface is on-line (/cache)

Indexing Memory (indexing.cmo)

This filter is an experiment in using NLP tools such as incremental full-text indexing to provide navigation aid. As an alternative to bookmarks, we suggest that all incoming HTML or text documents should be indexed on the fly by some full-text indexer. Then, finding a place on the Web where the user has been in the past, and that was talking about some subject is just a matter of interrogating the data base with some query containing keywords.

The requirements for this experiment were: A first Web scan revealed surprisingly few freely available software meeting these requirements, and we chose FFW for the experiment. Unfortunately, we had to make some patches to the official distribution, and the FFW license forbids re-distribution of modified version.

Thus, the V6 distribution contains only our diffs, and the user has to get the original FFW distribution from
http://www.nta.no/produkter/ffw/ffw.html
then has to apply the patch and compile the software (C++ required !).

The indexing component requires ffwindex and ffwmerge to be in the PATH. It builds the database $HOME/.v6/ffwdb/cache (make sure that the directory $HOME/.v6/ffwdb/ exists). Then configure and install the CGI script ffwsearch in some of your binary directories. To offer index querying, check cgi.conf, and add something like
file search  ~/bin/ffwsearch
so that you can issue queries with http://v6/search/cache

The filter defined by this component is named "Indexing Memory"

3.3  Running V6

V6 is started by executing the v6 command. The allowed options are
-engines <n>
specifies the number of engines (threads) for processing requests (default is 10). The relation between observable speed and number of engines is not immediate.
-modules <file>
specifies an alternate set of modules to be loaded (default is modules.conf.
-dir <directory>
specifies an alternate directory for all V6 configuration files. Also affects components who compute the path names of their persitent storage files from the V6 root directory.
-debug
makes V6 quite verbose.
foo.cmo
specifies additionnal modules to be loaded during startup.
V6 logs its transactions on (buffered) standard output. Debugging messages go to standard error.

To check that V6 started properly, try the URL http://v6/v6 (assuming you kept the redirection rule in the. From there, you can access to builtin services (see below 3.7).

3.4  Configuring your browser

In your browser, go the preference control panel for the network (or proxies). Set the proper host name and port number for each protocol for which you want V6 as a filterin proxy.

3.5  Dynamic configuration of V6

In this release, V6 cannot be easily re-configured while running. For the moment, one has to kill V6, edit the configuration file, and restart the engine.

3.6  Killing V6

V6 can be killed safely at (almost) any time (using SIGINT or SIGTERM). Most components will checkpoint properly when the program exits.

3.7  Builtin services

3.7.1  Services /services

The list of registered V6 service components can be accessed with the services URL. For each component, a short description is given, and when available, a pointer to the documentation (and further configuration).

3.7.2  Filters /filters

The list of registerd V6 filters is available at /filters. A simple dynamic configuration form is also available, but beware that the configuration changes are valid only for the current session (they are not saved to the disk).

3.7.3  Engines /engines

Then engines interface provides simple ps and kill interface to V6 engines. Working engines may be killed if necessary (e.g. when stuck on outbound connections).

3.8  Troubleshooting

Chapter 4  Programmer's Manual

NOT YET AVAILABLE

This document was translated from LATEX by HEVEA.