Cheetah Core Interface

Steve Karmesin
$Date: 2000/04/17 21:29:18 $
$Revision: 1.1 $

The cheetah_core interface is the the core set of C routines that handle moving bits between contexts and invoking appropriate operations there.  The fundamental operations are:

These are described more fully in the following sections.
This document describes a particular implementation of the Cheetah core which is aimed at getting maximum performance with the constraint of being highly portable.  In particular, it is designed to be implementable on top of MPI with no threads so that it will be highly portable. Within that constraint it works hard to be as efficient as possible.
The goal of this library layer is to provide a low level interface on top of which other more sophisticated interfaces can be built.  This is a purely C interface, not C++.

Setup/Teardown

There are many possible ways to initiate parallelism.  This library's goal is to be able to work on top of MPI, so it has an initialization mechanism compatible with that.  It has an initialization function that inputs the command line arguments and a finalize function that takes no arguments.  The guarantees on what parallelism exists before the setup call are the same as MPI, which means system dependent.  The system is free to have only a single context before setup is called and to call fork inside of that, or it can run multiple executables from the beginning and hook them together in the setup call.
The setup function is:
void cheetah_core_setup(int *argc, int ***argv)
The user passes a pointer to the arguments to the function main, and the library parses the arguments to pull out the information it needs.  The only guaranteed command line argument for this to parse is -np #, for the number of contexts.
The teardown function is:
void cheetah_core_finalize()
This does a barrier to wait for all contexts to reach this point, then it gets rid of the parallelism.  After this function is called, no other cheetah_core functions can be used.  Whether all of the contexts but one have actually been destroyed at this point or not is implementation dependent.

Queries

Once the parallelism has been initiated, there are two primary queries the user can make of the system: how many contexts are there and which one am I.
int cheetah_core_ncontexts()
Return the total number of contexts that have been started up.
int cheetah_core_mycontext()
A unique identifier for this context in the range [0..ncontexts()-1].

Doorbells

The requirement of being able to be efficient without relying on threads has interface implications compared to the conventional definition of cheetah.  In particular, the user has to be able to specify non-blocking operations, while the normal cheetah interface is entirely blocking on the assumption that there is always another thread available to run should this one block.  Nonblocking operations are implemented here by using 'doorbells' which are counters that increment whenever a certain asynchronous event takes place.
In the current implementation, doorbells are just integers.  A thread safe implementation would substitute a doorbell that can be incremented safely from multiple threads simultaneously.
The intended use of a doorbell is that a series of asynchronous operations are begun with a given doorbell.  Each increments ("rings") the doorbell at an appropriate time.  The rest of the code can then wait until the doorbell reaches a given value to be sure that all of the asynchronous operations have completed.
When passing a doorbell to one of the cheetah_core routines, you pass a pointer to it so that the routine can modify it at will.

Put

Given a buffer in the local context and a pointer to a buffer in a remote context, send the buffer to the remote context without intervention in the remote context.  No checks are made to be sure that the remote buffer is available, or to synchronize with the remote context before writing the data.  There is a doorbell on the sending side to indicate when the local buffer may be reused, and a doorbell on the remote side to indicate that the data has arrived.
void cheetah_core_put(int context, void *remote, void *local, int length,
                    int *local_bell, int *remote_bell)
context: The integer (as from cheetah_core_mycontext()) identifier for the remote context.
remote: A pointer valid in the remote context for where to put the buffer.
local: A pointer valid in this context for where to get the buffer.
length: The length in bytes of the buffer.
local_bell: The local bell.  This bell is rung (in the current context) when the buffer on this side can be reused.
remote_bell: The remote bell.  This bell is rung (in the remote context) when that context can use the destination buffer.
It is important to keep in mind that no synchronization is guaranteed by this function before writing the remote data. You can think of this as an interface to a DMA engine, and all synchronization has to be performed outside of this call.

Get

Given a buffer in the local context and a pointer to a buffer in a remote context, copy the remote buffer to the local buffer without intervention in the remote context.  No checks are made top be sure that the remote buffer is consistent or to synchronize with the remote context before getting the data.   There is a doorbell on the remote side which is rung when the data has been copied out, and a doorbell in the local context that will be run when the data has arrived.
void cheetah_core_get(int context, void *remote, void *local, int length,
                    int *local_bell, int *remote_bell)
context: The integer (as from cheetah_core_mycontext()) identifier for the remote context.
remote: A pointer valid in the remote context for where to get the buffer.
local: A pointer valid in this context for where to put the buffer.
length: The length in bytes of the buffer.
local_bell: The local bell.  This bell is rung (in the current context) when the buffer on this side can be used.
remote_bell: The remote bell.  This bell is rung (in the remote context) when that context can reuse the source buffer.
It is important to keep in mind that no synchronization is guaranteed by this function before writing the remote data. You can think of this as an interface to a DMA engine, and all synchronization has to be performed outside of this call.

Register/Ainvoke

The ainvoke function is available to allow one context to tell another context to call a given function with a given buffer of data.  Because function pointers aren't sufficiently portable, we register each function with a specific integer tag, and use that tag when telling the remote context what function we want called.
void cheetah_core_register(int tag, 
     void (*handler)(int who, int tag, void* buffer, int length) )
tag: The tag the user will use to indicate this function.
handler: The function pointer to associate with this tag.  The same function can be registered with multiple tags. The arguments to this function are:
who: The context that issued the ainvoke
tag: The tag that was used to call this function.
buffer: The buffer of data the user supplied in the ainvoke.
length: The length in bytes of the buffer.
void cheetah_core_ainvoke(int context, int tag, void *buffer, int length,
                        int *local_bell)
context: The context in which the function should be executed.
tag: The function to be invoked.
buffer: The buffer of data to pass to the handler in the remote context.
length: The length in bytes of the buffer.
local_bell: The bell to ring when the local buffer can be reused.
The handler function could perform an arbitrary computation (including calling other cheetah functions), and it is the user's responsibility to ensure that the handler returns sufficiently rapidly that the system does not hang.

No synchronization is done before calling the handler in the remote context, so if the user's code has to do any synchronization, it must either ensure it calls the ainvoke at a legal time, or inside of the function being called it much check for any synchronization conditions.

The handler function should not free the buffer pointer it is given, and neither may it capture a pointer to the buffer. The system is allowed to free the buffer upon the return of the handler.

Poll

In a nonthreaded environment, we have to be able to tell the system to check for incoming get requests, puts, ainvokes, etc.  because there is no mechanism for them to happen completely behind the scenes.  The poll function should be called often enough to keep things flowing.  It is nonblocking in that if there are no messages to handle, it returns immediately.  In a nonthreaded environment all hanlder operations are of course done in the master thread, and they are done inside of the poll call.
void cheetah_core_poll()
The poll call takes no arguments.  It potentially has side effects in that it could perform a get (which could ring a local bell), a put (which could deposit memory locally and possibly ring a local bell), or an ainvoke (which could do an arbitrary computation).

Wait

When the user wants to wait until a doorbell reaches a certain value, it is important that they not simply spin waiting for the bell to get there: poll must be called to allow the asynchronous events to take place that will increment the bell.  To make this easy, we provide the wait function which blocks until the bell reaches a given value while processing asynchronous events:
cheetah_core_wait(int *bell, int value)
bell: The bell we're waiting for.
value: The value the bell should get to.  It returns when the *bell >= value.

Barrier

A simple barrier that ensures that all of the contexts have gotten to the same point.  It is necessary to have this be an intrinsic operation in order to be able to initialize safely.  It takes no arguments.
void cheetah_core_barrier()

Steve Karmesin

Last modified: Tue Dec 21 16:01:30 MST 1999