The goal of one sided messaging is to decouple communication from synchronization. Conventional message passing libraries like MPI implicitly synchronize events because a message must be actively sent and actively received. That is called cooperative communication, and while it works well for some applications, it greatly complicates others.
There are several general ideas and mechanisms that are necessary parts of implementing one sided messaging.
On some levels, these are implementation details, but the interface and functionality has been chosen with these in mind. They all have to do with setting up a one-sided messaging system that does not necessarily use threads and instead allows asynchronous communication.
An important design choice is whether to do synchronous or asynchronous communication. If you are doing synchronous communication you will block on various I/O operations and wait until some appropriate time. If this method is going to be efficient, it relies on the underlying system doing other things while that I/O is proceeding.
That is not the method we choose here. We use asynchronous communication because we are not relying on an underlying threaded or multitasking system; instead the user is involved to some extent in overlapping communication and computation.
Whenever something is supposed to be happening in the background, you need some technique for checking on that operation. In this system we use "doorbells" or just "bells" to inform the user that an I/O operation has completed. Just what operation has completed depends on the usage, but in this release "ringing a bell" means incrementing an integer. What that integer represents is entirely up to the application.
Most forms of interprocess communication use some form of special memory to effect the communication. Within an SMP communication is usually through shared memory. Over a network the NIC usually has a region of pinned memory that it interacts with. This special memory is generally separate from the user process's memory and sometimes is not directly accessible without special permissions.
Message data often has to be copied across the memory bus three times on the send and receive sides. The first is when the user is assembling the message into a contiguous buffer in user space. The second is when the system copies it to special memory and the third is when the NIC copies it out of special memory. On the receive side the steps are reversed. This many trips across the memory bus can be a serious impediment to overall messaging speed.
If the user can directly access the special memory, one of these copies can be eliminated because the user doesn't allocate a contiguous buffer in user space to pack the message. Instead it is packed directly into the special buffer.
Some NICs can work with virtual addresses, and can therefore copy the message directly out of user space, and under some circumstances that can eliminate another copy. If the user's data starts out contiguous then it can be copied directly from its original location to its destination.
At present Cheetah does not support an interface that allows direct use of special memory. Libraries like MPI don't have a special memory interface, and since that is the first primary communication mechanism there is currently no support for eliminating these copies.
This generation of Cheetah is designed to be usable in a thread-free environment because that is the lowest common denominator. One implication is the requirement for something like the bells above. Another implication is that the user has to do something to trigger message processing rather than assume that it will happen in the background. In Cheetah the 'poll' function must be called periodically to move messages into and out of the network. The user can call 'poll' directly, and the other Cheetah functions call 'poll' internally.
There are several basic operations needed for doing one sided communication:
The description here is in terms of the general functionality and will treat each of these items as global functions. The actual functions are member functions of the controller class.
Take a user buffer (specified by a pointer and a byte count) and send it to a specified location in a remote context (specified by a pointer valid in that context). A local bell is rung when the local buffer can be reused and a remote bell is rung when the data has been copied into the remote buffer.
Fetch a user buffer in a remote context (specified by a pointer, a byte count and a remote context) and move it to a local buffer (specified by a pointer). A remote bell is rung when the data has been copied out of the remote buffer, and a local bell is rung when the data has been copied into the local buffer.
Call a function (specified by a function ID produced by register_function below) in a remote context (specified by a context ID), and pass to that function a buffer from the local context.
Rather than pass raw function pointers across the wire, we pass function IDs which are just integers that all of the contexts agree on. The function takes an integer for the calling context, an integer for the function ID, a pointer to the buffer and a byte count for the buffer. The function pointer is passed to 'register_function' and it returns an integer ID. For this to work properly, all of the contexts have to call 'register_function' for the same functions in the same order since new function IDs are generated by just incrementing an integer.
This is a classic barrier. All of the contexts call barrier, none of them leave the barrier until all have entered it. If someone forgets to call barrier then everybody else waits forever.
Since we're not counting on a polling thread in the background, the user has to kick the system periodically by calling poll. If poll is never called, no messages are guaranteed to be delivered. That means that code which is the moral equivalent of:
while (*bell < 1) ;
will never exit because poll is never called. Instead you have to write
while (*bell < 1) poll();
That loop is written for you in the 'wait' function.
The processes are started and stopped in a communication system dependent way. For shared memory, initializing the system does the forks necessary. For MPI the processes exist from before main starts. The multiprocess behavior is not well defined before the Cheetah system is initialized, so Cheetah should be initialized as soon as possible, and code that executes before then should not depend on the parallel behavior.
At teardown, all processes may exist after the finalize call, but their behavior is undefined. The program should exit shortly after calling finalize, and the behavior should not depend on whether all of the parallel processes still exist after calling finalize.