There are many kinds of hardware that a messaging system will have to interact with. Our goal here is to lay out some of the complexities that have to be dealt with and a software design for dealing with them. This is a somewhat long range design, and not necessary if you know, for example, that you will only be using one communication mechanism.
Memory is a precious resource on modern computers in various ways. While the total amount of memory is often large, it is also generally slow, so every touch needs to be controllable. There are also various kinds of special memory that messaging systems will use, and these will often have tight constraints on total size because they may part of device interfaces. Also, various kinds of buffers must be allocated, freed and the responsiblity for their management must be clear between the library and the user's code. For these reasons, we define interfaces for carefully dealing with memory.
The interfaces for put and get are defined to make it easy to handle memory efficiently. They copy from a provided buffer to a provided buffer. By calling put or get, the user is guaranteeing to the system that it is allowed to touch whatever memory is involved. The interface is simple enough that a DMA engine is supposed to be able to handle it efficiently.
There are complications though. Most relevent computers today use virtual memory, which means that the OS is free to move pages around, including moving them off to disk, as long as it keeps the TLB (the translation between physical addresses and the virutal addresses that the user sees) up to date. Most NICs have to deal with pinned memory, and limited quantities of it, so generic pointers you get from malloc can't be given to the NIC, and therefore can't be given to a DMA engine.
The implication is that you have to do one of two things when the user wants to DMA a patch of memory: (1) pin it down and tell the NIC about it before doing the DMA or (2) copy it to a preallocated pinned region and DMA it from there.
Each of these has advantages and disadvantages, and the balance between them will depend on the hardware in use. If pinning memory is cheap, then you can effectively treat all of memory as "pinnable" and DMA from wherever you want. Generally the remote side also has to be pinned though, and you then also have to send a control message to pin the destination before doing the DMA. Then the memory should at some point be unpinned to avoid wedging the OS or the NIC. If all of these operations are efficient, then this strategy can work, and is called 'zero-copy' messaging.
There exist NIC's that are sufficiently smart to have a copy of the TLB, and in that case you can give virtual addresses to the NIC and it will handle it from there. This requires that the OS coordinate TLB operations with the NIC, which increases the complexity of both the OS and the NIC. If the system does this though, it can be very powerful.
Very often though one or more of the stages of pinning and unpinning local and remote memory is slow, so data has to be copied into some preallocated pinned region, DMA'ed to a remote previously pinned region, and then copied from there to the user's memory. This us usually called 'one-copy' messaging because there is one copy on each side.
Very often a data structure cannot simply be DMA'ed as it exists in memory, and must be serialized first. That process of serialization generally involves copying all or a part of the data structure into a contiguous buffer (and possibly including information so that things like floating point formats can be interpreted correctly on the far side which may use a different format). That is one copy, and if that copy has to be in a user space buffer then there is a second copy into the pinned buffer, for a two copy messaging system.
Libraries can also impose copies, sometimes necessarily, other times not. For example, if the system wants to return immediately from a send operation but there is no space in pinned memory and the semantics of the function guarantee that once the function returns the user is allowed to modify the buffer again, it has to make another copy.
It is a fact of life that different pieces of hardware require different patterns of interaction. When telling a NIC that has a copy of the TLB to do a put or get, all you fundamentally have to say is:
DMA(source,destination,length)If the NIC can only deal with pinned memory, but you happen to know that the source and destination are already pinned, you can say the same thing. If they're not in pinned memory but pinning and unpinning is fast then the pattern would look like:
PIN(source,length) PIN(destination,length) DMA(source,destination,length) UNPIN(source,length) UNPIN(destination,length)If pin and unpin are slow, then you need to include all of the logic for copying in and out of pinned memory. Each device will have slightly different requirements, and there will be systems with various combinations of hardware, and decisions will have to be made about how to balance between them.
These sorts of complexities are well suited for the kinds of polymorphism that C++ can express.
Here are a set of design elements that satisfy the above requirements and principles.
This turns out to be surprisingly simple; a classic C++ design.
The abstract base class has the same interface as the cheetah_core global functions, except that they are abstract virtual functions.
class Controller { public: virtual ~Controller(); virtual int ncontexts() const = 0; virtual int mycontext() const = 0; typedef void (*Handler_t)(int who, int tag, void* buf,int len); virtual void register(int tag, Handler_t handler) = 0; virtual void ainvoke(int context, int tag, void *buffer, int len, int *local_bell) = 0; virtual void put(int context, void *remote, void *local, int len, int *local_bell, int *remote_bell) = 0; virtual void get(int context, void *remote, void *local, int len, int *local_bell, int *remote_bell) = 0; virtual void poll() = 0; virtual void wait(volatile int *bell, int value) = 0; virtual void barrier() = 0; };
Each of these subclasses would be abstract because they define extensions to the interface of Controller without defining any new implementation.
class DynamicController : public Controller { ... } class OneCopyController : public Controller { ... } class OneCopyDynamicController : public Controller { ... };The class DynamicController would define extensions to the interface of Controller for adding and deleting contexts, and OneCopyController would define extensions for the memory management associated with managing pinned or shared memory. Then you would implement specific interfaces like
class HiPPIController : public OneCopyController { ... };class HiPPIController : public OneCopyController { ... }; class MPIController : public Controller { ... };which would implement those interfaces.
Steve Karmesin Last modified: Wed Mar 29 13:35:29 MST 2000