Cheetah Polymorphism Considerations

Steve Karmesin
$Date: 2000/04/17 21:29:18 $
$Revision: 1.1 $

There are many kinds of hardware that a messaging system will have to interact with. Our goal here is to lay out some of the complexities that have to be dealt with and a software design for dealing with them. This is a somewhat long range design, and not necessary if you know, for example, that you will only be using one communication mechanism.

Zero Copy, One Copy, Two Copy

Memory is a precious resource on modern computers in various ways. While the total amount of memory is often large, it is also generally slow, so every touch needs to be controllable. There are also various kinds of special memory that messaging systems will use, and these will often have tight constraints on total size because they may part of device interfaces. Also, various kinds of buffers must be allocated, freed and the responsiblity for their management must be clear between the library and the user's code. For these reasons, we define interfaces for carefully dealing with memory.

The interfaces for put and get are defined to make it easy to handle memory efficiently. They copy from a provided buffer to a provided buffer. By calling put or get, the user is guaranteeing to the system that it is allowed to touch whatever memory is involved. The interface is simple enough that a DMA engine is supposed to be able to handle it efficiently.

There are complications though. Most relevent computers today use virtual memory, which means that the OS is free to move pages around, including moving them off to disk, as long as it keeps the TLB (the translation between physical addresses and the virutal addresses that the user sees) up to date. Most NICs have to deal with pinned memory, and limited quantities of it, so generic pointers you get from malloc can't be given to the NIC, and therefore can't be given to a DMA engine.

The implication is that you have to do one of two things when the user wants to DMA a patch of memory: (1) pin it down and tell the NIC about it before doing the DMA or (2) copy it to a preallocated pinned region and DMA it from there.

Each of these has advantages and disadvantages, and the balance between them will depend on the hardware in use. If pinning memory is cheap, then you can effectively treat all of memory as "pinnable" and DMA from wherever you want. Generally the remote side also has to be pinned though, and you then also have to send a control message to pin the destination before doing the DMA. Then the memory should at some point be unpinned to avoid wedging the OS or the NIC. If all of these operations are efficient, then this strategy can work, and is called 'zero-copy' messaging.

There exist NIC's that are sufficiently smart to have a copy of the TLB, and in that case you can give virtual addresses to the NIC and it will handle it from there. This requires that the OS coordinate TLB operations with the NIC, which increases the complexity of both the OS and the NIC. If the system does this though, it can be very powerful.

Very often though one or more of the stages of pinning and unpinning local and remote memory is slow, so data has to be copied into some preallocated pinned region, DMA'ed to a remote previously pinned region, and then copied from there to the user's memory. This us usually called 'one-copy' messaging because there is one copy on each side.

Very often a data structure cannot simply be DMA'ed as it exists in memory, and must be serialized first. That process of serialization generally involves copying all or a part of the data structure into a contiguous buffer (and possibly including information so that things like floating point formats can be interpreted correctly on the far side which may use a different format). That is one copy, and if that copy has to be in a user space buffer then there is a second copy into the pinned buffer, for a two copy messaging system.

Libraries can also impose copies, sometimes necessarily, other times not. For example, if the system wants to return immediately from a send operation but there is no space in pinned memory and the semantics of the function guarantee that once the function returns the user is allowed to modify the buffer again, it has to make another copy.

Handling Different Kinds of Hardware

It is a fact of life that different pieces of hardware require different patterns of interaction. When telling a NIC that has a copy of the TLB to do a put or get, all you fundamentally have to say is:

 DMA(source,destination,length)
If the NIC can only deal with pinned memory, but you happen to know that the source and destination are already pinned, you can say the same thing. If they're not in pinned memory but pinning and unpinning is fast then the pattern would look like:
 PIN(source,length)
 PIN(destination,length)
 DMA(source,destination,length)
 UNPIN(source,length)
 UNPIN(destination,length)
If pin and unpin are slow, then you need to include all of the logic for copying in and out of pinned memory. Each device will have slightly different requirements, and there will be systems with various combinations of hardware, and decisions will have to be made about how to balance between them.

These sorts of complexities are well suited for the kinds of polymorphism that C++ can express.

Design Requirements and Principles

  1. Provide a uniform core interface. The user must be able to use the core interface (put, get and ainvoke, etc.) in the same way for each of the implementations.
  2. Allow specific extensions. Each of the specific implementations must be able to define some extended interface and implementation that would be used by code which knows it is dealing with that specific case.
  3. Allow implementations to minimize copies. Because copying memory is the slowest thing we can do, we want to minimize the number of times we have to do it. Because the techniques for minimizing copies varies from one kind of NIC to another, the specifics of this will appear in the specialized implementations.
  4. Communication is slow. Within reason at least. One virtual function call in a communication chain is fine. One hundred will not be.
  5. Allow multiple communication mechanisms simultaneously. While the interface to any particular piece of hardware may have to or want to go through a particular Singleton object, there can be multiple communication channels available in a given computer, and the interfaces to them should be unified and simultaneously available.
  6. A C++ interface is OK. If we're going to do polymorphism without killing ourselves, that is how it has to be. We should have a defined strategy for wrapping any given implementation in a C interface though.
  7. Each implementation should do what it is good at and delegate the rest. For example, if there are multiple interfaces on a particluar computer and a decision must be made about which one to use, build the implementation for that compound system by putting together implementations for the individual interfaces. Then the code for the compound interface just has the logic associated with handling multiple NICs.
  8. You should be able to choose at run time which communication layer to use. This is very useful in practice, because you can compile in multiple layers and choose later which one to use in a particular run.

Major Design Elements

Here are a set of design elements that satisfy the above requirements and principles.

  1. Communication is done through member functions on objects, rather than through global functions. This provides a way to specify which communication layer to use, give each controller state, fire them up, tear them down, and so on.
  2. The various communication controllers inherit from an abstract base class. This enforces the requirement that the basic interface be specified, and imposes one virtual function call per communication, or one per layer of delegation. This also helps to satisfy the requirement to choose the communication layer at run time. The interfaces for specific subclasses can have extensions on the base class, and in contexts where you know which one you're dealing with, you can use the extensions.

This turns out to be surprisingly simple; a classic C++ design.

The abstract base class Controller

The abstract base class has the same interface as the cheetah_core global functions, except that they are abstract virtual functions.

class Controller
{
 public:

  virtual ~Controller();

  virtual int ncontexts() const = 0;

  virtual int mycontext() const = 0;

  typedef void (*Handler_t)(int who, int tag, void* buf,int len);

  virtual void register(int tag, Handler_t handler) = 0;

  virtual void ainvoke(int context, int tag, void *buffer, int len,
               int *local_bell) = 0;

  virtual void put(int context, void *remote, void *local, int len,
               int *local_bell, int *remote_bell) = 0;

  virtual void get(int context, void *remote, void *local, int len,
               int *local_bell, int *remote_bell) = 0;

  virtual void poll() = 0;

  virtual void wait(volatile int *bell, int value) = 0;

  virtual void barrier() = 0;

};

Possible abstract subclasses of Controller

Each of these subclasses would be abstract because they define extensions to the interface of Controller without defining any new implementation.

class DynamicController : public Controller { ... }

class OneCopyController : public Controller { ... }

class OneCopyDynamicController : public Controller { ... };
The class DynamicController would define extensions to the interface of Controller for adding and deleting contexts, and OneCopyController would define extensions for the memory management associated with managing pinned or shared memory. Then you would implement specific interfaces like
class HiPPIController
  : public OneCopyController { ... };

class HiPPIController
  : public OneCopyController { ... };

class MPIController
  : public Controller { ... };
which would implement those interfaces.


Steve Karmesin
Last modified: Wed Mar 29 13:35:29 MST 2000