There is no widely accepted best practice for programming highly concurrent complex control systems. UML 1.4 can perhaps be viewed as the most ambitious current attempt to provide a standard model for concurrency programming.
In Chapter 2, the precise semantics of UML 1.4 are described. These involve:
This model is the one most commonly used when programming concurrency in e.g. C++. It is also quite similar to e.g. OHaskell, a fairly recent attempt at offering concurrency support in the popular functional programming language Haskell. Ericsson programmers may also find it rather similar to the semantics of the Ericsson-proprietary language PLEX.
Looking at the developments in telecoms, the programming languages Chill, EriPascal, HLPLEX and Erlang (the most recent and most widely used of these languages), all use a slightly different approach:
These languages evolved out of experiences from building complex telecom software. This document attempts to illustrate the qualitative differences between the "UML" approach and the "Erlang" approach, when applied to a typical telecom problem.
I decided to use a POTS example similar to that found in the book "Concurrent Programming in Erlang" (Armstrong et al, ISBN 0-13-508301-X, page 240.) This example clearly illustrates the problems of writing control systems for telecoms. For various reasons, the program used here and the one in the book are not the same. In the book, a "middleware" layer, tele_os, is provided that interfaces to the hardware. The corresponding middleware layer in this example is called 'lim', for Line Interface Module.
The original implementation uses the classical Erlang programming model of
In the following table, I have collected a few code metrics. Briefly explained,
Original | |
Lines of code | 161 |
Branches | 50 |
Functions | 16 |
Timing Dependencies | 0 |
In order to implement an event-based control module in Erlang, I wrote a small event loop:
The actual program was quite easy to write in Erlang (and in fact, the commonly used "generic server" framework in Erlang is based on this "event-driven" programming. The resulting module is in fact significantly shorter than the original. A subjective opinion is that it is more difficult to follow, given this particular problem.
It should be noted that this module "cheats", in the sense that is still uses a hardware API with blocking semantics. This API is implemented using synchronous wrappers, each sending an asynchronous message to the hardware and then blocking until the answer arrives (implicitly buffering all other messages in the meantime.) This assumes the possibility to program a "blocking RPC", something that you can readily do in Erlang, but not in OHaskell or UML 1.4. It should be noted that a blocking RPC method is introduced in UML 1.5, even though the UML semantics do not allow the programmer to easily describe such functionality in UML itself.
Another thing to note is that the code relies rather heavily on function-head pattern matching, which is a rather unique Erlang feature. One could have written the code using case statements instead. This would have increased the number of branches in the code by roughly the number of functions. The same goes for the later examples, where this factor would become more dominant. It is unclear whether this would make the comparison more or less fair, so I have strived to write the code the way I thought best.
Original | asynch_0 | |
Lines of code | 161 | 104 |
Branches | 50 | 34 |
Functions | 16 | 18 |
Timing Dependencies | 0 | 0 |
Are the two implementations equivalent? Actually, no. Both do, as far as I can tell, solve the problem and adhere to the specification (however, note that the specification is quite primitive.)
Let's consider a specific message sequence to illustrate the differences:
{lim,offhook} - {lim,onhook} - {Pid,request_connection}If the A:onhook
and
Pid:request_connection
arrive at roughly the same
time, internal scheduling details may determine which message
appears first in the message queue of subscriber A. In the
original implementation, this doesn't matter; the onhook message
will be serviced first, as it appears first in the list of
patterns in the 'receive' clause. The
request_connection
message will then be serviced in
the idle
state and will be accepted. Note that in
this programming model, we can actually control the
behaviour in this situation, by prioritizing the match patterns.
In the async_0
implementation, the message that
arrives first will be serviced first, so if the
request_connection
message arrives before the
onhook
, the connection request will be rejected. We
have very little control of this aspect in this programming model.
The behaviours in this case are both valid. From a user's point
of view, it is perhaps more desireable that the call be serviced
when possible. On the other hand, even in the original
implementation, the "optimized" behaviour is only possible if the
onhook
message is actually in the message queue. This
is a delicate timing matter, and in this particular case, we may
well be prepared to ignore the subtle difference. However, we will
see other variations on the same theme below, with more dire
consequences.
This version is based on the previous example (asynch_0), but with the hardware API changed to non-blocking semantics. This complicates the state machine, and introduces a number of timing dependencies. This is intended to simulate the style of programming required to solve the problem in e.g. OHaskell or UML 1.4.
I have tested the basic functionality, but not the timing-dependent code. This type of code is generally difficult to test, as it often requires tricky fault injection techniques. One may run a profiler on the code to make sure that all the code has been covered, but there may still be timing dependencies that the programmer hasn't thought of that are not represented in the code. I encountered a number of such bugs when writing the code.
Original | asynch_0 | asynch_1 | |
Lines of code | 161 | 104 | 175 |
Branches | 50 | 34 | 64 |
Functions | 16 | 18 | 37 |
Timing Dependencies | 0 | 0 | 6 |
This version goes one step further and also makes the number analysis API non-blocking. Even though the number analysis API consists of only one function, and it is only used within a limited segment of the state machine, implementation complexity grows substantially. The reason is that the existing timing dependencies interact with the new ones, multiplying the effect.
Original | asynch_0 | asynch_1 | asynch_2 | |
Lines of code | 161 | 104 | 175 | 203 |
Branches | 50 | 34 | 64 | 70 |
Functions | 16 | 18 | 37 | 37 |
Timing Dependencies | 0 | 0 | 6 | 13 |
Some work remains on finding truly language-neutral comparison techniques, but there seems to be a clear indication that the non-blocking event-based programming style leads to increasingly complex code as more asynchronous APIs are introduced.
An important observation is that the timing dependencies introduced in the implementations do not correspond to actual timing dependencies in the real world. There is, for example, no actual correlation between the pressing of a digit on the telephone and the sending of a tone request reply from our switch hardware. The dependencies exist in our implementation solely due to the programming model.
The original program is free from such arbitrary timing dependencies. This makes it stable in the sense that it is not affected if parts of the code are distributed, if hardware is upgraded, changing the capacity and thereby the timing aspects of the system, or if a supporting component is made more (or less) complex.
It should be noted that the 'defer' semantics in UML make it possible to avoid some of the complexity described above. However, it doesn't eliminate the timing dependencies, since messages to be deferred must be declared explicitly. This means that one also needs to clearly identify all vital messages and carefully verify that these messages cannot be accidentally discarded in any state where the message might arrive. This must of course be repeated each time the model is altered or extended, as new states or new messages are introduced. An alternative would be to enforce the convention that all unknown messages are always deferred. An obvious problem with this is that deferred messages must also be explicitly "recalled"; thus one must first know that a message has been deferred in order to recall it. I believe that there is a "recall_all" function, but this should probably only be used to flush unknown messages.
Looking more closely at what makes the first program stable, we can identify the following core requirements: