5 Fault-Tolerant Examples

This chapter shows how to use the failure model to build robust distributed applications. We first present basic fault-tolerant versions of common language operations. Then we present fault-tolerant versions of the server examples. We conclude with a bigger example: reliable objects with recovery.

5.1 A fault-tolerant hello server

Let's take a fresh look at the hello server. How can we make it resistant to distribution faults? First we specify the client and server behavior. The server should continue working even though there is a problem with a particular client. The client should be informed in finite time of a server problem by means of a new exception, serverError.

We show how to rewrite this example with the basic failure model. In this model, the system raises exceptions when it tries to do operations on entities that have problems related to distribution. All these exceptions are of the form system(dp(conditions:FS ......) where FS is the list of actual fault states as defined before. By default, the system will raise exceptions only on the fault states tempFail and permFail.

Assume that we have two new abstractions:

We first show how to use these abstractions before defining them in the basic model. With these abstractions, we can write the client and the server almost exactly in the same way as in the non-fault-tolerant case. Let's first write the server:

declare Str Prt Srv in  
{NewPort Str Prt}
thread   
   {ForAll Str
    proc {$ S}
       try 
          S="Hello world" 
       catch system(dp(......then skip end 
    end}
end 
 
proc {Srv X}
   {SafeSend Prt X}
end 
                                         
{Pickle.save {Connection_many Srv} "http://www.sics.se/~pvr/hw"}

This server does one distributed operation, namely the binding S="Hello world". We wrap this binding to catch any distributed exception that occurs. This allows the server to ignore clients with problems and to continue working.

Here's the client:

declare Srv
 
try X in 
   try 
      Srv={Connection.take {Pickle.load "http://www.sics.se/~pvr/hw"}}
   catch _ then raise serverError end 
   end 
    
   {Srv X}
   {SafeWait X infinity}
   {Browse X}
catch serverError then 
   {Browse 'Server down'}
end

This client does two distributed operations, namely a send (inside Srv), which is replaced by SafeSend, and a wait, which is replaced by SafeWait. If there is a problem sending the message or receiving the reply, then the exception serverError is raised. This example also raises an exception if there is any problem during the startup phase, that is during Connection.take and Pickle.load.

5.1.1 Definition of SafeSend and SafeWait

We define SafeSend and SafeWait in the basic model. To make things easier to read, we use the two utility functions FOneOf and FSomeOf, which are defined just afterwards. SafeSend is defined as follows:

declare 
proc {SafeSend Prt X}
   try 
      {Send Prt X}
   catch system(dp(conditions:FS ......then 
      if {FOneOf permFail FS} then 
         raise serverError end 
      elseif {FOneOf tempFail FS} then 
         {Delay 100} {SafeSend Prt X}
      else skip end 
   end 
end

This raises a serverError if there is a permanent server failure and retries indefinitely each 100 ms if there is a temporary failure.

SafeWait is defined as follows:

declare 
local 
   proc {InnerSafeWait X Time}
      try 
         cond {Wait X} then skip 
         [] {Wait Time} then raise serverError end 
         end 
      catch system(dp(conditions:FS ......then 
         if {FSomeOf [permFail remoteProblem(permSome)] FS} then 
            raise serverError end 
         if {FSomeOf [tempFail remoteProblem(tempSome)] FS} then 
            {Delay 100} {InnerSafeWait X Time}
         else skip end 
      end 
   end 
in 
   proc {SafeWait X TimeOut}
   Time in 
      if TimeOut\=infinity then 
         thread {Delay TimeOut} Time=done end 
      end 
      {Fault.enable X 'thread'(this)
         [permFail remoteProblem(permSome) tempFail remoteProblem(tempSome)] _}
      {InnerSafeWait X Time}
   end 
end

This raises a serverError if there is a permanent server failure and retries each 100 ms if there is a temporary failure. The client and the server are the only two sites on which X exists. Therefore remoteProblem(permFail:_ ...) means that the server has crashed.

To keep the client from blocking indefinitely, it must time out. We need a time-out since otherwise a client will be stuck when the server drops it like a hot potato. The duration of the time-out is an argument to SafeWait.

5.1.2 Definition of FOneOf and FSomeOf

In the above example and later on in this chapter (e.g., in Section 5.2.3), we use the utility functions FOneOf and FSomeOf to simplify checking for fault states. We specify these functions as follows.

The call {FOneOf permFail AFS} is true if the fault state permFail occurs in the set of actual fault states AFS. Extra information in AFS is not taken into account in the membership check. The function FOneOf is defined as follows:

declare 
fun {FOneOf F AFS}
   case AFS of nil then false 
   [] AF2|AFS2 then 
      case F#AF2
      of permFail#permFail(...then true 
      [] tempFail#tempFail(...then true 
      [] remoteProblem(I)#remoteProblem(I ...then true 
      else {FOneOf F AFS2}
      end 
   end 
end

The call {FSomeOf [permFail remoteProblem(permSome)] AFS} is true if either permFail or remoteProblem(permSome) (or both) occurs in the set AFS. Just like for FOneOf, extra information in AFS is not taken into account in the membership check. The function FSomeOf is defined as follows:

declare 
fun {FSomeOf FS AFS}
   case FS of nil then false 
   [] F2|FS2 then 
      {FOneOf F2 AFS} orelse {FSomeOf FS2 AFS}
   end 
end

5.2 Fault-tolerant stationary objects

To be useful in practice, stationary objects must have well-defined behavior when there are faults. We propose the following specification for the stationary object (the "server") and a caller (the "client"):

We present two quite different ways of implementing this specification, one based on guards (Section 5.2.2) and the other based on exceptions (Section 5.2.3). The guard-based technique is the shortest and simplest to understand. The exception-based technique is similar to what one would do in standard languages such as Java.

But first let's see how easy it is to create and use a remote stationary object.

5.2.1 Using fault-tolerant stationary objects

We show how to use Remote and NewSafeStat to create a remote stationary object. First, we need a class--let's define a simple class Counter that implements a counter.

declare 
class Counter 
   attr i
   meth init i<-end 
   meth get(X) X=@end 
   meth inc i<-@i+end 
end

Then we define a functor that creates an instance of Counter with NewSafeStat. Note that the object is not created yet. It will be created later, when the functor is applied.

declare 
F=functor 
  import Fault
  export statObj:StatObj
  define 
     {Fault.defaultEnable nil _}
     StatObj={NewSafeStat Counter init}
  end

Do not forget the "import Fault" clause! If it's left out, the system will try to use the local Fault on the remote site. This raises an exception since Fault is sited (technically, it is a resource, see Section 2.1.5). The import Fault clause ensures that installing the functor uses the Fault of the installation site.

It may seem overkill to use a functor just to create a single object. But the idea of functors goes much beyond this. With import, functors can specify which resources to use on the remote site. This makes functors a basic building block for mobile computations (and mobile agents).

Now let's create a remote site and make an instance of Counter called StatObj. The class Remote.manager gives several ways to create a remote site; this example uses the option fork:sh, which just creates another process on the same machine. The process is accessible through the module manager MM, which allows to install functors on the remote site (with the method "apply").

declare 
MM={New Remote.manager init(fork:sh)}
StatObj={MM apply(F $)}.statObj

Finally, let's call the object. We've put the object calls inside a try just to demonstrate the fault-tolerance. The simplest way to see it work is to kill the remote process and to call the object again. It also works if the remote process is killed during an object call, of course.

try 
   {StatObj inc}
   {StatObj inc}
   {Show {StatObj get($)}}
catch X then 
   {Show X}
end

5.2.2 Guard-based fault tolerance

The simplest way to implement fault-tolerant stationary objects is to use a guard. A guard watches over a computation, and if there is a distribution fault, then it gracefully terminates the computation. To be precise, we introduce the procedure Guard with the following specification:

  • {Guard E FS S1 S2} guards entity E for fault states FS during statement S1, replacing S1 by S2 if a fault is detected during S1. That is, it first executes S1. If there is no fault, then S1 completes normally. If there is a fault on E in FS, then it interrupts S1 as soon as a faulty operation is attempted on any entity. It then executes statement S2. S1 must not raise any distribution exceptions. The application is responsible for cleaning up from the partial work done in S1. Guards are defined in Section ``Definition of Guard''.

With the procedure Guard, we define NewSafeStat as follows. Note that this definition is almost identical to the definition of NewStat in Section 3.2.3. The only difference is that all distributed operations are put in guards.

<Guard-based stationary object>=
proc {MakeStat PO ?StatP}
   S P={NewPort S}
   N={NewName}
in 
   % Client interface to server:
   
<Client side> 
   % Server implementation:
   
<Server side> 
end 
 
proc {NewSafeStat Class Init Object}
   Object={MakeStat {New Class Init}}
end 

The client raises an exception if there is a problem with the server:

<Client side>=
proc {StatP M}
in 
   {Fault.enable R 'thread'(this) nil _}
   {Guard P [permFail]
    proc {$}  
       {Send P M#R}  
       if R==then skip else raise R end end 
    end 
    proc {$raise remoteObjectError end end}
end

The server terminates the client request gracefully if there is a problem with a client:

<Server side>=
thread 
   {ForAll S
    proc{$ M#R}
       thread RL in 
          try {PO M} RL=N catch X then RL=X end 
          {Guard R [permFail remoteProblem(permSome)]
           proc {$} R=RL end 
           proc {$skip end}
       end 
   end}
end

There is a minor point related to the default enabled exceptions. This example calls Fault.enable before Guard to guarantee that no exceptions are raised on R. This can be changed by using Fault.defaultEnable at startup time for each site.

Definition of Guard

Guards allow to replace a statement S1 by another statement S2 if there is a fault. See Section 5.2.2 for a precise specification. The procedure {Guard E FS S1 S2} first disables all exception raising on E. Then it executes S1 with a local watcher W (see Section ``Definition of LocalWatcher''). If the watcher is invoked during S1, then S1 is interrupted and the exception N is raised. This causes S2 to be executed. The unforgeable and unique name N occurs nowhere else in the system.

declare 
proc {Guard E FS S1 S2}
   N={NewName}
   T={Thread.this}
   proc {W E FS} {Thread.injectException T N} end 
in 
   {Fault.enable E 'thread'(T) nil _}
   try 
      {LocalWatcher E FS W S1}
   catch X then 
      if X==then 
         {S2}
      else 
         raise X end 
      end 
   end 
end

Definition of LocalWatcher

A local watcher is a watcher that is installed only during the execution of a statement. When the statement finishes or raises an exception, then the watcher is removed. The procedure LocalWatcher defines a local watcher according to the following specification:

  • {LocalWatcher E FS W S} watches entity E for fault states FS with watcher W during the execution of S. That is, it installs the watcher, then executes S, and then removes the watcher when execution leaves S.

declare 
proc {LocalWatcher E FS W S}
   {Fault.installWatcher E FS W _}
   try 
      {S}
   finally 
      {Fault.deInstallWatcher E W _}
   end 
end

5.2.3 Exception-based fault tolerance

We show how to implement NewSafeStat by means of exceptions only, i.e., using the basic failure model. First New makes an instance of the object and then MakeStat makes it stationary. In MakeStat, we distinguish four parts. The first two implement the client interface to the server.

<Exception-based stationary object>=
declare 
proc {MakeStat PO ?StatP}
   S P={NewPort S}
   N={NewName}
   EndLoop TryToBind
in 
   % Client interface to server:
   
<Client call to the server> 
   
<Client synchronizes with the server> 
   % Server implementation:
   
<Main server loop> 
   
<Server synchronizes with the client> 
end

proc {NewSafeStat Class Init ?Object} Object={MakeStat {New Class Init}} end >>>

First the client sends its message to the server together with a synchronizing variable. This variable is used to signal to the client that the server has finished the object call. The variable passes an exception back to the client if there was one. If there is a permanent failure of the send, then raise remoteObjectError. If there is a temporary failure of the send, then wait 100 ms and try again.

<Client call to the server>=
proc {StatP M}
      R in 
      try 
         {Send P M#R}
      catch system(dp(conditions:FS ......then 
         if {FOneOf permFail FS} then 
            raise remoteObjectError end 
         elseif {FOneOf tempFail FS} then 
            {Delay 100}
            {StatP M}
         else skip end 
      end 
      {EndLoop R}
   end

Then the client waits for the server to bind the synchronizing variable. If there is a permanent failure, then raise the exception. If there is a temporary failure, then wait 100 ms and try again.

<Client synchronizes with the server>=
proc {EndLoop R}
      {Fault.enable R 'thread'(this)  
         [permFail remoteProblem(permSome) tempFail remoteProblem(tempSome)] _}
      try 
         if R==then skip else raise R end end 
      catch system(dp(conditions:FS ......then 
         if {FSomeOf [permFail remoteProblem(permSome)] FS} then 
            raise remoteObjectError end 
         elseif {FSomeOf [tempFail remoteProblem(tempSome)] FS} then 
            {Delay 100} {EndLoop R}
         else skip end 
      end 
   end

The following two parts implement the server. The server runs in its own thread and creates a new thread for each client call. The server is less tenacious on temporary failures than the client: it tries once every 2000 ms and gives up after 10 tries.

<Main server loop>=
thread 
      {ForAll S
       proc {$ M#R}
          thread 
             try 
                {PO M}
                {TryToBind 10 R N}
             catch X then 
                try 
                   {TryToBind 10 R X}
                catch Y then skip end 
             end 
          end 
       end}
   end

<Server synchronizes with the client>=
proc {TryToBind Count R N}
      if Count==then skip 
      else 
         try 
            R=N
         catch system(dp(conditions:FS ......then 
            if {FOneOf tempFail FS} then 
               {Delay 2000}
               {TryToBind Count-1 R N}
            else skip end 
         end 
      end 
   end


Peter Van Roy, Seif Haridi and Per Brand
Version 1.0.1 (19990218)