Self-Stabilizing Systems
 9780773591141

Table of contents :
Cover
Title
Copyright
Table of Contents
Preface
Compositional Design of Multitolerant Repetitive Byzantine Agreement
Development of Self-Stabilizing Distributed Algorithms using Transformation: Case Studies
The Linear Alternator
Self-Stabilizing l-exclusion
On FTSS-Solvable Distributed Problems
Compositional Proofs of Self-stabilizing Protocols
Delay-Insensitive Stabilization
A Latency-Optimal Superstabilizing Mutual Exclusion Protocol
Memory-Efficient Self-stabilizing Algorithm to Construct BFS Spanning Trees
Self-stabilizing Universal Algorithms
Tradeoffs in Fault-Containing Self-stabilization
Self-stabilizing Multiple-Sender/Single-Receiver Protocol
Propagated Timestamps: A Scheme for the Stabilization of Maximum Flow Routing Protocols
Deductive Verification of Stabilizing Systems

Citation preview

INTERNATIONAL INFORMATICS SERIES 7

Editors in Chief EVANGELOS KRANAKIS Carleton University School of Computer Science Ottawa, ON, Canada KIS 5B6

NICOLASANTORO Carleton University School of Computer Science Ottawa, ON, Canada KIS 5B6

Consulting Editors FRANKDEHNE Carleton University School of Computer Science Ottawa, ON, Canada KIS 5B6

DANNYKRIZANC Carleton University School of Computer Science Ottawa, ON,Canada KIS 5B6

JORGRUDIGER SACK Carleton University School of Computer Science Ottawa, ON, Canada KIS 5B6

JORGEURRUTIA Carleton University School of Computer Science Ottawa, ON, Canada KIS 5B6

Series Editor JOHN FLOOD Carleton University Press Carleton University Ottawa, ON, Canada KIS 5B6

International Informatics Series 7 Sukumar Ghosh and Ted Herman (Eds.)

SELF-STABILIZING SYSTEMS 3rd Workshop, WSS '97 Santa Barbara, California, August 1997 Proceedings

CARLETON UNIVERSITY PRESS

Copyright 8 Carleton University Press, Inc. 1997 Published by Carleton University Press. The publisher would like to thank the Vice-president (Academic), the Associate Vice-President (Research), the Dean of Science, and the School of Computer Science at Carleton University for their contribution to the development of the Carleton Informatics Series. Carleton University Press would also like to thank the Canada Council, the Ontario Arts Council, the Government of Canada through the Department of Canadian Heritage, and the Government of Ontario through the Ministry of Culture, Tourism and Recreation, and the Ontario Arts Council. Printed and bound in Canada.

Canadian Cataloguing in Publication Program WSS'97 (3rd : Santa Barbara, Calif.) Self-stabilizing systems : 3rd Workshop, WSS'97, Santa Barbara, California, August 1997 : proceedings (International informatics series ;7) Includes bibliographical references. ISBN 0-88629-333-2 1. Electronic data processing-Distributed processing-congresses. I. Ghosh, Sukumar, 194611. Herman, Ted, 1952- 111. Title IV. Series

Table of Contents Sandeep Kulkami, Anish Arora Compositional Design of Multitolerant Repetitive Byzantine Agreement Hirotsugu Kakugawa, Masaaki Mizuno, Mikhail Nesterenko Development of Self-Stabilizing Distributed Algorithms using Tkansformation: Case Studies Mohamed G. Gouda, Furman Haddix The Linear Alternator

Uri Abraham, Shlomi Dolev, Ted Herman, Irit Koll Self-Stabilizing t-exclusion Jofioy Beauquier, Synnove Kekkonen-Moneta On FTSS-Solvable Distributed Problems George Varghese Compositional Proofs of Self-stabilizing Protocols Anish Arora, Mohamed G. Gouda Delay-Insensitive Stabilization Eiichiro Ueda, Yoshiaki Katayama, Toshimitsu Masuzawa, Hideo F'ujiwara A Latency-Optimal Superstabilizing Mutual Exclusion Protocol 110 Colette Johnen Memory-Efficient Self-stabilizing Algorithm to Construct BFS Spanning Trees

125

Paolo Boldi, Sebastiano Vigna Self-stabilizing Universal Algorithms Sukumar Ghosh, Sriram V. Pemmaraju 'Ikadeoffs in Fault-Containing Self-stabilization Karlo Berket, Ruppert Koch Self-stabilizing Multiple-Sender/Single-Receiver Protocol

170

Jorge A. Cobb, Mohamed Waris Propagated Timestamps: A Scheme for the Stabilization of Maximum Flow Routing Protocols

185

Y. Lakhnech, M. Siege1 Deductive Verification of Stabilizing Systems

Preface Self-governing control is a defining characteristic of autonomous computing machinery. Autonomy implies some degree of independence, and when a system's ability to achieve its mission is independent of how it is initialized, the system is self-stabilizing. Application of self-stabilization to system and network components is motivated by core concerns of fault-tolerance in distributed systems. The purpose of the workshop is to bring together researchers in the field of selfstabilization as well as specialists in networking, formal methods, and application areas that benefit from or contribute to theory and practice of self-stabilization. The 3rd Workshop on Self-Stabilizing Systems (WSS'97) was held at the University of California at Santa Barbara in August, 1997. WSS'97 follows the workshops of 1983 (held in Austin, Texas) and the workshop of 1995 (held in Las Vegas, Nevada). The year 1997 marks the 25th year of research on selfstabilization in distributed computing. Edsger W. Dijkstra's initial manuscript on the subject, EWD391, was written in 1973 and the results were subsequently published as "Self-Stabilization in Spite of Distributed Control" in the November 1974 issu6-f Communications of the A CM. Topics explored in workshop presentations include compositional design, compositional proofs, model sensitivity, transformation from one model to another, and stabilization of various network protocols. Research in self-stabilization explores many of the classic themes of distributed computing (distributed graph algorithms, mutual exclusion, distributed agreement), which are represented by papers in these proceedings. We would like to thank the other members of the Program Committee: Anish Arora, Jim Burns, Joffroy Beauquier, Ajoy K. Datta, Shlomi Dolev, Mohamed Gouda, Shay Kutten, George Varghese. Also deserving thanks are the following colleagues for careful reading and evaluation of the submissions: Farokh Bastani, Sylvie Delaet, Jerry L. Derby, Arobinda Gupta, Shing-Tsaan Huang, Leslie Lamport, William Leal, Jeff Line, Toshimitsu Masuzawa, Masaaki Mizuno, Sriram V. Pemmaraju, Sandeep Shukla, Shmuel Zaks. We thank Ambuj Singh for the workshop's local arrangements. This year's workshop has been sponsored by The University of Iowa and The University of California, Santa Barbara. A generous grant by Rockwell International Corporation supports publication of these proceedings. Further electronic information, including a report of workshop events, discussions, and a list of participants, is available via the following Web addresses. http://www.cs.uiowa.edu/ftp/selfstab/wss97 http://www.cs.uiowa.edu/ftp/selfstab/bibliography http://mm.cs.uchicago.edu/publications/cjtcs /working-papers/contents html

.

Sukumar Ghosh and Ted Herman August, 1997.

Compositional Design of Multitolerant Repetitive Byzantine Agreement Sandeep S. Kvlkarni

Anish Arora

Department of Computer and Information Science The Ohio State University Columbus, OH 43210 USA

Abstract We illustrate in this paper a compositional and stepwise approach for designing programs that offer a potentially unique tolerance to each of their fault-classes. More specifically, our illustration is a design of a repetitive agreement program that offers two tolerances: (a) it masks the effects of Byzantine failures and (b) it is stabilizing in the presence of transient and Byzantine failures.

1

Introduction

The motivation for designing programs to be "multitolerant" follows from the limitations of designing them to be "unitolerant". Given the set of tolerances that are possible for each fault-class when considered in isolation, designing the highest of these tolerances for all of the fault-classes can be impossible. And, designing the lowest of these tolerances for all of the fault-classes can yield unacceptable functionality or performance for some of the fault-classes. By way of example, consider a network protocol and the fault-classes of (i) channel message loss and reordering, (ii) node fail-stops and repairs, and (iii) both channel and node faults, i.e., the union of (i) and (ii). If only channel faults are considered, the protocol may be designed to mask their effect or may be designed to be stabilizing. If only node faults are considered, the same two design possibilities may exist. However, if both channel and node faults are considered, the protocol may not be designed to mask their combined effect, since it may be driven into arbitrary states [I]. Consequently, designing masking tolerance to the fault-classes (i), (ii), and (iii) may be impossible. Moreover, designing stabilizing tolerance to (i), (ii), and (iii) may be unacceptable, especially for (i) in isolation. A more desirable solution would be to make the protocol mask the effects of (i) and of (ii) and to be stabilizing in the presence of (iii). Email: {kulkami,anish}Qcis.ohio-state.edu; Web: http://www.cis.ohiostate.edu/C kulkarni; anish ). Research supported in part by NSF Grant CCR-9308640, NSA Grant MDA904-96-1-1011 and OSU Grant 221506

In recent years, several research efforts in self-stabilization have studied special cases of multitolerance. For instance, Dolev and Herman [2] introduced the class of superstabilizing programs and illustrated their transformational design; these programs offer, in addition to stabilization, a fail-safe tolerance in the presence of faults that change the network topology. Dolev and Welch [3] presented randomized clock synchronization protocols that masked isolated Byzantine failure and were stabilizing in the presence of both transient and Byzantine failures. Ghosh et a1 [4, 51 designed elegant stabilizing protocols that also offered faultcontainment with respect to a specified fault-class. Gouda and Schneider [6] presented a stabilizing maximum flow routing protocol that also offered unbroken routes to the destination nodes in the presence of "faults" that only changed edge capacity. Masuzawa [7] presented stabilizing topology and reset protocols that also tolerated up to k crash failures of nodes in a k 1-connected asynchronous network. Yen and Bastani [8] presented a clever stabilizing token ring protocol that used a cryptographic technique to also offer fail-safe tolerance in the presence of transient faults. Earlier, the second author [9] had discussed example programs that were both stabilizing and masking tolerant, and others that were both nonmasking and masking tolerant.

+

A lack of systematic, general methods for the design of multitolerance in the stabilization field in particular and in the fault-tolerance field at large has led us to investigate such methods [lo, 111. Explained briefly, our approach is based on the use of components: for each fault-class, some component is added to the program so that the program tolerates that fault-class in a desired manner. To simplify the complexity of adding multiple components to the program, the principle of stepwise refinement is observed: starting with an intolerant program, in each step, a component is added to the program resulting from the previous step, to offer a desired tolerance to a hitherto unconsidered fault-class. In using this approach, the design decisions in each step are focussed on the following issues:

- How to design and add a component that will offer the desired tolerance to the program in the presence of faults in the fault-class being considered?

- Since the component may share state with the program to which it will be added, how to ensure that execution of the component and the program will not interfere with each other in the absence of faults in all fault-classes being considered?

- How to ensure that execution of the component will not interfere with the tolerance of the program corresponding to a previously considered faultclass in the presence of faults in that previozlslg considered fault-class? These three issues will be illustrated in this paper in the context of a compositional and stepwise design of a repetitive agreement program that offers two tolerances: (a) it masks the effects of Byzantine failures and (b) it is stabilizing in the presence of transient and Byzantine failures.

The rest of the paper is organized as follows. In Section 2, we recall the problem of repetitive agreement. In Section 3, we design a fault-intolerant program for the problem. In Section 4, we add a component to the program so that it masks the effect of Byzantine failure. In Section 5, we discuss the stabilization of the augmented program in the presence of transient as well as Byzantine failure. Finally, we discuss alternative designs and comment on the overall design approach in Section 7.

2

Problem Statement: Repetitive Agreement

A system consists of a set of processes, including a "general" process, g. Each computation of the system consists of an infinite sequence of rounds; in each round, the general chooses a binary decision value d.g and, depending upon this value, all other processes output a binary decision value of their own. The system is subject to two fault-classes: The first one permanently and undetectably corrupts some processes to be Byzantine, in the following sense: each Byzantine process follows the program skeleton of its non-Byzantine version, i.e., it sends messages and performs output of the appropriate type whenever required by its non-Byzantine version, but the data sent in the messages and the output may be arbitrary. The second one transiently and undetectably corrupts the state of the processes in an arbitrary manner and possibly also permanently corrupts some processes to be Byzantine. (Note that, if need be, the model of a Byzantine process can be readily weakened to handle the case when the Byzantine process does not send its messages or perform its output, by detecting their absence and generating arbitrary messages or output in response.)

The problem. In the absence of faults, repetitive agreement requires that each round in the system computation satisfies Validity and Agreement, defined below.

-

Validity: If g is non-Byzantine, the decision value output by every nonByzantine process is identical to d.g. - Agreement: Even if g is Byzantine, the decision values output by all non-Byzantine processes are identical. Masking tolerance. In the presence of the faults in the first fault-class, i.e., Byzantine faults, repetitive agreement requires that each round in the system computation satisfies Validity and Agreement. Stabilizing tolerance. In the presence of the faults in the second fault-class, i.e., transient and Byzantine faults, repetitive agreement requires that eventually each round in the system computation satisfies Validity and Agreement. In other words, upon starting from an arbitrary state (which may be reached if transient and Byzantine failures occur), eventually a state must be reached in the system computation from where every future round satisfies Validity and Agreement.

Before proceeding to compositionally design a masking as well as stabilizing tolerant repetitive agreement program, let us recall the wellknown fact that for repetitive agreement to be masking tolerant it is both necessary and sufficient for the system to have at least 3 t + l processes, where t is the total number of Byzantine processes [12]. Therefore, for ease of exposition, we will initially restrict our attention, in Sections 3-5, to the special case where the total number of processes in the system (including g) is 4 and, hence, t is 1. In other words, the Byzantine failure fault-class may corrupt at most one of the four processes. Later, in Section 6, we will extend our multitolerant program for the case where t may exceed 1. Programming notation. Each system process will be represented by a set of "variables" and a finite set of "actions". Each variable ranges over a predefined nonempty domain. Each action has a unique name and is of the form: (name) :: (guard)

-+ (statement)

The guard of each action is a boolean expression over the variables of that process and possibly other processes. The execution of the statement of each action atomically and instantaneously updates the value of zero or more of variables of that process, possibly based on the values of the variables of that and other processes. For convenience in specifying an action as a restriction of another action, we will use the notation (name') :: (guard') A (name) to define an action (name') whose guard is obtained by restricting the guard of action (name) with (guard'), and whose statement is identical to the statement of action (name). Operationally speaking, (name') is executed only if the guard of (name) and the guard (guard') are both true. Let S be a system. A "state" of S is defined by a value for each variable in the processes of S, chosen from the domain of the variable. A state predicate of S is a boolean expression over the variables in the processes of S. An action of S is "enabled" in a state iff its guard (state predicate) evaluates to true in that state. Each computation of S is assumed to be a fair sequence of steps: In every step, an action in a process of S that is enabled in the current state is chosen and the statement of the action is executed atomically. Fairness of the sequence means that each action in a process in S that is continuously enabled along the states in the sequence is eventually chosen for execution.

3

Designing an Intolerant Program

The following simple program suffices in the absence of Byzantine failure: In each round, the general sends its new d.g value to all other processes. When a process receives this d.g value, it outputs that value and sends an acknowledgment to the general. After the general receives acknowledgments from all the other processes, it starts the next round which repeats the same procedure. We let each process j maintain a variable d.j, denoting the decision of j , that is set to l when j has not yet copied the decision of the general. Also, we let j maintain a sequence number sn.j, sn.j E (0..I}, to distinguish between successive rounds. The general process. The general executes only one action: when the sequence numbers of all processes become identical, the general starts a new round by choosing a new value for d.g and incrementing its sequence number, sn.g. Thus, letting @ denote addition modulo 2, the action for the general is:

+

RG1 :: (Vk :: sn.k =sn.g)

d.g, sn.g := newdecision(), sn.g @ 1

Each other process j executes two actions: The non-general processes. The first action, R01, is executed after the general has started a new round, in which case j copies the decision of the general. It then executes its second action, R02, which outputs its decision, increments its sequence number to denote that it is ready to participate in the next round, and resets its decision to I to denote that it has not yet copied the decision of the general in that round. Thus, the two actions of j are: -

p -

R 0 1 :: d.j = l h (sn.j $1 = sn.g) R 0 2 :: d.j # l -

--

-

--

--+

d.j := d.g

+ { output d.j }; d j , sn.j := I,sn.j @ 1

--

Proof of correctness (sketch). We show that the problem specification is satisfied in program computations from any start state where the sequence numbers of all processes are identical and the decisions of all other processes are equal to I.In any start state, only the general can execute, thus starting a new round by executing RG1. In the resulting state, each other process can only copy the decision of the general by executing R 0 1 and then output this decision by executing action R02. Thus, Validity and Agreement are satisfied. Also, after each processes executes action R02, the resulting state is again a starting state. Therefore, Validity and Agreement are satisfied in each successive round. Remark. The atomicity of these actions can be easily refined to read-only or write-only by the standard method of introducing local copies of the state of g in all other processes and the state of the other j in the general process, a read action to update these local copies from the non-local state, and replacing non-local variables with their local copies in actions RG1, R01, and R02. (For examples of such refinements see [13].)

4

Adding Masking Tolerance to Byzantine Failure

Program R is neither masking tolerant nor stabilizing tolerant to Byzantine failure. In particular, R may violate Agreement if the general becomes Byzantine and sends different values to the other processes. Note, however, that since these values are binary, at least two of them are identical. Therefore, for R to mask the Byzantine failure of any one process, it suffices to add a "masking component to R that restricts action R02 in such a way that each non-general process only outputs a decision that is the majority of the values received by the non-general processes. For the masking component to compute the majority, it suffices that the non-general process obtain the values received by other non-general processes. Based on these values, each process can correct its decision value to that of the majority. We associate with each process j an auxiliary boolean variable b.j that is true iff j is Byzantine. For each process k (including j itself), we let j maintain a local copy of d.k in D.j.k. Hence, the decision value of the majority can be computed over the set of D .j.k values for all k. To determine whether a value D.j.k is from the current round or from the previous round, j also maintains a local copy of the sequence number of k in SN.j.k, which is updated whenever D.j.k is. The general process. To capture the effect of Byzantine failure, one action MRG2 is added to the original action RG1 (which we rename as MRGl): MRG2 lets g change its decision value arbitrarily and is executed only if g is Byzantine. Thus, the actions for g are as follows:

The non-general processes. We add the masking component "between" the actions R01 and R02 at j to get the five actions MR01-5: MROl is identical to R01. MR02 is executed after j receives a decision value from g, to set D .j.j to d.j, provided that all other processes had obtained a copy of D.j. j in the previous round. MR03 is executed after another process k has obtained a decision value for the new round, to set D.j.k to d.k. MR04 is executed if j needs to correct its decision value to the majority of the decision values of its neighbors in the current round. MR05 is a restricted version of action R02 that allows j to perform its output iff its decision value is that of majority. Together, the actions MR02-4 and the restriction to action R02 in MR05 define the masking component (cf. the dashed box below).

To model Byzantine execution of j, we introduce action MR06 that is executed only if b.j is true: MR06 lets j arbitrarily change any D.j.k. Changing D .j. j affects the values read by process k when k executes M R03. And, chang-

ing other D values lets j change d.j using M R04. Thus, the six actions of M RO are as follows: MROI :: ROI _ -.---------------------------.---------.-.----M R O 2 : : d t j , t l ~ s N . j . ~ ' = s n . j ~ c ~ ) r n p I+D.j.j,SN.j.j:=d.j,SN.jJ$I .j , , MRO.3 :.! SN.j.k 8 1 = SN.k.k + D.j.k, SN.j.k := D.k.k, SN.k.k ~ ~ pj; 0 4, 4 A midefine4 A cl.i muid. _t -4-i:-=- muii . . ...- - - . - . MRO5 :: d.j ,t I:A mjdcjned.j A d.j = majj: + ( output d j ) ;d j , snj := I , snJ 8 1 I

:i

*

-

_

____ __ -

.---------.---------.

where,

compl.j r majdefined.j m a j .j =

(Vk :: S N .j . j = SN.k.j ) cornp1.j A ( V k : : S N . j . j = S N . j . k ) A ( s n . j # S N . j . j ) (majority k :: D . j . k )

Fault Actions. If the number of Byzantine processes is less than 1, the fault actions make some process Byzantine. Thus, letting 1 and m range over all processes, the fault actions are: 1 :1

10

10 0

2000

4000

simulation time Fig. 2. Relationship between

T

0

2000

4000

simulation time

and the number of messages per CS entry

Even though the algorithms coded from scratch exhibit better performance in terms of message complexity, our experience suggests that they were much more difficult to develop, debug, and verify, compared with Alg SSLR. We believe that the simplicity and reliability of resulting code make the design of SS distributed algorithms using the transformation a preferable alternative. 3.3

Alg SSM

1 Synchronization T h r e a d Si of N o d e pi 2 const 3 Ri /*p;'srequestset*/ 4 Ai /* pi's acquired set Ai (pj I pi E Rj)

*/

5 var 6 A-request : boo1 7 mode : {idle, waiting, locked) 8 ts : t i m e s t a m p 9 who : n o d e i d /* a node that obtained pi's lock

*/

10 *[ 11 A-request A (mode = i d l e ) A (Vpj E Ri pj.who 12 ts:= n e w f i r n e s t a m p ( ) 13 mode := waiting

# pi) -+

14 15

C] mode = waiting A (Vpj E Ri[pj. who = pi]) +

16 17

C] -A,request A (mode # idle)

18

[(who = I V (who = pj A pj.mode # locked))/\ (3pk f Ai such that who # pk and pk.ts = min(p1.t~I Vpl E Ai : pt-mode # idle)) who := pk

mode := locked

19

20 21 22

+

mode := i d l e

+

C] who = pj A pj.mode = i d l e -+ who := I

I

Fig. 3. Alg SSM

In Alg SSM, in addition to mode and timestamp ts, each synchronization thread Si maintains variable who that stores the identifier of the node that has

locked Si; or I if Si is not locked. The first three guards are similar to those of Alg SSLR. Instead of checking whether the node has the smallest timestamp among all the requesting nodes, Si checks whether all the nodes in its request who = pi]). set are locked by pi (Vpj dpj R, The fourth guard handles lock requests from other nodes. Condition (who = pj Apj .mode # 1ocked)A (3pk E Ai such that who # pk and pk .Is = min{pl .ts I Vpl E Ai : pl.mode # i d l e ) ) is for deadlock avoidance. Even though the node itself is locked by a node pj, if pj has not obtained all the necessary locks yet (pj.mode # locked), Si may change who to the node with the smallest timestamped request. Condition (Vpj E R, pj.who # pi) in the first guard (line 11) and condition (who = pj Apj .mode = i d l e ) in the fifth guard (line 20) are to prevent a situation described below: Assume that the condition in line 11 is removed and that right after a node pi leaves its CS, it requests to enter the CS again. If none of the nodes in R, have noticed the change of piemode from locked to i d l e and pi executes the first guarded command to set pi.mode to waiting again, rhoi can immediately execute the second guarded command (lines 14 - 15) and enter the CS. The conditions in the first and the fifth guards prevent such a situation. Finally, implementation of new-timestamp () deserves special attention. Unlike SSLR, each node in SSM communicates only with other nodes in its request and acquired sets (ref: Fig. 3) and does not communicate with the rest of the nodes. To guarantee propagation of timestamp values among nodes, new-timestamp() may be implemented as follows: every node pi maintains a variable maxAs that holds the maximum value of the timestamps found in the nodes in its acquired set Aim(Update of max-ts may be done in the third guarded command (lines 18-19) when pi checks the minimum t s value in its Ai.) When a node pi requests for its CS entry, n e w f i r n e s t a m p ( ) returns a value larger than any m a x f s values found in the nodes in its request set R,. Alg SSM is very simple compared with the original non-SS Alg M. This is particularly true for the deadlock avoidance part.

4

Another Example: SS Leader Election

As another case study, this section presents an SS leader election algorithm (Alg SSCR). The algorithm is based on Chang and Robert's leader election algorithm[2] for unidirectional ring networks. It is assumed that each node maintains a unique node identifier (ID) and integer N in its read-only incorruptible memory. N is chosen to be larger than the number of nodes in the system and is used to stabilize the system. In addition, each node maintains variable max to keep the largest node ID that it has seen. The largest rnax value is propagated along the ring. If node pi finds that its neighbor holds pi's ID in its max (i.e., pi's ID is propagated through all other nodes), pi has the maximum ID and becomes the leader. In our self-stabilizing version, each node also maintains a variable dist which stores a distance to itself from the node with the maximum ID. The variable is

used to eliminate a wrong max value from the system. If a node, say p i , holds in its max a value Wrong-ID which is larger than any IDS, Wrong-ID is propagated to other nodes. However, eventually dist of each node will become larger than N and Wrong-ID will be discarded.

1 Node pi 2 const 3 ID : n o d e i d /* node ID of pi 4 var 5 m a x : nodeid 6 7 8

9 10 11 12 13 14 15 16 17 18 19 20 21

*[

*/

dist : integer leader : boo1 /* pi-1 denotes the left neighbor of node pi (pi-~.maz>ID)~(p;-i.dist where is a boolean expression over the variables of q[i] and those of the neighbors of q[i], and is a sequence of assignment statements that update the variables of q[i]. The processes in the process array q[i : O..n-1] can be defined as follows. process q[i : O..n-l] var

, ctype k-l> -->

11

cguard m-l>

end A state of the process array q[i :O..n-1] is an assignment of a value to every variable in every process in the array. The value assigned to each variable is from the domain of values of that variable. An action in the process array q[i : O..n-l] is enabled at a state s iff the guard of that action is true at state s. We assume that the process array q[i : O..n-l] satisfies the following two conditions. i.

Determinacy: At each state s, each process q[i] has at most one action that is enabled at s.

ii .

Enabling: At each state s, each process q[i] has at least one action that is enabled at s.

(The assumption that each process q[i] satisfies these two conditions is not a severe restriction. For example, if a process q[i] has two actions whose guards are "" and "cguard 2>" and if these two actions are enabled at some state, then replace " A

...A noteguard m> --> skip

where cguard is, ... ,cguard m> are the guards of all actions in q[i]. This added action is enabled at any state where none of the other actions is enabled.) A serial transition of the process array q[i : O..n-l] is a pair (s, s') of

states such that starting the process array at state s then executing the statement of one action that is enabled at s yields the process array in state s'. A serial computation of the process array q[i : O..n-l] is an infinite

sequence s.0, s.1,

... of states such that each pair (s.i, s.(i+l)) of consecutive

states in the sequence is a serial transition.

A set S of the states of the process array q[i : O..n-1] is closed iff for every serial transition (st s') of the process array, if s is in St then s' is in S. Let S be a closed set of the states of the process array q[i : O..n-11. The process array q[i :O..n-1] is serially stabilizing to S iff every serial computation of the process array has an infinite suffix where each state is in S. In this section, we show that if the process array q[i : O..n-l] is serially stabilizing to S, then another process array ql[i : O..n-11 is concurrently stabilizing to Sgtwhere ql[i : O..n-l] and S' are strongly related to q[i : O..n-11 and S, respectively. We start by showing how to construct the process array ql[i : O..n-l] from the process array q[i : O..n-l] and the linear alternator p[i : O..n-1] in Section 2. Each process ql[i]can be constructed from process q[i] and process p[i] in the linear alternator, as follows. First, a copy of the boolean variable

b[i] in process p[i] is added to process q[i]. Second, guard G.i of the action of process p[i] is added as a conjunct to the guard of every action in process q[i]. Third, statement S i of the action of process p[i] is added to the statement of every action in process q[i]. The resulting process array qt[i : O..n-1] is defined as follows. process qt[i:O..n-1] var



S.i ;

-->

S.i ;

A state of the process array qt[i : O..n-l] is an assignment of a value

to every variable in every process in the array. The value assigned to each variable is from the domain of values of that variable. An action in the process array qt[i : O..n-1] is enabled at a state s iff the guard of that action is true at state s. Lemma 4:

For each state s, at least one action in a process in the process array q'[i : O..n-1) is enabled at s.

[I

Lemma 5:

For each state s, at most one action in each process in the process array q'[i : O..n-1] is enabled at s.

[I

The definitions of a concurrent or maximal transition, and of a concurrent and maximal computation, that were given in Section 2 for the linear alternator p[i : O..n-11, can also be given for the process array q'[i : O..n-11. An alternating state of the process array ql[i :O..n-1] is a state s where either exactly one action in every even process (namely processes qVIO],q'[2], ...) is enabled at st or exactly one action in every odd process (namely processes ql[l], qV[3],...) is enabled at s. It is straightforward to show that the process array qV[i: O..n-1] satisfies the following three properties. i.

Non-in terference: At each state of the process array ql[i : O..n-11, if a process has an enabled action, then no neighbor of that process has an enabled action.

ii.

Progress: Along each concurrent computation of the process array qV[i: O..n-11, an action of each process is executed infinitely often.

iii.

Stabilization: Each maximal computation of the process array ql[i : O..n-l]

has an infinite suffix where each state is

alternating.

Note that these properties of the process array q'[i : O..n-l] are similar to the properties of the linear alternator discussed in Section 2. Actually, the fact that q'[i : O..n-l] satisfies these three properties follows from the fact that p[i : O..n-l] satisfies similar properties. Let S be a closed set of states of the process array q[i : O..n-11. The extension of S is the set S' of all states of the process array q'[i : O..n-I] such that for every state s' in set St, there is a state s in set S where every variable in q[i : O..n-l] has the same value in s and s'. Let S be a closed set of states of the process array q[i : O..n-11, and let S' be the extension of S. The process array q'[i : O..n-11 is concurrently stabilizing to S' iff every concurrent computation of the process array has an infinite suffix where each state is in S'. Theorem 1:

If the process array q[i : O..n-11 is serially stabilizing to S, then the process array q'[i : O..n-l] is concurrently stabilizing to S, where S' is the extension

U

of S.

5 Concluding Remarks An alternator is a system that can be used in transforming any system that is stabilizing under the assumption that actions are executed serially into one that is stabilizing under the assumption that actions are executed concurrently. In this paper, we presented a linear alternator, and discussed how to use this alternator in transforming any linear system that is stabilizing assuming serial execution into one that is stabilizing assuming concurrent execution.

Currently, we are developing alternators with more general topologies (rather than mere linear topology). These results will be reported in [3].

References [I] Anish Arora, Paul Attie, Mike Evangelist, and Mohamed G. Gouda, "Convergence of Iteration Systems", Distributed Computing, Volume 7, pp. 43 - 53,1993. [2] James E. Burns, Mohamed G. Gouda, and Raymond E. Miller, "On Relaxing Interleaving Assumptions", Proceedings of the MCC Workshop on Self-Stabilization, Austin, Texas, 1989. [3] Furman Haddix, "Alternating Parallelism and the Stabilization of

Cellular Systems", Ph. D. Dissertation in progress, Department of Computer Sciences, the University of Texas at Austin, Austin, Texas, 1997. [4] Masaaki Mizuno and Hirotsugu Kakugawa, "A Timestamp Based Transformation of Self-stabilizing Programs for Distributed Computing Environments" Proceedings of the International Workshop On Distributed Algorithms (WDAG). Also published in Lecture Notes on Computer Science, Volume 1151, pp. 304 - 321.

Self-stabilizing l-Exclusion (Preliminary Version) Uri Abraham112

Shlomi dole^',^

Ted Herman4

Irit Ko11115

Department of Mathematics and Computer Science, Ben-Gurion University, Beer-Sheva, 84105, Israel. Email: abrahamacs bgu. ac il Partially supported by the Israeli ministry of science and arts grant #6756196. Email: dolev0cs. bgu. ac i l Department of Computer Science, University of Iowa. Elta, Israel Ltd. Email: i r i t k Q i s e l t a . co il

.

.

. .

.

Abstract. This work presents a self-stabilizing algorithm for the problem of e-exclusion in the common shared memory model. The algorithm is a combination of mechanisms that are responsible for safety, liveness and fairness.

1

Introduction

Mutual exclusion is one of the fundamental problems in distributed computing [3, 61. The problem is defined for a system of n asynchronous processes that communicate only by reading from and writing to shared memory. There is a single resource in the system. Every process can access this resource, however only a single process can access the resource at a time. The mutual exclusion problem was generalized to the t-exclusion problem by Fischer, Lynch, Burns and Borodin in [5]. In the t-exclusion problem there is a resource that can be shared by at most !processes at any time, for some specified t 2 1. The program of every process has a piece of code called the critical section in which the process has shared access to the resource. A solution of the !-exclusion problem guarantees that at most t processes are executing the critical section at any time. While the common shared memory model is the one considered in most works on mutual exclusion (e.g. [ l l , 8, 9, lo]), very few works consider self-stabilizing mutual-exclusion algorithms for this model. One important exception is the seminal work of Lamport in [7]. Another self-stabilizing solution is mentioned in [8] and appears in 1121. Dijkstra's presentation of self-stabilizing mutual exclusion [2] has a system in which state variables are shared, but communication is constrained to a ring shape (program counters are abstracted out of this model). It is natural to generalize mutual exclusion to t-exclusion. A self-stabilizing t-exclusion algorithm presented in [4] is a variant of [2] designed for the ring shaped system. The algorithm in [4] does not satisfy all the requirements one would hope for the common shared memory model: processes in a ring depend on each other for privilege circulation, whereas it should be possible for processes to

enjoy greater independence under the common shared memory model. A variant of the 1-exclusion problem called bounded first-in, first-enabled is presented in [I]. The following quotation addressing an open problem, is taken from [I]: "Self-stabilizing algorithms have the interesting property of converging to correct global state, no matter how they are initialized [2]. It would also be interesting to investigate self-stabilizing solutions to the t-exclusion problem . . ." In this work we present the first self-stabilizing algorithm for the t-exclusion problem in the common shared memory model. In our algorithm processes may execute the critical section, the remainder section or the trying section. Processes that execute the remainder section do not take an active part in the algorithm (i.e., do not try to obtain the shared resource). The algorithm is a combination of three mechanisms that are responsible for safety, liveness and fairness. The remainder of the paper is organized as follows. In the next section we formalize the assumptions and requirements for self-stabilizing t-exclusion. Section 3 presents a self-stabilizing t-exclusion algorithm for which the safety and liveness requirements hold but not the fairness requirement. A new mechanism to ensure fairness is presented in Section 4. Concluding remarks are given in Section 5. Due to space limitations, a number of proofs are omitted from this preliminary version.

2

The Self-stabilizing 1-Exclusion Problem

2.1 The System A distributed system consists of n processes ( 1 , . . .,n). Each process is a state machine. A state of a process is defined by its program counter and the value of its local variables. Processes communicate with each other only by using shared communication registers. The system state is the vector of states of all processes and the contents of all registers. Each communication register is atomic (serializable) with respect to read and ~ n t operations. e The smallest execution unit of a process is an atomic step. One atomic step of a process consists of an internal computation followed by either a read or a write operation, but not both. An execution sequence is an alternating sequence , - . . such (finite or infinite) of system states and atomic steps E = cl ,sl , ~ 2s2, that ci+l is reached from ci by the execution of the si step, for i = 1,2, . -. A computation is an execution sequence in which every process executes an atomic step infinitely often. The program of each process includes a trying section, critical section and remainder section. The value of the program counter of a process defines whether a process is executing the trying, critical or remainder sections. A process that executes the critical section has shared access to critical resource(s). When a process is executing the remainder section it is not interested in a resource. Refresh Operations The model defined above can be seen informally as a collection of processes, each with local variables, a program and program counter, and a shared random access 2.2

memory (the communication registers). This model relates to the traditional model used for mutual exclusion in [ll,8, 9, 101. A consequence of this model for self-stabilization is that steps must be taken so that non-shared variables (local and the program counters) agree with the communication registers. Consider, for instance, an initial state in which some process p perpetually executes steps in the remainder section and never exits the remainder section. In an illegitimate initial state, certain communication registers could indicate that process p is in the critical section, whereas in fact p never executes any step in the critical section. Two possible resolutions for such a situation are ( i ) process p could periodically "refresh" values in communication registers by writing to them, or ( i i ) we could propose some form of detection mechanism be added to the model whereby processes other than p could detect that p is in the remainder section6 (this would allow the other processes to ignore the faulty information in certain communication registers). We adopt strategy ( i ) to deal with processes that either perpetually execute the remainder section, or perpetually execute the critical section. We assume that a process perpetually executing either remainder or critical section will periodically invoke refresh operations, which consists of steps that write appropriate values to communication registers. Due to space limitations in this preliminary version, a detailed description of the refresh operations is omitted from the specification of our algorithms. Note that for strategy ( i ) , fair execution sequences are essential for self-stabilization - if the execution sequence is unfair starting from an illegitimate initial state, a process could be in the critical section but undetectably so if it does not write to its communication registers. 2.3

Requirements

For a self-stabilizing !-exclusion algorithm, every computation has a suffix that fulfills safety, liveness and fairness requirements defined in this section. We do not present code for the critical section or for the remainder section in this paper because these two sections represent "user activity" outside the scope of the !-exclusion algorithm. We say that a process is settled in the critical section if it continues forever to execute the critical section (and the implied refresh operations). The definition of settling in the critical section is useful for specifying and proving certain properties of our algorithms. For instance, we use it to show that certain processes make progress while other processes do not give up the resource they hold. The first requirement is safe t-exclusion for self-stabilizing algorithms. Recall that the original !-exclusion requirement (see [5]) does not allow any system state with more than !processes in the critical section. Requirement 2.1 Safe !-exclusion: If no more than !processes are settled in the critical section (from the initial system state of the computation) then eventually at most t processes are concurrently in the critical section. We thank Jim Burns for suggesting the possibility of (ii).

50

Clearly, at most !processes can be settled in the critical section if we are to prove self-stabilization of safety: if in some initial state, all n processes are in the critical section and do not exit the critical section, then it will be impossible to ensure that eventually at most t < n processes are concurrently in the critical section. However, if t or fewer processes are settled in the critical section, the safety property specifies that eventually no more than t? processes concurrently execute the critical section at any time. Requirement 2.2 Live e-exclusion: If k < t? processes are settled in the critical section and some other process tries to enter the critical section, then eventually k 1 processes are in the critical section.

+

Requirement 2.3 Fair texclusion: In a computation in which no process is settled in the critical section, each process not in the remainder section eventually executes the critical section.

For the algorithm presented in Section 4, there are computations in which a single process settled in the critical section forever prevents the execution of the critical section of some other processes. For that algorithm, we assume that no process is settled in the critical section in order to prove fair !-exclusion. Requirement 2.4 Self-stabilizing !-exclusion: An l-exclusion algorithm is self-stabilizing if every computation that starts with any arbitrary initial state has a sufix in which the above safety liveness and fairness requirements o f t exclusion hold.

The construction of a self-stabilizing l-exclusion algorithm is simplified if the shared memory architecture supports the following atomic, read-modify-write o p eration. The CFAA Operation (conditional fetch and add) is: examine a shared counter variable; if the counter is less than l , then add one to the counter and return a true condition; otherwise leave the counter unchanged and return a false condition. Using the CFAA operation, an algorithm can use a busy-wait strategy to attempt entry to the critical section (some additional mechanism is still needed to provide fairness). The system model in this paper is a limited architecture, providing only atomic read and write of a shared register; we do not have a CFAA operation in the model. But the idea of the CFAA operation invites the following strategy for constructing an [-exclusion algorithm. First, devise a protocol to simulate the CFAA operation; then use this protocol to build the t-exclusion algorithm. For instance, CFAA could be simulated by a self-stabilizing mutual exclusion algorithm (a self-stabilizing 1-exclusion algorithm) to guarantee exclusive access to the shared counter. Although such a strategy would technically satisfy our definition of self-stabilizing e-exclusion, we intuitively expect more of an eexclusion algorithm than mere application of one instance of mutual exclusion. We propose the following property to capture the spirit of l-exclusion. Requirement 2.5 x-wait-freedom: For some polynomial P ( n ) , every computation has a sufix satisfying: for every set S of ( n - x) processes, if the processes of S are in the remainder section in every state, each process p 6 S

outside the remainder section is either critical or becomes critical within P ( n ) of its own steps. It follows from the definition of x-wait-freedom that no t-exclusion algorithm can be x-wait-free for x > t (if t processes occupy the critical section, then any other trying process can execute an unbounded number of steps trying). This implies that the strategy of simulating CFAA using mutual exclusion, described above, does not satisfy x-wait-freedom for x > 1. By contrast, our live and safe algorithm is t-wait-free, while our fair algorithm is ( t - 1)-wait-free. Intuitively the x-wait-free property ensures a "wide door" to the critical section when there are no conflicting requests. If an algorithm is x-wait-free, then no process in the trying section endures a large interference from other processes, so long as at most x processes are competing for the critical section. Although x-wait-freedom is specified with respect to a fixed set of processes forever in the remainder section, we intend broader application. For instance, any computation segment in which (n - x) processes are in the remainder section "long enough" is x-wait-free for the other processes. 2.4

Conventions

In the presentations of our algorithms, we abbreviate some iterative constructs. These iterative constructs are not atomically executed: there is some lower-level, straightforward implementation using steps. For instance, the statement

if #{ j I tryj

1 > t then . . .

could be implemented by a loop and a local variable, such as s = 0; for j = 1 to n do if tryj then s = s + 1 od if s > t then . . .

3

The Base Algorithm - Live and Safe Algorithm

This algorithm, the base algorithm, is self-stabilizing and satisfies safety and liveness, but not fairness. The base algorithm is the building block for Section 4, which adds fairness to the t-exclusion. The base algorithm consists of three sections. Each process i has two boolean communication registers, tryi and csi. Intuitively, if tryi = true, then process i is either trying to get into the critical section or is executing the critical section. If csi = true, then process i is executing the critical section. The mechanism that ensures safety appears essentially in lines B04 and B09. Roughly speaking, if there are more than t processes in the critical section then the last process that executed line B04 among them must find that the condition in line B09 holds and thus does not enter the critical section. The liveness property is achieved by a mechanism that selects the processes with the largest identifiers among the processes that try to enter the critical section and for which there is

an available resource. In case the set of processes that try to enter the critical section is larger than the number of available resources (line B09) the processes with the largest identifiers do not assign t r y = false (line B07), i.e. they do not give up. The processes with smaller identifiers assign t r y = false (line 807) and return to the beginning of the try section (line B01). Thus, the processes with small identifiers let the processes with the large identifiers enter the critical section.

/*

Part One: waiting for large-id processes

doorl: B01 while # { j I t r y j B02 tryi = f a l s e B03 cs; =false

/*

A icsj A

*/

j > i ) 2 (e-#{jIcsj))

Part Two: waiting for small enough group

do

*/

door2: B04 try; = true B05 csi = false B06 if # { j I tryj A 1 C S j A j B07 try; =false B08 goto door1

> i ) 2 (e - # ( j 1 csj )) then

B09 if #{ j I t r y , ) > e then BlO goto door2

/*

Part Three: critical and remainder sections

B11 B12 B13 B14 B15 B16

cs; = true critical section csi = false tryi = f a l s e remainder section

*/

goto doorl

FOR PROCESS i, 1 5 i Fig. 1. BASEALGORITHM

5n

The code in Figure 1 mentions two variables for process i, the t r y i and the csj variables. There are additional internal variables necessary for the implementation of the counting constructs of lines B01, B06 and B09. Observe that the counting construct of line BO9 is not atomically executed, but is computed by low-level atomic steps that first read t r y j variables and then add to some internal accumulation variable. It is therefore possible for the count obtained

to be inaccurate either because some tryj was read as false but subsequently became true, or because t r y j was read as true but subsequently became false before line B09 finished execution. We say that internal variables for a counting construct are consistent with respect to a state in a computation if the values of these internal variables are entirely due to previous steps in the computation. In particular, the internal variable values should be justified by previously read t r y and cs values - which were (at least momentarily) accurate when the atomic read step executed. Consistency is undefined for initial states. We make some additional definitions to prove properties of the algorithm. Let pci denote the value of the program counter for process i. Process i is said to be critical if pci is in the range B11 to B14. The definition of a legitimate state is essentially a relation between variables and program counters; we define first a form of "pre-legitimate" state. An algorithm state a is coherent if, for every process i and program counter value pci = k at state a,the values of tryi and csi at state a are identical to the values at a state obtained by some computation segment starting from an initial state satisfying PCj = B01 A C S j = false A tryj =false for every process j. A computation suffix is coherent if every state in the computation is coherent and every internal variable is consistent at each state in the suffix. The coherence property specifies, for example, that csi = true iff the program counter is B12 or B13. In order to prove self-stabilization, some assumptions are made about lines B12 and B15. Line B12, the critical section, corresponds to some procedure requested by a user invocation of the system e-exclusion service. Statement B15, the remainder section, represents the activity of a user between calls to the l-exclusion service. We assume that some refresh procedure is repeatedly executed by a process while it is executing the critical and remainder sections. The refresh procedure ensures that, if the critical section does not terminate for process i, both tryi and csi are assigned true. Similarly, tryi and csi are eventually assigned false if i forever executes the remainder section. Lemmal, Every computation of the base algorithm contains a coherent sufix. The proof (which is omitted) relies on the execution of the refresh procedure. For convenience, we define a computation to be feasible if at most l processes are settled in the critical section. Lemma 2. Every coherent and feasible computation s u B z in which at most ( s + 1 ) processes are critical at each state, s 2 l?, contains a sufix where at most s processes are critical at each state. Proof The proof uses contradiction, based on a computation suffix with a particular property we now describe. Consider some coherent, feasible computation suffix A and let k 5 e of the processes be settled in the critical section in computation A; additional processes enter and leave the critical section during computation A. Define B to be a suffixof A such that every process entering and leaving

the critical section in B does so infinitely often. Let T be the set of processes that infinitely often enter a critical section in suffix B. By B's construction, we may identify a suffix C with the following property: any process i entering the critical section assigns try; = true (line B04) before entering the critical section, and in all states that follows the last assigment of tryi = true before i enters the critical section, it holds that tryi = true. (Of course it is also possible that a process could be unsuccessful in the second part of the code, jump back to the first part, and retry numerous times before it successfully completes the second part of the algorithm and becomes critical; however, before it becomes critical, it must first successful1y pass through the second part of the protocol.) The antecedent of the lemma specifies that at most ( s 1) processes are critical at each state of suffix C. By feasibility, C contains at least one state where at most s processes are critical. Heading for a contradiction, consider a step in computation C , resulting in a state a , such that the number of critical processes changes from s to s 1. Let S be the intersection of set T (the processes that enter the critical section infinitely often) and the set of (s + 1) critical processes at state a (the state that follows an atomic step in which the s l'th process joined the critical section). Notice that the s 1 process that execute the critical section in a are the k settled processes and the process in S. For each i E S, there is a segment Xi of C starting from a state obtained by the step tryi = true (line B04) and terminating at state a , such that tryi is true at each state. Among all the segments Xi, for i E S, there exists a segment Xp of minimum length: this segment corresponds to the last process p that assigned tryp = true before entering the set of processes critical at state a. Because the evaluation of the counting statement B09 occurs entirely within Xp, the count obtained includes all the processes i E S as well as processes settled in the critical section. Therefore the count obtained is at least JSI k = ( s 1), which implies process p returns to door2 instead of becoming critical - and this is a contradiction on p's entry to the critical section. The contradiction proves that w the number i f critical processes does not increase from s to (s 1).

+

+

+

+

+

+

+

L e m m a 3. Every coherent and feasible computation SUB A contains some suffix B such that at most l processes are critical at each state of B . Thus, the base algorithm fulfills the safety requirement, Requirement 2.1.

Proof. The proof is by induction, using Lemma 2 at each step of the induction.

w For the remainder of this section, we base all arguments on the assumption of i?-coherent sufixes, that is, coherent suffixes in which at most i? processes are critical at each state. Lemma4. Consider an i?-coherent computation in which k < i? processes are settled in the critical section and there is some other process p not ezecuting the remainder section. Then eventually, k 1 processes are critical in the computation.

+

Proof Suppose there is some computation where k processes are settled in the critical section and there are never k 1 processes in the critical section. We . build the proof by a number of claims based on this computation, which establish the existence of suffix computations having desired properties. Claim 1. At an infinite number of states in the computation, #{ j t r y j } > k holds. The claim can be shown by contradiction. We know that at least k of the t r y variables are eventually true at every state and there exists some non-empty set of processes S permanently outside of the remainder section. If none of the processes in set S assigns its t r y variable to true, all these processes eventually execute lines only in Part One of the algorithm. But the process with maximum identifier in set S must find the if condition of line B01 false, since by supposition #{ j I t r y j } = #{ j csj } and k < t. This is a contradiction proving Claim 1. A corollary of Claim 1 is that some process executes lines in Part Two of the algorithm at an infinite number of states. We can therefore define B to be the suffix computation in which every process that executes a line in Part Two does so at infinitely many steps. Call R the set of processes that execute some line in Part Two within computation B. Let T be the subset of R containing the t - k processes of largest id if IRI 2 t - k , and if IRI < t - k let T = R. Claim 2. Within computation B, any process in set T executes only lines in Part Two of the algorithm. This claim follows from the construction of T,which makes the if condition of line B06 evaluate to false for any member of T. Let C be the suffix of B such that each process in T has its try variable true at every state ( C is well-defined because the first line of Part Two assigns the t r y variable to true.) Claim 3. C has a suffix D in which only processes in T execute lines of Part Two. This claim holds trivially if IR(5 C - k . We prove the remaining case by contradiction. Suppose IT1 = 4 - k and there exists some p $ T executing lines of Part Two infinitely often in computation C. Since p has a smaller id than any process of T , process p eventually evaluates the if condition of line B06 to true, and subsequently will execute the while construct in Part One. Because the t - k processes in T have larger ids and their t r y variables are true, process p will not advance beyond the while construct in Part One. This contradicts the assumption about computation C , that some process p @- T infinitely often executes lines of Part Two of the algorithm. The following claim completes the lemma's proof, since it implies a contradiction to the assumption that only k processes are critical throughout the computation we consider. Claim 4. Some process executing Part Two in computation D becomes critical. This is just a corollary of Claim 3 and the assumption that only k processes are critical (and these k are settled). Since we consider only coherent states, csj + t r y j ; therefore k of the t r y variables are true corresponding to the settled processes and t k (or fewer) processes have t r y variables true because they are elements of T. Within D, all remaining processes have false values for the t r y variables. Consequently, the if condition of line B09 evaluates to false for H any process within computation D, and Claim 4 is implied.

+

I

I

-

Corollary 5. The base algorithm fulfills Requirement 2.2. Theorem6. The base algorithm is !-wait-free (Requirement 2.5 with x = I?). Proof. For any !-coherent computation in which no more than !processes are outside the remainder section it is easy to see that every process checks the condition of the while loop of line B01 once, and neither jumps to door1 nor to door2.

4

Fair L-Exclusion Algorithm

The base algorithm presented in the previous section does not guarantee fair access to the critical section. Processes with large ids have priority over those with smaller ids, so that there are computations in which a small-id process can starve. This section describes a method combining two copies of the base algorithm to obtain a fair !-exclusion algorithm. The two copies of the base algorithm are executed "concurrently". One copy consists of the base algorithm with t' = 1and the other copy is the base algorithm with P' = 1-1. We call the first copy 1-exclusion and the second (t- 1)-exclusion. The idea is that processes may concurrently participate in both copies, up to the point where they enter the critical section (we do not want a process to simultaneously participate in the critical sections of both copies). It is clear that processes with large identifiers (or fast step rates) could repeatedly enter the l - 1 version of the algorithm while others do not succeed. To counter this bias toward large identifiers, we arrange that each process i stop participating in the != 1 algorithm upon observing that some other process is trying to enter the critical section since the previous time in which i tried. The idea is to schedule participation in the two copies so that only the "oldest" requests for entry into the critical section are allowed to participate in the 1-exclusion copy. Intuitively, we say that the processes with new requests to enter the critical section "owes a turn" to processes with older requests. Thus, it is eventually the case that no fast process is executing the 1-exclusion copy and the slow processes eventually enter the critical section via the 1-exclusion copy. To implement the concurrent execution we introduce new variables that assist us in the management of the 1-exclusion and (l 1)-exclusion copies. Process i sets boolean variable to true to signal that it tries to enter the critical section ( x i , unlike the tryj variable, is true until i enters the critical section). The xi variable serves as a request indicator to other processes signaling that i wants to enter the critical section. In addition, owej and yi are vectors of n booleans used to record an image of all the x variables. Unlike all other variables to which only one process can write, the value of each element y j b ] can be written by two processes: i and j ( jassigns false in yj b] before it executes the critical section). Each process i participates in both the 1-exclusion and the (L - 1)-exclusion copies only if it does not owe a turn to any other process. To avoid obvious naming conflicts, let the t r y and cs variables of the 1-exclusion copy be renamed

-

clean: C01 xi = false C02 for j = 1 to n do C03 yj[i] = false C04 return

1-clean: C05 try: = false C06 cs: =false C07 pc: = B15 C08 return

(L - 1)-clean: C09 C10 C11 C12

try!'-') =false cs!'-') =fabe pcf'l = B15 return

owe-clean: C13 Ei = 8 C14 for j = 1 to n do Cl5 for k = 1 to n do C16 if owej[k] then C17 Ei = E i u { ( j , k ) ) C18 Gi = ( (1, ...,n), Ei ) /* Graph Gi represents the "owesnrelation */ C19 for j = 1 to n do C20 if (Gihas a cycle containing edge (i, j)) v -yilj] then C2 1 owei[j] = false C22 return

Fig.2. FAIRALGORITHM FOR PROCESS i, 1 5 i

5 n, CLEANUPROUTINES

try1 and csl, with variables of the (A! - 1)-exclusion copy renamed to try('-') and cs('-l). Our fair algorithm is presented in Figures 2 and 3. Figure 2 contains "cleaning" procedures that reset variables when a process is not simulating one or both copies of the base algorithm. Figure 3 contains the main part of the algorithm. During the period of concurrent simulation of both copies of base algorithm, the fair algorithm invokes the l-clean procedure when it stops the simulation of the 1-exclusion copy of the base algorithm; and a process invokes the (k' - 1)-clean procedure when it stops the simulation of the (l - 1)-exclusion algorithm. The variables assigned by the clean procedures have an important role for implementing fairness. For instance, while some process k is not simulating a copy of the 1-exclusion copy of the base algorithm, we require that try; and csi are false so that other processes simulating 1-exclusion are not blocked by process k. Some of the lines in Figure 3 need explanation. The condition "(next step is not critical)" refers to a step of one copy of the base algorithm, either the 1- or (t- 1)-exclusion, and the identity of the copy should be clear from context. We suppose that the simulation keeping track of program counters allows examination of the next step to be executed. Two cases are of interest: if the next step is in the remainder, then the simulated copy's program counter corresponds to line B15 in the base algorithm - the remainder section; if the next step corresponds to line B11 of the base algorithm, where a process assigns csi = true, then this next step is "critical" (we also regard lines Bll-B14 of the base algorithm as critical).

For the fair algorithm, we define coherence to be coherence of the base algorithms and additional constraints on x and y variables, the program counter of the fair algorithm and of the simulated program counters. Due to space limitations, we omit precise specification of coherence, however the details are straightforwardly derived by considering executions that start from a legitimate state. A coherent suffix is coherent at each state and internal variables values are consistent with previous states in the suffix. We assume that a refresh procedure is repeatedly executed by every process; the refresh procedure is repeatedly executed while a process is executing its trying section, critical section and remainder section. In addition to the properties that the refresh procedures of the 1-exclusion and the ( t - 1)-exclusion establish, the refresh procedure of the fair algorithm (of process i) ensures coherence properties for x and y variables. Lemma 7. Every computation of the fair algorithm contains a coherent sufix.

We define an [-coherent suffix for the fair algorithm to be a suffix in which at most one process is executing the critical section of the 1-exclusion algorithm and no more than t - 1 processes execute the critical section of the (A!?- 1)-exclusion. A computation is non-settled if no process is executing the critical section forever. Lemma 8. Every non-settled computation has an t-coherent suBx. Lemma9. At most l processes are simultaneously critical in any non-settled, t-coherent computation sufix.

Proof. It follows from Lemma 3 and the definition of t-coherence that at most

t - 1 processes are critical in the ( t - 1)-exclusion copy and at most one process is critical in the 1-exclusion copy. .

I

Theorernlo. In any non-settled, [-coherent computation sufix of the fair algorithm, any process not in the remainder section is eventually critical. Thus, the fair algorithm fulfills the fairness requirement, Requirement 2.3.

Proof. Assume towards a contradiction that there is a set T of processes such that every i E T is never in the remainder section nor critical in some non-settled, tcoherent computation. We first demonstrate the existence of a computation suffix with certain properties. Let S be the set of processes that are critical infinitely often during this computation. The theorem assumes a non-settled computation, so each process in S infinitely often exits and enters a critical section. Note that S is a set containing at most n - 1 processes. Let R be the remaining processes, the processes that execute the remainder section forever. Let owek (T denote, for any process k, the variables owek[j] for j E T . Similarly, let yk IT denote the variables yk[j] for j E T . Let owelT denote that part of the owe relation restricted to processes in T, that is, owelT is the subgraph induced by considering only processes in T from the graph of owe variables (the subgraph contains no vertices from S U R). The non-settled computation has a suffix A so that, for i E T,the variable x i is true at every state (after i first executes line F05 or invokes the refresh procedure

for coherence) and the program counter for each process i E T is permanently in the range F05-F29. Therefore the values of yi variables do not change by any assignment of process i in suffix A, for i E T (changes to yi could still be possible due to clean procedure execution by a process in S). Computation A, in turn, has a suffix B such that yj[i] is true at every state for i E T and j E S, because processes in S infinitely often execute line F04. Because each process in S infinitely many times executes the clean procedure, B has a suffix C so that yjb] is false for i E T and j E S. The computation C thus has the property that no variable yi changes for i E T , and no variable yj IT changes for j E S, at any step. Because each process in T executes lines C20 and C21 infinitely often, and l y i b ] is true for i E T and j E S, C has a suffix in which every oweilj] is false throughout. This implies that any subsequent call to owe-clean by any process j E S constructs a graph G j in which there is no cycle including an edge (j, i ) for j E S and i E T; therefore, if owej[i] is true prior to a call to oweclean, then owej[i] is unchanged by the call to owe-clean. Moreover, owej[i] is true throughout some suffix of C , because F02 copies owej from yj, and we established above that yj[i] is true throughout suffix C . Hence, C has a suffix D such that each variable owej [i] is true, for i E T and j E S , at each state. The owei variables for i E T change only by execution of the owe-clean procedure in computation D, and any such change is monotonic: owe;[k] can only change from true to false for i E T during computation D. Therefore the number of edges in owelT does not increase during computation D. Furthermore, since the number of edges in owelT cannot decrease below zero, D has a suffix E such that owelT does not change at any step in E. We also observe that each process in T calls owe-clean infinitely often, which implies that graph owe(T is acyclic throughout computation E. Every process r E R repeatedly executes the refresh procedure, so E has a suffix F where x, is false at each state. We now claim that at each state of F ,

does not hold if i E S. The claim follows for F because T is non-empty, xj is continuously true for any j E T , and similar properties hold of oweib] and y i l j ] . The counting construct of F'l8, although executed non-atomically, therefore obtains a non-zero count each time it executes in computation F. Therefore F has a suffix G for which no process in S can execute any of the lines F20-F29 (Some process in S initially executing the critical section of the 1-exclusion copy eventually exits by the assumption of a non-settled computation.) We have shown of G that no process in S executes the 1-exclusion copy. It remains to show that some process in T eventually executes the 1-exclusion copy. We claim that equation (1) holds for some i E T throughout computation G. The claim is shown by contradiction. Suppose, for every i E T , that equation ( 1 ) does not hold. Observe that yilj] is false at every state of G, for every j E S (by execution of the clean procedure) and every j E R (by periodic refresh). This

observation reduces the evaluation of (1) to

If equation (2) fails to hold for all i E T , it can only be that for each i E T there exists some k E T so that owei[k] is true. This implies that each vertex in the subgraph owelT has outdegree of at least one. But a directed graph where every vertex has positive outdegree must contain a cycle, and this contradicts the fact, already established, that owelT is acyclic in computation G. This contradiction proves that equation (2) holds for some i E T throughout computation G. Since the relevant variables of (2) do not change value in suffix G, the construct of F18 evaluates to true for some process p E T; therefore p executes steps of the l-exclusion protocol infinitely often during computation G. We conclude that eventually there is a non-empty set of processes in T that execute the l-exclusion copy and no other process outside of T executes the l-exclusion copy. The proof is completed by the non-settled property of the computation and Corollary 5. By the non-settled property the critical section of the l-exclusion copy is eventually free and by Lemma 5 a process in T enters the critical section of the l-exclusion. But the definition of T precludes entry to the critical section, so we have contradicted the existence of non-empty T. Theorem 11. The fair algorithm is (L - 1)-wait-free (the fair algorithm fulfills Requirement 2.5 with x = (! - 1)).

Proof. For each simulated step in either copy of base algorithm, a process executes at most O(n2) steps in the fair algorithm. All trying processes execute steps of the (t - 1)-copy, which is (L - 1)-wait-free by Theorem 6. Thus if at most (t- 1) processes are outside the remainder section, any trying process will become critical within a polynomial number of its own steps.

5

Concluding Remarks

This paper presented the first self-stabilizing algorithm for e-exclusion problem in the classic shared memory model. Our presentation and construction of the algorithm follows the tried-and-true methodology of separating concerns: safety, liveness, and fairness of the final algorithm are developed separately, each property layered upon the previous. The identifier of a process can influence its priority in entering the critical section - the larger the identifier, the better the chances are to enter the critical section. A slight modification of our algorithm can almost nullify this "unfair" property: every process is assigned a set of identifiers, say process i is assigned {i, i n, i + 2n, .,i + kn). Whenever a process leaves the remainder section, it chooses randomly one identifier from its set and uses it in the algorithm (while the t r y and cs variables for the other identifiers in the set are set to false). Our system model of shared memory relies on atomic read and write operations to registers for interprocess communication. In fact, we use a more

+

+

restricted form of communication than required by our model: all the algorithms use only single-bit registers. We speculate that with slight modifications (e.g. avoiding unnecessary writes) a weaker model of shared communication could be used, namely read and write operations to safe (or regular) rather than atomic registers.

References 1. Y. Afek, D. Dolev, E. Gafni,M. Merritt and N. Shavit, "A Bounded First-In, FirstEnabled Solution to the l-exclusion Problem," A CM Transactions on Programming Languages and Systems, Vol. 16, No. 3, pp. 939-953, 1994. 2. E. W. Dijkstra, "Self-stabilizing Systems in Spite of Distributed Control," Communications of the ACM, Vol. 17, No. 11, pp. 643-644, 1974.

3. E. W. Dijkstra, "Solution of a problem in concurrent programming control," Communication of the ACM, Vol. 8, No. 9, p. 569, 1965. 4. M. Flatebo, A. K. Datta, and A. A. Schoone, "Self-stabilizing multi-token rings," Distributed Computing, Vol. 8, pp. 133-142, 1994. 5. M. Fischer, N. Lynch, J. Burns and A. Borodin, "Resource allocation with immunity to limited process failure," Proceedings 20th IEEE Annual Symposium on Foundations of Computer Science, pp. 234-254, 1979. 6. L. Lamport, "A new solution of Dijkstra's concurrent programming problem," Communication of the ACM, Vol. 17, No. 8, pp. 453-455, 1974. 7. L. Lamport, "The mutual exclusion problem. Part 11: Statement and solutions," Journal of the A CM, 33(2):327-348, 1986. 8. E. A. Lycklama and V. Hadzilacos, "A First-Come-First-Served Mutual-Exclusion Algorithm with Small Communication Variables," A CM Transactions on Programming Languages and Systems, Vol. 13, No. 4, 1991, pp. 558-576. 9. G. Peterson, "Myths About The Mutual Exclusion Problem," Information Processing Letters, Vol. 12, No. 3, pp. 115-116, 1981.

.

10. G. Peterson and M. Fischer, "Economical Solutions for the Critical Section Problem in Distributed Systems," In Proceedings of the 9th ACM Symposium on Theory of Computing, pp. 91-97, 1977. 11. M. Raynal, Algorithms for Mutual Exclusion, MIT Press, 1986.

12. K. TNuvert, M.S. thesis, Dept. of Computer Science, Univ. of Toronto, Ontario, 1989.

This article was processed using the 14'l@C2E macro package with SIROCCO class

62

/* invoke cleanup and remainder section */ start: F01 remainder section F02 for j = 1 t o n d o oweilj] = y i l j ]

/*

record all x j in yi vector F03 for j = 1 t o n d o F04 yib] = X j

*/

/*

simulate one or both copies */ loop : F05 xi = true; yi[i] = false F06 bi = -bi /* switch copy to simulate */ F07 if bi then

/* bi + simulate ( I - 1)-exclusion */ if (next (t - 1)-exclusion step ft critical) then

execute next step of ( I - 1)-exclusion got0 loop /* critical in (l - 1)-exclusion */ else call 1-clean call clean while (next (t- 1)-exclusion step E critical) d o execute next step of (4- 1)-exclusion goto s t a r t

F17 F18 F19 F20 F21 F22 F23 F24 F25 F26 F27 F28 F29

/* ybi -+ possibly simulate 1-exclusion */ call owe-clean if #{ j ( owei[31 A ~ i bA] x i ) # 0 then call 1-clean /* skip 1-exclusion */ goto loop if (next 1-exclusion step critical) then execute next step of 1-exclusion goto loop /* critical in 1-exclusion */ else call (t- 1)-clean call clean while (next 1-exclusion step E critical) d o execute next step of 1-exclusion goto s t a r t

FOR PROCESS i, 1 5 i 5 n , SIMULATIONS Fig. 3. FAIRALGORITHM

On FTSS-Solvable Distributed Problems Joffroy Beauquier Synnove Kekkonen-Monetal UniversitC de Paris-Sud, LRI CNRS, Bgt. 490, 91405 Orsay cedex, FRANCE. Gmail: {jb,kekkonen)Olri.lri.fr Research supported by a fellowship from the Academy of Finland.

Abstract. We investigate which distributed problems can be solved with so-called k-ftss protocols. K-ftss protocols combine two types of failure resilience, k-fault-tolerance (k-ft), i.e., resilience to up to k process failures, and self-stabilization (ss), i.e., resilience to arbitrary memory and message corruption. We show that if a problem is k-fault-sensitive on a (j, k)-restrictable process network, then there exists no k-ftss protocol for solving the problem on that network. Instead, when either condition does not hold, then there can be a k-ftss solution. We present several 1-ftss protocols for rings. We first propose a randomized solution to 2-COL on anonymous rings of even size. We then present a generic deterministic I-ftss protocol that can be used to produce 1-ftss solutions to 2-COL, orientation and non-trivial eventual consensus on rings where processes have identifiers.

1

Introduction

In the framework of [2], any failure, whatever its cause, is represented as an event that corrupts the global system state, and a protocol is failure resilient if it brings the system back to consistent functioning once failures stop occurring. Furthermore, if the protocol does not allow the system to exhibit incorrect behavior while failures occur or while the system recovers from failures, then the protocol is masking, otherwise it is non-masking. So-called fault-tolerant (ft) protocols are usually masking failure resilient and self-stabilizing (ss) protocols are non-masking failure resilient. Tkaditionally, these two types of failure resilience have been developed separately. In fault-tolerance, one assumes that up to k processes may be faulty, i.e., they do not follow their correct program code. The resulting incorrect behavior can vary from the relatively benign crash faults where a faulty process stops prematurely all activity to arbitrary (Byzantine) behavior. Fault-tolerant protocols suspect all incoming information and guarantee legal behavior of correct processes by preceeding each step with a sufficient number of checks. In self-stabilization, one assumes that all processes follow a correct program code but messages and working memories (i.e., the configuration) can be corrupted arbitrarily. Self-stabilizing protocols allow processes to behave inconsistently but guarantee an automatic return to legal behavior from any arbitrary

configuration after all corruption ceases. Self-stabilization was introduced by Dijkstra in [7]: he proposed a distributed mutual exclusion protocol for rings such that given any number of processes initially and simultaneously in their critical section, the protocol guarantees that in a finite time exactly one process at a time enters its critical section. [12] presents a formalization of the theory of system stabilization, and gives references to stabilizing solutions for various distributed problems. Few protocols are simultaneously fault-tolerant and self-stabilizing (ftss). Gopal and Perry [ll] demonstrate how a round-based protocol resilient to crash faults can be transformed into a ftss protocol in a synchronous environment. Dolev and Welch [lo] propose ftss solutions, under different synchrony hypotheses, to clock synchronization on networks where up to one third of the processes are Byzantine. Buskens and Bianchini [6] design a ftss mutual exclusion protocol conditional on several hypotheses on the behavior of the faulty (Byzantine) processes and communications. Also [4] assumes a form of synchrony on communications and then develops a 1-ftss size protocol for rings subject to one crash fault and where the processes have access to self-stabilizing failure detectors. Anagnostou and Hadzilacos [I] propose a 1-ftss unique naming protocol for assigning identifiers to processes on asynchronous rings subject to one crash fault. Masuzawa [13] sets hypotheses on processes' knowledge of the network and proposes a k-ftss protocol for finding the topology of (k+l)-connected asynchronous networks subject to k crash faults. [3] focuses on asynchronous rings subject to one crash fault, and proposes two 1-ftss orientation protocols for assigning system-wide consistent labels lefl and right to the communication links. We investigate which distributed problems can be solved with k-ftss protocols on fully asynchronous networks subject to up to k crash faults. To our knowledge, only three protocols have been proposed for such networks: [I], 1331, and [13]. Anagnostou and Hadzilacos [I] showed that failure-sensitive problems do not allow 1-ftss solutions and that the size problem, while not failure-sensitive, cannot be solved with a 1-ftss protocol on asynchronous rings subject to one crash fault. Developing from [I], we identify a sufficient condition for concluding when a given problem cannot be ftss-solved on a given process network. Our impossibility result states that if a problem is k-fault-sensitive on an asynchronous (j, k)-restrictable process network then it cannot be k-ftss solved on that network. In a (j, k)-restrictable network of size n, there are j 2 k processes such that if they are extremely slow, then the n - j processes cannot know whether the slow section consists of j slow processes or k crashes. A problem is k-fault-sensitive on a network of size n if the following conditions apply to its solutions. First, when the problem is solved, if j processes appear as k crashed processes ( j2 k), then the problem is no longer solved from the point of view of the n - j processes. Second, all solutions reached by the n - j processes, while assuming that the j slow processes correspond to k crashes, are not correct. The impossibility result follows from observing that a k-ftss protocol solving a problem that is k-fault-sensitive on a (j, k)-restrictable asynchronous network can make the processes oscillate endlessly between correct and incorrect solutions.

,

If a problem is not k-fault-sensitive or if a problem is k-fault-sensitive but the network is not (j,k)-restrictable, then there can be a k-ftss solution. For example, size is k-fault-sensitive on (j>k, k)-restrictable networks, but not on networks that are only (j=k, k)-restrictable (e.g., cliques). Unique naming, c-coloring (i.e., given c different colors, each process has a color different from its neighbors), orientation, and non- trivial eventual consensus (i .e., correct processes eventually hold a common value in some determined variable, and there are two different runs giving consensus on different values) are not fault-sensitive problems. In the second part of the paper, we propose several 1-ftss protocols for asynchronous rings. We first consider anonymous rings and design a randomized 1-ftss protocol for 2-coloring (2-COL). If processes have identifiers then 2-COL, orientation, and non-trivial eventual consensus can all be solved with a deterministic protocol derived from our generic 1-ftss protocol. The generic protocol is based on the following heuristic: a correct process p fixes its solution to the problem at hand and the others compute solutions that are consistent w .r .t . (with respect to) the solution of p. In the next section we give definitions and notations. In Section 3 we develop our impossibility result. In Section 4 we present the randomized 2-COL protocol. The generic deterministic 1-ftss protocol, as well as three protocols derived from the generic protocol, are presented in Section 5. Section 6 concludes the paper.

2

Definitions and notations

A distributed system S = (N, A) consists of a process network N and a protocol A. The process network is presented by an undirected graph N = (P,L) where the vertex set P is the set of processes and the edge set L is the set of bidirectional communication links connecting the processes. The network N belongs to a family of networks iY whose instances share the same properties concerning: (i) anonymity of processes, (ii) implementation of communication links, (iii) the topology of the communication graph, and (iv) mode of computation. The networks of n/ are anonymous if processes do not have identifiers. Communication links are implemented either with two-way, FIFO communication channels or with shared memories. In the first case, processes communicate by message passing. In the second case, neighboring processes p and p' communicate via two link registers: p writes in its register and p' reads from it, pi writes in its register and p reads from it. The topology of the networks N = (P,L) in N describes the structure of the communication graph induced by P and L. In ring topology, each process has two neighbors. If computation on N is asynchronous then no assumptions can be made on relative process speeds or communication delays; in the synchronous mode such assumptions can be made. The protocol A is a collection of algorithms, one for each process. A protocol is uniform if all processes have the same algorithm. An algorithm is presented a set of rules (Condition):Action. If Condition evaluates to true in a process p, then p can be activated to complete the instructions of Action, i.e., p can complete an

atomic algorithm step. If some instructions in Action are subject to probabilistic choices, then the protocol is randomized, otherwise it is deterministic.

A configuration C of a system S is a vector of process states and states of communication links. A process state is an assignment of values to the variables used by the algorithm. The state of a communication link is an assignment of values to the link registers or the messages transiting in the link. The (possibly infinite) set C contains all configurations of a system S. An execution of a system S is an infinite sequence of configurations and algorithm steps by processes. An execution segment is a finite subsequence of configurations and steps. The scheduling of algorithm steps in an asynchronous computational environment is modeled with a distributed demon ([5],[8]). Given the processes for which one or more Conditions evaluate to true, the demon chooses which processes and algorithm steps to activate. The activated processes complete their steps simultaneously and without interruption. Our protocols use composite atomicity [8] where a process can receive and transmit a message in the same atomic step. A fair demon guarantees that if some process can execute an algorithm step infinitely often, then eventually that process and step will be activated. A bounded demon guarantees that if two processes can repeatedly take an algorithm step, there is an upper bound, noted D, to the number of consecutive activations of one without activating the other. Unless explicitly stated, we assume that the demon is distributed and fair but not bounded. The family of systems S = (N,A) can be subject to process failures and systemic failures. A systemic failure corrupts the states of one or more processes and/or communication links, while a process failure occurs when a process deviates from its algorithm. As process failures we consider only crash faults where a process follows its algorithm correctly up to a point and then stops all action prematurely in a crash step. Letter k denotes the upper bound of crash faults that can occur in an execution. We underline that in the asynchronous mode of computation it is impossible, in a bounded delay, for the processes to distinguish between a process that is very slow and a crashed process. We associate to each process a variable that indicates whether the process is crashed or not. This variable is not accessible to the process and cannot be used in any protocol. However, it allows us to identify the set Pf of faulty processes in a configuration C. CIPf denotes that in configuration C processes PI are faulty. The protocol A is designed to solve a distributed problem lI on a family of networks N . 17 specifies the legal behavior of processes in an execution. In general, a problem is specified as a predicate lI on executions, and it includes a specification for each case where a subset of up to k processes is faulty Executions where lI is satisfied are called legal, others are called illegal. We focus on problems where the correct behavior of processes can also be defined as a predicate on configurations and sets of faulty processes. Configurations where 17 is satisfied are legitimate, others are illegitimate. The legitimacy of a configuration CIPf is evaluated in respect to correct processes' variables and communication links accessible to correct processes.

Definitionl. Let N be a family of networks and II a problem specification. The family of systems S = ( N , A) subject to systemic failures and up to k crash faults, is simultaneously k-fault-tolerant and self-stabilizing (k-ftss) for 17 if, for every system S = (N, A) E S, 1. every execution of S starting from any configuration CJPf (PI E P , )PI1 5 k) leads to a legitimate configuration C'lP; (Pf & P; P, lP; 1 5 k) (convergence), and 2. every execution of S starting from any legitimate configuration C'IP; guarantees that all following configurations C"IP; (P; P; P , IPy 1 5 k) are legitimate (closure).

If the protocol is randomized, convergence is achieved with probability 1 and closure is deterministic.

3 Impossibility result In order to develop our impossibility result we define several properties of the process networks.

Definition2. Given a network N = (P,L) and its subnetwork N' = (PI, L') (N' c N), the multiset of border processes of N w.r.t. N' contains those processes of P \ P' that have a neighbor in P I . The multiset of border processes guarantees that the border processes of networks N and N' of the following definition have the same number of communication links:

Definition3. A (j, k)-restriction of a network N = (P, L) is a network N' = (PI, L') obtained from N by replacing a subnetwork Nj C N (Nj = (Pj, Lj), lPj 1 k) with a network Nk = (Pk,Lk) for a minimal k such that 1Pk)= k SO that the set of border processes is the same in N w.r.t. Nj and in N' w.r.t. Nk.

>

The intuition of the above definition is that a part of N can be replaced with another network in such a way that the non-replaced processes cannot know whether the new network is N or N' only by looking at their communication links. The (j, k)-restrictability of an asynchronous process network N = (P, L) has the following implication. If the Pj processes are extremely slow, then the processes P \ Pj cannot know, based on their communication links, whether the slow section consists of Pj slow processes or Pk crashed processes.

Definition4. A family of networks IV forms a (j, k)-restrictablefamily of networks iff there is a network N in N having a (j, k)-restriction N' in N. Definition5. A problem 17 is k-fault-sensitiveon a family of networks n/ iff there is a network N E n/ such that whenever 17 is solved on N there is a (j, k)-restriction N' E IV of N for which following conditions hold:

1. if the processes of N' \ Nk behave as the processes of N \ Nj , and if the processes Pk are faulty on N', then 17 is not solved on N', and 2. when 17 is solved on N', where the Pk processes are faulty, if the processes of N \ Nj behave as the processes of N'\ Nk, and if the behavior of processes of Nj is unchanged, then 17 is not solved on N . The above definition has two implications on any protocol A that tries to solve a k-fault-sensitive problem l7 on a (j, k)-restrictable network N. First, given any legitimate configuration Cl0 of the system S = (N, A), there is a set of processes Pj such that if those processes appear as Pk crashed processes, then C is illegitimate. Second, all configurations C' that appear legitimate on S if the Pj processes correspond to Pk crashed processes are, in fact, illegitimate. Theorem6. If a distributed problem 17 is k-fault-sensitive on a family of asynchronous, (j, k)-restrictable networks N, then there is no k-flss protocol for solving I? on N (k > 0).

PROOFAssume that a k-ftss protocol A exists. Thus, there is a family of systems S = ( N , A ) where N is a (j, k)-restrictable family of networks and the computation on networks N E N is asynchronous. We select a system S = (N,A) E S such that the definition of k-fault-sensitive problem applies to N , and construct an execution of S which keeps the system from stabilizing due to (a) (j,k)-restrictability of N , and (b) asynchrony. Thus, given S = (N, A) and a legitimate configuration Ch10 of S, there is a system St = (N', A) E S, whose process network N' = (PI, L') is a (j, k)restriction of the process network N = (P, L) of S. The configuration CI,IPk on St, where the processes P' \ Pk and the links L' \ Lk are in the same states as the processes P \ Pj and the links L \ Lj in Ch10,is illegitimate by part (1) of the definition of a k-fault-sensitive problem. The execution of the system S' = (N', A), starting from the illegitimate configuration Ci IPk,brings thesystem to a legitimate configuration Ci 1 Pk. Consider then an execution of the system S = (N, A) where all processes are correct. Once S has reached a legitimate configuration Chi@, the demon slows down completely the execution in the subnetwork Nj = (Pj)Lj). The processes P\ Pj and the links L \ Lj are in exactly the same states as the processes PI\ P k and the links L'\ Lk in CI, 1 Pk.Since N is (j , k)-restrictable, the processes in P\ Pj cannot distinguish whether they are executing the protocol A on network N or on network N'. In other words, as it concerns the processes in P \ Pj, the current configuration can be either (the illegitimate) CI, I Pkor (the legitimate) Ch10. The protocol A, as reasoned above, brings the processes P \ Pj and links L \ Lj to states where they are as in CilPk.Meanwhile, the slowed-down processes Pj and the links Lj remain as in Ch10. This new configuration Ci10 on S is illegitimate by the part (2) of the definition of a k-fault-sensitive problem. Since the demon can repeatedly enable and disable the execution in selected sections on N, the system S will never stabilize but will oscillate between legitimate and illegitimate configurations.

Note that the above result holds even if the protocol is randomized, if processes have identifiers, and is insensitive to how communication links are implemented. 3.1

Fault-sensitive a n d insensitive problems

Size is ft-solvable and ss-solvable. Anagnostou and Hadzilacos [I] showed that there is no 1-ftss protocol for solving ring size. By the following theorem, size is not k-ftss solvable on any (j>k, k)-restrictable family of networks: T h e o r e m 7 . Size is k-fault-sensitive on any (j > k, k)-restrictable family of networks.

PROOFConsider a system S = (N, A) E S = ( N ,A) where N is a (j>k, k)restrictable family of networks. We select N = (P, L) so that it has a (j>k, k)restriction N' = (P', L') obtained from N by replacing Pj processes, lPjl > k, with Pk processes, lPkl = k. In any legitimate configuration C h p on S, all processes are in states where size is IP(= rn. Consider then the system S' = (N', A). Given a legitimate configuration Ch10 on S, the configuration CLIPk on S' where the processes P' \ Pk are in states where size equals rn is illegitimate because the size of Sf is smaller than m, say IP'I = n. On the other hand, given a legitimate configuration CilPk on S t , the configuration Ci10 on S, where processes P \ Pj have "size equals n" while the processes Pk have "size equals m", is illegitimate. 0 In the c-colorability problem processes decide whether c-coloring is possible given c different colors. Ring Zcolorability is ft-solvable and ss-solvable but not ftss-solvable because it is k-faul t-sensitive: T h e o r e m 8. Gcolorability is I-fault-sensitive on the family of rings. PROOFWithout loss of generality, consider an odd size ring in a legitimate configuration (i.e., 2-COL impossible). The (j, 1)-restriction of the ring, where j consecutive processes ( j = lPjl even) are replaced with one faulty process, is in an illegitimate configuration if correct processes are in states indicating that 2-COL is impossible. On the other hand, given a legitimate configuration on the (j, 1)-restriction (i.e., 2-COL possible), the configuration on the original ring where processes P \ Pj are in states indicating that the ring is 2-colorable while 0 processes Pi are in opposite states, is illegitimate. Unique naming, orientation, and non-trivial eventual consensus are faultinsensitive problems on all networks. Any configuration where either unique naming or orientation or non-trivial eventual consensus holds remains legitimate given any number of faulty processes. C-coloring is fault-insensitive on any network that is always c-colorable. For example, 2-COL is not 1-fault-sensitive on rings of even size: the crash of one or more processes does not make a legitimate configuration (i.e., any 2-coloring) illegitimate. .

4

Randomized 1-ftss 2-COL

In this section we consider the 2-COL problem on anonymous, asynchronous even size rings. We first conclude that if the ring is anonymous then a 2-COL protocol cannot be deterministic. We then propose a randomized 2-COL protocol. (In Section 5 we present a deterministic protocol for even size rings with identifiers.) The impossibility of deterministic 2-COL of anonymous even size rings follows directly from the scheduling strategy of the distributed demon where all processes take steps in lockstep. Consider an illegitimate configuration where all correct processes are in identical states (e.g. all processes have color 0). Since the protocol is deterministic, each correct process executes the same step when activated. Consequently, the processes are still in identical states in the resulting illegitimate configuration (they all have color 1).The impossibility of deterministic 1-ftss ring 2-COL results from the incapacity of any such protocol to break the symmetry of the ring, or the line of processes if one process has crashed. Instead, a randomized protocol has this capacity since the same algorithm steps by different processes result, with probability, in non-identical new process states due to the probabilistic choices made by the processes. In our uniform randomized 2-COL protocol, processes decide to change their coloring based on a coin toss; the algorithm that each process executes is presented in Figure 1. The instructions for 2-COL are presented together with the link register protocol [I] which guarantees that the registers of a crashed process, and thus any permanently corrupted information, is accessed a t most once.

-

+

i n {O,l), p's color, Clr, 5 (Clr, l)mod2 Clr, p writes Reg,.S, Regi.R and Regi.Clr, i E (0, 1). p reads RegI.S, Reg{.R and Regi.Clr, i E ( 0 , l ) . (Regi.S = I A Reg:.R = I ) : Regi.S := S ; Regi.Clr := Clr, (Regi.S = S

A

Reg:.R = R): Regi.S := I

(Regi.R = I A Regi.S = S ) : Regi.R := R; IF (Reg:.Clr = C1rp A Random({ f alse, t r u e ) ) ) THEN Clr, := C l r p (Reg;-R= R A Regi.S = I ) : Regi.R := I

Fig. 1. Randomized 1-ftss algorithm for ring 2-COL.

The link register protocol organizes the communications as follows. Neighboring processes p and p' share two registers Reg and Reg' that have two bits for the protocol, the bit S for sending with values "idle" (I)and "sending" (S),and the bit R for receiving with values "idle" ( I ) and "received" ( R ) .p can communicate to p' when Reg.S = I and Regt.R = I. p writes its message to the register Reg

and sets Reg.S to S. p' reads register Reg and acknowledges by setting Regi.R to R. p terminates the communication by setting Reg.S to I, after which p' returns to idle state by setting Reg'.R to I. The instructions for 2-COL are as follows. Each process p has a variable Clrp that takes values in {O, 1). Each correct process tells repeatedly Clrp to both neighbors. If p detects that Clrp is also the color of a neighbor, then p either changes or keeps Clrp with probability 1/2. The proof of the protocol is based on the assumption that the demon is bounded. For the proof, let S = ( N ,A) denote a family of systems where (i) n/ is the family of anonymous, asynchronous, even size rings where processes communicate via link registers, (ii) A is the protocol obtained by making each process execute the algorithm of Figure 1, and (iii) systems S E S are subject to systemic failures and one crash fault. L e m m a 9. S satisfies closure for 2-COL.

PROOFWe demonstrate closure by showing that once a system S E S is in a configuration where correct processes can no longer change their colors, the system has stabilized to a 2-coloring. Consider a ring S in a configuration C such that, in all possible following configurations C',the correct processes are colored as in C. The registers that a correct process p can read have a color different from Clrp. On the other hand, p's registers - directed toward other correct processes hold Clrp (otherwise p's writing Clrp to a register toward p' would cause p' to change Clrpt). Thus, correct processes' registers directed toward other correct processes hold the color of the owner process and, furthermore, all colors that a correct process can read are different from its own color. Thus, the ring is 2-colored. In the next lemma, we use the notion of a communication round (or simply, a round) which is a minimal execution segment where each correct process has received a communication at least once from each correct neighbor. Lemma 10. S satisfies convergence /or 2- COL in expected 0(n2(~+')") rounds where n is the number of processes and D is the bound of the demon.

PROOFFor establishing the upper bound for convergence, we use the technique of scheduler-luck games [9].An execution is a game between the demon (called scheduler in [9]) and luck. The demon tries to keep the system from converging, while luck tries to help the system converge; whenever a process tosses a coin luck can fix the outcome of the toss. Dolev, Israeli, and Moran show that if luck has a winning strategy for the game in expected number of at most r rounds and with f interventions then the protocol converges within r21 expected number of rounds. Thus, we need to find a (f,r)-strategy for luck. The strategy for luck is as follows. When the first coin toss occurs, luck fixes the final 2-COL. Call the first tosser po for whom lucksets Random({ false, true)) to false so that po keeps its color. In the worst case the ring po, pl, . . . ,pn-a, pn-1 is in the following initial configuration (worst in the sense that it requires the

maximal number of interventions by luck in maximal number of rounds): processes p~ and pl have the same color, and the rest of the ring is consistently colored w.r.t. pl, except for the process pn,l that has crashed and has arbitrary colors in its registers. In the round that starts from po's toss, po may read D times the color of pl (and toss a coin) before pl reads the color of po, tosses a coin, and changes its color to Clr,, under luck's intervention. The limit D exists because the demon is bounded. Note that in each of the D times when po reads Clrp,, luck intervenes. Luck may intervene an additional time if the register of pn-1 is accessible to po and causes po to toss. Thus, in the first round luck intervenes at most D+ 1 1 1 times in order to color po and pl. In the round that starts from pl's toss, pl may read D times the color of p2 (and toss a coin) before p2 reads the color of p l , tosses a coin, and changes its color to Clrp, . Thus, luck intervenes D 1 times in the second round. The game continues this way until only pn-2 is badly colored. In the round n - 2 that starts from the toss of pn-3, luck may intervene D times: pn-3 reads the color of pn-2 (and tosses a coin) D times before p,-2 reads the color of p,,3, tosses a coin, and changes its color to Clrp,-, under luck's intervention. In the round n - 1 that starts from the toss of pn-2 luck may need to intervene once more if the register of pn-1 is accessible to p,,-2 and causes p,,2 to toss. Thus, the strategy requires a t most n - 1 rounds and (n - 2 ) ( 0 1) 3 interventions by luck. Consequently, the convergence of the protocol is bounded 0 from above by 0(n2(~+')") rounds.

++

+

+ +

By closure (lemma 9) and convergence (lemma lo), we have: Theorem 11. S is I-fiss for GCOL in the presence of a bounded demon.

5

A generic deterministic 1-ftss protocol

Ring 2-COL is not 1-fault-sensitive on even size rings, and orientation and nontrivial eventual consensus are not 1-fault-sensitive on rings in general. Furthermore, these problems can all be solved with the following heuristic: if one correct process locally, independently of the states of other processes, fixes its solution to the problem, and if other correct processes converge to solutions that are consistent w.r.t. that process, then the problem is solved on the ring. Let l7 denote the specification of any problem that is not 1-fault-sensitive on rings and that can be solved with the above heuristic. The problem l7 can be solved with a protocol derived from our generic deterministic 1-ftss protocol that is designed for the family of asynchronous rings where processes have unique identifiers and where processes communicate by message passing. (If communication is via link registers then the link register protocol should be modified so that processes can both receive and forward a message in an atomic step.) The algorithm that the processes execute is presented in Figure 2. A process p can locally distinguish between its communica-

p has variable(s) that hold p's solution to ll and other variable(s) including list o f entries (Id, Prml,. ,P ~ m i ) List Link link through which List arrived, Link gives the other link

..

(true):

compose

-

[(id,, Prm,l,. . . , Prmpi)] and send it to both neighbors

(List arrives via Link): I F id, & {Idl, . . . ,Idj) THEN find (Id, Prml,. .. , Prm,) with biggest I d I F I d > id, THEN solve lI consistently w.r.t. process I d corn pose (id,, Prm,~,. . . , Prrn,;) send (id,, Prmpl,...,PrmPi) (1 List t o Link Fig. 2. Generic deterministic 1-ftss algorithm for rings.

tion links, and knows its identifier id,. Each correct process repeatedly sends a message [(id,, Prmpl, . . . , Prmpi)] to both neighbors. The parameters P r m p l , . . ., Prmpi are specific to the problem l7 and hold p's solution to l7. A message tries to circulate the ring, and will list the ids of processes that forward it. When p receives a message containing a List [(Idl, P r m l l , . . . , Prmli), (Id2, P r m z l , . . ., Prmzi), . . . , (Idj, P r m j l , . . . , Prmji)J, it checks whether idp appears among the identifiers I d l , . . ., Idj. If so, p discards the message because it has circulated the ring. Otherwise, p finds the biggest I d of the message. If I d 5 id, then, again, p discards the message. Instead, if I d > id,, then p proceeds to the problem specific computation where it can use any information of List and on the channel through which the message arrived. More specifically, p: (1) resolves the problem l7 so that the new solution is consistent w.r.t. the solution of process Id, and (2) concatenates (11) the item (id,, Prmpl, . . . , Prmpi) to the head of List (without otherwise modifying List) and forwards the message. Note that the correct process with the biggest id either does not change the solution that it had in the initial configuration or changes its solution a finite number of times because it receives messages that (a) are corrupted and hold (non-existing) big ids or (b) were sent by a bigger id process before it crashed. For the proof of the protocol, let S = (N,A) denote the family of systems where (i) N is the family of asynchronous rings where processes have identifiers and communicate by message passing, (ii) A is the protocol obtained by making each process execute the algorithm of Figure 2, and (iii) systems S E S are subject to systemic failures and one crash fault. L e m m a 12. S satisfies closure for l7.

PROOFAssume that the lemma does not hold. That is, on a system S E S, that is in a legitimate configuration ChlPj, there is a correct process p that changes its solution to 17.

Process p changes its solution to 17 only if it receives a List [(Idl, P r m l l , .. . , Prmli), (Id2, P r m z l , . . . , Prmzi), . . .,(Idjl P r m j l , . . .,Prmji)] with an identifier Id, E { I d l , . . ., I d j } such that Id, > id,, and if p's new solution to IT, based on the parameters (Idq, P r m q l , . . . , Prmqi), is different from its old solution. However, the existence of such a List contradicts the assumption that the configuration Ch1Pf was legitimate. 0 Converge is proved through a three-step convergence stair. The following lemma is repeatedly used in the convergence proof, and it follows directly from the fairness of the demon and the fact that communication channels are FIFO. Lemma 13. Given a configuration Ch of a system S E S with a message m transiting toward correct process p in a communication link connecting processes p and q , the process p will eventually receive m. Lemma 14. A system S E S converges t o a configuration where each List contains only identifiers of correct processes.

PROOF This convergence results from how old messages are eliminated. When a process p sees a message where all Ids are smaller than idp, p absorbs it. Thus, there is a point when only messages with big ids circulate in the ring. When p forwards a message, p adds idp to List. If p sees a message for a second time, i.e., id, appears in List, then p absorbs the message. If no other process eliminates a message and it does not disappear to a communication link toward a crashed process, then it returns to p by (i) lemma 13, and (ii) the fact that processes forward all messages containing an id bigger than theirs. In sum, old messages are either traverse the ring (and get absorbed) or disappear into a communication link toward a crashed process. Since only correct processes generate messages, there is a point when all old messages and messages sent by a crashed process have been eliminated, and only new messages circulate in the ring. These messages contain the id of the sender and the ids of processes that forwarded the message. O Lemma 15. Once lemma la holds on a system S E S , the correct process with the biggest identifier does no longer change its solution t o II.

PROOF By lemma 14, only messages with ids of correct processes transit in the ring or in the line of processes. Thus, when the process with the biggest id, say p, receives a message, the Ids of List are either smaller or equal to idp. Consequently, p discards the message without modifying its solution to IT. Lemma 16. Once lemmas 14 and 15 hold, a system S E S converges t o 17, that is, S satisfies convergence for D.

PROOF This lemma follows from the observation that the correct processes stabilize one by one, in descending order of ids, to the solution that is consistent w.r.t. the correct process with the biggest id.

Let po be the correct process with the biggest id. When lemmas 14 and 15 hold, there can still be messages containing different solutions to 17 by po. These messages will eventually disappear. By (i) lemma 13 and (ii) the fact that correct processes forward each message with an id bigger than theirs, the messages either return to po or are forwarded toward a faulty process. Thus, there is a point when, given each List where idpo is present, the solutions to II by po are identical. Once each List with idpo holds the same solution, the correct process with the second biggest id, say process pl, stabilizes when it receives a message with idpo (pl resolves II each time a message contains idpo,but the solution by pl is always the same since the protocol is deterministic.) At this point, there can still be Lists containing different solutions to 17 by pl and where idpl is the biggest id. These messages will eventually disappear. By (i) and (ii), they either return to pl, are absorbed by the process po, or are forwarded toward a faulty process. Then, the correct process with the third biggest id solves U consistently w.r.t. po and pl. Then the correct process with the fourth biggest id solves 17, then the fifth, then the sixth, etc. until all correct processes have a solution that is consistent w.r.t. all other correct processes and no List present in the ring 0 leads to a different solution. By closure (lemma 12) and convergence (lemmas 14, 15, and 16), we have: Theorem17. S is I-flss for II.

Next we use the generic deterministic 1-ftss protocol for deriving solutions to 2-COL on even size rings, and to orientation and non-trivial eventual consensus on rings of any size.

Figure 3 presents the algorithm that processes execute in the 2-COLprotocol for even size rings. A process p has a variable Clrp that takes values in {O,l). Each correct process sends repeatedly [(idp, Clrp)] to both neighbors. If p receives a List of entries (Id, Clr) and idp is not among the Ids, then p finds (Id,Clr) with the biggest I d and Ind, the index of (Id, Clr) from the head of List. Ind tells the distance, in links, of the process I d from p. In general, p should be colored as distance I n d = 2,4,6,. . . processes and differently from distance I n d = 1,3,5, . ..processes. If I d> idp then p: (1) computes Clr,: if I n d is even, then Clrp := Clr, else Clrp := Clr, and (2) adds (id,, Clrp) to the head of List and forwards the message. By the closure and convergence of the generic protocol, we have: Theorem 18. Let S = ( N ,A) be the family of systems where hf is the family of asynchronous even size rings with identifiers and A is the protocol obtained from Fig. 3. S is 1-ftss for 2-COL.

-

+

Clr, List Ind Link

in {O,l), p's color, Clr, (Clr, l)mod2 list of entries (Id, Clr) in {1,2,. ..), position in List link through which List arrived, Link is the other link

(true):

send [(id,,

(List

arrives via

C~T,)]to

both neighbors

Link): IF id, $ (Idl, ...,Idj) THEN find (Id, Clr) with biggest Id from List IF Id > id, THEN count Ind of (Id,Clr) from head of List IF Ind is odd THEN Clr, := Clr ELSE Clr, := Clr send (id,, Clr,) II List to Link

Fig. 3. Deterministic 1-ftss algorithm for ring 2-coloring. 5.2

R i n g orientation

[3] shows that orientation cannot be solved with a deterministic protocol on anonymous rings. Figure 4 presents the algorithm that processes execute in our deterministic orientation protocol for rings with identifiers. A process p holds its orientation in variables L, and Rp pointing each t o one of the two communication links. Each correct process tells repeatedly how, in its opinion, both neighbors should have named their communication links toward it. If p receives a List of entries (Id, Lbl) and id, is not among the Ids, then p finds (Id, Lbl) with the biggest Id. If I d > id, then p: (1) updates orientation: if Lb1 = left (resp. Lbl = right) then L, (resp. R,) points to the communication link through which the message arrived, and R, (resp. Lp) points to the other link, and (2) adds its orientation to the head of List and forwards the message. By the closure and convergence of the generic protocol, we have: T h e o r e m 19. Let S = ( N , A) be the family of systems where N is the family of asynchronous rings with identifiers and A is the protocol obtained from Fig. 4. S is 1-flss for ring orientation. 5.3

Non- trivial eventual consensus

Figure 5 presents the algorithm that processes execute in the non-trivial eventual consensus protocol for rings with identifiers. A process p holds the value for consensus in the variable Valp. Each correct process tells repeatedly Val, t o both neighbors. If p receives a List of entries (Id, Val) and id, is not among the Ids, then p finds (Id, Val) with the biggest Id. If I d > id, then p: (1) assigns Val, := Val, and (2) adds (id,, Valp) to the head of List and forwards the message. By the closure and convergence of the generic protocol, we have:

R,, L, List Link

pointers to p's communication links list of entries ( I d ,Lbl) link through which List arrived, Link is the other link

(true):

send [(id,, right)] t o L,, [(id,, l e f t ) ]to R p

-

(List arrives via Link):

$! { I d l ,. . .,I d j ) THEN

IF id,

find ( I d ,Lbl) with biggest Id from List IF Id > id, THEN IF Lbl = l e f t THEN point L, to Lank, R, to Link send (id,, l e f t ) II List to R,

-

ELSE

-

point R, to Link, L, to Link send (id,, right) 11 List t o L,

Fig. 4. Deterministic 1-ftss algorithm for ring orientation. Val, List Link

p's value in consensus list of entries ( I d ,V a l ) link through which List arrived, Link is the other link

(true):

send [(id,, Val,)] to both neighbors

-

(List arrives via Link):

IF id,

$! { I d l , . . .,Id,) THEN find ( I d , V a l ) with biggest Id from List IF Id > id, THEN

Val, := V a l send (id,, Val,) -

- -

-

-

-

-

11 List to Link

--

Fig. 5. Deterministic 1-ftss algorithm for non-trivial eventual consensus.

Theorem20. Let S = ( N ,A) be the family of systems where Af is the family of asynchronous rings with identifiers and A is the protocol obtained from Fig. 5. S is 1-fiss for nun-trivial eventual consensus.

6

Conclusions

We defined the property of k-fault-sensitivity, and showed that if a problem is k-fault-sensitive on an asynchronous (j,k)-restrictable process network, then there is no k-ftss solution to the problem on that network. We then proposed deterministic and randomized 1-ftss solutions to several problems that are not 1fault-sensitive on rings. We also developed a generic deterministic 1-ftss protocol

o n rings with identifiers. Our deterministic 2-COL, orientation, a n d non-trivial eventual consensus protocols are derived from the generic solution.

References 1. E. Anagnostou and V. Hadzilacos, "Tolerating transient and permanent failuresn, WDAGY93,LNCS Vol. 725, pp. 174 188. 2. A. Arora and M. Gouda, "Closure and convergence: a foundation of fault-tolerant computing", IEEE Transactions on Software Engineering 19(11):1015-1027, 1993. 3. J. Beauquier, 0. Debas, and S. Kekkonen, "Fault-tolerant and self-stabilizing ring orientationn, in Proc. of 3rd Intl. Colloquium on Structural Information and Communication Complexity, 1996, pp. 59-72. (Available also as LRI Res. Rep. 1030, UniversitC de Paris-Sud, Orsay). 4. J. Beauquier and S. Kekkonen, "Fault-tolerance and self-stabilization: Impossibility results and solutions using self-stabilizing failure detectorsn, Intl. Journal of Systems Science, Special Issue on Distributed Systems, 1997 (to appear). (Extended abstract "Making ftss is hard", in Proc. of 11th Intl. Conf. on Systems Engineering, 1996, pp. 91-96.) 5. G. M. Brown, M. Gouda, and C. L. Wu, "A self-stabilizing token systemn, in Proc. of 20th Annual Hawaii Intl. Conf. on Systems Sciences, 1987, pp. 218-223. 6. R. Buskens, and R. Bianchini, Jr., "Self-stabilizing mutual exclusion in the presence of faulty nodes", in Digest of papers of 25th Intl. Symp. on Fault-Tolerant Computing, 1995, pp. 144-153. 7. E.W. Dijkstra, "Self-stabilizing systems in spite of distributed controln, Communications of the ACM 17(11):643-644, 1974. 8. S. Dolev, A. Israeli, and S. Moran, "Self-stabilization of dynamic systems assuming only readlwrite atomicityn, Distributed Computing 7(1):3-16, 1993. 9. S. Dolev, A. Israeli, and S. Moran, "Analyzing expected time by scheduler-luck games", IEEE Transactions on Software Engineering 21(5):429-438, 1995. 10. S. Dolev and J. Welch, "Self-stabilizing clock synchronization with byzantine faultsn, PODC'95, p. 256. (Also in Proc. of 2nd Workshop on Self-stabilizing Systems 1995, paper #9.) 11. A. Gopal and K. Perry, "Unifying self-stabilization and fault-tolerance" PODC'93, pp. 195-206. 12. M. Gouda, "The triumph and tribulation of system stabilization", WDAG'95, LNCS Vol. 972, pp. 1-18. 13. T. Masuzawa, "A fault-tolerant and self-stabilizing protocol for the topology problem", in Proc. of 2nd Workshop on Self-stabilizing Systems, 1995, paper #l.

-

This article was processed using the BTEX macro package with SIROCCO style

79

Cornpositional Proofs of Self-stabilizing Protocols George Varghesel

' Washington University in St. Louis; work done partially while at Lab. for

Computer Science, MIT. Email: vargheseQaskew.wustl.edu Research supported in part by an ONR Young Investigator Award and NSF grant NCR-950544.

Abstract. We describe a modularity theorem for self-stabilizing protocols. The theorem shows that self-stabilizing component automata (that possess a property called suffix closure) can be composed to form a self-stabilizing system. Our theorem is more general than prior compositional theorems, and is described using the timed 110 automaton model to facilitate the proof of stabilization time bounds. We also use a simple theorem that facilitates hierarchical proofs for stabilizing protocols. Taken together, the theorems for compositional and hierarchical proofs allow the verification of of complex self-stabilizing systems. We describe example applications.

1

Introduction

A protocol is self-stabilizing if when started from an arbitrary global state it exhibits "correct" behavior after finite time. Typical protocols are designed to cope with a specified set of failure modes like packet loss and link failures. A self-stabilizing protocol copes with a set of failures that subsumes most previous categories, and is robust against transient errors. Transient errors include memory corruption, as well as malfunctioning devices that send out incorrect packets. Transient errors do occur in real networks and cause systems to fail unpredictably. Thus stabilizing protocols are attractive because they offer increased robustness as well as potential simplicity. Self-stabilizing algorithms can be simpler because they use uniform mechanisms to deal with different kinds of failures. Self-stabilizing protocols were introduced by Dijkstra [Dij74]. They have been studied by various researchers (e.g., [BP89, GM90, DIM93, IJSOa, IJSOb, AKYSO]). Much research in self-stabilization has focused on proving particular protocols self-stabilizing. Our focus in this paper, however, is on methods to construct (and especially verify the correctness of) complex, asynchronous selfstabilizing protocols. Verification of a small self-stabilizing program P is straightforward. The proof obligation naturally divides into two parts*: a liveness part which shows that P reaches a legitimate state no matter what state it starts in; and a safety *called

closure and convergence in [AG92]

80

part which shows that P remains in legitimate states. The liveness part can be carried out using temporal logic or other formalisms; the safety part is easily accomplished using invariant assertions or other inductive methods. Some results that describe such verification methodology specialized for self-stabilizing programs can be found in [AG92]. There have also been a number of papers that describe general techniques to construct self-stabilizing programs[KP90, AG90, AKYSO, APV9 1, Var93, Var941 by compiling non-stabilizing distributed programs that accomplish the same task. These methods directly synthesize self-stabilizing programs whose correctness follows directly from the proof of correctness of the compiler used. While general transformational methods and standard verification methods are important tools for designing and verifying stabilizing programs, neither is perhaps sufficient for the design and verification of complex stabilizing systems. The problem is that a large system may be constructed by composing several constituent protocols, each of which consists of a number of subprotocols, and so on. We need a verification methodology that can support such a building block approach .t Consider an example we will return to later. The literature describes an elegant stabilizing Data Link protocol [AB89] that can be built over primitive physical links. The literature also describes stabilizing spanning tree protocols that work over reliable Data Links (e.g., [DIM93]). Finally, there are stabilizing token passing protocols that work in a spanning tree topology. Suppose we wish to build a stabilizing token passing protocol that works over an arbitrary topology of primitive physical links. A natural approach would be to "compose" the aforementioned protocols. It seems intuitive that (in the resulting composition) after the Data Links stabilize, the Spanning Tree protocol will stabilize, followed by the stabilization of the token passing protocol. However, such intuitions can be misleading. Without a formal basis for composing stabilizing protocols, we have no real basis for talking about correctness of the larger system. In fact, our composition theorem (described later) requires a subtle condition known as sufix closure which may not always hold. The situation is similar to the verification or ordinary (i.e,, non stabilizing) protocols. While verification can be carried on the complete protocol modeled as a monolithic automaton, this is often tedious and error prone. The monolithic approach has the usual software engineering disadvantages in terms of the difficulty of writing, modifying, and reusing parts of the proof. Thus many models for describing protocols offer facilities for compositional and hierarchical proof techniques. Compositional proofs allow the behavior of a system to be inferred from the behavior of its components; this allows the proof to be modularized by system components. Hierarchical proofs (e-g., [LV96]) he ideal solution to the design of complex systems is a methodology for stepwise refinement. Our paper does not deal with this but only with the problem of composing self-stabilizing components. While composition is an important part of a general methodology, it is only a small part of what is needed.

.

allow a system to designed at finer and finer levels of detail: the proof method can be used to show that lower level designs "implement" more abstract designs. This allows the proof to be modularized by abstraction levels. Complex protocols can be verified by a combination of compositional and hierarchical proofs. As an example, the 1 / 0 Automaton (IOA) [LT89] and timed IOA models [MMTSl] offer compositional lemmas to "paste" the behavior of a system from the behaviors of its components. The IOA models also allow the use of simulation techniques to show that the behaviors of automaton A are a subset of the behaviors of another automaton B (which shows that A implements B). Ideally, we wish to possess the same level of verification methodology for verifying complex.stabilizing protocols. Rather than start from scratch, we leverage off existing work. Thus, in this paper we start with the IOA model and extend the lemmas and proof techniques already developed to provide the required methodology for stabilizing systems. Thus we use existing IOA definitions (e.g., fairness, composition) and existing IOA lemmas (e.g., composition lemmas.).

2

Comparison with Existing Work

Two prior results on "compositional" theorems for stabilization can be found in [DIM931 and [Hergl]. The result in [DIM931describes a fair composition between a slave protocol and a master protocol. The master protocol is a function of both its own state together with some shared state. The shared state is read but not written by the master protocol, and both read and written by the slave protocol. The authors define the fair composition of these protocols (in which each processor alternates between actions of the two protocols) and show that the fair composition is self-stabilizing if the master and slave protocols are selfstabilizing. This is used to verify the composition of a token passing protocol and a spanning tree protocol. Herman's thesis [Her911 uses a UNITY-like shared memory model and defines composition as (roughly) the union of the statements in the two programs. However, composition is not symmetric. One of the two programs, say A, must control the other program. Roughly speaking, A controls B if A can write to the variables of B but not vice versa. This theorem is used several times [Her911 in verifying an elegant biconnected components protocol. A difficulty with these previous theorems is a lack of generality. Both theorems rely on asymmetric composition between a controlling component (e.g., graph specification in [DIM93]) and a controlled component. When we compose a reset protocol with a spanning tree protocol (see example below) both protocols produce output actions that are input actions of the other protocol. In other words, each component protocol affects the states of the other component, which seems hard to model using asymmetric composition. Second, in [Her911 it appears to be necessary for the controlling component to converge to a fixed point instead of an arbitrary closed predicate. This is sufficient if the controlling component is computing values that are input to the controlled component - e.g., a spanning tree. However, this is insufficient if

the controlling component provides a more dynamic service - e.g., a resource allocation protocol that runs over a token passing protocol. A stabilizing token passing protocol does not converge to a fixed point but instead to a closed predicate (exactly one token in each state). Similarly, a stabilizing Data Link protocol provides reliable in-order delivery which can be expressed by a closed predicate and not by a fixed point. In summary, the lack of generality in existing work is caused by asymmetry [Hergl, DIM931 in composition, and the requirement ([Her911) for the controlling automaton to converge to afixed point. By contrast, the theorem we introduce in this paper allows symmetric composition and allows the constituent automata to converge to a closed predicate instead of just a fixed point. We also describe our theorem in the timed IOA message passing model model; this allows our theorem to apply to apply to message passing protocols and to be used for calculating system stabilization times from the stabilization times of components. On the negative side, our theorem requires that the constituent automata stabilize to what we call sufix closed behaviors/executions. This additional condition is not very restrictive for two reasons. First, many useful stabilizing system behaviors are suffix closed; we have used our theorem on several complex examples [Var93]. Suffix closed behaviors are common because any closed automaton (i.e., one whose reachable states are also possible initial states) is suffix closed; most general techniques for stabilization (e.g., [KP90, APVSl] explicitly construct closed automata. A second negative aspect of our theorem is that it only caters to the composition of automata that unconditionally stabilize; it does not provide for the composition of two automata that conditionally stabilize. The rest of the paper is organized as follows. We first describe a summary of the timed IOA model that we use in Section 3 followed by definitions of stabilization in Section 4. We present our main modularity theorem, which is the basis of our compositional proof technique, in Section 5. We briefly describe a example of a complex proof carried out using this theorem in Section 6 and conclude in Section 7. [Var93] describes our hierarchical proof technique which complements our compositional proof technique. We do not describe it here because our hierarchical proof techniques are similar to standard simulation techniques (e.g., [LV96]).

3

Modeling

We use the timed IOA model [MMT91] in which each action is associated with a fairness class c. To model asynchronous protocols, we assume that any continuously enabled action in class c will occur in time t,. We do not use lower bounds and we assume that the protocols must work regardless of the value oft, used. The use of parameterized upper bounds allows us to avoid the double effort of first proving correctness and then proving time bounds. Liveness arguments are replaced by showing time bounds on event occurences. We use the usual model of timed IOA with a few small variations (see formal summary in appendix). An automaton is a state machine with transition

labels called actions. An untimed execution of an IOA is an alternating series of states and actions (so, a l , sl, . . .) such that so is a start state and ( s i ,a;, si+1) is a valid transition of the IOA. An execution a of automaton A is an untimed execution with additional time components for each state and actions: (so,t o ) ,( a l ,d l ) , ( s l ,t l ) ,( a 2 ,t 2 ) ,( s 2 ,t 2 ) ,. . .) such that if any action in any class c is enabled in any state sj of a , then within time sj .time t , either some action in c occurs or some state sj occurs in which every action in c is disabled. In addition, the time assigned to any event a; in a (i.e., ai.time) is equal to the time assigned to the next state (i.e., si .time).t. Let a be an execution. Let y be the subsequence of a consisting of timed external actions, and let to be the start time of cr. The behavior j3 corresponding to a is the sequence j? = to, y. The behaviors of automaton A are the behaviors corresponding to the executions of A. We will use X.start to denote the starting time of either a behavior or or an execution X . Notice that start times are allowed to be arbitrary non-negative real numbers. This allows a clean statement of a lemma about stabilizing automata. There is a notion of composition [MMTSl] that produces a composite automaton A = llAi out of a set of compatible constituent automata A;. In a composition, input and output actions of the same name are performed simultaneously; the state of the composite automata is the composition of the constituent automata states. We use PIAi to represent the projection of a behavior p of A on to some constituent automaton A;. We assume that @(Aiinherits the times of j3 in the natural way. The following lemmas are translated from [MMTSI] and shows how we can "cut" behaviors of a composition and paste together behaviors of the constituent automata.

+

L e m m a 1. C u t L e m m a Let { A i ,i E I ) be a compatible collection of automata and let A = IIjErAi. Let /3 be any behavior of A. Then PIAi is a behavior of Ai for every i E I . L e m m a 2 . P a s t e L e m m a Let { A i ,i E I ) be a compatible collection of automata and let A = l l i E I A i .Let P be a behavior sequence such that each action in 3j is an external action of A. If @lAi is a behavior of Ai for every i E I , then j? is a behavior of A.

4

Stabilization: Definitions and Properties

We begin with a standard state-based definition of stabilization and then d e scribe a definition of stabilization in terms of external behaviors. 4.1 Definitions of Stabilization based o n Executions

Intuitively, if automaton A stabilizes to the executions in set C in time t , then within time t , all executions of A begin to "look like" an execution in set C. We !we also rule out the possibility of "Zeno executionsn in which the execution is infinite but time stays within some bound. See appendix

formalize this by defining a t-suffix of an execution a. Intuitively, this is a suffix of a whose first element occurs no more than time t after the start of a. We say that a' is a t-su& of execution a if: at.start - a.start 5 t and a' is a suffix of a. We can now define execution stabilization. Let C be a set of sequences of timed elements. We say that automaton A stabilizes to the executions in C in time t if for every execution a of A there is some t-suffix of execution a that is in C. We also say that automaton A stabilizes to the executions of automaton B in time t if for every execution a of A there is some t-suffix of execution a that is an execution of B. Execution stabilization is transitive; this allows us to prove execution stabilization in several stages. Lemma 3. If automaton A stabilizes to the executions of automaton B in time t l and B stabilizes to the executions of automaton C in time t2, then A stabilizes to the executions of C in time t t2.

+

4.2

Definitions of Stabilization based on External Behavior

A major theme of the 1 / 0 Automaton model [LT89] is the focus on external behaviors for specifying correctness. Thus it natural to look for a definition of stabilization in terms of external behaviors. Typically, the correctness of an IOA is specified by a set of legal behaviors P. An IOA A is said to solve P if the behaviors of A are contained in P. For stabilization, however, we weaken this definition and ask only that an automaton exhibit correct behavior after some finite time. As in the case of execution stabilization, we begin with the definition of a t-suffix of a behavior /?. Intuitively, this is a portion of P that starts at time no more than t after the start of /?.However, this is not as easy as defining a t-suffix of an execution because a behavior /? = (to,y ) consists of two components: a start time to and a sequence of timed actions y. We cannot simply define a t-suffix of p to be a suffix of /?;the t-suffix must have a start time as well as a sequence of timed actions. For simplicity, assume that these definitions apply only to infinite behaviors in which time grows without bound. Formally, consider any two behavior sequences P = to,y and P' = t;, y'. We is a t-sujjix of behavior /? if: say that /?'

- y' is a suffix of y containing all actions in greater than /?'.start.

p

that occur at times strictly

Using this, we can define behavior stabilization analogous to execution stabilization (our definition is adapted from a definition suggested by Nancy Lynch.) Let P be a set of behavior sequences. An IOA A stabilizes to the behaviors in P in time t if for every behavior /3 of A there is a t-suffix of behavior /? that is in P. An automaton A is said to stabilize to the behaviors of another automaton B in time t if for every behavior P of A there is a t-suffix of behavior P that is a behavior of B.

The following lemmas follow from the definitions. The transitivity lemma allow us to prove behavior stabilization results in stages. Lemma4. If automaton A stabilizes to the behaviors of automaton B in time t l and B stabilizes to the behaviors of automaton C in time t Z , then A stabilizes to the behaviors of C in time t 1 + t 2 . Lemma5. If automaton A stabilizes to the behaviors of an automaton B in time t and i 2 t then A stabilizes to the behaviors of B in time i.

The previous lemma motivates a natural complexity measure called the stabilization time from A to B. Intuitively, this is the smallest time after which we are guaranteed that A will behave like B. Formally, the stabilization time from A to B is the infimum of all t such that A stabilizes to the behaviors of B in time t . The next lemma shows that execution stabilization implies behavior stabilization. Thus to prove a behavior stabilization result, we prove a corresponding execution stabilization result. Behavior stabilization is typically used for specification and execution stabilization is used for proofs. Lemma6. If automaton A stabilizes to the executions of automaton B in time

t then automaton A stabilizes to the behaviors B in time t . 5

Modularity Theorem

We mostly deal with stabilization properties of a special class of automata called unrestricted automata (UIOA). Intuitively, a UIOA models a system that can start in an arbitrary state. Formally, a UIOA A is an automaton such that start(A) = states(A). We often work with a second special kind of automaton called a Closed 1 / 0 Automaton (CIOA ). Define the reachable states of an automaton A to be the states that can occur in executions of A. A CIOA is an automaton such that every reachable state is also a start state. It is easy to see that every UIOA is a CIOA . The following lemma is convenient and is used often below without explicit reference. It is the reason we allow executions and behaviors to start with arbitrary values of time. It depends crucially on the fact that there are no lower bounds on the time between actions. Lemma 7. Consider any execution cu of a CIOA A. Any sufix of a that starts with a timed state is also an execution of A.

Suppose we begin to view an automaton after it has "run for a while" and the resulting behavior is indistinguishable from an ordinary behavior of the automaton. Then, intuitively, we say that the automaton is sufix-closed. More formally: we say that an automaton A is sufix-closed if for every behavior P of A and every time t 2 0,every t-suffix of behavior P is a behavior of A.

A remarkable number of interesting automata we have studied are suffixclosed. This fact is explained by the following lemma: Lemma 8. Any CIOA A is sufix-closed. Proof. We will only sketch the main idea of the proof. Consider any behavior 0. Let p' be any t-suffix of behavior P. Consider any execution a of A such that the behavior of a is P. The proof consists of using a to construct another execution a' of A such that the behavior of a' is b'; a' is essentially a suffix of a whose start time is adjusted to match the start time of

p of A and any t 2

0'. Suffix of behavior includes all actions after this point in time. Behavior Execution

SO

**

a1

a2

a 3 * * * * am

a

a g....

a

P""

.

am+l....

.... a W.... sx: .... a Y ....

Fig. 1. Obtaining the suffix of an execution corresponding to a t-suffix of the behavior of the execution.

The behavior /3 and corresponding execution a are sketched in Figure 1. By definition, for every action in /3 there is a corresponding external action in a which occurs at the same time. This is sketched by drawing the action in the behavior directly above the corresponding action in the execution. (However, since the execution will, in general, have internal actions not included in the behavior, the indices of the actions will not necessarily match. Thus in the figure a1 in /? corresponds to a, in a.) The t-suffix P' can be sketched using a line that contains all actions in P occuring to the right of the line (see Figure 1). The line is drawn between two actions in /3 because the start time of p' may occur in between the times of two actions in p. We need a suffix a' of cr whose behavior is equal to PI. Thus we look for a state s, in a corresponding to the vertical time line drawn in Figure 1. But we may not have a state in a whose time is equal to the start time of PI. So (intuitively) we choose s, to be the first state that occurs to the "left" of the vertical time line. Then we choose at to be the suffix of a starting with s, and with the time of s, adjusted to be equal to pt.start. This works for two reasons. First, by definition of a CIOA, s, is a start state of A. Second, we have no lower bounds on the time between actions in A. Thus increasing the time of the initial state of an execution (and such that the resulting time is no greater than the time of the first action) still leaves us with a legal execution. The suffix-closed property is not just a interesting curiousity. It also provides the basis for the following important Modularity Theorem that we discuss next. Our Modularity Theorem about the stabilization of composed automata may seem "obvious". We would expect that if each piece Ai of a composed system

stabilizes to the behaviors of say Bi, then the composition of the Ai should stabilize to the composition of the Bi. Sadly, this is not quite true. There is a counterexample described in Section 5.1 which shows that if we allow some of the Bi to be arbitrary automata, then this statement is false. The main problem is that for a given behavior of the system A, the component automata Ai may stabilize at different times. But if each of the Ai begin to "look like" the corresponding Bi at different times, then it may not be possible to paste the resulting behavior into a behavior that "looks like" a behavior of B. However, this problem does not arise if each of the Bi is suffix-closed. Thus we have the following result. Theorem 9. Modularity: Let I be a finite index set. Let AI = {Ai, i E I) be a compatible set of automata and BI = {Bi, i E I) be a second set of compatible, sufix-closed automata. Suppose also that for all i E I , Ai stabilizes to the behaviors of Bi in time t. Let A = lIiEIAI and B = IIiEIBI. Then A stabilizes to the behaviors of B in time t.

Proof. The proof relies on the Cut Lemma (Lemma 1) that allows us to dissect a behavior of a system into its component behaviors, and the Paste Lemma (Lemma 2) that allows us to paste component behaviors into a system behavior. Consider any behavior P of A. Consider the P' that is a t-suffix of behavior @ and such that:

- Any actions in P' occur at times strictly greater than t. We can verify that such a @' exists from the definition of a t-suffix of a behavior. Intuitively, P' is chosen so that all component behaviors are guaranteed to have stabilized in PI. Now consider any i E I. By the Cut Lemma (Lemma I), PIAi is a behavior of Ai. But because Ai stabilizes to the behaviors of Bi in time t , there must be some t' _< t and some pi such that:

- fi is a t'-suffix of @IA; and is also a behavior of Bi.

Next consider @'(Ai. It can be verified that @'(Aiis a t"-suffix of fi for some t". Thus by the fact that Bi is suffix-closed, @'IAi is a behavior of Bi. Thus @'IAi is a behavior of Bi for all i. Hence by the Paste Lemma (Lemma 2), @' is a behavior of B. The theorem follows since fl' is a t-suffix of behavior @. 5.1

The Importance of Suffix-Closure

In the hypothesis of the modularity theorem, we assumed that each of the Bi was suffix-closed. We show a counterexample to show that if the Bi are allowed to be arbitrary automata, then the theorem is false. Consider automaton Bi shown

in Figure 2. Let Ai be a UIOA identical to Bi except that the start states of A; are unrestricted (i.e., the initial value of counk in A can be any value in the range (0, . ..,2)). The state of Bi consists of an integer variable count; E (0, ... , 2 ) The initial value of count; = 0 (* i.e., B is not a UIOA *)

INCREMENT^ (k) (*output action, outputs counter value using parameter k *) Precondition: k = count; Effect: count; := (count;

+ 1 ) mod 3

Any INCREMENT; action is in a separate class with upper bound t.

Fig. 2. Specification for Automaton Bi

It is easy to see that Ai stabilizes to the behaviors of Bi in time 3t because within that time the value of counk must reach 0. After such a state, any behavior of Ai is a behavior of Bi . Now consider an index set I = {1,2). Consider A which is the composition of Al and A2 and B which is the composition of B1 and B2.We claim that A does not stabilize to B in time 3t (or in fact in any finite time). To see this, we start with the following observation. In any behavior of B in which the actions of B1 and B2 strictly alternate, the counter values output in such a behavior will be of the form 0,0,1,1,2,2,0,0,. . .. Now consider the behavior corresponding to an execution of A in which countl = 0 initially and count2 = 2 initially and the actions of A1 and A2 strictly alternate starting with A1. Then the counter values output in such a execution will be of the form 0,2,1,0,2,1,0,2.. .. F'rom the earlier observation, it follows that there is no suffix of this behavior of A that is a behavior of B . Unfortunately, this example is somewhat artificial because it can be made to fail if we slightly modify the definition of stabilization. Thus it still remains to find a better example (to justify the need for suffix closure) based on "some interaction phenomenon between the automata".§

6

Example Applications

As an example of how Theorem 9 can be used, consider a stabilizing spanning tree protocol that runs over a stabilizing reset protocol that itself runs over a stabilizing Data Link implementation. Figure 3 shows the structure of such a protocol using (for simplicity) two nodes a and b. Sa and Ra are the spanning tree and reset automata at a; Sb and Rb are the corresponding automata a t b. The reset automata communicate over a Data Link implementation as shown in the leftmost frame (A). The double arrows between two connected automata §I am grateful to one of the anonymous referees for this observation.

89

show that each automaton can produce output actions that are input actions of the other automaton. For example, the Data Link reports when it is free to the reset protocol (input action for reset protocol), and the reset protocol uses an output action to send packets to the Data Link. Thus it does not appear to be possible to use earlier results based on asymmetric composition.

Working Reset Protocol

Fig. 3. Stabilization of a spanning tree protocol that works over a reset protocol that works over a stabilizing data link implementation.

We first show that the Data Link implementation stabilizes in time td to a Data Link automaton UDL using the hierarchical proof technique described in [Var94]. We then use the Modularity Theorem to show that the composed automaton in the leftmost frame A stabilizes to the composed automaton shown in the middle frame B in time td. We can do this since the reset and spanning tree automata are UIOAs (no start states specified) and UDL is a CIOA (hence all the automata in the middle frame are suffix-closed). We then show that the composition of the reset automata and the UDL stabilizes to a correct reset protocol in time t , using the hierarchical proof techniques. Since the reset protocol is made stabilizing using local checking and correction [APVgl], it stabilizes to an automaton that satisfies all local predicates and is hence a CIOA. Since the spanning tree automata are UIOA, we can use the Modularity Theorem again to show that the the composition of the automata in the middle frame B stabilizes to the composition of the automata in the rightmost frame C in time t,. We can now use the Transitivity Lemmas (Lemma 4) to show that the composition of automata in A stabilizes to the composition of automata in C in time td t,. This process can be continued to show stabilization of the spanning tree protocol itself and even that of a token passing protocol run on top of the spanning tree protocol. Details of the protocols can be found in [APV91, Var931. Notice how both hierarchical and compositional proof techniques are used and how the stabilization times of each "layer" add up. Note, too, that both the UDL and reset automata stabilize to closed predicates, and how the automata have actions that affect each other. It does not appear possible to use earlier compositional theorems [DIM93, Her911 to obtain similar results.

+

7

Conclusions

The main contribution of this paper is the modularity theorem (Theorem 9). The Modularity Theorem (Theorem 9) is simple but useful. It helps us to prove facts about the stabilization of a big system by proving facts about the stabilization of each of its parts, as long as each part is suffix-closed. The modularity theorem gives us a formal basis for a building-block approach. The requirement that each of the parts be suffix-closed is not very restrictive. Essentially, this is because our components are either primitive automata that are UIOAs (recall that stabilizing automata are built from automata whose start states are unspecified!) or intermediate automata that are CIOAs. Both CIOA and UIOA have suffix-closed behaviors. The reason that intermediate automata (e.g., the working reset protocol in Figure 3) are often CIOAs is because most general techniques for self-stabilization (e.g., [KP90, APV91, Var93, Var941) construct automata that stabilize to a specification automaton of the form AIL (i-e., the automaton that is identical to A except that its set of start states is equal to L). If L is a closed predicate of A - i,e., no transition of A can falsify L - then A1 L is a CIOA. As we build up a complex stabilizing protocol in several layers, the stabilization time of the system can be calculated by applying the Modularity Theorem and the Transitivity Lemma for behaviors (see Section 6.) The results are intuitive: the stabilization times of each "layer" add up due to the Transitivity Lemma. The definitions of stabilization in terms of external behaviors allow us to define that automaton A stabilizes to another automaton B, even though A and B have different state sets. This is useful when A is a low level model (e.g., an implementation) and B is a high level model (e.g., a specification). We also have a standard definition of stabilization in terms of executions. The execution definition is used for proofs while the behavior definition is used for specification. The definitions give us nice properties: transitivity for both behavior and execution stabilization, the fact that execution stabilization implies behavior stabilization, and the Modularity theorem.

References [AB89] Yehuda Afek and Geoffrey Brown. Self-stabilization of the alternating bit protocol. In Proceedings of the 8th IEEE Symposium on Reliable Distributed Systems, pages 80-83, 1989. [AG90] Anish Arora and Mohamed G. Gouda. Distributed reset. In Proc. 10th Conf. on Foundations of Software Technology and Theoretical Computer Science, pages 316-331. Spinger-Verlag (LNCS 472), 1990. [AG92] Anish Arora and Mohamed G. Gouda. Closure and convergence: A foundation of fault-tolerant computing. Unpublished manuscript, February 1992. [AKYSO] Yehuda Afek, Shay Kutten, and Moti Yung. Memory-efficient selfstabilization on general networks. In Proc. 4th Workshop on Distributed Algorithms, pages 15-28, Italy, September 1990. Springer-Verlag (LNCS 486).

[APV91] Baruch Awerbuch, Boaz Patt-Shamir, and George Varghese. Selfstabilization by local checking and correction. In Proc. Sind IEEE Symp. on Foundations of Computer Science, October 1991. [BP89] J.E. Burns and J. Pachl. Uniform self-stabilizing rings. ACM Tmnsactions on Programming Languages and Systems, 11(2):330-344, 1989. [Dij74] Edsger W. Dijkstra. Self stabilization in spite of distributed control. Comm. of the ACM, 17:643-644, 1974. [DIM931 Shlomo Dolev, Amos Israeli, and Shlomo Moran. Self-stabilization of dynamic systems assuming only readlwrite atomicity. In Distributed Computing, vol7, 1993 [GM90] Mohamed G. Gouda and Nicholas J. Multari. Stabilizing communication protocols. Technical Report TR-90-20, Dept. of Computer Science, University of Texas a t Austin, June 1990. [Her911 Ted Herman. Adaptivity through Distributed Convergence. PhD thesis, Dept. of Comp. Science, University of Texas, Austin, 1991. [IJSOa] Amos Israel and Marc Jalfon. Token management schemes and random walks yield self-stabilizing mutual exclusion. In Proc. 10th ACM Symp. on Principles of Distributed Computing, Quebec City, Canada, August 1990. [IJSOb] A. Israeli and M. Jalfon. Self-stabilizing ring orientation. In Proc. 4th Workshop on Distributed Algorithms, Italy, September 1990. [KP90] Shmuel Katz and Kenneth Perry. Self-stabilizing extensions for messagepassing systems. In Proc. 10th ACM Symp. on Principles of Distributed Computing, Quebec City, Canada, August 1990. [Lam831 L. Lamport. Specifying concurrent program modules. S(2):190-222, April 1983.

ACM TOPLAS,

[LT89] Nancy A. Lynch and Mark R. Tbttle. An introduction to input/output automata. C WI Quarterly, 2(3):219-246, 1989. [LV96] Nancy Lynch and W t s Vaandrager. Forward and backward simulations, 11: Timing-based systems. Information and Computation, 128(1):1-25, 10 July 1996. Nancy A. Lynch and Mark R. 'Ihttle. An introduction to input/output automata. C WI Quarterly, 2(3):219-246, 1989. [MMTgl] M. Merritt, F. Modugno, and M.R. Tuttle. Time constrained automata. In CONCUR 91, pages 408-423, 1991. [MP91] Zohar Manna and Amir Pnueli. Completing the temporal picture. Theoretical Computer Science, 83, 1991. [OL82] S. Owicki and L. Lamport. Proving liveness properties of concurrent programs. ACM Trans. on Programming Lang. and Syst., 4(3):455-495, 1982. [Var93] George Varghese. Self-stabilization by local checking and correction. Ph. D. Thesis MIT/LCS/TR-583, Massachusetts Institute of Technology, 1993. [Var94] George Varghese. Self-stabilization by counter flushing. In Proceedings of the 13th PODC, Los Angeles, California, August 1994.

A

Formal Summary of the I/O Automaton Model

Our model is a special case of the timed 1/0automaton model in [MMTSl]. However, our terminology is slightly different from that of [MMTSl]. An automaton A consists of five components:

- a finite set of actions actions(A) that is partitioned into three sets called the set of input, output, and internal actions. The union of the set of input actions and the set of output actions is called the set of external actions. The union of the set of output and internal actions is called the set of locally controlled actions.

- A finite set of states called states(A). - A nonempty set start(A)

states(A) of start states.

- A transition relation R(A) E states(A)

x actions(A) x states(A) with the property that for every state s and input action a there is a transition (s, a , ) E R(A).

- An equivalence relation part(A) partitioning the set of locally controlled actions into equivalence classes, such that for each class c in part(A) we have a positive real upper bound t,. (Intuitively, t, denotes an upper bound on the time to perform some action in class c.) An action a is said to be enabled in state s of automaton A if there exist some E states(A) such that (s, a , ) E R(A). An action a is disabled in state s of automaton A if it is not enabled in that state. Since one action may occur multiple times in a sequence, we often use the word event to denote a particular occurrence of an action in a sequence. To model the passage of time we use a time sequence. A time sequence to, t 1, t 2, . . . is a non-decreasing sequence of non-negative real numbers; also the numbers grow without bound if the sequence is infinite. A timed element is a tuple ( x , t ) where t is a non-negative real and x is an element drawn from an arbitrary domain. A timed state for automaton A is a timed element (s, t) where s is a state of A. A timed action for automaton A is a timed element (a, t) where a is an action of A. Let X = ( t o ,to), (xl ,t l), . . . be a sequence of timed elements. We will also use xj.time (which is read as the time associated with element xj) to denote tj. We say that element x j occurs within time t of element xi if j > i and xj.time 5 xi.time+t. We will use X.start (which is read as the start time of X ) to denote to. An execution a of automaton A is an alternating sequence of timed states and actionsof A ofthe form (so,to),( a l , t l ) , ( s l , t l ) , (a2,t2),(s2,t2),. . .such that the following conditions hold: 1. s o E start and (s;, a;+l, s;+l) E R for all i 2 0.

2. The sequence can either be finite or infinite, but if finite it must end with a timed state.

3. The sequence to,t t Z, . . . is a time sequence. 4. If any action in any class c is enabled in any state si of a then within time si .time+ t , either some action in c occurs or some state sj occurs in which every action in c is disabled. Notice that the time assigned to any event ai in a (i.e., ai.time) is equal to the time assigned to the next state (i.e., si.time). Notice also that we have ruled out the possibility of so-called "Zeno executions" in which the execution is infinite but time stays within some bound. Our definitions of composition are identical to that in [MMTSl].

This article was processed using the I4W 2e macro package with SIROCCO class

94

Delay-Insensitive Stabilization Anish &oral and Mohamed G. ~ o u d a ~ l ~ e ~ a r t m eofn tComputer and Information Sciences The Ohio-State University, Columbus, OH 43210 2~epartmentof Computer Sciences The University of Texas at Austin, Austin, TX 78712 Abstract We consider a class of systems, each of which satisfies the following property of system stabilization: starting from any state, the system will reach a fixed point. We show how to add asynchronous delays to systems in this class, and present several results concerning the effect of adding or removing delays on the property of system stabilization. First, we show that adding or removing delays does change the set of fixed points of a system. Second, we show that removing a delay preserves system stabilization, but adding a delay may not. Third, we identify three types of delays: non-cyclic, short, and long. We show that adding non-cyclic or short delays preserves system stabilization. We also show that under some lax conditions, adding long delays preserves system stabilization.

1 Introduction A computing system is stabilizing iff starting from an arbitrary,

possibly illegitimate state, the system is guaranteed to reach a legitimate state in a finite number of steps. Stabilization is a fundamental property of computing systems. For instance, the reset of a system, the ability of a system to tolerate faults, and the ability of a system to adapt to changes in its environment can all be viewed as special forms of stabilization. See for example PI, I31, I41, and I51.

Unfortunately, stabilization is a "fragile" property. Many seemingly lame transformations of a stabilizing system can disrupt the stabilization of the system [6].Thus, transformations of.stabilizing systems should be allowed only after showing that these transformations do not disrupt system stabilization. In this paper, we discuss an important class of transformations of stabilizing systems, namely the adding or removing of "asynchronous delays". These transformations are important because any implementation of a system is bound to add such delays to the system or to remove such delays from the system. Although these transformations clearly preserve most properties of the original system, it is not obvious that they do preserve the property of stabilization. The main result of this paper is that removing delays preserves system stabilization but adding delays may not, except in some special cases. We start our presentation in the next section by identifying a rich class of systems and showing how to add delays to each system in this class. 2 Systems with Delays A system S consists of some variables x y,

.,

and z, and an equal

number of assignment statements of the form

.., z) y := G(x, y, .., z) ...

x := F(x, y,

z := H(x, y, where F, G,

..., z)

..., and H are total functions over the variables in S.

A state of a system S is a mapping that maps each variable in S to a

value from the domain of that variable.

A transition of a system S is a triple (p, sf q), where p and q are

states of S, s is an assignment statement in Sf and executing statement s when S is in state p yields S in state q. For any transition (p, s, q), p is called the tail state of the transition, and q is called the head state of the transition. A computation of a system S is an infinite sequence of transitions of S such that the following two conditions hold.

i.

Order: In the sequence, the head state of each transition is the same as the tail state of the next transition.

ii.

Fairness: Each assignment statement in S appears infinitely many times in the sequence.

The tail state of the first transition in a computation is called the initial state of the computation. Moreover, if a transition in a computation has a state p (as the tail or head state of that transition), then the computation is said to reach state p. A state p of a system S is called a fixed point iff each transition (p,

s, q) of S is such that p = q. A system S is stabilizing iff each computation of S reaches a fixed

point. Let x be a variable in a system S. A delay can be added to variable x by modifying system S as follows.

I.

Add a new variable dx whose domain of values is the same as that of variable x.

ii

.

i ii .

Add an assignment statement of the form dx := x. Replace each occurrence of x by dx in the assignment statement of every variable, other than x and dx.

In the resulting system, variable x is referred to as a delayed variable, and variable dx is referred to as a delay variable. In the next section, we discuss an example where delays are added to some stabilizing system in order to facilitate the implementation of this system by a network of communicating processes. In this case, it is important that the resulting system after adding the delays has all the interesting properties of the original system, including the property of system stabilization.

3 Example of a System with Delays Let S be a system that consists of variables x and y (whose values range over the positive integers), and the two assignment statements

x

:=

if x > y t h e n x - y e l s e x

Y

:=

ifx

: state transitions

!--#'

(a) Linear execution E

(b) Linear execution E'

Fig. 1. Linear executions E and E' in Lemma 1

where the states of pj and pi-1 are s, and s b respectively. Let t be the state transition sequence of p j . The state of pj at c1 is say since R is a round of a linear execution. Let d1 be the configuration obtained from c1 by changing the state of pj-1 to ss. Notice that d l is a 1-faulty configuration and pj-1 is the faulty process of d l . Process pi has a privilege a t d l , and the states of pj and pj-1 are .s and ss respectively. Now we consider another linear execution E' starting from d1 (Fig. 1). Since state transition of pj depends only on the states of pj and pj-1, the state transition sequence t of pj can be applied to d l . In execution E', the state transition sequence t of pj is applied to d l . At the configuration that results from the application of t t o d l , both pi and pj have privileges. Thus, there exists an execution from a 1-faulty configuration such that it reaches a configuration where two non-faulty processes have privileges. This contradicts the fact that 0 the protocol A is a superstabilizing mutual exclusion protocol. It follows from Lemma 6 that the latency of any superstabilizing mutual exclusion protocol is rn/2] or more. For the case that n is even, we can improve the lower bound by one. The detailed proof of Lemma 7 is presented in Appendix.

Lemma 7. If n is even, there exists no n/2-latent superstabilizing mutual exclusion protocol on unidirectional rings. 0 From Lemma 6 and Lemma 7, we can prove the following theorem.

Theorem 8. There exists no in/2]-latent superstabilizing mutual exclusion protocol on unidirectional rings.

4

Superstabilizing mutual exclusion protocol

In this section, we propose a ( b / 2 1 + 1)-latent superstabilizing mutual exclusion protocol on unidirectional rings.

4.1 Herman's superstabilizing mutual exclusion protocol Herman [6] adopts the process-register model, and presents two superstabilizing mutual exclusion protocols on unidirectional rings: one is a n-latent protocol using n-registers and the other is a 1-latent protocol using 2n-registers. The nregister protocol can be directly applied to the shared variable model. We briefly describe the protocol. Herman's protocol uses one major token and n minor tokens. All processes share the major token and every process has its own minor token. All of the tokens are implemented by Dijkstra's self-stabilizing mutual exclusion protocol [I]. The major token represents a privilege, however, having the major token is not a sufficient condition to acquire a privilege. When process p; receives the major token, it sends its own minor token to p;+l. The minor token circulates the ring. When the minor token returns to pi, p; acquires a privilege. On releasing the privilege, p; passes the major token to p;+l. Process p;+l acquires a privilege after its minor token circulates the ring. The key idea for attaining the superstabilization is to circulate the minor token around the ring before acquisition of a privilege. A transient fault of a single process may create a spurious major token, and, as a result, a 1-faulty configuration with two major tokens may occur. If a process receives the spurious major token, then it circulates its own minor token around the ring before acquiring a privilege. During the minor token circulation, one of the two major tokens is removed, and no two processes simultaneously have privileges. Thus, the superstabilization is attained. The major token advances one position in the ring only after a minor token circulates once around the ring. It takes n rounds for the major token (i-e., a privilege) to circulate the ring, and, thus, the latency of the protocol is n3. Furthermore, its space complexity per process is O(nlogn), since each process needs space of O(1ogn) for each token.

' This estimation follows from Herman [6]. But in our formulation, it should be estimated as n + 1. For example, consider a legitimate configuration immediately after pi's release of a privilege. It requires n + 1 rounds till pi acquires a privilege again. First, every process must make a move in the order of pi+l ,pi+z,. ..pi to circulate p;+l's minor token. Then,

makes moves to acquire and release a privilege. And

4.2

(b/al

+ 1)-latent superstabilizing mutual exclusion protocol

We present a ( b/21+1)-latent superstabilizing protocol P with space complexity (per process) of O(1og n). Compared with Herman's protocol (for the n-register model), it improves the latency and the space complexity. The protocol P is based on Herman's protocol, and mainly two modifications are made: one for improving the space complexity and the other for improving the latency. The first modification is that all processes share one minor token instead of using n minor tokens. The role of the major and the minor tokens in the protocol P is same as that in Herman's protocol. The major token represents a privilege, and advances only after the minor token circulates once around the ring. However, the major token advances two positions in P, while it advances only one position in Herman's protocol. Thus, two neighboring processes consecutively acquire privileges in P after the minor token circulates once around the ring. This is the second modification for improving the latency. Notice that three or more processes cannot consecutively acquire privileges, since Lemma 6 shows that at most two processes can acquire privileges in one round. The above idea is implemented in the protocol P ' as follows. For each i (0 5 i 5 n/2 - I), processes p2i and P2i+l form a pair, and consecutively acquire privileges after the minor token circulation. To make the pairs, a simple selfstabilizing coloring protocol is used. When p2i gets the major token, it waits for arrival of the minor token. In any execution starting from any legitimate configuration, the minor token reaches P2i with the major token. On receiving the minor token, enters the waiting state and passes only the minor token to p2i+l ( p 2 i keeps the major token). On receiving the minor token, p2;+l enters the waiting state and passes the minor token to pz;+z. The minor token passes through all processes and returns to p2;. No process other than p2i and p2i+l enters the waiting state during the minor token circulation. When the minor token returns to p2i, p2i acquires a privilege. On releasing the privilege, ni leaves from the waiting state and passes both the major and the minor tokens to p 2 i + l . On receiving the tokens, p2i+l acquires a privilege. On releasing the privilege, p2i+l leaves from the waiting state and passes both the major and the minor tokens to p2i+2. Figure 2 shows the protocol P. For simplicity, action of each process is presented by a program instead of a state transition function. To clarify the part of the program corresponding to a single state transition, the program contains comments "start of a s t a t e transition" and "end of a state transition". Each process pi (0 i 5 n - 1) has the following variables.

- majori, minori: variables for implementing the major and the minor tokens, respectively. Each variable stores an integer ranging from 0 to n and is maintained using Dijkstra's self-stabilizing mutual exclusion protocol [I]. In every process must make a move in the order of p;+a, pi+3,. ..p;+l to circulate pi+a's minor token, and so on. Thus, every process makes moves n + 1 times repeatedly in the circular order of pi+l, pi+^, . .pi.

.

+

+

the rest of this paper, (major; 1) mod (n 1) is simply denoted by major; 1. Similar notation is used for the variable minor;. Process p; has the major token if the following holds. In case of i = 0 : majoro = major,-l. I n c a s e o f l s i s n - 1 : rnajor;+l=major;-l. The condition for having the minor token is similarly defined. - ~01;: a variable for the process color that is utilized to form process pairs. In any legitimate configuration, it should be col; = i mod 2. - wait;: a boolean variable for denoting that pi is at the waiting state. Process pi is a t the waiting state if and only if wait; = true.

+

4.3

Correctness

The legitimate configurations of the protocol P are defined as follows.

Definition 9. A configuration c of the protocol P is a legitimate configuration, if and only if the following four conditions are satisfied. 1. For each i (0 5 i 5 n - I), coli = i mod 2 holds. 2. There exists i (0 i 5 n - 1) such that majorj = majoro holds for each j (0 5 j 5 i) a n d m a j o r k + l =majoro holdsfor each k ( i + 1 5 k 5 n-1). 3. There exists i (0 5 i n - 1) such that minorj = minoro holds for each j (0 j 5 i) and minork 1= minoro holds for each k (i 1 k 5 n - 1). 4. Let p; and p j be processes that have the major token and the minor token respectively. (a) In case of i = j: i. In case of col; = 0: wait; = wait;+, (if i 1 5 n - I), and waitk = false for any k (k # i, i 1). ii. In case of coli = 1: wait; = true, and waitk = false for any k (k # i). (b) In case of i # j: coli = 0 and wait; = true hold, and the following holds. i. In case of j = i 1: waitr, = false for any k (k # i). = true (if i 15 n - I), and waitr, = ii. In case of j # i 1: false for any k (k # i, i 1).

s


2) is at least n A for an injinite number of networks.

+

9 A finite state synchronous algorithm The algorithm we discussed uses an unbounded number of states. This cannot be avoided in general, since it is possible to build predicates to which no finite-state algorithm can self-stabilize. Nonetheless, we can still characterize such predicates, and give universal algorithms also for the finite state case; this can be done with a multiplicative quiescence time loss of at(m), where at grows very slowly (in fact, more slowly than log*). More formally, if f (x) is the inverse of xX,then a(x) =

I

Oifx 5 2 f (a(x)) 1 otherwise.

+

nGEeSGsB

A predicate P is uniform (with respect to %')iff Pp is nonempty for all B. Clearly, a uniform predicate is anonymously computable on every finite subclass of V.Note that in this case the oraclexF can be made to depend only on B.

Theorem 15. Let %' be a class of synchmous networks, and P a predicate on X computable in anyfinite subclass of 9.Then, a finite state program which stabilizesC& to P exists if P is unifonn. In this case, there is a finite state program which self-stabilizes to P in n A (a(m) 1) steps on every network of %?with n nodes, m arcs and diameter A.

+

+

Prooj (Sketch of the second part) Each processor keeps a guess mi > 1 such that 2mi - 1 levels of the universal bundle are sufficientin order to build the minimum base. Each processor never builds trees taller than 2mi - 1. Moreover, at each step we update the guess by setting mi + supj,i,j+i mj . In the worst case, after A steps every processor will have a guess not smaller than the maximum guess M in the initial state, and after M steps all processor will possess M correct levels. Now at least one processor can detect locally that M is not a correct guess, and thus will update its guess by mi c m y . Again, after A steps every pro, which cessor will possess the new guess. The number of required rounds is ~ ( m )after mi is no longer increased. Clearly the algorithm is finite-state on any given network (due to the conditions on P, the call to the oracle depends only on B). I Note that, unless a precise space bound is required, the loss can be reduced arbitrarily; for instance, by updating our guess using Ackermann's function, we would have a much smaller loss; thus, the gap with our lower bound can be made arbitrarily small. We however conjecture that there is no universal finite state self-stabilizing algorithm with O(n A) quiescence time. Remark. In the infinite state case all processors end up with a "clock" (the height of the tree T) which is synchronized. This is the feature that allows to generalize our results to predicates in temporal logic (or, more generally, to any behaviour specified as a sequence of tuples of states). In the case of Theorem 15, this is not true. However, we can still give a finite state algorithm which provides all processors with a synchronizedclock with at least K values, where K is a given constant. This is sufficient in order to self-stabilize to any finite-state behaviour to which the network can self-stabilize, since it must be ultimately cyclic. A description of the algorithm will be given in the full paper; the main idea is to exploit the differences in the values of the clocks in order to estimate the size of the network, until stabilization (the clocks are updated with a standard catch-down technique). We execute the algorithm described in Theorem 15, but if we obtain a candidate minimum base in which the local identifiers induced by the clocks are not all equal, we increase the guess m and consequently the number of clock values to ~ m If ~ we . stabilize to a minimum base in which all local identifiers induced by the clocks are the same, it is certainly the minimum base of the network (since the clocks play no r6le); moreover, the clocks must be synchronized.

+

10 Self-stabilization: the asynchronous case In this section we briefly sketch the ideas behind our results in the asynchronous case. At every step, any set of enabled processors can now be activated at the same time. Thus, a computation is given by an initial state and by a sequence Ao, A 1, . .. of activated processors (which of course must be enabled). By convention, the state of the network at time t is the state just before the processors in At are activated. Thus, the state at time 0 is the initial state of the network. We denote with #i(T) the number of times the processor i has been activated at time i,i.e., #i (7) = I{t I i E At, t < i)1. Theorem 16. Let V be a class of asynchronous networks, and P a uniform predicate on X. Then there is a program which self-sabilizes to P in o ( A ~ steps ~ on ) every network of V with n nodes and diameter A.

We associate a synchronizing catch-up clock Ci to each processor, and we stipulate that a processor is not enabled unless for all in-neighbours j we have Cj 2 Cj. After an activation, we set Ci t m q + i C, 1. The only property of the clock we shall use is the following one

+

Lemma 17. At every time t, #j ( t ) 2 #i ( t )- 1 for all in-neighboursj of i. Moreovel; if i is enabled at time t then #j ( t ) > #i (t). ProoJ Note that between each two consecutiveactivations of the same processor, all its neighbours must be activated at least once. This implies the first part of the statement. For the second part, consider an activation with At = {i);we have #i ( t ) = #j (t 1 ) 1 s # j ( t 1 ) + 1 - 1 =#j(t). 1

+

+

As in the synchronous case, the correctness level of each processor increases with time; however, the exact number of correct levels_now depends on the number of activations: letting ci(t) = max{k I 'I;(?) Ik 2 G' t k), and c(t) = mini ci(t), we have

+

Lemma 18. ci ( t ) > ~ ( 0 )#i (t).

Pmoj By induction on t. The base case is trivial. If i $ A,, the claim is true by induction; otherwise, using Lemma 17 we have ci(t+l)r

min c j ( t > + l > min c(o)+#j(t)+l j=i,j+i j=i, j+i 3 ~ ( 0+) #i(t) 1 = ~ ( 0 ) #i(t

+

+ + 1)- 1

This proves that the number of correct levels ultimately increases. In order to prove the convergence of our algorithm, we introduce the net correctness level

-

C i ( t ) = Ci ( t )-

( t ).

Correspondingly, we have the minimized version E(t) = mini Fj (t). Note that c(0) = F(0). Finally, we say that a processor i is perfect at time t iff

Lemma 19. Thefollowing properties hold: I. E(t) is a nondecreasingfunction. 2. If for all in-neighboursj of i we have 4 ( t ) 5 Zj (t),and i is enabled at time t , then also ci ( t ) 5 cj (2).

3. IfEi(t) = E(t) and i g! At then Zi(t

+ 1 ) = Z(t + I ) .

4. Z(t) is a constantfunction.

5. IfEi(t) = E(t) and i E At then i is perfect at time t Pmo$ (1). If i # At then Zi(t we have

+ 1.

+ 1 ) = Zi(t) 2 F(t). If instead i E At, using Lemma 17

1. min .(cj(t)+ 1)-#i(t) - 1 ]=I,J+I

+ #j ( 2 ) ) - #i( t ) -']=l.J--+l . min .@j ( t )+ #i ( 2 ) ) - #i (1)

= min .(Zj ( 2 ) j=~.]+t

= min -Zj(t)l Z ( t ) . j=1.1-+1

+

Thus, for all i we obtain Fi (t 1 ) > F ( t ) . (2). In this case, Lemma 17 implies Ci ( 2 )

+

= Ei ( t ) #i ( 2 ) 5 Fj ( 2 )

+#j ( t ) = C j ( 2 ) .

(3). If i g! A,, then

(4). By (3), E(t) can increase only if a processor minimizing Ei ( t )is activated. However, in this case by (2) we have

(5). Just note that by (2)

Now we are in the position to prove that

Lemma 20. A perfect processor has a correct tree; moreover; perfection is stable.

Proof: By Lemma 18,

Finally, note that in the equation

whenever i is activated both the left side (by Lemma 18) and the right side grow by one. If i is not activated, both sides maintain their value. I The correctness and convergence proof now follows by noting that a processor minimizing Fi (t) retains this property until activated, and then becomes perfect. Moreover, during any scheduling the following statement must hold:

+ A)n steps, all processors have been activated at least k times. Thus, in (A + 1)n steps all processors have been activated, and so in (A + l)n2

Lemma 21. After (k

steps all processors have been activated in every possible order. This implies that they are all perfect, and since P is uniform, the knowledge of the minimum base is sufficient for stabilization to P (as in Section 9). We conjecture that the algorithm really requires 0(n2) steps for quiescence. In the full paper an S2 (n2) lower bound will be proved by considering predicates in the future fragment of temporal logic.

11 Conclusions We have exhibited a series of (preliminary) results about a theory of universal selfstabilizing algorithms. Our aim is to "factor out" of a self-stabilization problem the coordination part, showing that it can always be reduced to a single algorithm (much like a universal Turing machine "factors out" all the computational power of recursive functions). The results are not complete, and we would like to highlight the most important open problems, and to sketch some additional results which we have omitted for lack of space. The asynchronous algorithm we described in Section 10 can self-stabilize to uniform predicates, but it is easy to see that there are more predicates to which an asynchronous network can self-stabilize. We conjecture that our algorithm is universal, but we should firstly characterize the asynchronously computable predicates. The algorithm can be of course made finite-state with the same techniques of Section 9, and it becomes universal for predicates (but the characterization problem rises again when behaviours depending on time are considered). We remark that the whole theory can be applied to interleaved networks, in which a central daemon chooses a single processor to be activated among the enabled ones. Essentially, one just considers only discrete fibrations in the characterization theorems (a fibration is discrete iff its fibres have singletons as strongly connected components). The self-stabilizing algorithms are then extended following the ideas of [4].

References I. Yehuda Afek, Shay Kutten, and Moti Yung. Memory-efficient self-stabilizing protocols for general networks. In Proc. of the International Workshop on DistributedAlgorithms,number 486 in LNCS, pages 15-28. Springer-Verlag, 1991. 2. Dana Angluin. Global and local properties in networks of processors. In Proc. 12th Symposium on the Theory of Computing, pages 82-93, 1980. 3. Baruch Awerbuch, Boaz Patt-Shamir, and George Varghese. Self-stabilization by local checking and correction. In Pmc. 32nd Symposium on Foundations of Computer Science, pages 268-277, 1991. 4. Paolo Boldi, Bruno Codenotti, Peter Gemmell, Shella Shammah, Janos Simon, and Sebastiano Vigna. Symmetry breaking in anonymous networks: Characterizations. In Proc. 4th Israeli Symposium on Theory of Computing and Systems. IEEE Press, 1996.

5. Paolo Boldi and Sebastiano Vigna. Graph fibrations. Preprint, 1996. 6. Paolo Boldi and Sebastiano Vigna. Computing vector functions on anonymous networks. In Structure, Information and Communication Complexity. Proc. 4th Colloquium SIROCCO '97, International Informatics Series. Carleton University Press, 1997. To appear. An abstract appeared also as a Brief Announcement in Proc. PODC '97. 7. James E. Bums, Mohamed G. Gouda, and Raymond E. Miller. Stabilization and pseudostabilization. Distributed Computing, 7:3542, 1993.

8. E.W. Dijkstra. Self-stabilizing systems in spite of distributed control. CACM, 17(11):643644,1974. 9. Shmuel Katz and Kenneth J. Perry. Self-stabilizing extensions for message-passing systems. Distributed Computing, 7: 17-26, 1993.

10. Nancy Norris. Universal covers of graphs: Isomorphism to depth n - 1 implies isomorphism to all depths. Discrete Applied Mathematics, 56:61-74, 1995.

I I. Masafumi Yamashita and Tiko Kameda. Computing on anonymous networks. In Proc. of the 4th PODC, pages 13-22,1985. 12. Masafumi Yamashita and Tiko Kameda. Computing functions on asynchronous anonymous networks. Math. Systems Theory, 29:33 1-356, 1996.

This article was processed using the

A Im 2¯o package with SIROCCO class 156

Trade-offs in fault-containing self-stabilization Sukumar Ghosh12

Sriram V. Pemmarajul

Department of Computer Science, University of Iowa, Iowa City, IA 52242, USA. Email: {ghosh, sriram)@cs uiova.edu This author's research was supported in part by the National Science Foundation under grant CCR-9402050

.

Abstract. This paper demonstrates the feasibility of constructing faultcontaining, self-stabilizing protocols that allow the user to fine-tune the performance of the protocols, via the choice of values for certain program parameters. Based on the fault-history of the protocol, the user can choose appropriate values for program parameters and select desirable performance guarantees from various classes of faults. As an example, parameterized versions of Dijkstra's K-state mutual exclusion protocol are presented and these allow the user to trade off between performance measures such as stabilization time, k-fault-containment time, and token size.

Introduction Informally, a self-stabilizing protocol is called fault-containing if in addition to ensuring eventual convergence to a legitimate state from an arbitrary state, the protocol provides extra guarantees during convergence from states with "limited" faults. For example, an extra guarantee that a fault-containing self-stabilizing protocol may provide is convergence in O(1) time after a single process fault. Alternately, such a protocol might guarantee that only processes at O(1)distance from the faulty process make state changes during recovery from a single process fault. The kind of "limited" faults for which fault-containing self-stabilizing protocols provide extra guarantees also depends on the context. For example, in one context limited faults might mean a change in the states of a small number of processes; in another context limited faults might mean a certain kind of change in the network topology. For a precise definition of fault-containment in the context of self-stabilization see [4]. For this paper, the informal definition of fault-containment given above will suffice. Self-stabilizing protocols provide a type of fault-tolerance that is said to be non-masking. This is because the users of a self-stabilizing system can observe disrupted behavior while the system recovers to a legitimate state. Given a non-masking fault-tolerant system, one would hope that the level of disruption observable by users is proportional to the severity of the fault causing the disruption. Unfortunately, many self-stabilizing systems do not have this property: in some cases even a single bit corruption can lead to an observable state change in all processes, and the system may take a large amount of time to recover to a legitimate state. Given that in practice, limited faults are much

more common than faults that corrupt large portions of a distributed system, the above observations point out a major limitation of self-stabilizing systems. One way to alleviate this problem is to add the property of fault-containment to self-stabilizing protocols. Fault-containment is especially important in view of rapidly growing network sizes. Recently fault-containing self-stabilizing protocols have been constructed in [3, 5, 61 for the following problems: leader election, spanning tree construction, and construction of a breadth-first spanning tree. These examples demonstrate .the feasibility of adding fault-containment to self-stabilizing protocols, as well as the difficulties involved in constructing such protocols. All three examples focus on single process faults and present protocols that, in addition to being self-stabilizing, limit the effects of a single process fault to within the neighborhood of the faulty process. A general technique for adding the property of fault-containment t o non-reactive self-stabilizing protocols is presented in [4]. This research reveals a potential conflict between fault-containment and selfstabilization; fault-containment cannot be added free of cost to self-stabilizing protocols. In particular, it is shown that there exist self-stabilizing protocols such that adding fault-containment to these protocols necessarily increases the stabilization time of the protocol [lo]. In [9], Herman adds fault-containment to a reactive protocol; he devises protocols based on Dijkstra's K-state mutual exclusion protocol [I, 21 that contain the spread of spurious tokens generated by a single process fault. In related work, Gouda and Schneider [8] present a self-stabilizing protocol that constructs a maximum flow tree, and provides the additional guarantee that the protocol will continue to maintain a flow tree while recovering from any change in the capacity of edges. Kutten and Peleg [12] present a class of protocols in a synchronous model for which the recovery time is proportional to the number of transient faults. These protocols are not self-stabilizing and have an unacceptably high space complexity, but the attempt by the authors to link recovery time to the severity of the fault is an important contribution. Kutten and Patt-Shamir [ll] present a transformer that takes as input a non-reactive, possibly non-stabilizing protocol and produces as output an equivalent self-stabilizing protocol that recovers in time proportional to the number of faults from any state in which at most half the processes are corrupted by transient faults. Again, the resulting protocol is not very practical due to its huge space requirement. Research on fault-containment , within the context of self-stabilization, has shown the feasibility of constructing fault-containing self-stabilizing protocols. We view adding fault-containment to a self-stabilizing system as a way of "finetuning" the system. A user's option to add fault-containment to a self-stabilizing system should be ideally based on the fault-history of the system. As the nature of the faults occurring in the system changes over time, the user would like to respond by changing the nature of fault-containment provided. This implies that a user would like a menu of fault-containment options to choose from and the fault-containing self-stabilizing protocols we design should be able to provide that menu. For example, if a user who is running a self-stabilizing protocol

that contains single process faults "tightly", observes over a period of time that single process faults are becoming rare in the system, but other types of faults persist as before, then this user may want to relax fault-containment in the hope of improving some other performance measure such as stabilization time' or stabilization space. This is because, there is typically some overhead involved in tightly containing faults and the user may not want want to pay for this overhead unnecessarily. In this paper we describe protocols that allow a user to do precisely this. We use Dijkstra's K-state mutual exclusion protocol (MUTEX) as the example with which to illustrate the above mentioned ideas. In our main result, we present a mutual exclusion protocol derived from MUTEX that exhibits tradeoffs among three performance measures: (a) worst case time to reach a legitimate state, (b) worst case time to recover from a state with a t most k faults, and (c) the "size" of the token circulating around the unidirectional ring. In a token ring, the token size determines the latency of the ring; once a legitimate state has been reached, the token size is an important measure of how quickly the token circulates in the ring. See Varghese [13]for a remark on the importance of circulating a small number of bits in actual token protocols. The performance of the protocol we present can be varied along the three performance dimensions mentioned above by varying certain program parameters. What we present therefore is not a self-stabilizing mutual exclusion protocol that offers specific fault-containment; what we present is a menu of self-stabilizing mutual exclusion protocols that offers the user a variety of fault-containment options.

2

Definitions and Results

We start by making our model of computation precise. All protocols we present run on a network of processes communicating by locally shared memory. In other words, each process has a set of local variables that it can read from and write into and in addition each process can read from the local variables of neighbors. The local state of a process is the collection of values of its local variables; the state of the protocol is simply the collection of the local states of the processes. All our protocols are presented in the language of guarded commands. We assume that each process repeatedly evaluates its guards and checks if any guard is true and if so non-deterministically picks an enabled guard and executes the corresponding action. We assume that each process evaluates a guard and executes the corresponding action in a single atomic step. Such an atomic step by a process is called a move. A move can be represented as a state transition (s, st) where s is the state of the protocol immediately before the move and st is the state of the protocol immediately after the move. In this model of computation, the execution of a protocol can be viewed as a sequence of moves, in which moves by different processes are interleaved. More precisely, an ezecution sequence of a protocol P is a maximal sequence of states such that each pair of consecutive states in the sequence is a move. Given a move (s, st) by a process i, the communication size of the move is the number of bits of

neighbors' local states read by process i during the move. Given a state s of a self-stabilizing protocol P, a stabilizing sequence from s is any minimal execution sequence of P starting in s and ending in a legitimate state of P. The stabilization time from s is the length of the longest stabilizing sequence fiom s. The stabilization time of P is the largest stabilization time from any state s of P. Define a k-faulty state of P as a state obtained from a legitimate state of P by arbitrarily changing the local state of at most k processes. Define the k-fault-containment time of P to be the largest stabilization time of any k-faulty state. Since, in this paper, we focus on protocols for mutual exclusion, we now provide some definitions specific to mutual exclusion. Suppose that P is a selfstabilizing protocol for mutual exclusion. Consider an execution sequence x of P. Since x is infinite and since P has finitely many distinct states, x has a suffix of the form w w w Note that since P is self-stabilizing, every state in w is legitimate. In other words, within a finite number of state transitions after reaching a legitimate state, P starts cycling through a certain sequence of states (denoted w in the above). A token advance is a minimal contiguous subsequence T = sl, ~ 2 ,...,sk of w such that there exist a pair of distinct processes i and j such that i can enter its critical section in state sl and j can enter its critical section in state s k . In other words, T = $1, sz ,... ,s k is a minimal execution sequence in which the token has advanced from one process (i, in this case) to another (j, in this case). The token size of the token advance T is defined as the sum of the communication sizes of all moves (si, s ~ + ~1) 5 , i < k, in the token advance T. The token size of the protocol P is the sum of the token sizes of all the token advances in w divided by the number of token advances in w . Intuitively, the token size is the average number of bits communicated between neighbors per token advance, and as mentioned earlier, this is a measure of the latency of mutual exclusion protocols. We are now ready to present an overview of the main results in this paper. It is known that MUTEX has a worst case stabilization time of B(N2); assuming that K, the number of states each process can take on, is equal to N, it is easy to verify that the token size of MUTEX is log N. We begin Section 3 by showing that m E X has a worst case k-fault-containment time of B(kN) for any k, 0 < k 5 N. We then present a new mutual exclusion protocol (which we call NUTEX for "new MUTEX") derived from MUTEX in Section 3.1. In Section 3.2 we show that NUTEX has the following performance measures: worst case stabilization time of e(N2(M + I)), worst case k-fault-containnient time of B(kN), and token size of (1 log N for any k, 0 < k < N ' / ~ - 2. Here M is a positive integer parameter that can be chosen by the user. These performance measures reveal a trade-off between the stabilization time and token size. As M increases, the worst case stabilization time of NUTEX increases, while the token size of the protocol decreases (and approaches (log N)/2 in the limit). The important feature of NUTEX is that while the stabilization time depends on M, the k-faultcontainment time is completely independent of M and is, within a constant factor, the same as the fault-containment time of MUTEX. The reader should also a.

note another cost incurred by NUTEX: unlike MUTEX, NUTEX provides a 8(kN) worst case recovery time guarantee from a k-faulty state only when 0 < k < ~ ' - 1 2. ~ We subsequently sketch in Section 4 a generalization of NUTEX (which we shall call GNUTEX for "generalized NUTEX") which provides the user with a second program parameter, a positive integer C, 2 5 C 5 (logN)/2, whose value can be appropriately chosen so as to tune the performance of the protocol more finely. When C = 2, GNUTEX is identical to NUTEX. The protocol GNUTEX offers the following performance measures: worst case stabilization time of 8(N2(M I)), worst case k-fault-containment time of B(kCN), and token size of (1 &) log N for any k, 0 < k < ~ ' - 2. 1 Here ~ C and M are positive integer parameters that can be chosen by the user. Note that as C increases, the k-fault-containment time of GNUTEX increases, and the token size decreases. Thus in GNUTEX the user has the ability to tune independently, both the stabilization time and the k-fault-containment time, differently affecting the token size in the process. Note that in GNUTEX, the token size can be made arbitrarily small by choosing C and M large enough. However, the reader should note that as C grows, the value of k, the number of faults that can be "contained" by the protocol in B(kN) time, decreases since k is strictly bounded above by ~ ' - 2.1 ~

+

(9)

3

Tradeoff Between Stabilization Time and Token Size

In this section we present our first mutual exclusion protocol NUTEX derived from MUTEX, Dijkstra's K-state mutual exclusion protocol [I] that runs on any unidir rectional ring of N process, where N 5 K. The processes in the unidirectional ring are labeled 0, 1,...,N - 1and for each i, 0 i < N, process (i - 1) mod N is the left-neighbor of process i. Recall that in MUTEX, each process has a variable x that can take on any vahe in the range [O..K). (The notation [i..j] denotes the set of integers {kli 5 k 5 j}. As is common when talking about open intervals, we use parentheses ( or ) in denoting sets that do not contain their boundary points. For example, [i..j) denotes [i..j] {j}.) A process i, 0 < i < N executes the protocol: d o x # XL + x := XL od, while process 0 executes the program: d o x = XL + x := x++ od. In the above protocol and in the subsequent protocols that we present, the subscript L following a variable (xL, for example) denotes that variable belonging to the left neighbor of the process in question and the operator ++ is used to denote the increment operator with the appropriate modulus (K in this case). It is known that if we take an enabled guard to be a token, then independent of the initial state, the above protocol reaches a state after which a single token keeps circulating around the ring. For the rest of the paper, for simplicity of exposition, we assume that K = N. Given this, it is easy to see that the token size of MUTEX is log N. Below, we show that MUTEX has a worst case k-fault-containment time of 8(kN). An immediate corollary is that the worst case stabilization time of HUTEX is 9(N2). Before we establish the bound on the k-fault-containment time of MUTEX we introduce some notation and define the set of legitimate states of MUTEX. We use x i occasionally to refer to the value of x at process i. We will use i d e n t i c a ~ [ i , j ] where , i 5 j, as a


j. The legitimate states of MUTEX can then be defined by the following predicate:

+

Theorem 1. The worst case k-fault-containment time of MUTEX is t9(kN).

+

Proof. Define a segment as a maximal sequence i , i 1, . . .,j of processes, where j 2 i , s u c h t h a t x = ~ ~ f o r e a c h p r o c e s s i + 1 , i +.2. ,. , j . It iseasytoverifythat a k-faulty state has a t most 2k+ 2 segments. Furthermore, it is also easy to verify that in a k-faulty state, the set {xilo 5 i < N) has size a t most k+2. Start with an arbitrary k-faulty state and label the segments in this state s l , sz, ,st where 0 < 5 2k 2. Let L(i) (respectively, R(i)) denote the smallest (respectively, largest) process in segment si. Suppose that the segments are labeled "left-toConsider the /!-tuple right", that is, R(i) 1 = L(i 1) for all i, 1 5 i < D = (dl, d2, . . .,de), where di = N - 1- R(i). Think of the N processes in the ring being placed on a horizontal line in the order O,1,. ..,N - 1. Then di can be thought of as the distance between the "right-end" of a segment si and process N - 1. Note that de = 0 and hence we can write D as (dl,dp,. . . ,deml,O). Clearly, N > di > di+l for all i, 1 5 i < L. Each move by MUTEX, depending on whether the move is by process 0 or by some other process, affects D as described below:

.. .

+

e

+

+

e.

1. If a process i, 0 < i < N makes a move, then i = L(j) for some segment s j , 1 < j 5 L. As a result of this move, i leaves segment sj and joins segment sj-1 and as a result dj-l decreases by 1. If dj-1 becomes 0 (that is, sj-1 disappears) then dj-1 is removed from D.

If process 0 makes a move, then there are two possibilities depending on the length of the segment s l . If the length of sl is greater than 1, then a move by process 0 splits sl into two segments. The first of these two segments has length 1 and contains process 0 only. The creation of this segment results in the introduction of N - 1 as the first element of D. If the length of sl equals 1, then a move by process 0 either leaves sl as it is (and as result D remains unchanged) or causes sl to join s2. This latter event results in the removal of the first element of D. Now define 4 = z 6 , d i . Clearly, in a k-faulty state 4 5 (2k+2)(N - 1). Each move by MUTEX either increases 4 by N -1, decreases 4 by N -1,decreases 4 by 1, or leaves 4 unchanged. However, 4 can be increased or left unchanged by a t most k 3 moves because process 0 can move at most k 3 times before D becomes (0). This is because, as stated earlier, in a k-faulty state the set {xilo 5 i < N} has size at most k 2 and by the pigeon hole principle process 0 makes at most k 3 moves before incrementing xo to a value that is distinct from xi for all i, 0 < i < N . Once such a "new" value is generated by process 0, then the segment si cannot shrink, it can only grow. This implies that 4 5 (3k+5)(N - 1)

+

+

+

+

always. This in turn implies that in O(kN) moves, MUTEX reaches a state in which D = (0). In other words, MUTEX reaches a state in which there is exactly one segment in O(kN) time. In such a state the value of x a t all processes is identical and hence this state is in LM.Therefore, the k-fault-containment time of MUTEX is O(kN). Note that once a legitimate state of MUTEX has been reached, D cycles through the sequence of values: (O), ( N - l,O), ( N - 2,0),

, (1,O).

It is easy to construct an example of a k-faulty state for which reaching a legitimate state takes R(kN) time. Hence the worst case k-fault containment time of MUTEX is 8(kN). 3.1

T h e protocolNUTEX

To derive NUTEX from MUTEX, we add to each process a variable called mode, that can take on any value in the set {ALL, UP, DOWN). In addition, we add to process 0 a variable called count, that can take on any value in the range [0..2N M), where M is a positive integer that can be chosen by the user. Process 0 executes a protocol (shown in Figure 1) that is distinct from the protocol executed by the rest of the processes (shown in Figure 2). In these protocols, we use the notation M S H (y) (respectively, LSH(y)) to denote the most (respectively, least) significant half of an integer y. Before we explain the protocols in Figures 1 and 2 in more detail, we provide some intuition. In NUTEX, processes can operate in one of three modes: ALL, UP, or DOWN. In mode ALL, the system executes MUTEX. This implies that in this mode, all bits of the variable x are used. In mode UP (respectively, DOWN) the system executes a version of MUTEX that only considers the most (respectively, least) significant half of x. NUTEX keeps cycling through these three modes and it is designed so that as long MUTEX is periodically executed for a long enough duration, stabilization is guaranteed. Once legitimacy is reached, to an observer, MUTEX and NUTEX appear essentially the same. Since NUTEX spends part of its time in modes UP and DOWN, the token size of NUTEX is smaller than the token size of MUTEX. However, precisely because of the reason that NUTEX spends part of its time in modes UP and DOWN, the stabilization time of NUTEX is greater than the stabilization time of MUTEX. In particular, the longer the time NUTEX spends in modes UP and DOWN, greater the stabilization time and smaller the token size. Note that the way in which NUTEX cycles through the three modes ensures that the k-fault-containment time is not affected, as long as 0 < k < N ' / ~ - 2. Respecting this constraint is what makes the design of NUTEX interesting. P r o t o c o l executed by process 0: We now focus on the protocol executed by process 0. The variable count determines the mode in which process 0 should operate. In particular, the range [0..2N + M ) of count is divided into two subranges: [0..2N), which corresponds to mode ALL and [2N..2N + M), which corresponds to modes UP and DOWN. The range [2N..2N + M ) is divided into "blocks" of size k 3, each alternately corresponding to the modes UP and DOWN. In other words, the range

+

+

The protocol executed by process 0: (S1)

(-Needsupdate,)

A (~sSame,)

(S2) 0

(Needsupdate,) A (Issame,)

+

Increment, ; count := count++

+

Update,; Increment, ; count := count++

The predicates NeedsUpdate, and IsSame, are defined as follows: NeedsUpdate,

1

((count E [0..2N)) A (mode # ALL) V

- 2N)/(k + 3)J is even) A (mode # UP)) V > 2N) A ([(count - 2N)/(k + 3)J is odd) A (mode # DOWN)).

((count 2 2N) A ([(count ((count

((mode = UP)A (MSH(X) = MSH(XL)))

v

The procedure Increment, is defined as:

0 0

(mode=ALL) (mode=UP) (mode= DOWN)

+ +

x:=x++; MSH(x):=MSH(x)++; -+ LSH(x) := LSH(x)++

The procedure Update, is defined as: (count E [0..2N)) .

+ 3)J is even) (count > 2N) A ([(count - 2N)/(k + 3)J is odd) (count

0

2 2N) A ([(count - 2N)/(k

Fig. 1. The protocol executed by Process 0.

+

mode := ALL;

-+ mode := UP;

+

mode := DOWN

+ +

+ +

+

[2N i(k 3)..2N (i 1)(k 3)) corresponds to the UP mode when i is even and to the DOWN mode when i is odd. Equivalently, assuming that count 2 2N, the mode at process 0 should be UP (respectively, DOWN) if [(count - 2N)/(k 3)J is even (respectively, odd). The protocol executed by process 0 contains two guarded statements S1 and S2 and we use G1 and G2 respectively to denote the guards in these statements. G1 is TRUE if process 0 is in the correct mode (a condition denoted by the and if the value of x is equal to the value of XL predicate -Needsupdate,) (a condition denoted by the predicate Issame,). Note that depending on its mode, process 0 either considers all bits of x and XL or only the most (or least) significant halves. If G1 is TRUE, then process 0 increments the appropriate portion of x (see procedure Increment,) and then increments the value of count. The ++ operator used in the procedure Increment, to increment L S H ( x ) and MSH(x) is modulo N/2. The guard G2 is TRUE if the mode of process 0 is not correct with respect to count, but depending on the current mode of process 0 the appropriate portions of x and XL are equal. If G2 is TRUE, then process 0 first updates the mode variable and then increments the appropriate portion of x and finally increments count. The procedure Update, used to update the value of mode is shown in Figure 1. Note that every execution of a guarded statement by process 0 is accompanied by incrementing of count. Program executed by Process i > 0: Process i, for each i > 0, executes the protocol shown in Figure 2. The guarded statement in the protocol is labeled S 3 and we use G3 to denote the corresponding guard. If guard G3 is TRUE then process i differs from its left neighbor either in the value of mode or in the value of the appropriate portion of x. As the name suggests, the predicate I sDistinct, is responsible for comparing the appropriate portions of x and XL to check if they are distinct. If G3 is TRUE, then process i first copies the value of mode^ into mode and then copies the value of XL into x. The procedure Copy, is responsible for copying the appropriate portions of XL into x.

+

3.2

Proving properties of NUTEX

The set of legitimate states of NUTEX is defined by a predicate CN. We express CN as the disjunction CN z SN V where SN represents single mode legitimate states of that is legitimate states in which all processes have the same mode and TN represents legitimate states in which there is a transition from one mode to another. We define the predicates SN and below: The reader should note that in both definitions, the variable k is quantified over the range [O..N).

rN,

rN

Aident icalx[k,N- - 11)

.

The protocol executed by each process i > 0 is: (S3) (mode #

+

mode^) V (IsDistinct,)

mode := mode^; COPY,

The predicate IsDistinct, is defined as: IsDistinct.

E

((mode = ALL) A (x

# xL))

((mode = UP)A (MSH(x)

V

# MSH(X~)))V

((mode = DOWN) A (LSH(x)

# LSH ( x L ) ) ) .

The procedure Copy, is defined as:

0 0

(mode = ALL) (mode = UP) (mode = DOWN)

Fig. 2. The protocol executed by process i

> 0.

e k - 1] TN r 3 : ident ical,[O, k - 11 A ident i c G O d [O, A identical.[k, N - 11 A identicalmode[k, N - 11. In the above we use the notation ident icaLOde [i,j] for the predicate that asserts that the value of mode is the same for all processes i,i 1,. .. ,j. The predicate identicaLod,[i,j ] is taken to be TRUE if i > j. In a state satisfying SN,all processes have the same mode and in a state satisfying TN, the system is making a transition from the "old" mode that process k through N - 1 are in, to the "new" mode that processes 0 through k - 1 are in. It is easy to verify that in a state satisfying LN exactly one process has an enabled guard. Suppose that process i, 0 5 i < N, has an enabled guard in a state satisfying LCN. Then, a move by i, in such a state leads to another state satisfying CN in which the process (i 1) mod N has an enabled guard. From the protocol shown in Figures 1 and 2 it can be seen that if a process has an enabled guard then it has exactly one guard enabled. These comments imply that NUTEX provides to a user the same mutual exclusion service as does MUTEX. We now present theorems that establish the performance measures of NUTEX. Due to lack of space, we merely provide brief proof sketches only.

+

+

Theorem 2 . The worst case stabilization time of NUTEX is B((M + 1)N2). Proof Sketch: F'rom an arbitrary state NUTEX reaches a state in which count = 0 in O((M 1)N 2 ) moves. F'rom such a state, NUTEX reaches a state in which modei = ALL, for all i, 0 5 i < N in O ( N ~ )moves. NUTEX is identical to

+

MUTEX when modei = ALL for all i and hence from such a state NUTEX reaches a legitimate state in O(N2) moves.

Theorem3. The worst case k-fault-containment time of NUTEX is B(kN). Proof Sketch: The proof of. this theorem is similar to the proof of Theorem 1. The extra work in proving this theorem is in showing that even though we have only half the bits circulating, we would like to make progress tokards a legitimate state if started in a k-faulty state. This progress is ensured by two features of NUTEX: (a) the number of distinct values that LSH(x) or MSH(x) is large enough and (b) the number of token circulations before process 0 changes its mode is large enough. In particular, the condition 21°gN12 2 k + 3 ensures that the number of distinct values that LSH(x) or MSH(x) can take, is greater than k + 3. This translates to k < N ' / ~- 2. In addition, at least k + 3 token circulations take place before process 0 initiates a change in mode. Theorem 4. The token size of NUTEX is (1 - Ml(2M + 4N)) log N. Proof Sketch The claim follows immediately from the fact that NUTEX eventually cycles through a phase of M token circulations with token size (log N)/2 followed by a phase of 2N token circulations with token size log N.

4

Tradeoff Between Fault-Containment Time and Token Size

In this section we briefly sketch the protocol GNUTEX that is a generalization of the protocol NUTEX. In addition to the tradeoff between stabilization time and token size, GNUTEX provides a trade-off between fault-containment time and token size as well. The user is provided with a second program parameter C , whose value can be appropriately chosen so as to tune the performance of the protocol more finely. As motivation for the construction of GNUTEX, suppose that the user of NUTEX realizes by observing the system over a period of time that single process faults are becoming rather rare. The user, based on this new observation, is ready to give up fault-containment to a certain extent to further improve other performance measures. In the current setting, the user can give up tight fault-containment to improve the token size of the protocol. To achieve this, the user picks a large value for C; this increases the k-fault-containment time of GNUTEX, and reduces the token size. The protocol GNUTEX offers, for any k, 0 < k < ~ ' - 12 the~ following performance measures: stabilization time of B(N2(M I)),k-fault-containment &) log N GNUTEX is identical time of B(kCN), and token size of (1- ( t o NUTEX when C = 2. The main idea in the construction of GNUTEX is as follows. Instead of partitioning the bits of x into two blocks as was done in NUTEX, the bits of x are partitioned in C blocks labeled 0 through C - 1. Assuming that x has B = [log N l bits, each block is a contiguous sequence of B I C bits. The system operates in C 1 modes; each process has a variable called mode that can

y)

+

+

+

take on any value in the range [O..q corresponding to the C 1 possible modes. In mode C, the protocol MUTEX is executed. In any other mode i, 0 5 i < C, a version of MUTEX is executed that only considers block i of x. Just as in NUTEX process 0 has an additional variable called count that can take on any value in the range [0..2N+ M). Process 0 uses this variable to determine which mode the system should be in. The subrange [0..2N) corresponds to mode C; the subrange [2N..2N M) is divided into segments of size k 3, each corresponding to a mode i, 0 5 i < C. The rest of the details are similar to those in NUTEX. Note that in NUTEX, 2 bits were sufficient to represent (and transmit) the mode information, but in GNUTEX, [log(C 1)1 bits are needed for this task. Since this is O(log(1og N)), the additional space requirement is trivial when compared to the savings in token size.

+

+

+

5

Conclusions

Fault-cont ainment is an extremely useful property to add to self-stabilizing protocols. However, it may not be useful to offer the user a protocol that provides a specific fault-containment property. Ideally, a user would like to fine-tune a self-stabilizing protocol based on the fault history of the system. Thus a designer of fault-containing self-stabilizing protocols ought to provide a user with a menu of protocols, offering different performance guarantees from different classes of faults. In this paper we have demonstrated the feasibility of constructing a protocol, that offers the user, via the choice of program parameters, a wide variety of performance guarantees. Based on the fault-history of the system, the user can tune the performance of the protocol finely, trading off various performance measures against each other. Besides the practical motivation of providing a menu of fault-containment options, our work can be thought of as an example, similar to the one presented in [7],that exhibits tradeoffs among different complexity measures. Whether these tradeoffs are inherent to the problem is an interesting question that we do not consider here. Our example also begs the more general question: just as we did for MUTEX, can we transform any self-stabilizing protocol into a family of equivalent self-stabilizing protocols whose stabilization time and k-fault-containment time can be varied independently by varying the "amount" of communication between processes. This is a topic for future research. Our protocols reveal independent tradeoffs between token size and stabilization time and token size and k-fault-containment time. The protocols do not reveal a direct tradeoff between stabilization time and k-fault-containment time. In previous work [4, 101, we have shown that stabilization time and k-fault-containment time are in conflict with each other in the sense that for certain problems reducing the k-faultcontainment time necessarily means increasing stabilization time. Whether this is true in particular for the problem of mutual exclusion and in general for problems whose solutions involve reactive protocols, is not yet known.

References 1. EW Dijkstra. Self stabilizing systems in spite of distributed control. Communications of the Association of the Computing Machinery, 17:643-644, 1974. 2. EW Dijkstra. A belated proof of self-stabilization. Distributed Computing, 1:5-6, 1986. 3. S Ghosh and A Gupta. Fault-containing leader election. Information Processing Letters, 5(59):281-288, 1996. 4. S Ghosh, A Gupta, T Herman, and SV Pemmaraju. Fault-containing selfstabilizing algorithms. In PODC96 Proceedings of the Fifteenth Annual ACM Symposium on Principles of Distributed Computing, pages 45-54, 1996. 5. S Ghosh, A Gupta, and SV Pemmaraju. A fault-containing self-stabilizing algorithm for spanning trees. Journal of Computing and Information, 2:322-338, 1996. 6. S Ghosh, A Gupta, and SV Pemmaraju. Fault-containing network protocols. In Twelfth ACM Symposium on Applied Computing, 1997.

7. MG Gouda and M Evangelist. Convergence/response tradeoffs in concurrent systems. In Proceedings of the 2nd IEEE Symposium on Parallel and Distributed Processing, pages 288-292, 1990. 8. MG Gouda and M Schneider. Maximum flow routing. In Proceedings of the Second Workshop on Self-Stabilizing Systems, pages 2.1-2.13, 1995. 9. T Herman. Superstabilizing mutual exclusion. Preliminary Report available a t the following URL: h t t p : //m.cs .uiowa. edulftp/selfstabhain. html, 1996. 10. T Herman and SV Pemmaraju. Impossibility results in fault-containing selfstabilization.. In preparation, 1997. 11. S Kutten and B Patt-Shamir. Time-adaptive self-stabilization. In PODC97 Proceedings of the Sixteenth Annual ACM Symposium on Principles of Distributed Computing, 1997. 12. S Kutten and D Peleg. Universal fault-local mending. In PODC95 Proceedings of the Fourteenth Annual ACM Symposium on Principles of Distributed Computing, pages 20-27, 1995. 13. G Varghese. Self-stabiliation by counter flushing. In PODC94 Proceedings of the Thirteenth Annual ACM Symposium on Principles of Distributed Computing, pages 244-253, 1994.

This article was processed using the L

w 2~ macro package with SIROCCO class 169

Self-St abilizing Multiple-Sender Single-Receiver Protocol

/

Icarlo Berket, Ruppert Koch Department of Electrical and Computer Engineering, University of California, Santa Barbara, CA 93106, karlo@)alpha.ece.ucsb.edu,ruppert8alpha.ece.ucsb.edu

Abstract. We present a new self-stabilizing protocol for many-to-one multicasting of messages. It is based on the window washing protocol of Costello and Varghese which uses positive acknowledgments for received messages. The assumed model uses a single queue at the receiver's side taking all the messages sent by N senders. The protocol provides flow control independently for every sender by dividing the queue into N logical queues. To assure good performance for bursty traffic, the share of the queue a sender is holding is adapted dynamically to the amount of traffic emitted by the sender. A proof of correctness and an upper bound for stabilization are given.

1

Introduction

In a time of increasing software complexity, protocols must be designed in a robust manner. They should be able to cope with unforeseen erroneous situations. Since Dijkstra [3,61 introduced self-stabilization, much work has been done on designing protocols that use this property to guarantee return t o a correct state in case of transient errors (1, 51. For low-level communication protocols, self-stabilization turns out to be particularly useful. Unreliable communication causes errors that must be handled by the protocol. Gouda and Multari [4] and Spinelli [7] developed self-stabilizing two-node sliding-window protocols. In [2] Costello and Varghese introduced a self-stabilizing sliding-window protocol called window washing. Window washing deals with one-to-one and one-to-many communication. Sliding-window protocols [B] for one-to-one links are straightforward. The main problem - protecting the receiver's queue from overflowing with messages - can be solved by controlling the number of messages the sender is allowed to transmit. An implementation of many-to-one communication must address the problems that arise when several sources are feeding the receiver queue. Splitting up the receiver buffer into N buffers (where N equals the number of senders) cannot usually be done since buffer management is a part of the network interface itself. Another approach is to divide the buffer logically into N parts by assigning the window size of the senders so that the sum of all of the window sizes is less than or equal to the size of the receiver buffer. Using fixed window sizes is easy t o implement but turns out t o be wasteful if the traffic is bursty. A way to

achieve better performance is to assign flexible window sizes. Senders emitting more messages than others can use a wider window. The sum of all of the window sizes must remain constant. The protocol proposed in this paper achieves this by permanently checking the amount of data each sender is sending. The assigned window sizes are piggybacked in the acknowledgments. It can be seen that, by applying strict assignment rules and periodically resending some of the data, the protocol becomes self-stabilizing. Loss of messages, loss of acknowledgments, arbitrarily assigned window sizes, and inconsistent views of the system can be overcome without any additional control within a well-defined period of time. This makes our protocol both flexible and robust. The paper is organized as follows. Section 2 briefly describes window washing and explains the mechanisms by which self-stabilization is achieved. Section 3 introduces the window size adjustment protocol for a one-to-one protocol. Section 4 expands the protocol for many-to-one communication. A worst-case analysis and a proof of correctness are given. Section 5 provides simulation results. Section 6 states our conclusions.

2

Window Washing

The window washing protocol proposed by Costello and Varghese [2] is a slidingwindow protocol that can be used to impose flow control in one-to-one and oneto-many setups. Here we consider the one-to-one setup as it is given in Figure 1. Every message is tagged with a sequence number seq. Each message is retransmitted periodically until the sender receives an acknowledgment for it. If the sender window size is w, the sequence number of messages sent is in the range [L 1, L w]. L denotes the lower window edge and is attached to every message.

+

+

messages

acknowledgments Sender

Receiver

Fig. 1. One-to-one setup. The sender sends messages to the receiver, and the receiver sends acknowledgments back. The window washing protocol provides flow control and reliable message delivery. The channels are FIFO channels.

If the receiver receives a message with the sequence number seq, it checks if R+ 1 where R is the sequence number of the last message it has received. If this is the case, it increments R. If this is not the case, it discards the message. However, if seq is not in the range of [R 1,R w], the receiver accepts the message and copies the value seq from the message header into R. After a valid message is received, R is sent as an acknowledgment. It acknowledges the message with sequence number R and all previous messages. Like all unacknowledged messages, the last ack is retransmitted periodically to protect the system from data loss. The sender accepts an ack if R is in the range [L 1,L w] and sets its lower window edge L equal to R. Figure 2 shows the protocol. seq is equal to

+

+

+

+

SendData( L, seq, message ) Precondition: seq f [L

+ 1, L + w ]

/* Sender emits message */

ReceiveData( L, seq, message ) if (R # [L, L+ w ] or seq = R + 1) then R = seq deliver message endif

/* Receiver absorbs message * /

SendAck( ack ) Precondition: ack = R

/* Receiver emits ack */

ReceiveAck( ack ) if (ack E [L 1,L + w ] ) then L = ack endif

/* Sender absorbs ack */

+

Fig.2. Window washing protocol. Variables used are window size w , lower window edge L, sequence number seq of current message, number R of last valid message received by receiver, and acknowledgment number ack.

Costello and Varghese show that the protocol is self-stabilizing if seq E [0, M ] with M not smaller than w c,,,,, , where ,c, is the maximum number of messages that can be in the system at any point in time.

3

Window Size Adjustment

Whereas the window washing protocol works fine for one-to-one and one-to-many communication, many-to-one communication is a different kettle of fish. In this

form of communication, many senders are writing to the same receiver queue. The queue is divided into N logical sub-queues by assigning window sizes that can be adjusted to the amount of traffic a sender is creating. FIFO channels are assumed. We introduce our protocol in two steps. First, we describe a one-to-one protocol based on window washing that includes the feature of window size adjustment. In a second step, the new one-to-one protocol is expanded to a many-to-one protocol. It is proven that both protocols are self-stabilizing and a worst case analysis are given. 3.1

Description of the Protocol

The protocol consists of two parts: window washing is responsible for flow-control and reliable transfer of messages and acknowledgments, and the window size assignment adjusts the window size used by the sender. On the sender's side only minor changes need to be made. In addition to the lower window edge L and a sequence number seq, a message is tagged with the sender's current window size ws. The acknowledgment now contains the sequence number and the new window size. Every time an acknowledgment arrives -valid or not - the sender overwrites its old value of the window size with the new one. Apart from these modifications all senders run the window washing protocol. Attaching the new window sizes to the acknowledgments has the advantage that no additional messages are needed. This reduces the traffic and, much more importantly, allows distribution of the new window sizes without worrying about loss of these messages. However, there is a price to pay. The assignment is less flexible: The sender is notified of a change in its window size only if it receives an acknowledgment. The window size can be increased in arbitrary steps without informing the corresponding sender. This is not true for decreasing the window size. We can reduce the window size of the sender only if the sender receives an acknowledgment. The reduction must not be larger than the number of messages that are acknowledged with the ack. Imagine that the receiver attaches a new window size to an ack that is equal to the old size diminished by more than one. At the time the sender receives the ack, it has already sent all messages it was allowed to send according to its old window size. Buffer overflow can occur. Decrementing the window sizes in steps of one protects the receiver buffer from overflow even if acknowledgments get delayed or lost. Since the sender cannot move its lower window edge L, no more than the allowed number of messages can be generated. Pseudo-code for the protocol is given in Figures 3. 3.2

Proof of Correctness

The main idea of the proof of correctness is to split up the protocol into two independent parts: the window washing protocol and the window size assign-

SendData( L, w s ,seq, message ) Preconditions: seq E [L + 1, L + w]

/* Sender emits message */

ReceiveData( L, w s ,seq, message ) /* Receiver absorbs message */ if (R 81 [L,L + w,] or seq = R + 1) then R = seq deliver message Adjust Windowsize() endif SendAck( w , ach ) Preconditions: ack = R CheckWindowsize()

/* Receiver emits ack to j */

ReceiveAck( w , ack )

/* Sender absorbs ack */

Ws

=W

+

if (ack E [L 1 , L L = ack endif

+ w ] )then

Checkwindowsize() if ( W [wmin,wmoz]) then W = Wmin endif

/*

Adjustwindowsize( newsize) if ( ~ m i n5 new size 5 wm,,) then

/* adjusts window size */ /* check if new size is in valid range */ /* check if window size is not decreased too much */

if ( w - 1 5 newsize) then w

check for correct window size */

= newsize

endif endif

Fig. 3. Send, receive, check, and adjust routines for the many-to-one protocol: Send and receive are similar to the window washing protocol .

ment protocol. It can be shown that both parts fulfill the requirements of selfstabilization. Then the two parts can be merged and it can be proven that the protocol works correctly. Figure 4 illustrate the separation into two independent parts. Although the window-assignment protocol logically rests on top of the window washing protocol, the order is reversed in this proof. It is easy to see that there is no round trip flow of information in the window-assignment part. The

capacity c, latency L

message exchange

capacity c,

latency ,L.

window washing

capacity c,

latency 1 ,

window size assignment

9

sender

receiver

Fig. 4. Logical structure of the protocol. The upper sender-receiver pair represents the window washing protocol, the lower pair the window size assignment protocol. Q denotes the message queue size, q the acknowledgment queue size, c and L the capacity and maximum latency of the message channel, and ca,k and L a c k the capacity and maximum latency of the acknowledgment channel.

receiver of the system is the source, and the sender the sink of information. It is obvious that the whole system stabilizes after the receiver reaches a correct state since the sender overwrites its value of the window size every time it receives an acknowledgment. Even if acks are dropped at the sender's queue, the protocol works since every ack contains the correct window size. After the correct window size is established, no messages or acknowledgments are lost. The window washing protocol begins to stabilize and eventually the whole system becomes stable. We will see that the entire protocol reaches a valid state approximately after 3AT, where A T denotes the timeout period for resending messages and acknowledgments. Its value must be greater than the longest possible time between sending a message and receiving the corresponding acknowledgment in the error-free case. It is assumed that a sender regularly sends messages to the receiver. A state in which one of the senders no longer sends messages is regarded as an erroneous state. This is required by both, the window washing protocol and the window size assignment protocol. Let us assume the system starts from a random state. The receiver starts to send out acknowledgments to the sender. Before doing so, the receiver checks its window size settings. If the value does not conform to the requirements ( wmjn 5 w 5 w,,, ), it will be set to wmin. The receiver side of the window size assignment protocol is then in a valid state.

After reading the acknowledgment, the sender overwrites its own window size value with the new value carried in the ack. Now the window size assignment protocol of the sender is in a valid state, too. The size of the acknowledgment queue of the sender is denoted as q, the maximum channel latency as Lack, and the minimum service rate of the acknowledgment queue as A. Latest, after A T time units the receiver sends an acknowledgment which piggybacks the correct window size. It is guaranteed that every ack carries a valid window size by calculating w before sending the ack. The ack reaches the sender queue at A T + Lack.However, it may happen that the receiver sends the ack earlier and the ack is dropped. By time Lack 1/X the sender frees a space in its queue. Adding both times results in A T 1 / X 2Lack for the latest time when an ack reaches the sender. After another 1 / X (the arriving ack sees an empty queue because q/X < AT) the ack is taken out and the correct window size is established. The worst-case adjustment time adds up to AT+2/X +2Lack. After the window size is adjusted, no messages are lost due to buffer overflow. Window washing is guaranteed to stabilize within two round trip delays (proof in [2]), which is less than 2AT. The total time for self-stabilization is smaller than Tstaailite = 3AT 2/X 2Lack.

+

+

4

+

+

+

Many-to-One Protocol

Having shown that our protocol stabilizes within Tstabicize we now expand the model as it is depicted in Figure 5. Many senders feed messages to a single receiver. All messages share the same queue of size Q.Every sender incorporates an acknowledgment queue of size q. 4.1

Description of the Protocol

On the sender's side only minor changes need to be made. In addition to the lower window edge Lj, the window size wj and the sequence number seqj, messages are tagged with the sender number i. Apart from this modification all senders run the one-to-one protocol. To achieve self-stabilization, several variables are needed on the receiver's side. The internal structure of the receiver is given in Figure 6. The windowassignment protocol uses three arrays: sender, actual, and w. The first data structure is a shift register that stores the sender ids of the messages received. The capacity of the register is &, which is also the number of messages the receive queue can hold. Every time a new message is removed from the receiver queue, the sender id is stripped off and put in the shift register. The protocol then counts the number of identifiers a sender i has in the register and stores the result in adual[i]. The actual array contains the latest statistics about the traffic distribution. Based on this information, the new window sizes are calculated, stored in the window size array w, and sent to the senders by piggybacking them to the acknowledgment.

w N Senders

Fig. 5. Many-to-one setup. IV senders send messages to a single receiver. All messages are queued in a queue of size Q.

window size army (N entries)

-\

actual arm (N entriesf

receiver queue (Q buffers) sender-id queue (Q entries)

Fig. 6. Internal structure of the receiver. It consists of a message queue and the servicing node. The node uses the internal arrays sender, which stores the sender identifiers of the last Q messages that were taken out of the queue, actual, which stores the distribution of senders for the last Q messages, and w , which holds the assigned window sizes for each of the senders.

It is easy to see that there is no conflict between the senders if the window size assignment follows several rules. As in the one-to-one protocol, the window size can be increased in arbitrary steps without informing the corresponding sender. However, the receiver can only reduce a window size of a sender that is going to receive an acknowledgment, and the reduction must not be larger than the number of messages that are acknowledged with the ack. The reasoning is identical to that used for the one-to-one protocol. Assigning new window sizes happens every time a message is taken out of the receiver queue. First, it is determined whether or not the sender j whose message was taken out can be deprived of one unit of its window size. Therefore, the receiver checks if j's margin, which is defined as actuallj] - wlj], is larger than a threshold value DecThreshold. If this is the case, the receiver picks the sender k that has the smallest window of the senders with the smallest margin. If its margin is smaller than IncThreshold, j's window is reduced by one and k's window increased by one. The protocol takes a while to adjust the window sizes correctly in case one sender sends heavily and suddenly another sender sends a burst of messages. To increase the dynamics of the system, a less stringent version was developed that allows the receiver to deprive j of one unit of its window even if its margin is less than or equal to DecThreshold, as long as it is greater than margin[k] 1. Pseudo-code for the protocol is given in Figures 7 and 8. Figure 7 describes the send, receive, and check routines, whereas the window size assignment protocol is given in Figure 8. The latter also contains the main functions for both sender and receiver.

+

4.2

Proof of Self-stabilization

In the protocol described here several senders write to the same receiver. This does not create an entirely new situation different from that described for the one-to-one protocol. The receiver can be thought of as being split up into N receivers, each of them forming a sender-receiver pair with its corresponding sender. That all of the messages go to the same receiver input queue does not affect the window washing protocol so long as there is no buffer overflow. Therefore, the sum of all of the assigned window sizes must not be greater than the queue size &. The window size assignment protocol works exactly the same way as in the one-to-one protocol. Thus, after A T 2/A 2Lack all senders are supplied with the correct window size and the system stabilizes two round-trip delays later. = 3AT 2/X 2Lack. Again, we have a worst case stabilization time of TJlaailile With many senders, the check for a correct window size assignment a t the receiver differs slightly from the one-to-one protocol (see Figure 7).

+

+

+

+

/* Sender emits message */

S e n d D a t a ( L, w , sender, seq, message ) Preconditions: seq E [ L 1, L w]

+

+

ReceiveData( L, w , sender, seq, message ) /* Receiver absorbs message if (sender E [ I , N])then if (R[sender]4 [L,L w] or seq = R[sender] 1) then R[sender]= seq deliver message endif endif

+

+

SendAck( j , w, , ack, ) Preconditions: j E [I, N] CheckWindowArray()

/*

ReceiveAck( w,, ack ) W =Wj if (ack E [ L 1, L L = ack endif

/* Sender absorbs

+

*/

Receiver emits ack to sender j

ack

*/

*/

+ w ] )then /* Calculate values of actual */ /* set actualu = 0 */ + + /* count sender-ids */

BuildActualArray ( ) for i in 1 to N do actual[i]= 0 for i in 1 to N do actual[sender[i]] CheckWindowArray () for i in 1 to N do

/* Check values in wu */ /* first, check if all values are in the valid range

if ( ~ [ 4i ][wmin,wrnas]) then ~ [ i=]Wmin endignddo if ( C w [ i ]> Q) then for i in 1 to N do w[i]= wmin endif

*/

/* Check if sum of window */ /* sizes is smaller than Q */

Fig. 7. Send, receive, a n d check routines for t h e many-to-one protocol. Send and receive are similar to the one-to-one protocol.

CalculateNewWindow( sender ) for i in 1 to N do rnargin[i] = w[i] - actzsal[zl enddo if (3(margin(l < IncThreshold)) then

/* Calculates new window size for senders */ /* Calculate new margins */ /* Find

sender whose window will be increased */ C = sender with smallest w of senders with minimum margin endif /* Free buffer space available */ if ( E w [ i ] < Q) then if (k exists) then w[k] /* No free buffer space available */ else if (C exists AND margin[sender] > DecThreshold) then w[k]

++

++

sender] - -

endif endif

Sender: if timeout then w = wmjn endif if not (wmin 5 w 5 wmas) then w = wmin endif SendData(L, w, sender, seq, message) ReceiveAck(w, ack)

Receiver: ReceiveData(L, w, sender, seq, message) Enqueuesender (sender) BuildActualArray() CalculateNewWindow(sender) SendAck(sender, sender], R[senderJ)

Fig. 8. Many-to-one protocol.

5

Simulation

The many-to-one protocol was evaluated using a discrete event simulation. The simulation model was designed to match the system model given in Section 4. The channels were designed as FIFO queues of length one. 5.1

Error-Free Simulation

For the error-free case, the simulation was started from a quasi-random state. Every sender started creating messages either in a deterministic manner or randomly with a negative exponential distribution. The rate was set to A. = 0.04. The receiver removed messages from the queue a t a constant rate A, = 1. A newly created message was sent if the flow control allowed it; otherwise, it was delayed. The measurements were taken after reaching a state of equilibrium. To measure the ability to adapt to changes in traffic, bursts were injected for senders one and two. The burst rate was Ab = 0.4. The results are shown in Figure 9. F'rom the figures it can be seen that the first window assignment method is superior to the less stringent method in reducing fluctuations of the window size during periods of no change in the sender rates. However, the second method proves to be quicker in redistributing the resources after sudden shifts in the sender rates. This produces a shorter latency for messages and makes the protocol better suited to a changing environment. For these reasons the less stringent window allocation protocol was chosen as the basis for further tests. In all three diagrams the black sender sends bursts from time 2000 to 3000. The moment the black sender stops, the grey starts bursting for 1500 units. It takes over the black sender's share of the receiver buffer. At time 4000 the black sender again starts bursting, and the shares average out. At 5000 no sender is bursting, and the buffer is equally shared between all five senders. 5.2

Error Recovery

To test the error-recovery times of the protocol, three different types of errors were introduced into the system: loss of messages and acknowledgments, corruption of sequence numbers of messages and acknowledgments, and berserk senders. The statistics used to measure the error-recovery time in these cases were the retransmission count of the sender and the received count of the receiver. When the protocol is in the stable state, both of these measures should be equal to one for every message sent. The simulation results show that a loss of a single message does not affect the protocol beyond the loss of that message. If the missing message were to be retransmitted, only a small penalty would be added to the recovery process. The same behavior occurs for the second type of error. The receiver accepts the incorrect message and sends an acknowledgment to the sender. Since the sequence number on the ack is invalid, the sender disregards it and continues as if the error had not occurred. These two errors are thus indistinguishable to the senders.

40

-

95

-.

I

0

loo0

2000

JOOO

am

5000

boo0

4oOo

5000

Boo0

Time

401 35

#

8

u

.-

=".-

4 4 8

25

20.-

5

--

0 0

1000

2000

JOOQ

Time 40-

3s .-

#

3

u

jo"

25

..

20..

5

.-

0 0

tow

2000

JOOQ

4000

5000

Boo0

Time

Fig.9. Simulation of the error-free case: The figures show two senders competing for buffer space. They represent the change in the sender's window sizes over time. The top one depicts the results achieved with the more stringent version (DecThreshold = I), the middle and lower ones use the less stringent version with DecThreshold = margin[k]+ 1 . The IncThreshold equals one for all cases. The upper two graphs illustrate the behavior for deterministic traffic, the lower one deals with a negative exponential distribution of traffic.

i2 .Cl I

u00

p20

2240

2260

22W

2300

2320

2340

pEy)

!Z3W

2400

Time

Fig.10. Simulation of a berserk sender scenario. In a system of five senders, one starts to emit messages at a high rate ignoring flow control for a certain time interval. In this case the interval is set from 2200 to 2300 time units. The figure shows the number of times the same message is received versus the time it is generated. After 180 time units the protocol stabilizes and the system operates correctly.

The third error shows the resilience of the protocol. A sender emits messages at a high rate, ignoring flow control for a time period of 100 time units. From Figure 10 it can be seen how the protocol stabilizes once there are no more new errors. The error-recovery time is shown to be on the order of 200 units. 5.3

Stabilization from a Random State

To test the self-stabilizing bound of the protocol, the system was started with the system in an uninitialized state. The variables were not initialized and random messages were placed in the channels and queues. The system consisted of 10 senders and a queue size of 50 and 4000 simulation runs were executed.

1200

-

!i l

,9

200 0

0

m

loo

1m 200 Time to stabilize

250

J00

Fig. 11. Simulation of stabilizing time from a random state. In a system of ten senders, the system is started from an uninitialized state. The simulation is run 4000 times. The figure shows the frequency with which a stabilizing time occurs.

The statistics used to measure the stabilizing time were the same as in the error-recovery cases. The theoretical upper bound on the stabilization time was calculated to be 300 time units. The simulation results are shown in Figure 11.

6

Conclusion

In this paper we introduced a new sliding-window protocol that provides reliable communication and flow-control for many-to-one setups. The two main features of the protocol are a flexible way of adjusting the size of the windows to different traffic patterns and self-stabilization. A proof and an upper-time bound are given. The adjustment of the window size should depend heavily on the statistical properties of the traffic. Constant bit rate traffic and long bursts should not challenge the protocol, whereas short bursts and long pauses are more difficult to handle. As further work, it would be interesting to see how the protocol behaves when channel latencies are introduced, especially when they vary between senders. This would provide results closer to the real-world scenarios we face. Another interesting line of research would be to examine how this protocol could be merged with a self-stabilizing one-sender/multiple-receiver protocol. The properties of the resulting multiple-sender/multiple-receiver protocol would be of great interest to the distributed computing community.

References 1. G. M. Brown, M. G. Gouda, C. Wu, "Token systems that self-stabilize," IEEE Zkansactions on Computers, vol. 38, no. 6 (June 1989), pp. 845-852. 2. A. M. Costello, G. Varghese, "Self-stabilization by window washing," Proceedings of the 15th ACM Symposium on Principles of Distributed Computing, Philadelphia, PA, USA (May 1996), pp. 35-44. 3. E. W. Dijkstra, "Self stabilization in spite of distributed control," Communications of the A CM, vol. 17 (1974), pp. 643-644. 4. M. G. Gouda, N. J. Multari, "Stabilizing communication protocols," Technical Report TR-90-20, Dept. of Computer Science, Univ. of Texas, Austin, (June 1990). 5. S. Katz, K. Perry, "Self-stabilizing extensions for message-passing systems," Distributed Computing, vol. 7, no. 1 (August 1990), pp. 17-26. 6. M. Schneider, "Self-stabilization," ACM Computing Surveys, vol. 25, no. 1 (March 1993), pp. 45-67. 7. J. M. Spinelli, "Self-stabilizing ARQ protocols on channels with bounded memory or bounded delay,", Proceedings of the 12th Conference of the IEEE Computer and Communications Societies, San Francisco, CA, USA (March 1993), pp. 1014-1022. 8. A. S. Tanenbaum: Computer Networks, third edition, Prentice-Hall, Englewood Cliffs, N.J., (1996).

This article was processed using the UTEX macro package with SIROCCO style

Propagated Timestamps: A Scheme for The Stabilization of Maximum Flow Routing Protocols Jorge A. Cobb Mohamed Waris Department of Computer Science University of Houston Houston, TX 77204-3475

Abstract We present a distributed protocol for maintaining a maximum flow spanning tree in a network, with a designated node as the root of the tree. This maximum flow spanning tree can be used to route the allocation of new virtual circuits whose destination is the designated node. As virtual circuits are allocated and removed, the available capacity of the channels in the network changes, causing the chosen spanning tree to lose its maximum flow property. Thus, the protocol periodically changes the structure of the spanning tree to preserve its maximum flow property. The protocol is self-stabilizing, and hence it tolerates transient faults. Furthermore, it has the nice property that, while the structure of the tree is being updated, no loops are introduced, and all nodes remain connected. That is, the tree always remains a spanning tree whose root is the designated node.

1. Introduction Computer networks can be represented as connected, directed graphs, where nodes represent computers and edges represent channels between computers. Each edge is assigned a positive capacity. The flow of a network path is the minimum of the capacities of the edges in the path. Identifying the path with the maximum flow between two nodes is useful for establishing virtual circuits in computer networks. To establish a virtual circuit with some required capacity between two nodes in a network, a maximum flow path between the two nodes is first identified. If the flow of the identified path is greater than or equal to the required capacity of the circuit, then the circuit is established along the identified path. Otherwise, the circuit is rejected. In this paper, we describe a distributed protocol for maintaining a maximum flow spanning tree whose root is a designated node. Each new virtual circuit whose destination is the designated node is allocated along this tree. However, as virtual circuits are allocated along the tree, the tree may lose its property of being a maximum flow tree, and needs to be updated. We describe a protocol that periodically updates the structure of the spanning tree to preserve its maximum flow property. The protocol is self-stabilizing [6] [lo], and has the nice property that, while the tree is being updated, it always remains a spanning tree. Self-stabilizing protocols for maintaining a maximum flow spanning tree have also been presented in [7] and 181. However, these protocols have the disadvantage that they either introduce temporary routing loops, or they require a termination detection algorithm. In addition, the network diameter must be known. In this paper, we present

a protocol that overcomes these disadvantages, by using a novel technique based on propagating timestamps. We present the protocol in three steps. First, we introduce a basic protocol which is vulnerable to routing loops. Then, we refine the protocol to prevent loops from occurring during normal execution of the protocol, but is not self-stabilizing. We then present the self-stabilizing version of the protocol. To simplify our network model, we assume each computer can read the variables of its neighboring computers. We will relax this assumption to a message passing model in the full version of the paper. Our protocols have been proven correct. Due to space limitations, proof sketches are provided in the appendix. The detailed proofs are deferred to the full paper.

Protocol Notation Our network model consists of a set of processes which read but not write the variables of other processes. Two processes are neighbors iff they can read the variables of each other. The processes may be viewed as a network graph, where each node corresponds to a process, and each edge corresponds to a pair of neighboring processes. The network graph is assumed to be connected. Each process is defined by a set of local constants, a set of local inputs, a set of local variables, and a set of actions. Assume each process has a local variable named f. We denote with f.u the local variable f of process u. For simplicity, we omit the suffix if the process is clear from the context. The actions of a process are separated by the symbol using the following syntax: begin action 0 action . . . 0 action end Each action is of the form guard + command. A guard is a boolean expression involving the local constants, inputs, and variables of the process, and the variables of neighboring processes. A command is constructed from sequencing (;) and conditional (if fi) constructs that group together skip and assignment statements. Similar notations for defining network protocols are discussed in [4] and [S]. An action in a process is enabled if its guard evaluates to true in the current state of the network. An execution step of a protocol consists of choosing any enabled action from any process, and executing the action's command. Executions are maximal, i.e., either they consist of an infinite number of execution steps, or they terminate in a state in which no action is enabled. Execution are assumed to be fair, that is, each action that remains continuously enabled is eventually executed. If multiple actions in a process differ by a single value, we abbreviate them into a single action by introducing parameters. For example, let j be a parameter whose type is the range 0 . . 1. The action xu] < 0 + yu] := false is a shorthand notation for the following two actions. x[O] < 0 + y[O] := false ux[l] < 0 + y[l] := false

a

We use quantifications of the form (V x : R(x) : T(x)) where R is the range of the quantification, and T is the body of the quantification. The above denotes the conjunction of all T(x) such that x is a value satisfying predicate R(x). If R(x) is omitted, then all the possible values of the type of variable x are used.

The Basic Protocol We assume there exists a designated process in the network, called root. The purpose of the protocol is to build a maximum flow spanning tree (defined below) of the network graph with process root as the root of the tree. Associated with each directed edge is a capacity, i.e., the available bandwidth of the edge. Edge capacities are assumed to be updated by an external agent which keeps track of the available bandwidth as virtual circuits are added to or removed from the edge. The flow of a directed path is the minimum of the capacities of all the edges in the path. A directed path P beginning at process u and ending in process v is said to be a maximumfEow path if there is no other path from u to v with a flow greater than the flow of P. A spanning tree T is called a maximumflow tree iff for every process u, the path contained by T from u to the root is a maximum flow path. For routing efficiency, if for a process u there are two maximum flow paths to the root, then the protocol should include in the tree the path with the smallest number of edges. To represent the edges in the tree, each process has a variable pr indicating its parent in the tree. Also, each process has a local constant N indicating its set of neighbors in the network graph. To build a maximum flow spanning tree, each process maintains two variables, f and d. Variable f indicates the flow of the path to the root along the spanning tree, and variable d indicates the distance (i.e., number of edges) to the root along this path. For a pair of neighbors u and v, we define relation better as follows. (feu,d.u) better (f.v, d.v) = f.u > f.v v (f.u = f.v A d.u < d.v) That is, (f.u, d.u) better (f.v, d.v) is true iff the path from u to the root has a greater flow than the path from v to the root, or if these flows are the same but the distance to the root from u is smaller than the distance from v. Relation worse is the complement of relation better. To build the spanning tree, we use an algorithm similar to the Bellman-Ford algorithm, but tailored towards maximum flow paths rather than minimum cost paths. The basic procedure consists of two steps. One step is for each process u to periodically compare its (f.u, d.u) pair with that of its parent, and update (f.u, d.u) if necessary. This is because, at any time, it is possible that the flow of the path to the root via the parent changes, due to changes in the capacity of some edges along the path. The second steps consists of choosing a new parent, if possible. For each neighbor v of u, u reads the current value of (f.v, d.v), and chooses v as its new parent iff the flow of the path to the root via v is better than the flow of the path to the root via its current parent. That is, u chooses v as its new parent iff

(min(f.v, cap(u,v)), d.v+l) better (f.u, d.u) where cap(u,v) is the capacity of edge (u,v). The basic protocol for a non-root process is given below. References to local variables have no suffix, while references to neighbor's variables have the neighbor's identifier as a suffix. process u cons t a n ts N : { w I w is a neighbor of u) inputs c : array [N] of integer I* c[v] = cap(u,v) *I variables Pr : element of N I* parent of u *I f : integer I* flow to the root *I d : integer I* distance to the root *I parameters : element of N I* v ranges over all neighbors of u *I v begin (f, d) # (min(f.pr, c[pr]), d.pr+l) + f, d := min(f.pr, c[pr]), d.pr+l

0

+ (min(f.v, c[v]), d.v+l) better (f, d) f, d, pr := min(f.v, c[vJ), d.v+l, v

end The process contains two actions. In the first action, variables f and d are updated to reflect the current flow and distance to the root via the parent process pr. In the second action, a neighbor v is checked to see if it offers a path to the root with a flow greater than the flow of the current path to the root. If this is the case, v is chosen as the parent, and f and d are updated accordingly. The root process may be specified as follows. process root constants : integer I* maximum flow possible in the tree *I F variable I* parent of root process *I Pr : process-id f : integer I* flow of the root *I d : integer I* distance of the root *I begin d # 0 v f # F v pr # root + d, f, pr := 0, F, root end

&

x

,Tree edge of capacity x

m----~--.Network"~f

capacity x

a) Before edge capacity changes b) After edge capacity changes Figure 1: Routing loop The code contains a single action. Since the root has no parent, its parent variable should be set to itself. Also, the distance to itself should be set to zero, and the flow to itself should be set to the maximum value possible.

4. The Propagated Flow Protocol The protocol presented in Section 3 allows processes to adapt to changes in the capacities of network edges, and to choose as a parent the neighbor that offers the best path to the root. However, there is a flaw in the protocol, which we remedy in this section. As the capacities of edges change, a routing loop could occur, as shown in Figure 1. In the figure, the capacity of the edge between u and its parent decreases, and u chooses v as its new parent, since v appears to offer a path to the root with a flow greater than the flow of the path to the root via the current parent. However, v is actually a descendant of u, and thus a loop is formed. There are two approaches to solve this problem. The first is to assume an upper bound D on the diameter of the network. If a process u is in a loop, then its distance d, and also the distance of all processes in the loop, will grow beyond bound. Thus, process u changes parents when d is greater than D [7], which breaks the loop. This technique has the disadvantage of requiring the protocol to have knowledge of the size of the network, and furthermore, although permanent loops do not occur, temporary loops occur while the protocol adapts the spanning tree to changes in the capacity of some edges. The second approach, which we adopt in this paper, is to prevent loops from forming altogether. To do so, we use the technique similar to the one originally proposed in [9]. Variations of this technique have been used to prevent loops in leader election protocols [O] and routing protocols for minimum cost paths [3] [ l 11. We would like to ensure that for each process u and each descendant v of u in the spanning tree, (f.u, d.u) better (f.v, d.v) (1)

That is, v does not have a path to the root with a better flow than the current path from u to the root. Therefore, u will not choose v as its parent, avoiding the formation of a loop. Unfortunately, it is not always possible to ensure the above. This is because as edge capacities change, the value of (f.v, d.v) may be temporarily inaccurate, and overestimates the flow of the path from v to the root. This may cause u to incorrectly choose v as its new parent. To resolve this, each process maintains a boolean variable, fclean, with the following meaning. (V v : v is a descendant of u : (f.u, d.u) better (f.v, d.v)) fc1ean.u (2) We say that u isflow clean if fc1ean.u is true, otherwise, we say that u isflow dirty. To ensure Relation (2) always holds, process u becomes flow dirty whenever it updates (f.u, d.u), and the new value is worse than the previous one. Furthermore, process u becomes flow clean when all its children in the spanning tree are flow clean, and the flow of u is better than the flow of each child. Therefore, if u is flow clean, it implies that the latest decrease in (f.u, d.u) has propagated to all its descendants. Given Relation (2), process u is free to change parents if it is flow clean. If u is flow clean, and all its descendants have a flow worse than that of u, then u will not choose any of them as its new parent. Thus, loops are avoided. We are now ready to present the definition of each non-root process u. The root process remains the same as in the previous section. process u constants N : { w I w is a n is a neighbor process of v) inputs c : array [N] of integer I* c[v] = cap(u,v) *I variables : element of N I* parent of u *I Pr : integer I* flow to the root *I f d : integer I* distance to the root *I : boolean I* flow clean bit *I fclean parameters v : element of N I* v ranges over all neighbors of u *I begin + (f, d) # (min(f.pr, c[pr]), d.pr+l) if (min(f.pr, c[pr]), d.pr+l) worse (f, d) + fclean := false (min(f.pr, c[pr]), d.pr+l) better (f, d) + skip fi; f, d := min(f.pr, c[prJ), d.pr+l

fclean A (min(f.v, c[v]), d.v+l) better (f, d) f, d, pr := min(f.v, c[v]), d.v+l, v

+

0 (V w : w E N A pr.w = u : fc1ean.w fclean := t h e

A

(f, d) better (f.w, d.w))

+

end The process has three actions. The first action is similar to the first action in the previous section. However, if the values of (f, d) become worse, then the process becomes flow dirty. In the second action, a new parent is chosen, but only if the process is currently flow clean. In the last action, the process becomes flow clean if all its children are flow clean and the process has a flow that is better than the flow of its children.

5. Correctness of the Propagated Flow Protocol In this section, we characterize the behavior of the protocol of the previous section using closure and convergence properties [4]. We first define the terms closure and convergence, and then present the specific closure and convergence properties of the protocol. An execution sequence of a network protocol P is a sequence (state.0, action.0; state. 1, action.1; state.2, action.2; . . . ) where each state-i is a state of P, each acti0n.i is an action of some process in P, and state.(i+l) is obtained from state.i by executing action-i. A state predicate S of a network protocol P is a function that yields a boolean value at each state of P. A state of P is an S-state iff the value of state predicate S is true at that state. State predicate S is a closure in P iff at least one state of P is an S-state, and every execution sequence that starts in an S-state is infinite and all its states are Sstates. Let S and S' be closures in P. We say that S converges to St iff every execution sequence whose initial state is an S-state contains an St-state. From the above definition, if S converges to S' in P, and if the system is in an S-state, then eventually the execution sequence should reach an St-state,and, because S' is a closure, the execution sequence continues to encounter only St-states indefinitely. The converges to relation is transitive. Also, if S converges to S' and T converges to T', then S A T converges to S' A T'. We next present the properties of the propagated flow protocol. We begin with a couple of definitions. Let desc(u) denote the set of descendants of u. That is, v E desc(u) iff there is a path (wo, wl, w2, ... , w,), such that wo = v, wn = u, and for every i, 0 Ii < n, pr.wi = W(i+l). Predicate FC (Flow Clean) below relates fc1ean.u and the flow of u's descendants. FC = (V u, v :v E desc(u) : fc1ean.u (f.u, d.u) better (f.v, d.v) ) Predicate ST (Spanning Tree) below is true iff the parent variables form a spanning tree. ST e (V u : : u E desc(root)) A pr.root = root

-.

X

10

,,?__.

, ,

10

Tree edge of capacity x Network edge of capacity x

Figure 2: Routing loop Lemma 1: FC A ST is a closure

+

Thus, if the protocol begins in a state in which FC holds and the parent variables define a spanning tree, then this continues to hold forever. Let pr-path(u) be the path obtained by following the parent pointers from u to the root. Predicate MFST below is true iff a maximum flow spanning tree is obtained, and variables f and d of each process are correct with respect to the parent. MFST = ST A (f.root, d.root) = (F, 0)A ( V u : u # root : (f.u, d.u) = (min(f.(pr.u), cap(u, pr.~)),d.(pr.u)+l) ) A ( V u : u # root : pr-path@) is a maximum flow path ) For the next theorem, we assume that edge capacities remain constant. Otherwise, the maximum flow spanning tree is a moving target that may never be reached. Theorem 1: FC A ST converges to FC A MFST

+

Thus, if the protocol begins in a state in which FC and ST holds, eventually a maximum flow spanning tree is found. If the edge capacities change, then the protocol will adapt and obtain a new maximum flow tree, while continuously maintaining a spanning tree.

6. The Propagated Timestamp Protocol The propagated flow protocol presented above has a weakness. Its is very sensitive to the initial state of the system. For example, consider Figure 2. The arrows in the figure indicate the parent relationships, e.g., u has chosen x as its parent, and x has chosen w as its parent. The numbers indicate the capacity of each edge. If flow variable f of each process in the loop is currently 10, then the loop will not be broken, because all edges not included in the loop have a lower capacity than the loop edges, and all processes in the loop will not change parents. The protocol presented in [7] is insensitive to the initial state, in the sense that it will converge to a maximum flow spanning tree irrespective of the initial state. However, it introduces temporary loops when the edge capacities change. In [8], an improved version of the protocol is presented that is insensitive to the initial state, and does not introduce temporary loops. However, it also requires knowledge of the diameter of the network, and more importantly, it requires an underlying selfstabilizing termination detection protocol.

-,>

parent pointer

---

Network edge not on tree

b) tS.u = 10, tS.V = 10 c) d) Figure 3: Breaking loops with propagated timestamps

a) ts.u = 8, t5.v = 10

In this section, we present a protocol that is insensitive to the initial state and does not introduce temporary loops. Furthermore, it requires no underlying termination detection protocol and no knowledge of the network diameter. It is based on propagating timestamps through the network. Using timestamps to either break loops or avoid the formation of loops in network protocols has been done in the past [ I ] [2].However, we present a novel method of propagating timestamps, which allows our maximum flow protocol to be self-stabilizing. The basic strategy is as follows. The root process has a timestamp variable, called ts.root. Periodically, the root increases its timestamp. When a child u of the root notices that the timestamp of the root, ts.root, is greater than its own timestamp, ts.u, u assigns ts.root to ts.u. Similarly, a child v of u will assign ts.u to ts.v provided ts.u > ts.v, and so on. Thus, the timestamp of the root will propagate to all its descendants. We would like to place a bound on the difference between the timestamp of the root and that of its descendants, namely, they should differ by at most one. To accomplish this, each process u maintains a boolean variable, tclean.~,with the following meaning. tc1ean.u a (V v : v E desc(u) : ts.v 2 ts.u ) (3) We say that u is timestamp clean iff tc1ean.u equals true, otherwise u is timestamp dirty. If u is timestamp clean, it implies that all its descendants have a timestamp that is at least the timestamp of u. To guarantee Relation (3), process u makes itself timestamp dirty whenever it increases its timestamp by copying its parent's timestamp into its own. Process u makes itself timestamp clean whenever each child of u is timestamp clean and has a timestamp at least that of u. If the root process does not increase its timestamp until it is timestamp clean, then it is guaranteed that its timestamp is always at most one greater than that of its descendants. Therefore, timestamps propagates in "waves". When all processes have a timestamp equal to k, and the root is timestamp clean, the root increases its timestamp to k+l and becomes timestamp dirty. Then, timestamp k+l propagates down

the tree until it reaches the leaves of the tree. The leaves then become timestamp clean (since they have no children), which in turn allow their parents to become timestamp clean, and so on, until the root becomes timestamp clean once more. The fact that the timestamp of the root is at most one greater than all the timestamps of its descendants can be used by a process to detect that it is involved in a loop. Assume process u notices that it has a neighbor v, where ts.v 2 ts.u + 2. This implies that u is disconnected from the root, and is most likely involved in a loop. That is, u never received timestamp ts.u+l from its parent, and hence its parent does not lead to the root. If the above is the case for process u, then u must choose a different parent. The steps to do so are illustrated in Figure 3. Initially (Figure 3(a)), ts.v 2 ts.u + 2. Thus, u must change parents. Before doing so, u sets ts.u = ts.v, because the timestamp of the root is at least ts.v. Next (Figure 3(b)), u sets pr.u = u, indicating to all its neighbors that its does not have a parent. Since u no longer has a parent, each child of u, e.g. w, also sets pr.w = w (Figure 3(c)). Thus, eventually u will have no children. When this is the case, u is free to choose any parent whose timestamp is at least ts.u (Figure 3(d)) and rejoin the tree. We are now ready to present the definition of a non-root process u. process u constants N : { w I w is a n is a neighbor process of v) I* neighbor set *I inputs c : array [N] of integer I* c[v] = cap(u,v) *I variables : element of N I* parent of u *I Pr f : integer I* flow to the root *I d : integer /* distance to the root *I fclean : boolean I* flow clean bit *I ts : integer I* timestamp *I tclean : boolean I* timestamp clean bit */ parameters v : element of N I* v ranges over all neighbors of u *I begin (f, d) +: (min(f.pr, c[pr]), d.pr+l) + if (min(f.pr, c[pr]), d.pr+l) worse (f, d) + fclean := false 0 (min(f.pr, c[pr]), d.pr+l) better (f, d) + skip fi; f, d := min(f.pr, c[pr]), d.pr+l

tclean A fclean A ts = ts.v A (pr.v # v v v = root) A 4 (min(f.v, c[v]), d.v+l) better (f, d)

(V w : w E N A pr.w = u : fc1ean.w A (f, d) better (f.w, d.w)) fclean := true

0

ts.pr > ts

+

tclean, ts := false, ts.pr

0 (V w : w E N

0

A

+

pr.w = u : tc1ean.w

A

ts.w 2 ts)

+

tclean := true

+ (ts.v 2 ts + 2 v pr.pr = pr) A pr # u pr, ts, tclean := u, max(ts, ts.v), false

0 (V w : w E N A pr.w # U)A pr = u A (pr.v # v v v = root) A ts.v 2 ts fclean, tclean, ts := true, true, ts.v; pr, f, d := v, min(f.v, c[v]), d.v+l

+

end Process u contains seven actions. As before, the first action updates the pair (f, d) to match those of its parent, and becomes flow dirty if (f, d) became worse. The second action changes parents. We have strengthened the guard to ensure that u is timestamp clean, that the new parent has a timestamp equal to u8s,.andthat the new parent also has a parent of its own. The third action is the same as in the propagated flow protocol. The fourth action updates the timestamp to that of the parent, and makes u timestamp dirty. The fifth action makes u timestamp clean. The sixth action detects that a neighbor v has ts.v 2 ts + 2, or that the parent of u has no parent. In this case, u could be in a loop, so it sets pr to u. The timestamp of u is increased to at least the timestamp of the neighbor. In the last action, u rejoins the tree. If u has no parent and no children, and it finds a neighbor who does have a parent and whose timestamp is at least u's timestamp, then u chooses that neighbor as its new parent, it updates its timestamp and flow to match those of its new parent, and becomes flow and timestamp clean. The specification root process is similar to that of previous sections, except that we require one additional action, which is given below. ts := ts+l ( v w : w E N A pr.w = root : tc1ean.w A ts.w 2 ts) In this action, the root increases its timestamp if all its children are timestamp clean and have a timestamp greater than or equal to the root's timestamp.

7. Correctness of the Propagated Timestamp Protocol We next show that the propagated timestamp protocol is self-stabilizing, that is, regardless of the initial state of the system, the protocol will converge to a maximum

flow spanning tree. Furthermore, during any fault-free execution of the protocol, the integrity of the spanning tree is maintained, even while the tree is adapting to changes in edge capacities. We show the correctness in several steps. We first present some closure properties of the protocol, and then prove that it converges to a maximum flow spanning tree. We begin by noting that the flow clean predicate is restored automatically regardless of the initial state of the system. Lemma 2: a) FC is a closure b) true converges to FC

+

We next consider the timestamp clean bit. Predicate TC (Timestamp Clean) below relates tc1ean.u and the timestamps of the descendants of u. Lemma 3: a) TC is a closure b) true converges to TC

+

Since both FC and TC are closures, and they both are restored automatically, then TC A FC is also restored automatically (true converges to TC A FC). However, this does not ensure that loops are broken, since FC and TC can hold even in the presence of loops. To show that loops are broken, we need the following theorem. Theorem 2: TC A FC converges to TC A FC A ST

+

Thus, by transitivity of the converges relation, true converges to FC A TC A ST, i.e., the spanning tree is restored automatically. For the next theorem, we assume that edge capacities remain constant. Otherwise, the maximum flow spanning tree is a moving target that may never be attained. Theorem 3: TC A FC A ST converges to TC A FC A MFST

+

By transitivity, true converges to TC A FC A MFST, and the protocol is selfstabilizing.

8. Further Refmements Several further refinements are possible to the propagated timestamp protocol. Most important of these are to relax the model to a message passing system, and to obtain a version of the protocol that uses only a timestamp with a finite number of values rather than an unbounded timestamp. It can be shown that the convergence time of the protocol is o(D~),where D is the diameter of the network. Convergece time may be reduced to O(D) with the following simple refinement. Each process maintains an additional variable, maxts, which is set to the maximum of both the ts and maxts variables of itself and of its neighbors. This will cause the value of maxts of the root to quickly obtain the value of the maximum timestamp in the system. The root would then set its timestamp to

a value at least maxts, which quickly ensures that no node has a timestamp greater than the root. The above and other practical refinements will be presented in the full version of the paper.

References A. Arora and A. Singhai, "Fault-Tolerant Reconfiguration of Trees and Rings in Networks", IEEE International Conference on Network Protocols, 1994, page 221. [ l ] A. Arora, M. G. Gouda, and T. Herman, "Composite Routing Protocols", Proc. of the Second IEEE Symposium on Parallel and Distributed Processing, 1990. J. Cobb, M. Gouda, "The Request-Reply Family of Group Routing Protocols", to ap[2] pear in A CM Transactions on Computers, 1998. [3] J. J. Garcia-Luna-Aceves, "Loop-Free Routing Using Difussing Computations", IEEWACM Transactions on Networking, Vol 1, No. 1., Feb. 1993, page 130 [4] M. Gouda, "Protocol Verification Made Simple", Computer Networks and ISDN Systems, Vol. 25, 1993, pp. 969-980. [5] M. Gouda, The Elements of Network Protocols, textbook in preparation. [6] M. Gouda, "The Triumph. and Tribulation of System Stabilization", International Workshop on Distributed Algorithms, 1995. [7] M.Gouda and M. Schneider, "Stabilization of Maximum Flow Trees", Invited Talk, Proceedings of the third Annual Joint Conference on Information Sciences, 1994, pp. 178-181. A full version was submitted to the journal of Information Sciences. [8] M. Gouda and M. Schneider, "Maximum Flow Routing", Second Workshop on SelfStabilizing Systems, 1996. [9] P. M. Merlin and A. Segall, "A Failsafe Distributed Routing Protocol", IEEE Transactions on Communications, Vol. COM-27, No. 9, pp 1280-1288, 1979. [lo] M. Schneider, "Self-Stabilization", ACM Computing Surveys, Vol. 25, No. 1, March 1993. [ l 11 A. Segall, "Distributed Network Protocols", IEEE Transactions on Information Theory, Vol. IT-29, No. 1, pp. 23-35, Jan. 1983. [0]

Appendix A.l Properties of the Propagated Flow Protocol Proof Sketch of Lemma 1: One can show that FC by itself is a closure. The proof is very similar to the proof of part a) of Lemma 2 in the propagated timestamp protocol. T o show that FC A S T is a closure, note that the only time S T can be invalidated is when a process u changes parents. Since the new parent has a flow better than that of u, and u is flow clean, then, from FC, the new parent cannot be a descendant of u. Thus, a loop is not formed and S T is maintained.

+

For the next theorem, we assume that edge capacities remain constant.

Proof Sketch of Theorem 1:

Since FC A ST is a closure, the parent variables define a spanning tree, and continue to do so forever. We have to show that eventually this spanning tree is a maximum flow spanning tree. Define T to be the following tree of processes. Initially, T contains only the root. Let u be the child of the root whose edge capacity to the root is the highest. Let this capacity be C. We first show that all processes eventually have a flow at most C. We begin with the children of the root. If a child updates its flow to that of the edge to the root, then its flow becomes at most C. Furthermore, any child that the root gains will have a flow equal to the edge to the root, and hence at most C. Therefore, all children of the root will eventually have a flow at most C, and this will continue to hold forever. We must then prove that if all descendants of the root down to level L of the tree have a flow at most C, then this continues to hold forever. This can be shown by executing each action and checking if the new state satisfies the above. We omit the details. We then must show that if all descendants of the root down to level L of the tree have a flow at most C, then so will processes at level L+1, and this will continue to hold forever. Again, we omit the details. By induction, all processes eventually have a flow at most C. Since u is a neighbor (distance 1) from the root, it will choose the root as its parent. Let T now be the root plus process u. We repeat the same argument again, but finding the neighbor v of T whose edge into T has the greatest capacity C' into set T, and argue that all nodes outside of T eventually have a flow at most C', and that v chooses its neighbor in T as its parent. By construction, a maximum flow tree is the result.

+

A.2 Properties of the Propagated Timestamp Protocol Proof Sketch of Lemma 2: Part a) Let prlen(v,u), where v is a descendant of u, be the number of edges in the path from v to u following the parent variables. We define FC(i) as follows: FC(i): (V u, v : v E desc(u) A prlen(v,u) li : tc1ean.u (f.u, d.u) better (f.v, d.v) ) We must show that FC(i) is a closure for all i, and hence FC is also a closure. To show that FC(i) is a closure, execute each action in the protocol under the assumption that FC(i) holds before the action and show that FC(i) holds after the execution.

Part 6) We show by induction that FC(i) eventually holds for all i regardless of the initial state. Base case: i = 1. Consider any process v. Process v will eventually execute its action to update its flow to that of its parent u. Then, its flow is worse than that of its parent, and hence FC(1) holds between v and u. It its easy to show that FC(1) continues to hold forever between v and its parent, even if it changes parents.

Figure 4: Restoring FC(i+l) between u and v.

Induction case: assume FC(i) holds, show that eventually FC(i+l) holds. Consider any process v, as illustrated in Figure 4, and assume i = 2. Since we assume FC(i) holds, the relationship between u and each of x and y satisfies FC(i). Descendant v will eventually execute its action to update its flow to that of its parent. When this occurs, if u is flow clean, then the flow of y is worse than that of u, and hence the new flow of v is worse than u's, and FC(i+l) holds between u and v. Assume next that v was not a descendant of u, and either v or an ancestor (say y) chose to join the subtree of u. If this is the case, since the process changing parents (y in this case) must be flow clean, from FC(i), v's flow is worse than y, which in turn is worse than x's. Thus, FC(i+l) holds between u and v. Hence, for any descendant v of u at a distance i+l from u, either FC(i+l) holds between v and u when v joins the subtree of u, or v is already in the subtree and FC(i+l) holds between v and u eventually. We need to show next that FC(i+l) continues to hold between v and u. This can be shown in a manner similar to that of part a) by going through all the actions in the protocol assuming FC(i+l) holds between v and u and showing it continues to hold after executing the action. Therefore, FC(i+ 1) holds eventually between u and any descendant v at a depth of i+l, which implies that eventually FC(i+l) holds between any pair of processes and continues to hold forever.

+

Lemma 3: The proof is very similar to that of Lemma 2 and is omitted. Proof Sketch of Theorem 2: We prove the theorem in two parts. For the first part, we show that the timestamp of the root always increases. For the second part, we prove that eventually a spanning tree is formed. Part one. To show that the timestamp of the root always increases, we must show that the guard of the action that increases it has to become true. The guard checks two things. First, all children must have a timestamp at least that of the root. Second, that all children are timestamp clean. The first part will become true since because of fairness, all children must eventually execute the action that sets their timestamp to that of their parent. To prove that all children eventually become timestamp clean, we show the following. Define, for all i,

clean-ts(i) = (V u : u E desc(root) A prlen(u, root) 5 i : all processes in pr-path(u) from the root down to the first timestamp clean process have non-decreasing timestamps) It can be shown by induction that each clean-ts(i) will hold and continues to hold until the root increases its timestamp. Thus, timestamps will be non-decreasing from the root up to the first timestamp clean process in every rooted path. A simple induction argument shows that all processes will become timestamp clean, and thus the root is free to increase its timestamp. Part two: Knowing that the timestamp of the root always increases, we need to show that a spanning tree is obtained. From Lemmas 2 and 3, we know that eventually FC and TC hold and continue to hold. We assume that FC and TC already hold. Let the largest timestamp of any process be T at this time. By part one, eventually the root increases its timestamp to T+1, and no other process has a timestamp greater than T. What we would like to show is that eventually all processes have a timestamp greater than T+1, and that at all times all processes with timestamps greater than T form a tree. When the root increases its timestamp to T+1, we have a tree with a single node. Since when a process changes parents the new parent must have the same timestamp as the process, and thus a process with timestamp at least T+l will never choose as a parent a process with timestamp less than T+1. Furthermore, since FC and TC hold, no process ever chooses a descendant as its parent. Hence, all processes with timestamp at least T+l always form a tree. What remains to be shown is that all processes eventually have a timestamp of at least T+ 1. From TC, when the root is timestamp clean, all processes have a timestamp at least ts.root. However, it is easy to argue that the root timestamp is always at least that of any other process. Thus, if a process has a timestamp at least T+1, then its timestamp is either equal to the root (when the root is timestamp clean) or one less than the root (when the root is timestamp dirty). Since the timestamp of the root always increases, eventually it becomes greater or equal to T+3. In which case, for each process, either the process has a timestamp at least T+2 (and thus is connected to the root) or has a timestamp at most T (and thus not connected to the root). Any process v with timestamp T or less with a neighbor whose timestamp is T+2 or greater will execute action six, setting pr.v = v. Eventually, all its children do the same, and v ends with no parent and no children. Thus, v is now free to execute action seven and join the tree, and set its timestamp to at least T+2 (i.e., the timestamp of its new parent). Since all processes eventually have a timestamp of at least T+1 (in effect, at least T+2),and all these processes form a tree, and continue to form a tree, the theorem holds.

+

Theorem 3: The proof is very similar to that of Theorem 1 and is omitted.

Deductive Verification of Stabilizing Systems Y. Lakhnechl and M. Siege12 Christian-Albrechts-Universitgt zu Kiel Institut fiir Informatik und Praktische Mathematik I1 Preusserstrasse 1-9, 24105 Kiel, Germany E-mail: [email protected] * Weizmann Institute of Science, Rehovot 76100 Dept. of Applied Mathematics and Computer Science Email: [email protected]

Abstract. This paper links two formerly disjoint research areas: proof rules for temporal logic and stabilization. Temporal logic is a widely acknowledged language for the specification and verification of concurrent systems. Stabilization is a vitally emerging paradigm in fault tolerant distributed computing. In this paper we give a brief introduction to stabilizing systems and present fair transition systems for their formal description. Then we give a formal definition of stabilization in linear temporal logic and provide a set of temporal proof rules specifically tailored towards the verification of stabilizing systems. By exploiting the semantical characteristics of stabilizing systems the presented proof rules are considerably simpler than the general temporal logic proof rules for program validity, yet we prove their completeness for the class of stabilizing systems. These proof rules constitute the basis for machine-supported deductive verification of an important class of distributed algorithms.

1

Introduction

This paper links two formerly disjoint research areas: proof rules for temporal logic and stabilization. Temporal logic is a widely acknowledged language for the specification and verification of concurrent systems [18, 191. Stabilization is a concept used in various fields of computer science s.a. databases and artificial intelligence [14, 221 but has gained its importance as a paradigm in fault tolerant distributed computing [22]. In this paper we give a brief introduction to stabilizing systems and present fair transition systems [18] for their formal description. Then we give a formal definition of stabilization by means of linear temporal logic (LTL for short) and provide a set of temporal proof rules specifically tailored towards the verification of stabilizing systems. By exploiting the semantical characteristics of stabilizing systems the presented proof rules are considerably simpler than the general temporal logic proof rules for program validity, yet we prove their completeness for the class of stabilizing systems. These proof rules constitute the basis for

machine-supported deductive verification of an important class of distributed algorithms, The notion of stabilization has been introduced to the field of distributed computing by Edsger W. Dijkstra [12]. In 1983, Lamport noted the importance of stabilization for fault-tolerant distributed computing [16]. He observed that stabilizing systems show the remarkable property of being capable to automatically recover from transient errors, i.e. errors which do not continue to occur during the period of recovery [22], and to remain correct thereafter. Unfortunately stabilization comes a t a high price. Stabilizing algorithms are amongst the most complicated objects studied in the field of distributed computing. Their intricacy stems from the high degree of parallelism which is inherent to stabilizing systems (see [24] for a detailed discussion) combined with the standard problem in distributed computing that processes, cooperating in a network, have t o perform actions taken on account of local information in order to accomplish a global objective. Since fault-tolerance is an increasingly important branch of distributed computing there is a well defined need for methods which support the design and verification of fault-tolerant systems based on stabilization. The state-of-the-art so far are handwritten proofs on the basis of an intuitive understanding of the system -commonly described in pseudo-code - and the desired properties. Undoubtly, handwritten proofs add confidence in the correctness of the algorithms. However, there is common consent that correctness proofs of complicated distributed systems are just as error-prone as the systems themselves [20]. This insight resulted in an increasing interest in automated verification which comprise algorithmic techniques, known as model checking, e.g., [3, 91, and deductive techniques, commonly referred to as theorem proving, e.g. [8, 211. The involvement of stabilizing systems makes tool support for their verification desirable if not indispensable. The prerequisite for tool-support is a formal basis comprising the following:

1. a formal model to describe stabilizing systems, 2. a formal specification language to describe properties of stabilizing systems, and 3. a formal framework to establish that a certain system obeys some specific properties. This article provides such a formal basis. We use fair transition systems to describe stabilizing systems and advocate that LTL is an adequate specification language for this class of systems. The combination of temporal logic specifications and transition systems allows for algorithmic verification in the finite state case. To cover those cases where algorithmic verification is not applicable we investigate a temporal proof system for stabilizing systems. These proof rules are considerably simpler than the general proof rules for linear temporal logic [17, 18, 191. Nevertheless, we prove that the simple rules are just as complete as the general rules when considering stabilizing systems.

This article is organized as follows: In Section 2 we present fair transition systems and linear temporal logic. A brief introduction to stabilization as well as a formal definition by means of LTL is given in Section 3. The framework for formal reasoning about stabilization, consisting of a set of proof rules, is presented in Section 4. Some conclusions and prospects are given in Section 5. Remark: Due to space limitations we only sketch some of the proofs. All proofs and technical details can be found in [24].

2 2.1

Preliminaries Fair Transition Sys tems

For the description of stabilizing systems, formalisms with all kinds of communication mechanisms (shared variables, synchronous and asynchronous communication) are used in the literature, and the execution models vary from pure interleaving to true parallelism. Therefore we use a generic abstract model, namely fair transition systems [18], to formally capture stabilizing systems and their semantical characteristics. This choice guarantees broad applicability of our formalization as well as tool support since most existing analysis and verification tools are based on transition systems. For the following presentation we assume a countable set Var of typed variables in which each variable is associated with a domain describing the possible values of that variable. A state s is a partial function s : Var --,Val assigning type-consistent values to a subset of variables in Var. By V, we refer t o the domain of s, i.e. the set of variables evaluated by s. The set of all states interpreting variables in V E Var is denoted by Cv. Definition 1. A fair transition system A = (V, O,T, WF, SF) consists of a finite set V C Var of state variables, an assertion O characterizing initial states, a finite set T = {tl, . . . ,t,) of transitions as well as weak and strong fairness constraints o expressed by set W F C T resp. SF C T.

To refer t o the components of a fair transition system (fts for short) A we use A as index; the state space of A is denoted by Ca and defined by CA gfEVA. The set of all fts's A with V, Var is denoted by S. Each transition t E T is represented by a first order assertions pt (V, Vt), called a transition predicate [la]. A transition predicate relates the values of the variables V, in a state s t o their values in a successor state st obtained by applying the transition to s. It does so by using two disjoint copies of set V,. The occurrence of variable v E V, refers to its value in state s while an occurrence of v' refers to the value of v in st. Assuming that p t ( K , is the transition predicate of transition t , state s' is called a t-successor of s iff (s, st) t= pt (V,, q).A transition t is enabled in state s if there exists s' with pt(V, V'); otherwise it is disabled. Enabledness of a transition can (s, st) be expressed by en(t) gf3V1.pt(V, V'). We require that T contains the idling transition tidle whose transition relation is ptidls : (V = Vt). We use a standard linear semantics for fair transition systems 1171.

c)

+

Definition2. A computation of an fts A = (V, O,T, WF, S F ) is an infinite sequence a = (so,sl, s 2 , . . .) of states si E C A ,s.t.

2. for all i E N state

Si+l

is a t-successor of si for some transition t E T,

3. if transition t E WF is continuously enabled from some point onwards in a, there are infinitely many i E N with (si, si+l) pt (V, V'),

+

4. if transition t E SF is infinitely often enabled in a , there are infinitely 0. ~ pt(V, ) V'). many i E N with (si, s ~ + b The set of computations generated by an fts A is denoted by [A]. A finite sequence of states that satisfies the conditions 1. and 2. above is called computation-prefix. Note, that each fts A is machine closed [I], that is, each computation-prefix of A is a prefix of a computation of A. For a given computation a = (so,s l , s2, . . .) and i E N we define the prefix dc' of a up to index i by ali - (so,sl, . . .,si). The suffix of a starting a t index i is the infinite sequence u2i 'Ef ( s ~Si+l , , . . .) . The set of computation prefizes of A is denoted by Pref(A) the set of computation sufixes by Suff (A). By ai we refer to the (i 1)-st state of computation a . We use {p} t {q} as abbreviation of (p A pt(V, V')) + q', where q' results from q by replacing all variables of q by their primed version. For finite sets T = {t . . .,t,} of transitions we define {p} T {q} gfA:=, Ip} t {q} . As notation for sequences we use o to denote concatenation of sequences and last(seq) . . to return the last element of a (finite) sequence seq.

+

2.2

Linear Temporal Logic

As specification language we use a future fragment of Linear Temporal Logic [18] without next-operator, referred to as LTL-. Formulas are constructed from state formulas of some first order language L,the temporal until operator U , and boolean operators A, -.I applied to formulas. Temporal formulas are denoted by cp, +, .. . and state formulas by p, q, . . .. Temporal formulas are interpreted over infinite sequences of states [18]. Definition 3. Given an infinite sequence a = (so, sl ,s2, . . .) of states and cp E LTL'. We define that a satisfies cp, denoted by a cp, as:

+

~I=P

abcp~$ ab-9 a b 9u'b

iff iff iff iff

~ObP, abcp and

OF$,

3i > 0.

+ and

akp,

Vj

< i.

+(p.

We use the standard abbreviations Ocp gftrueU9 (eventually), 09 Zf -0-cp (always) and cp $ gfO(9 -+ $) (entails). An fts A satisfies a formula cp E LTL-, denoted by A b cp, if all its computations satisfy 9.

3

A Brief Introduction to Stabilization

Stabilization is besides masking and detection/recovery mechanisms the major paradigm in fault-tolerant computing [4]. Stabilization is studied as an approach to cope with the eflect of arbitrary faults, as long as some indispensable prerequisites for recovery, characterized by a so-called fault-span, are not violated. Such prerequisites may require that the actual code is not affected, or that the communication topology of the distributed system is not' divided into unconnected parts. Fault-tolerance by stabilization ensures the continued availability of systems by correctly restoring the system state whenever the system exhibits incorrect behavior due to the occurrence of faults. Obviously, such a self-organizing behavior requires some assumptions about the occurrence of faults. When talking about stabilization there is always an environment that acts as an adversary producing transient faults thus affecting the consistency of variables upon which the actual stabilization depends. In most publications on stabilization, see e.g. [2, 7, 131, the adversary is not explicitly modeled. The influence of faults is indirectly taken into account by considering arbitrary initial states in stabilizing systems. This models the situation that a fault-action just occurred and now time is given to the stabilizing system to reestablish a legal state. From a more application oriented point of view stabilizing systems are constructed to be non-initializing because: -

(re-)initialization of distributed systems is a very difficult task, cf. [6], unless one constructs systems which can start in arbitrary states, and

- the main structuring principle to manage the complexity involved in the design and verification of these systems is to build complex stabilizing systems as a particular composition of simpler stabilizing systems, see e.g. (6, 13, 14, 151. In order for this composition to yield a stabilizing overall system, the simpler systems have to be non-initializing (see [24] for details). So, from now.on we restrict our attention to the set NZ of non-initializing systems, formally defined as NZ gf{SE SI es ++ t r u e ) .

+

Formal Definition of Stabilization Arora and Gouda advocate in [4, 51 a general and uniform definition of faulttolerance based on the terms convergence and closure. The observation which led to such a uniform definition is that there are two distinct kinds of behavior that a fault-tolerant system displays. As long as no fault occurs it performs the task that it is designed for. As soon as a fault occurs a concerted effort of the processes causes the whole system to re-establish a so-called legal state [12] from which point on it displays the original fault free behavior. So, in (4, 51 it is assumed that for a fault-tolerant system there exists a predicate le characterizing the set of legal system states which is invariant throughout fault-free system execution.

Furthermore Arora and Gouda state that faults can uniquely be represented as actions that upon execution perturb the system state [ll]. Instead of anticipating various possible faults, it is only assumed that there exists a predicate fs weaker than le, called the fault span of the system [4], which defines the extent to which fault actions may perturb the legal states during system execution. Based on these observations a system S E N T is defined to be stabilizing w.r.t. fs and le iff it has the following two properties:

- convergence: Starting in an arbitrary state where fs holds, eventually a state is reached where le holds. - closure: Predicates fs and le are invariant under execution of actions from S. This description of stabilization can be formally captured by means of LTL' as follows:

+

Definition4. Given fts S E NT, fault span fs and predicate le, with le + fs. System S is sla bilizing w.r.t. fs and le iff S satisfies the following properties: convergence: closure:

4

+

S O(fs + Ole) S I=o(fs Ofs) S O(1e -+ Ole) +

Proof Rules for Convergence and Closure

In this section we present a collection of temporal proof rules which exploit the fact that stabilizing systems are non-initializing and that their fault-span is not left by system transitions, i.e. the fault-span is closed under system execution. We give proof rules for program validity of formulas n(p + Op) and O(p + Oq) for state predicates p, q. Repeated application of these rules gives a complete reduction of the program validity of these formulas into a set of assertional validities [18]. We prove relative completeness of our rules, i.e. we assume that there exists an oracle that provides proofs or otherwise verifies all generally valid assertions [17]. There exist complete proof rules for closure and convergence for arbitrary fts's [19]. As explained, rather than simply recalling these very powerful rules, we adapt the general rules to deal with systems from set NZ. In the following presentation we always state the general proof rule first, followed by the adapted rule for set N T . 4.1

Proof Rules for Closure Properties

We start with a complete rule for proving S n(p -+ up) for S f S and some state predicate p. The rule for general invariance in [17, 191 refers to past

modalities; in order to avoid the introduction of past-modalities we state a rule which only refers to future modalities.

{cp)

GClos

T {cp)

Concerning notation: In the following rules we assume that a fixed fts is given and refer to its components where necessary (e.g. to its set of transitions T in the rule above), cf. [17]. Recall, that operator j denotes entailment cp j $ O(cp -. $), whereas -.. denotes usual implication between assertions or formulas. Soundness and completeness of the above rule means that assertion p is closed in system S iff there exists a state formula cp such that 1. p implies cp in every reachable state of S, 2. cp implies p in every state, and

3. 9 is invariant under transitions of S. The first two premises imply that cp and p are equivalent on the set of reachable states of S. Typically, the additional state formula cp is used t o characterize the set of p-states, i.e. states where predicate p holds, that are reachable in the system under investigation. Finding the appropriate assertion cp, i.e. coding enough reachability information into such an auxiliary predicate, is the intricate part in the application of rule GClos. Proposition5. Rule GClos is sound and complete /or proving S for S E S .

O(p + Op)

Proof. As explained, we consider completeness of our proof rules relative to assertional validity. Hence it is sufficient to show that validity of the premises follows from validity of the conclusion. So, given a system S and predicate p such that S b O ( p Op) we have to define predicate cp such that the premises of rule GClos becomes valid. Assuming that the data domain of the assertion language is expressive enough to encode records of data and lists of records one can define a state assertion x that holds in a state s iff s is reachable in S, i.e. appears in some computation of S (for technical details of this definition see [17]). d_ll Now, we consider as auxiliary predicate x p - x A p and prove validity of the premises for predicate xp.

-

S

b p 3 xp: Given Obviously

xp

-

ai

p: By the definition of

of x p .

[a

a computation u E and position i such that oi is reachable and satisfies p so uj xp.

xp we have

+

+ p.

xp + p because p is a conjunct

+ { x p )T { x p } :We prove that whenever state s satisfies predicate xp every t-successor st of s, for t E T, also satisfies xP. s

I= X p

iff s k x A p by definition of xp iff 3~ E Pref ( S ) .(last( L ) = s A s b p) by definition of x Let L be a computation prefix leading to s, i.e. last(r) = s. We have that L O ( s t )E Pref ( S ) since st is by assumption a t-successor of s for some t E T O ( p + op) we have and since S is machine closed. Since s p and S s' b p. We conclude st x A p, i.e. st b xp. The adapted complete rule for proving closure for systems S E N T does not use any auxiliary predicate.

The remaining premise { p ) T { p ) is local in the sense that it does not require any more temporal reasoning; it is purely assertional. Note, that this rule is not complete if we consider arbitrary fts's. Instead of proving directly soundness and completeness of rule Clos-NZ for O ( p -+ Op) (for S E N Z ) is derivable by systems from N Z , we prove that S rule GClos iff it is derivable by rule Clos-NZ. This reveals how we obtained the premise of rule Clos-NZ, namely by an explicit characterization of predicate cp in rule GClos and subsequent simplification of its premises.

+

PropositionG. Given system S E N T . S GClos ifl it is derivable by rule Clos-NZ.

O(p

4

Op)

is derivable b y rule

Proof. Given system S E h/Z and predicate p. We just prove the more interesting direction: validity of the premises of rule GClos implies validity of the premise of rule Clos-NT. So assume that there exists a predicate cp such that S b p j cp, k cp p and b {cp} T {cp) holds. We have to prove that { p ) T { p } holds. We exploit the soundness and completeness result of rule GClos. From the assumptions we have by soundness of rule GClos that S O(p + Op). From the completeness proof for rule GClos we know that xP p for systems { x p } T { x p }holds. Now it suffices to note that S E NI since:

-

-

bXP

iff s b x ~ p by definition of xp iff 31 E Pref(S).last(~) = s A s b p by definition of xp iff s k p since S E h/Z The last step is the reason why we can do without the auxiliary predicate cp in rule Clos-NT: every pstate of S is reachable anyhow since S E N'Z so predicate xp and p are equivalent. From Proposition 5 and 6 we get:

Corollary 7. Rule Clos-NZ is sound and complete for proving S for S E NT.

O(p 4 Op)

Before we present proof rules for convergence we state a proposition which is important for the simplification of forthcoming proof rules. Proposition8. For all systems S E N Z and all temporal formulas 9 E the following holds: S t = P i f f Sl=ocp

LTL-

The proof of this proposition is based on the observation that the set of computations of all systems S E ACT is suffix closed. 4.2

Proof Rules for Convergence

In the presentation of the proof rules for convergence we also consider subsets of N Z , consisting of all those systems which preserve a predefined state predicate p, i.e. sets of the form N Z p sf {S E N Z I S O(p Op)). The relevance of N Z p stems from the fact that we are interested in establishing the stabilization of a given system S E N Z w.r.t. some state predicates p, q. So we have to prove S O(p -+ Oq) A O(p --,O p ) A O(q 4 Oq). Proving S O(p Op) first (by rule Clos-NZ), we have shown that S belongs to set N Z p and thus we can apply the rules for NZ, in order to establish the convergence of S. The existing rules for proving convergence properties are usually partitioned into single-step convergence rules and extended convergence rules. We keep this pattern and mainly follow the presentation in [17]. Similar t o the previous section we first state the general rule, followed by an adapted rule for set N Z and finally the rule for NIP. -)

-+

Single-step Convergence Rule under Weak Fairness Single-step convergence rules are applicable in case that there exists a t least one transition that accomplishes the desired convergence within one step. We obtain two slightly different rules, depending on whether this so-called helpful transition is executed weakly or strongly fair. The general proof rule as stated in [17] is:

WConv

In this rule t identifies the helpful transition, contained in the set of weakly fair executed transitions, and en(t) denotes the enabledness o f t . The rule states that we have to find a state predicate cp such that p entails q V p. Predicate cp has to be preserved by every transition in T unless q is established. Since the helpful transition t is enabled in p-states if q is not yet established, we conclude

that either q is established by a T-transition or, by weak fairness, finally t is executed which also establishes q. When restricting our attention to systems in the set NZ we get due to Proposition 8 the following simpler proof rule.

What looks like a minor change, replacing two times entailment by implication, constitutes in fact a considerable saving. Whereas the entailment properties of rule WConv require in general further invariants for their proof, in case of stabilizing systems we are done with proving ordinary implications! P r o p o s i t i o n g . Given system S E NZ. S t= O ( p --, Oq) is derivable by rule WConv iff it is derivable by rule WConv-NZ. In case that the system under consideration belongs to set N I P we get a further simplification.

This simplification causes no loss of generality when considering systems in N Z p . P r o p o s i t i o n l o . Given system S E NZP. S b O ( p + Oq) is derivable by rule WConv iff it is derivable by rule WConv-NZp. Proof. The proof of this proposition is based on the following two lemmata which are proved in [24]. These lemmata refer t o a predicate X; which is a variant of predicate xp used in the completeness proof of Proposition 5. Predicate X; is defined such that s b X; holds in a state s E Cs iff there exists a reachable p state s' and a computation segment L leading from st to s , s.t. L does not contain any q-states.

Lemma 11. Given system S E S and predicates p, q , cp. If S I= {PI T ( 9 V PI holds then I=x; 9 and {x;1 T (9 V x;).

p

+ (q V 9 ) and

+

The second lemma gives a characterization of predicate X: for systems S E

NZp. L e m m a 12. For systems S E NTp and predicates p, q we have

X; * ( P Al q ) .

Using these lemmata it is not difficult to complete the proof for both directions of Proposition 10.

Single-step Convergence Rule under Strong Fairness The second singlestep convergence rule relies on the existence of a helpful transition t in the set of strongly fair executed transitions. The general rule is:

SConv

As in rule WConv, t denotes the helpful transition, but now contained in the set of strongly fair executed transitions. Only the fourth premise is changed. We have to prove that cp entails that eventually q V e n ( t ) holds. With the same justification as in the case of weak fairness we obtain the following two simplifications of this rule:

We have the corresponding relative completeness result for rule SConv-NZ w.r.t. rule SConv. Proposition 13. G i v e n s y s t e m S E N T . S SConv i n it i s derivable by rule SConv-n/Z.

O ( p + Oq) i s derivable by rule

In case that the system under consideration belongs to set

NIP we get:

- -

The expected result for N Z p in case of strong fairness is: Propositionl4. G i v e n s y s t e m S E AfZ,. S SConv iff it i s derivable b y rule SConv-NZ,.

O ( p + Oq) is derivable by rule

As can be observed in most convergence proofs for stabilizing systems, concerted e$ort of several transitions is generally necessary to establish q after a p s t a t e has been encountered. So more powerful rules for so called extended convergence are needed. Extended Convergence Rules The general proof rule reduces extended convergence properties to a set of single-step convergence properties. These'singlestep convergence properties commonly serve to establish a well founded induction argument. We follow the presentation in [17].

A binary relation 5 over set A is a pre-order if it is reflexive and transitive. If a 5 b but not b 5 a we say that a precedes b denoted by a + b. The irreflexive, asymmetric, and transitive ordering (A, 4)induced by (A, 5 ) is well-founded if there does not exist an infinite sequence (ao, a l , az, . . .) where ai E A and ai+l 4 ai for all i 1 0. A pre-order 5 is called well-founded if its induced ordering 4 is well-founded. Since we do not deal with past-operators we can use ranking functions which map states, rather than computation prefixes, t o a well founded domain. So in the following rule from [17] 5 : C I+ A is a ranking function where C is the state space of the system under consideration, and A the domain of a well-founded pre-order (A, 5 ) . Then, the following rule can be used to prove convergence:

*

P (q v 9 ) (p A 6 = a)

O(P

+

* O(q V ( 9 A b + a ) )

EConv-1

Oq)

The first premise ensures that q V 9 holds a t a position in a computation if p holds a t that position. The second premise ensures that each reachable state where holds is eventually followed by a 9-state with a lower rank or a state where q holds. Proposition15 ( M a n n a & Pnueli). Rule set { EConv-1, WConv, SConv) is sound and complete for proving S b O(p --+ Oq) for S E S and predicates p, q.

Simplifications of rule EConv-1 are obtained along the same lines as in the case of the single-step convergence rules.

P r o p o s i t i o n l G . Given system S E JdI. S O(p EConv-1 ifl it is derivable by rule EConv-1-h/Z.

+

Oq) i s derivable by rule

As corollary of Proposition 15 and 16 we get: C o r o l l a r y 17. Rule set { EConv-1-n/Z, WConv-NZ, SConv-n/Z) is sound and complete for proving S b O(p -+ Oq) !or S E N Z and predicates y , q . In case of set N Z p we get the following adapted rule:

P r o p o s i t i o n 18. Given system S E N Z p . S b O(p EConv-1 iff it is derivable by rule EConv-1-NZp.

-

Oq) i s derivable by rule

From Proposition 15 and 18 we get the following soundness and completeness result for proving convergence properties of systems from NIP. Corollary 19. Rule set { EConv-1-NIP, WConv-NZ,, SConv-NZ,) is sound and complete for proving S O(p + Oq) for S E NIP and predicates p, q.

+

Extended Convergence with Helpful Directions In [17] a strategy is proposed t o replace the premise concerning convergence in rule EConv-1 by applying a combination of single-step convergence rules. The resulting rule, stated below, uses the set WF USF containing all weakly or strongly fair executed transitions. Without loss of generality WF n SF = 0 holds. Furthermore the rule assumes the existence of a set of state formulas { p l , . . . , p,), each pi corresponding t o one ti E WF U SF, a well-founded pre-order (A, 3,and a ranking function with range A. Let p EfVy=)=I pi.

*

P (9 v 'P) { ' P i A b = a ) T { ~ v ( ( D +A a~) ~ ( ~ ~ A 6 5 a ) ) {pi A 5 = a ) ti {q v (cp A b 4 a)) pi ( q V en(ti)) for ti E WF 'Pi O(q V en(ti)) for ti E SF O(P 09)

* *

EConv-2

+

In cpi-states transition ti is the helpful transition contained in WF U SF. The third premise guarantees that ti, executed in a state where pi holds, either decreases the rank while preserving cp or establishes q. The other transitions in T either have to preserve these pi or have to lower the rank or establish q. The fourth and fifth premise guarantee that the ti transitions are eventually executed when a pi-state has been encountered. Proposition20 (Manna & Pnueli). Rule EConv-2 is sound and complete for proving S O(p + Oq) for S E S and state predicates p, q.

+

In case we restrict our attention to systems in set N Z we obtain the following simplified rule: P (9 v 9 ) { ' P i A b = a } T { q v ( p A b + u ) v ( ~ ~5 aA) )~ {pi A 6 = a ) ti {q V ( p A b + a)) pi (Q V en(ti)) for ti E WF pi + O(q V en(ti)) for ti E SF O(P 09) +

EConv-2-NZ

+

+

The corresponding proposition states: Proposition21. Given system S E N Z . S O(p EConv-2 iff it is derivable by rule EConv-2-NZ.

From Proposition 20 and 21 we obtain:

+ Oq)

is derivable by rule

Corollary 22. Rule EConv-2-NZ is sound and complete for proving S O q ) for S E AfZ and state predicates p, q .

+ O(p

-+

Finally, we get the following rule for extended convergence in case of AfZ, where the Pi are selected such that b ( p A l q ) -+ VE, pi holds. r

{ ~ i ~ 6 = a ) T { q ~ 6 + a ~ ( ~ ~ A ~ ~ a ) } { p i ~ 6 = a ti) { q v 5 + a ) Pi ( q V e n ( t i ) ) for ti E WF ECO~V-2-NZp Pi O ( q V en(ti)) for t i E SF +

+

O(P

+

Oq)

Proposition 23. Given system S E NIP. S b O ( p --+ O q ) is derivable by rule EConv-2 ifl it is derivable by rule EConv-2-NIP.

As in the previous cake we get from Proposition 20 and 23: Corollary 24. Rule EConv-2-NZ, is sound and complete for proving S O q ) for S E NIP and state predicates p, q .

O ( p -+

We have obtained rule EConv-2-NZp by replacing (o in rule EConv-2 by p A - q and perform subsequent simplifications exploiting the closure of p. However, it is not possible to eliminate the auxiliary predicates pi in rule EConv-2-NZp; the construction of the y;'s in the completeness proof of rule EConv-2 [17] reveals that there does not exist a general characterization of the Pi by means of the predicates p and q .

5

Conclusion

In this paper, we presented a formal framework for deductive verification of stabilizing systems. Temporal logic has been used to define the central notions of closure and convergence. Then, we used the temporal approach to program verification to derive a set of proof rules for stabilizing systems. Using the semantical features of these systems, such as the fact that they are non-initializing, we obtained a set of proof rules which are simpler than the general temporal proof rules. These proof rules provide the basis for computer-aided deductive verification of stabilizing systems. In [24] we have completed the list of proof rules for stabilizing systems by giving rules for pseudo-stabilization [lo] and also some useful temporal tautologies which can be used in the verification of actual systems. The temporal proof rules are extended in [24] towards phased reasoning [25,231 about stabilizing systems, based on the concept of convergence stairs [14]. Acknowledgments: We thank Amir Pnueli for proof reading and comments.

References 1. M. Abadi and L. Lamport. The existence of refinement mappings. Computer Science, 82(2), 1991.

Theoretical

2. Y. Afek and G.M. Brown. Self-stabilization over unreliable communication media. Distributed Computing, 7:27-34, 1993. 3. R. Alur, T. Henzinger, and P. Ho. Automatic symbolic model checking of embedded systems. In IEEE Real- Time Systems Symposium, 1993. 4. A. Arora. A Foundation of Fault Tolerant Computing. PhD thesis, The University of Texas a t Austin, 1992. 5. A. Arora and M.G. Gouda. Closure and convergence: a foundation of fault-tolerant computing. IEEE Transactions on Software Engineering, 19:1015-1027, 1993. 6. A. Arora and M.G. Gouda. Distributed reset. IEEE Transactions on Computers, 43:1026-1038, 1994. 7. J. Beauquier and S. Delaet. Probabilistic self-stabilizing mutual exclusion in uniform rings. In PODC94 Proceedings of the Thirteenth Annual ACM Symposium on Principles of Distributed Computing, page 378, 1994. 8. R. S. Boyer and J. S. Moore. Integrating decision procedures into heuristic theorem provers. Machine Intelligence, 11, 1986. 9. J. Burch, E. Clarke, K. McMillan, D. Dill, and L. Hwang. Symbolic model checking: lo2* states and beyond. In Logic in Computer Science, 1990. 10. J.E. Burns, M.G. Gouda, and R.E. Miller. Stabilization and pseudo-stabilization. Distributed Computing, 7:35-42, 1993. 11. F. Cristian. A rigorous approach to fault-tolerant programming. IEEE Transactions on Software Engineering, 11(1), 1985.

12.

E.W.Dijkstra. Self stabilizing systems in spite of distributed control. Communications of the A CM,17(11), 1974.

13. S. Dolev, A. Israeli, and S. Moran. Self-stabilization of dynamic systems assuming only read/write atomicity. Distributed Computing, 7:3-16, 1993. 14. M.G. Gouda and N. Multari. Stabilizing communication protocols. IEEE Transactions on Computers, 40:448-458, 1991.

15. S. Katz and K.J. Perry. Self-stabilizing extensions for message-passing systems. Distributed Computing, 7:17-26, 1993. 16. L. Lamport. Solved problems, unsolved problems, and non-problems in concurrency. In Proceedings of the 3rd Annual ACM Symposium on Principles of Distributed Computing, 1984. 17. Z. Manna and A. Pnueli. Completing the .temporal picture. Theoretical Computer Science, 83(1), 1991.

18. 2. Manna and A. Pnueli. The Temporal Logic of Reactive and Concurrent Systems. Springer Verlag, 1991. 19. 2. Manna and A. Pnueli. Verlag, 1995.

Temporal Verification of Reactive Systems. Springer

20. S. Owre, J . Rushby, N. Shankar, and F. von Henke. Formal verification for fault-

tolerant' architectures: Some lessons learned. In FME '93:Industrial-Strength Formal Methods, number 670 in LNCS. Springer Verlag, 1993. 21. S. Owre, J.M. Rushby, and N. Shankar. Pvs: a prototype verification system. In 11th Int. Conf. on Automated Deduction (CADE), volume 607 of LNCS. Springer Verlag, 1992. 22. M. Schneider. Self-stabilization. ACM Computing Surveys, 25:45-67, 1993. 23. M. Siegel. A refinement theory that supports both 'decrease of non-determinism' and 'increase of parallelism'. In S. Smolka, editor, CONCUR '95, volume 962 of LNCS, 1995. 24. M. Siegel.

Phased Design and Verifcation of Stabilizing Systems. PhD thesis, University of I