Mathematical Reliability: An Expository Perspective 9781461347606, 9781441990211

110 3 32MB

English Pages 340 Year 2004

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Mathematical Reliability: An Expository Perspective
 9781461347606, 9781441990211

Citation preview

MATHEMATICAL RELIABILITY: AN EXPOSITORY PERSPECTIVE

INTERNATIONAL SERIES IN OPERATIONS RESEARCH & MANAGEMENT SCIENCE Frederick S. Hillier, Series Editor

Stanford University

Hobbs, B. et al. / THE NEXT GENERATION OF ELECTRIC POWER UNIT COMMITMENT MODELS

Vanderbei, RJ. / liNEAR PROGRAMMING: Foundations and Extensions, 2nd Ed. Kimms, A. / MATHEMATICAL PROGRAMMING AND FINANCIAL OBJECTIVES FOR SCHEDUliNG PROJECTS

Baptiste, P., Le Pape, C. & Nuijten, W. / CONSTRAINT-BASED SCHEDUliNG Feinberg, E. & Shwartz, A. / HANDBOOK OF MARKOV DECISION PROCESSES: Methods and Applications

Ramfk, 1. & Vlach, M. / GENERAliZED CONCA VITY IN FUZZY OPTIMIZATION AND DECISION ANALYSIS

Song, J. & Yao, D. / SUPPLY CHAIN STRUCTURES: Coordination, Information and Optimization

Kozan, E. & Ohuchi, A. / OPERATIONS RESEARCH/MANAGEMENT SCIENCE AT WORK Bouyssou et al. / AIDING DECISIONS WITH MULTIPLE CRITERIA: Essays in Honor ofBernard Roy

Cox, Louis Anthony, Jf. / RISK ANALYSIS: Foundations, Models and Methods Dror, M., L'Ecuyer, P. & Szidarovszky, F. / MODEliNG UNCERTAINTY: An Examination of Stochastic Theory, Methods, and Applications

Dokuchaev, N. / DYNAMIC PORTFOliO STRATEGIES: Quantitative Methods and Empirical Rules for Incomplete Information

Sarker, R, Mohammadian, M. & Yao, X. / EVOLUTIONARY OPTIMIZATION Demeulemeester, R & Herroelen, W. / PROJECT SCHEDUliNG: A Research Handbook Gazis, D.C. / TRAFFIC THEORY Zhu, J. / QUANTITATIVE MODELS FOR PERFORMANCE EVALUATION AND BENCHMARKING Ehrgott, M. & Gandibleux, X. / MULTIPLE CRITERIA OPTIMIZATION: State ofthe Art Annotated Bibliographical Surveys

Bienstock, D. / Potential Function Methods for Approx. Solving Linear Programming Problems Matsatsinis, N.F. & Siskos, Y. / INTELliGENT SUPPORT SYSTEMS FOR MARKETING DECISIONS

Alpern, S. & Gal, S. /THE THEORY OF SEARCH GAMES AND RENDEZVOUS Hall, RW./HANDBOOK OF TRANSPORTATION SCIENCE - 2nd Ed. Glover, F. & Kochenberger, G.A. / HANDBOOK OF METAHEURISTICS Graves, S.B. & Ringuest, J.L. / MODELS AND METHODS FOR PROJECT SELECTION: Concepts from Management Science, Finance and Information Technology

Hassin, R & Haviv, M./ TO QUEUE OR NOT TO QUEUE: Equilibrium Behavior in Queueing Systems

Gershwin, S.B. et aV ANALYSIS & MODEliNG OF MANUFACTURING SYSTEMS Maros, I.I COMPUTATIONAL TECHNIQUES OF THE SIMPLEX METHOD Harrison, Lee & Neale/ THE PRACTICE OF SUPPLY CHAIN MANAGEMENT: Where Theory And Application Converge

Shanthikumar, Yao & Zijml STOCHASTIC MODEliNG AND OPTIMIZATION OF MANUFACTURING SYSTEMS AND SUPPLY CHAINS

Nabrzyski, 1., Schopf, 1.M., Wt;:glarz, J./ GRID RESOURCE MANAGEMENT: State of the Art and Future Trends

Thissen, W.A.H. & Herder, P.M./ CRITICAL INFRASTRUCTURES: State of the Art in Research Carlsson,

and Application

c., Fedrizzi, M., &

*A

Fuller, R./ FUZZY LOGIC IN MANAGEMENT

list ofthe early publications in the series is at the end ofthe book

*

MATHEMATICAL RELIABILITY: AN EXPOSITORY PERSPECTIVE

Edited by

REFIKSOYER

Department of Management Science The George Washington University Washington, DC 20052

THOMAS A. MAZZUCHI

Department of Engineering Management and Systems Engineering The George Washington University Washington, DC 20052

NOZERD. SINGPURWALLA

Department of Statistics The George Washington University Washington, DC 20052

,. ~.

Springer Science+Business Media, LLC

Library of Congress Cala logiog-in-Pu blication

Soyer, Refikl Mazzuchi, Thomas A l SingpuTWalJa, Nozer D. Mathematical Reliability: An Expository Perspective ISBN 978-1-4613-4760-6 ISBN 978-1-4419-9021-1 (eBook) DOI 10.1007/978-1-4419-9021-1

Copyright © 2004 bySpringer Sc icncc+Busincss Media New York Originally published by Kluwer Academic Publishcrs in 2004 Softcover reprint ofthc hardcovcr lst edition 2004 AII rights reserved. No part of this publication may be reproduced. stored in a retrieval system Of transmined in any fonn ar by any means, elcetronic, mcehanical, phO"l.o- 0, so that the bridge structure is seen to be a system which enjoys the IFR property when components are d.d. from an IFR distribution. The parallel-series system shown in Figure 1.1 is an example of a coherent system without this closure property. As will be clear in the sequel, the primary use of system signatures is in the comparison of two or more complex systems. It is often possible to rate one system as better than another by a quick glance at their respective signatures. A 2-out-of-4 system has signature s~14 = (0, 1,0,0) while a 3-out-of-4 system has signature s~14 = (0,0,1,0). While the superiority of the latter system is obvious from first principles, it is apparent that signatures also succeed in quantifying the fact that the 3-out-of-4 system will tend to last longer than the 2-out-of-4 system. There are a finite number of coherent systems of a given size. For example, there are exactly five different coherent systems in 3 components and there are 20 coherent systems of order 4. Obtaining a closed-form expression for the number of coherent systems of order n is a interesting open problem in Reliability Theory. Like many of the most intriguing unsolved problems in mathematics, the problem is easy to state but very hard to solve. Regardless of how many systems of order n there may be, we will find it useful to expand the size of the collection of order n systems to infinity. Doing so will provide substantial conceptual and analytical flexibility to the study and comparison of systems. The expansion we have in mind is accomplished by adding mixtures of coherent systems to the collection. Specifically, a mixed system is a convex combination of coherent systems of a given order. Given systems Tl, T2, ... , Tk, and a probability vector pt = (PI, P2, ... ,Pk), one may speak of the p-mixture of these systems as the system represented by k

Tp

=

LPiTi 1

(1.4)

10

MATHEMATICAL RELIABILITY

A mixed system is more than a mathematical artifact. It is physically realizable as the outcome of a randomization process in which one of k coherent systems "in stock" is chosen for use at random according to the distribution p. The resulting system is "coherent" only in a stochastic sense; the system selected by such a process is coherent with probability one. The system Tp itself is technically incoherent because a given component (say the i th component of the lh system) is relevant only if the jth system is selected for use, that is, only with probability Pj. With the introduction of mixtures of coherent systems, the space of systems of interest becomes continuous. Instead of being restricted to a fixed finite collection of signatures of a given order, one can consider any probability vector of order n as the signature of a mixed system. For example, for any vector s c [0, l]n for which l: Si = 1, the s-mixture of k-out-of-n systems, that is the system TS

=

L

skTkln,

(1.5)

has the signature s. Since the representation in (1.5) need not be unique, there are generally a variety of ways to construct a system with a given signature. The bridge structure of Figure 1.2 can, for example, be represented, alternatively, as a mixture of 2-out-of-5, 3-out-of-5 and 4-out-of-5 systems.

2.

Related Concepts

Perhaps the most basic idea associated with the reliability of a system at a given point in time is the reliability polynomial. Under the assumptions that each of the components of a system of order n operates independently and that the i th component has reliability Pi at the time of interest, the reliability polynomial is simply the function of the vector p that relates p to the reliability of the system. We'll denote the reliability polynomial as h(p). When all of the components have the same reliability P (as happens under the assumption of Li.d. component lifetimes), the reliability polynomial is simply written as h(p). In this latter case, it may be represented as n

h(p) = Ldrpr

(1.6)

r=l

The coefficients dr in (1.6) are not easy to identify for a complex system, and do not appear to have any particular intuitive meaning. While certain simple properties of these coefficients are easy to obtain (for example, since for any coherent system, h(O) = 0 and h(l) = 1, we know that the reliability polynomial in (1.6) has no constant term, that is do = 0, and that l: d r = 1), other characteristics of the d r are considerably more difficult to come by. An important advance in this regard was achieved by Satyanarayana and his co-workers

11

The Signature ofa Coherent System

([19],[20],[21],[22], [23]), who introduced the concept of "domination", a tool that has been central in algorithmic calculations of di over the past two decades. We tum now to a description of the notion of signed dominations, followed by a discussion of their relation to system signatures. The latter relationship provides a new vehicle for studying and interpreting dominations. A more detailed treatment of this relationship may be found in Boland, Samaniego and Vestrup [8]. The "inclusion-exclusion" formula (see for example Feller [14]) is a standard and widely used tool in the calculation of the probability of a union of a collection of overlapping sets or "events". The formula is a generalization of the well-known addition rule which stipulates that P(A u B) = P(A) + P(B) - P(A n B). The general inclusion-exclusion rule applies to the union of any m sets, and may be written as

P(UA i )

=

L P(A L P(A n A + ... +/ - L p(nAz) - / + ... + / - L P(A i) -

1

L i

(1.7)

j)

2

k

where

i

1

n ... n Am)

m

represents a sum over all i-fold intersections. This rule may be

applied directly to the calculation of a system's reliability by noting that a system works if and only if all of the components in at least one minimal path set are working. If Ai in (1.7) represents the event that all components in the i th minimal path set are working, and there are m minimal path sets in all, then the formula in (1.7) provides the probability that at least one such event obtains. That probability is, of course, the system's reliability. One may recognize the result as the reliability polynomial by noting that the probability is pT' that a particular collection of r components functions simultaneously, and that each of the intersections appearing in the inclusion-exclusion formula is precisely the collection of components appearing in one or more sets of the intersection in question. In spite ofthe fact that the inclusion-exclusion formula can provide an explicit representation of the reliability polynomial, it falls short of the desired solution because of it's inherent computational complexity. Besides the fact that the generation of the m minimal path sets of a given system requires exponential time, that is, requires a number of steps that is exponential in the size n of the system, there are 2m -1 intersections in (1.7), resulting in a doubly exponential algorithm for system reliability. This unpleasant fact motivated research that ultimately led to the simplifying notion of "dominations". Let us suppose that we have a list of minimal path sets of a given system in n i.i.d. components. A "formation" is defined as a union of minimal path sets. There is some room for confusion in the use of terminology here, so let's make

12

MATHEMATICAL RELIABILITY

special note of the fact that the "intersection" of two or more events in (1.7) - each event representing a particular set of working components - can equivalently be thought of as the set consisting of the union of all these working components, that is, a union of all the components in these path sets. A formation is thus the union of the components in a fixed collection of minimal path sets. An "i-formation" isa union of the components in a set of i minimal path sets. For instance, the set {I, 2, 3, 4} formed from the minimal path sets {I, 2}, {2,3} and {3, 4} would be an example of both a 2-formation and a 3-formation. We will refer to a particular set as an "even" formation if it is the union of an even number of minimal path sets, and as "odd" if it is the union of an odd number of minimal path sets. Clearly it can be both simultaneously. Keeping track of formations is simply an accounting mechanism that helps one to see what is happening in the inclusion-exclusion formula. An even formation occurs with each k-fold intersection in (1.7) when k is even, and an odd formation results from every intersection when k is odd. Thus, the total number of formations for a system having m minimal path sets is precisely 2m - 1. We illustrate below the cataloguing of formations for the 5-component bridge system shown in Figure 1.2. The minimal path sets of this system are {I, 4}, {2, 5}, {I, 3, 5} and {2, 3, 4}. The "signed domination" ofagiven union of minimal path sets is simply the difference of the number of even dominations and the number of odd dominations for that union.

Table 1.2.

Signed dominations for the 5 component bridge system.

Min Path Set Unions

# Odd Formations

# Even Fonnations

{1,4} {2,5} {1,3,5} {2,3,4} {1,2,3,4} {1,2,3,5} {1,2,4,5} {1,3,4,5} {2,3,4,5} {1,2,3,4,5}

1 1

o o o o

1

1

o o o o o 4

1 1

Signed Domination 1 1 1 1 -1 -1

1

-1

1 1

-1 -1

2

2

Denoting the minimal path sets in the bridge system above as AI, A 2 , A 3 and A 4 , the formations associated with the set {I, 2, 3, 4, 5} are the four odd ones Al U A 2 U A 3 , Al U A 2 U A 4, Al U A 3 U A 4 and A 2 U A 3 U A 4, and the two even ones A 3 U A 4 and Al U A 2 U A 3 U A4.

13

The Signature of a Coherent System

If A is a union of minimal path sets consisting of exactly r components, then the marginal probability that this set "works" is pr. The signed domination of the set A is equal to the sum of the coefficients (l's and -1 's) in the terms of the inclusion-exclusion formula corresponding to the occurrences of the probability P(A) in the expansion (1.7). The total contribution of this term is the signed domination of the set A times the probability element pr. Given this type of accounting, the signed dominations of sets of size r can be summed and then multiplied by pr. Summing over all possible values of r yields the reliability polynomial. For the bridge system above, this process yields the reliability polynomial h(p) given by (1.8) In the notation of (1.6), we have identified the coefficient vector dbridge = (d 1 , d 2 , d 3 , d4 , d 5 ) ofthe reliability polynomial as dbridge = (0,2,2, -5, 2) for the bridge system in Figure 1.2. We will henceforth refer to d as the vector of (signed) dominations, subsuming mention of the fact that each element of this vector is actually the sum of signed dominations for terms of the same order. It is worth noting that the general process described above applies equally well to the computation of the reliability function for systems with independent but non-identical components. This generalization will not, however, be required in the developments here. To explain the connection between the domination and signature vectors, it is useful to write the reliability polynomial in an alternative form. We will refer to the polynomial in (1.6) as being in "standard form". This polynomial can also be written in "pq form", that is, the reliability function of a coherent system in n Li.d. components may be written as n

h(p) =

Lmjpiqn- j ,

(1.9)

j=l

where q = 1 - p. If we focus on the reliability of the system at the fixed time t, then the probability that any given component is working at that time is p = 1 - F(t), the probability of that component surviving beyond time 1. A simple manipulation of the reliability function in (1.2) results in the reexpression ofthe reliability polynomial in terms ofthe system's signature, that is, as

h(p) =

t( n-j+1 t 1

Si) (~)piqn-j

(1.10)

J

where st = (Sl' S2, ..• , sn) is the system signature. The "tail probabilities" of the signature vector s have an interpretation through the concept of a path set (a set of components whose functioning insures that the system is functioning).

14

MATHEMATICAL RELIABILITY

This connection was noted by Boland [10] and exploited in the study of indirect majority systems, a special application of signatures that will be discussed in some detail in section 4. As is apparent from (1.10), the coefficient of pi qn- j in the reliability polynomial in pq-form can be interpreted as the number of path sets of order j, as it is precisely those sets, among the collection of (j) sets with exactly j working components, that contribute positively to the rehability polynomial. If we let aj represent the proportion of path sets among the (j) sets with j working components (and the complementary components nonworking), then we see that the reliability polynomial can be written as

h(p) =

Laj(~)pJqn-j.

(1.11)

It follows that the vector a, which is fundamentally related to path sets, and the vector S, which is fundamentally related to cut sets, are related to each other through the system of equations n

L

aj =

Si

for j = 1, ... , n

(1.12)

n-j+1 or equivalently (using the convention that ao = 0),

Sj = an-j+1 - an-j

for j = 1, ... ,n.

(1.13)

It is thus clear that the vectors S and a are in one-to-one correspondence and are, in fact, linearly related. An interesting observation resulting from the functional expression in (1.12) is that ai i i, that is the proportion of subsets of size i which are pathsets for any system is an increasing function of i. In linking the vector of dominations d to the notion of signature, we will utilize the vector a of tail probabilities of the signature S , though the connection between d and s will also be specified. Let us consider the reliability polynomials in (1.6) and (1.11). Writing qn- j in the latter expression as (1 - p)n- j and expanding the term as a binomial, one may rewrite the polynomial in (1.11) as

h(p) =

taj(~)pJ{I: (n~j)(_l)ipi}

J=l

t r=l

~=o

J

{t j=l

aj

(~) (~=~) (-lr- j }pr J

(1.14)

J

Examining the expressions in (1.6) and (1.14) , we see that the vectors a and d are related via the equations dr

_

-

. n

r a J (j ) ~

(

n- J

r _ j ' ) (-1)

r-j

_

for r - 1, ... , n

(1.15)

15

The Signature of a Coherent System

or, alternatively, the components of the domination and signature vectors satisfy the relationships

Obtaining the set of signed dominations for a given system has proven to be a popular computational approach for deriving the reliability polynomial. It provides both conceptual and algorithmic advantages over other approaches. If the approach, and ultimate product - the reliability polynomial in standard from - have any disadvantages, they would be that the vector d has no particular intuitive interpretation and provides little help in comparing systems (short of the brute force examination of the difference of the respective polynomials). The signature vector s is, on the other hand, rich with meaning and, as we will see in the two sections that follow, has a direct and easily quantified relation to system performance. In particular, it is highly useful in comparing one system to another. Thus, the fact that the vector d is related to the vectors s and a in a straightforward way provides the link facilitating interpretations of dominations that have heretofore been lacking. We end this section with an example. Earlier, we computed the signature of the bridge system as S~ridge = (0,1/5,3/5,1/5,0). This corresponds to the vector a~idge = (0,1/5,4/5,1,1). The linear relationship displayed in (1.15) may be written as

dl d2 d3 d4 d5

5 0 0 0 -20 10 0 0 30 -30 10 0 -20 30 -20 5 10 -5 5 -10

0 0 0 0 1

al a2 a3 a4 a5

=

0 2 2 -5 2

(1.17)

Since one would ordinarily have the dominations in hand, it is the inverse of the relationship above that would be of interest. In the present example, we would obtain a from d as a = M-ld, where M is the 5 x 5 matrix in (1.17). The inverse of M has the particularly simple form displayed below:

o o 0 1/5 0 o 0 o 2/5 1/10 3/5 3/10 1/10 o 0 4/5 3/5 2/5 1/5 0 1 1 1 1 1

16

MATHEMATICAL RELIABILITY

Since the vectors s and a are also linearly related, the relationship above can be written alternatively as d MPs, or s p-IM-Id, where M is displayed in (1.17) and P and p-I are given by

=

p=

3.

0 0 0 0 1

0 0 0 1 1

0 0 1 1 1

0 1 1 1 1

1 1 1 1 1

and

=

p-I =

0 0 0 -1 1 0 0 -1 0 -1 1 0 -1 1 0 0 0 1 0 0

1 0 0 0 0

Comparison of Systems - Theory

We have already alluded to the usefulness of signatures in the comparison of complex systems. In this section we will show how various stochastic comparisons between systems may be established by comparing the signatures of two competing systems. The concept of stochastic order is a useful tool in comparing system lifetimes. There are many types of stochastic relationship which are commonly used, although here we shall restrict our considerations to the usual stochastic order, the hazard rate order and the likelihood ratio order. Shaked and Shanthikumar [25] (see also Boland et. al. [9]) provides a comprehensive treatment of stochastic relationships. IfTI and T 2 are random lifetimes (or more generally any two random variables) then we say that T2 is greater than T I in the (usual) stochastic order (TI ~st T 2) if FTI (t) ~ FT2 (t) for all t. Simply put, for any t, T2 is more likely to exceed (survive) t than TI is. The importance of the failure rate (or hazard rate) function in reliability theory and survival analysis has led to the use of the hazard rate ordering in comparing two lifetimes. We say that T2 is greater than TI in the hazard rate (or failure rate) ordering t (and write T I ~hr T2) if F T2 (t)/ F TI (t) i t (or equivalently in the case where the lifetimes T I and T 2 have densities, rTI (t) 2: rT2 (t) for all t). This relationship is stronger than the usual stochastic order, for it is equivalent to saying that for any t, conditional on surviving until time t, T 2 is stochastically larger than T I ( (TI I T I > t) ~st (T2 I T 2 > t) for all t > 0). An even stronger stochastic relationship is the likelihood ratio order, and one says that that T2 is greater than T I in the likelihood ratio ordering (TI ~lr T 2) if h2 (t) / hI (t) i t. In summary one has the following implications between these stochastic orders: T I ~lr T 2 ===} T I ~hr T 2 ===} T I ~st T2.

Although the signature vector S7 of a coherent system 7 is a probability vector which is a function of the system design (and in particular independent of tWe say that a function g(t) is increasing (and write g(t) it) if it is a non-decreasing function.

17

The Signature of a Coherent System

component lifetimes), we can in a similar manner introduce stochastic orders on signatures of systems. If Sl and S2 are the signatures of two systems of order n, then we say that Sl 50st S2 if n

L

n

Slj

j=i

50

L

S2j

for i = 1,2, ... , n.

(1.18)

j=i

Note that the condition (1.18) is equivalent to ali 50 a2i for i = 1,2, ... , n. If the structure functions of two systems have the property that (PI (x) 50 (h (x) for all component state vectors x, then clearly the corresponding signatures satisfy Sl 50st S2 (although the converse is not true). The following results (see Kochar, Mukerjee and Samaniego [16]) establish some interesting connections between stochastic relationships between signatures and corresponding relationships between system lifetimes. The importance of these results lies in the fact that they give simple sufficient conditions on the structure (in terms of the signature) of two systems which guarantee stochastic orderings on the corresponding system lifetimes (but which are independent of the actual (common) component lifetimes in the systems).

Theorem 1 Let 81 and 82 be the signatures of two systems of order n, and let T 1 and T2 be their respective lifetimes. If 81 50st 82, then T 1 50st T 2. The proof ofTheorem 1 follows by rewriting the expression (1.2) for the survival function of a system with lifetime T in the following form:

P(T> t) =

I: (. t )=0

Si) ( ; )

j (F(t)t- (F(t))j.

(1.19)

2=)+1

Theorem 1 often enables us to stochastically order system lifetimes by comparing their signatures. For example consider the five coherent systems of order 3. These are the k-out-of-3 systems: 7113 (series) , 7213 (2-out-of-3) , 7313 (parallel), as well as the parallel-series system t ps illustrated in Figure 1.1 and the consecutive 2-out-of-3 system 7 c :213' A (linear) consecutive k-out-of-n system 7 c :kln (with signature sc:kln) is a system of n linearly ordered components which fails if any k consecutively ordered components fail. We have seen that sps = (0,2/3,1/3) and it is easy to establish that Sc:2!3 = (0,2/3,1/3). Therefore it follows from the above theorem that (1.20) One may define in a natural way a hazard rate ordering for two signature vectors of order n whereby Sl 50hr S2 if ~j=i S2j / ~j=i Slj is increasing in i = 1, ... , n. Likewise we say that the signature vector Sl is less than

18

MATHEMATICAL RELIABILITY

the signature vector S2 in the likelihood ratio order (SI ~lr S2) if s2i/ Sli is increasing in i = 1, ... , n. It is easy to see that all five of the different systems of order 3 given above are also (totally) ordered as in equation (1.20) for both the hazard rate and likelihood ratio orders. For systems of higher order, things become considerably more complicated. As we have previously noted, there are 20 different coherent systems of order 4, and some of these systems (and their corresponding signature vectors) are not comparable according to the stochastic orderings we have discussed. Kochar, Mukerjee and Samaniego [16] give an example of two systems Tl and T2 of order 4 where SI ~st S2 but SI thr S2, and another example of two systems Ti and T2 of order 4 where si ~hr S2 but si tlr S2' The following result gives sufficient conditions on signatures which imply hazard rate (respectively likelihood ratio) ordering on the corresponding system lifetimes. Theorem 2 Letsl ands 2 be the signatures oftwo systems ofordern, and letTl and T2 be their respective lifetimes. Ifsl ~hr (~lr) S2, then Tl ~hr (~lr) T2.

4.

Comparison of Systems - Applications

In this section, we give several examples of situations where the concept of the signature of a system is a useful tool in applications. In the first case we investigate the signatures of consecutive k-out-of-n systems, showing, for example, that there is a natural ordering among consecutive 2-out-of-n systems as n grows. We then demonstrate how signatures may be used to show that redundancy at the component level is stochastically superior in the likelihood ratio order to redundancy at the system level for k-out-of-n systems of Li.d. components. A final example illustrates how the signature can be used to show that the expected lifetime of an indirect majority system exceeds that of a simple majority system of Li.d. exponential components.

4.1.

Consecutive k-out-of-n Systems

The linear consecutive k-out-of-n systems are those where the components are linearly ordered and which fail if any k consecutive components fail. Such systems are often used in modeling telecommunication and oil pipeline systems, as well as in the design of integrated circuitry. They were introduced by Chaing and Niu [11], and have been extensively studied since (see for example Derman, Lieberman and Ross [12] and Koutras and Papastavridis [17]). The simplest case is the consecutive 2-out-of-n system T C :2/n- By noting that (~~~) is the number of ways one can have i successes and n - i failures in a linear arrangement of n components whereby between every 2 failures there is at least 1 success, one has that ai (7) = (~~~) for n 21 ~ i ~ n. It therefore

19

The Signature of a Coherent System

follows that the reliability function for a consecutive 2-out-of-n system is given by the expression .

n

hT

C

() = '2In P .

L

(~+

i=[-I]

1)

(n - ~ +1)

[nt 1 ] i n-i

n- . Pq ~

=

L

i=O

.

.

~

Pn-i qi.

(1.21)

Recursive formulae are available for obtaining the reliability function of consecutive k-out-of-n systems when k > 2. Table 1.3 gives the signature vectors for consecutive 2-out-of-n systems for 2 ::; n ::; 8. Table 1.3.

Signatures for consecutive 2-out-of-n systems.

sc:212

1 2 3 4 5

sc:213

sc:214

sc:215

sc:216

sc:217

sc:218

0

0

0

0

0

0

0

1

2/3 1/3

1/2 1/2

4/10 5/10 1/10

5/15 7/15 3/15

0

0 0

10/35 15/35 9/35 1/35

7/28 11/28 8/28 2/28

0 0

0 0 0

0

6

7 8

In order to investigate the relationship between consecutive 2-out-of-n systems for different values of n, we will make use of the following simple lemma.

Lemma 1 Let h n (p) be the reliability polynomial ofa coherent system oforder n; as noted in section 2, this polynomial can be written as (1.22) where the elements aj,n, j = 1,2, ... , n, are as in (1.12), with the subscript n representing the order of the system. If

aj,n (;)

+ aj+l,n

C: 1) ~

aj+l,n+l (; :

~),

(1.23)

then hn(p) ~ hn+l(p) 'lip E [0,1]. The proof of this Lemma follows immediately from the fact that the LHS of (1.23) is the coefficient of pJ+lqn- j in the polynomial (p + q)hn(p), while the RHS of (1.23) is the coefficient of pJ+lqn- j in hn+l (p). Now the vectors an and an +l corresponding to the systems T c:2ln and T c:2In+l can be identified

20

MATHEMATICAL RELIABILITY

from (1.21). One can show that the former system is uniformly superior to the latter simply by verifying that these vectors satisfy condition (1.23) of Lemma 1. This is easily shown to be implied by the combinatorial inequality

(r) + (r + 1) (r + 1) k

k_ 1

~

k

for k = 0,1, ... , r

+ 1,

(1.24)

where (~) = 0, by convention, if A < B or if B < 0. But since (~~i) ~ (k: 1) Vrand k, theinequality (1.24) follows immediately from the well-known identity

(r) + (r) = (r + 1) k

k-1

k'

(1.25)

From these considerations, the following result may be inferred. Theorem 3 Let T c :2 ln be the lifetime of a consecutive 2-out-of-n system with i.i.d. component lifetimes. Then T c:2 ln ~st T c:2 InH' Theorem 3 generalizes to consecutive k-out-of-n systems. This result is established in Boland and Samaniego [7], using a somewhat different line of argumentation.

4.2.

Redundancy Enhancements to a System

A common way of improving a system is to build redundancy into it. One method is whereby spare components are added in parallel with some of the components or modules already in the system. Such additions (termed active or parallel redundancies) clearly improve the life of the system, but to what extent? A commonly encountered and challenging problem is to allocate (subject to some budgetary constraints) available spare parts or components within the system in such a way as to optimize system performance (say for example to optimize in some stochastic sense the life of the system). A classical theoretical result well known to engineers is that redundancy at the component level is superior (in the usual stochastic order) to redundancy at the system level. Barlow and Proschan [2] state this domination in terms of the structure functions of the two resulting systems. Assuming d.d. lifetimes of the components, this structural domination then implies domination in terms of signatures, and consequently in terms of system lifetimes. If X = (Xl,' .. , X n ) are the independent lifetimes of a system with life length T(X) and X* = (Xi, ... ,X~) are the lifetimes of n independent spares, then T(X) VT(X*) is the lifetime of a new system formed by using the spare components X* to form a replicate of the original system and linking it in parallel to the original one (parallel redundancy at the system level). In a similar manner we let T(XVX*) be the lifetime of the original system after the i th spare Xt is connected in

21

The Signature of a Coherent System

parallel to the i th component Xi (parallel redundancy at the component level). The above mentioned engineering principle states that

T(X) V T(X*)

~st

(1.26)

T(X V X*) ,

and it is natural to ask in what way this extends to other stochastic orderings. Boland and EI-Neweihi [4] showed by means of an example (a 2 component series system with i.i.d. exponentially distributed components) that if one drops the condition that the Li.d. spares have the same distribution as the original components, then the result does not extend to the hazard rate ordering. Boland and EI-Neweihi [4] also conjectured that this principle of component redundancy being stochastically superior to system redundancy as in (1.26) does extend to the hazard rate ordering for k-out-of-n systems when the components and spares are independent and identically distributed, that is

Tk1n(X) V Tkln(X*) ~hr Tkln(X V X*). In fact this engineering principle was proved for the stronger likelihood ratio order by Singh and Singh [24] Their interesting proof uses a technical lemma establishing several delicate inequalities, and requires the tacit assumption that the underlying component lifetime distribution F is absolutely continuous. Using the concept of the signature of a system, Kochar, Mukerjee and Samaniego [16] gave a simpler and slightly more general proof. Using elegant combinatorical arguments, they were able to show that the signature skin for a k-out-of-n system with system-wise redundancy is given by sr

s2n-2k+2+r

(k-~-r) = (~=D (2n-l) 2k-2-r

sr

+

lor r

and = 0 otherwise. Similarly the signature with component-wise redundancy is given by cr

s2n-2k+2+r and

s 1/2 and n 2: 3, then h Y [(n+l)/2] IJp) 2: p and h Y [(n+l)/2] In (p) i 1 as n i 00. Of course Condorcet's theorem makes an assumption of homogeneity (common p) and independence of individuals - assumptions which are quite idealistic. In the engineering and reliability context, many safety systems are designed which use a majority logic structure (say by using 2-out-of-3 systems as building blocks or modules).

The Signature of a Coherent System

23

In many situations, a more intricate design (than that of a simple or direct majority system) may be appropriate in setting up a majority system. Suppose for example we have 15 components and instead of employing a simple majority design we break the components into 5 groups or subsystems of 3 components each (in decision making this might be done with a group of people in order to lessen the influence of anyone individual on the rest of the group). Each of the subsystems is deemed to be working if a majority (in this case 2 or more) of the components are working, and the system as a whole is working if a majority of the subsystems (in this case 3 or more) are working. Such a system is called a 5 x 3 indirect majority system, and this system could be working when as few as 6 components are working - but also it might be in a failed state when as many as 9 components are working! More formally for any odd integers rand s, we define an r x s indirect majority system to be one where the n = r * S components are broken into r groups of size s each, and majority logic is applied within each subsystem as well as amongst the subsystems themselves. Hence the system is functioning if at least (s+ 1) /2 components in at least (r+ 1) /2 of the subsystems are working. Of course there is no necessity to confine oneself to indirect majority systems where the subgroups are all of the same size. For example the election of a US president through the electoral college procedure is essentially an indirect majority decision (with the 50 states and the District of Columbia acting as subgroups of different sizes). A natural question to consider in any case is if an indirect majority system of a given size or order n has an advantage over a simple majority system of the same size? Let Trxs denote the lifetime of an indirect majority system of size n = r x s. Using the powerful technique of total positivity, Boland, Proschan and Tong [6] showed that hT[Cn+l)/2] In (p) 2 h Trx8 (p) if and only if P 2 1/2. That is to say when components are independent, homogeneous and reasonably reliable (p 2 1/2), a simple majority system is more reliable than any other indirect majority system of the same size. This suggests a certain sense of superiority of simple majority systems over indirect majority systems; however, using the concept of signature one may show that the mean life of an indirect majority system always exceeds that of the simple majority system of the same size when the Li.d. components have (DFR) decreasing failure rate distributions. Indirect majority systems possess interesting properties of symmetry, and Boland [10] showed that the signature of the r x s = n indirect majority system Trxs is symmetric around (n + 1)/2 (that is Si = P(Trxs = X(i)) = P(Trxs = X(n-i+l)) = Sn-i+l for any i = 1, ... , n), and moreover that the vector a Trx8 has the property that 1 - an-i = ai for any i = 0,1, ... ,n. Table 1.4 gives the signatures (as well as the vectors aT of proportions of path sets

24

MATHEMATICAL RELIABILITY

of a given size) for both 5 x 3 and 3 x 5 indirect majority systems. Note the symmetry of the signatures about 8. Table 1.4.

Signatures for 5 x 3 and 3 x 5 indirect majority systems. T3X5

S;5 6 7 8 9 10

2:11

I

ai

Si

0

0 .054 .240 .412 .240 .054 0

270/C 5 ) = .054 1890/ C;) = .294

4545/ C~) 4735/ C9) 1 1

= .706 = .946

ai

0

= .060 1800/( 7) = .280 4635/(~~) = .720 4705/C9) = .940 300/(~i)

1 1

From equation (1.2) it follows that the lifetime T coherent system T satisfies

Si -I

0 .060 .220 .440 .220 .060 0

= T(X l , ... ,Xn ) of a

n

P(T> t) = L SiP(X(i) > t), i=l

and hence can be viewed as a mixture distribution of the order statistics (X(1) , ... , X(n)) with mixing probabilities being given by the signature s~ = (Sl," ., sn). In particular it follows that the expected lifetime of the system can be written as n

E(T) = L SiE(X(i))'

(1.28)

i=l

Suppose now that the function g( i) = E(X(i)) of the order statistics of the Li.d. components X = (Xl,' .. ,Xn ) is a convex function of i = 1, ... ,n. Then if the system in question is the r x S = n indirect majority system Trxs (with lifetime Trxs ) formed with these components, it follows from the convexity of 9 and the symmetry of the signature of Trxs that n

E(Trxs ) = L i=l

n

sig(i) 2:: g(L sii)

= E(X[(n+l)!2])'

i=l

It is known (see Kirmani and Kochar [15]) that if X = (Xl, ... , X n ) is an Li.d. sample from a DFR distribution, then E(X(i») is a convex function of i = 1, ... ,n; consequently one has the following result:

Theorem 4 Let T(n+l)!2 In (respectively TrxsJ be the lifetime of the simple majority (r x S indirect majorityJsystem with independent identically distributed

The Signature of a Coherent System

25

DFR component lifetimes given by X = (Xl"", X n ), where rand s are both odd integers and n = r * s. Then

For a sample of n Li.d. exponential random variables with mean parameter A, it is clear that E(X(i)) = L:~=l (n-]+I)A is a convex function of i = 1, ... , n, and therefore the above theorem holds in this case. Consider for example the situation where n = 15 = 5 * 3 and the components are exponentially distributed with A = 1 . Then E(T5x a) = 0.7336 2: 0.7254 = E(T8 115)' Therefore although T rxs has greater mean than T(n+1)/2 In' it may be only slightly greater.

5.

Discussion and Conclusions

In the preceding sections, we have defined and discussed the notion of the signature of a coherent system. Although the notion admits to generalizations beyond the Li.d. setting to which we have restricted attention, we believe that its simple form, its interpretability and its demonstrated properties justify it's use as a primary vehicle for characterizing system designs and for comparing one system with another. In section 2, we clarified the connection between a system's signature and the basic building blocks of coherent systems - their path sets and cut sets. We also displayed the linear relationship that exists between signatures and signed dominations. Because the latter, described by Agrawal and Barlow [1] as a major "theoretical breakthrough", serves as a major vehicle for computational and algorithmic work on system (and network) reliability, it is important to provide links between dominations and ideas and methods that are useful in system comparisons. In section 3, we reviewed results of Kochar, Mukerjee and Samaniego [16] and Boland [10] which clarify the connection, in terms of increasingly stringent conditions on the signatures of two competing systems, between properties of signatures and properties of the system lifetime. Specifically, we stated and discussed several results stating that certain types of stochastic relationships between signatures translate into similar stochastic relationships between system lifetimes. Section 4 discusses a variety of examples in which the theoretical results of section 3 have been fruitfully applied. Other applications exist. The question of when the expected lifetime of one system of order n (with component lifetimes ij.d. from F) is necessarily smaller than the expected lifetime of a second system of order n (with the same conditions on components) is a natural one to consider. We have already discussed in subsection 4.3 an example where one system (a simple majority system) has expected lifetime less than that of another (an indirect majority

26

MATHEMATICAL RELIABILITY

system of the same order n). The condition (1.29) where T 1 and T 2 are the lifetimes of the two systems of interest, is a fairly weak stochastic relationship (implied, for example, by T 1 :::;st T2), and it is likely to arise more often in applications than the more stringent relationships discussed in section 3. Thus, conditions which guarantee that equation (1.29) holds would be very much of interest. Although we have observed in (1.28) that the expected lifetime of a coherent system can be expressed as a (signature) weighted average of its order statistics, the fact that the comparison of two system lifetimes is a subtle matter is brought home by the following example. Example 1. Using the terminology in the Introduction, consider two "mixed systems" of order three with lifetimes T 1 and T 2 and signatures = (.2, .5, .3) and s~ = (.5,0, .5) respectively. If three Li.d. components are distributed according to an exponential distribution with unit failure rate, then the expected values of the order statistics are 1/3, 5/6 and 11/6, and we have that

si

On the other hand if the components are, instead, Li.d. from the uniform distribution on the interval [0, 1], then the expected values of the order statistics are 1/4, 1/2, and 3/4, and we have

E(TI)

= 21/40 > 20/40 = E(T2)'

It is thus apparent that the ordering ofexpected lifetimes for two fixed systems can depend on the underlying distribution. Can anything further be said? It is :::;st s~ is a sufficient condition for E(T1) :::; E(T2) clear, in general, that for all underlying distributions F. Could the condition be necessary? While the general answer is not known at the moment, the necessity of this condition can be demonstrated for systems of order 3. This follows from the fact that the condition (1.29) can be rewritten as

si

U2(a~ - a~)

+ u3(ai -

aD < 0,

(1.30)

where Uj = E(X(j) - E(XU- 1)) and aj is given in (1.12). Now (1.31) occurs if and only if every term of the vector a2 is greater than or equal to the corresponding term of the vector a 1. Now suppose that the two signatures do not satisfy the stochastic ordering relationship (1.3 I). Then the two differences in the left hand side of (1.30) must have opposite signs. If the ratio of the expected spacings U2 and U3 in (1.30) can be made arbitrarily large or arbitrarily small

27

The Signature of a Coherent System

by different choices of the underlying distribution F, then the inequality (1.29) cannot hold for all F. This precise circumstance can be achieved with the order statistics from three Li.d. Bernoulli variables. If Xl, X2, X 3 are Li.d. from B(1,p), then it is easy to verify that U3/U2 = q/p. It follows that (1.30) will hold for all distributions F if and only if the two differences in the left hand side of (1.30) are negative, that is, if and only if 81 ~st 82. We thus have: Theorem 5 Given two (mixed) systems oforder 3 having signatures Sl and S2,

E(TI) :::; E(T2) for all F if and only if

81 ~st S2.

It is not known at present whether or not this result extends to systems of arbitrary order. What we do know is that if E(T1 ) :::; E(T2) for all F, then the vector of differences 81 - 82 has an even number of sign changes, a conclusion that follows from Descartes's Rule of Signs. We argued in section 2 that the comparison of systems via their vectors of dominations had little intuitive content, and for complex systems, was extremely difficult. The reason for this is that the difference of two polynomials in standard form is another polynomial in standard form. For two complex systems, this difference polynomial will typically be of quite high degree. Thus, determining whether one reliability polynomial is uniformly larger than another for all p € (0,1) is a task equivalent to finding the roots of a high degree polynomial. But that algebraic problem is a quite famous one, a problem that was dramatically resolved by Evariste Galois. Finding roots of polynomials of order greater than 4 is not a problem that is "solvable by radicals", that is, there are no closed-form expressions which solve the problem. Transforming this problem into the world of signatures changes things substantially. To see this more graphically, let us consider the comparison between two systems of order 5, the bridge system shown in Figure 1.2 and the consecutive 2-out-of- 5 system (as discussed in Section 4). The domination vectors for these two systems are dbridge = (0,2,2, -5,2) and d~:215 = (0,1,3, -4, 1) respectively. The direct comparison of these vectors lends little insight, although, because of the low dimension of the problem, it is possible to factor the difference polynomial and show, thereby, that it does not cross the x-axis for 0 < p < 1. But the comparison of the two system signatures immediately yields strong conclusions. It is clear that the signature of the bridge system, that is 8bridge = (0, 1/5,3/5,1/5,0), is stochastically larger than 8~:215 = (0,2/5,1/2,1/10,0), the signature of the consecutive 2-out-of- 5 system. This conclusion is the same as one could obtain from dominations by brute force: the reliability function of the bridge system is uniformly larger than that of the consecutive 2-out-of- 5 system for all p € (0, 1). But the comparison of signatures yields quite a bit more. By inspection, it is clear that the two signatures obey stronger orderings - both the hazard rate and the likelihood ratio

28

MATHEMATICAL RELIABILITY

orderings. Noting the increasing ratios of non-zero terms of the two signatures (i.e., .5, 1.2, 2) demonstrates both of these claims. We can thus rightly say that the bridge system is not only better that the consecutive 2-out-of- 5 system, it's actually a whole lot better. Since most of the properties and potential of the notion of system signature have come to light only recently, it is reasonable to expect that the collection of successful applications of the idea will grow over the coming years. If this presentation serves to encourage others to pursue such applications and to investigate actively the potential to which we have alluded, our purposes will have been very well served.

Acknowledgments The authors wish to thank the Departments of Statistics at both the National University of Ireland - Dublin (NUl-D) and Trinity College Dublin (TCD) for their support during periods of time when the authors were on sabbaticals visiting these institutions.

References [1] Agrawal, A and Barlow, R. E. (1984). A Survey of Network Reliability and Domination Theory. Operations Research, 32: 478-492. [2] Barlow, R. E. and Proschan, F. (1981). Statistical Theory ofReliability and Life Testing: Probability Models. Silver Spring, Maryland, USA:To Begin With. [3] Birnbaum, Z. W., Esary, J. A. and Marshall, A. (1966). Stochastic Characterization of Wearout for Components and Systems. Annals ofMathematical Statistics, 37: 816-25. [4] Boland, P. J. and El-Neweihi, E.(1995). Component Redundancy versus System Redundancy in the Hazard Rate Ordering. IEEE Transactions on Reliability, 44: 614-619. [5] Boland, P. J. (1989). Majority Systems and the Condorcet Jury Theorem. The Statistician, 38:181-189. [6] Boland, P. J., Proschan, F. and Tong, Y. L. (1989). Modelling Dependence in Simple and Indirect Majority Systems. Journal ofApplied Probability, 26: 81-88. [7] Boland, P. J. and Samaniego, F. J. (2003). Stochastic Ordering Results for consecutive k-out-of-n Systems. To appear in IEEE Transactions on Reliability. [8] Boland, P. J., Samaniego, F. J., and Vestrup, E. M. (2001). Linking Dominations and Signatures in Network Reliability Theory. Technical Report # 372, Department of Statistics, University of California, Davis.

REFERENCES

29

[9] Boland, P. J., Shaked, M. and Shanthikumar, J. G. (1998). Stochastic Ordering of Order Statistics. In Handbook ofStatistics, Academic Press, San Diego, CA, 16: 89-103. [10] Boland, P. J. (2001). Signatures of Indirect Majority Systems. Journal of Applied Probability, 38: 597-603. [11] Chaing, D. and Niu, S. C. (1981). Reliability of Consecutive k-out-of-n. IEEE Transactions on Reliability, R-30: 87-89. [12] Derman, C., Lieberman, G. J., and Ross, S. (1982). On the Consecutive k-out-of-n: F System. IEEE Transactions on Reliability, R-31: 57-63.

[13] Esary, J. and Proschan, F. (1963). Coherent Structures of Non-identical Components. Technometrics, 5: 191-209. [14] Feller, W. (1968). Introduction to Probability Theory and Applications, Third Edition. New York: Wiley and Sons. [15] Kirmani, S. and Kochar, S. (1995). Some New Results on Spacings from Restricted Families of Distributions. J. Stat. Plan. Inf. 46: 47-57. [16] Kochar, S., Mukerjee, H. and Samaniego, F. J. (1999). The Signature of a Coherent System and Its Application to Comparisons Among Systems. Naval Research Logistics, 46: 507-523. [17] Koutras, M., and Papastavridis, S. G. (1993). Consecutive k-out-of-n: F Systems and their Generalizations. In K. B. Misra, editor, New Trends in System Reliability Evaluation. Elsevier, Amsterdam, pp. 228-248. [18] Samaniego, F. J. (1985). On closure of the IFR class under formation of coherent systems. IEEE Transactions on Reliability, R-34: 69-72. [19] Satyanarayana, A. and Prabhakar, A. (1978). New topological formula and rapid algorithm for reliability analysis of complex networks. IEEE Transactions on Reliability, R-27: 82-100. [20] Satyanarayana, A. and Hagstrom, J. N. (1981a). A new algorithm for the reliability analysis of multi-terminal networks. IEEE Transactions on Reliability, R-30: 325-334. [21] Satyanarayana, A. and Hagstrom, J. N. (1981b). Combinatorial properties of directed graphs useful in computing network reliability. Networks, 11: 357-366. [22] Satyanarayana, A. (1982). A Unified Formula for the Analysis of Some Network Reliability Problems. IEEE Transactions on Reliability, R-31: 2332. [23] Satyanarayana, A. and Chang, M. K. (1983). Network Reliability and the factoring theorem. Networks, 13: 107-120. [24] Singh, H. and Singh, R. S. (1997). On Allocation of Spares at Component Level Versus System Level. Journal ofApplied Probability, 34: 283-287.

30

MATHEMATICAL REliABIliTY

[25] Shaked, M. and Shanthikumar, J. G. (1994). Stochastic Orders and their Applications. San Diego, CA. Academic Press.

Chapter 2 SYSTEM RELIABILITY OPTIMIZATION: AN OVERVIEW* Way Kuo Texas A&M University, College Station, TX, USA [email protected]

Rajendra Prasad KBSI, College Station, TX, USA [email protected]

Abstract

This chapter provides an overview of the methods that have been developed since 1977 for solving various reliability optimization problems. Also discussed are applications of these methods to various types of design problems. The chapter addresses heuristics, metaheuristic algorithms, exact methods, reliabilityredundancy allocation, multiobjective optimization and assignment ofinterchangeable components in reliability systems. Like other types of applications, exact solutions for reliability optimization problems are not necessarily desirable because exact solutions are difficult to obtain and even when they are available, their utility is marginal. A majority of the recent work in the area is devoted to the development of heuristic and metaheuristic algorithms for solving optimal redundancy allocation problems.

Keywords:

Reliability optimization, redundancy allocation, reliability-redundancy allocation, metaheuristic algorithms, heuristics, optimal assembly of systems

-Based largely on "Annotated Overview of System Reliability Optimization" by W. Kuo and R. Prasad which appeared in IEEE Transactions on Reliability, R49(2), pp. 176-187, June 2000. @2000IEEE

32

1.

MATHEMATICAL RELIABILITY

Introduction

Reliability optimization has attracted many researchers in the last 4 decades due to reliability's critical importance in various kinds of systems. To maximize system reliability, the following options can be considered: I. enhancement of component reliability, 2. provision of redundant components in parallel, 3. a combination of I and 2, and 4. reassignment of interchangeable components. However, any effort for improvement usually requires resources. In this chapter, we mainly review papers published after 1977 addressing options 2-4. Design for reliability allocation in a system belongs to option 1 which has been well developed in the 1960s and 70s as is documented in Tillman et ai. [66]. To understand enhancement of component reliability for modern semiconductor products, see Kuo et at. [32], Kuo and Kim [34] and Kuo and Kuo [35]. The diversity of system structures, resource constraints and options for reliability improvement has led to the construction and analysis of several optimization models. Note that, according to Chern [6], quite often it is hard, with multiple constraints, to find feasible solution for redundancy allocation problems. For good literature survey of the early work, refer to Tillman et ai. [65], Tillman et at. [66] and Kuo et at. [37]. The major focus of recent work is on the development of heuristic methods and metaheuristic algorithms for redundancy allocation problems. Little work is directed towards exact solution methodology for.such problems. To the best of our knowledge, all of the reliability systems considered in this area belong to the class of coherent systems. The literature on reliability optimization methods since 1977 can be classified into the following seven categories: 1. heuristics for redundancy allocation, special techniques developed for reliability problems, 2. metaheuristic algorithms for redundancy allocation, perhaps the most attractive development in the last 10 years, 3. exact algorithms for redundancy allocation or reliability allocation, most are based on mathematical programming techniques, ego the reduced gradient methods presented in Tillman et al. [66], 4. heuristics for reliability-redundancy allocation, a difficult but realistic situation in reliability optimization,

33

System Reliability Optimization: An Overview

5. multi-objective optimization in system reliability, an important but not widely studied problem in reliability optimization, 6. optimal assignment ofinterchangeable components in reliability systems, a unique scheme that often takes no effort, and 7. others including decomposition, fuzzy apportionment, and effort function minimization. Notation:

- system reliability

n

- number of stages in the system

m

- number of resources - number of components at stage j - lower limit on

Xj

Uj

- upper limit on Xj

rj

- component reliability at stage j

x

- (Xl,'"

r

-

,X n )

(rl,"',r n )

amount of i th resource required to allocate stage j

Xj

components at

- total amount of i th resource required for allocation x

9i(X) C

- total system cost

w

- total system weight

Cj(Xj)

- cost of Xj components at stage j

Wj(Xj)

- weight of X j components at stage j

Pj

- lh stage (subsystem)

Ro

- minimum system reliability required

WI, ... ,Wm -

surrogate multipliers

S (w)

- single-constraint surrogate problem

Pij

-

reliability of component j in position i

34

MATHEMATICAL RELIABILITY

2.

Redundancy Allocation

The redundancy allocation problem, PI, in a general reliability system consisting of n stages, is perhaps the most common problem in design for reliability. It can be written as to maximize (2.1) subject to

:S .€· 0, is the critical time (or the threshold time), and the ensuing model is a description for a cascading failure. The choice of what value to choose for J is subjective, though a possible strategy is to let J be the time it takes to restore the failed component to operational status. This, is what we define to be a Cascading Model, that is, a model in which the failure of one component causes a temporary change (usually an increase) in the failure rates of the surviving component(s). Motivation for this comes from thinking about the domino effect. That is, the falling of a domino causes its neighbor to fall, but only if the neighbor is within striking distance of the falling domino. If the dominos are too far apart, a falling domino will not have any effect on its neighbors. Thus with cascading failures, the failure of one component is followed by that of its neighbor, but within a specified time, which we shall call the critical time. If the failure of a component causes its neighbor to fail, but after the critical time has elapsed, then such failures are causal, not cascading. See Swift [12] for generalizations of the cascading model to non exponential distributions and for networks of more than two components.

REFERENCES

67

28

8

o

Figure 3.4.

Time

Failure Rate of the Second Component to Fail in a Model for Cascading Failures.

Acknowledgments This work is supported by the Army Research Office under a MURI Grant DAAD 19-01-1-0502.

References [1] Chen, J. and Singpurwalla, N. D. (1994). The Notion of Composite Reliability and its Hierarchical Bayes Estimation. Journal of the American Statistical Association. 91: 1474-1484. [2] DeFinetti, B. (1972). Probability, Induction and Statistics, John Wiley, New York. [3] DeFinetti, B. (1975). Theorem of Probability, John Wiley, New York. [4] DeGroot, M. H. (1989). Probability and Statistics, 2nd edn, AddisonWesley, Reading, MA. [5] Freund, J. E. (1961). A Bivariate Extension ofthe Exponential Distribution. Journal of the American Statistical Association, 56: 971-977. [6] Kong, C. W. and Singpurwalla, N. D. (2001). Neural Nets for Network Reliability. to appear. [7] Lindley, D. V. and Singpurwalla, N. D. (2002). On Exchangeable, Causal and Cascading Failures. Statistical Science, 17: 209-219.

68

MATHEMATICAL RELIABILITY

[8] Lynn, N., Singpurwalla, N. D. and Smith, A. F. M. (1998). Bayesian Assessment of Network Reliability. SIAM Review, 40: 202-227. [9] McCulloch, W. S. and Pitts, W. (1943). A Logical Calculus of Ideas Imminent in Nervous Activity. Bulletin ofMathematical Biophysics, 5: 115-133. [10] Roberts, G. O. and Smith, A. F. M. (1993). Bayesian Methods via the Gibbs Sampler and Related Markov Chain Monte Carlo Methods. J. Roy. Statist. Soc. B, 55: 3-23. [11] Singpurwalla, N. D. and Swift, A. W. (2001). Network Reliability and Borel's Paradox. The American Statistician, 55: 213-218. [12] Swift, A. W. (2001). Stochastic Models of Cascading Failures. PhD thesis, The George Washington University.

II

RECURRENT EVENTS

Chapter 4 MODELLING HETEROGENEITY IN REPEATED FAILURE TIME DATA: A HIERARCHICAL BAYESIAN APPROACH Elja Arjas RolfNevanlinna Institute University ofHelsinki, Finland ela@rnLhelsinkLfi

Madhuchhanda Bhattacharjee RolfNevanlinna Institute University of Helsinki, Finland [email protected]

Abstract

Heterogeneity in data has been frequently encountered and incorporated in modelling and inference problems in many areas. As far as reliability data are concerned this issue has been addressed only to a very limited extent. This paper develops inference procedures which can take into account different kinds of heterogeneity in repeated failure time data and enable inference not only on population parameters relating to the survival process, but also underlying processes that lead to such observed heterogeneity between individuals. We illustrate the proposed model by analyzing real life data sets, wherein, although the observations consist of times to repeated occurrences of failures/events from similar systems/subjects, several factors can cause them to behave differently. We propose hierarchical Bayes models for these data. The gains from the proposed models can be observed form the predictive distributions. Inference on model parameters and predictions were carried out using MCMC sampler.

Keywords:

Heterogeneity, repairable systems, repeated events data, hierarchical Bayesian modelling, latent variable structures, predictive distributions, nuclear power plant valve failure data, relapses of clinical condition.

72

1.

MATHEMATICAL RELIABILITY

Introduction

In the analysis of lifetime data it is commonly assumed that the considered lifetimes are i.i.d., that is, independent and identically distributed. In the context of reliability problems, the independence assumption is usually justified by referring to an argument of the physical independence between units and a consequent absence of causal mechanisms that would connect their operation to each other. The identical distribution assumption, on the other hand, is commonly expressed by saying that the units are "drawn from the same population". The key idea there is that, as far as the statistical modelling is concerned, the units are treated as identically distributed if there do not, a priori, seem to be useful ways based on observable characteristics that would justify considering their reliabilities individually. Therefore, being "identically distributed" does not mean that the units would be identical, but rather that we as analysts are unable, or sometimes unwilling, to find characteristic differences between them that would provide a basis for considering them individually or in groups. But there are some instances in which it becomes almost mandatory to somehow acknowledge, and take into account explicitly, that there is an element of unobserved heterogeneity between the units belonging to the considered population. This is so although there may not be a direct way of describing what such differences between units are, let alone quantifying them. The following well known examples illustrate this point. In a burn-in design all units that are to be used later will first be put on test under testing conditions which are possibly even more strenuous than normal use. The idea behind this is that some units may be of "inferior quality" and it would be important to select them away, but it is impossible to distinguish them in advance from the "good quality" units (e.g. [4], [10], [11]). Therefore all units are put on test, and it is anticipated that those of inferior quality will fail sooner. After some time has elapsed it is likely that most of them have been selected away in this manner, leaving mainly "good quality" units for future use. Note that, whether this is acknowledged explicitly in the statistical modelling or not, such burn-in testing makes sense only if it is actually supposed that the units can be usefully divided into different quality categories, as otherwise the testing could not result in any selection. If there are no differences in the quality, the net result from such testing is only negative: nothing is gained; instead, a number of units are lost and the useful lifetimes of those finally selected are reduced because of the testing. Another context in which it may be important to take into account the possibility of unobserved heterogeneity in the population is in the study of effects of the covariates on lifetimes, that is, hazard regression. In the famous proportional hazards model of Cox [6] such effects are modelled in terms of a relative risk function which modulates in a multiplicative manner a common baseline

Modelling Heterogeneity: Hierarchical Bayesian Approach

73

hazard function of time. On the other hand, it seems reasonable to think that some part of the variation between lifetimes should be attributed to individual characteristics which have remained unobserved in the data. For this purpose, in order to control for unobserved heterogeneity in the population, it has become common to introduce into the model unit specific so called frailty parameters. In a technical sense, frailty parameters are positive random variables which are sampled independently, and also independently of the corresponding observed covariates, from some distribution (which is often assumed to be from the Gamma family). The multiplicative structure is preserved so that the hazard rate of a unit is now modelled as the product of three terms: the baseline hazard, the relative risk function representing "observed heterogeneity", and the frailty parameter describing unobserved heterogeneity. There is presently a substantial amount of literature describing such models, for example, the monograph by Hougaard [8]. In the case where a single duration/survival has been measured on each individual there are as many unknown frailty parameters in the model as there durations in the data. Point estimation of the individual frailties, along with the estimation of the other unknown model parameters, is therefore not possible. On the other hand, there are interesting theoretical results concerning the identifiability of frailty models due to Heckman and co-workers (e.g. [7]), essentially saying that under the above multiplicativity and independence assumptions, one can capture the precise form of the distribution from which the frailty parameters are sampled in the (hypothetical) situation in which the covariate specific survival distributions have been estimated without error. In this paper we consider situations in which the units can experience several episodes of use. Each episode ends with a failure, and may then be followed by a repair that puts the unit back into operation; this sequence of events is then repeated for a number of times. Under such circumstances we can often see that across units the repeated episodes follow some more or less distinct systematic looking patterns. Such a pattern in the data suggests that the corresponding unit may have some underlying property or characteristic which is causing the observed systematic looking behavior. For example, some units appear to be systematically associated with short lifetimes, and some others with long lifetimes. Different patterns can then be associated with a corresponding notion of heterogeneity in the population. This suggests introduction of an unobserved frailty parameter (or random effect) for each unit. A somewhat more complicated pattern is present if there is a discernible increasing or decreasing (long term) trend in the successive inter-failure times of each unit, but the speed at which this happens may be different across units. In the currently available methods for dealing with frailty models, the common practice is to integrate out the frailties even when there are repeated observations from same unit/subject, leaving no hope for appropriate individual

74

MATHEMATICAL RELIABILITY

specific inference. Our objective here is, apart from being able to infer about the population level parameters and/or parameters common to subjects, with the availability of more individual level information, as in this case, to make individual specific inferences too. This is particularly important in situations in which one wants to make individual predictions based on earlier monitored failure data. For earlier discussions, see [1], [2]. Of the earlier literature which we have encountered, by far the closest to our work is the paper of Chen and Singpurwalla [5] which applies hierarchical Bayesian modelling on reliabilities of units belonging to heterogeneous population. Also the considered context is close to our first illustration below. The main difference is that we are concerned with a longitudinal analysis of failure data, modelling sequences of inter-failure times and considering aspects of dependence including prediction and the possibility of trend-like behavior within such sequences, whereas in [5] the main focus is on modelling potentially dependent binary outcomes from a large number of start-test trials, and in estimating the consequent system level "composite" reliability. We illustrate our proposed method by two examples based on real life data, followed by a discussion.

2. 2.1.

Illustration 1: Valve Failures in Nuclear Power Plants Data

The data here pertain to 104 motor operated closing valves in different safety systems at two nuclear power plants in Finland, with 52 valves in each of the plants. Our data are based on 9 years operating experiences of these valves, beginning 1981. A motor operated valve, including the actuator, is an electromechanical system and consists of hundreds of parts, and hence can be quite complicated in structure. The valves have different design capacities, vary in diameter and actuator size and are manufactured by different manufacturers. The valves can experience different types of malfunctioning either when in operation or when pre-scheduled tests are carried out on them as a part of preventive maintenance ([9]). Here we consider the failures of the type "External Leakage". An external leakage is recorded from a valve usually when there is a leakage from one of its components, such as a "Bonnet" or "Packing". Valves are continuously monitored for such failures and are rectified/repaired without delay. Here we assume that repair times are negligible. The observed failure processes from these valves varied distinctively across valves. As could be expected, a large majority of systems, 88 out of the 104 considered, did not fail during the follow up, as these are parts of safety systems and are built to perform successfully for long time periods. From the remaining

Modelling Heterogeneity: Hierarchical Bayesian Approach

75

16 valves 37 external leakages were recorded, ranging between a minimum of 1 failure to a maximum of 8 events recorded from a single valve.

2.2.

Preliminary Models

In a preliminary analysis we decided to ignore the observed heterogeneity across valves and started with a simplistic model, assuming that the failures are governed by Poisson processes with common hazard/intensity parameter values for all 104 valves. We assumed a hierarchical model structure for the parameters. But this model predicted a median inter-failure time of 17 years time for all valves. This, for an obvious reason is not acceptable for valves which had experienced approximately 1 failure every year, e.g. for Valve 1 which had 8 failures in 9 years, or Valve 2 which had 7 failures in 9 years. As a next step we relaxed the above model structure by assuming that the Poisson processes are not necessarily identical in distribution and assumed an appropriate hierarchical structure. This simple extension immediately captured the heterogeneous behavior of the valves and valve specific predicted interfailure times now had median values ranging approximately from 2 years to 26 years. This is consistent with the observed behavior of the respective valves, as some of them had failed often whereas most others failing, so far, never. We further tried an extension of this model by dropping the Poisson assumption. Instead, we considered a point process model where failure intensity for each valve is piecewise constant and changes randomly at its own failure epochs, the intensities being drawn independently from a Gamma distribution. This was done observing that also the inter-event times from any specific valve varied widely, and in the homogeneous Poisson process assumption they are assumed to be coming from the same exponential distribution. Note that, by integrating out the piecewise constant parameters, this model reduces to modelling the valves by identical renewal processes with Pareto distributed inter-failure times. But even though this was a simple and intuitive extension it lost the valve specific prediction ability as due to the implicit identical assumption the model did not contain any valve specific features. The large number of long (censored or uncensored) inter-failure times dominated the prediction and the predicted median inter-arrival time was estimated to be approximately 14 years for all valves.

76

MATHEMATICAL RELIABILITY

These three preliminary models can be summarized as follows:

1) Inter-failure times: For i = 1, ... , 104 and j 2: 1, Li.d. Homogeneous Poisson Process: Non-i.i.d. homogeneous Poisson Process: Point process with piecewise constant intensity: 2) Hazard rate(s): For i = 1, ... , 104 and j 2: 1 Li.d. Homogeneous Poisson Process: Non-Li.d. homogeneous Poisson Process: Point process with piecewise constant intensity:

Xi,j I B "-' Exp(B) Xi,j I Bi rv Exp( Bi ) Xi,j I Bij rv Exp(Bij ) B I (3

rv Gamma(l, (3), Bi I (3 rv Gamma(l, (3), Bij I (3 "-' Gamma(l, (3),

3) Hyper parameter: (Under all three models)

(3

rv

Gamma(5, 0.003)

A partial summary of the preliminary analysis is given below. Table 4.1.

Preliminary analysis of valve failure data.

Valve no.

Observed no. of failures

1

8

2 10

7 1

17

o

Predicted median failure time (apprx.) in years Li.d. HPP Non-Li.d. HPP PP with piecewise const. intensity

17 17 17 17

2 2.3 10.5 25.5

14 14 14 14

From these preliminary analyzes we could see that, even in a situation in which there is considerable heterogeneity in the inter-failure times both between valves and within each valve, merely adding more latent variables may not be a good strategy for modelling. Here, in the third model, it resulted in a loss of the predictive performance.

2.3.

Proposed Model and Analysis

As above, we describe the behavior of these units by a hierarchical Bayesian model, where the top layer of this model consists of observed inter-failure times. Due to heterogeneity in the valve data, the failure process of each valve needs to be modelled separately. Therefore, and in absence of any evident time trend, we model each individual failure process with a piecewise constant intensity process changing at the failure epochs. Unlike the last of our preliminary models wherein we had assumed such a piecewise constant intensity, here we additionally assume that the constants

77

Modelling Heterogeneity: Hierarchical Bayesian Approach

forming the intensity are drawn from a valve specific distribution, thus preserving the valve specific characteristics. Next, we observe that there are at least two different patterns in the behavior of the valves, with most not failing during the entire follow up and a few failing even several times. We interpret this in a simplistic manner as follows. We assume that each valve comes from either an, a priori unknown, group of "good" valves with possibly long failure times, or from a (smaller) group of "bad" valves with typically much shorter inter-failure times. These group specific distributions were then chosen to belong to the Gamma family and in order to characterize the "bad" and "good" groups, the scale parameters of these two Gamma distributions were ordered. The group membership of each valve is identified by a latent indicator variable with a Bernoulli distribution. Due to lack of further knowledge about the population proportions of the good and bad valves we assigned the UniformeD, 1) prior to the parameter of the Bernoulli distribution. This hierarchical model structure can be summarized as follows: 1) Inter-failure times:

(Xi,j

I Oij) '" EXp(Oij),

where i 2) Hazard rates:

(Oij

= 1, ... ,104 andj

::::: 1,

I (30,(31,Ci ) '" Gamma(l,(3c,),

where i = 1, ... , 104 and j ::::: 1, 3) Group indicators:

(Ci I p) '" Bernoulli(p) , i = 1, ... , 104,

4) Group specific parameters:

(30'" Gamma(5, 0.005) and (31 where 'fI1 '" Gamma(5, 0.001),

5) Mixing probability:

p'" Uniform(O, 1).

= (30 + 'fI1,

Numerical estimation of the model parameters was done by using MCMC sampling from the posterior. After carrying out careful sensitivity analysis, the fixed parameterslhyper parameters were chosen such that they gave rise to reasonably vague prior distributions for the non-constant parameters. Note that again, given the values of the group indicator and the appropriate {3 parameter, if the parameters are integrated out from the model the failure process for each valve becomes a renewal process with Pareto distributed interevent times. The two groups, will still be identified by the two {3 parameters, which will now be the shape parameters of the Pareto distributions. The proportion of the good and bad valves in the population was estimated from the posterior distribution of the mixing hyper-parameter p. Apart from obtaining the posteriors of the parameters involved we generated predictive distributions of several quantities. The predicted failure process of a

e

78 Table 4.2.

MATHEMATICAL RELIABILITY Results of analysis of valve failure data: Population level characteristics.

Characteristic Proportion of "Bad valves" Proportion of "Good valves"

/3 Parameter for "Bad valves" /3 Parameter for "Good valves"

Posterior expectation 0.07 0.93 494 13650

generic valve of an unspecified quality class was also obtained. Additional information of the valve quality can help in predicting the future failure behavior of the valve. This was achieved by fixing the value of the latent group membership variable, and then generating the corresponding failure process. As anticipated, the realized sample paths across the two groups were significantly different. Table 4.2 (continued) Results of analysis of valve failure data: Generic-valve characteristics.

Characteristic

Posterior estimate

Predictive probability for a generic valve to be a "Good valve" Predictive median time to first failure for a generic valve

0.93 31.67 years

Predictive median time to first failure for a "Bad valve" Predictive median time to first failure for a "Good valve"

1.27 years 37 years

Additional quantities that could be of interest from a maintenance point of view are predicted number of failures in some forthcoming time intervals. Some examples of such predictions are given below. Unlike the prevalent methods for dealing with situations of heterogeneity in data, our method provides individual specific inferences also. This can be done by obtaining posterior estimates of the individual specific parameters in the model. For example, posterior distributions of Ci's describe the chances of the corresponding valve to come from the good or bad group, which may be useful for many purposes, like reliability assessment and maintenance. This information in conjunction with the population parameter estimates will then provide valve specific predictive distributions.

79

Modelling Heterogeneity: Hierarchical Bayesian Approach Table 4.2 (continued) Results of analysis of valve failure data: Maintenance related characteristics.

Characteristic

Posterior expectation

No. of failures from all 104 valves during the 10-th year

4.00

No. of valves to fail at least once during 10-th year No. of valves to fail 2 2 times during 10-th year No. of valves to fail 2 3 times during lO-th year

3.47 0.41 0.09

Table 4.3.

Results of analysis of valve failure data: Unit level characteristics.

Characteristic Posterior probability of being a "Good valve" Predictive median future inter-failure time (in years) Expected no. of failures during the lO-th year Expected time to next failure at the end of 9-th year (in years)

Estimates for valve no. 1 2 10 17 0.00 1.26 0.31 3.42

0.00 1.29 0.29 3.67

0.96 33.6 0.03 38.82

0.99 35.8 0.02 45.86

The heterogeneity in the valves becomes quite apparent from these valve specific predictions, which are quite consistent with the observed data. As expected, the higher the posterior probability that a considered valve is "good", the longer is the predicted time to the next failure, and the fewer failures are expected to happen within a considered future time interval. A more detailed discussion on this data can be found in [3].

3. 3.1.

Illustration 2: Repeated Occurrence of Relapses Data

In this example the data consist of times at which a recurrent problem (relapse) appeared in patients undergoing a certain treatment. Altogether 48 patients were followed from the beginning of the treatment, with the follow up time ranging from 44 days to 720 days. Even irrespective of the length of observation, the number of relapses experienced varied very much. For example, one patient had no relapse during the entire 720 day observation period, whereas another patient had 46 relapses within a time span of 530 days. Also the inter-event times for the same patient were sometimes irregular, with the maximum being several hundred days and the minimum only a few days.

80

3.2.

MATHEMATICAL RELIABILITY

Model and Analysis

Thus far, these data are similar to our previous example, where they exhibited considerable variation both within and across subjects/units. But here is additionally an expected trend element embedded which makes the situation different from the previous one. This is so, as the treatment has long term benefits and patients are expected to experience gradually less frequent relapses. However, the speed of improvement would certainly differ from patient to patient. The treatment is time consuming and is known to have side effects which may result in bad compliance and inconsistent adherence to the treatment. Another factor that contributes to the observed heterogeneity is that the treatments are initiated at different ages for different patients. Also information on several other factors, like past history ofdurations and doses of similar treatments undergone by these patients, could help understand the patients physio-psychological ability to respond to the treatment. Note that even though we have been able to identify several sources of heterogeneity, we still can do little or nothing to control them. Since the patients are not in a controlled environment, it is not possible to directly improve compliance. It is also not possible to restrict the analysis to any age group, as patients join the study/treatment at their wish and according to their own initiative. Complete information on their relevant medical history is also not available. We conclude that these shortcomings will be true for most similar studies and hence the observed outcomes will always be a result of the true randomness of the treatment, as well as heterogeneity present in the study population. Unfortunately we may never be able to identify analytically these two sources of heterogeneity, let alone being able to quantify them. We propose to model the data as a joint outcome of these two sources of randomness. We assume that there are k possible states that a patient can be in, at any given time. When a patient enters the study and starts to undergo treatment, then the patient is assumed to be in one of these k states. A current state mayor may not change when an event occurs. For every patient these transitions occur independently of the other patients, and depending solely on the patient's own physio-psychological state, adherence to the treatment etc. We further assume that the k states are ordered, with state I being the best state any patient can be in, wherein the chance of a relapse is least or almost none. Continuing in this manner, state k is the worst with the highest possibility of a relapse. We assume a simple Markov process for each patient with k = 4 states. The states described above are quantified as the states of a Markov process. Once the underlying Markov process for a patient is in a certain state, the patient experiences an inter-event time governed by a hazard specific to that state. This

81

Modelling Heterogeneity: Hierarchical Bayesian Approach

way the states can directly be related to the event process, and being in a better or worse state means having lower or higher hazard of relapse, respectively. The model may be briefly described as follows. 1) Inter-failure times:

(Xi,j I !3Ci,i) '" EXP(!3Ci,i)' wherei= 1, ... ,48andj 21, (0;,1

2) Latent variables:

(Ci,j

I Po) '" Multinomial(l,po), I p(Ci,j-l)) '" Multinomial(l,p(Ci ,j-l)),

wherei= 1, ... ,48j 2 2 and 3) State specific hazards:

!3l '" Gamma(O.I, 5) and

4) Transition probabilities:

pel) '" Dirichlet(I), 1= 1, ... ,4, and

5) Initial distribution:

PO'" Dirchlet(I), where 1= (1,1,1,1).

!31 = !31-l + 1]1, where 1]1 '" Gamma(0.5, 5), I = 2,3,4,

The fixed parameters in the model were again chosen so as to give rise to reasonably vague priors. Computations were carried out by drawing a large MCMC sample from the posterior distributions of the parameters. In the following we present a part of these results, namely the posterior expected values ofthe initial distribution and of the transition probability matrix. Table 4.4.

Results from the analysis of relapse data.

Estimated initial distribution

Estimated Transition Probability Matrix States

1

2

3

4

0.04 0.11 0.38 0.47

1 2 3 4

0.27 0.29

0.25 0.66 0.12 0.05

0.24 0.03 0.85 0.11

0.24 0.02 0.02 0.83

om 0.01

The estimated initial state distribution clearly depicts the heterogeneous nature of the subjects at the beginning of the treatment. We had noted from the data that on several occasions a patient who had experienced a long relapse free period, then had relapses that were not very far apart. A possible explanation for this somewhat unexpected events may be as follows. Being in better state means possibly a few hundred days without relapse, which might lead to negligence of treatment continuation, resulting

82

MATHEMATICAL RELIABILITY

in worsening of condition followed by relatively more frequent relapses. But unfortunately as the study was not carried out in a controlled environment, we can only hypothesize such an effect of adherece/compliance on the outcome, and it can not be verified from the present data set. Recall that even though patients are expected to improve over time, preliminary examination of the data had not shown any readily recognizable trend. Therefore, due to lack of any evident trend over time, we did not impose any monotonicity condition on the transition matrix. Note, however, from the estimated transition matrix that for any row, say l (l = 1,2,3,4), which describes the probability of transition from state l to any of the four states, there is a definite trend of moving towards the better states. Also these probabilities show a trend of gradual decrease to the left, from nearer states to a state further away. Apart from the model described earlier, some other choices of constructing ordered states were also considered and the estimated transition matrices were found to be remarkably similar. Predictions for generic individual showed clearly the transient nature of the process. Posterior predictive distributions of residual time to the next relapse at pre-determined time points were estimated. Denoting by ~ such a residual at time Yi, we considered Yi = 50 x i, with i = 1, ... , 8. From the plots of simulated predictive distributions of Ri'S (Figure 4.1) it is apparent that Ri'S are stochastically ordered. In other words, the more time has elapsed from the beginning of the treatment, the longer, in the sense of stochastic ordering, it takes to the next relapse.

4.

Concluding Remarks

For both these data sets we had suspected heterogeneity amongst the units/ subjects, which was then confirmed later by the estimates of the relevant parameters. We were also able to estimate the proportions of units in different categories. For the valve data set these proportions remained stable over time, but in the relapse data set they changed over time. It was possible to carry out population level as well as unit level predictions for future behavior. Additional information of the unit quality or class information evidently helped in predicting the future failure behavior. Heterogeneity of failure behavior of safety related components, such as valves in our case study, may have important implications for reliability analysis of safety systems. If such heterogeneity is not identified and taken into account in the statistical modelling and analysis, the decisions made to maintain or enhance safety can be far from optimal. This non-optimality is more serious if the safety related decisions are made purely on the basis of failure histories

83

Modelling Heterogeneity: Hierarchical Bayesian Approach

1.0-,-----------

-,

0.9 0.8

0.7 0.6 0.5 0.4 0.3 0.2 0.1

O.Oll-__._-_ _-

o

50

100

150

-__._--.,-_ _-.....J

_ _200

250

300

350

400

450

500

550

600

650

700

750

Reeldue/llfe In dey.

Figure 4.1.

Empirical distributions of expected residual lives (R i ) at times Yi, where Yi 50, 100, ... ,400.

=

of the components. This work demonstrates how even rather simplistic models could describe the heterogeneous behavior successfully. Another characteristic feature of failure data from safety systems is the small number of observed failures. Often many of the components are totally failure free, and failures are observed only in some components. This makes the use of more complex statistical models highly speculative. The simplistic models discussed here could give meaningful results also for such a sparse database. For the relapse data, we could successfully model analytically the individual event patterns, which we suspect to be governed not only by the treatment but also by individual habits/compliance. Further, we were able to combine individual-specific models to arrive at a population level model which gave us quite meaningful results, as is apparent from the above table of estimates. This also gave us some insight into the dynamics of the event process, even though we did not incorporate significant prior knowledge into the model and had started with vague priors. By considering these illustrations our aim was to establish that, given repeated observations from units/subjects, it is possible to model heterogeneous data meaningfully, not only being able to infer about the population characteristics but also at individual level. This we achieved by introducing one or more layers of a latent variables. These latent variables can certainly be different for

84

MATHEMATICAL RELIABILITY

different individuals as well as they can vary over time, as was illustrated in the second example. Having done this it is a natural question to ask in what sense latent variables/ parameters appearing in hazard models and describing unobserved heterogeneity between individuals actually exist? In the context of reliability a natural goal would be that they could be thought to be adequate descriptions of a physical state of the considered unit. But since latent variables are, by definition, not observed, their role in the statistical modelling is only indirect, that is, we cannot link their values directly with data. Likewise, rather than thinking in terms of parameter identification or point estimation, we have to deal with the latent variables by considering their distributions. This is automatically done in Bayesian inference, but it is also the way to handle "random effects" if the frequentist paradigm is followed. For example, in ML-estimation such random effects variables are integrated out before maximizing the likelihood expression with respect to the common population or structural parameters. Similarly in Bayesian inference, in order to arrive at probabilistic statements concerning potentially observed variables, the latent variables have to be integrated with respect to the corresponding distributions. Although we must give up the idea that latent variables in statistical models would be comparable to physical state variables, it may nevertheless be useful, in a given concrete reliability context, to try to think of some such properties if they seem relevant. For example, physical wear and fatigue are monotone processes in time which in different units will generally advance at different speeds, and when they progress, the corresponding units will become more failure prone. With this in mind, we look for model structures that are capable of describing such systematic patterns in behavior, and use the framework of hierarchical models and latent random variables as a learning device by which such patterns can be picked up from data. The result is that there do not seem to be firm general principles, let alone canonical theory, according to which such latent variables could be defined and introduced into the models. What should be done will depend on the problem in hand. One may therefore wonder whether there is even any sensible general criterion according to which different candidates for latent variable models could be compared and assessed? To answer this question we only need to think about the original motivation why such latent variables are introduced into statistical models. In the context of reliability problems the answer seems obvious: the individual latent variables are introduced because they give us a way to improve predictions, also on the level of individual units. In the case of our first illustration above it turned out to be useful to group the valves into latent "good" and "bad" categories. If this had not been done, the prediction of future failures in of a generic valve would have corresponded to averaging over these two categories, without a possibility

REFERENCES

85

of learning from the observed follow-up data of that particular valve whether is was more likely to be a "good" or "bad" one. As a consequence, the predictions would have been systematically biased, either up or down, depending on the case. Likewise, in the second illustration, if it had not been initially thought that there could be a general trend towards longer inter-failure times, we might not have devised a model with ordered categories of hazard rates, which would be capable of accounting for such a trend. Note that the accuracy of the prediction usually improves as the follow-up time of the considered unit increases, because then the model has had a better opportunity to learn about the latent variables that are specific to the considered unit. This is also why we have here considered situations in which there is a possibility of the same unit undergoing several episodes of survival, each ending with a failure and a subsequent repair. Latent variable models give us a possibility of systematically "transferring" information from the earlier episodes of use of the same unit into later ones. In contrast if there is a single duration for each unit and if one introduces unit specific latent variables into the model, one may naturally in each case speculate about its value, but there won't be a real opportunity of using such information.

Acknowledgments We are thankful to Prof. Urho Pulkkinen, for providing us with the nuclear power plant valve failure data and for his help in analyzing it. Thanks are also due to Prof. Juhani Tuominen for helping us obtain the second data set.

References [1] Arjas, E. and Bhattacharjee, M. (2001). Modelling heterogeneity/selection and causal influence in reliability. lISA-JSM 2000-2001 Conference, New Delhi, India. [2] Arjas, Elja. (2002). Predictive inference and discontinuities. Nonparametric Statistics, 14: 31-42. [3] Bhattacharjee, M., Arjas, E. and Pulkkinen, U. (2002). Modelling heterogeneity in nuclear power plant valve failure data. Third International Conference on Mathematical Methods in Reliability: Methodology and Practice, June 17-20, Trondheim, Norway. [4] Block, H. W. and Savits, T. H. (1997). Bum-in. Statistical Science, 12: 1-13. [5] Chen, J. and Singpurwalla, N. D. (1996). The notion of "Composite Reliability" and its hierarchical Bayes estimation. J. of Amer. Stat. Assoc., 91: 1474-1484.

86

MATHEMATICAL RELIABILITY

[6] Cox, D. R. (1972). Regression models and life tables (with discussion). J. R. Statist. Soc. B, 34: 187-220. [7] Heckman, J. J. and Singer, B. (1984). Econometric duration analysis. J. Econometrics, 24: 63-132. [8] Hougaard P. (2000). Analysis ofMultivariate Survival Data. Springer. [9] Simola, K. and Laakso, K. (1992). Analysis of Failure and Maintenance Experiences of Motor Operated Valves in a Finnish Nuclear Power Plant. V1T Research Notes 1322. [10] Lynn, N. J. and Singpurwalla, N. D. (1997). Comment"Bum-in" makes us feel good. Statistical Science, 12: 13-19. [11] Spizzichino, F. (2001). Subjective Probability Models for Lifetimes. Chapman and Hall/CRC.

Chapter 5 ON SPATIAL NEUTRAL TO THE RIGHT PROCESSES AND THEIR POSTERIOR DISTRIBUTIONS Kjell A. Doksum Department ofStatistics University of Wisconsin, Madison Madison, Wisconsin 53706, USA [email protected]

Lancelot F. James Department ofInformation Systems and Management The Hong Kong University ofScience and Technology Clear Water Bay, Kowloon, Hong Kong [email protected]

Abstract

In this article we consider classes of nonparametric priors on spaces of distribution functions and cumulative hazard measures that are based on extensions of the neutral to the right (NTR) concept. In particular, recent results on spatial neutral to the right processes that extend the NTR concept from priors on the class of distributions on the real line to classes of distributions on general spaces are discussed. Representations of the posterior distribution of the spatial NTR processes are discussed. A new type of calculus is provided which offers a streamlined proof of these results for NTR models and its spatial extension. As an example of an application of the NTR model we describe how one may approximate the posterior distribution of Barlow's total time on test transform.

Keywords:

Bayesian nonparametrics, inhomogeneous Poisson process, Levy measure, neutral to the right processes, reliability, survival analysis.

88

1.

MATHEMATICAL RELIABILITY

Introduction

In the late sixties, nonparametrics was a flourishing field. David Blackwell observed all this activity and asked how one would do nonparametric Bayesian analysis. Responding to this, Ferguson [16, 17] developed a Bayesian nonparametric analysis based on the Dirichlet prior. For this prior, if P is a probability on X, and B1, ... , B k is a measurable partition of X, then P(B1),"" P(B k ) has a Dirichlet distribution. Moreover, the posterior distribution of P given a sample {Xl, ... , X n } is also a Dirichlet distribution. Also responding to David Blackwell, Doksum [10, 12] considered a nonparametric Bayesian analysis based on the neutral to the right priors. For these priors, if P is a distribution on the real line, then for each partition B 1, ... ,Bk with B j = (8j-1, 8j], j = 1, ... , k, 80 = -00, 8k = 00, 8i < 8j for i < j; P(B 1), ... , P(Bk ) is such that each P(Bi ) has the same distribution as Vi I1~~~ (1 - Vj), where VI, V2 , ••. is a collection of independent non-negative random variables. Each choice of distributions for the Vi gives a new NTRfrior. If Vi is chosen to be a beta variable with parameter (ai, f3i) and f3i = '£j;:}+l aj, then this gives the Dirichlet process as described in Doksum [11, 12]. The NTR concept allows for flexibility in the choice of prior. Doksum [12] shows that if P is a NTR distribution, then the posterior distribution of P given a sample Xl, ... , X n is also an NTR. It has been difficult to get explicit expressions for the NTR posterior distributions and expected values (e.g. Doksum [12], Example 4.1). However, recently James [22] develops a different calculus for these models which allows for more explicit representations. In addition, with new computer technology and computational algorithms it has become possible to generate samples from the posterior distributions, which in tum makes it possible to generate posterior distributions of interesting functionals, e.g. functionals based on Barlow's total time on test transform (Barlow, Bartholomew, Bremner and Brunk [3], Chapters 5 and 6, Barlow and Campo [4] and Barlow and Doksum [2]). The notions of the Dirichlet and NTR priors are related to work of Freedman [19], Dubins and Freedman [13] and Fabius [15]. For a detailed history with detailed comparisons of the various notions, see Doksum [11, 12].

2.

NTR, Levy and Other Processes

Doksum ([12], Theorem 3.1) gives an alternate characterization of NTR processes via positive Levy processes, Z, on n+ as follows;

1 - F(t)

= S(t) = e-Z(t)

°

(5.1)

where S denotes the survival distribution associated with F, Z is an increasing independent increment process satisfying Z(O) = and limt-too Z(t) = 00, a.s. In other words, a NTR survival process is essentially synonymous with the

89

Spatial Neutral to the Right Processes

idea of an exponential functional of a subordinator Z. Such objects have been recently studied by probabilists with applications, for instance, to finance. See in particular Bertoin and Yor [5] and Cannona, Petit and Yor [7]. Noting some of these connections, Epifani, Lijoi and Pruenster [14] apply techniques from those manuscripts to obtain expressions for the moments of mean functionals of NTR models. NTR models also arise in coalescent theory as seen for example in Pitman ([30], Proposition 26). The analysis here will mainly consider processes Z without a drift component. In that case, Z may be represented as

1

00

Z(ds) =

uN(du, ds)

(5.2)

where N(du, ds) is a Poisson random measure on (0, (0) x (0, (0), with mean measure, E[N(du, ds)] = r(duls)1](ds) chosen such that,

So = 1 - Fo represents a prior specification of the survival distribution. The Dirichlet process with shape parameter BFo is specified by choosing 1

r(duls)1](ds) = 1 _ e_ue-uOFo([s,oo))duBFo(ds).

(5.3)

While, as we shall see, the characterization of NTR in terms of Z is useful from the point of view of calculations, it is difficult to interpret statistically. In particular, since Z and F are discrete, Z cannot be interpreted as the cumulative hazard of F, say A. Instead of modelling F directly, Hjort [20] suggested to model the associated cumulative hazard, A, by a Levy process with jumps restricted to (0,1]. That is, A may be represented as

1 1

A(ds) =

uN(du, ds)

(5.4)

where N(du, ds) is a Poisson random measure on (0,1] x (0, (0), with mean measure, (5.5) E[N(du, ds)] = p(duls)1](ds). Note that the Poisson random measures N associated with Z and A are different. The measures p and 1] are chosen such that,

1 1

E[A(ds)] =

up(duls)1](ds) = Ao(ds).

90

MATHEMATICAL RELIABILITY

where

A (d ) = Fo(ds) o s 8 0 (s-) , represents a prior specification for the cumulative hazard. The choice of,

p(duls)1](ds) = u- 1 (1- u)C(S)-l duc (s)A o(ds) for c( s) a positive decreasing function corresponds to a beta process with parameters c and Ao. Using this framework, a neutral to the right process can be characterised in terms of the product integral representation of a survival function as follows; 1 - F(t) = 8(t) =

J1

(1 - A(du)) .

(5.6)

us,t

Importantly, using the interpretation of a cumulative hazard, it is much more natural to write, F(dt) = 8(t- )A(dt). (5.7) Hjort [20] shows that when A is a beta process with c(s) = BFo([s, (0)) = B80 ( s-) then the F defined by (5.6) is a Dirichlet process with shape BFo. The case of more general beta processes generates a class of generalized Dirichlet processes, see Hjort ([20], section 7A), which is essentially equivalent in distribution to the beta-neutral processes in Lo [26] and the beta-Stacy process discussed in Walker and Muliere [32]. The expressions (5.1) and (5.6) suggests an important functional 1-1 relationship between a particular Z and A. Dey [8] and Dey et. al. [9] establish that the Levy measure T(duls )1](ds) for Z is the image of the Levy measure of a A with Levy measure p(duls)7](ds), via the map (u, s) to (-log(1 - u), s). This type of correspondence is actually noted, albeit less explicitly, in Hjort [20]. An important consequence of this result, noted and exploited by James [22], is the fact one can write

Z(ds) =

1 1

[-log(1 - u)]N(du, ds)

where N is the Poisson random measure associated with A with mean measure specified by (5.5). Hence, A and Z are related in distribution via a common Poisson random measure. Specifically,

8(t-)

= e-Z(t-) = e- Jooo I{ss} p(duls)T](ds, dx) and conditionally independent of N, J is a random variable with distribution Pr{J E dulT} ex: up(duIT).

(5.18)

Proof: The task is to identify the posterior distribution denoted as 1f( dNIT, Y). This is equivalent to indentifying the conditional Laplace functional N given {T = t, X = x}, say L:N(hlt, x), which must satisfy

J

g(t, x)e-N(h) S(t- )A(dt, dx)P(dNlp, T])

=

J

g(t, x)L:N(hlt, x)E[S(t- )A(dt, dx)]

(5.19)

for some arbitrary integrable function 9 and bounded positive measurable h. By Proposition 1,

S(t- )P(dNlp, T]) = P(dNle- ft - p, T])E [S(t- )Ip]

Spatial Neutral to the Right Processes

97

with Jt-(u,s,x) = -I{s < t}log(l - u). Note that this operation identifies P(dNle- ft - p, "1) as the posterior distribution of NI{T 2:: t}. Another application of Proposition 1 gives,

e-N(h)P(dNle- ft - p, "1) := P(dNle- h- ft - p, "1)LN(hle- ft - p, "1). The expectation of A(dt,dx) with respect to the law P(dNle-h-ft-p,"1) is given by,

[1

1

e- h(u,t'X)(l_ u)I{t>t}UP(du1t)] "1(dt,dx).

(5.20)

[fo1

Recalling that E[A(dt, dx) Ip] = uP(dult)] "1(dt, dx), the statements above imply that the expressions in (5.19) are equal to,

J

g(t,X)LN(hle-ft-p,"1)

[1

1

e-h(u,t'X)Pr{J E du1t}]

xE [S(t- )jp] E [A(dt, dx)lp] which implies the desired result,

almost everywhere. "

Remark 3.4 Note that the representation in (5.20) shows clearly why in the homogeneous case, as remarked prior to Theorem 5 in Ferguson and Phadia [18J, the jump J is independent of the jump time T. 1n general, we see that the distribution of the J can be described as,

Pr{J E dulT} ex u(l - u/{T>T} p(duIT), where of course I {T

> T}

= O.

It is obvious from the representation of A combined with Theorem 1 that the posterior distribution of A is equivalent to the law of the random measure

A+ J8T ,x where A is a random hazard measure with Levy measure (1 - u/{T>s} p(duls) "1(ds, dx) and J has distribution described in (5.18). Similarly it follows that the posterior distribution of F is still a spatial neutral to the right process. The posterior distribution given n observations, is described in James [22], and follows from an inductive argument or directly from the methods discussed in

98

MATHEMATICAL RELIABILITY

James [22]. In general one can easily obtain the posterior distribution relative to a model which takes the form n

e-N(f)

II A(dti, dXi)'

(5.21)

i=l

Note for instance,

J( (1 - A(ds))Y(s) =

e-N(f)

with f(u, s, x) = -Y(s) 10g(1 - u) for Y(s) some predictable function. The complete data case follows when n

Y(s) := Yn(s) = LI{1i > s} i=l

If there are n + m possible observations for T and m are right censored by independent censoring times C1, ... , Cm, in this case the corresponding X are not observed, then the choice m

Y(s) = LI{Ci ~ s}

+ Yn(s).

i=l

corresponds to a univariate right censoring model as seen for instance in the medical cost example described earlier. In general assume that

where Yc ( s) represents some predictable function indicating right censoring, left truncation or filtering. From James [21,22], the posterior distribution of F and A under a model (5.21) with f(u, s, x) = -Y(s) 10g(1 - u) is described by the fact that the posterior distribution of A is equivalent to the Levy measure n(p)

A~(p)(ds, dx) = A(ds, dy)

+L

Jj,n6Tj*,x;(ds, dx)

(5.22)

j=l

where {Tl, j = 1, ... ,n(p)} denotes the unique observed values ofT and X* are the respective concomittants, A is a random hazard measure with intensity

(1 - u)Y(s) p(duls)'f/(ds, dx), and the {lj,n} are conditionally independent with distributions

Spatial Neutral to the Right Processes

99

where ej,n is equivalent to the cardinality of the set {i : ~ = T j*} of observed values ofT. Thus in the case of a beta process with parameters c( s) and Ao(ds, dx), the posterior distribution is again a beta process where A given in (5.22) is a beta process with intensity

and with {Jj,n} independent beta random variables with density proportional to

u ej ,n- 1 (1 _ u)Y(T/)+c(Tndu.

When, in the complete data setting, Y(s) := Yn(s) and c(s) := BSo(s-) the corresponding posterior spatial NTR process is a Dirichlet process with shape BFo(ds, dx) + L::~=1 tJT;,X;' More generally, when c(s) = a([s, (0)) + ,8([s, (0)) the posterior distribution is a Beta-Neutral processs, Fa *,j3*, where n

a*(ds,dx)

=

a(ds,dx)

+ LtJT;,X;(ds,dx) i=l

and,8*([s,oo)):= ,8([s, (0)) + 1";;(s). 1 Letting A depend on i as, Ai(dti, dXi) = fo hi(u, B)N(du, dti, dXi) allows for an extension to semiparametric models. Further details are given in James [21].

4.

Total Time on Test Transform

In 1970, before he developed his Bayesian interests, Richard Barlow proposed the total time on test (TTT) transform for model diagnostics and classical statistical inference. Now technology has advanced to the point that it is possible to use the TTI transform in the Bayesian context. Let X denote a positive random variable with distribution F and finite mean E[X]. The total time on test (TTI) transform developed by Barlow et. al. [2, 3,4], takes the form,

with TF(l) = E[X]. The TIT transform is concave if and only if F is an increasing failure rate (IFR) distribution. Thus the IFR property can be checked without having to estimate a density. Moreover, because tF(U) := TF(u)/TF(l) equals u when F is exponential, any distance measure between t F (u) and u will measure closeness of F to the exponential distribution. A Bayesian version of this model is induced by choosing F to be a NTR. Bayesian inference can be conducted by evaluating the posterior distribution ofTF given

100

MATHEMATICAL RELIABILITY

the Xl, ... , X n' The simplest case is when F is chosen to be a Dirichlet process with shape a = eFo. Although the posterior distribution of F is equivalent to the distribution of the Dirichlet process with shape eFo + L:?=I ox;, the posterior distribution of TF, appears to be somewhat complex. Lo [28] obtained an explicit expression for the posterior mean of T F with respect to the posterior Dirichlet distribution. However, it is interesting to investigate the posterior distribution ofTF, and tF, when we allow ---+ O. In this case the deflated posterior distribution of F corresponds to the distribution of a Bayesian Bootstrap (BB) (see Rubin [31], Lo [27]). Let ZI, ... , Zn denote independent exponential (1) random variables with

e

n

Sn

= LZi. i=l

The BB empirical distribution function takes the form

FBB(t)

L n

=

Z. S~ I{Xi :S t} n

i=l

and the BB ITT transform is given by,

TFBB(U)

rFB1

= io

(U)

(1 - FBB(t))dt.

It follows that TFBB takes the form

TFBB

(t1 I

Zi/ Sn) =

f; t1 I

I

[n-i+ Zi] Sn (XU) - XU-I)) for l = 1, ... ,n - 1

where X(I), ... ,X(n) are the order statistics, and X(o) = O. Note also that, n

TFB B(1) = L

i=l

Zi

SXi , n

which is the BB mean functional. In addition, when Fn is the empirical distribution function, I 1 I TFn(l/n) = E[TFBB(L Zi/Sn) IX] = -:;; L(n - j i=l

+ 1) (X(j)

-

XU-I)),

j=l

is equivalent to the the empirical version of the TTT transform used by Barlow et. al. [2,3,4]. A test statistic used by Barlow and Doksum [2] to measure the distance from F to a hypothesized distribution G, is of the form, (5.23)

REFERENCES

101

It follows that one may use the distribution of (5.24) to approximate the distribution of (5.23). Additionally, the distribution of (5.24) should serve as a good approximation to the posterior distribution of the quantity

.;n r

1

Jo

[TF(U) _ Ta(U)] du TF(l) TG(1) ,

when F is specified to be a Dirichlet process.

Acknowledgments This work was supported in part by NSF grant DMS-997 1301.

References [1] Bang, H., and Tsiatis, A. A. (2000). Estimating medical costs with censored data. Biometrika, 87: 329-343. [2] Barlow, R. E., and Doksum, K. A. (1972). Isotonic tests for convex orderings. In Proceedings ofthe Sixth Berkeley Symposium on Mathematical Statistics and Probability. Vol. I. Univ. of Calif. Press, Berkeley, CA, pp. 293-323. [3] Barlow, R. E., Bartholomew, D. J.,Bremner, J. M. and Brunk, H. D. (1972). Statistical Inference under Order Restrictions. John Wiley and Sons, Inc. [4] Barlow, R. E., and Campo, R. (1975). Total time on test processes and applications to failure time data analysis. In Reliability and Fault Tree Analysis. Barlow, R. E. Fussel, J, B., andSingpurwalla, N. D., eds. SIAM, pp. 451-481 [5] Bertoin, J. and Yor, M. (2001). On subordinators, self-similar Markov processes and some factorizations ofthe exponential variable. Electronic Communications in Probability, 6: 95-106. [6] Bickel, P. J. and Doksum K. A. (2001). Mathematical Statistics: Basic Ideas and Selected Topics. Vol I, 2nd Edition. Prentice Hall, Princeton, New Jersey. [7] Carmona, P., Petit, E, and Yor, M. (1997). On the distribution and asymptotic results for exponential functionals of Levy processes. In Exponential functionals and principal values related to Brownian motion, M. Yor, ed., pp. 73-121. Biblioteca de la Revista Matematica Iberoamericana. [8] Dey, J. (1999). Some Properties and Characterizations of Neutral-to-theRight Priors and Beta Processes. Ph.D. Thesis. Michigan State University.

102

MATHEMATICAL RELIABILITY

[9] Dey, J, Erickson, R.V. and Ramamoorthi, R.Y. (2000). Neutral to right priors- A review. Preprint. [10] Doksum, K. A (1971). Tailfree and neutral processes and their posterior distributions. aRC report 71-72, Univ. Of Calif., Berkeley. [11] Doksum, K. A (1972). Decision theory for some nonparametric models. In Proc. Sixth Berkeley Symp. Math. Statist. Prob. 1. Univ. of California, Berkeley, pp. 331-343. [12] Doksum, K. A (1974). Tailfree and neutral random probabilities and their posterior distributions. Ann. Probab., 2: 183-201. [13] Dubins, L. E. and Freedman, D. A (1966). Random distribution functions. In Proc. Fifth Berkeley Symp. Math. Statist. Prob. 2. Univ. of California, Berkeley, pp. 183-214. [14] Epifani, 1., Lijoi, A, and Pruenster, 1. (2002). Exponential functionals and means of neutral to the right priors. Technical report Universita di Pavia no. 144. [15] Fabius, J. (1964). Asymptotic behavior of Bayes estimates. Ann. Math. Statist., 35: 846-856. [16] Ferguson, T. S. (1970). A Bayesian analysis of some nonparametric problems. Technical report, UCLA [17] Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems.Ann. Statist., 1: 209-230. [18] Ferguson, T. S. and Phadia, E. (1979). Bayesian nonparametric estimation based on censored data. Ann. Statist., 7: 163-186. [19] Freedman, D. A (1963). On the asymptotic behavior of Bayes estimates in the discrete case Ann. Math. Statist., 34: 1386-1403. [20] Hjort, N. L. (1990). Nonparametric Bayes estimators based on Beta processes in models for life history data. Ann. Statist., 18: 1259-1294. [21] James, L. F. (2003). Calculus for spatial neutral to the right processes. Manuscript. [22] James, L. F. (2002). Poisson process partition calculus with applications to exchangeable models and Bayesian nonparametrics. Available at http://arxiv.org/abs/math.PRl0205093. [23] James, L. F. and Kwon, S. (2000). A Bayesian nonparametric approach for the joint distribution of survival time and mark variables under univariate censoring. Technical report, Department of Mathematical Sciences, Johns Hopkins University. [24] Kim, y. (1999). Nonparametric Bayesian estimators for counting processes. Ann. Statist., 27: 562-588.

REFERENCES

103

[25] Last, G. and Brandt, A. (1995). Marked Point Processes on the Real Line: The Dynamic Approach. Springer-Verlag, New York. [26] Lo, A. Y. (1993). A Bayesian bootstrap for censored data. Ann. Statist., 21: 100-123. [27] Lo, A. Y. (1987). A large sample study of the Bayesian bootstrap. Ann. Statist., 15: 360-375. [28] Lo, A. Y. (1978). Bayesian nonparametric density methods. Technical Report, Department of Statistics and Operations Reserach Center, Univ. of Calif., Berkeley. [29] Lo, A. Y. and Weng, C. S. (1989). On a class of Bayesian nonparametric estimates: II. Hazard rates estimates. Ann. Inst. Stat. Math., 41: 227-245. [30] Pitman, J. (1999). Coalescents with multiple collisions. Ann. Probab., 27: 1870-1902. [31] Rubin, D. B. (1981). The Bayesian bootstrap. Ann. Statist., 9: 130-134. [32] Walker, S. and Muliere, P. (1997). Beta-Stacy processes and a generalization of the P6lya-urn scheme. Ann. Statist., 25: 1762-1780.

Chapter 6 MODELS FOR RECURRENT EVENTS IN RELIABILITY AND SURVIVAL ANALYSIS Edsel A. Pefia* University of South Carolina Columbia, SC 29208 USA [email protected]

Myles Hollandert Florida State University Tallahassee, FL 32306 USA [email protected]

Abstract

Existing models for recurrent phenomena occurring in public health, biomedicine, reliability, engineering, economics, and sociology are reviewed. A new and general class of models for recurrent events is proposed. This class simultaneously takes into account intervention effects, effects of accumulating event occurrences, and effects of concomitant variables. It subsumes as special cases existing models for recurrent phenomena. The statistical identifiability issue for the proposed class of models is addressed.

Keywords:

Counting process, model identifiability, renewal process, repair models.

1.

Introduction

In many studies in public health, biomedicine, reliability, engineering, economics, and sociology, the event of primary interest is recurrent and thus could

·E. Pena is Professor in the Department of Statistics, University of South Carolina, Columbia, SC 29208. He acknowledges NIHINIGMS Grant I ROI 56182 which partially supports his research. tM. Hollander is Robert O. Lawton Distinguished Professor and Chairman of the Department of Statistics, Florida State University, Tallahassee, FL 32306. He acknowledges NIHINIGMS Grant 1 ROI GM56183 which partially supports his research.

106

MATHEMATICAL RELIABILITY

occur several times during the study period for a study unit or subject. Examples of recurrent events in public health are drug or alcohol abuse of adolescents, outbreak of diseases (e.g., encephalitis), and repeated hospitalization of end stage renal disease patients. In the medical area the recurrent event may be the occurrence of a tumor (cf., Byar [10]; Gail, Santner and Brown [17]; Klein, Keiding and Kamby [28]; Wei, Lin and Weissfeld [45]), headaches (Leviton, Schulman, Kammerman, Porter, Slack and Graham [32], cyclic movements in the small bowel during fasting state (Aalen and Husebye [2]), depression, seizures of epileptic patients, nausea in patients taking drugs for the dissolution of cholesterol gallstones, and angina pectoris in patients with coronary disease (cf., Lawless [30]; Thall and Lachin [43]). In the reliability and engineering settings, the breakdown of electro-mechanical systems (e.g., motor vehicles, subsystems in space stations, computers), encountering a software bug in software development, and nuclear power plant meltdowns are examples of recurrent events; while in the economic setting they could be the advent of economic recession, stock market crashes, and labor strikes. By virtue of the time-sequential fashion in which recurrent events occur, there is an added aspect to these studies which hitherto has not been considered in existing models and which is usually not present in studies where only one endpoint event per subject is observed. This aspect is the ability to perform interventions on the subject upon event occurrence. For example, when a subject abuses alcohol, intervention in the form of psychological methods (e.g., confinement or enforced hospitalization; correction of faulty home environment), physiological methods (e.g., conditional reflex therapy; elevation of blood sugar level; convulsive therapy; serotherapy and hemotherapy), or through family-based or institutional-based methods (e.g., closer supervision by family members; Alcoholics Anonymous) is performed. Similarly, a heart attack patient would for instance be advised to alter existing lifestyle (e.g., eating habits; reduce stress level); while the reoccurrence of a tumor might lead to its removal and some prophylactic treatment (e.g., continuation of retinoid prophylaxis (cf., Byar [10]). In reliability and engineering, the breakdown of the system will entail repair or replacement of the failed subsystem or component, and the replacement will usually be an improved version of the old subsystem or component. In an economic setting, when the stock market crashes or a recession occurs, the government intervenes by instituting new guidelines. The primary purpose of such interventions is to prolong, if not eliminate, the reoccurrence of the event. Hence such interventions can be viewed as improving the subject or unit. Improvement is usually possible since one could use the information that has accumulated on or before the event occurrence and the knowledge that has been discovered or acquired between event occurrences to assist in the formulation of new intervention strategies. It should be realized, however, especially in the medical and public health settings which usually deal

107

Models for Recurrent Events in Reliability and Survival Analysis

with human subjects, that though the intervention may bring about improvements, other factors such as the weakening effect on the subject of accumulating event occurrences and the adverse effects of aging and other time-dependent covariates may outweigh the intervention improvement. Thus, when all these effects are considered, the time to the next occurrence of the event may still be smaller, in a stochastic sense, relative to the preceding interoccurrence time. It is therefore imperative that any modeling scheme should attempt to take into consideration the effects of the interventions simultaneously with the effects of accumulating event occurrences and relevant concomitant variables.

2.

Mathematical Framework

Let us formalize the description of mathematical models for recurrent phenomena by considering an experimental unit (for example, a patient in a clinical trial) experiencing successive occurrences of a recurrent event. Let X = (Xl, ... , Xq)t, where t denotes vector/matrix transpose, be a vector of covariates (e.g., age, race, sex) for this unit, which may be time-dependent. Denote by T1, T2, T3, ... the successive interoccurrence times of the event, and by 81, 82, 8 3 , ... the successive calendar times of event occurrences, so 80

== 0,81 = T 1, 8 2 = T 1 + T 2, 8 3 = T 1 + T 2 + T3,....

(6.1)

Let F = {Fs : s 2: o} represent an increasing, right-continuous collection of cr-fields for this unit, that is, a filtration (cf., Fleming and Harrington [16]; Andersen, Borgan, Gill and Keiding [4]), so in particular F s contains information about the number of times that the recurrent event has occurred in the time interval [0, s], the covariate information, and information concerning the types of interventions performed upon event occurrences. A probabilistic model for the successive occurrences of the recurrent event is a specification of the collection ofjoint distribution functions of {81, 8 2, 8 3 , ... }. Because of the dynamic nature or time-sequential feature of the setting and as a consequence of the interventions that are performed, such specifications are facilitated by restating the model in terms of hazard functions or failure intensities. Let {N (s) : s 2: o} be the process which represents the number of occurrences of the recurrent event during the period [0, s], and {Y( s) : s 2: o} denote the risk process so Y (s) = 1 if the unit is still under observation at time s, andY(s) = Oiftheunitisnotunderobservationattimes. Let {A(s) : s 2: O} be a predictable nondecreasing process such that the process {M (s) : s 2: O}, where M(s) = N(s) - A(s), s 2: 0,

J;

is a square-integrable local martingale. We assume that A(s) = dA(w), where {A( s) : s 2: O} is a predictable nondecreasing process satisfying

dA(w) = Y(w)a(w)dw,

(6.2)

108

MATHEMATICAL RELIABILITY

where {a(w) ; w ~ O} is a predictable nonnegative process (see Aalen [1]). It will have the intuitive and practical interpretation that, for h > 0 and sufficiently small, the quantity Y(s)a(s)h represents the approximate conditional probability, given F s -, of the recurrent event occurring in the time interval [s, s+ h). The probabilistic model for the recurrent phenomena is then completely determined by specifying the failure intensity rate process {a( s) ; s ~ O} (cf., Jacod [24]; Aalen [1]; Bremaud [8]; Arjas [5]; Andersen, Borgan, Gill and Keiding [4]). As implied by its measurability with respect to F s -, the intensity process a(s) may depend on the covariate of the unit and the number of event occurrences during the period [0, s].

3.

Existing Models

By specifying the {a( s) : s ~ O} process, a variety of classes of models for the recurrent event are generated. We review in this subsection some of the models considered in the literature. Let 'ljJ : R --t R+ be a known nonnegative (link) function. In most cases this is taken to be the exponential function given by (6.3) 'ljJ(u) = exp{u}. Let ao (.) be some unknown hazard rate function. The simplest model considered is obtained by taking (6.4)

where,8 = (,81, ,82, ... ,,8q)t is a regression coefficient vector. Since s- S N(s-) represents the elapsed time since the last event occurrence, this model assumes that the interoccurrence times are identically distributed. Borrowing from the parlance of reliability, one says that upon an event occurrence, the intervention leads to a 'perfect repair' of the unit. With 'ljJ given in (6.3), this model is the extension of the Cox [13] proportional hazards model, or Aalen [1] and Andersen and Gill [3] multiplicative intensity model, to the recurrent event situation. This model has been considered in Prentice, Williams and Peterson [40], Lawless [30], and Aalen and Husebye [2], and under this model the resulting data could be analyzed using partial likelihood methods for making inference about ,8, and through the use of the Nelson-Aalen estimator for making inference about the cumulative baseline hazard function AoO = J~ ao(s)ds (cf., Kalbfleisch and Prentice [26]; Cox and Oakes [14]; Fleming and Harrington [16]; Andersen, Borgan, Gill and Keiding [4]). This model has the disadvantage of ignoring the (non-zero) correlations among the interoccurrence times (cf., Prentice and Kalbfleisch [39]; Aalen and Husebye [2]), which could have serious implications especially in the presence of intervention. Two possible approaches could be used to partially alleviate this deficiency. The first approach is to introduce a time-dependent covariate X q+1(t), possibly defined to

Models for Recurrent Events in Reliability and Survival Analysis

109

be X q+1 (t) = log{l + N(t-)}, which is augmented to the covariate vector to obtain X* = (Xt, Xq+l)\ while the regression vector is also augmented to become 13* = (j3t, j3q+l), (note that j3q+l is not time-dependent). In model (6.4), the linear form j3*tX* is then used in the link function. Such an approach, however, suffers from the defect that it could not allow non-proportional intervention effects. The second approach is to incorporate in the model an unobservable random frailty in model (6.4) via (6.5) where Z has some parametric distribution which is usually taken to be a gamma distribution with shape and scale parameters ('r/, 'r/). Through such a model, dependencies among the interoccurrence times are incorporated, although whether they are the appropriate dependencies is not clear. In models where only one event per subject is observed, incorporating frailties is very useful in generating models which account for subject heterogeneities, aside from being able to model positive dependence among the failure times of the subjects. An excellent exposition on the stochastic process approach to modeling frailties together with the appropriate methods of analyses could be found in Andersen, Borgan, Gill and Keiding ([4], Ch. IX). Other very useful references are those by Clayton [11], Clayton and Cuzick [12], Hougaard ([20], [21], [22], [23]), Oakes ([35],[36],[37],[38]), Nielsen, Gill, Andersen and Sorensen [34], and Vaupel [44]. Another model for recurrent data considered in the literature is obtained by taking (6.6) This has been considered in Prentice, Williams and Peterson [40], Brown and Proschan [9] and Lawless [30). It amounts to assuming that the intensity of event occurrence for a subject or unit that just experienced an event occurrence is identical to the intensity just prior to the event occurrence. In reliability terminology, the subject or unit is said to have been 'minimally repaired' through the intervention. Partial likelihood methods are also applicable in making inference about 13 under this model. A limitation of this model is the restrictive way in which intervention effects can be modeled, since the model basically states that there is no improvement on the subject or unit relative to its state just prior to the event occurrence even after the intervention. As in (6.4) one may be able to alleviate the just-mentioned limitation in (6.6) through the incorporation of a time-dependent covariate which enters in the link function, or through the introduction of an unobservable random frailty. A generalization of the Markovian model of Gail, Santner and Brown [17], derived via theories of carcinogenesis, postulates that

a(s)

= (m -

N(s-)

+ l)ao(s -

SN(s_))7/J(j3 t X) ,

(6.7)

110

MATHEMATICAL RELIABILITY

where m is some unknown positive integer parameter with the interpretation of being the original number of tumor sites, so that N(s) :S m. This model could also be viewed as another extension of (6.4) through the use of the timedependent covariate Xq+l(s) = log(m - N(s-) + 1) with !3q+I = 1. This model takes into account the effect of the number of event occurrences through the multiplicative term m - N(s-) + 1, and since this is a decreasing function of N(s-), this is a model where the effect of an increasing number of event occurrences on the subject or unit leads to its improvement, which may not be the case in many biomedical-type settings. At this stage we point out the fruitful interplay between models that arise in biomedical settings and reliability by observing that the model in (6.7) can also be viewed as the Jelinski and Moranda [25] software reliability model with covariates, where m will have the interpretation as being the original number of bugs in the software. The basic limitation of the model in (6.7) is again the restrictive way in which intervention effects can be incorporated. Another model usable for recurrent data, but which was primarily developed for modeling the tumor occurrences at multiple sites after breast cancer, is that of Klein, Keiding and Kamby [28] which utilizes the generalized multivariate Marshall-Olkin distribution. For the bivariate case, the joint survivor function of (TI' T2) is assumed to be

where, for j E {I, 2, 12}, we have

It is not clear, however, how this model could be restated in the stochastic

process framework that we have adopted, and this modeling scheme seems difficult to implement in the situation where event occurrences and interventions are happening in a time-sequential fashion. Furthermore, the appeal of this model in the tumor occurrence setting is it allows the modeling of simultaneous occurrences, but this is not the case in the recurrent model we are considering since we are assuming that for a given subject or unit the events are occurring one at a time. The class of marginal models developed for multivariate failure time data could also be used in the recurrent data setting. Such models specify the marginal distributions or hazard functions of the interoccurrence times Tk'S. Among these models is the one examined by Wei, Lin and Weissfeld [45] which postulates that the hazard rate function of Tk is

111

Models for Recurrent Events in Reliability and Survival Analysis

or the log-linear model of Lin and Wei [33] given by

log(Tk) =

I3lx k + Ek,

k = 1,2, ... ,

(6.9)

where X k are (possibly time-dependent in (6.8» covariates which are relevant for the kth event, and Ek'S are random error terms. Aalen and Husebye's [2] variance component model which specifies that

g(Tk)=J.l+U+Ek,

k=1,2, ... ,

(6.10)

where U and Ek'S are independent, U is zero-mean normal with variance O"~, and the Ek'S are iid zero-mean normal random variables with variance also belongs to this marginal class of models. The appeal of these models is the ease of analyses since existing methods for the Cox model and the accelerated failure time model are immediately applicable. The disadvantage of these models when dealing with recurrent data is that the dependencies among the Tk'S are not explicitly taken into account, and one would be hard-pressed to model the intervention effects and the possibly weakening effects of accumulating event occurrences. Furthermore, the time-dynamic aspect of the model is ignored. Finally, another modeling approach utilized primarily in the reliability area, but which could be adopted to recurrent models in biomedical settings is that of specifying the form of the cumulative mean function (CMF) of N (s), without specifying the full probabilistic specification of the process. This approach is exemplified in Lawless and Nadeau [31], where they presented simple and robust methods for the estimation of the CMF. The robust estimators are related to estimators developed under the Poisson process model. The specific model considered in that paper, which also incorporates covariates, is given by

0";,

m(t) = mo(t)P(t)'ljJ(l3t X(t)),

t ~ 0,

(6.11)

where moO is a baseline mean function, and P(·) is some known function. The cumulative mean function is (in the continuous-time case) defined to be

M(t)

=

fat m(u)du.

Some of the robust inference procedures for this model were developed using estimating equation theory. The model in (6.11) is restrictive in that the intervention effects it could model can only be contained in the P(·) function, and the link function. These are the different varieties of models that have been utilized in dealing with recurrent data in the biomedical, reliability, engineering, economics, and sociological settings. Though many of these models are quite general, none of them satisfy the three requirements enunciated in Section 1, which is to have a model that takes into account the three effects simultaneously. One may

112

MATHEMATICAL RELIABILITY

argue that all of these effects could be incorporated through the use of timedependent covariates in the Cox model, but one should realize that the type of effects that could be modeled through such an approach are limited to be of the proportional type. In the next section we advocate a different scheme of modeling the intervention effects, which is through a change in the time origin of the baseline hazard function just after intervention. It will also be demonstrated that most of the models mentioned above are subsumed in the proposed class of models.

4.

A New Class of Models

To describe our proposed class of models, we assume the existence of a complete probability space (n, F, P) with an associated filtration F =

°

{Fa:

8

E

[O,T]},

where < T :S 00 is the upper endpoint of the study period. All relevant random entities are defined on (n, F). In particular, the interoccurrence times T 1 , T2, ..., the calendar times 8 1 ,82, ... of event occurrences, and the observable processes {N(8) : 8 E [0, T]} and {Y(8) : 8 E [0, T]} are defined in

(n, F).

Our proposed class of models postulates that intensity rate process {a (8 IX) E [0, T]} for a subject or unit with covariate vector X = (Xl,"" Xq)t, which may be time-dependent, is of the form : 8

a(8IX) = aO[£(8)]p[N(8- )]~(;JtX).

(6.12)

In (6.12), 0000 is an unknown baseline hazard rate function; pO is a nondecreasing or nonincreasing function from N = {O, 1, 2, ...} into ~+ = [0,00) which may depend on unknown parameters with the norming condition p(O) = 1; ~(.) is a nonnegative link function from ~ = (-00,00) into ~+ which is of known form (usually taken to be the exponential function);;J = (131,132, ... ,;Jq)t is an unknown regression coefficient vector; and {£( 8) : 8 E [0, T]} is an observable predictable process satisfying the conditions: (I) £(0) (II) £(8)

= eo almost surely (a.s.), where eo is a nonnegative real number; 2

°

a.s.;

(III) for 8 E [8k-1, 8k), £(8) is a.s. monotone and differentiable with £/(8) E

(0,1]. This predictable observable process, called the effective age of the unit, is where the improvement effects accruing from the performed intervention is modeled. Note that condition (III) implies £(8k-) :S £(8k-1) + Tk, k =

Models for Recurrent Events in Reliability and Survival Analysis

113

1,2, ... , which means that the unit's effective age just before the kth event occurrence, which is represented by £(Sk-), is at most the unit's effective age just after the (k-1)th event occurrence, which is £(Sk-l), plus the time between the (k - 1)th and the kth event occurrences, which is Tk. Thus, in the context of the effective age of the unit, the effect of intervention is to make the unit age at a slower rate relative to the elapsed calendar time. We point out that a different interpretation is needed or other conditions need to be imposed if the baseline hazard rate function ao (.) is a decreasing hazard rate function, as in the case for instance when dealing with infants having ear infection since infants will usually exhibit a decreasing hazard rate. In these types of situations, the improvement effects might be modeled by changing the sign of the derivative and allowing £(0) to be non-zero. Our modeling the intervention effects as a change in the unit's effective age differs from models which have been considered in the literature, since most of the models in the literature incorporate such effects in the regression component. The initial motivation of our modeling approach came from reliability repair models where when a system or component fails, either the system or component is restored to its state just prior to its failure, or is replaced by a new component, so the process of repair leads to a change in the unit's effective age. In biomedical settings, such a model is also plausible since interventions that can be considered as good are meant to slow or decelerate the reoccurrence of the event. In model (6.12), the function p( '), represents the effect of accumulating event occurrences. In many biomedical situations, this will usually be assumed to be a nondecreasing function of N(s-) since it is natural to assume that the event occurrences have a weakening effect on the unit or subject. In some situations however, such as the Markovian model by Gail, Santner and Brown [17] in (6.7) where this function will be p( k) = m - k + 1, the occurrences of events lead to improvements on the unit. This nonincreasing feature is also prevalent in reliability models where at each event occurrence, faults or defects in the system or component are eliminated, which leads to improvements. Thus, generally, we will simply require that this function be monotonic, either nondecreasing or nonincreasing depending on the context or situation at hand. The link function in the model clearly serves the purpose of containing the effects of the concomitant variables. In this model (6.12), the intervention effects, the effects of accumulating event occurrences, and the effects of the covariates are therefore taken into account simultaneously. Furthermore, there is an interplay among these effects to the extent that just after intervention, in an overall sense, the unit need not always be better relative to its state just before the event occurrence because the improvement effects might be outweighed by the other two effects. We now illustrate the generality of the proposed class of models by considering specific forms of £ (.) and p(.).

114

MATHEMATICAL RELIABILITY

Example 4.1: By letting p(k) == 1, k E Z == {0,1,2, ...}, and £(s) = s - 8 N (s-)' then o:(sIX) = o:o(s - 8 N (s_))'lj;((3tX), which is the extended Cox model in (6.4) considered by Prentice, Williams and Peterson [40], Lawless [30], and Aalen and Husebye [2]. " Example 4.2: By letting p(k) == l,k E Z,and£(s) = s,s E [0, T], we obtain o:(sIX) = o:o(s)'lj;((3tX), which is the model in (6.6), a model examined by Prentice, Williams and Peterson [40], Brown and Proschan [9] and Lawless [30]. " Example 4.3: Gail, Santner and Brown's [17] Markovian model becomes a special case of model (6.12) by taking £(s) = s-8N (s-) andp(k) = m-k+l, and as mentioned in the preceding section, this model coincides with the Jelinski and Moranda [25] software reliability model with the additional feature that a covariate has been incorporated. II

Example 4.4: Let h , 12 , h, ... be a sequence ofiid Bernoulli random variables with success probability p. [For technical reasons, it is assumed that the Ii'S are measurable with respect to Fo.] Define the process fT](s) : s E [0, T]} via N(s)

7](s) = Also, let

°==

Lh i=l


O.

136

MATHEMATICAL RELIABILITY

The discrimination information between any two Weibull distributions, fj (x ICXj, Aj), j = 1,2, let 1J = A2/ AI, is given by:

K(!I : fzlcxj, 1J) = log CXl +1J0 H(Ylx = 0), H(Y) < H(Ylx = 1), andH(Y) = H(Ylx = 2). That is, x = oreduces uncertainty (is

143

Information Functions for Reliability Table 7.2.

Information Measures for Three Bivariate Distributions.

Joint and marginal distributions a) f(x, y) general bivariate distribution

Entropies of conditional distributions

x

y 0 1

h(x)

0 0 .25 .25

1 .25 .25 .50

x 2 .15

.10

h(y)

y

.40 .60

0 1

H(Ylx)

.25

0 0 1 0

1 .50 .50 .69

2 .60 .40 .67

b) f(x, y) when X and Yare independent

x

y

0

0 1

.10 .IS

h(x)

.25

1 .20 .30 .50

x 2 .10 .15 .25

h(y)

y

.40 .60

0 1

H(Ylx)

0 .40 .60 .67

1 .40 .60 .67

2 .40 .60 .67

c) f(x, y) when X and Y are related functionally

x

x

y 0 1

h(x)

0 .20 0 .20

1 0 .60 .60

2 .20 0 .20

h(y)

y

.40 .60

0 1

Information measure Marginal Entropy H(Y) Conditional Entropy H(YIX) Mutual Information M(X, Y) Information Index I(X, Y)

H(Ylx) (a) General .67 .52

.IS 23%

(b) Independent .67 .67 0 0%

0 1 0 0

1 0 1 0

2 1 0 0

(c) Related Functionally .67 0 .67 100%

informative) about outcome of Y, x = 1 increases uncertainty about outcome of Y, and x = 2 leaves the uncertainty unchanged. However, H(Y) > H(YIX), indicating that, on average, knowledge of X reduces uncertainty about outcomes of Y. The mutual information M(X, Y) = .15 quantifies this average uncertainty reduction. The fraction of uncertainty reduction, computed using the normalized index (7.35) with HM = H(Y), is I(X, Y) = 23%. In case (b), the bivariate distribution has the independent structure. The entropies of the conditional distributions (shown in the right panel of the Table) are all equal to the marginal entropy H (Y) = .67. Thus, the amount of average uncertainty is H(YIX) = .67, there is no uncertainty reduction, and I(X, Y) =

0%.

144

MATHEMATICAL RELIABILITY

In case (c), the variables are related functionally, P(Y = 2X - X 2 ) = l. The entropies of the conditional distributions (shown in the right panel of the Table) are all zero, hence no uncertainty about the outcomes remains when an X = x is available for the prediction of Y. Therefore, the mutual information is equal to the marginal entropy M(X, Y) = H(Y) = .67 and the normalized is I(X, Y) = 100%. However, we note thatthe correlation coefficientp(X, Y) = odue to the fact that the functional relationship between Y and X is nonlinear.

Example 4.2: To survive or to fail? Abel and Singpurwalla [2] considered the lifetime of an item X with the exponential distribution with mean E(Xle) = e and failure rate >. and posed the following questions: Which of the two outcomes, survival (5) orfailure (F), in a small interval (to, to + ..6.to) is more informative about and>.? They provided an answer using a gamma prior for >. with density p(>'la, (3) ex: >.a-1 exp(-(3)'). The implied prior for the mean = 1/>. is inverse gamma. The posterior distributions for>. E A and E 8 are gamma and inverse gamma with a shape parameter a(a + 1) and a scale parameter (to + (3). The information provided by the data (S or F) about>. and are quantified by the differences between entropies of respective prior and posterior distributions:

e

e

e

e

H(A) - H(AIS) = log to ; (3 H(A) - H(AIF)

to + (3

= log -(3-

+ (Ha

-

Ha+1)

(3

Ie(S)

H(8) - H(8IS) = l o g - to + (3

Ie(F)

H(8) - H(8IF)

= log

to ~ (3

+(H

a -

H a +1 +~)

,

where H a is the entropy of the gamma distribution with shape parameter a and scale equal one, shown in Table 7.1. As shown in Table 7.1, Ha is increasing in a, so Ha - Ha +1 < 0, which implies that I>. (S) > I>. (F). Abel [1] has shown that for a ~ 1, the quantity H a - H a +1 + 2/a > 0, which implies that Ie(F) > Ie(S). Thus, a survival is more informative about the failure rate>. and a failure is more informative about the mean e. Also note that I>. (S) = -Ie (S) < O. Furthermore, I>. (S) is increasing in to and Ie(S) is decreasing in to. For any to > 0, I>.(S) > 0 and Ie(S) < O. That is, a survival is always informative about the failure rate, but always increases uncertainty about the mean. The case is not so clear for I>. (F) and Ie (F).

145

Information Functions for Reliability

The expected information about the parameters are given by the mutual information:

M(A, Y) M(8, Y)

=

I>.(S)P(S) + I>.(F)P(F) > 0 Ie(S)P(S) + Ie(F)P(F) > 0,

where Y is a binary random variable that indicates survival and failure in interval

(to, to + ~to).

Finally, let). and {) be the Maximum Likelihood Estimate (MLE) of), and e. Then, Fisher information (7.11) gives the following results: F>..(S) = >0 and is increasing in to; Fe(S) = ti)2 > 0 and is decreasing in to; but F>.. (F) = Fe(F) = O. Thus, Fisher information (7.11) does not provide a meaningful answer to the question of interest in this problem when a failure occurs in

t6

(to, to + ~to).

Example 4.3: Multivariate normal Consider (Y, Xl,'"

,Xp ) with (p + I)-variate normal distribution.

(a) Let X = (Xl,'" ,Xp )'. Then X has p-variate normal distribution with covariance .Ex = [aij]. (i) The entropy of conditional distribution f(XiIXj) is

where Pij is the correlation coefficient. Thus, H(XiIXj = Xj) is a function of the correlation and variance and is independent of the value Xj. (ii) The mutual information between any pair (Xi, Xj ), i uncertainty difference,

=I j

is just the

The mutual information index is I(Xi , X j ) = PIj' This reflects the fact that for the multivariate normal variables, stochastic dependency and linear dependency among Xl, ... ,Xp are equivalent.

146

MATHEMATICAL RELIABILITY

(iii) The mutual information between the vector X and its components is given by:

where l:Exl denotes the determinant, and A£ is the £'th eigenvalue of :Ex. The normalized index of dependency (7.34) is I[X, (Xl,'" , X p)]

= 1_

A1'" Ap .

(Ill ... (lpp

Linear dependency is indicated by some A£ = 0, which leads to I[X, (Xl,'" ,Xp )] = 1. On the other extreme, Aj = (ljj, j = 1,," ,p if and only if Xl,'" , X p are mutually uncorrelated (independent) for which I[X, (Xl,'" ,Xp )] = 0. (b) It can be shown that for any set of given x = (Xl,'" , x p), the entropy of conditional distribution, H(Ylx), is a function of the variances and covariances, and is functionally independent of x. (i) The entropy difference is ~[H(Y), H(Ylx)]

1

= -"2log[1 -

p2(y;

X)] 2: 0,

(7.38)

where p2(y; X) is the square of the multiple correlation between Y and Xl, ... ,Xp. Therefore by (7.38), any set of multivariate normal data x is informative about Y. (ii) The mutual information is given by the mean uncertainty difference: M(Y,X)

Ex {~[H(Y),H(Ylx)]} =

-~ log [1 -

=

1 -"2

p2(y; X)]

L log [1 p

p2(y; Xil x 1,'" Xi-1)] ,

i=l

where p2(y; Xi/Xl, ... xi-d is the square of the partial correlation between Y and Xi, given (Xl," . ,Xi-1). The normalized index of dependency (7.34) is given by the square of the multiple correlation

I(Y, X) = p2(y; X).

147

Information Functions for Reliability

5.

Information in Residual Lifetime

Frequently, in reliability one has information about the current age of the system under consideration. In such cases, the age must be taken into account when measuring information. Ebrahimi and Kirmani ([16], [17]) considered the situations when age t must be taken into account. If we think of a random variable X as the lifetime of a system then X is a non-negative random variable. In this case, the set of interest is the residual life, E t = {x : x

5.1.

> t}.

Residual Discrimination Information

Ebrahirni and Kirmani ([16], [17]) proposed using the discrimination information function between two residual life distributions for the system F t (x) = P(X - t ~ xiX> t) and Gt(x) = P(X - t ~ xiX> t) implied by two lifetime distributions F(x) and G(x). The discrimination information between the two residual life distributions is given by:

K(J: g;t) == K(Jt: gt) =

1

ft(x) ft(x) log -(-) dx t gt x F(t) K(J : g; E t ) -log G(t)' 00

(7.39)

g(x)/G(t) denote the conditional where ft(x) = f(x)/ F(t) and gt(x) densities, F(t) = Pf(Et ) = 1 - F(t) and G(t) = Pg(Et ) = 1 - G(t) are the survival functions, and K(J : g; Et ) is defined in (7.2). It is clear that for to = inf{x: P(x) = I}, K(J: g;to) = K(J,g). By (7.39), the discrimination information between two residual distributions is equal to the mean information for discrimination in favor of F against G, given E t , minus the logarithm of the likelihood ratio of the survival of the system beyond t under the two lifetime distributions F and G. By (7.4), for each t, t ~ 0, K(J : g; t) possesses all the properties of the discrimination information function (7.3). If we consider t as an index ranging over E t , then K(J : gj t) provides a dynamic discrimination information function indexed by t for measuring the discrepancy between the residual life distributions Ft (x) and Gt(x). It can be shown that K(J : g; t) is free of t if and only if the hazard functions are proportional, i.e., G( t) = p!3 (t), f3 > O. For more details see Ebrahimi and Kirmani ([16], [17]). The following example demonstrates computation and usefulness of K(J, g; t).

Example 5.1: Systems of components Consider the systems of n components discussed in Example 2.4.

148

MATHEMATICAL RELIABILITY

(a) Series components For Zl = min (Xl , X 2 ,'" ,Xn ) and Z2 = min(Yl ," discrimination information is given by

. , Yn ),

the dynamic

(b) Parallel components For Zl = max (Xl , X 2 ,'" ,Xn ) and Z2 = max(Yl ,'" namic discrimination information is given by

K(fZl : fzz; t) =

E!Zllzl>t

, Yn ),

the dy-

f(Z)] [log g(Z) F(Z)]

+(n - I)E!Zllzl>t [ log G(Z)

1 - [G(t)]n

+ log 1 _

[F(t)]n'

In both cases K(fZl : fzz; 0) = K(fZl : fzz)' see; Example 2.4.

5.2.

Residual Entropy

The entropy of residual life distribution is defined similarly as

1

00

H(X; t) == H(f; t) = -

t

f(x) f(x) F(t) log F(t) dx.

(7.40)

As in (7.22), H(f; t) may be computed using the hazard rate,

1

00

H(f;t)

= 1-

1 F(t)

t

f(x)logAF(x)dx.

It is clear that for to = inf{x : F(x) = I}, H(f; to) = H(f). Like the entropy (7.21), the residual entropy (7.40) is a discrimination information function as in (7.23). As before, let U (x) denote the uniform distribution with support S = {x : a < x < b}. Then conditional distribution of X, given x> t is also uniform, Le., Ut(x) is uniform over (t, b) and

K(f : U; t) = H(U; t) - H(f; t). Therefore H(f; t) measures the uncertainty (or lack of predictability) of the remaining life-time of a system of age t. It can be shown that the dynamic entropy H (f; t), like the failure rate and mean residual life, uniquely determines the distribution function F. On the basis of the measure H (f; t), we can define some non-parametric classes oflife distributions that are closely related to other

149

Information Functions for Reliability

classes of life distributions, such as increasing failure rate (IFR) and decreasing failure rate (DFR). A survival function F (F(O) = 1) is said to have decreasing (increasing) uncertainty orresiduallife (DURL (IURL» if H (f; t) is decreasing (increasing) in t. One can easily show that for the exponential distribution H(f; t) remains constant. That is, uncertainty about lifetime does not change as the system ages. In fact, the exponential distribution is both DURL and IURL. For further properties and implications of H(f; t), DURL and IURL classes see Ebrahimi [14] and Ebrahimi and Kirmani [15]. In reliability, there are many situations in which the hazard rate function AF(t) must satisfy certain constraints. In fact, we argue that state of no knowledge about physical characteristic of a system at all is hardly, if ever realistic and we would typically at least have some idea concerning physical behavior of a system. Ebrahimi, Hamadani, and Soofi [19] studied developing lifetime distribution through maximizing the entropy H (f) subject to monotonicity constraints on failure rate AF(t). However, to produce a model for the data generating distribution function f under these constraints the direct use of H(f, t) is more appropriate. Because, given X > t we are interested in modeling distribution of X t , the remaining lifetime of a system of age t :2: 0. When partial information is available about AF(t) we can develop a model for X t by maximizing H(J; t) instead of H(J) in the ME problem; see Ebrahimi [12].

5.3.

Residual Mutual Information

We may define a dynamic version of mutual information as

M(X, Y; t l ,t2) = K[jtl,t2 (x, y) : ftlhx(x)ftlhY(Y)] where ftl h (x, y) =

(7.41)

f(( x, y)) is the residual joint density and

F tl, t2

f tl,t2,x (x) --

-1

00

t2

f(x, y) d F(t I, t) Y 2

is the marginal of ftlh(x, y). The marginal density ftlhY(Y) is defined similarly. It is clear that if M(X, Y; tl, t2) = 0, then the residual lifetimes are independent. Thus the dynamic mutual information measures the extent of functional dependency between the remaining lifetimes of two components that are already survived to times tl and t2. Note that in (7.41) the dynamic mutual information M(X, Y;tl, t2) is defined in terms of (7.33). The expression analogous to (7.32) is:

150

MATHEMATICAL RELIABILITY

whereH(X, Y; tl, t2) = H[ftlh (x, y)] is the joint dynamic entropy, H(X; tl, t2) = H[JtlhX(x)] and H(Y; tl, t2) = H[JtlhY(Y)] are the component entropies. Since ftl hx (x) is the distribution conditional over the bivariate event {(X > tl, Y > t2)}, in general, the dynamic component entropy H (X; t I, t2) is not equal with the dynamic univariate entropy H (X; t I) = H[ftl (x)]. However, it is clear that for tjO = inf{x : Fj(x) = I}, j = 1,2, M(X, Y; tlO, t20) = M(X, Y).

6.

Information Statistics

In this section, we outline some statistical applications of the information functions discussed above. This is a rich line of research which provides numerous theoretical and applied problems for the future.

6.1.

Order Statistics

The system reliability problems presented in Examples 2.4 and 5.1 involved discrimination information between two minimum (maximum) lifetimes. More generally, order statistics are of paramount importance in reliability analysis and have wide ranges of applications in many other fields. However, only a few papers have discussed information properties of order statistics, (Wong and Chen [57], Park [42], [41]). More recently, some results on the properties of the entropy of order statistics, the discrimination information and the mutual information functions that involve order statistics are developed by Ebrahimi, Soofi, and Zahedi [22].

6.2.

Covariate Information Index

Covariate information indices are measures that quantify the impact of a set of variables X = (Xl, ... ,Xp ) on the distribution of another variable Y. The most well-known measure of covariate information is the R 2 of regression. The R 2 is meaningfully interpretable for the Gaussian case. This is shown to be the case in various information theoretic formulations. We have seen in Example 4.3 that when the variables are jointly normal, the conditional entropy H(Ylxl,'" ,xp ) is functionally independent of Xl,'" ,xp • In this case, the sample counterpart of the normalized mutual information I(Y, X) is the R 2 of regression. A formulation that is particularly useful for reliability analysis is the generalization of R 2 in the context of the exponential family regression. Consider the regression problem E(YIX,,6) = X/3, where Y = (YI,'" ,Yn ) is the random vector of responses, X = [Xij] is an n x p matrix of given covariate values Xij, and /3 is the p x 1 vector of regression parameters. Suppose that the

151

Information Functions for Reliability

distribution of Y has a density in the exponential family

f7J(Y) = h(y)exp{7J'Y - w(7J)}, where "1 E 1i is the vector of natural or canonical parameters, W("1) is the normalizing function, and h(y) is a parameter-free function. When the covariance matrix is positive definite, the relation between "1 E 1i and E(Y) = f..t E M is one-to-one. For any member of the exponential family, the discrimination information has the sequential additive properties for nested linear subspaces of the natural parameter space 1i and the expectation parameter space M (Kullback [30J, Simon [48]). The covariate information index for exponential family regression is derived based on the additivity of information in the natural parameter space. Let 7J r = X r f3r' where X r is an n x r full-rank matrix and 7J s = X s f3 s ' where X s is an n x s full-rank submatrix of X s , r ~ s ~ n. Then

(7.42) where "1*, "1;, and "1; are the MDI estimates (Simon [48]) used the MLE). Hastie [23J formulated the exponential family regression estimation in terms of (7.42) with

7J*ly = 7J*(Y),

7J;ly = 7J;(Y), and (7J;ly = jl) = 7J;(x!3).

Covariate information for exponential family regression is

For the normal regression problem, Iy(X) = R 2 • More relevant distributions for reliability applications are exponential and gamma. Cameron and Windmeijer [7J tabulated the covariate index Iy(X) for several distributions in the exponential family. For example, the index for the Gamma family is

I-(X) = 1 _ y

6.3.

I: log(yd jl) + (yd jl) I:~g(~/~

- 1.

Distributional Fit Diagnostics

The discrimination information function and entropy have been instrumental in the development of indices of fit of parametric models to the data. Given data Xl, ... , X n from a distribution F, it is important to assess whether the unknown F (x) can be satisfactorily approximated by a parametric model F* (x I0). The loss of approximating F(x) by a parametric model F*(xIO) is measured by

152

MATHEMATICAL RELIABILITY

K (J : j* 10). In order to assess the loss of approximating the unknown datagenerating distribution F(x) by a model F*(xIO), the discrimination information K(J : j*10) must be estimated. In general, estimation of (7.3) directly is formidable. Akaike considered approximating the unknown data-generating distribution f (x) by a family of models j* (x I0 J) and estimating the model parameter 0 J, including its dimension J, J = 1,' .. ,L. Akaike [4] showed that "choice of the information theoretic loss function is a very natural and reasonable one to develop a unified asymptotic theory of estimation." The approximation loss is measured by the information discrepancy K[f(x) : j*(xIO)]. The MDI or minimum relative entropy loss estimate of 0 is defined by

OMDI = argmaxK[f(x) : j*(xIB)].

o

The entropy loss has been used with frequentist and Bayesian risk functions in various parametric estimation problems and for model selection; see Soofi [50] and references therein. Akaike [3] observed that decomposing the log-ratio in (7.3) gives

K(J: j*10) = -Ef[logj*(XJO)] - H[f(x)],

(7.43)

where H[f(x)] is the entropy of f. Since the entropy of the data-generating distribution is free of the parameters, the second term in (7.43) is ignored in the derivation of the AIC for model selection. The minimization of the information discrepancy between the unknown data-generating distribution and the model is operationalized by maximizing the average log-likelihood function in (7.43). Since the AIC type measures are derived by minimizing the first term in (7.43) and the second term is ignored, the AIC type measures provide criteria for model comparison purposes only, and do not provide information diagnostic about the model fit. An alternative approach for estimating K(J : j*10) when f is an unknown distribution is proposed by Soofi et al. [54]. They considered the moment class (7.14) and showed that if f E nO and j* is the ME model in nO, then the first term in (7.43) becomes the entropy of j* and

K(J: j*10) = H[j*(xIO)] - H[f(xIO)].

(7.44)

This equality defines the information distinguishability (ill) between distributions in nO' The first term is the entropy of the parametric model ME (7.26) and the second term is the entropy of a distribution which is unknown other than the density is a member of a general moment class (7.14). Like the Akaike's decomposition (7.43), this decomposition proves to be quite useful for developing model selection criteria. Ebrahimi [10] has given other conditions where equivalence of two entropies implies that two distributions are equal.

Information Functions for Reliability

153

ID statistics are obtained by estimating (7.3) via (7.44). The normalized ill index for the continuous case is computed as:

1 - exp[-K(fn : J*IOn)] (7.45) 1 - exp{H[fn(xIOn )] - H[J*(xIO n )]}, where In is a nonparametric estimate with entropy H[jn(xIO n )] and moments On = (Ol,n,··· ,OA,n)', andf* is the ME model in nOn . AnID(fn : f*IOn) = oindicates the perfect fit; i.e., f* is a perfect parameterization of In. A lower bound for (7.45) in terms of variation distance is given by (7.12), I D(fn : f*IOn) ~ ~ V 2 (fn : f*IOn). Implementation of ill indices of fit includes two steps. First, a parametric model f*(xIO) is selected based on the maximum entropy characterization of the densities of the parametric families. Many commonly known parametric families are shown to admit ME characterization. On the other hand, for a parametric model, one may easily identify the moment class nO by writing the density in the exponential form (7.26). The entropy expression (7.27) for the well known parametric families are tabulated, see, e.g., Soofi et al. [54]. The second step for implementation of ID indices is the nonparametric estimation of H[In(xIO n )]. Various nonparametric entropy estimates for continuous distributions are developed in the literature which can be used as H(fn) in (7.45). However, maintaining the non-negativity of the estimate of (7.45) is an important issue. For this purpose, the parameters of the maximum entropy H[f*(xIO n )] in (7.45) must be estimated by the moments of the density In whose entropy is H[In(xIO n )]; for references and more details, see Soofi and Retzer [53]. For example, many current results in life testing are based on the assumption that the life of a system is described by an exponential distribution. Of course, in many situations this assumption is usually suspect. When F is a distribution with support on the positive real line, then the exponential distribution is the ME model f*(xIO) in the moment class,

ne =

{f(xIO) : E(X) ::; O}.

We can estimate 0 using the mean On of a nonparametric density estimate In with entropy H(fn) and compute the ID statistic ID(fn : f*IOn). Then for large (small) values of I D(fn : f* IOn) we reject (accept) the exponential model for the data. Developing ME fit indices is a very promising line of research. Recent developments includes Ebrahimi ([13], [11]) who uses dynamic discrimination information function for developing tests of exponentiality and uniformity of the residual lifetime, and Mazzuchi et al. ([38], [37]) who develop Bayesian estimation and inference about entropy and the model fit.

154

6.4.

MATHEMATICAL RELIABILITY

Statistical Process Control

Alwan, Ebrahimi and Soofi [5] using information functions (7.3) and (7.21) proposed information theoretic process control (ITPC) as a framework which formalizes some current SPC practices and broadens the scope of the SPC so that various types of process parameters can be monitored in a unified manner. In the ITPC framework, they developed signal charting procedures for monitoring of various types of moments without a need for making distributional assumptions. The most important feature of ITPC is that various monitoring problems are handled in a unified manner based upon a criterion function (7.3). The ITPC procedure for monitoring of moments consists of a three-step algorithm. In the first step, the in-control moment values (Jo = (elO, ... ,e JO) are the only available information. The in-control moment values are used as inputs to the ME procedure which produces a model fa (x I(Jo) for the unknown distribution the process variable X. The second step is for estimating the distribution of the process variable X at the monitoring state. At each stage t = 1,2,··· the information at hand are the ME model for the in-control distribution and the data moments mt = (mlt,· .. ,mJt). The MDI algorithm uses moments mjt (new information) and the initial ME model fo(xl(Jo) as the inputs, minimizes K(ft, fa) with respect to ft and produces a new model ft(xlmt) for the distribution of X at the monitoring state. The third step is for detecting a change in the distribution of the process variable between monitoring state and the in-control state. The process is monitored based on the MDI function K(ft, fa). The final step is the most important feature of the ITPC algorithm for monitoring moments because it solves the traditional problem of constructing charts based on problem specific statistical criterion functions deemed suitable for the problem at hand. Alwan et. al. [5] derived various MDI functions for ITPC charts, developed examples of MDI control charts for the multivariate case and process attribute, and discussed possibility of developing control charts for detecting distributional change by application of (7.45). As an example, Alwan et. al. [5] examined the performance of the Information Chart for monitoring mean and variance of the process variable. For monitoring of mean and variance, the conventional SPC assumption of normality in not needed. Whence the in-control parameters (J = (fJo, cr5) are given, the model f*(xlfJo, cr5) = N(fJo, cr5) is found as the ME solution. At the monitoring stage, using the sample mean and variance mt = (Xt, sr) the MDI control function for the detecting mean and/or variance shifts is

155

Information Functions for Reliability

lMVi

The first term in I M t measures the infonnation discrepancy due to the process mean and the second term lVi measures the infonnation discrepancy due to the process variation. The I M -chart constructed by plotting I M t is equivalent to the Shewhart mean chart. Using /10 for the mean in the monitoring state instead of Xt gives I M t = 0 and we obtain the MDI control function for the process variance, lVi, shown in lVi. Note that the term of lVi is the chi-square statistic 2 associated with the 8 control chart that detects shifts in process dispersion. Thus, I MVi embraces two control charting procedures traditionally used for mean and variance as its special cases.

8F!(75

6.5.

Prediction Problems

The entropy-moment equality (7.28) is shown to playa key role in the prediction problem (Shepp, Slepian, and Wyner [46]). Let Y denote a predictor of Y. Then (7.28) provides a sharp lowe! bound for the prediction error variance in terms of the entropy of X = Y - Y:

E(Y - y)2 2:

e 2H (Y-Y) 27fe

,

(7.46)

with equality holding when Y - Y is Gaussian. Using (7.46), Pourahmadi and Soofi [43] developed a sharp lower bound for the prediction error variance of non-Gaussian ARMA processes. Their lower bound is for the variance of any unbiased predictor. As an example, consider the ARMA(l, 1) model

yt - 4>yt-l

= Zt

+ ()Zt-l, 4> + () =J 0, 14>1 < 1, I()I < 1,

where { Zt} is a sequence of i.i.d. random variables with mean zero and variance (72, called the innovation process. The model is stationary. For prediction of Yo on the past yt, t = -1, - 2, ... , a result of Pourahmadi and Soofi [43] gives A

2

EIYo - Yol 2:

e2H (Zo) log I()I 27fe

log 14>1'

(7.47)

with equality holding for Gaussian processes. Conceptually, the role of entropy in (7.47) is the role of the inverse of Fisher information which provides a lower bound for the variance of an unbiased estimator via the Cramer-Rao inequality.

156

MATHEMATICAL RELIABILITY

When the distribution of the innovation in the ARMA process is known,

H(Zo) can be estimated parametrically. However, the more realistic and interesting case is when the innovation distribution is unknown. Then a nonparametric estimate of the entropy can be used as a yardstick against which to gauge the fits of various competing parametric models assessed through their one step ahead prediction error variances. For Gaussian data details of this idea has been worked out in Mohanty and Pourahmadi [40] and references therein. For the non-Gaussian case, a nonparametric estimate of the innovation process entropy is needed.

References [1] Abel, P. S. (1991). Information and the Design of Life Tests. Ph.D. dissertation, George Washington University, Department of Operation Research. [2] Abel, P. S., and Singpurwalla, N. D. (1994). To Survive or to Fail: That is the Question, The American Statistician, 48: 18-21. [3] Akaike, H. (1974). A New Look at the Statistical Model Identification, IEEE Trans. Automat. Contr., AC-19: 716-723. [4] Akaike, H. (1973). Information Theory and an Extension ofthe Maximum Likelihood Principle, 2nd International Symposium on Information Theory, 267-281. [5] Alwan, L. C., Ebrahimi, N., and Soofi, E. S. (1998). Information theoretic framework for process control, European Journal ofOperational Research, 111: 526-542. [6] Bernardo, J. M. (1979). Expected Information as Expected Utility, Annals of Statistics, 7: 686-690. [7] Cameron, A. C. and Windmeijer, F. A. G. (1997). An R-squared Measure of Goodness of Fit for Some Common Nonlinear Regression Models, Journal of Econometrics, 77: 329-342. [8] Carota, c., G. Parmigiani, and Polson, N. G. (1996). Diagnostic Measures for Model Criticism, Journal ofthe American Statistical Association, 91: 753-762. [9] Csiszar, 1. (1991). Why Least Squares and Maximum Entropy? An Axiomatic Approach to Inference in Linear Inverse Problems, Ann. Statist., 19: 2032-66. [10] Ebrahimi, N. (2001 a). Families of Distributions Characterized by Entropy, IEEE Trans. on Information Theory, 97: 2042-2045. [11] Ebrahimi, N. (2001b). Testing for Uniformity of the Residual Lifetime Based on Dynamic Kullback-Leibler Information, Annals of the Inst. of Stat. Math., 53: 325-337.

REFERENCES

157

[12] Ebrahimi, N. (2000). The Maximum Entropy Method for Lifetime Distributions, Sankhya A, 62: 236-243. [13] Ebrahimi, N. (1998). Testing Exponentiality of the Residual Life, Based on Dynamic Kullback-Leibler Information, IEEE Trans. on Reliability, 47: 197-201. [14] Ebrahimi, N. (1996). How to Measure Uncertainty in the Residual Lifetime Distributions, Sankhya A, 58: 48-57. [15] Ebrahimi, N. and Kirmani, S. (1996a). Some Results on Ordering of Survival Functions Through Uncertainty, Stat. and Probab. Letters, 29: 167176. [16] Ebrahimi, N. and Kirmani, S. (1996b). A Characterization of the Proportional Hazards Model Through A Measure Of Discrimination Between Two Residual Life Distributions, Biometrika, 83: 233-235. [17] Ebrahimi, N. and Kirmani, S. (1996c). A Measure of Discrimination Between Two Residual Lifetime Distributions and Its Applications", Ann. Inst. Statist. Math, 48: 257-265. [18] Ebrahimi, N. and Soofi, E. S. (1998). Recent Developments In Information Theory and Reliability Analysis. In A. P. Basu, S. K. Basu and S. Mukhopadhyay, editors, Frontiers in Reliability, pp. 125-132. World Scientific, New Jersey. [19] Ebrahimi, N., Hamadani, G. G., and Soofi E. S. (1991). Maximum Entropy Modeling with Partial Information on Failure Rate, School of Business Administration, University of Wisconsin-Milwaukee. [20] Ebrahimi, N., Maasoumi, E., and Soofi, E. S. (1999a). Ordering Univariate Distributions by Entropy and Variance, Journal of Econometrics, 90: 317336. [21] Ebrahimi, N., Maasoumi, E., and Soofi,E. S. (1999b). Measuring Informativeness of Data by Entropy and Variance. In D. Slottje, editor, Advances in Econometrics: Income Distribution and Methodology of Science, Essays in Honor ofCamilo Dagum, pp. 61-77, New York: Physica-Verlag. [22] Ebrahimi, N., Soofi E. S., and Zahedi, H. (2002). Information Properties of Order Statistics and Spacings, School of Business Administration, University of Wisconsin-Milwaukee. [23] Hastie, T. (1987). A Closer Look at the Deviance, The American Statistician, 41: 16-20. [24] Hoeffding, W. and Wolfowitz, J. (1958). Distinguishability of Sets of Distributions, Annals ofMathematical Statistics, 29: 700-718. [25] Jaynes, E. T. (1968). On the Rationale of Maximum-Entropy Methods, Proceedings of IEEE, 70: 939-952.

158

MATHEMATICAL RELIABILITY

[26] Jaynes, E. T. (1957). Information Theory and Statistical Mechanics, Physics Review, 106: 620-630. [27] Jeffreys, H. (1946). An Invariant Form for the Prior Probability in Estimation Problems, Proceedings of Royal Statistical Society (London), A, 186: 453-461. [28] Joe, H. (1989). Relative Entropy Measures of Multivariate Dependence, Journal of the American Statistical Association, 84: 157-164. [29] Kullback, S. (1987). The Kullback-Leibler Distance, The American Statistician, 41: 340. [30] Kullback, S. (1971). Marginal Homogeneity of Multidimensional Contingency Tables, The Annals ofMathematical Statistics, 42: 594-606. [31] Kullback, S. (1967). A Lower Bound for Discrimination Information in Terms of Variation, IEEE Transactions on Information Theory, IT-B. [32] Kullback, S. (1959). Information Theory and Statistics, N.Y.: Wiley (reprinted in 1968 by Dover). [33] Kullback, S. and Leibler, R. A. (1951). On Information and Sufficiency The Annals ofMath. Stat., 22: 79-86. [34] Lehmann, E. L. (1983). Theory of Point Estimation, N.Y.: Wiley. [35] Lindley, D. V. (1961). The Use ofPrior Probability Distributions in Statistical Inference and Decision, Proceedings ofthe Fourth Berkeley Symposium, 1: 436-468, Berkeley: UC Press. [36] Lindley, D. V. (1956). On a Measure of Information Provided by an Experiment, The Annals ofMath. Stat., 27: 986-1005. [37] Mazzuchi, T. A., Soofi, E. S. and Soyer, R. (2002). Bayes Estimate and Inference for Entropy and Information Index of Fit, submitted for publication. [38] Mazzuchi, T. A., Soofi, E. S. and Soyer, R. (2000). Computations of Maximum Entropy Dirichlet for Modeling Lifetime Data, Computational Statistics and Data Analysis, 32: 361-378. [39] McCulloch, R. E. (1989). Local Model Influence, Journal ofthe American Statistical Association, 84: 473-478. [40] Mohanty, R. and Pourahmadi, M. (1996). Estimation of the Generalized Prediction Error Variance of a Multiple Time Series, Journal of the American Statistical Association, 91: 294-299. [41] Park, S. (1996). Fisher Information on Order Statistics, Journal of the American Statistical Association, 91: 385-390. [42] Park, S. (1995). The Entropy of Consecutive Order Statistics, IEEE Trans. on Information Theory, IT-41: 2003-2007.

REFERENCES

159

[43] Pourahmadi, M. and Soofi, E. S. (2000). Predictive Variance and Information Worth of Observations in Time Series, Journal ofTime Series Analysis, 21: 413-434. [44] Savage, L. J. (1954). The Foundations ofStatistics, New York: John Wiley. [45] Shannon, C. E. (1948). A Mathematical Theory of Communication, Bell System Technical Journal, 27: 379-423. [46] Shepp, L. A, Slepian, D. and Wyner, A D. (1980). On Prediction of Moving-Average Processes, The Bell System Technical Journal. [47] Shore, J. E. and Johnson, R. W. (1980). Axiomatic Derivation ofthe Principle of Maximum Entropy and the Principle of Minimum Cross-Entropy, IEEE Transac. on Info. Theory, IT-26: 26-37. [48] Simon, G. (1973). Additivity ofInformation in Exponential Family Probability Laws, Journal ofthe American Statistical Association, 68: 478-482. [49] Soofi, E. S. (2000). Principal Information Theoretic Approaches, Journal of the American Statistical Association, 95: 1349-1353. [50] Soofi, E. S. (1997). Information Theoretic Regression Methods, in T. B. Fomby and R. C. Hill, editors, Advances in Econometrics: Applying Maximum Entropy to Econometric Problems, pp. 25-83, Greenwich, CT: JAI Press. [51] Soofi, E. S. (1994). Capturing the Intangible Concept of Information, Journal of the American Statistical Association, 89: 1243-1254. [52] Soofi, E. S. (1992). A Generalizable Formulation of Conditional Logit With Diagnostics, Journal ofthe American Statistical Association, 87: 412816. [53] Soofi, E. S., and Retzer, J. J. (2002). Information Indices: Unification and Applications, Journal ofEconometrics, 107: 17-40. [54] Soofi, E. S., Ebrahimi, N., and Habibullah, M. (1995). Information Distinguishability with Application to Analysis of Failure Data, Journal of the American Statistical Association, 90: 657-668. [55] Theil, H. (1971). Principles of Econometrics, New York: John Wiley. [56] Teitler, S., Rajagopal, A K, and Ngai, K L. (1986). Maximum Entropy and Reliability Distributions, IEEE Transactions on Reliability R-35: 391395. [57] Wong, K M. and Chen, S. (1990). The Entropy of Ordered Sequences and Order Statistics, IEEE Trans. on Information Theory, IT-36: 276-284. [58] Zellner, A (1971). An Introduction to Bayesian Inference in Econometrics, New York: Wiley (reprinted in 1996 by Wiley) [59] Zellner, A (1997). Bayesian Analysis in Econometrics and Statistics: The Zellner Vzew and Papers, Cheltenham UK: Edward Elgar.

Chapter 8 DESIGN AND ANALYSIS OF EXPERIMENTS FOR RELIABILITY ASSESSMENT AND IMPROVEMENT Vijay N. Nair University of Michigan Ann Arbor, MI [email protected]

Luis A. Escobar Louisiana State University Baton Rouge, LA [email protected]

Michael S. Hamada Los Alamos National Laboratory Los Alamos, NM [email protected]

Abstract

This chapter deals with design and analysis of experiments for accelerated testing and reliability improvement. Accelerated testing is a commonly used approach for timely assessment of reliability during product design and development. The first part of the chapter describes models, inference, and design for accelerated test experiments with time-to-failure data. We discuss applications of fractional factorial designs for reliability improvement and the complications introduced by the presence of censored data. Robust design studies for variation reduction are also briefly reviewed. The availability of degradation data mitigates many of the problems encountered with censored time-to-failure data. We review some models and issues on the design and analysis of experiments with degradation data.

Keywords:

Accelerated failure time, degradation data, factorial experiments.

162

1.

MATHEMATICAL RELIABILITY

Introduction

Global competition and increasing customer expectations dictate that products and processes must be highly reliable. This is especially true in hightechnology and safety-critical applications. There is also a great deal of emphasis on reducing costs and product development cycle time and getting the product to market as quickly as possible. In this environment, we need efficient and economical methods for assessing and improving product reliability. This chapter reviews the design and analysis of experiments for reliability assessment and reliability improvement based on both lifetime or time-to-failure (TTF) data and degradation data. Reliability estimation and assessment is a critical part of the design and development process for new products and processes. High reliability is good from a customer perspective, but it implies few failures at normal use conditions. This makes it difficult to assess reliability in a timely manner by running experiments at normal use conditions. These considerations have led to the development of accelerated testing (AT) as an important tool in design for reliability of new products and components. We review acceleration models, methods of inference, and accelerated test planning based on TTF data in Section 2. The emphasis on quality and reliability improvement in recent years has led to a renewed interest in the use of experimental design methods in industry. Much of this new-found popularity is due to the work of Taguchi [49] and his pioneering contributions on robust parameter design for variation reduction. This has led many engineers and industrial practitioners to re-discover fractional factorial designs, repackaged under the guise of the so-called "Taguchi methods." While statistically designed experiments are used extensively in manufacturing industries, there have been relatively few applications for reliability improvement. We review the use of fractional factorial designs with TTF data in Section 3. The primary feature that distinguishes reliability improvement experiments is the presence of censored data, which can be an especially serious problem in high-reliability situations. The availability of degradation data mitigates many of the problems that arise with TIF data. Fortunately, recent advances in sensing and measurement technologies are making it feasible to collect extensive amounts of degradation and related performance data. These are often much more informative than TIF data for estimating reliability. Degradation data can also be used to predict failure at an individual device-level while TTF data provide information only about population-level characteristics. However, degradation and related performance measurements are useful only if they are good predictors of failure. In Section 4, we review some models for degradation data, design of accelerated tests, and use of factorial designs for reliability improvement.

Experiments for Reliability Assessment and Improvement

2.

163

Accelerated Testing for Reliability Assessment

The purpose of an accelerated test (AT) is to "shorten" the life of a product (or component) by testing at conditions that accelerate degradation and time to failure. ATs are used for a variety of purposes, such as detecting failure modes or design flaws, assessing material properties, improving reliability, comparing suppliers, and monitoring processes. ATs can be used to assess reliability of materials, components, or systems. Our goal here is to review the use of ATs for reliability estimation. We focus on situations in which there is a single, or predominant, failure mode. For extension to more general situations, see McLean [27] and Chapter 7 of Nelson [40]. There are many ways to accelerate degradation and induce failures. One approach relies on increasing the levels of stress factor such as temperature, humidity, or voltage. Another is to reduce "idle operation time" and increase usage rates such as mileage, cycles of operations, etc. Most of the published AT experiments use only a single accelerating variable but it is common in industry to combine these accelerating methods in AT experiments with multiple accelerating variables (see Meeker and Escobar [31] Chapters 18 and 20, for examples and more details). For some ATs, the stress is varied during the experiment (step-stress, progressive-stress), but the most common situation is to keep the stress at a constant level. Nelson ([40], Chapter 11) discusses models and test plans and provides extensive references (see also Viertl, [52], for additional references).

2.1.

Acceleration Models

Reliability estimation based on ATs involves extrapolation across both time and acceleration variables. For this reason, it is important to have models that can be justified based on the underlying physics-of-failure or through extensive empirical evidence. Let s = (Sl,"" Sk) be a set of k acceleration variables used in an AT with So denoting their values at the normal or use condition. Let T(s) denote the TTF data at acceleration level sand F (t; s) its distribution. Further, let Fo(t) = F(t; so) denote the distribution of the TTF data at the use condition. If one defines T(t; s) as through the relationship F(t; s) = FO[T(t; s)], then T( t; s) is the acceleration transform which characterizes how the acceleration variables change the time scale and induce early failures. AT experiments typically have a limited amount of stress levels and test units at each levels. Thus, we have to rely on parametric models for both the acceleration transform T(t; s) and the TTF distribution Fo(t; so). The most common acceleration transform in the literature is the scale acceleratedfaiture time (SAFT) model with T(t;S) = T(S)t. In this case, the effect of the acceleration variable is captured through a simple linear trans-

164

MATHEMATICAL RELIABILITY

formation with scaling constant T(S). This leads to a location-scale model in terms of the log-TTF data. Specifically, if Y(s) = log[T(s)J, we can write Y(s) = f-l(s) + (JE where f-l(s) = -log[T(s)] is the location parameter and (J as well as the distribution of E do not depend on the acceleration factors. We can write the distribution of T (s) as

F(t; s) = [IOg(t)-; f-l(S)]

(8.1)

where (z) is the CDF of E. The two most common choices for (.) are the normal and SEV distributions which lead to lognormal and Weibull models for TTFdata. Figure 8.1 illustrates a SAFT model in which f-l (s) is a quadratic function of s = log (Stress) and E has an SEV distribution, Le., T(s) rv Weibull[f-l(s), (J]. The use condition corresponds to Stress = 70. If the test is truncated at 100 thousand cycles of operation, we can see from the figure that more than 90% of the units will be censored at use condition. Thus, one has to conduct the ALT at higher stress values and use the parametric models for the acceleration transform and TTF distribution to extrapolate the results. In most applications, the location parameter f-l( s) is taken to be a linear function of the form f-l(s) = aO+algl(s)+" ·+amgm(s)wheregl(-), ... ,gm(-) are known functions of the k stress variables. This formulation includes cases with a single acceleration variable, multiple acceleration variables with no interaction and with interactions. Well-known examples in the single variable case include the Arrhenius model for characterizing the effect of temperature and the power law model for the effect of voltage (see Meeker and Escobar [31], Chapter 18 and Nelson [40], Chapter 2). For the Arrhenius model, if s = temperature in DC, we get a(s) = ao + alg(s) where g(s) = ",/(s + 273.15) and '" is a known constant. Similarly, for the power-law model (or inverse power rule), if s = voltage (or voltage-stress), we get a(s) = ao + alg(s), where 9 (s) = log( s). The Eyring relationship (Meeker and Escobar [31], Chapter 18) allows incorporation of temperature and another accelerating variable such as relative humidity. Another useful model is an extension of the Coffin-Mason relationship which relates lifetime to frequency and temperature cycling (see Tobias and Trindade [50], Chapter 7). The SAFT models are widely used because of their simplicity and empirical/physical justification. There have been recent applications with more general acceleration transforms based on the underlying physics-of-failure (Meeker and LuValle [30]). There are also empirical and theoretical ALT models in which the scale parameter (J also varies with stress (Pascual and Meeker [43] and Nelson [40], Page 272). Nevertheless, there is a major need for additional research in this area. The use of accelerated testing involves a considerable amount of extrapolation. Thus, one should incorporate as much subject matter knowledge and

165

Experiments for Reliability Assessment and Improvement

500

:g ~ U

100

........

.

.

g~~.~~~~~!t!!.~~

.

50

'0

~!!j o

1=

90%

10

50% 100/.

60

60

100

120

140

160

Stress

Figure 8.1.

SAFf Model Example.

past evidence as possible in developing and using the acceleration models. Meeker, Escobar, Doganaksoy, and Hahn [34] provide some guidelines along these directions. Meeker and Escobar ([31], [32]) discuss several potential pitfalls when making inferences from accelerated life test (ALT) data due multiple failure modes, masked failure modes, faulty comparisons, and so on.

2.2.

Reliability Estimation from ALT Data

Suppose the ALT is conducted at J levels of the acceleration variables Sl, ... , SJ with ni test units at each level. The goal is to estimate the parameters or some required percentiles/quantiles of F(t; so), the TTF distribution at use condition. This is done by fitting a regression model to the data at accelerated conditions and extrapolating the results to the normal or use condition. In most ALTs, the number of stress levels as well as the number of test units are small. Further, even with accelerated testing, most tests have to be terminated before all units fail. It is also possible that the units are inspected periodically in which case we also have interval censoring. Given the limited amount of data, one has to rely on parametric models, both for the acceleration transform and the TIF distribution, to make inference and extrapolate the results. There is an extensive literature on the analysis of censored data based on parametric regression models. In particular, log-linear regression with lognormal or Weibull SAFT models can be done using many of the available software packages. Estimation of the model parameters as well as functions of the parameters, such as percentiles or quantiles, can be done using likelihood-based

166

MATHEMATICAL RELIABILITY

methods. There is also a growing literature on the use of Bayesian methods with censored data (see Hamada and Wu [20] and references therein). It is critical, however, that the underlying parametric model assumptions are carefully checked before extrapolating the results and making inference at use conditions. Meeker and Escobar ([31], Chapter 19) recommend a systematic strategy for model selection and analysis with SAFf models. This involves extensive use of graphical methods (probability plots, scatter plots, residual analysis, and other diagnostics) to check the adequacy of the parametric timeto-failure distribution as well as the proposed relationship between lifetime and the accelerating variable. This strategy can be easily implemented using SPLIDA (SPlus Life Data Analysis), a set of S-PLUS functions with a graphical interface (GUI), designed for the analysis of reliability data. The most up-to-date version of SPLIDA can be downloaded from www.public.iastate.edursplida. Other available software packages include JMp™, (www.jmpdiscovery.com). SAS (www.sas.com/statistics). MINITAB, and ReliaSoft's ALTA 6 (www.ReliaSoft.com).

2.3.

Planning Accelerated Life Tests

Careful planning of ALTs is very important because of the high cost and the great time constraints under which these experiments are usually run. The goal of the study must be clearly identified up front as test plans that are efficient and economical for a given purpose can perform poorly for another. One must also identify the time, cost, and equipment constraints as these are often the major limiting factors. Other considerations include specification of the experimental region, scale and range of the accelerating variables, levels for the non-accelerating variables, maximum degree of extrapolation that is allowed, potential interactions among experimental variables, and so on. Meeker and Hahn [28], Nelson ([40], Chapter 6), and Meeker and Escobar ([31], Chapter 20) describe methods for planning statistically efficient ALTs that also meet practical constraints. Meeker and Escobar [29] review some development on ALT planning including Bayesian plans. Most of the ALT planning in the literature assume that the goal is to estimate either a particular quantile, t p , in the lower tail of the TIF distribution at design (use) conditions or the underlying parameters of the TIF distribution. In the former case, the test plans are developed so as to minimize the (asymptotic) variance of p , the ML estimator of the target quantile. In the latter case with several parameters, one typically uses the D-optimality criterion of maximizing the determinant of the Fisher information matrix of the estimated parameters. There are also results in the literature on test plans for efficiently estimating a specified percentile or hazard rate of the TTF distribution (see Nelson [40], Chapter 6 and references therein).

t

Experiments for Reliability Assessment and Improvement

167

Most of these results are based on a SAFf model with with one or two

[log(t) - p,J where p, = p,(s) = ao + (J alg(s) for a one-variable ALT and p, = p,(s) = ao + algl(s) + a2g2(s) for the two-variable case. For the single-variable ALT, the optimum test plans select only two levels of the accelerating variable and allocate the test units appropriately among these levels. However, these plans are very non-robust to the model assumptions. Moreover, there is no way to check the model adequacy based on just two levels of the acceleration factor. This has led to the development of more practical test plans that provide a good compromise between efficiency and robustness. Nelson ([40], Chapter 6) describes and reviews statistically optimum and compromise plans for the one-factor ALT planning problem. Escobar and Meeker [13J discuss a systematic method for obtaining statistically optimum and more practical compromise test plans for two-factor ALTs with censored data. Meeter and Meeker [35J develop methods for planning one-factor ALTs when both the scale ((J) and the location (p,) depend on the accelerating variables. Chaloner and Lamtz [5J present a Bayesian approach to obtain ALT plans for Weibull and log-normal time-to-failure distributions. Nelson [41J provides an extensive list of references on accelerated test plans. Meeker and Escobar ([31], Chapter 20) discuss planning criteria, statistical optimum plans and compromise test plans for one and two-variables, and simulation techniques to anticipate the information to be obtained from an ALT plan. They also provide extensive references for planning and guidelines for the use of simulation for the evaluation of ALTs. It is important to assess the sensitivity of ALT plans to changes in optimization criteria, model assumptions, and the parameter input values used to develop the plans. The optimum test plans for a given criterion can be used as a starting point in finding appropriate practical plans. It is important to minimize extrapolation as much as possible on the stress/acceleration variable space. While increasing stress can reduce the variance of the reliability estimates, undue acceleration can induce new failure modes or introduce other biases, and thereby invalidate the conclusions of the study. The efficiency calculations in deriving the test plans are based on asymptotic variance of the MLEs. It is important to calibrate these results for finite-samples by simulating the test plans. Simulation is an effective tools to asses a given ALT plan, anticipate the results to be obtained from the experiment, examine the precision of estimates, assess sensitivity to model assumption, and uncover any potential estimability problems. Software for test planning: Until recently, there was no readily-available and easy-to-use software for ALT planning. Due to faster computing, friendlier graphical interfaces (GUIs), availability of modem graphical capabilities, and accelerating variables: F(t; s)

=

168

MATHEMATICAL RELIABILITY

high demand for reliability software, we are beginning to see new developments in this area. SPLIDA has easy-to-use capabilities that include generation of ALTs, computation of sample sizes, evaluation and simulation of an ALT plan, and computation of acceleration factors. See also Meeker and Escobar [29] for a brief description of other software for ALT planning.

3.

Reliability Improvement

While statistically designed experiments have been used extensively in industry, applications to reliability improvement are more limited. Condra [9](2001), Hamada ([19], [21]), Phadke ([45], Chapter 11), Taguchi [49], and Wu and Hamada ([55], Chapter 12) describe several case studies and applications of reliability improvement. Most of these applications are based on the usual factorial and fractional factorial designs with TTF data as the response.

3.1.

Analysis of Lifetime Data from Designed Experiments

The most important feature that distinguishes reliability improvement experiments from other studies is the presence of incomplete data (a combination of right, left or interval censoring). This is a serious problem in highly-fractionated industrial experiments with few or no replications. The issues are especially problematic in high-reliability applications where there will be few failures. Consider the following experiment to improve reliability of router-bit lifetimes discussed in Phadke [44]. Router bits are used to cut printed wiring boards. As the bits get dull, they cause excessive dust which builds up on the edge of the boards and cause friction and abrasion during circuit pack insertion. The cleaning operation is costly, and changing router bits is also expensive. A 32-run fractional factorial experiment was conducted to study the effect of nine factors on router-bit lifetimes. The design factors were: A - Suction, B X-Y feed, C - In-feed, D - Type of bit, E - Spindle position, F - Suction foot, G - Stacking height, H - Depth of slot, and I - speed. Factors D and E were at four levels while the remaining 7 were at two levels. The experimental design and data are given in Phadke [44] and Hamada and Wu [20]. Lifetimes were measured in (hundreds of) inches of cut in the x - y plane. The router bits were inspected every hundred inches, so all of the data are interval-censored. Even if we ignored this and used the mid-points of the intervals as exact failure times, we still have to deal with right censoring as the experiment was terminated after 17 hundred inches. Eight runs did not fail at termination so these are all right censored. The most commonly used model for analyzing TTF data is the log-linear model (8.2)

Experiments for Reliability Assessment and Improvement

169

where Ii is the lifetime of the ith unit, Xi is the corresponding design-factor combination (ith row of the design matrix), f3j, for j 2 2 are the designfactor effects and the error term ti has location parameter equal to 0 and scale parameter equal to 1. Common parametric models for ti are normal and SEV, which translate to lognormal and Weibull models for TIF data. The analysis is straightforward when there is no censoring. For example, under lognormal models, we can take the logarithm of the TIF data and use ordinary least-squares for estimating the design effects. For other parametric situations such as the Weibull (SEV) regression model, we can use MLEs and likelihood-ratio based inference (see, for example, Meeker and Escobar [31], Chapters 17 and 19). Inference in the presence of censoring is more involved. However, most software packages include capabilities for analyzing censored data for common parametric models such as lognormal and Weibull (see Section 2.2). Many of the simple and attractive diagnostics and model-selection methods associated with balanced factorial designs are no longer valid with censored data. The estimates will be correlated, so tools such as half-normal plots cannot be used. One has to rely on likelihood-based methods for model selection (Meeker and Escobar [31], Chapters 17 and 19). There is considerable variation on the availability of diagnostics for model selection in standard software packages. The biggest problem with censoring is that some effects may be non-estimable. For example, if we have only one factor at two levels and all the observations at one level are censored, then it is easy to see that the location parameter at that level, and hence the design effect, is not estimable. One can get lower bounds in certain situations, but the MLE itself does not exist. This is in fact the case for the router-bit experiment. As discussed in Hamada and Wu [20], there were 23 effects of interest in this study: intercept term, seven two-level main effects, three two-level effects associated with the four-level factor D, and 12 two-factor interactions. The MLEs do not exist for any model that includes these 23 terms. A naive (and wrong) approach is to treat the right-censored observations as failures and apply the usual methods of analysis (Phadke [45], Chapter 11). This can lead to serious biases. Schmee and Hahn [47], Hahn, Morgan, and Schmee [17], and Hamada and Wu [18] propose methods based on iterative strategies that impute failure times from the the censored observations. We describe here a Bayesian approach that is more appealing and has other advantages related to model selection and inference from designs with complex aliasing (Hamada and Wu [20]) The approach and algorithm for the lognormal case with conjugate priors is described in detail in Hamada and Wu [20]. The idea is as follows. If you use a normal prior for (3 and an inverse gamma prior for 0- 2 , the posteriors with

170 Table 8.1.

MATHEMATICAL RELIABILITY Posterior Quantiles for Router Bit Lifetime Experiment

Effect

Quantile 0.025 Median 0.975

Int I A B C D1 D2 D3 AH BF CG F

1.14 0.51 -0.38 -1.29 -0.36 0.40 -0.06 -1.76 -0.35 -0.83 0.41 -1.34

1.59 0.77 0.08 -0.74 0.09 0.74 0.29 -1.21 0.11 -0.38 0.76 -0.73

2.24 1.21 0.55 -0.40 0.55 1.29 0.82 -0.86 0.58 0.08 1.30 -0.20

Effect G H BG BI CI GI AF CH AI PI HI (1

Quantile 0.025 Median 0.975 -1.56 -0.41 -0.58 -0.08 -0.10 -1.23 -1.23 -0.39 -0.35 -0.91 -0.92 0.30

-0.98 0.08 -0.10 0.10 0.09 -0.78 -0.70 0.08 -0.17 -0.47 -0.47 0.38

-0.47 0.63 0.36 0.28 0.27 -0.49 -0.36 0.55 0.02 -0.22 -0.21 0.50

(uncensored) log-TTF data also have the same form: normal distribution for (3 given cr 2 and an inverse gamma for cr 2 . With censored data, the posteriors are more complicated, but this can be gotten around by using data augmentation to impute the "missing" values (censored observations) given the parameter values and use the imputed data as complete (uncensored) data. These ideas can be extended to other models without conjugate priors, such as the Weibull regression model, using Markov-chain Monte Carlo techniques. Table 8.1 shows the results of the Bayesian analysis for the router bit experiment using WinBUGS (www.mrc-bsu.cam.ac.uklbugsl). In this analysis, we treated the the mid-points of the interval-censored data as exact failure times, so there were only right-censored observations. The priors were chosen as follows. The 0.01 and 0.99 quantiles of the prior for cr are 0.2666 1.1829. The 0.01 and 0.99 quantiles of the prior for each (3j are approximately -4.5 and 4.5. Columns 2 and 4 give the central 95% posterior intervals for the the various main effects and selected interactions. We can use this to detect the important effects. For example, main effects of factors I and B are active while those for factors A and C are insignificant. These results can be used to identify the optimal factor settings and improve reliability. Phadke [45] reports a substantial improvement in router-bit mean lifetime (increase from about 900 to 4,150 hundred inches) from this experiment. The Bayesian analysis above used a relatively diffuse proper prior to resolve the estimability problem caused by censoring. Nevertheless, when the MLEs do not exist, the posterior modes for non-informative (improper) priors will also not exist. So the Bayesian approach relies on prior information to resolve the non-estimability problem. The prior information is equivalent to additional

Experiments for Reliability Assessment and Improvement

171

observations or experimental runs. This is fine in situations where the prior information is reliable, but we need to be cautious with using Bayesian methods in other situations. One can also incorporate prior information, such as past data on similar products, directly in likelihood-ratio based methods of inference.

3.2.

Design of Reliability Improvement Experiments: Some Issues

Fractional factorial designs (FFDs) are most commonly used in reliability improvement studies. The usual rationale for FFDs is that one can study many factors in few runs economically. The recommended strategy is as follows (Wu and Hamada [55]). Highly fractional screening experiments with two levels are first used to identify the important factors, and then response surface designs are used to determine optimum conditions. The FFDs are balanced, so the estimated design effects are uncorrelated under the usual iid error structure. This makes inference and model selection especially straightforward. This has led to the development of many simple methods such as Daniel's half-normal plots for variable selection. Most reliability studies are conducted under some time or failure constraints, leading to right censoring. There may also be more complicated censoring patterns depending on the study in question and data collection schemes. In the presence of censoring, (fractional) factorial designs lose some oftheir attractive properties. The estimators will now be correlated due to the unequal variances at the different experimental condition induced by the different degrees of censoring. If there are replications, it would be optimal to allocate more observations to design combinations at higher degrees of censoring. Inference and model selection issues are also complicated. Perhaps the biggest concern is the estimability problem where one or more of the design effects are not estimable if all the observations at a given level are right censored. In usual experimental design applications, conventional wisdom suggests that the factor levels should be chosen at the extremes of the operating regions to minimize variance of the estimated design effects. In reliability problems, however, moving the design points too far out can lead to an increase in the probability of censoring, thus off-setting the gain. This calls into question conventional wisdom on how to choose the design points. When one has historical data or information from studies on related products, such information should be used as much as possible in both the design and analysis stages. For example, if the current use condition is in the middle of the design region and we have past data at this design point, this information can be used to resolve the estimability problems. Even if we do not have such information, it is useful to run the factorial design with a few replications at the center point added.

172

MATHEMATICAL RELIABILITY

Another appealing idea is to combine accelerated testing with reliability improvement experiments. Consider the log-linear model in (8.2) with the intercept term as a function of the stress factor: (30 = (30(80) and 80 is the stress factor at the use condition. Suppose the model in (8.2) holds at accelerated settings of the stress variable 8 and the design effects do not vary with stress 8. Then, it is more efficient to induce failures by increasing the stress level (subject to the usual caveats about over-stressing), conduct the experiment at the accelerated stress level, estimate the design effects and choose the appropriate settings. If we also want to do reliability assessment [estimate (3(80)] from the same study, this forces a compromise. A more interesting situation is when both the design effects (or at least some of them) and the intercept (30 vary with stress. This problem is currently being studied and the results will be reported elsewhere. Two-level screening experiments are ideally suited to initial studies for new product development. In reliability improvement studies where the current reliability level is already quite high, different approaches, such as sequential methods, may be more useful. However, the additional time needed for sequential studies leads to other difficulties. The use of degradation data discussed in Section 4 mitigates many of these problems.

3.3.

Robust Design Studies

The goal of robust parameter design is to make the product/process robust or "insensitive" to variations in uncontrollable "noise" factors. This is done through designed experiments where the effect of control (design) factors on both location and dispersion is studied. Much of this work has been stimulated by Taguchi's contributions to parameter design and quality improvement. There has been very limited work on robust design studies with reliability data (Phadke [44]; Montmarquet [36]; Hamada ([19], [21]). Consider the following experiment on lifetimes of drill bits (Montmarquet [36]). These bits are used to drill holes in printed circuit boards. The goal of the study was to improve the lifetimes of the drill bits (number of holes drilled). A 16-run experiment with 12 design factors, one at four levels and 11 at two levels, was conducted to study the effect of the control factors. At each design combination a 2 5 - 2 fractional factorial design was used to systematically vary the settings of five noise variables, each at two levels. The experiment was terminated after 3000 holes were drilled, so several observations were right-censored. The experimental design and the data can be found in Montmarquet [36] (see also Wu and Hamada [55]). Several approaches have been proposed in the literature for analyzing data from robust design experiments (Nair [39]; Wu and Hamada [55]). One class of methods involve treating the noise factors as random and modeling the mean and

Experiments for Reliability Assessment and Improvement

173

variance (of possibly transformed) responses (Box [4]; Nair and Pregibon [38]). In the reliability context, this involves taking the log TTF data and replacing the model in (8.2) with (8.3)

where fJ(Xi) = x~f3 and g[a 2 (xi)] = x~c/J, and the elements (3/s and ¢/s for j 2: 2 measure the location and dispersion effects respectively. The most commonly used link function for the variance is g[a 2 (x)] = log[a 2 (x)], the log-linear model (Box [4]; Nair and Pregibon [38]). The observations from the noise array are treated as "replications" and the sample mean and sample variance of the data from these replications are then analyzed to identify the important location and dispersion effects. Alternatively, one can directly fit the model in (8.3) using likelihood-based methods. Most software packages do not have standard routines for joint modeling of location and dispersion parameters with censored data. In many robust design studies, the noise variables are controlled and systematically varied in off-line studies. In these situations, it is simpler and more efficient to use a "fixed-effects" analysis where the noise variables are treated as fixed. The dispersion effects can be determined from the control-by-noise interactions (Shoemaker, Wu, and Tsui [48]). Instead of (8.3), we use (8.4)

where Zj are the settings of the noise variables, the components of f3 are the main effects of the control factors, the components of, are the main effects of the noise factors, and the components of A are the control-by-noise interactions. Analysis of data using this model is relatively easy since this is a standard regression problem with censored data. Hamada [19] used this approach to analyze the data from the drill bit experiment. The goal is to make reliability (as measured by some characteristic of the TIF distribution) as large as possible. In Taguchi's taxonomy of robust design studies, this is called a "larger-the-better" problem. Taguchi has proposed a "signal-to-noise" ratio for analyzing data from such studies. We prefer an analysis based on the parameters of the log-TTF distribution and using the results to make the final conclusions. In some cases, a direct examination of the location and dispersion effects may lead to conflicting recommendations on factor settings. In such cases, one can use an appropriate criterion such as required reliability at a given time R(to) or design life tp and find the design settings that maximize the estimated value of this criterion.

174

4.

MATHEMATICAL RELIABILITY

Reliability Assessment and Improvement with Degradation Data

Degradation data can be a very rich source of reliability infonnation and offer many advantages over TIF data. This is especially true in high-reliability applications where few failures occur. With degradation data, one usually defines TIF as the first-time that a certain degradation threshold is exceeded (Meeker and Escobar [31], Chapter 14). In this section, we first discuss some models and methods of analysis for degradation data. The next two subsections describe accelerated test planning and analysis of designed experiments with degradation data.

4.1.

Models and Methods for Degradation Data

In any given application, the problem context and appropriate subject matter knowledge must be used to characterize the underlying degradation mechanisms and develop degradation models (LuValle, Welsher, and Svoboda [26] and Meeker and LuValle [30] are good examples). Nevertheless, the following class of random-effect degradation models are quite flexible and have been found quite useful in reliability applications. (See also the extensive literature on nonlinear random effects models and growth curves, e.g., Longford [25]; Lindsey [24]; Davidian and Giltinan [10]). Let X(t) denote the (possibly transfonned) degradation data observed with measurement error. Consider the nonlinear random effects model (8.5)

where A(t) is a smooth monotonic function of t, 8 are population-level parameters characterizing degradation, A j is the random effect associated with device j, and Ej (t) is measurement error. Simple examples include: (a) a linear random-effects model Xj(t) = Ajt + Ej(t) with A j log-nonnally distributed; and (b) Yj(t) = Ajtexp(Ej(t)) which, by applying a log-transfonnation, becomes a linear (random intercept) model with X j (t) = log A j + log(t) + Ej (t). In general, the degradation path A(t; 8; A) will be nonlinear and different parametric shapes can be chosen to characterize different degradation behavior. The parameters of the underlying TIF distribution can be estimated from those of the degradation process. This can be done analytically in simple models. Consider, for example, data from the log-transformed linear model

Suppose we can assume Aj's are iid LN(/-L, (J"~) and Ej (ti) 's are iid N(O, (J"~). Let failure be defined as the first-passage time of the true degradation path Ajt to degradation threshold D (or equivalently log(A j ) + log(t) crossing log(D)).

Experiments for Reliability Assessment and Improvement

175

Then,_TTF T j = D / A j is distributed as LN(1og(D) - p, 17~). We can estimate p by Z where Zij = X j (ti) -log( ti)' The parameter 17~ can be estimated based L-~=l (Zij - Zj)2 /[m(n -1)J, an on the appropriate difference between unbiased estimator of 17;, and (Zj - Z)2/ (m - 1), an unbiased estimator of 17~ + 17;/n. To do a simple comparison of degradation versus TIF data, let us consider the efficiency of the estimators of p, the mean of the log-TIF distribution. The variance of IJ, from the degradation data is a2t+;;/n. As n gets large, the contribution from the measurement error 17;/n gets small, so this will tend to 17~/ m. This is the same as the variance of IJ, from m exact failures T 1 , ... , Tm (TTF data with no censoring). Thus, the comparison between degradation and TIF data depends on the magnitude of the measurement error (predictive ability of the degradation data) versus the degree of censoring. Of course, one would also have to take into account the costs involved in collecting degradation versus TTFdata. Inference under the random effects model (8.5) has been been discussed in the literature, see for example Meeker, Escobar, and Lu [33]. One approach is to maximize the likelihood directly. When the likelihood does not have a closed form, various approximations have been proposed (Pinheiro and Bates [46]) and are implemented in the S-PLUS functions Ime and nIme. An alternative is to use a two-stage approach (see, for example, Davidian and Giltinan [10]) where one first estimates the random effects A j for each unit and uses these estimated effects in the second stage analysis. Chiao and Hamada [8] consider more complicated random effect models in which the both the mean and covariance matrix of the random effects distribution depend on the experimental factors. In simple models, the TTF distribution can be estimated easily from the degradation data. This was the case for the linear degradation model and lognormal TTF distribution discussed above. For the general nonlinear random effects model in (8.5), one will have to use numerical or simulation methods to estimate the TIF distribution. Meeker and Escobar ([31], Chapter 21) describe a simulation-based approach and the use of bootstrap methods for estimation and inference. There are several other approaches to modeling degradation data in the literature. These include random shock processes (Gertsbakh [16]; Esary, Marhsall, and Proschan [12]), cumulative damage models (Bogdanoff and Kozin [2]), and multistage models (Houggard [22]). Doksum and Hoyland [11], Whitmore [53], and Whitmore and Schenkelberg [54] discuss the use of Wiener processes for modeling degradation data.

L-j=l

L-;l

176

4.2.

MATHEMATICAL RELIABILITY

Accelerated Degradation Test Planning

Many of the practical considerations in planning ADTs are similar to those for ALTs discussed in Section 2.3. Boulanger and Escobar [3] studied the design of ADTs under the nonlinear random effects model (8.7)

on,

where X ij (t) is the degradation measurement at time t for the jth device under stress condition i. Further, Eij (t) '" N(O, and F(t, .J (t) ~ >'3 (t) for all t ~ O. Here the failure rate >.J corresponds to a distribution H j with density hj , and >'3 corresponds to a lifetime distribution F j with density fj:

o hj(t) 1 f·(t) >'j(t) = 1 _ Hj(t) , >.J·(t) = 1-F J Hj(O) = Fj(O) = O. j (t)' As described in Aven and Jensen [1], the SSM representation of the indicator process I{T9} of a stopping time T can be determined by means of the failure rate process>. corresponding to T:

Here, M E M is bounded in L 2 , Le., SUPtElR+ lE[M 2 (t)] < 00. In our case, the failure rate of the jth item changes from >.J to >'3 after releasing the item. Therefore we assume that for the lifetime T j of the jth item, the following representation holds true: (11.2)

where yt = I{T 0 per unit operating time of released items. In addition there are costs for failures, CB > 0 for a failure during burn-in and CF > 0 for a failure after the burn-in time T, where

213

Burn-in - Sequential Stop and Go Strategies Cp > CB. If we fix the bum-in time for a moment to T is given by

m

m

Z(t) = C 2)Tj - t)+ j=l

CB

= t then the net reward

m

LI{Tj'S.t} j=l

Cp

LI{Tj>t}, t E lR+. (11.3) j=l

As usual a+ denotes the positive part of a, Le. a+ = max(a, 0). At this point we emphasize that this cost and reward structure is just a simple example. Other criteria are possible and can be treated in a similar way as is shown below. Now it is assumed that it is possible to observe the failure time of any item during the bum-in phase. To model this information in an appropriate way we introduce the observation filtration which is generated by the lifetimes of the items:

IF

= (Ft ), t E lR+, F t = a(I{Tj:Ss}, 0 :::; s :::; t, j = 1, ... , m).

Here, a(I{Tj:Ss},O :::; s :::; t,j = 1, ... ,m) denotes the completion of the smallest a-algebra with respect to which all random variables I{Tj:Sa}' 0 :::; s :::; t, j = 1, ... , m, are measurable. In order to determine the optimal bum-in time, the following optimal stopping problem has to be solved: Find an IF -stopping time ( E elF satisfying

E[Z(()]

=

sup{E[Z(T)] : T E elF}.

In other words, at any time t the observer has to decide whether to stop or to continue with bum-in with respect to the available information up to time t not anticipating the future. Now, Z is not adapted to IF, Le., Z(t) cannot be observed directly. Therefore we consider the conditional expectation m

Z(t) = E[(Z(t)IFt )] =c LI{Tj>t}E[(Tj - t)+ITj > t] j=l

+ (cp -

mcp

m

CB)

LI{Tj'S.t} j=l

(11.4)

which is obviously adapted to IF. As an abbreviation we use

1

00

I-Lj(t)

=

E[(Tj - t)+ITj > t]

=

--=-:-( 1 ) FJ t

t

Fj(x)dx, t E lR+,

for the mean residual lifetime. The derivative with respect to t is given by I-Lj (t) = -1 + AJ (t) I-Lj (t). We are now in a position to formulate conditions under which an optimal stopping time can be determined.

214

MATHEMATICAL RELIABILITY

Theorem 2 Suppose that I-tj (t) is bounded for j = 1, ... , m and that the functions

satisfy the following condition for all :7 ~ {1, ... , m}:

L 9j(t) :S 0 implies 9j(S) :S 0 Vj E :7, Vs ~ t.

(11.5)

jEJ

Then m

( = inf{t E lR+ : L I{Tj>t}9j(t) :S O} j=l

is an optimal bum-in time:

lE[Z(()] = sup{lE[Z(T)] : T

E

ell'}

Proof. In order to obtain a semimartingale representation for Z in (11.4) one has to derive such a representation for I{Tj>t}fLj(t). The integration by parts formula for Stieitjes integrals (pathwise) gives

After some rearrangements this leads to

fLj(t)I{To>t} = I-tj(O) J

+ Jort[-l-tj(s)I{To>s}>.J(s) + I{T>s}l-tj(s)]ds J J

+ it fLj (s)dA1j (s) =

I-tj(O)

r I{To>s} [-1 -l-tj(s)(>.J(s) - >.}(s))] ds

+ Jo

+ Mj(t),

J

Burn-in - Sequential Stop and Go Strategies

215

where Mj is a martingale which is bounded in £2. This yields the following semimartingale representation for Z :

with a uniformly integrable martingale N. Since for all wEn and all E JR+ there exists some.:J ~ {I, ... ,m} such that 'L/;~l I{Tj>t}gj(t) = ~jE.:J gj(t), condition (11.5) in the theorem ensures that the monotone case (11.1) holds true. Therefore we get the desired result by Theorem 1. For more 0 details of the proof see [12].

t

Remark The structure of the optimal stopping time shows that high rewards per unit operating time lead to short bum-in times whereas great differences CF - CB between costs for failures in different phases lead to long testing times.

The theorem shows in a formal way how the information about failure times during bum-in can be used. The first easy consequence is: stop bum-in latest when the last item has failed, which is a random time. In realistic cases the event that all items fail during bum-in should have small probability. The following special cases illustrate some more consequences of the theorem. 1 Burn-in forever. If gj(t) > 0 for all t E JR+, j = 1, ... , m, then ( = max(Tl' ... , T m ), Le., bum-in until all items have failed. 2 No burn-in. If gj(O) :S O,j = 1, ... ,m, then ( = 0 and no bum-in takes place. This case occurs for instance if the costs for failures during and after bum-in are the same: CB = CF. 3 Identical items. If all failure rates coincide, Le. A~(t) and Ai(t) = ... = A~(t) for all t 2: 0, then gj(t) j E {I, ... , m} and condition (11.5) reduces to

= ... = A~(t) = gl(t) for all

216

MATHEMATICAL RELIABILITY

If this condition is satisfied, the optimal stopping time is of the form ( = t1 !\ max(T1, ... , T m ), Le. stop burn-in as soon as 91(8) ~ 0 or all items have failed, whatever occurs first. 4 The exponential case. If all failure rates are constant, equal to >-.J and >-.], respectively, then f-lj and therefore 9j is constant, too, and (( w) E {O, T 1 (w), ... , Tm (w)} if condition (11.5) is satisfied. If furthermore the items are "identical", then ( = 0 or ( = max(T1 , ... , Tm ). 5 No random information. In some situations the lifetimes of the items cannot be observed continuously. In this case one has to maximize the expectation function

E[Z(t)] = E[Z(t)] = -mcp + c

m

L Hj (t)f-lj (t) + (cp j=1

m

CB)

L Hj(t) j=1

in order to obtain the (deterministic) optimal burn-in time. This can be done using elementary calculus. In the case of independent component lifetimes the question arises: When does it really pay to observe the failure times during burn-in and to follow a sequential procedure not leading (in principle) to a deterministic burn-in time. As can be seen from the special case identical items above the i.i.d. case leads in principle to a deterministic stopping rule. But if the items are independent and have different lifetime distributions as in cases when small systems are burned in comprising different types of components, then we are in another situation: Let us assume that we have two types, one for which the failure rates >-. 0 (t) and >-. 1 (t) during and after burn-in coincide and another type for which there is a big difference between these rates. This means that the rate function 9 of the latter type of items takes negative values with the interpretation that this type is very sensitive against extra stress conditions during burn-in. If the monotone case is fulfilled, the optimal stopping rule is of the following type: If these sensitive items survive for a certain time and some of the non-sensitive items with positive rate function 9 fail first, then we stop burn-in earlier to release the sensitive items. On the other hand ifthe sensitive items fail early during burn-in then the non-sensitive items are dominating and the burn-in takes longer. The optimal burn-in time can be determined for other cost functions in the same way as in Theorem 2. This shall be demonstrated with a second example: We consider the following cost structure proposed by Costantini and Spizzichino [8], who, however, aimed at the more complex case of conditionally independent lifetimes (see also Section 3): Consider a cost CB to be paid for each component that fails during burn-in and a decreasing cost C( r), r E [0,(0), for each component that is put into operation and has an operative lifetime r, which

217

Burn-in - Sequential Stop and Go Strategies

seems a little more realistic than a constant failure cost. For a fixed bum-in time

t, the gain is given by

Z(t) = -CB

m

m

j=l

j=l

L I{Tj$t} - L I{Tj>t}C(Tj -

t), t E lR+ l

and m

IE[Z(t)IFt] = -CB

L I{Tj$t} j=l

m

L I{Tj>t}IE[C(Tj - t)ITj > t]. j=l

Denote Cj(t) = IE[C(Tj -t)ITj > t] = Fj(t) !too C(x - t)!J(x)dx and assume that this function is absolutely continuous with representation

Then using an integration by parts formula as in the proof of Theorem 2 (see also Proposition 109 on p. 240 in [1]) we get the SSM representation

Together with the SSM representation for I{Tj$t} we finally have IE[Z(t)IFtJ as

-f

j=l

= -

Cj(O)

f

j=l

+ it 0

Cj(O)

+

f

j=l

I{Tj>s} {(Cj(S) - CB»,J(S) - cj(s)}ds + M(t)

it f 0

j=l

I{Tj>s}9j(S) ds

+ M(t).

So if the functions 9j fulfill condition (11.5) then we can apply Theorem 2 and the stopping time m

(= inf{t E lR+ : L1{Tj>t}9j(t):::; O},

j=l

can be seen to be optimal. This stopping rule is just of the same type as that one in Theorem 2. Since this type of stopping strategy occurs quite frequently we introduce the term Stop and Go strategy because of the following interpretation: At time 0 determine the first zero of 2:;:1 9j (t); then go to this first predetermined stopping point in time or the first failure, say of component k,

218

MATHEMATICAL RELIABILITY

whatever comes first; then stop (finally) or determine a new stopping time (first zero of Lj'fk 9j(t)) and go again etc. These Stop and Go strategies even apply in cases ofdependent lifetimes to be investigated in more detail in Section 3. To demonstrate this let us come back to the first cost/reward structure, where we considered the stochastic process m

Z(t) =

C

2)Tj j=1

m

t)+ -

CB

L

m

I{Tj::;t} - CF

j=1

L

I{Tj>t}·

j=1

If the lifetimes are not independent the "only" difference is that we have to take the history until time t into account when forming conditional expectations. at time t when h components failed before t is The history

1if

1if =

(tI, ... , th; C(I), ... , C(h))

and consists of the failure times up to time t and the indices of the failed items. So forming conditional expectation gives JE[(Tj - t)+ 1Ft )] = f-Lj (t, 1if) depending on the history and leading to

This shows that again a Stop and Go strategy is optimal, even in the dependent case, if the functions 9j (t, 1if) fulfill the conditions (11.5) for all histories 1if. Of course, in general it will not be easy to check these conditions.

3.

Sequential Burn-in for Similar Components with Dependent Lifetimes

In this section we present a discussion about the sequential character of procedures for optimally stopping the bum-in period for m components, in the case when they are similar and the corresponding lifetimes T I , ... , T m are interdependent. In particular, we illustrate in some more details the role and the meaning of Stop and Go strategies.

3.1.

Preliminaries

Assume now that we are in a position to record all failures which occur during bum-in and the identity of failing components. Generally, for interdependent components one should expect that the optimal procedure to stop the bum-in must be sequential. From an intuitive standpoint, we can indeed reason as follows: at time 0, we determine the optimal bum-in time, suitably taking into account the joint survival function FTl,oo.,Tm ; denote that by ..

=S

L

7i -

L

= 0.

(12.4)

i=l

Consequently, we have 71 = 72 = ... = 7 n from (12.3). Also, from (12.4), we can show that 71 = 72 = ... = 7 n = S~

246

MATHEMATICAL RELIABILITY

To find the overall optimal coordination policy, we need also consider the schedule constraint specified in (12.1). For each n, we first set S to 1 and continue to increment S by 1 until the schedule constraint is met. To choose the optimal number of cycles n *, we compare different n and choose the one that yields the lowest objective value. If the optimal number of developers is S*, the length of the development phase in each cycle will be L (S* n *) -1.

A.2. Solution Procedure for the Model that Considers Size Effect Similar to section A.I, the Lagrangian when considering only the effort constraint is

L =

ECC- A

[St,7

kaSn + kb S'

t,

i

7i

-L]

+ pk,(i)

t,(ek,sr. -

1) - A

[s t,

7i -

L] .

The first-order conditions are: (12.5) and

n

L>.

=S

L

Ti -

L

= o.

(12.6)

i=l

From (12.5), we can represent Ti as a function of A: (12.7) Substituting (12.7) into (12.6) and with some simplification, we have (12.8) Once A is known, using (12.7) again we calculate the value of Ti, i from I to n. So

The procedure of incorporating the schedule constraint is the same as in section A. I.

REFERENCES

247

References [1] Ammann, P. E., Brilliant, S. S. and Knight, J. C. (1994). The Effect of Imperfect Error Detection on Reliability Assessment via Life Testing, IEEE Transactions on Software Engineering, 20: 142-148. [2] Banker, R D., Davis, G. B. and Slaughter, S. A (1998). Software Development Practices, Software Complexity, and Software Maintenance Performance: A Field Study, Management Science, 44: 433-450. [3] Basili, R V., and Turner, A J. (1975). Iterative Enhancement: A Practical Technique for Software Development, IEEE Transactions on Software Engineering, SE-l: 390-396. [4] Boehm, B. W. (1988). A Spiral Model of Software Development and Enhancement, IEEE Computer, 21: 61-72. [5] Brooks, E P. (1995). The Mythical Man-Month: Essays on Software Engineering. Anniversary Edition, Addison-Wesley. [6] Chan, T., Chung, S. L. and Ho, T. H. (1996). An Economic Model to Estimate Software Rewriting and Replacement Times, IEEE Transactions on Software Engineering, 22: 580-598. [7] Chiang, I. R. (1999). Optimizing the Productivity of Development Teams during Software System Construction: Theoretical and Practical Considerations, Ph.D. Dissertation, University of Washington. [8] Dalal S. R, and McIntosh, A A (1994). When to Stop Testing for Large Software Systems with Changing Code, IEEE Transactions on Software Engineering, 20: 318-323. [9] Ehrlich, W, B., Prasanna, J., Stampfel, E, and Wu, J. (1993). Determining the Cost of a Stop-Test Decision, IEEE Software, 33-42. [10] Erkanli, A, Mazzuchi, T. A and Soyer, R (1998). Bayesian Computations for a Class of Reliability Growth Models, Technometrics, 40: 14-23. [11] Fagan, M. E. (1976). Design and Code Inspections to Reduce Errors in Program Development, IBM System Journal, 15: 182-211. [12] Gilb, T. (1988). Principles of Software Engineering Management. Addison-Wesley. [13] Goel, A L. (1985). Software Reliability Models: Assumptions, Limitations, and Applicability, IEEE Transactions on Software Engineering, 11: 1411-1423. [14] Koushik, M., and Mookerjee, V. (1995). Modeling Coordination in Software Construction: An Analytical Approach, Information Systems Research, 6: 220-254.

248

MATHEMATICAL RELIABILITY

[15] Kuo, L., and Yang, T. (1996). Bayesian Computation for Nonhomogeneous Poisson Processes in Software Reliability, Journal of the American Statistical Association, 91: 763-773. [16] Musa, J. D. (1975). A Theory of Software Reliability and Its Application, IEEE Transactions on Software Engineering, 1: 312-327. [17] Paulk, M., Curtis, B., Chrissis, M. B. and Weber, C. V. (1993). Capability Maturity Model for Software, Version 1.1, CMU/SEL Technical Report, CMU/SEI-93-TR-024. [18] Singpurwalla, N. D. (1991). Determining an Optimal Time Interval for Testing and Debugging Software, IEEE Transactions on Software Engineering, 17: 313-319. [19] Wood, A. (1996). Predicting Software Reliability, IEEE Computer, 29: 6977. [20] Xia, G., Zeephongsekul, P. and Kumar, S. (1993). Optimal Software Release Policy with a Learning Factor for Imperfect Debugging, Microelectronic Reliability, 33: 81-86. [21] Yamada, S., and Osaki, S. (1985). Software Reliability Growth Modeling: Models and Applications, IEEE Transactions on Software Engineering, 11: 1431-1437.

Chapter 13 RELIABILITY MODELING AND ANALYSIS IN RANDOM ENVIRONMENTS Siileyman Ozekici Department ofIndustrial Engineering Kor;University 80910 Sariyer-Istanbul, Turkey [email protected]

Refik Soyer Department ofManagement Science The George Washington University Washington, DC 20052, USA [email protected]

Abstract

We consider a number of models where the main emphasis is on the effects of random environmental changes on system reliability. They include complex hardware and software systems which operate under some set of environmental states that affect the failure structure of all components. Our discussion will be of an expository nature and we will review mostly existing and ongoing research of the authors. In so doing, we will present an overview of continuous and discrete-time models and their statistical analyses in order to provide directions for future research.

Keywords:

Reliability models, random environment, Markov modulation, operational profile, Bayesian analysis.

1.

Introduction and Overview

In this expository paper, we consider complex reliability models that operate in a randomly changing environment which affects the model parameters. Here, complexity is due not only to the variety in the number of components of the model, but also to the fact that these components are interrelated through

250

MATHEMATICAL RELIABILITY

their common environmental process. For example, a complex device like an airplane consists of a large number of components where the failure structure of each component depends very much on the set of environmental conditions that it is subjected to during flight. The levels of vibration, atmospheric pressure, temperature, etc. obviously change during take-off, cruising and landing. Component lifetimes and reliabilities depend on these random environmental variations. Moreover, the components have dependent lifetimes since they operate in the same environment. A similar observation holds in software systems. For example, an airline reservation system consists of many modules where failures may be experienced due to the faults or bugs that are still present. In this case, the way that the user operates the system, or the so-called operational profile, plays a key role in software reliability assessment. Failure probabilities of the modules and system reliability all depend on the random sequence of operations it performs. The operational profile in this setting provides the random environment for the software system. The term "environment" is used in the generic sense in this papers so that it represents any set of conditions that affect the stochastic structure of the model investigated. The concept of an "environmental" process, in one form or another, has been used in the literature for various purposes. Neveu [25] provides an early reference to paired stochastic processes where the first component is a Markov process while the second one has conditionally independent increments given the first. Ezhov and Skorohod [8] refer to this as a Markov process with homogeneous second component. In a more modern setting, ~mlar ([3], [4]) introduced Markov additive processes and provided a detailed description on the structure of the additive component. The environment is modelled as a Markov process in all these cases and the additive process represents the stochastic evolution of a quantity of interest. The use of an environmental process to modulate the deterministic and stochastic parameters of operations research models is not limited to reliability applications only. Ozekici [27] discusses other applications in inventory and queueing. In inventory models, the stochastic structure is depicted by the demand and the lead-time processes. Song and Zipkin [41] argue that the demand for the product may be affected by a randomly changing "state-of-the-world", which we choose to call the "environment" in our exposition. A periodic review model in a random environment with uncertain supply is analyzed in Ozekici and Parlar [31]. Queueing models also involve stochastic and deterministic parameters that are subject to variations depending on some environmental factors. The customer arrival rate as well as the service rate are not necessarily constants that remain intact throughout the entire operation of the queueing system. A queueing model where the arrival and service rates depend on a randomly changing two-state environment was first introduced by Eisen and Tainiter [7]. This

Reliability Modeling and Analysis in Random Environments

251

line of modelling is later extended by other authors such as Neuts ([23], [24]) and Purdue [36]. A comprehensive discussion on Markov modulated queueing systems can be found in Prabhu and Zhu [35]. Although the literature cited above clearly illustrate the use of random environments in inventory and queueing models, the concept is much more applicable in reliability and maintenance models. It is generally assumed that a device always works in a given fixed environment. The probability law of the deterioration and failure process thus remains intact throughout its useful life. The life distribution and the corresponding failure rate function is taken to be the one obtained through statistical life testing procedures that are usually conducted under ideal laboratory conditions by the manufacturer of the device. Data on lifetimes may also be collected while the device is in operation to estimate the life distribution. In any case, the basic assumption is that the prevailing environmental conditions either do not change in time or, in case they do, they have no effect on the deterioration and failure of the device. Therefore, statistical procedures in estimating the life distribution parameters and decisions related with replacement and repair are based on the calendar age of the item. There has been growing interest in recent years in reliability and maintenance models where the main emphasis is placed on the so-called intrinsic age of a device rather than its real age. This is necessitated by the fact that devices often work in varying environments during which they are subject to varying environmental conditions with significant effects on performance. The deterioration and failure process therefore depends on the environment, and it no longer makes much sense to measure the age in real time without taking into consideration the different environments that the device has operated in. There are many examples where this important factor can not be neglected or overlooked. Consider, for example, the jet engine of an airplane which is subject to varying atmospheric conditions like pressure, temperature, humidity, and mechanical vibrations during take-off, cruising, and landing. The changes in these conditions cause the engine to deteriorate, or age, according to a set of rules which may well deviate substantially from the usual one that measures the age in real time irrespective of the environment. As a matter of fact, the intrinsic age concept is being used routinely in practice in one form or another. In aviation, the calendar age of an airplane since the time it was actually manufactured is not of primary importance in determining maintenance policies. Rather, the number of take-offs and landings, total time spent cruising in fair conditions or turbulence, or total miles flown since manufacturing or since the last major overhaul are more important factors. Another example is a machine or a workstation in a manufacturing system which may be subject to varying loading patterns depending on the production schedule. In this case, the atmospheric conditions do not necessarily change too much in time, and the environment is now represented by varying loading

252

MATHEMATICAL RELIABILITY

patterns so that, for example, the workstation ages faster when it is overloaded, slower when it is underloaded, and not at all when it is not loaded or kept idle. Therefore, the term "environment" is used in a loose sense here so that it represents any set of conditions that affect the deterioration and aging of the device. In what follows, we assume that the system operates in a randomly changing environment depicted by Y = {yt; t E T} where yt is the state of the environment at time t. The environmental process Y is a stochastic process with time-parameter set T and some state space E which is assumed to be discrete to simplify the notation. In Section 2, we consider continuous-time models applicable to hardware systems. This will focus mainly on the intrinsic aging concept. Section 3 is on continuous-time software reliability models where the operational profile plays the key role in testing as well as reliability assessment. Discrete-time periodic models are considered in Section 4 where we first discuss Markov modulated Bernoulli processes in the context of reliability applications and extend this discussion later to networks.

2.

Continuous Time Models with Intrinsic Aging An interesting model of stochastic component dependence was introduced by

~mlar and Ozekici [5] where stochastic dependence is introduced by a randomly

changing common environment that all components of the system are subjected to. This model is based on the simple observation that the aging or deterioration process of any component depends very much on the environment that the component is operating in. They propose to construct an intrinsic clock which ticks differently in different environments to measure the intrinsic age of the device. The environment is modelled by a semi-Markov jump process and the intrinsic age is represented by the cumulative hazard accumulated in time during the operation of the device in the randomly varying environment. This is a rather stylish choice which envisions that the intrinsic lifetime of any device has an exponential distribution with parameter 1. There are, of course, other methods of constructing an intrinsic clock to measure the intrinsic age. Also, the random environment model can be used to study reliability and maintenance models involving complex devices with many interacting components. The lifetimes of the components of such complex devices are stochastically dependent due to the common environment they are all subject to.

2.1.

Intrinsic Aging in a Fixed Environment

The concept of random hazard functions is also used in Gaver [10] and Arjas [I]. The intrinsic aging model of ~mlar and Ozekici [5] is studied further in ~mlar et al. [6] to determine the conditions that lead to associated component

Reliability Modeling and Analysis in Random Environments

253

lifetimes, as well as multivariate increasing failure rate (IFR) and new better than used (NBU) life distribution characterizations. It was also extended in Shaked and Shanthikumar [38] by discussions on several different models with multicomponent replacement policies. Lindley and Singpurwalla [17] discuss the effects of the random environment on the reliability of a system consisting of components which share the same environment. Although the initial state of the environment is random, they assume that it remains constant in time and components have exponential life distributions in each possible environment. This model is also studied by Lefevre and Malice [15] to determine partial orderings on the number of functioning components and the reliability of k-outof-n systems, for different partial orderings of the probability distribution on the environmental state. The association of the lifetimes of components subjected to a randomly varying environment is discussed in Lefevre and Milhaud [16]. Singpurwalla and Youngren [40] also discuss multivariate distributions that arise in models where a dynamic environment affects the failure rates of the components. For a complex model with m components, intrinsic aging in tlXn -'J - 2

--

iJ

(13.14)

e -j1.(i)t

(13.15)

=

so that the duration of the nth operation is exponentially distributed with rate

J..l( i) if this operation is i. The probabilistic structure ofthe operational process is given by the generator A(i, j) = J..l(i)(P(i,j) - I(i, j)) where I is the identity matrix. An overview of software failure models is presented in Singpurwalla and Soyer [39]. Perhaps the most important aspect of these models is related to the stochastic structure of the underlying failure process. This could be a "times-

Reliability Modeling and Analysis in Random Environments

259

between-failures" model which assumes that the times between successive failures follow a specific distribution whose parameters depend on the number of faults remaining in the program after the most recent failure. One of the most celebrated failure models in this group is that of Jelinski and Moranda [13] where the basic assumption is that there are a fixed number of initial faults in the software and each fault causes failures according to a Poisson process with the same failure rate. After each failure, the fault causing the failure is detected and removed with certainty so that the total number of faults in the software is decreased by one. In the present setting, the time to failure distribution for each fault in the software is exponentially distributed with parameter )..( k) during operation k and this results in an extension of the Jelinski-Moranda model. In dealing with software reliability, one is interested in the number of faults Nt remaining in the software at time t. Then, No is the initial number of faults and the process N = {Nt; t 2: O} depicts the stochastic evolution of the number of faults. If there is perfect debugging, then N decreases as time goes on, eventually to diminish to zero. Defining the bivariate process Zt = (yt, Nt), it follows that Z = (Y, N) is a Markov process with discrete state space F = E x {O, 1,2, ... }. This follows by noting that Y is a Markov process and N is a process that decreases by 1 after an exponential amount of time with a rate that depends only on the state of Y. In particular, if the current state of Z is (i, n) for any n > 0, then the next state is either (j, n) with rate J1(i)P( i, j) or (i, n - 1) with rate n)..(i). If n = 0, then the next state is (j, 0) with rate J1(j). Note that 0 is an absorbing state for N. This implies that the sojourn in state (i, n) is exponentially distributed with rate (13.16) (3(i, n) = J1(i) + n)..(i) and the generator Q of Z is

Q((i, n), (j, m))

=

-(J1(i) + n)..(i)) , J1(i)P(i, j), { n)..(i) ,

j = i, m = n . j -=I i, m = n j = i, m = n - 1

(13.17)

Reliability is defined as the probability of failure free operation for a specified time. We will denote this by the function

R(i, n, t)

=

P[L > tlYo

= i,

No

=

n]

=

P[Nt

=

n!Yo = i, No

=

n]

(13.18) defined for all (i, n) E F and t 2: O. Note that this is equal to the probability that there will be no arrivals until time t in a Markov modulated Poisson process with intensity function ~(t) = n)..(yt). Thus, using the matrix generating function (22) in Fischer and Meier-Hellstern [9] with z = 0, we obtain the explicit formula R(i,n,t) = (13.19)

L [e(A-nAlt.

JEE

J

260

MATHEMATICAL RELIABILITY

where

+00

e(A-nA)t

= L ~! (A k

nA)k

(13.20)

k=O

is the exponential matrix and A(i,j)

3.3.

= A(i)I(i,j).

Bayesian Analysis of Software Reliability Models

In the software reliability model of Section 3.2, it is assumed that the parameters are given. A Bayesian analysis of this model can be developed as in Ozekici and Soyer [32] by specifying prior probability distributions to describe uncertainty about the unknown parameters. In Ozekici and Soyer [32] independent gamma priors are assumed on each A( i) with shape parameter a(i) and scale parameter b(i), denoted as A(i) rv Gamma (a(i), b(i)) for all i E E. Similarly, independent gamma priors are assumed for the components of f-L as f-L( i) rv Gamma (c( i), d( i)) for all i E E. A Poisson distribution with parameter "f, denoted as No rv Poisson ("(), is assumed as the prior for initial number of faults No. For the components of the transition matrix, the ithrow ~ = {P(i,j);j E E}has a Dirichlet prior

p(Pi ) ex

II P(i, j)a}

-1

(13.21)

jEE

denoted as Dirichlet {o:;; j E E} and Pi'S are independent for all i E E. Furthermore, it is assumed that apriori A, f-L, P and No are independent. We denote the joint prior distribution of the parameters by p( 8) where 8 = (A, f-L, P, No). During the usage phase as debugging is performed, the failure times during each operation as well as the operation types and their durations are observed. Assuming that during a usage phase of T units of time K operations are performed and K - 1 of those are completed, the observed data is given by D = {(Xk, 8k), (Uf, U~, ... , U1 k ); k = 1"" ,K} where Xk is the kth operation performed, while 8k is the time at which the kth operation starts and uj is the time (since the start of the kth operation) of jth failure during the kth operation. Defining Nk = NS k to denote the total number of faults remaining in the software just before the kth operation, M k as the number of failures observed during the kth operation and assuming that the initial operation is Xl = i for some operation i starting at 8 1 = 0, £(8ID), the likelihood function of 8 is obtained as K-1

£K {

II P(Xk, XkH)f-L(Xk)e-J.L(X k )(Sk+l- S k)

k=l

(13.22)

261

Reliability Modeling and Analysis in Random Environments

where 12K is the contribution of the Kth operation to the likelihood given by e-P(XK)(T-SK)

NK !

(13.23)

(NK-MK)!

'A(XK )MK e

-A(XK )[

MK

L:

Ur+(NK-MK)(T-SK)]

j=l

Given the independent priors, the posterior distribution of Pi'S can be obtained as independent Dirichlets given by

K-l

(PiID) '" Dirichlet {a~

+L

k=l

l(Xk = i,Xk+l

= j);j

E E}

(13.24)

where 1(.) is the indicator function. Similarly, the posterior distributions of p,(i)'s are obtained as independent gamma densities given by

(p,(i) I D) '" Gamma (c(i)+

K-l

K

k=l

k=l

L l(Xk = i), d(i)+ L (Sk+l-Sk)l(Xk = i))

(13.25) where SK +1 = T. We note that posteriori p, and P are independent of A and No as well as each other. A tractable Bayesian analysis for A and No is not possible due to the infinite sums involved in the posterior terms, but the Bayesian analysis can be made by using a Gibbs sampler [see Gelfand and Smith [12]]. The implementation of the Gibbs sampler requires the full posterior conditionals p(NoIA, D) and p(A(i)IA(-i), No, D)foralli E E where A(-i)= {A(j); j i= i,j E E}. Using the fact that N 1 = No, it can be shown that K

(No - M

lA,

D) '" Poisson (,e

where M = 'Ltf[=l Mk. obtained as

+L

k=l

A (Xk)(Sk+l- Sk)

k=l

),

(13.26)

The full conditionals, p(A(i)IA(-i), No, D)'s are

K

Gamma (a(i)

- L:

K

Mk l(Xk

= i), b(i) + L

Wk l(Xk

= i))

(13.27)

k=l

where Wk = 'Lt~l uj+ (Nk -Mk) (Sk+l -Sk)' Thus all of the posterior distributions can be evaluated by recursively simulating from the full conditionals in a straightforward manner. It is important to note that using the independent priors, given No, aposteriori the A( i) 's are independent.

262

MATHEMATICAL RELIABILITY

We note that in the controlled testing setup of Ozekici and Soyer [33], presented in Section 3.2, the operations and their durations are deterministic. Thus, the Bayesian inference in the controlled testing setup can be obtained as a special case by using (13.26) and (13.27) above with tk = (Sk+l - Sk) and (3k = b(k) and the expected cost term (13.13) can be evaluated and optimized sequentially after each testing stage. Once uncertainty about 8 is revised to p(8ID), it is of interest to make posterior reliability predictions as P[L > tiD]. Note that both A and A are functions of 8. Conditional on 8, using the Markov property of the Z process and (13.19), we obtain P[L > t18, D] =

L

[e(A(8)-(No-M)A(8))t]

jEE

..

(13.28)

XK,J

Conditional on 8, e(A(8)-(No-M)A(8))t can be computed from the matrix exponential form using one of the available methods, for example, in Moler and van Loan [19]. Then the posterior reliability prediction can be approximated as a Monte Carlo integral P[L > tiD]

~ ~L

P[L > tI8(g), D]

(13.29)

9

using G realizations from the posterior distribution p(8ID). Similarly, prior to observing any data, reliability predictions can be made by replacing (No - M) with No and the index (XK,j) with (i,j) in (13.28) and using (13.29) with realizations from the prior distribution p( 8).

4.

Discrete Time Models

We now consider discrete-time models for hardware systems where a device is observed periodically at discrete time points. The device survives each period with a probability that depends on the state of the prevailing environment in that period. Since each period ends with a failure or survival, one can model this system as a Bernoulli process where the success probability is modulated by the environmental process. Using this setup with a Markovian environmental process, Ozekici [28] focuses on probabilistic modeling and provides a complete transient and ergodic analysis. We suppose throughout the following discussion the sequence of environmental states Y = {yt; t = 1,2, ... } is a Markov chain with some transition matrix P on a discrete state space E.

4.1.

Markov Modulated Bernoulli Process

Consider a system observed periodically at times t = 1,2,· .. and the state of the system at time t is described by a Bernoulli random variable

Reliability Modeling and Analysis in Random Environments

1

X= { 0: t

263

if system is not functioning at time t if system is functioning at time t.

Given that the environment is in some state i at time t, the probability of failure in the period is (13.30) P[Xt = lilt = i] = 7r(i) for some 0 :S 7r(i) :S 1. The states of the system at different points in time constitute a Bernoulli process X = {Xtj t = 1,2,"'} where the success probability is a function of the environmental process Y. Given the environmental process Y, the random quantities Xl, X 2 , ••• represent a conditionally independent sequence, that is, n

P[X I = Xl, X 2 = X2, '"

,

X n = xnlY] =

II P[Xk = xkI Y ].

(13.31)

k=l

In the above setup, the reliability of the system is modulated by the environmental process Y which is assumed to be a Markov process and thus the model is referred to as the Markov Modulated Bernoulli Process (MMBP). If the system fails in a period, then it is replaced immediately by an identical one at the beginning of the next period. It may be possible to think of the environmental process Y as a random mission process such that It is the tth mission to be performed. The success and failure probabilities depend on the mission itself. If the device fails during a mission, then the next mission will be performed by a new and identical device. If we denote the lifetime of the system by L, then the conditional life distribution is ifm= 1 ifm ~ 2.

(13.32)

Note that if 7r(i) = 7r for all i E E, that is, the system reliability is independent of the environment, then (13.32) is simply the geometric distribution P[L = mlY] = 7r(1 - 7r)m-1 . We can also write

P[L> mlY] = (1 - 7r(Yr)) (1 - 7r(Y2)) ... (1 - 7r(Ym ))

(13.33)

for m ~ 1. We represent the initial state of the Markov chain by Yl, rather than Yo, as it is customarily done in the literature, so that it represents the first environment that the system operates in. Thus, most ofour analysis and results will be conditional on the initial state YI of the Markov chain. Therefore, for any event A and random variable Z we set ~[A] = P[AlYl = i] and Ei[Z] = P[ZIYI = i] to express the conditioning on the initial state.

264

MATHEMATICAL RELIABILITY

The life distribution satisfies the recursive expression

PilL > m + 1] = (1 - 7r(i)) L P(i,j)Pj[L > m] (13.34) jEE with the obvious boundary condition Pi [L > 0] = 1. The survival probabilities can be explicitly computed via (13.35) PilL> m] = L Qo(i,j) jEE (1 - 7r(i))P(i,j). Using (13.35), the conditional expected

where Qo(i,j) = lifetime can be obtained as

Ed L ] =

+00

L 'LQo(i,j) = LRo(i,j) m=OjEE jEE

where Ro(i,j) = L:~~o Qo(i,j) corresponding to Qo.

4.2.

= (1 - Qo)-l(i,j)

(13.36)

is the potential matrix

Network Reliability Assessment

Ozekici and Soyer [34] consider networks that consist of components operating under a randomly changing common environment in discrete time. Their work is motivated by power system networks that are subject to fluctuating weather conditions over time that effect the performance of the network. The effect of environmental conditions on reliability of power networks have been recognized in earlier papers by Gaver, Montmeat and Patton [11] and Billinton and Bollinger [2] where the authors pointed out that power systems networks are exposed to fluctuating weather conditions and that the failure rates of equipment and lines increase during severe environmental conditions. Consider a network with K components with an arbitrary structure function 1> and reliability function h. The components of the network are observed periodically at times t = 1,2"" and the probability that the kth component survives the period in environment i with probability 7rk (i). It follows that the life distribution of component k is characterized by n

P[Lk > nlY] =

IT 7rk(yt)

(13.37)

t=l

since the component must survive the first n time periods. Moreover, we assume that, given the environment, the component lifetimes are conditionally independent so that X

P[L 1 > n, L 2 > n,'" Lx> n/Y] =

n

IT IT 7rk(yt)· k=l t=l

(13.38)

265

Reliability Modeling and Analysis in Random Environments

We will denote the set of all components that are functioning prior to period

t by Zt such that Zi is the set of all functioning components at the outset and Zt+i = {k = 1,2, .. · ,K;Xt(k) = I}

(13.39)

is the set of components that survive period t for all t ~ 1. The state space of the stochastic process Z = {Zt; t = 1,2,"'} is the set of all subsets of the component set K = {I, 2" .. , K}. Although it is not required in the following analysis, it is reasonable to assume that Zi = K. Moreover, it follows from the stochastic structure explained above that

P[Zt+i = MIZt = S, yt = i] == Qi(S, M) =

II 7rk(i) II kEM

(l- 7r k(i))

kE(SnMC)

(13.40) for any subsets M, S of K with M ~ S. In words, Qi(S, M) is the probability that the set of functioning components after one period will be M given that the environment is i and the set of functioning components is S. This function will playa crucial role in our analysis of the network. The stochastic structure of our network reliability model is made more precise by noting that, in fact, the bivariate process (Y, Z) is a Markov chain with transition matrix

P[Zt+i = M, yt+l = jlZt = S, yt = i] == P(i,j)Qi(S, M)

(13.41)

for any i, j E E and subsets M, S of K with M ~ S. In many cases, it is best to analyze network reliability and other related issues using the Markov property of the chain (Y, Z). Denote the set structure function W by

w(M) = ¢(m)

(13.42)

where m = (mi, m2,' .. ,mK) is the binary vector with mk = 1 if and only if k E M. Then,

Pi(S) =

L

Qi(S, M)

(13.43)

Mr:;;.S, IIJ(M)=l

is the conditional probability that the network will survive one period in environment i given that the set of functioning components is S. The characterization in (13.43) can also be written in terms of the path-sets of the network. Let P denote the set of all combinations of components that makes the network functional. In other words, P={M~K;w(M)=l}

(13.44)

then (13.43) becomes (13.45) Mr:;;.S,MEP

266

MATHEMATICAL RELIABILITY

and

q(i) = qi(IC) =

L

Qi(IC, M) = L

MEP

II 1rk(i)

(13.46)

MEPkEM

is the probability that the network, with all components functioning, will survive one period in environment i. In assessment of network reliability, we are interested in failure free operation of the network for n time periods. More specifically, we want to evaluate P[L > n] for any time n ~ O. Note that we can trivially write

P[L > n] = LP[L > nlY1 = i,Zl = S]P[Y1 = i,Zl = S]

(13.47)

iEE

that requires computation of the conditional probability P[L > n!Y1 = i, Zl = S] given any initial state i and S. We will denote the conditional network survival probability by

> nlY1 =

f(i, S, n) = P[L

i, Zl = S]

(13.48)

which is simply the probability that the network will survive n time periods given the set S of initially functioning components and the initial state i of the environment. Similarly, we define the conditional mean time to failure (MTTF) as

+00 g(i, S) = E[LIY1 = i, Zl = S] = L P[L > nlY1 = i, Zl = S].

(13.49)

n=O

We will now exploit the Markov property of the process (Z, Y) to obtain computational results for f and g. Once they are computed, it is clear that we obtain the desired results as f(i, IC, n) and g(i, IC) since it is reasonable to assume that Zl = IC initially.

Minimal and Maximal Repair Models If we assume that there is minimal repair and all failed components are replaced only if the whole system fails, then the Markov property of (Z, Y) at the first transition yields the recursive formula

f(i, S, n

+ 1) = L

L

P(i,j)Qi(S, M) f(j, M, n).

(13.50)

jEE Mc;;.S, \f1(M)=l

The recursive system (13.50) can be solved for any (i, S) starting with n = 1 and the boundary condition f(i, S, 0) = 1. A further simplification of (13.50) is obtained by noting that we only need to compute f(i, S, n) for S E P. The

267

Reliability Modeling and Analysis in Random Environments

definition of Pin (13.44) implies that we can rewrite (13.50) as

f(i,S,n+ 1) = L

L

P(i,j)Qi(S,M) f(j,M,n)

(13.51)

jEE Mr;;.S,MEP

for S E P since f(j, M, n) = 0 whenever M r;J. P. Similarly, using (13.49) or the Markov property directly we obtain the system of linear equations

g(i, S) = qi(S)

+L

L

P(i, j)Qi(S, M) g(j, M)

(13.52)

jEE Mr;;.S,1I'(M)=l

which can be solved easily since both E and K are finite. Once again, the dimension of the system of linear equations in (13.52) can be reduced by noting that g(j, M) = 0 whenever M r;J. P and we only need to compute g(i, S) for S E P. The reader should bear in mind that this computational simplification applies in all expressions with \lJ (M) = 1 since this is true if and only if M E P. If we assume that there is maximal repair and all failed components are replaced at the beginning of each period, then this implies that all components are functioning at the beginning of a period and we can take Zl = S = K. Now (13.50) can be written as

f(i,K,n+1)=L jEE

L

P(i,j)Qi(K,M)f(j,K,n)

(13.53)

Mr;;.JC,1I'(M)=l

with the same boundary condition f(i, K, 0) = 1. Note that (13.53) is dimensionally simpler than (13.50) since it can be rewritten as

f(i,n+1)

L Qi(K, M)] L P(i,j) f(j, n) (13.54) [ 1I'(M)=l jEE q(i) L

P(i,j) f(j, n)

(13.55)

jEE

after suppressing K in f. A similar analysis on the MTTF yields the system of linear equations

g(i) = q(i)

+ q(i) L

P(i,j) g(j).

(13.56)

jEE

Defining the matrix R(i,j) = q(i)P(i,j), (13.56) can be written in compact form as 9 = q + Rg with the explicit solution (13.57)

268

4.3.

MATHEMATICAL RELIABILITY

Bayesian Analysis of Discrete Time Models

The results presented for the MMBP and the network reliability assessment are all conditional on the specified parameters. In what follows we will consider the case where the parameters are treated unknown and present a Bayesian analysis. In so doing, we will present the Bayesian inference for the network reliability model and show that results for the MMBP can be obtained as a special case. Under the network reliability setup of Section 4.2, we describe our uncertainty about the elements of the transition matrix P and the elements of the vector 1f(i) = (1f1 (i), ... , 1fK (i)). Thus, in terms of our previous notation we have 8 = (P, 1f(i), i E E). As in Section 3.3, for the ithrow of P we assume the Dirichlet prior given by (13.21) with Pi's are independent for i E E. For a given environment, we assume that 1f( i) has independent components with beta densities denoted as 1fk (i) "" Beta (ak (i), bk (i)). Also, we assume that 1f(i)'s are independent of each other for all i E E and they are independent of the components of P. If the network is observed for n time periods, then the observed data consists of D = {Xt ; t = 1, ... ,n} whereX t = (Xt (1),X t (2), ,Xt (K)). The failure data also provides the values = {Zt; t = 1, , n + I} since Zt+1 = {k = 1,2"" , K; Xt(k) = I}. It is assumed that the environmental process is unobservable. In this case the Bayesian analysis of the network reliability presents a structure similar to the hidden Markov models which were considered by Robert, Celeux and Diebolt [37J. In the minimal repair model, we can write the likelihood function as

zn

£(8, yn; D) ex

IT

P(yt-1, yt) {

II kEZt+l

t=l

1fk(yt)

II

[1 - 1fk(yt)]} ,

kE(Ztnzi'+l)

(13.58) where Zl 2 ... 2 Zn+1with Zl = K and yn = (Y1, ... , Yn). In the maximal repair model, the likelihood function is given by

£(8, yn; D) ex

IT t=l

P(yt-1, yt)

{II [1fk(yt)]Xt(k) [1 - 1fk(yt)F-Xt(k)} . kEJ(

(13.59) Note that in (13.58) and (13.59), we set P(Yo, Y1) = 1 when t = 1 and we observe only n - 1 transitions of Y. As pointed out in Ozekici and Soyer [34], when the history of Y process is not observable, there is no analytically tractable posterior analysis. Thus, as in Section 3.3 the posterior analysis can be developed using the Gibbs sampler.

269

Reliability Modeling and Analysis in Random Environments

The full conditional distributions of Pi'S are obtained as independent Dirichlet densities n

(Pi!D, yn)

rv

Dirichlet {aj

+L

l(yt

= i, yt+l = j);j

E E}.

(13.60)

t=l

The full conditionals of 1fk(i); i E E, k = 1, ... , K are independent beta densities given by (1fk(i)! D, yn) rv Beta (ak(i), bk(i)) with n

ak(i) bk(i)

ak(i)

+L

= bk(i) +

l(yt = i)l(k E Zt+l),

(13.61)

t=l n

L l(yt = i)l(k E (Zt n Z[+l))

(13.62)

t=l

for the minimal repair model and with n

ak(i)

ak(i)

+L

l(yt = i) Xt(k),

(13.63)

l(yt = i) (1 - Xt(k))

(13.64)

t=l n

bk(i)

=

+L

bk(i)

t=l

for the maximal repair model. We note that posteriori elements of 1f(i) 's and Pi'S are independent of each other for all i E E. The full conditional distributions ofthe environmental process, p(ytID, y(-t), 1f(yt),P) where y(-t) = {YT; T i= t} is obtained for the minimal repair model as

p(ytID, y(-t), 1f(yt),P) oc P(yt-l, yt) {

II

1fk(yt)

kEZt+l

(13.65)

II

kE Ztnz;+l

[1 - 1fk(yt)]} P(yt, yt+l)

and for the maximal repair case as

p(yt!D, y(-t), 1f(yt),P) oc P(yt-l' yt)

(13.66)

{II [1fk(yt)]Xt(k) [1 - 1fk(yt)P-Xt(k)} P(yt, yt+d· kEIC

270

MATHEMATICAL RELIABILITY

Thus, for both repair scenarios, a posterior sample from p(8, ynlD) can be easily obtained by iteratively drawing from the given full posterior conditionals. Once the posterior distribution is obtained, posterior reliability predictions canbemadebyevaluatingP[L> miD], whereL = L-n is the time remaining to network failure. For the minimal repair case, using (13.51) and the Markov property of the chain (Y, Z), by generating G realizations from the posterior distribution p(8, YnID) we can approximate the posterior network reliability as a Monte Carlo integral

P[L > miD]

~ ~L 9

L

P(Y~9), j)f(j, Zn+l, m -

118(9)),

(13.67)

jEE

where fU, Zn+l, m18) is obtained as the solution of (13.51). In the maximal repair model, similar results can be obtained by using (13.55) to compute f for each realization g. For the MMBP we can obtain the Bayesian inference by considering the special case K = 1 in the maximal repair model. By setting a'k(i) = a*(i), b'k(i) = b*(i), 7fk(i) = 7f(i), and Xt(k) = X t in (13.63), (13.64) and (13.66) we can obtain posterior analysis for the MMBP.

References [1] Arjas, E. (1981). The Failure and Hazard Process in Multivariate Reliability Systems. Mathematics of Operations Research, 6: 551-562. [2] Billinton, R. and Bollinger, K. E. (1968). Transmission System Reliability Using Markov Processes. IEEE Transactions on Power Apparatus and System, 87: 538-547. [3] 0 and constant variance (Y2. Then dX(s) = D(s)ds, s > 0, gives the increment of damage to the system at load s. Also assume that the drift parameter ¢ == ¢( L) can depend on the size or environmental level, L, of the system. For example, larger material specimens usually will sustain more damage, or fail more easily, at a specific tensile stress s, than smaller specimens. Then that relationship between ¢ and L can be described by using some parametric function of L, such as a power law, ¢( L) = TlL'Y, or a linear law, ¢( L) = TI + 'Y L, where the parameters can be positive or negative, yielding an increasing or decreasing relationship as appropriate for the type of system under study. Now the cumulative damage sustained by the system under test at load t > 0 can be represented as

J t

X(t) - X(O)

=

dX(s)

Q

J t

=

D(s)ds.

(14.3)

Q

It is well known (see Hoel, Port and Stone [13], for example) that (3) is also a Gaussian process since D(s) is. The mean function of X(t) is /Lt = ¢t == ¢(L)t and variance function (Yt = (Y2t. The amount of initial damage in the system at zero load, t = 0, due to inherent flaws or otherwise, is given by X(O) == XQ. Actually, this initial damage can also depend on the size L, i. e. XQ == XQ (L), if appropriate for the system under study. If the system fails at an unknown critical damage level, A == A(L), that also might depend on the system size, then the breaking stress, or strength, of the system of size L is a random variable, S, the

283

Failure ofBrittle Fibers and Composites

first passage of the cumulative damage process, X(t), to the critical threshold A (L ). It is known that S has an inverse Gaussian distribution (see Chhikara and Folks [4], for example) with pdf depending on L,

f(8; f-LL, )..L)

=

(-)..L(8 - f-Ld Vr>:;: ~ exp 2f-Ll 8

2

)

,8> 0,

(14.4)

where f-LL = [A(L) - xo(L)]/¢(L) and)..L = {[A(L) - xo(L)]/o-}2. The pdf (4) is the usual form of the well-known inverse Gaussian pdf, and the distribution of S is denoted as S = strength of system of size L rv IG(f-LL, )..L). The dependency of the parameters [A(L) - xo(L)J and/or A(L) on the variable L can be quantified by using parametric acceleration models such as the power law or linear law mentioned earlier, or a host of possible others that might be appropriate for the situtation under study. The acceleration models are chosen to represent the type of relationship that might exist between one or both of the inverse Gaussian parameters and the acceleration variable L. Possible choices include the inverse linear law of Bhattacharyya and Fries [3], power laws which give the "Gauss-Weibull Additive Model" of Durham and Padgett [9], exponential laws, and others (see Table I in Onar and Padgett [19], for examples of several such acceleration models). Therefore, this continuous cumulative damage approach yields an entire family of possible inverse Gaussian accelerated test models with the acceleration variable L being the system size or another environmental variable under which such systems at several values of L are being tested. In light of the failure rate properties of strength distributions of load-sharing systems discussed in the previous section, it is curious that the inverse Gaussian distribution (4) arises as the distribution of the system strength. It is known (e.g. see Chhikara and Folks [4]) that the failure rate of the usual two-parameter inverse Gaussian distribution (equation (4) with parameters not depending on L) is in general first increasing and then decreasing to a finite positive asymptote. Graphs of the average failure rate function indicate the same type of behavior, although it has not been investigated thoroughly. So, how can the system strength have such a distribution as (4), which is in general neither IFR nor IFRA? The inverse Gaussian distribution function is a linear combination of standard normal distribution functions (Chhikara and Folks [4]),

F(8) =

v ( ~ (~ -

1)) + exp (~ ) v ( - ~ (1 + ~) ) ,8 > O.

(14.5) When).. is much larger than f-L, Bhattacharyya and Fries [2] showed that the second term of (5) is negligible, and hence the cdf behaves like a normal distribution' which is known to have an increasing failure rate. Durham and Padgett [9] showed that for their inverse Gaussian models fitted to actual strength data

284

MATHEMATICAL RELIABILITY

for carbon fibers and small composite specimens, the estimate of the scale parameter, A, was for all data sets much larger than the estimate of the mean, !-", denoted without the dependency on L here. Thus, the estimated system strength distribution has the increasing failure rate property as discussed in section 2. The maximum likelihood estimates of the model parameters can be found, although not in closed form, and the Fisher information matrix for any of the inverse Gaussian accelerated test models in the family obtained above can be calculated for asymptotic inference purposes. At each of k different sizes or values of L,Li,i = 1, ... , k, tests are performed for ni systems, observing the system strengths denoted by Sij, j = 1, ... , ni. Then for estimating the unknown model parameters over all of the k sizes or values of L, the likelihood function k

ni

IIII !(Sij; !-"Li' ALJ,

L(fD =

i=l j=l

or its logarithm is maximized with respect to the vector of model parameters,

fi-

An example will be given in the next section for which a good-fitting inverse Gaussian accelerated test model (4) is given by the mean being an exponential law acceleration model, !-"L = exp(()L)/( , and the scale parameter not depending on L, AL = A. For this model, the parameters are fl = ((, A, e), and the three-dimensional Fisher information matrix is

(14.6)

where

k

- ~ '\' n ~·e- OLi , b1-(L...i=l

b3 =~,

b2 = -A k

with

m

= L: ni, i=l

and

b4

=

A(

k

L: niLie-OLi, i=l k

L: niL;e- OLi .

i=l

It is easy to obtain the inverse of I in closed form and then the estimated Fisher information can be readily calculated from the maximum likelihood estimates of the parameters. The Fisher information for other models can be found similarly. Also, an expression for the p100th percentile of the strength distribution (4) can

be found as in Durham and Padgett [9] as sp (L) =

tfz (zp + Jz~ + ~ )

2,

where zp is the p100 th percentile of the standard normal distribution. An approximate (conservative) lower confidence bound for the percentile can be calculated via Bonferroni arguments. Specific examples will be given next.

Failure of Brittle Fibers and Composites

4.

285

Examples: Tensile Strength of Materials

To illustrate the models and inference procedures described in section 3, two data sets of tensile strength of materials will be considered here. In engineering design considerations, small percentiles of a material's strength distribution are important. In particular, lower confidence bounds on small percentiles are of interest and will be presented for the two data sets. For example, a 90% lower confidence bound on the first percentile of the strength distribution at size L gives a conservative estimate of the value which has 99% of the strengths larger than it for engineering design purposes. The estimated Fisher information matrices will be used to calculate such approximate lower confidence bounds for the percentiles for large sample sizes.

4.1.

Carbon Fibers

Observed tensile strengths (in GPa, giga-pascals) of single PAN carbon fibers were obtained by Nunes, Edie, Bernardo and Pouzada [18]. Data adapted from these observations are given in Table II of Onar and Padgett [19], consisting of observed tensile strength values at five different fiber lengths, or "gauge lengths," L 1 = 20 mm, L 2 = 30 mm, L 3 = 40 mm, L4 = 60 mm, and L5 = 80 mm. There were nl = 18, n2 = 22, n3 = 16, n4 = 24, and n5 = 14, for a total of m = 94 observed strengths. Onar and Padgett [19] found that the best fitting inverse Gaussian accelerated test model among those considered was the exponential law for the mean parameter, f.LL = exp(BL)j( , and the scale parameter not depending on L, AL = A, as discussed in section 3. The maximum likelihood estimates (MLEs) for this model were found to be ( = 0.3442995,'\ = 61.09641, and = -0.002181136. The Fisher information matrix (6) can then be estimated from these MLEs, and its inverse is

e

]-1 =

0.00032321 0.00000000 0.00001705) 0.00000000 79.4206800 0.00000000 ( 0.00001705 0.00000000 0.00000108

Note that for this model, P,20 = 2.7805 which is much less than ,\, as are the estimates of f.LL for the other gauge lengths. Hence, as discussed in section 3, the fitted inverse Gaussian-type model here has an increasing failure rate. For the p100 th percentile of the strength distribution at "gauge length" L, sp(L), the MLE, sp(L), can be calculated from the parameter estimates. Also, using Bonferroni's inequality, conservative asymptotic lower confidence bounds can be obtained from appropriate lower or upper confidence bounds on the three parameters, (, A, and B. Table 14.1 gives the MLEs of the percentiles and corresponding approximate 90% lower confidence bounds, Sp,LB (L ), on the first, tenth, and fiftieth percentiles at each of the five fiber gauge lengths.

286 Table 14.1.

MATHEMATICAL RELIABILITY MLEs and 90% Lower Confidence Bounds on Fiber Strength Percentiles

Gauge Length 20mrn 30mm 40mm 60mm 80mm

4.2.

8.01 (L)

8.01,LB(L)

8.1O(L)

8.10,LB (L)

8.50(L)

8.50,LB(L)

1.70114 1.67316 1.64555 1.59144 1.53879

1.42765 1.38510 1.34355 1.26339 1.18712

2.11715 2.07757 2.03866 1.96284 1.88961

1.81258 1.75043 1.69022 1.57539 1.46770

2.78047 2.72048 2.66179 2.54817 2.43940

2.44254 2.34471 2.25079 2.07410 1.91127

Micro-Composite Specimens

Next, the tensile strength measurements of Bader and Priest [1] on 1000carbon-fiber impregnated tows (micro-composites) are considered. This data appears explicitly in Smith (1991), and consists of 28 observed strengths at L 1 = 20 mm gauge length, 30 values at L 2 = 50 mm gauge length, 32 observations at L 3 = 150 mm, and 29 at L 4 = 300 mm. These 1000-carbonfiber bundles embedded in resin ("impregnated") are small load-sharing systems in which the carbon fibers are the interacting components. Onar and Padgett [19] found that the best fitting inverse Gaussian accelerated test model among those considered there was an exponential law acceleration function, where f..tL = exp(OL)/( and AL = exp(20L)h 2 • Hence, both inverse Gaussian parameters are "accelerated" here. The MLEs of (, 'Y and e are found to be ( = 0.3488066, i = 0.04181022, and = -0.00463525. Again, note that (J,20 = 2.61309 which is much less than ~20 = 475.23953, and (J,300 = 0.71367 and ~300 = 35.44843, almost fifty times as large as (J,300. Thus, the fitted strength distributions have IFR as before. For this model, the Fisher information matrix can be found to be

e

(14.7)

where

k

C2 = C4

=

1 -:::;2

' " ni L i OL

u

"'I i=l k

_.£ "'I '" u n·L· t t, i=l

e '

REFERENCES

287

From the parameter MLEs, the estimated inverse of the Fisher information matrix is

1:;1

1.295817 = 9.395819 ( 1.718658

X X X

10- 5 10- 7 10- 7

9.395819 7.462319 2.147137

X X X

10- 7 10- 6 10- 8

1.718658 2.147137 3.927486

X X X

10- 7 10- 8 10-9

)

Similar to the estimation of the strength distribution percentiles in the first example, the MLEs of the percentiles and their corresponding approximate lower confidence bounds are given in Table 14.2. Table 14.2. centiles

MLEs and 90% Lower Confidence Bounds on Micro-Composite Strength Per-

Gauge Length 20mm 50mm 150mm 300mm

2.40777 2.37179 2.25556 2.09146

8.01,£B(L)

8.1O(L)

8.10,£B(L)

2.40755 2.36316 2.22085 2.02282

2.59310 2.55566 2.43467 2.26361

2.56863 2.52269 2.37531 2.16994

8.50,£B(L) 2.84046 2.80124 2.67436 2.49729

2.78130 2.73345 2.57982 2.36541

In conclusion, the approach presented here is based on physical considerations of cumulative damage, and results in a relatively rich family of useful inverse Gaussian-type accelerated test models. Although a computer is needed to find the parameter estimates from the likelihood function, this is not a drawback, and it is relatively easy to compute the estimates and bounds from the strength data with these models, as the two examples show.

Acknowledgments This work was partially supported by the National Science Foundation under grant number DMS-98771 07. The authors are grateful to Mr. Brandon Julio for computing the parameter estimates, the Fisher information, and the percentile bounds for the illustrative examples.

References [1] Bader, M.G. and Priest, A.M. (1982). Statistical Aspects of Fibre and Bundle Strength in Hybrid Composites, in T. Hayashi, K. Kawata and S. Umekawa, editors, Progress in Science and Engineering of Composites, pp. 1129-1136, ICCM-IV, Tokyo. [2] Bhattacharyya, G. K. and Fries, A. (1982a). Fatigue Failure Models Birnbaum-Saunders vs. Inverse Gaussian, IEEE Transactions on Reliability, R-31: 439-441. [3] Bhattacharyya, G. K. and Fries, A. (1982b). Inverse Gaussian Regression and Accelerated Life Tests, in Proceedings ofthe Special Topics Meeting on

288

MATHEMATICAL RELIABILITY

Survival Analysis, 138th Meeting of the Institute of Mathematical Statistics, Columbus, OH, October 1981, pp. 101-117. [4] Chhikara, R. S. and Folks, J. L. (1989). The Inverse Gaussian Distribution, Marcel Dekker, New York. [5] Durham, S. D., Lynch, J. and Padgett, W. J. (1988). Inference for the Strength Distributions of Brittle Fibers under Increasing Failure Rate, Journal of Composite Materials, 22: 1131-1140. [6] Durham, S. D., Lynch, J. and Padgett, W. J. (1989). A Theoretical Justification for an Increasing Failure Rate Average Distribution in Fibrous Composites, Naval Research Logistics, 36: 655-661. [7] Durham, S. D. , Lynch, J. and Padgett, W. J. (1990). TP2-orderings and IFRness with Applications, Probability in the Engineering and Informational Sciences, 4: 73-88. [8] Durham, S. D., Lynch, J. D., Padgett, W. J., Horan, T. J., Owen, W. J. and Surles, J. (1997). Localized Load-Sharing Rules and Markov-Weibull Fibers: A Comparison of Microcomposite Failure Data with Monte Carlo Simulations, Journal of Composite Materials, 31: 1856-1882. [9] Durham, S. D. and Padgett, W. J. (l997b). A Cumulative Damage Model for System Failure with Application to Carbon Fibers and Composites, Technometrics, 39: 34-44. [10] Gurland, J. and Sethuraman, J. (1994). Reversal of Increasing Failure Rates When Pooling Failure Data, Journal of the American Statistical Association, 90: 1416-1423. [11] Gurland, J. and Sethuraman, J. (1995). How Pooling Failure Data May Reverse Increasing Failure Rates, Technometrics, 36: 416-418. [12] Harlow, D. G., Smith, R. L. and Taylor, H. M. (1983). Lower Tail Analysis of the Distribution of the Strength of Load-Sharing Systems, Journal of Applied Probability, 20: 358-367. [13] Hoel, P., Port, S. and Stone, C. (1972). Introduction to Stochastic Processes, Houghton Mifflin, Boston. [14] Karlin, S. (1968). Total Positivity, Stanford University Press, Stanford, CA. [15] Lee, S., Durham, S. D., and Lynch, J. (1995). On the Calculation of the Reliability of General Load-Sharing Systems, Journal ofApplied Probability, 32: 777-792. [16] Lee, S. and Lynch, J. (1997). Total Positivity of Markov Chains and the Failure Rate Character of Some First Passage Times, Advances in Applied Probability, 32: 713-732.

REFERENCES

289

[17] Lynch, J. D. (1999). On Conditions For Mixtures Of Increasing Failure Rate Distributions To Have An Increasing Failure Rate, Probability in the Engineering and Informational Sciences, 13: 31-36. [18] Nunes, J. P., Edie, D. D., Bernardo, C. A. and Pouzada, A. S. (1997). Formation and Characterization of Po1ycarbonate Towpregs and Composites, Technical Report, Department of Polymer Engineering, Universidade do Minho, Portugal. [19] Onar, A. and Padgett, W. J. (2000). Inverse Gaussian Accelerated Test Models Based on Cumulative Damage, Journal ofStatistical Computation and Simulation, 64: 233-247. [20] Owen, W. J. and Padgett, W. J. (1998). Bimbaum-Saunders-Type Models for System Strength Assuming Multiplicative Damage, in A. P. Basu, et. aI., editors, Frontiers in Reliability, pp. 283-294, World Scientific Publishers, River Edge, NJ. [21] Wolstenholme, L. C. (1995). A Nonparametric Test of the Weakest-Link Principle, Technometrics, 37: 169-175. [22] Wolstenholme, L. C. (1999). Reliability Modeling, A Statistical Approach, Chapman & Hall/CRC, Boca Raton, FL.

Chapter 15 MODELING THE RELIABILITY OF HIP REPLACEMENTS Simon P. Wilson Department of Statistics, Trinity College Dublin 2, Ireland [email protected]

Abstract

In this paper we develop a hierarchical model for fatigue crack initiation and growth in hip replacements. The model is fitted to data on crack growth in 5 such replacements using the Bayesian approach, which is then used to make predictions for replacement lifelength based on knowledge of their failure modes.

Keywords:

Bayesian inference, crack growth, damage accumulation, hierarchical model, hip replacement.

1.

Introduction

Several hundred thousand hip replacement operations are carried out each year. The procedure consists of inserting a metal prosthesis into the femur, which is then secured to the bone by a polymer cement. Typical lifetimes of a replacement are of the order of 10-20 years, the failure mechanism being the growth and propagation of fatigue cracks in the polymer, which cause implant loosening. These are induced by stresses of several times bodyweight that occur when weight is placed on the leg. The failure by crack growth occurs in two ways: the growth of one large crack or 'damage accumulation', which is the cumulative effect of many small cracks. In this paper a hierarchical model for crack growth will be described and fitted. The fitted model will then be used to make reliability predictions. These predictions use the Bayesian approach, implemented by Markov chain Monte

292

MATHEMATICAL RELIABILITY

Carlo (MCMC). The method is applied to data on crack growth in five specimen replacements.

2.

Data and a Model

Five replacements were subjected to cyclic stresses of constant frequency, typical of normal use. Details of the experiment can be found in McCormack et. al. [2]. Some cracks were already in existence at the start of the experiment, and are called preload cracks. In addition to preload cracks, others initiate during the experiment. Data on the number and length of all cracks at 5 times, at 0, 0.5, 1,2.5 and 5 million stress cycles, were recorded. Over the five specimens, there were a total of 116 preload cracks, between and 58 per specimen. After 5 million cycles, a total of 1315 cracks had been observed, with between 55 and 544 per specimen. Not all the cracks could be observed; only two-thirds ofthe polymer was visible through a series of windows. As an example of the data collected, Figure 15.1 shows the crack length data from one of the specimens (those only observed at 5 million cycles are not plotted). The five observation times (in millions of cycles) are nl = 0, n2 = 0.5, ... , n5 = 5. Then Ni(nk) is defined to be the number of cracks observed in specimen i at nk million cycles, L ij (nk) is the length of crack j in specimen i at nk million cycles and lij(nk) = log(Lij(nk))' Let I ij be the initiation time of crack j in specimen i (also in millions of cycles). We observe the number of cracks at different times {Ni(nk) Ii = 1, ... ,5; k = 1, ... , 5}, the initiation times {Iij Ii = 1, ... ,5; j = 1, ... ,Ni(5)}, interval censored to within successive observation times and the lengths {L ij (nk) Ii = 1, ... , 5; j =

°

1, ... , Ni(5); nk 2: I ij , ... , 5}.

Initiation has been investigated in McCormack et. al. [3], where a homogeneous Poisson process model was proposed for the initiation of load initiated cracks with a rate that is linear in the number of preload cracks in the specimen. To simplify the growth process model, we discretize it to units of 0.5 million cycles, this being the smallest length of time between observations. The initiation time is 0, if a preload crack, and from the Poisson process if load initiated. We assume that there is an initially observed length, from which cracks grow exponentially (hence linearly on the log scale). In addition, we see from Figure 15.1 that cracks may suddenly grow at a much quicker rate; this corresponds to sudden jumps in crack length that can occur when a crack meets a local weakness in the material. We assume that such jumps occur as a Markov process; this is a common approach in fatigue crack models, see Sobczyk and Spencer [4]. Since jumps are rather rare, we allow only one jump in each successive 0.5 million cycles. The jumps are multiplicative (hence additive on the log scale). We then assume the standard hierarchical model; parameters for each crack are assumed exchangeable within a specimen and an additional level is defined

293

Modeling the Reliability ofHip Replacements

across specimens. Putting all of this together, the following specifies the model for both initiation and growth: Ni(n)

rv

Ni(O)

+ Poisson process with rate ¢load + ~iNi(O); (15.1)

f ij

rv

N(/-lij(n), (T;j) , n 2 fiji O.5m::;n f3ij + ai (n - f ij ) + L 'YijmJijm; O.5m>Iij

lij (n) /-lij (n)

{ N(Bpre1 , S;rel), N(B1oad , S~ad)'

log (f3ij )

Jijm (Tij

if preload (Iij = 0), if fij > 0;

N(r i , S;,i), ifO.5m > fiji Bernoulli(l- exp(-0.5Ai)), ifO.5m

log( 'Yijm) 2

{ 0, if preload, else Poisson process event time from N i (n);

rv

(15.2) (15.3) (15.4) (15.5) (15.6)

> fiji

(15.7) (15.8)

f G ('l/J, TI) (inverse gamma),

with at the specimen level ai rv N(A, s~J, log(ri ) rv N(C, s;), log(Ai) rv N(A, S;,i rv fG('l/J"i' TI"() and log(~i) rv N(S, The following hyperpriors were assumed: Gaussian priors on A, B pre ], B 1oad , C, A,log(¢load) and S with zero mean and large variance, IG(0.5,0.5) priors on S;rel' S~ad' s~, s;, s~ and s~, exponential priors with large mean on 'l/J (the scale parameter), TI, 'l/J"( and TI"(.

sV.

sD,

3.

Model Fitting and Checking

The posterior distribution of all model parameters given the data is calculated by a Gibbs sampler, which draws samples from the posterior distribution. Some full conditionals require adaptive rejection or adaptive rejection Metropolis sampling, following Gilks et. al [1]. The Poisson process model parameters for initiation can be analysed separately, see McCormack et. al. [3]. In what follows, we assume that there are X samples of all parameters available from the posterior distribution, with the xth sample of a parameter denoted by a superscript (x), thus A (x), 'Yi~~' etc. Model assessment is done by comparing the data with the posterior predictive distribution. For the growth model we compare lij(nk) with p(lij(nk) I data), calculated with the usual kernel density approximation p(lij(nk) Idata)

~ ~ LP(lij(nk) I /-lij(nk)(X) , (T;/X)) , x

(15.9)

294

MATHEMATICAL RELIABILITY 600

500

.,E

400

~

~300

.!!

tl ~

5. Our aim is therefore to compute

R(n Idata, n

> 5)

= R(n Idata)/ R(51 data),

so it is sufficient to show how to compute R(n I data) for any n 2': 5. There are two failure modes: appearance ofa dominant crack or microcrack accumulation. Before we develop an expression for the reliability, 3 issues must be addressed: crack censoring, new crack initiation and predicting future crack lengths.

4.1.

Censored Cracks and New Crack Initiation

Recall that one third of the specimen volume was not observed, where there are most likely many unobserved cracks. A crude estimate of how many preload cracks are missing is then that it is Poisson distributed with a mean one half of that number observed. Letting the number of censored preload cracks in specimen i be Nf(O), we therefore assume that NiC(O) "" Poisson(0.5Ni (0)). We let the number of censored load-initiated cracks up to the end of the experiment be denoted Nf(5) - NiC(O), which is according to the model Poisson(O.5 x 5( 5 million cycles, the rate of crack formation over the entire volume of the specimen is then Poisson with rate 1.5 5, (15.10)

where NiC(O) "" Poisson(0.5Ni (0)).

4.2.

Predicting Future Crack Length

Log length at a time n > 5 million cycles is Gaussian with mean f-tij ( n) and . 2 h vanance (Jij' were: (15.11)

There is a distinction between the parameters in f-tij (5), for which MCMC samples exist, and the Jijm and 'Yijm for times after 5 million cycles, which

297

Modeling the Reliability ofHip Replacements

have not been fitted and are therefore distributed as a Bernoulli distribution with success probability 1 - exp( -0.5.Ai) and normal distribution (for log(-Yijm)) with mean f i and variance respectively. For an unobserved crack, either censored or initiating after 5 million cycles, all parameters will be drawn from these specimen level distributions.

S;,i

4.3.

An Expression for the Reliability

Using the MCMC samples, we can simulate the damage in a replacement at any number of cycles and hence make a statement about the reliability. Failure occurs when either the largest crack exceeds a length L M or the total damage (being the sum of crack lengths) exceeds a threshold L s. So the reliability function for specimen i at n million cycles is:

~Li;(n) :; L s

R;(n Idata) = P ( illj"'Li;(n) :; L m and

data).

(15.12) We estimate by simulating the number and length of cracks at n million cycles for each MCMC sample and taking the proportion of those where the conditions maxj Lij(n) ::; L m and 2: j Lij(n) ::; Ls hold, thus:

1 x ~(n Idata) ~ X L'I(Mi(n)(x) ::; LM and Si(n)(x) ::; Ls),

(15.13)

x=l

where I is the indicator function, Ni(5)+Ki(n)(x)

max j=l

L .. ( )(x) tJ n ,

Ni(5)+Ki(n)(x)

L

Lij(n)(x),

j=l

Ki(n)(x) is sampled according to Equation 15.10 using the xth MCMC sample of 5) = R(n Idata)/ R(51 data) for each specimen. From the dimensions of the prosthesis, reasonable values for the threshold were thought to be 4,OOOJ-Lm for

298

MATHEMATICAL RELIABILITY -e-Spec1

....... Spec 2 --+- Spec 3 -+- Spec 4 S 5

0.9

-e-

0.8 0.7

0.3 0.2 0.1

°4L_ _-'-~~=-+-----4~-+----+--+--~~:::::"--+-==:;:::::::IJ------l 10 12 Number of cycles (10")

Figure /5.3. cycles.

16

18

The posterior reliability for the 5 specimens conditional on no failure at 5 million

L m and 60,000p,m for L s . Note that we have weak information from the 5 surviving specimens on L m and L s - maximum lengths of

(929,263,481,556,701) and sum of lengths

(7735,5936,44201,11734,33203) over the observed portion of the replacements respectively - for which these thresholds are consistent. Figure 15.3 shows the reliability predictions for the 5 specimen. Specimens 3 and 5 are most likely to fail early and are those with the largest sum of crack lengths at 5 million cycles. If the probability of failure due to the dominant crack mode alone is calculated, we see that there is a non zero chance of failure at any time, because each crack has a small chance of exceeding 12m under the fitted model, which is multiplied over all the cracks in the specimen. However, this probability changes little with time. This reflects the fact that most cracks do not grow quickly, and indicates that the dominant mode of failure is accumulation, driven by new crack initiation.

REFERENCES

5.

299

Conclusions

The exchangeable model for crack initiation and growth that we have proposed allows us to make reliability predictions that take account of both new crack initiation and the censoring of cracks outside the observation window. Model fit shows some problems with the model for crack initiation lengths, but we have argued that these are not important to the reliability predictions. There are other properties of the replacement whose relationship with initiation and growth might be investigated, such as stress and loading. Further experiments are being conducted that will allow these to be measured. Of particular importance are prosthesis properties that can be measured before insertion; if relationships between such properties and reliability can be established, then there exist clear opportunities to select prostheses that have improved chances of long lifetimes.

Acknowledgments Thanks to Brendan McCormack and Paddy Prendergast for all their assistance with the data.

References [1] Gilks, W. R., Best, N. G., and Tan, K. K. C. (1995). Adaptive rejection metropolis sampling within Gibb's sampling. Appl. Statist., 44: 455-472. [2] McCormack, B. A. 0., Prendergast, P. J., and Gallagher, D. G. (1996). An experimental study of damage accumulation in cemented hip prostheses. Clinical Biomechanics, 11: 214-219. [3] McCormack, B. A. 0., Walsh, C. D., Wilson, S. P., and Prendergast, P. J. (1998). A statistical analysis of microcrack accumulation in PMMA under fatigue loading: applications to orthopredic implant fixation. Int. J. Fatigue, 20: 581-593. [4] Sobczyk, K. and Spencer, B. F. (1992). Random Fatigue: From Data to Theory. Academic Press, San Diego.

VII

RELIABILITY IN FINANCE AND FORENSICS

Chapter 16 THE PRICE OF FAILURE Nicholas James Lynn Canadian Imperial Bank of Commerce London, UK

[email protected]

Abstract

On the surface, the theories of financial derivatives and reliability have little in common. However, the new financial discipline of "credit derivatives," concerned with payments that are contingent on the "survival" of debtors, promises to bring the two theories together. This paper attempts to cast these contingent payments in a reliability setting and thereby demonstrates how fields as diverse as network reliability and burn-in have application in finance. Key differences, such as the need for "risk-neutral" pricing, are highlighted.

Keywords:

Burn-in, credit derivatives, martingales, network reliability, reliability theory.

1.

Introduction

The fields of reliability and financial derivatives have developed in isolation, from wildly differing starting points. While reliability theory has its roots in the physical world and the pioneering work of Barlow and Proschan [1], the theory of financial derivatives has developed out of the seminal paper by Black and Scholes in 1973 [3]. For most of the intervening years between then and now there was little to link the two fields. However the last few years have seen an explosion in the use of so-called "credit derivatives," which involve payments that are contingent on the "survival" of debtors. While the associated pricing theory is still in its infancy, it is clear that much of the work published under the umbrella of "reliability" transfers naturally to the field ofcredit derivatives. This paper introduces the fundamental concepts ofcredit derivatives, before applying the language of reliability theory to the field. It will hopefully provoke interest in translating reliability theory to what is an exciting and developing discipline.

304

2.

MATHEMATICAL RELIABILITY

Financial Fundamentals

The fundamental financial concept is that of the (risk-free) "zero-coupon bond." A zero-coupon bond of maturity T, denoted by ZT, pays (with certainty) $1 at time T. It is assumed that a continuum of such instruments exists. Define the $-value at time t of receiving $1 at time T to be P (t, T). Clearly, P (t, t) = 1. Less obviously, P (t, T) is a decreasing function of T. [If not, assume that P (t, T 1 ) < P (t, T2) for some t < T 1 < T 2 and consider the strategy of buying the T 1 bond and selling the T2 bond with the understanding that the $1 received at Tl would be held and paid at T2. This is a risk-free, zero-cost strategy yielding a guaranteed profit of P (t, T2) - P (t, TI) and as such is impossible.] It is useful to consider the "continuous rate ofinterest" at time t for borrowing at time u 2: t, denoted by f (t, u) and defined via

(16.1)

Differentiating gives

dP (t, T)

P (t, T) = f (t, t) dt == r (t) dt

so r (t) is seen to be the "rate of return". Note from earlier comments that f(t,u) 2:0.

2.1.

Risk-Free Bonds

Zero-coupon bonds are rarely traded; rather the standard debt instrument is the "coupon bond", which enables the seller (buyer) of the instrument to borrow (lend) money over a fixed term. Although the precise details vary, the archetypal bond obliges the bond issuer (the seller) to make pre-determined regular cash payments ("coupons") to the bondholder (the buyer). At the maturity of the bond, the "nominal" cash amount of the bond is also paid. Typically, bonds are structured to resemble fixed-term loans, the nominal equating to the $-size of the loan, and the coupon (stated as a percentage of the nominal) equating to the interest rate. Unlike loans, bonds are tradable securities so that the initial purchaser (in the so-called "primary market") may later sell the bond (and the associated right to payment) to another party (in the so-called "secondary market"). In practice, the precise size and timing of the cash flows are determined by a number of market conventions (the so-called "settlement delay" between buying and receiving, day-count conventions, and the concepts of clean and dirty prices). However, these only serve to complicate the analysis and are

305

The Price of Failure

essentially irrelevant to the underlying pricing theory - we ignore them here (for a thorough overview, see Stigum [18].) Theoretical Value Assume therefore that the bond (of nominal size N) makes regular cash payments of size Neat times tl, tz, ... , t M and in addition pays N at time t M. Assume initially that these payments are "risk-free" in the sense that there is no possibility that the payments will not be honored. Then the value of the bond at time t, B (t), is the total value of all agreed future payments. Assuming t ::; t M ,

B (t) = NP (t, tM)

+

M

L

NcP (t, ti).

(16.2)

i=min{j:tj ~t}

Thus given the prices P (t, T) the bond price B (t) may be exactly determined. [Note in practice that the reverse problem is more typical- Le. solving for the P (t, T) that match a set of Q bond prices, B 1 (t) , ... , BQ (t). This is often referred to as "bootstrapping" and is an important field in its own right see, for example, Miron and Swannell [16]. The "risk-free" assumption is strong, although it is justified when considering governments issuing bonds denominated in their own currency since governments can always "print" currency - thus they can always honor payments. So, despite occasional high-profile counter-examples, the "risk-free" assumption is usually made. However, for non-government bond issuers there is a non-negligible probability that payments will not be honored. This is a key point, which we will now consider in detail before returning to the question of bond-pricing in general.

2.2.

Risky Bonds

The Concept of Default A bond issuer is said to "default" (alternatively a "credit event" is said to have occurred) if it fails to honor a payment obligation or indicates that it will fail to make a future payment. A tight legal definition is provided by ISDA (the International Swaps and Derivatives Association) see [6]. In principle, a default may occur at any time (rather than at a discrete set of dates such as bond coupon dates, although these may be more likely). Furthermore, an issuer cannot "un-default." Define the process {D (t)} t>O indicating the occurrence (or otherwise) of default. Assume that D (0) = 0, indicating a state of non-default, and further assume that {D (t)} t>O is a Poisson process wit,h (possibly random) intensity {A (t, T)} t>o. The "credit event" is assumed to be triggered by the first arrival time, T, ofthe process {D (t)} t>o. Note that due to the tight definition of the term "credit event," T is observable.

306

MATHEMATICAL RELIABILITY

The probability of a credit event occurring during a small time interval of time, T)

= A (t, T) oT.

(16.3)

From (16.3) it follows that the conditional probability of non-default over a finite interval of time (T, U) is

(16.4)

Modeling Assumptions Returning to the coupon bonds discussed earlier, we now drop the assumption that they are "risk-free" and admit the possibility of default. In the real world, in the event of default, a number of eventualities may result. Typically a liquidator will announce (after a reasonable time-lag) that all bondholders will be paid a proportion 0: (7) of the nominal amount of the bonds held. This "recovery rate" process {o: (t) }t>o is itself random taking values on [0, 1]. Mathematical models of default fall into two broad classifications: parrecovery models and fractional-recovery models. The par-recovery model assumes that the recovery value is proportional to the nominal while the fractionalrecovery model assumes that the recovery value is proportional to the pre-default value of the bond. Thus the post-default value of a bond under the par-recovery model is B (7) = No: (7) while under the fractional recovery model it is B (7) = 0: (7) B (7-). For the purposes of this paper, we shall assume that the recovery rate is identically zero, so that Vi 2: 0, 0: (t) = and the par- and fractional-recovery models coincide.

°

3.

Parameter Estimation

Pricing of the bonds is governed by the principle of maximization of utility. Hence, one requires both a specification of utility and the default probabilities, P (7 > TI7 > t) VT. Equivalently, one requires A (t, T), defined as the instantaneous rate (at time t) of default at time T, conditional on 7 > T. Consider estimating A (t, T). Adopting a Bayesian perspective, one would proceed by first postulating a prior distribution (or process) before updating with available data. But in this case the only directly relevant data is the one piece of survival data, namely 7 > T. This indicates that one must think hard when specifying the prior. Fortunately there are organizations ("credit-rating agencies") that specialize in doing precisely this, through analysis of financial fundamentals and history.

307

The Price of Failure

And although these companies do not specify a full A (t, T), they do at least publish a "credit-rating", indicating the "reliability" of the borrower via a discrete classification scheme. Moody's, for instance, assign ratings ranging from Aaa (small probability of default) to C (high probability of default).

3.1.

Transition Matrices

Define the "rating process", {R(t)}t>o to be a continuous-time process taking discrete values on the range {I, 2,--: .. , Q}, denoting the various rating states (Aaa, Aal, ... ,C). For a fixed time-horizon, say 1 year, consider the transition matrix A governing movement between states alQ

al(Q+l)

a2Q

a2(Q+l)

and note the presence of an absorbing state representing default. To estimate aij, one would start by postulating a prior distribution for this matrix (for example a Dirichlet distribution - see Mazzuchi and Soyer [14] for properties) and then update this with observed transition data. This data is readily available. The continuous process A (t, T) may then be specified so that it coincides with the associated discrete process R (t). One approach is to associate each discrete state i, (i = 1,2, ... , Q) with an intensity Ai (u, u).

4.

Derivatives

The financial understanding of "derivative" is "a financial instrument whose value depends on the values of other, more basic underlying variables." (cf. Hull [5]) The key point is that these underlying variables are observable, and payments are made based on the realized values of these variables. For instance, Black and Scholes [3] studied contracts that involve payments linked to the observed price of an equity (at a specified date and time), but in general any variable may be used - for instance, the (government published) consumer price index or official weather records.

4.1.

Credit Derivatives

The discipline of "credit derivatives" is concerned with a specific class of contract - namely those that make payments contingent upon credit events. It is a rapidly expanding field and new varieties of derivative contract are constantly

308

MATHEMATICAL RELIABILITY

being invented. For the purposes of this paper, we concentrate on two of the more commonly traded contracts.

Credit Default Swap The "credit default swap" (CDS) is the most commonly traded credit derivative. Consider two market counterparties, A and B, who enter into such a swap. Party A will typically own a (risky) coupon bond of nominal size N issued by party C. If A does not wish to be exposed to the risk of party C defaulting, then rather than selling the bond, he may instead enter into a credit default swap (a good overview of the financial advantages that come from trading credit default swaps, including the positive effects of risk diversification, is given in [15]). In this case, party A will buy "protection" from party B, thereby becoming the seller of a CDS that obliges party A to pay party B an annual premium in return for party B agreeing to purchase the bond from party A at the nominal price N in the event of party C defaulting. The structure of the CDS is designed to match that of the coupon bond. The archetypal CDS obligates the seller (party A here) to pay the CDS premium (denoted by s) on each of the bond coupon dates t1, tz, ... , t M provided that a credit event has not occurred. In the event of a credit event, these payments cease and the buyer of the CDS (party B here) purchases the underlying bond from the CDS seller for the nominal amount N rather than the market price of the bond, B (T). Thus the payments made under the terms of the CDS (from the seller's perspective) are: at ti (i = 1,2, ... ,M), N sIT>ti (where IT>u denotes the indicator function that takes the value I when T > U and 0 otherwise) and at T, (B (T) - N) (1- IT>tM)' First-to-Default Contract The first-to-default derivative is one example of a class of contracts that depend on the behavior of multiple default processes. Other example include n'th to default and "collateralized bond obligations" (CBOs). Consider V underlying issuers and, extending the previous notation in the natural manner, define T1, ... TV to be the occurrence-times of the credit events. Define k such that Tk < Ti Vi i- k. Assuming that the payment profile mirrors that of a CDS, the payments are as follows: at ti (i = 1,2, ... , M),

309

The Price of Failure

4.2.

Pricing Credit Derivatives

Consider the credit default swap defined above, and denote the $-value of the CDS at time t by {C (t)}t>o' Replacing the word "default" by "failure" and the term "bond-issuer" by "component" suggests that we are in the world of reliability theory. If so, then a plausible price would be the expected value of the CDS: M

E [(B (r) - N) ZrIrtJ

(16.5)

i=l

Under the assumption that the zero-coupon bond prices, ZT, are deterministic and that a (r) = 0 (so B (r) = 0 under either the par- or fractional-recovery model), (16.5) becomes M

- N P (t, t M ) P (r < t M)

+ N s L P (t, ti) P (r > ti)'

(16.6)

i=l

[Note that in practice these assumptions may be relaxed. Variations would include: assuming independence of Zt and r; assuming that a (t) is known and constant; or assuming that a (t) is random but constant and independent of r.]

Instantaneous Hedging Unfortunately (or perhaps fortunately) the result in (16.6) is incorrect (not least because it makes no mention of utility) - however, it can be salvaged through a change of probability measure. This is the essence of derivatives pricing theory, which is covered in detail by Baxter and Rennie in [2] and applied to credit derivatives by, for example, Jarrow and Turnbill [7]. We will attempt to provide some intuition via the idea of "instantaneous hedging". For some time t consider a small interval of time (t, t + 8t) during which a "credit event" may occur. For small 8t, the probability of default may be approximated as). (t, t) 8t. Consider a credit derivative and denote by C (t) the price at time t, conditional on default not having occurred by time t (i.e. r > t). Similarly denote by C (t) the price at t conditional on r ~ t. Analogously, define B (t) and 13 (t) to represent the conditional prices of the coupon bond that is subject to the same default process {D (t) }t>O' Now consider an investor holding (at time t) a portfolio consisting of 1 unit of the credit derivative and - (3 units of the bond. Then with probability). (t, t) 8t a credit event will occur and the change in portfolio value will be

[0 (t + 8t)

- C (t)J - (3

[13 (t + 8t) - B (t)J .

(16.7)

310

MATHEMATICAL RELIABILITY

With probability 1 - A(t, t) 8t no credit event will occur and the portfolio value will change by [C (t

+ 8t) -

C (t)J -

f3 [B (t + 8t) - B (t)J .

(16.8)

Now consider choosing f3 such that

f3= C(t+8t)-C(t+8t). B (t + 8t) - B (t + 8t)

Then both (16.7) and (16.8) become

C (t + 8t)

[1 - Bft~1~~~t~1t)]

J: ) + C (t + vt

[

B(t)-B(t+t5t)] B(t+t5t)-B(t+t5t) -

C() t

(16.9)

This portfolio is thus (instantaneously) "risk-free" since its value is unaffected by the occurrence of a credit event. If so it must have the same return over (t, t + 8t) as any other "risk-free" asset (such as the zero-coupon bond), namely r (t) = f (t, t). Without loss of generality, assume r (t) = O. Thus (16.9) vanishes and may be re-written as

C (t)

=

(1 - ¢) C (t

+ 5t) + ¢C (t + 5t)

(16.10)

with

B(t) -f3(t+8t) ¢ = B (t + 5t) - B (t + 8t)

(16.11)

Since 0 < f3 (t + 8t) < B (t) < B (t + 8t) (for arbitrage reasons related to the impossibility of risk-free profits), ¢ E (0,1). Thus ¢ may be treated as a probability and (16.10) may be treated as an expectation. Hence C (t) = EQ [C (t

+ 8t) J

where the Q-suffix denotes expectation under the "risk-neutral" probability measure. This measure only coincides with the real-world measure in a world where investors are risk-neutral (Le. neither risk-averse or risk-prone.) Thus utility has manifested itself as measure.

Equivalent Martingale Measures In particular choose the credit-derivative payoff to mimic that of the bond. Then C (t) = B (t), C (t) = f3 (t) Vt and EQ [B (t

+ 8t) IB (t)J = B (t) .

The Price of Failure

311

Thus {B (t) }t>o is a martingale process w.r.t. Q and the probability measure Q is an "equivalent martingale measure". A detailed proof of this is beyond the scope of this article - see [2] for a thorough analysis. The key point to note for the reliability theorist is that the martingale property must be observed. In practice this means that any model postulated for the real-world evolution of>. (t, T) must be adapted to the Q-measure, or that the Q-measure evolution of>. (t, T) must be directly specified.

5. 5.1.

Application of Reliability Models Parameter "Implication"

While the pricing of bonds requires estimation of >. (t, T) and consideration of utility, the pricing of credit derivatives depends solely upon the prices (and evolution) of the underlying bonds. Thus for the purposes of pricing credit derivatives, >. (t, T) is not estimated from data, rather it is implied from the prices of bonds. (And future evolution of>. (t, T) is constrained by the martingale property.) Recall that the price of a risky zero-coupon bond is less than the price of the equivalent risk-free bond and may therefore be written as ()P (t, T), where o < () < 1. It is customary to think of () as the "default probability" and to define

Note that () is only the true "default probability" in a risk-neutral world where 7" and ZT are independent. Indeed, in practice the price is detennined via considerations of utility and () is only a "pseudo-probability." It is nevertheless instructive to consider various>' (t, u) that are observed in the market. Figure 16.1 shows two typical curves of >. (t, u) which were observed in the market in February 2002. [The discontinuous nature of the curves is due to the fitting algorithm applied, which in this case fits piecewise flat sections between observed bond maturities.] Explaining the Intensity Curve An intensity curve>' (t, T) such as the one in Figure 16.1 relating to the bad-quality credit, is recognized as the familiar "bath-tub" curve of reliability. In reliability theory, this is usually explained as being due to components improving during an initial period before deteriorating later - see [8]. A more plausible explanation is that of mixing poor and good quality items - those that are poor are quickly weeded out improving the average quality of those that remain. However, in finance, what is more plausible still is "psychological mixing" (cf. Lynn and Singpurwalla [11]). Even when

312

MATHEMATICAL RELIABILITY

the "model failure rate" (the failure rate conditional on the value of uncertain parameters) is flat or increasing, the "predictive failure rate," generated by averaging over these parameters, may take the form of a bathtub curve. Thus key to explaining Figure 16.1 is the fact that A (t, T) represents the failure rate at time T, conditional on survival to T. And in the presence of any uncertainty about A (t, T), conditioning on T > T will always reduce the expected value of A (t, T).

0.45

-

-

Bad quality credit

0.35

-

Good quality credit - -

Q)

0.3

a:::

0.25

.;;!

0.2

ro

Q) ....

'u. ro

--

0.4

0.15

\

0.1 0.05 0

o

24

12

36

48

60

72

84

96

Time (months)

Figure 16.1.

Implied Intensity Curves

To explain Figure 16.1 consider

and suppose that A (t, u) P (T >

= A, Vt, u.

TI T > t)

Then

= E [exp (- A(T

- t))] .

Consider the case where uncertainty about A may be expressed via a gamma distribution with parameters z and ¢. Then

313

The Price of Failure

exp

(-It ~ (t, u) dU)

E [exp (-,\ (T - t))J

roo

Jo e pZ (T-t+¢)Z

->"(T-t) qh->..(T-t+l

f(z)

d,\

(~r

=

Differentiating gives ~ (t, T) = zj(¢ + (T - t)), which is a decreasing function of (T - t). It is interesting to note that prices in the market (which in some sense reflect the "average" belief of market participants) are consistent with the theoretical results obtained by applying a Bayesian analysis. Indeed, a best-fit analysis of the bad-quality credit of Figure 16.1 would suggest that z = 1.34and¢ = 2.73 (see Figure 16.2).

0.6 0.5

Q)

0.4

ro

0:::

.... 0.3

Q)

..;;!

.iii LL

0.2

'"'"

-

Theoretical

--

-Implied

."

----

~

0.1

\

-

a a

12

24

36

48

60

72

84

96

Time (months)

Figure 16.2.

Implied ys. Theoretical Intensity

Note that a similar effect is not observed for the high quality credit of Figure 16.1. Here the decreasing failure rate due to learning is more than compensated for by the effects of risk-aversion.

314

5.2.

MATHEMATICAL RELIABILITY

Dependency Models

Payoffs such as first-to-default contracts depend on multiple credit-event processes, {D 1 (t)}t>o, {D 2 (t)}t>o,"" {D v (t)}t>o' Although one might assume independenceof these processes, in practice hTgh levels of dependence are observed. Multivariate reliability models thus have wide application here. In general, as in reliability theory, one can classify models in three broad categories - causal, exchangeable, and cascading failures (cf. Lindley & Singpurwalla, [10]).

Causal Models Causal models, such as the Marshall & Olkin common shock model [13] are relevant to modeling credit events since common environmental shocks may be experienced resulting in (approximately) simultaneous failure. Recall that the bivariate exponential distribution may be interpreted as resulting from three independent shock processes, one per component, plus a common-shock process affecting both components. For two debtors, define {Dr2 (t)} t>o' {Dr (t)} t>O and {D 2(t)} t>O to be three independent Poisson processes with intensity-A. (t, T). Further define (for i = 1,2), Di (t) = Di + Dh Then it is clear that Ai (t, T) = Ai (t, T) + Ab (t, T). This naturally extends to a multitude of components. This model is of particular relevance since Moody's employ a variant of it when assessing credit ratings (cf. [17]) For V underlyings, it is assumed that there are R :S V independent Poisson processes driving failure; R is termed the "diversity score." Process i is assumed to have intensity Ai (t, T) = Ai and to cause Ti simultaneous defaults. Clearly, L:~l Ti = V. Thus we have a simplified common-shock model that leads to simple results via binomial expansions. Other reliability models have potential application here. For instance, Singpurwalla and Youngren [19] employ a gamma process to model Ab (t, u) duo This gives rise to common shocks and in special cases collapses to the BVE of Marshall & Olkin.

It

Exchangeability Models based on exchangeability typically correlate the underlying processes Ai (t, T) directly rather than the default times, Ti. Indeed, conditional on Ai (t, T), the Ti are independent. A classic example is due to Lindley and Singpurwalla [9], where Ai (t, T) = ')'Ai (t, T) and')' is randomaveraging over')' introduces dependence through mixture. Cascading Failures Models of cascading failures may be particularly relevant to the financial field since the direct result of default is to penalize the lenders. The simple bivariate model of Freund [4] assumes that the failure rate

REFERENCES

315

of the surviving component (lender) doubles when the first fails - extensions to this model, such as those described by Swift [20] may be more relevant.

Network Reliability Models We conclude with a brief mention of network reliability models. As the range of credit derivative structures expands, so it becomes likely that payment structures will be invented that represent general networks rather than simple series or parallel systems. Lynn, Smith and Singpurwalla [12] provide a good overview of the key issues.

References [1] Barlow, RE. and Proschan, F. (1975). Statistical Theory ofReliability and Life Testing. Holt, Rinehart and Winston, New York. [2] Baxter, M. and Rennie, A. (1996). Financial Calculus. Cambridge University Press, Cambridge. [3] Black, F. and Scholes, M. (1973). The pricing of options and corporate liabilities. J. Pol. Econ. 81: 637-659. [4] Freund, J. E. (1961). A Bivariate Extension of the Exponential Distribution. Journal of the Amer. Statist. Assoc. 56: 971-977. [5] Hull, J.e. (1997). Options, Futures and Other Derivatives. 3rd Edition, Prentice-Hall. [6] ISDA (1999). ISDA Credit Derivatives Definitions. [7] Jarrow, R A. and Turnbill, M. (1996). Derivative Securities. SouthWestern College Publishers. [8] Jenson, F. and Petersen, N.E. (1982). Burn-in. Wiley, New York. [9] Lindley, D. V., and Singpurwalla, N. D. (1986). Multivariate Distributions for the Life Lengths of Components of a System Sharing a Common Environment. Journal ofApplied Probability, 23: 418-431. [10] Lindley, D. V., and Singpurwalla, N. D. (2002). On Exchangeable, Causal, and Cascading Failures. Statistical Science, 17: 209-219. [11] Lynn, N. J., and Singpurwalla, N. D. (1997). "Burn-in" Makes Us Feel Good. Statistical Science, 12: 13-19. [12] Lynn, N., Singpurwalla, N. and Smith, A. F. M. (1998). Bayesian Assessment of Network Reliability. SIAM Review, 40: 202-227. [13] Marshall, A. W. and aIkin, 1. (1967). A Multivariate Exponential Distribution. J. Amer. Statist. Assoc. 62: 30-44. [14] Mazzuchi, T. A. and Soyer, R (1993). A Bayes Methodology for Assessing Product Reliability During Development Testing. IEEE Trans. On Reliability. 42: 503-510.

316

MATHEMATICAL RELIABILITY

[15] Merrill Lynch & Co, Global Securities Research & Economics Group. (2002). Credit DefauLt Swap Handbook. Internal Report. [16] Miron, P. and Swannell, P. (1992). Pricing and Hedging Swaps. Euromoney Publications. [17] Moody's Investor Services. (1999). The BinomiaL Expansion Method Applied to CBO/CLO AnaLysis. [18] Sigum, M. (1996). Money Market & Bond CaLculations. McGraw-Hili Professional Publishing. [19] Singpurwalla, N. D. and Youngren, M. A. (1993). Multivariate Distributions Induced by Dynamic Environments. Scand. Journal of Statistics, 20: 251-261. [20] Swift, A. (2001). Stochastic Models of Cascading Failures. PhD Thesis, The George Washington University.

Chapter 17 WARRANTY: A SURROGATE OF RELIABILITY Nozer D. Singpurwalla Department ofStatistics The George Washington University Washington, DC 20052, USA

[email protected]

Abstract

Manufacturers of goods and services often endorse the superior quality of their products by offering the consumers a warranty. Often the warranty is for a specified period of time; sometimes it is for a specified time and usage, whichever is the smaller. The automobile is an example of this latter scenario. With either case, the issue of reliability is germane. Quantified measures of reliability are necessary for designing a warranty and for forecasting financial reserves against future warranty claims; they are also very useful in litigation involving violations of the terms of the warranty. This article is an overview. Its aim is to highlight some of the probabilistic, statistical, and econometric issues that the topic of warranties has spawned. Collectively, these issues should constitute the core of any mathematical theory of reliability; thus this entry in an expository perspective on reliability.

Keywords:

Failure models, forecasting, game theory, probability, reliability, statistics.

1.

Introduction: Terminology

The topic of warranties is at the interface of engineering, law and mathematics: see, for example, Blishke and Murthy [2]. Its origins are in the moral and legal sciences. Its impact is on the behavioral sciences. Its mathematical content lies in the philosophical underpinnings of probability. Warranties have become a critical segment of the industrial environment. Though initially rooted in the legal context of product liability (an issue that has arisen from a shift in philosophy from caveat emptor to caveat venditor), they have now become a key means for communicating information about product quality. Shafer [16] has stated that the original theory of probability was not about probability, but about fair prices. Thus the historical underpinnings of

318

MATHEMATICAL RELIABILITY

warranties may be traced as far back as 1654, to Pascal and to Fermat. However, to the best of our knowledge a formal development for determining optimal warranties may be attributed to Kenneth Arrow [1] in the Appendix to this classic paper. A warranty is a contractual agreement that requires the manufacturer to rectify a failed item, either through repair or replacement, should failure occur during a period specified by the warranty. Whereas some warranties do not restrict the amount of usage, other warranties, such as those for automobiles, impose restrictions on the usage. For example, it is common for automobiles to carry a warranty of 5 years or 50,000 miles, whichever comes first. The former are one-dimensional warranties; the latter two-dimensional. Under most warranties, the rectification of a failed item is undertaken at no cost to the consumer. However, under limited warranties, the cost of rectification is shared with the consumer. Under extended warranties, the cost of rectification is borne by an insurer who sells the warranty for the payment of a premium. Extended warranties often come into play after the expiration of the initial warranty, known as the basic warranty. The cost of the basic warranty is built into the selling price of the item, and a problem faced by the manufacturers is a forecast of the amount of money that should be put into reserve to meet warranty claims. In the automobile industry, such amounts run to millions of dollars. The warranty problem is multidisciplinary, involving topics as diverse as economics, game theory, law, probability and statistics. We illustrate this interdisciplinary nature of warranties by considering three scenarios: the first highlights the interface between between probability and the law, vis-a-vis "just" warranties; the second illustrates how the design of a time and use warranty region has prompted new developments in probabilistic failure modeling; the third describes some special caveats in time series analysis that arise when one addresses the problem of forecasting reserve funds against future warranty claims. Our overview concludes with some open problems involving game theoretic issues that remain to be addressed.

2.

Equilibrium Probabilities and "Just" Warranties

In this section we focus attention on warranties associated with the exchange of items that are selected from a large collection of similar items, between a manufacturer (seller) and a consumer (buyer). This is a commonly occurring situation in industrial contracting. Warranty contracts pertaining to item quality are generally written by lawyers representing the manufacturer. The specifications of the contract involve observable quantities such as the number of failures within a prescribed time. These specifications mayor may not be based on the underlying propensity of failure. The propensity of failure (or chance) is the reliability of an item

Warranty: A Surrogate ofReliability

319

[cf. Lindley and Singpurwalla [11]]. The warranty contracts are ratified (Le. accepted) by lawyers representing the buyer or consumer. If, after the contract is ratified, actual experience reveals an excessive number of failures, then the contract is likely to be canceled, and the lawyers proceed to litigate. During the course of litigation, one of the two adversaries may seek help from statisticians who evaluate warranty contracts in the light of probability, and so a statistician's testimony will invariably include the term "probability". Since the word probability is nowhere in the contract, opposing counsel may move to strike out the statistical testimony. The above sequence of events therefore poses a dilemma. On the one hand, probability is a necessary ingredient for analyzing the experiences of an adversary. On the other hand, warranty contracts can be interpreted by lay persons only in terms of observables, and so probability is not part of the terminology of a contract. The purpose of this section is twofold. The first is to describe an actual occurrence wherein the above conflict occurred. The second is to point out that underlying any meaningful warranty contract, no matter how specified, there exist probabilities that can be interpreted as being objective, subjective, or logical. Of particular relevance are what we call "equilibrium probabilities"; these are like the logical probabilities of Carnap [cf. Weatherford [22], p. 85]. Deviations of the propensity (which is an objective probability) offailure from the equilibrium probabilities make a contract unfair. The ratification of the contract by a buyer is based on the buyer's personal probability of the propensity. Thus the scenario of warranties brings into play the three interpretations of probability.

2.1.

A Real-Life Scenario

The following real-life scenario abstracted from Singpurwalla [18] pertains to a buyer B who is interested in purchasing n supposedly identical items. Each item is required to last for T = 1 units of time. Suppose that B is willing to pay $x per item, and is prepared to tolerate at most z failures in [0, T]. For each failure in excess of z, B needs to be compensated at the rate of $y per item. The quantity T can be viewed as the duration of the warranty. A seller I:- has an inventory of N such items, where N is large. I:- is agreeable to delivering the n items needed by B and is also prepared to abide by B's requirement of price and the terms of the warranty. The specific values for x, z, y, n, and T are decided upon by B, or by B and I:- together, and it is not our aim to discuss how these choices are arrived upon. A warranty contract of the type described above was ratified by both Band 1:-. It now happened that B experienced an excessive number of failures within the time period [0, T]. This development caused a loss ofgoodwill and reputation from B's customers. Since I:- had not delivered all the n items stipulated in the

320

MATHEMATICAL RELIABILITY

contract, B seized this opportunity to cancel the contract and to sue L for monies in excess of the terms of the contract. L claimed that the reliability of the units is acceptable and comparable to industry standards. However, B's lawyers move to strike out this claim on the grounds that the "word" reliability was mentioned nowhere in the terms of the contract. Given below is a synthesis of the probabilistic issues that underlie warranty contracts of the type described above. The gist of this synthesis is that there often exist unique and computable probabilities, called "equilibrium probabilities" that are specific to a warranty contract of the type described above, and that the existence of such probabilities should be acknowledged in all such contracts. Were this to be done, then the equilibrium probabilities could serve as a fulcrum point around which litigation may rationally proceed.

2.2.

.c's Normative Behavior

By normative behavior we mean actions that do not lead to a "dutch book"; see Lindley [10], p. 50. Suppose that it costs L $c to produce a single unit of the item. Then, if B experiences z or fewer failures in [0, T], L's profit will be n(x - c). However, if B experiences i failures in [0, T], with i > z, then L's liability will be (i - z)y. When the contract is ratified by L, i is of course unknown. Suppose then that P(i) is the propensity of i failures in [0, T]. Then, L's expected liability is n

L

(i - z)yP(i).

i=z+l

Since L is selecting n items (at random) from an inventory of N items, it makes sense to think in terms of the propensity of a single item failing in [0, T]; let p denote this propensity. If the N items are judged to be similar, it follows that were we to know p,

P(i) =

(7)pi(1-Pt- i,i=O, ... ,n.

Thus conditional probability on p, L's expected liability takes the form

The warranty contract is said to be fair to L if, conditional on p, L's expected liability equals L's receipts. Thus,

Warranty: A Surrogate ofReliability

321

Since n, x, y, T and z are either known or specified by the contract, and cis known to.c, the above can be solved by .c for p, to yield a value P£, should such a value exist. We callp£ an equilibrium probability for.c. If the true propensity of failure is greater than P£, .c is liable to experience losses; otherwise, .c is likely to profit.

2.3.

13's Normative Behavior

B's position is more complicated. B is prepared to pay nx to buy n items, which are supposed to last for T units of time. B will incur an unknown loss due to the items failing prior to time T, and is going to receive an unknown amont in reimbursements from .c. Let w denote the worth (in dollars) to B if a unit functions for T units of time. Then for the contract to be fair to B, we must have nx = nw - E(Loss)

+ E(Reimbursement) ,

where E denotes expectation. Thus E(Loss) denotes B's expected loss due to a premature failure of units: similarly for E(Reimbursement). Reimbursements to 13

If at the end of T units of time i units have failed, then

B will receive as reimbursement $(i - z) y, if i 2 z + 1, and nothing otherwise. Clearly, E(Reimbursement) is given by

.f Pt {np- to i(~ )pi(l- p)n-i -zit (~)pi(l_ p)n-} (i-z)y(7)pi(1-

i

z=z+l

~

y

Note that the above expression is conditional on p. Loss Incurred by 13 B is willing to pay $x for an item that can last for T units of time. If the item fails at inception, then B suffers a loss of $x; if the item serves until T, then B incurs no loss. Suppose that B's loss is linear in time so that should the item fail at time t, it is x - (x / T )t, for t ::; T. Suppose that the items in question do not age so that the propensity of the failure times of the N items in .c's lot is an exponential distribution with a parameter A; A is, of course, unknown. Thus, if f(t) denotes the true (or an objective) density function of the time to failure of an item at time t, then f(t) = Ae->.t, t 2 0, A > 0. Thus, conditional on f(t) known, B's expected loss is E(Loss)

nx

= nx - TA (1 - e-

T

>. ).

322

MATHEMATICAL RELIABILITY

For the contract to be fair to 8, the above expected loss, plus nx, the price paid by 8, should be equated to 8's expected reimbursement, with P replaced by 1 - e- TA , plus nw, the worth to 8 of the n items. That is, we must have

Since n, x, y, z, and T are specified in the warranty contract, and since w is known to 8, the above can be solved by 8 for A, and thence for p. We denote this value of p, should it exist, by PB, and call it an equilibrium probability for 8. If the propensity of failure is different from PB, 8 could incur a loss, depending on the values of the known constants.

2.4.

"Just" Warranties

We have argued that given a warranty contract there may exist values of the propensity P that make the contract fair to each adversary. We say that a contract is fair to a particular party if that party neither suffers an expected loss nor enjoys an expected gain. We have termed these values of the propensity as "equilibrium probabilities", and have denoted them by PB and p,e. A warranty contract is said to be in equilibrium if PB equals P,e, assuming, of course, that both PB and p,e exist. Furthermore, a warranty contract that is in equilibrium is said to be just if the equilibrium probabilities coincide with the propensity of failure. The term "just" reflects the fact that if the propensity of failure is greater than the (identical) equilibrium probabilities, then both the adversaries will experience an expected loss; vice versa if otherwise. In the latter case, it is a third party, the party that is responsible for providing 8 with a worth w per unit, that is getting penalized. In actuality, it is unlikely that PB equals P,e, unless the constants ofthe contract are so chosen. Thus contracts in equilibrium do not happen without forethought. Furthermore, when the contract is ratified, the propensity P is not known with certainty. Thus "just" warranties can only be retrospectively recognized. Finally, it is often the case that the third party referred to before is the public at large. Thus "just" contracts are beneficial to the general public.

2.5.

Ratification of a Warranty Contract

In actuality, "just" contracts are an idealization for two reasons. The first is that the propensity P is unknown. The second is that the buyer and seller tend to take actions that maximize their expected profits against each other, and also against the third party, which tends to be a passive player. It is often the case

Warranty: A Surrogate ofReliability

323

that B and I: collaborate in their design of x and y in order to play "cooperative game" against a third party, so that w tends to be inflated. It is not our intention here to discuss the game theoretic issues of warranty design. Rather, our aim is to describe how B and I: should act in the presence of a specified warranty, no matter how arrived upon, but under uncertainty about p. The ratification of a warranty contract usually is preceded by negotiations between B and I: about the constants of the contract. To see how these may proceed, let 1r.c(p) [1rB(p)J be defined as the seller's [buyer's] subjective probabilities for the unknown p. Personal probabilities of the propensity are the appropriate probabilities to consider in making decisions. It is unlikely, though possible, that 1r.c(p) and 1rB(P) will be identical. More important, since Band I: tend to be adversaries, 1r.c(p) and 1rB(P) will be known only to I: and to B, respectively. That is, I: will not reveal1rL (p) to B and vice versa. Clearly, if 1rL(p) is degenerate, say at p* > 0, and if p* = PL, then I: will ratify the contract. If p* > PL, then I: will not ratify the contract unless B is willing to lower y and/or to increase z. If p* < PL, then I: may either ratify the contract or may try to convince B that x needs to be increased and/or y needs to be lowered before the contract is ratified by 1:. B's behavior will be analogous to f:' s if 1rB (p) is degenerate, say at P*. In particular, if P* > PB, B will insist that x and z be lowered and/or y be increased in order for B to ratify the contract. If P* :s; PB, then B will immediately ratify the contract as is. Under the latter circumstance, B may try to convince the third party that w be increased. When 1rL (p) is not degenerate, which is what we would expect, then I: will ratify the contract if the n( x - c) is greater than or equal to

which is the expected value, with respect to 1r.c(p), of the right-hand side of Equation (17.1). It is what I: believes I:'s expected liability is. Prior to ratifying the contract I: may still choose to negotiate with B along the lines discussed whenp* < PL. When 1rB(P) is not degenerate, then B's behavior will be similar to that of 1:, mutatis mutandis. The point of this section is to show that the topic of warranties brings into play the three commonly discussed interpretations of probability - reliability being the propensity interpretation, and the role of probabilistic arguments on matters of fairness and justice.

324

3.

MATHEMATICAL RELIABILITY

Failure Models Indexed by Time and Usage

In this section we illustrate how the scenario of specifying an optimal warranty involving the placing of limits on time and usage has generated new research on developing probabilistic failure models. We refer to such warranties as two-dimensional warranties, one dimension representing time (since acquisition or sale) and the other representing the amount of usage. The discussion of Section 2 pertained to a one-dimensional warranty, namely failure during the time interval [0, T], where for convenience T was set to one. The failure model used by us was an exponential. Instead of indexing on time, we could have considered other indices such as mileage, number of cycles, cumulative exposure, etc. In all cases the index would be a single one. With a two-dimensional warranty, we need to consider two indices. The most common two-dimensional warranty region is the rectangular one offered with automobiles, and is usually of the type, 5 years of 50,000 miles, whichever occurs first. A virtue of such warranties is that they are easy to implement. Their disadvantage is that they tend to encourage an above average use during the initial period of purchase, and therefore pose a risk to the manufacturer. On the other hand, such warranties protect the consumer from failures attributed to "infant mortality" which tend to occur during the initial period of use and can therefore be discovered early on. Alternatives to the rectangular regions would be the circular, triangular and parabolic regions; details on these can be found in Singpurwalla and Wilson [20].

3.1.

Bivariate Distributions with Multiple Scales

With any two-dimensional warranty of the type described above, we need to know the probability of failure in the region of warranty. For this, a probability model indexed by two scales is needed, and the purpose of this sub-section is to discuss a development strategy for one such family of models. There is a limited amount of work in failure modeling with multiple scales, as distinct from multivariate modeling with a single scale. The first published paper on the subject is by Mercer [12] who describes the failure of a conveyer belt as a function of time and wear. The papers by Nelson [13] and Oakes [14] address the issue of multiple scales but their approach reduces to combining the scales so that analysis itself proceeds on a single scale. The work of Jewell and Kalbfleisch [8] comes close in spirit to that presented here, which has been abstracted from Singpurwalla and Wilson [21]. A more recent reference is Duchesne and Lawless [5]. Model Development

Let T be the time to failure of an item, M (t) its cumu-

lative use at t ~ 0, and U ~f M(T), the cumulative use at failure. Our goal is to construct a probability model fr,u(t, u), with T and U dependent. For this,

325

Warranty: A Surrogate ofReliability

we first observe the following elementary, but useful, decomposition:

h,u(t, u) =

h(t)fM(T)IT(ult), since U = M(T), h(t)fM(t)IT(ult), since we condition at T = t, frIM(t) (tlu)fM(t) (u), by the multiplicative law.

It is interesting to observe that in frIM(t)(tlu), t appears twice: as an argument of the variable TIM (t), and as a parameter of the variable M (t). Indeed as a function oft, frIM(t) (tlu) is not a probability density; its behavior is analogous to a likelihood for T with M (t) fixed. Meaningful forms for f M (t) ( u) are: the Poisson process, the compound Poisson process, the gamma process, the Markov additive process, and any other continuously non-decreasing stochastic process. To describe meaningful forms for frIM(t) (tlu), we need to prescribe a model that captures the effect of usage (which is a covariate) on the time to failure. For this we propose an additive hazards model. This model is guided by the notion that usage is detrimental to life. Here the failure rate at t depends on the value of the usage at t. The model assumes a baseline rate ro(t) which accounts for deterioration due to natural causes. The usage modifies ro (t); we assume that this modification is additive, so that r(t), the failure rate of T, is

r(t)

°

=

ro(t)

+ rtM(t),

where rt > is a known constant. The additive model suggests that each unit of use increases the failure rate by the same amount; in actuality, this may not be true. For example, with failure due to fatigue, large cracks grow faster than small ones, so that initial usage may contribute less to an increase in the failure rate than subsequent use. Suppose that for all t 2: 0,

M(t)

= it M(s)ds

Ro(t)

and

= it ro(u)du;

then {M(t); t 2: o} is the integrated usage process, and Ro(t) the integrated baseline failure rate. Since usage is not an internal covariate, the exponentiation formula can be invoked, and so conditional on M (t) = u and M (t) = U,

frl(M(t),M(t)) (tIM(t)

= u, M(t) = U) = (ro(t) +rtu) exp( -(Ro(t) +rtU)).

Averaging out with respect to fM(t),M(t) (Ulu) gives us the conditional density frIM(t)(tIM(t) = u) as

(ro(t)

+ rtu) exp( -(Ro(t))

J

e-'I)U fM(t)IM(t)

(Ulu)dU.

326

MATHEMATICAL RELIABILITY

If £[e-fJM(t)IM(t) = u] is the moment generating function of M(t) given M(t) = u, then

hIM(t)(tIM(t) = u) = (ro(t)

+ 'T]u)e-no(t)£[e-fJM(t)IM(t)

= u].

The ease with which the above can be dealt with depends on the moment generating function. In many cases, the moment generating function has to be numerically evaluated. However, if M(t) is the Poisson process with cumulative intensity A(t), then

£ [e-fJM(t) IM(t) = u] = ( (t e-fJ(t-s) A(S) dS) U Jo A(t)' so that

hIM(t) (tIM(t) = u) = (ro(t)

+ 'T]u)e-no(t)

(it e-fJ(t-s) ~~:~ dS) U;

where A(t) = dA(t)/dt. With fM(t) (u) and hIM(t) (tlu) at hand we are able to specify h,u(t, u) and thence assess the probability of failure within any region of warranty. In what follows, we illustrate the workings of the above via a special case.

Consideration of a Special Case For usage that can be characterized by counts, such as the number of times an item is turned on and off, the basic Poisson process serves as a suitable model for M(t). With A(t) = A(t) and ro (t) = a, it can be seen [cf. Singpurwalla and Wilson [19]] that for any t ~ 0, andu=0,1,2, ... ,

Its marginal distributions are: P (U = u)

=

AU(a + 'T]u) U

and

TI(a+A+'T]i)

i=O

P (T

~ t) =

exp (-(a

+ A)t + ~(1 -

e- fJt )) ,

and, P (T ~ t, U ~ u), the joint distribution function of T and U has the form

Warranty: A Surrogate ofReliability

327

The above fonn is useful for calculating the probability of failure in any rectangular warranty region. It can also be used to calculate the probability of failure in other non-rectangular regions; details are in Singpurwalla and Wilson [20]. For usage processes other than the Poisson, say the compound Poisson, the computations become very cumbersome; see Singpurwalla and Wilson [21], who also describe an application of such models to a problem involving warranties for traction motors. The purpose of this section is to give an illustration of how designing a twodimensional warranty region, be it rectangular or otherwise, raises new issues in probabilistic failure modeling. A distinguishing feature of these models is that they are indexed by different scales and that the scales bear a relationship to each other.

4.

Forecasting Financial Reserves for Warranty Claims

At the outset of a warranty - be it one-dimensional or two-dimensional a manufacturer is required by law to set aside reserve funds to honor future claims against the warranty. This can run into large sums. Thus the question of optimally forecasting the needed reserve is an important issue. This matter has been addressed from two angles, one based on a probabilistic analysis involving renewal theory, and the other an empirically based approach using a time series analysis of claims data. A review of the fonner is in Eliashberg, Singpurwalla and Wilson [6], who also propose a strategy for two-dimensional warranties based on a new failure model indexed by time and usage. The probabilistic approach, though aesthetically appealing, suffers from the drawback of not being realistic. The main difficulty pertains to the fact that most claims against warranty are satisfied by a minimal repair, so that a use of standard renewal theory - which assumes replacement - cannot be justified. Other difficulties stem from computational issues. The empirical approach, favored by industry, simply tracks actual costs over a period of years and thus tends to be more realistic. It is driven by the premise that there have been no drastic changes in the product's design or the character of the warranty. Its disadvantage of course is that one needs a substantial amount of historical data to implement a meaningful time series based approach. Notwithstanding the above requirement, the warranty scenario raises an issue that is not typical of time series data, namely the caveat of maturing data. This is explained below in the context of data on warranty claims for a brand of automobiles over different model years. But first a few words about warranty claims data, and the nature of such data.

4.1.

Warranty Claims Data

In many organizations, the analysis of warranty claims data is undertaken to serve the needs of two departments, financial and engineering. Financial

328

MATHEMATICAL RELIABILITY

departments have an interest in tracking the amounts spent on warranties; their aim is to forecast warranty claims over the service life of the product, so that funds could be escrowed to meet such claims. Engineering departments are more concerned with frequency of repairs rather than costs; their aim is to improve on product design and to compare the observed data with engineering tests. In what follows, we focus on the kind of data that is used by financial departments for forecasting financial reserves to meet warranty claims. Robinson and McDonald [15] presents such data for the average cumulative warranty cost/unit as a function of time, over different model years, for a particular brand of automobiles, say the Ford Escort. The time scale used is the number of months the automobiles in question have been in service. A nuance of these data is that a snapshot of the series changes with the time at which the snapshot is taken. This is because the number of units that go into computing the average cost increases with time. For example, for the 1982 model year the average cost/unit after 3 months in service taken say at January 1985, is different from that computed in February 1985. This is because by February 1985, more units have entered into the population of units having 3 months in service. We refer to this phenomenon as data maturation. Such phenomena is not typical of the kind of time series analyses considered in the literature. With data maturation the observation on a series at any time index, say 3 months, could go above or below its previous value. Any method of time series analysis involving data maturation should take account of it. Graphical methods used in industry do not. The model proposed in [3] and [4] is based on a second order dynamic linear model with leading indicators; it is overviewed below.

4.2.

Markov Mesh Model for Filtering and Forecasting

Suppose that a snapshot of the available data is taken at some time T. Let Ylt denote the warranty claims for units manufactured in model year 1, and having an exposure to use of duration t, t = 1,2, ... , TI. Similarly, let Yit, t = 1,2, ... , Ti' denote the claims for the model year i, i = 2, ... , p. We assume that model year 1 is the furthest back in time from T, and that model year p is the latest; TI > T2 > ... > T p • Assume that the item is warranted for a maximum exposure of T; thus for all i, Ti ::; T. Recall that, due to a maturing of the claims data, all the values Yit for which Ti < T will change when a snapshot of the series is taken at any point subsequent to T. Given the Yit'S, our aim is to predict the YiTH 1 , YiTH2' ... ,YiT, for each i for which ~ < T. Since our analysis of the data for model year 1 does not depend on any data from the previous model years, we start our forecasting scheme by looking at the time series for model year 1. We propose the second-order dynamic linear

329

Warranty: A Surrogate ofReliability

models, as suitable candidates to consider. Specifically, we let Ylt

= /-tlt + Ult

with Ult

rv

N(O, Ult),

/-tlt

= /-tlt-l + (3lt-l + Vlt

with Vlt

rv

N(O, Vlt ),

with Wlt

rv

(17.2)

N(O, Wlt),

where /-tlt is the level of the series at time t, (3lt is the change in level at time t, and K t is chosen to describe various shapes of the underlying growth. When K t = 1 for all t, the proposed model is referred to as a linear growth model. For the data of Figure XX, which show an S-shaped tendency, a suitable choice for K t , for some tl < t2, would be of the form K t = Cl > 1 for t ~ tl, K t = 1 for tl < t ~ t2, and K t = C2 < 1 for t ~ t2. The resulting model will be referred to as an S-shaped growth model. Suppose, for now, that we ignore the issue of data maturation, and set Ult = UI for all t; we then assume that U1 l has a gamma distribution with specified scale and shape parameters, and let Vlt = bUl, Wlt = CUI, with band c specified. Suppose that Ult, Vlt, and WIt are serially and contemporaneously uncorrelated. Then, given D lt = (Yl l , ... , YlTl)' standard Gaussian theory (see West and Harrison [23]) enables us to obtain the means and the covariances of the posterior and the filtered distributions of ~(l) = (/-tID, (310, , /-tITl' (3lTl)'

and also the predictive distributions ofYlt , t = TI + j, j = 1, , (T - T I ). A way to account for the effects of data maturation would be to set Ult = Ud (t), where f(t) is some increasing function oft that describes the modeler's beliefs in the rate at which the variance of the innovation sequence {Ult} grows with

t.

For model years 2 and later, the dynamic linear model that we propose has the form Ypt = /-tpt

+ Upt

+ (3p(t-l) + Vpt

/-tpt

= /-tp(t-l)

(3pt

= 'Y K t(3p(t-l) + (1

-

'Y)(3(p-l)t

+ Wpt

with Upt

rv

N(O, Upt ),

with Vpt

rv

N(O, V;t),

with Wpt

rv

N(O, W pt ),

(17.3) where 'Y E [0, 1] is a weight reflecting the effect of the previous model year's data on the model year p, p ~ 2, and K t is the same as that used to describe (3lt in (17.2). Assuming that the Upt'S, Vpt'S and Wpt'S are serially and contemporaneously uncorrelated, and that Upt , VPt, and Wpt are specified we can, using standard Gaussian theory, obtain the means and the covariances of the posterior and

330

MATHEMATICAL RELIABILITY

the filtered distributions of the f-lpt'S and the {3pt'S, for any specified value of ,. Then we can also find the needed predictive distributions. The algebraic details underlying these calculations are given in Chen, Lynn and Singpurwalla [3]. The key notion here is that the filtered means of the f-lpt'S and the (3pt'S serve as the priors for the f-lp+l,t'S, giving us a mechanism for incorporating information from one model year to the next, and in so doing providing a degree of smoothing. ~ Regarding the specification of Upt , one strategy is to set Upt = fJlt , where Ult , t = 1,2, ... , T 1 are the estimated values of U1 given Yl1, Y12 , ..• , Y1T1 • Note that the fJlt 's are the residuals obtained by referring to the filtered estimates of ~(1). This scheme of using the fJlt's to specify Upt is another way in which information from the previous model years is used for ~aking predictio~s in the current year. To specify Ypt and W pt , we let Vpt = bUlt and Wpt = cUlt. The discussion thus far assumes that the weight, is specified. In essence, we empirically choose that, which minimizes the mean square of predictions. This approach suffers from two disadvantages, the first being that it is cumbersome to implement, and the second that it has no formal justification. A defensible approach is to specify a prior distribution for" the prior reflecting the modeler's judgment regarding the extent of the relevance of the previous model year data for predicting the current series, and then to produce filtered estimates of 0(1) using the usual Bayesian updating schemes. However, the introduction of a prior for, makes the system of equations (17.3) nonlinear, for which an exact inferential mechanism is not available. The modern strategy for overcoming such difficulties, which we have successfully implemented, is to use the technique of Gibbs sampling (see Gelfand and Smith [7]). As is to be expected, it may happen that the inclusion of data from previous model years may result in predictions inferior to those given by the current year's data alone.

s.

Summary and Conclusions

In this overview we have presented three scenarios that have been spawned by warranty problems that bear a relationship to reliability theory, each in their own way. The first brings into play the three interpretations of probability, the objective (which is a propensity or chance), the subjective, and the logical. Reliability pertains to the first and is germane for designing a warranty. The second comes into play for ratifying a warranty and the third for assessing the fairness of a warranty. The second scenario pertains to the need for developing a special class of probabilistic failure models for setting meaningful warranty regions and in a way comes closest to the spirit of research in reliability. The third scenario draws on the second, but not in a meaningful way; it has a stronger impact on statistical analysis vis-a-vis the modeling of a time series. Problems

REFERENCES

331

of statistical inference based on warranty data have not been mentioned here. These constitute a large body of literature and pose some challenging issues for writing out the likelihood under informative stopping rules. A feel for these is given in Singpurwalla [17]. A more recent reference is Lawless, Hu and Cao [9]. At a broader level, one can view reliability as an aid to making decisions. After all, one is not interested in estimating the mean time to failure via a probabilistic model, purely for the sake of doing so. Rather, an estimate of the mean time to failure becomes an aid to making decisions such as acceptance or the rejection of an item, adding redundancies or arriving upon a suitable warranty region. However, the setting of an optimum warranty region calls for more than a meaningful probabilistic model. Warranties come into play because manufacturers also want to dominate the market by eliminating their competitors. Warranties are a way of luring customers. But poorly designed warranties, such as those that do not limit time or usage could be detrimental to a manufacturer's profits and thus there is a line to be drawn. A consequence is that manufacturers engage in a game. Thus warranties entail game theoretic considerations as well, and this perhaps is the most important issue that we need to come to terms with. A problem of future research therefore is a use of reliability theory in a game theoretic setting for arriving at warranties that are both economically and technically sound.

Acknowledgments Research was supported by The Office of Naval Research under Grant ONR NOOOI4-99-1-0875, and by Grants DAAD 19-01-1-0502, DAAD 19-02-10195, The Army Research Office, with The George Washington University.

References [1] Arrow, K. J. (1963), Uncertainty and the Welfare Economics of Medical Care. American Economic Review, 53: 941-973. [2] Blishke, W. R. and Murthy, D. N. P. (1996), Eds., Product Warranty Handbook, Marcel Dekker, Inc., New York. [3] Chen, J., Lynn, N. J. and Singpurwalla, N. D. (1995), Markov Mesh Models for Filtering and Forecasting with Leading Indicators. In Analysis of Censored Data, IMS Lecture Notes - Monograph Series, Vol. 43, pp. 39-54. [4] Chen, J., Lynn, N. J. and Singpurwalla, N. D. (1996), Forecasting Warranty Claims. In Product Warranty Handbook, W. R. Blishke and D. N. P. Murthy

332

MATHEMATICAL RELIABILITY

eds., Marcel Dekker, Inc., New York, pp. 803-817. [5] Duchesne, T. and Lawless, J. (2000), Alternative Time Scales and Failure Time Models. Lifetime Data Analysis, 6: 157-179. [6] Eliashberg, J., Singpurwalla, N. D. and Wilson, S. P. (1997), Calculating the Reserve for a Time and Usage Indexed Warranty. Management Science, 43: 967-975. [7] Gelfand, A. E. and Smith, A. E M. (1990), Sampling-Based Approaches to Calculating Marginal Densities. Journal of the American Statistical Association, 85: 398-409. [8] Jewell, N. P. and Kalbfleisch, J. D. (1992), Markov Models in Survival Analysis and Applications to Issues Associated with AIDS. In Aids Epidemiology: Methodological Issues, N. P. Jewell, K. Dietz and V. T. Farewell eds., Birkhauser, Boston, pp. 211-230. [9] Lawless, J. E, Hu, J. and Cao, J. (1995), Methods for the Estimation of Failure Distributions and Rates from Automobile Warranty Data. Lifetime Data Analysis, 1: 227-240. [10] Lindley, D. V. (1985), Making Decisions, John Wiley and Sons, New York [11] Lindley, D. V. and Singpurwalla, N. D. (2002), On Exchangeable, Causal and Cascading Failures. Statistical Science, 17: 209-219. [12] Mercer, A. (1961), Some Simple Wear-Dependent Renewal Processes. Journal of the Royal Statistical Society, Series B, 23: 368-376. [13] Nelson, W. (1995), Analysis of Failure Data with Two Measures of Usage. In Recent Advances in Life Testing and Reliability, N. Balakrishman ed., CRC Press, Boca Raton, FL, pp. 51-58. [14] Oakes, D. (1995), Multiple Time Scales in Survival Analysis. Lifetime Data Analysis, 1: 7-18. [15] Robinson, J. A. and McDonald, G. C. (1991), Issues Related to Reliability and Warranty Data. In Data Quality Control: Theory and Pragmatics, Marcel Dekker, New York. [16] Shafer, G. (1990), The Unity and Diversity of Probability. Statistical Science, 5: 435-462. [17] Singpurwalla, N. D. (1989), Inference Under Planned Maintenance, Warranties, and Other Retrospective Data. Journal of Statistical Planning and Inference, 29: 171-185. [18] Singpurwalla, N. D. (2001), Contract Warranties and Equlibrium Probabilities. In Statistical Science in the Courtroom, J. L. Gaswirth ed., SpringerVerlag, New York. [19] Singpurwalla, N. D. and Wilson, S. P. (1992), Warranties. In Bayesian Statistics 4, J. M. Bernado, J. O. Berger, A. P. Dawid and A. E M. Smith eds., Oxford University Press, Oxford, pp. 435-446.

REFERENCES

333

[20] Singpurwalla, N. D. and Wilson, S. P. (1993), The Warranty Problem: Its Statistical and Game Theoretic Aspects. SIAM Review, 35: 17-42. [21] Singpurwalla, N. D. and Wilson, S. P. (1998), Failure Models Indexed by Two Scales. Advances in Applied Probability, 30: 1058-1072. [22] Weatherford, W. R. (1982), Philosophical Foundations of Probability, Routledge and Kegan Paul, Ltd., London. [23] West, M. and Harrison, J. (1989), Bayesian Forecasting and Dynamic Models, Springer-Verlag, New York.

INTERNATIONAL SERIES IN OPERATIONS RESEARCH & MANAGEMENT SCIENCE Frederick S. Hillier, Series Editor Stanford University Saigal/ A MODERN APPROACH TO LINEAR PROGRAMMING Nagumey/ PROJECTED DYNAMICAL SYSTEMS & VARIATIONAL INEQUALITIES WITH APPLICATIONS

Padberg & Rijal/ LOCATION, SCHEDULING, DESIGN AND INTEGER PROGRAMMING Vanderbei/ LINEAR PROGRAMMING Jaiswal/ MILITARY OPERATIONS RESEARCH Gal & Greenberg/ ADVANCES IN SENSITIVITY ANALYSIS & PARAMETRIC PROGRAMMING Prabhu/ FOUNDATIONS OF QUEUEING THEORY Fang, Rajasekera & Tsao/ ENTROPY OPTIMIZATION & MATHEMATICAL PROGRAMMING Yu/ OR IN THE AIRLINE INDUSTRY Ho & Tang! PRODUCT VARIETY MANAGEMENT EI-Taha & Stidham! SAMPLE-PATH ANALYSIS OF QUEUEING SYSTEMS Miettinen! NONLINEAR MULTIOBJECTIVE OPTIMIZATION Chao & Huntington! DESIGNING COMPETITIVE ELECTRICITY MARKETS Weglarz! PROJECT SCHEDULING: RECENT TRENDS & RESULTS Sahin & Polatoglu/ QUALITY, WARRANTY AND PREVENTIVE MAINTENANCE Tavares/ ADVANCES MODELS FOR PROJECT MANAGEMENT Tayur, Ganeshan & Magazine/ QUANTITATIVE MODELS FOR SUPPLY CHAIN MANAGEMENT

Weyant, J./ ENERGY AND ENVIRONMENTAL POLICY MODELING Shanthikumar, J.G. & Sumita, U.I APPLIED PROBABILITY AND STOCHASTIC PROCESSES Liu, B. & Esogbue, A.O./ DECISION CRITERIA AND OPTIMAL INVENTORY PROCESSES Gal, T., Stewart, TJ., Hanne, T. / MULTICRITERIA DECISION MAKING: Advances in MCDM Models, Algorithms, Theory, and Applications

Fox, B.L. / STRATEGIES FOR QUASI-MONTE CARLO Hall, R.W. / HANDBOOK OF TRANSPORTATION SCIENCE Grassman, W.K. / COMPUTATIONAL PROBABILITY Pomerol, J-C. & Barba-Romero, S. / MULTICRITERION DECISION IN MANAGEMENT Axsiiter, S. / INVENTORY CONTROL Wolkowicz, H., Saigal, R., & Vandenberghe, L. / HANDBOOK OF SEMI-DEFINITE PROGRAMMING: Theory, Algorithms, and Applications

Hobbs, B.F. & Meier, P. / ENERGY DECISIONS AND THE ENVIRONMENT: A Guide to the Use ofMulticriteria Methods

Dar-EI, E. / HUMAN LEARNING: From Learning Curves to Learning Organizations Armstrong, J.S. / PRINCIPLES OF FORECASTING: A Handbookfor Researchers and Practitioners

Balsamo, S., Persone, V., & Onvural, R./ ANALYSIS OF QUEUEING NETWORKS WITH BLOCKING

Bouyssou, D. et al. / EVALUATION AND DECISION MODELS: A Critical Perspective Hanne, T. / INTELLIGENT STRATEGIES FOR META MULTIPLE CRITERIA DECISION MAKING Saaty, T. & Vargas, L. / MODELS, METHODS, CONCEPTS and APPLICATIONS OF THE ANALYTIC HIERARCHY PROCESS

Chatterjee, K. & Samuelson, W. / GAME THEORY AND BUSINESS APPLICATIONS