Practical system reliability 9780470408605, 047040860X

Learn how to model, predict, and manage system reliability/availability throughout the development life cycle Written by

926 57 3MB

English Pages 288 Year 2009

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Practical system reliability
 9780470408605, 047040860X

Citation preview

ffirs.qxd

3/3/2009

6:01 PM

Page i

PRACTICAL SYSTEM RELIABILITY

ffirs.qxd

3/3/2009

6:01 PM

Page ii

IEEE Press 445 Hoes Lane Piscataway, NJ 08855 IEEE Press Editorial Board Lajos Hanzo, Editor in Chief R. Abari J. Anderson S. Basu A. Chatterjee

T. Chen T. G. Croda M. El-Hawary S. Farshchi

B. M. Hammerli O. Malik S. Nahavandi W. Reeve

Kenneth Moore, Director of IEEE Book and Information Services (BIS) Jeanne Audino, Project Editor Technical Reviewers Robert Hanmer, Alcatel-Lucent Kime Tracy, Northeastern Illinois University Paul Franklin, 2nd Avenue Subway Project Simon Wilson, Trinity College, Ireland

ffirs.qxd

3/3/2009

6:01 PM

Page iii

PRACTICAL SYSTEM RELIABILITY

Eric Bauer Xuemei Zhang Douglas A. Kimber

IEEE Press

A JOHN WILEY & SONS, INC., PUBLICATION

ffirs.qxd

3/3/2009

6:01 PM

Page iv

Copyright © 2009 by the Institute of Electrical and Electronics Engineers, Inc. Published by John Wiley & Sons, Inc., Hoboken, New Jersey. All rights reserved. Published simultaneously in Canada. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representation or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print, however, may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com. Library of Congress Cataloging-in-Publication Data is available. ISBN 978-0470-40860-5 Printed in the United States of America. 10 9 8 7 6 5 4 3 2 1

ffirs.qxd

3/3/2009

6:01 PM

Page v

For our families, who have supported us in the writing of this book, and in all our endeavors

ftoc.qxd

3/3/2009

6:04 PM

Page vii

CONTENTS

Preface Acknowledgments 1 Introduction

xi xiii 1

2 System Availability 2.1 Availability, Service and Elements 2.2 Classical View 2.3 Customers’ View 2.4 Standards View

5 6 8 9 10

3 Conceptual Model of Reliability and Availability 3.1 Concept of Highly Available Systems 3.2 Conceptual Model of System Availability 3.3 Failures 3.4 Outage Resolution 3.5 Downtime Budgets

15 15 17 19 23 26

4 Why Availability Varies Between Customers 4.1 Causes of Variation in Outage Event Reporting 4.2 Causes of Variation in Outage Duration

31 31 33

5 Modeling Availability 5.1 Overview of Modeling Techniques 5.2 Modeling Definitions 5.3 Practical Modeling 5.4 Widget Example 5.5 Alignment with Industry Standards

37 38 58 69 78 89

6 Estimating Parameters and Availability from Field Data 6.1 Self-Maintaining Customers 6.2 Analyzing Field Outage Data 6.3 Analyzing Performance and Alarm Data

95 96 96 106 vii

ftoc.qxd

3/3/2009

viii

6:04 PM

Page viii

CONTENTS

6.4 6.5 6.6

Coverage Factor and Failure Rate Uncovered Failure Recovery Time Covered Failure Detection and Recovery Time

107 108 109

7 Estimating Input Parameters from Lab Data 7.1 Hardware Failure Rate 7.2 Software Failure Rate 7.3 Coverage Factors 7.4 Timing Parameters 7.5 System-Level Parameters

111 111 114 129 130 132

8 Estimating Input Parameters in the Architecture/Design Stage 8.1 Hardware Parameters 8.2 System-Level Parameters 8.3 Sensitivity Analysis

137 138 146 149

9 Prediction Accuracy 9.1 How Much Field Data Is Enough? 9.2 How Does One Measure Sampling and Prediction Errors? 9.3 What Causes Prediction Errors?

167 168 172 173

10 Connecting the Dots 10.1 Set Availability Requirements 10.2 Incorporate Architectural and Design Techniques 10.3 Modeling to Verify Feasibility 10.4 Testing 10.5 Update Availability Prediction 10.6 Periodic Field Validation and Model Update 10.7 Building an Availability Roadmap 10.8 Reliability Report

177 179 179 206 208 208 208 209 210

11 Summary

213

Appendix A System Reliability Report outline 1 Executive Summary 2 Reliability Requirements 3 Unplanned Downtime Model and Results Annex A Reliability Definitions Annex B References Annex C Markov Model State-Transition Diagrams

216 215 217 217 219 219 220

Appendix B Reliability and Availability Theory 1 Reliability and Availability Definitions 2 Probability Distributions in Reliability Evaluation 3 Estimation of Confidence Intervals

221 221 228 237

ftoc.qxd

3/3/2009

6:04 PM

Page ix

CONTENTS

ix

Appendix C Software Reliability Growth Models 1 Software Characteristic Models 2 Nonhomogeneous Poisson Process Models

245 245 246

Appendix D Acronyms and Abbreviations

263

Appendix E Bibliography

265

Index

279

About the Authors

285

fpref.qxd

3/3/2009

6:06 PM

Page xi

PREFACE

T

HE RISE OF THE INTERNET,

sophisticated computing and communications technologies, and globalization have raised customers’ expectations of powerful “always on” services. A crucial characteristic of these “always on” services is that they are highly available; if the customer cannot get a search result, or order a product or service, or complete a transaction instantly, then another service provider is often just one click away. As a result, highly available (HA) services are essential to many modern businesses, such as telecommunications and cable service providers, Web-based businesses, information technology (IT) operations, and so on. Poor service availability or reliability often represents real operating expenses to service providers via costs associated with: 앫 Loss of brand reputation and customer good will. Verizon Wireless proudly claims to be “America’s most reliable wireless network” (based on low ineffective attempt and cutoff transaction rates), whereas Cingular proudly claims “Fewest dropped calls of any network.” Poor service availability can lead to subscriber churn, a tarnished brand reputation, and loss of customer good will. 앫 Direct loss of customers and business. Failure of an online provisioning system or order entry system can cause customers to be turned away because their purchase or order cannot be completed. For instance, if a retail website is unavailable or malfunctioning, many customers will simply go to a competitor’s website rather than bothering to postpone their purchase and retrying later. 앫 Higher maintenance-related operating expenses. Lower reliability systems often require more maintenance actions and raise xi

fpref.qxd

3/3/2009

xii

6:06 PM

Page xii

PREFACE

more alarms. More frequent failures often mean more maintenance staff must be available to address the higher volume of maintenance events and alarms. Repairs to equipment in unstaffed locations (e.g., outdoor base stations) require additional time and mileage expenses to get technicians and spare parts to those locations. 앫 Financial penalties or liquidated damages due to subscribers/ customers for failing to meet service availability or “uptime” contractual requirements or service level agreements (SLAs). This practical guide explains what system availability (including both hardware and software downtime) and software reliability are for modern server, information technology or telecommunications systems, and how to understand, model, predict and manage system availability throughout the development cycle. This book focuses on unplanned downtime, which is caused by product-attributable failures, rather than planned downtime caused by scheduled maintenance actions such as software upgrades and preventive maintenance. It should be noted that this book focuses on reliability of mission-critical systems; human-lifecritical systems such as medical electronics, nuclear power operations, and avionics demand much higher levels of reliability and availability, and additional techniques beyond what is presented in this book may be appropriate. This book provides valuable insight into system availability for anyone working on a system that needs to provide high availability. Product managers, system engineers, system architects, developers, and system testers will all see how the work they perform contributes to the ultimate availability of the systems they build. ERIC BAUER XUEMEI ZHANG DOUGLAS A. KIMBER Freehold, New Jersey Morganville, New Jersey Batavia, Illinois February 2009

flast.qxd

3/4/2009

8:54 AM

Page xiii

ACKNOWLEDGMENTS

We thank Abhaya Asthana, James Clark, Randee Adams, Paul Franklin, Bob Hanmer, Jack Olivieri, Meena Sharma, Frank Gruber, and Marc Benowitz for their support in developing, organizing and documenting the software reliability and system availability material included in this book. We also thank Russ Harwood, Ben Benison, and Steve Nicholls for the valuable insights they provided from their practical experience with system availability. E.B. X.Z. D.K.

xiii

c01.qxd

2/8/2009

5:19 PM

CHAPTER

Page 1

1

INTRODUCTION

Meeting customers’ availability expectations for an individual product is best achieved through a process of continuous improvement, as shown in Figure 1.1. The heart of the process is an architecture-based, mathematical availability model that captures the complex relationships between hardware and software failures and the system’s failure detection, isolation, and recovery mechanisms to predict unplanned, product-attributable downtime (covered in Chapter 5). In the architecture or high-level design phase of a product release, parameters for the model are roughly estimated based on planned features, producing an initial availability estimate to assess the feasibility of meeting the release’s availability requirements (covered in Chapter 8). In the system test phase, updated modeling parameters (such as hardware failure rate calculations, software failure rate estimations from lab data, and measured system parameters) can be used in the model to produce a revised availability estimate for the product (covered in Chapter 7). After the product is deployed in commercial service, outage data can be analyzed to calculate actual rate of outage-inducing software and hardware failures, outage durations, and so on; these actual values can be used to better calibrate modeling parameters and the model itself (covered in Chapter 6). If there is a gap between the actual field availability and the product’s requirements, then a roadmap of availabilityimproving features can be constructed, and the availability prediction for the next release is produced by revising modeling parameters (and the model itself, if significant architectural changes are made) to verify feasibility of meeting the next release’s availability requirements with the planned feature set, thus closing the loop (covered in Chapter 10). Practical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.

1

c01.qxd

2/8/2009

2

5:19 PM

Page 2

INTRODUCTION

Construct Road map of Availability— Improving Features to Close Any Gap

Estimate Availability from Lab Data and Analysis

Figure 1.1. Managing system availability.

The body of this book is organized as follows: 앫 Chapter 2, System Availability, explains the classical, service providers’ and TL 9000 views of availability. 앫 Chapter 3, Conceptual Availability Model, explains the relationship between service-impacting failure rates, outage durations and availability. 앫 Chapter 4, Why Availability Varies between Customers, explains why the same product with identical failure rates can be perceived to have different availability by different customers. 앫 Chapter 5, Modeling Availability, explains how mathematical models of system availability are constructed; an example is given. 앫 Chapter 6, Estimating Parameters from Field Data, explains how system availability and reliability parameters can be estimated from field outage data. 앫 Chapter 7, Estimating Input Parameters from Lab Data, explains how modeling input parameters can be estimated from

c01.qxd

2/8/2009

5:19 PM

Page 3

INTRODUCTION



앫 앫

앫 앫

앫 앫 앫 앫 앫

3

lab data to support an improved availability estimate before a product is deployed to customers (or before field outage data is available). Chapter 8, Estimating Input Parameters in Architecture/Design Stage, explains how modeling input parameters can be estimated early in a product’s development, before software is written or hardware schematics are complete. Good modeling at this stage enables one to verify the feasibility of meeting availability requirements with a particular architecture and high-level design. Chapter 9, Prediction Accuracy, discusses how much field data is enough to validate predictions and how accurate availability predictions should be. Chapter 10, Connecting the Dots, discusses how to integrate practical software reliability and system availability modeling into a product’s development lifecycle to meet the market’s availability expectations. Chapter 11, Summary, summarizes the key concepts presented in this book, and the practical ways those concepts may be leveraged in the design and analysis of real systems. Appendix A, Sample Reliability Report Outline, gives an outline for a typical written reliability report. This explains the information that should be included in a reliability report and provides examples. Appendix B, Reliability and Availability Theory Appendix C, Software Reliability Growth Models Appendix D, Abbreviations References Index

c02.qxd

2/8/2009

5:20 PM

CHAPTER

Page 5

2

SYSTEM AVAILABILITY

There is a long history of so-called “Five-9’s” systems. Five-9’s is shorthand for 99.999% service availability which translates to 5.26 down-minutes per system per year. Telecommunications was one of the first areas to achieve Five-9’s availability, but this Five9’s expectation is now common for telecommunications, missioncritical servers, and computing equipment; in some cases, customers expect some individual elements to exceed 99.999%. Many telecommunications Web servers, and other information technology systems routinely exceed 99.999% service availability in actual production. The telecommunications industry, both service providers and equipment manufacturers, tailored the ISO 9000 quality standard to create the TL 9000 standard. More specifically, TL 9000 was created by the Quality Excellence for Suppliers of Telecommunications (QuEST) Forum. The QuEST Forum is a consortium of telecommunications service providers, suppliers, and liaisons* that is dedicated to advancing “the quality, reliability, and performance of telecom products and services around the world.” TL 9000 gives clear and formal rules for measuring the reliability and availability of servers and equipment that supports the Internet Protocol (IP) and a wide variety of telecommunications and computing center equipment. TL 9000 defines a number of metrics and the associated math and counting rules that enable tracking of very specific quality, reliability, and performance aspects of a wide variety of products. The metric names consist of a few letters that define the area being measured, along with a number to distinguish between similar metrics within that area. For example, *At the time of this writing, the QuEST forum membership included more than 25 service providers, more than 80 suppliers, and over 40 liaisons. Practical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.

5

c02.qxd

2/8/2009

6

5:20 PM

Page 6

SYSTEM AVAILABILITY

there is a metric called “SO4,” which is the fourth type of metric that deals with “system outages.” Because both equipment manufacturers and users (i.e., telecommunications service providers) defined TL 9000, it offers a rigorous and balanced scheme for defining and measuring reliability and availability. TL 9000 is explicitly applicable to many product categories used with all IP-based solutions and services, including: 앫 IP-based Multimedia Services (Category 1.2.7), including video over IP, instant messaging, voice features, multimedia communications server, media gateway 앫 Enhanced Services Platforms and Intelligent Peripherals (Category 6.1) like unified/universal messaging 앫 Network Management Systems, both online critical (Category 4.2.1), such as network traffic management systems, and online noncritical (Category 4.2.2) such as provisioning, dispatch, and maintenance 앫 Business Support Systems (Category 4.2.3), such as the inventory, billing records, and service creation platforms 앫 General Purpose Computers (Category 5.2) such as terminals, PCs, workstations, and mini-, mid-, and mainframes 앫 All networking equipment such as network security devices (Category 6.6), routers (Categories 1.2.9 and 6.2.7), PBX’s (Category 6.4), and virtually all equipment used by communications service providers TL 9000 reliability measurements are broadly applicable to most IP- and Web-based services and applications, and, thus, this book will use TL 9000 principles as appropriate. There are two parts to the TL 9000 Standard: a Requirements Handbook and a Measurements Handbook. The Measurements Handbook is the standard that is most applicable to the topics covered in this book because it defines the metrics for how to measure system availability and software reliability. Applicable TL 9000 principles will be explained, so no previous knowledge of TL 9000 is necessary.

2.1

AVAILABILITY, SERVICE, AND ELEMENTS

TL 9000’s Quality Measurement Systems Handbook, v4.0, defines availability as “the ability of a unit to be in a state ready to per-

c02.qxd

2/8/2009

5:20 PM

Page 7

2.1

AVAILABILITY, SERVICE, AND ELEMENTS

7

form a required function at a given instant in time or for any period within a given time interval, assuming that the external resources, if required, are provided.” Thus, availability is a probability, the probability that a unit will be ready to perform its function, and, like all probabilities, it is dimensionless. Practically, availability considers two factors: how often does the system fail, and how quickly can the system be restored to service following a failure. Operationally, all hardware elements eventually fail because of manufacturing defects, wear out, or other phenomena; software elements fail because of residual defects or unforeseen operational scenarios (including human errors). Reliability is typically defined as the ability to perform a specified or required function under specific conditions for a stated period of time. Like availability, reliability may be expressed as a percentage. It is the probability that a unit will be able to perform its specified function for a stated period of time. Availability and reliability are often confused, partly because the term reliability tends to be used when availability is what was really intended. One classic example that helps distinguish reliability from availability is that of an airplane. If you want to fly from Chicago to Los Angeles, then you want to get on a very reliable plane, one that has an extremely high probability of being able to fly for the 4 to 5 hours the trip will take. That same plane could have a very low availability. If the plane requires 4 hours worth of maintenance prior to each 4 hour flight, then the plane’s availability would only be 50%. High-availability systems are designed to automatically detect, isolate, alarm, and recover from inevitable failures (often by rapidly switching or failing over to redundant elements) to maintain high service availability. A typical design principle of socalled “high availability” systems is that no single failure should cause a loss of service. This design principle is often referred to as “no single point of failure.” As complex systems may be comprised of multiple similar or identical elements, it is often useful to distinguish between service availability and element availability. Service is generally the primary functionality of the system or set of elements. Some services are delivered by a single, stand-alone element; other services are delivered by a set of elements. A simple example of these complementary definitions is a modern commercial airliner with two jet engines. If a single jet engine fails (an element failure), then

c02.qxd

2/8/2009

8

5:20 PM

Page 8

SYSTEM AVAILABILITY

propulsion service remains available, albeit possibly with a capacity loss, so the event is not catastrophic; nevertheless, this element failure is certainly an important event to manage. If the second jet engine fails before the first jet engine has been repaired, then a catastrophic loss of propulsion service occurs. Conceptually, element availability is the percentage of time that a particular element (e.g., the jet engine) is operational; service availability is the percentage of time that the service offered by one or more elements (e.g., propulsion) is operational. As clustered or redundant architectures are very common in high availability systems and services, clearly differentiating service availability from element availability is very useful. TL 9000 explicitly differentiates these two concepts as network element outages (e.g., product-attributable network element downtime tracked by the TL 9000 NEO4 metric) versus product-attributable service downtime (tracked by the SO4 metric). Unless otherwise stated, this book focuses on service availability.

2.2

CLASSICAL VIEW

Traditionally, systems were viewed as having two distinct states: up and down. This simplifying assumption enabled the following simple mathematical definition of availability: Uptime MTTF Availability = ᎏᎏᎏ = ᎏᎏ Uptime + Downtime MTTF + MTTR

(2.1)

Mean time to failure (MTTF) often considered only hardware failures and was calculated using well-known hardware prediction methods like those described in the military standard MIL-HDBK-STD-217F or the Telcordia telecommunications standard BR-SR-332 (also known as Reliability Prediction Procedure, or RPP). Section 1 in Appendix B illustrates the definition of MTTF in mathematical format, and shows its relationship with the reliability function. Mean time to repair (MTTR) was often assumed to be 4 hours. Although this calculation did not purport to accurately model actual system availability, it did represent a useful comparison value, much like Environmental Protection Agency (EPA) standard gas mileage in the United States. An added benefit is that this definition is very generic and can easily

c02.qxd

2/8/2009

5:20 PM

Page 9

2.3

CUSTOMER’S VIEW

9

be applied across many product categories, from military/aerospace to commercial/industrial and other fields. This classical view has the following limitations: 앫 Hardware redundancy and rapid software recovery mechanisms are not considered yet are designed into many modern high-availability systems so that many or most failures are recovered so rapidly that noticeable outages do not occur. 앫 Service repair times vary dramatically for different failures. For instance, automatic switchovers are much faster than manual repairs, and recovering catastrophic backplane failures often takes significantly longer than recovering from circuit pack failures. 앫 Many complex systems degrade partially, rather than having simple 100% up and completely down states. For instance, in a digital subscriber line (DSL) access element, a single port on a multi-line DSL card can fail (affecting perhaps < 1% of capacity), or one of several multiline DSL cards can completely fail (affecting perhaps 10% of capacity), or the aggregation/backhaul capability can completely fail (affecting perhaps 100% of capacity). Clearly, loss of an entire (multiline) DSL access element is much more severe than the loss of a single access line. Thus, sophisticated customers generally take a more refined view of availability. 2.3

CUSTOMERS’ VIEW

Sophisticated customers often take a more pragmatic view of availability that explicitly considers actual capacity loss (or capacity affected) for all service disruptions. As sophisticated customers will typically generate trouble tickets that capture the percentage of users (or the actual number of users) that are impacted and the actual duration for service disruptions, they will often calculate availability via the following formula: Availability = In-service time – ⌺Outage events Capacity loss × Outage duration ᎏᎏᎏᎏᎏᎏᎏᎏ In-service time (2.2)

c02.qxd

2/8/2009

10

5:20 PM

Page 10

SYSTEM AVAILABILITY

In-service time is the amount of time the equipment was supposed to be providing service. It is often expressed in system minutes. Capacity loss is the percentage of provisioned users that are impacted by the outage (or, alternatively, the number of users). Outage duration is typically measured in seconds or minutes. Equation 2.2 prorates the duration of each outage by the percentage of capacity lost for that outage, and then adds all the outages together before converting the outage information to availability. As an example, consider a home location register (HLR) database system that stores wireless subscriber information on a pair of databases. The subscriber information is evenly allocated between the two servers for capacity reasons. If one of the database servers incurs a 10 minute outage, then half of the subscribers will be unable to originate a call during that 10 minute interval. If that was the only outage the HLR incurred during the year, then the annual availability of the HLR is: Availability = 1 year – (50% capacity loss × 10 min downtime) ᎏᎏᎏᎏᎏᎏ 1 year

(2.3)

525960 – 5 = ᎏᎏ = 99.999% 525960 This works out to be 99.999%. Notice that in this example the availability was calculated for an entire year.* Other periods could be used, but it is customary to use a full year. This is primarily because downtime, which is the inverse of the availability, is typically expressed in minutes per year.

2.4

STANDARDS VIEW

The QuEST Forum has standardized definitions and measurements for outages and related concepts in the TL 9000 Quality Management System Measurements Handbook. Key concepts from *This book uses 525,960 minutes per year because when leap years are considered, the average year has 365.25 days, and 365.25 days times 24 hours per day times 60 minutes per hour gives 525,960 minutes. It is acceptable to use 525,600 minutes per year, thus ignoring leap years. The important thing is to be consistent—always use the same number of minutes per year.

c02.qxd

2/8/2009

5:20 PM

Page 11

2.4

STANDARDS VIEW

11

the TL 9000 v4.0 Measurement Handbook relevant to software reliability and system availability are reviewed in this chapter. 2.4.1

Outage Attributability

TL 9000 explicitly differentiates product-attributable outages from customer-attributable or other outages. Product-attributable outage. An outage primarily triggered by a) The system design, hardware, software, components or other parts of the system b) Scheduled outage necessitated by the design of the system c) Support activities performed or prescribed by an organization, including documentation, training, engineering, ordering, installation, maintenance, technical assistance, software or hardware change actions, and so on d) Procedural error caused by the organization e) The system failing to provide the necessary information to conduct a conclusive root cause determination f) One or more of the above Customer-attributable outage. An outage that is primarily attributable to the customer’s equipment or support activities triggered by a) Customer procedural errors b) Office environment, for example power, grounding, temperature, humidity, or security problems c) One or more of the above d) Outages are also considered customer attributable if the customer refuses or neglects to provide access to the necessary information for the organization to conduct root cause determination.

As used above, the term “organization” refers to the supplier of the product and its support personnel (including subcontracted support personnel). This book focuses on product-attributable outages. 2.4.2

Outage Duration and Capacity Loss

TL 9000 explicitly combines outage duration and capacity loss into a single parameter:

c02.qxd

2/8/2009

12

5:20 PM

Page 12

SYSTEM AVAILABILITY

Outage Downtime—The sum, over a given period, of the weighted minutes a given population of a system, network element (NE), or service entity was unavailable, divided by the average in-service population of systems, network elements, or service entities.

Crucially, TL 9000 explicitly uses weighted minutes to prorate downtime by capacity lost. 2.4.3

Service Versus Element Outages

As many systems are deployed in redundant configurations to assure high availability, TL 9000 explicitly differentiates service-impacting outage from network-element-impacting outage: Service Impact Outage—A failure in which end-user service is directly impacted. End user service includes but is not limited to one or more of the following: fixed-line voice service, wireless voice service, wireless data service, high-speed fixed access (DSL, cable, fixed wireless), broadband access circuits (OC-3+), narrowband access circuits (T1/E1, T3/E3). Network Element Impact Outage—A failure in which a certain portion of a network element functionality/capability is lost, down, or out of service for a specified period of time.

The Service Impact Outage measurements are designed to assess the impact of outages on end-user service. As such, they look at the availability of the primary function (or service) of the product. The Network Element Impact Outage measurements are designed to assist the service provider in understanding the maintenance costs associated with a particular network element. They include outages that are visible to the end user as well as failure events such as loss of redundancy, which the end user will not see. 2.4.4

Outage Exclusion and Counting Rules

Outages often have variable durations and impact variable portions of system capacity. Thus, as a practical matter it becomes important to precisely agree on which events are significant enough to be counted as “outages” and which events are so transient or so small as to be excluded from consideration as “outages.” Naturally, equipment suppliers often prefer more generous outage exclu-

c02.qxd

2/8/2009

5:20 PM

Page 13

2.4

STANDARDS VIEW

13

sion rules to give a more flattering view of product-attributable service availability, whereas customers may prefer to take a more inclusive view and count “everything.” The TL 9000 measurements handbook 4.0 provides the following compromise for typical systems: All outages shall be counted that result in a complete loss of primary functionality for all or part of the system for a duration greater than 15 seconds during the operational window, regardless of whether the outage was unscheduled or scheduled.

Different services have different tolerances for short service disruptions. Thus, it is important for suppliers and customers to agree on how brief a service disruption is acceptable for automatic failure detection, isolation, and recovery. Often, this maximum acceptable service disruption duration is measured in seconds, but it could be hundreds of milliseconds or less. Service disruptions that are shorter than this maximum acceptable threshold can then be excluded from downtime calculations. Generally, capacity losses of less than 10% are excluded from availability calculations as well. TL 9000 sets specific counting rules by product category. Setting clear agreements on what outages will be counted in availability calculations and what events can be excluded is generally a good idea. 2.4.5

Normalization Factors

Another crucial factor in availability calculations is the so-called “normalization unit.” Whereas “system” and “network element” seem fairly straightforward in general, modern bladed and clustered architectures can be interpreted differently. For example, if a single chassis contains several pairs of blades, each hosting a different application, then should service availability be normalized against just the blades hosting a particular application or against the entire chassis? How should calculations change if a pair of chassis, either collocated or geographically redundant, is used? Since system availability modeling and predictions are often done assuming one or more “typical” configurations (rather than all possible, supported configurations), one should explicitly define this typical configuration(s) and consider what normalization factors are appropriate for modeling and predictions.

c02.qxd

2/8/2009

14

5:20 PM

Page 14

SYSTEM AVAILABILITY

2.4.6

Problem Severities

Different failures and problems generally have different severities. Often, problems are categorized into three severities: critical (sometimes called “severity 1”), major (sometimes called “severity 2”), and minor (sometimes called “severity 3”). TL 9000’s severity definitions are broadly consistent with those used by many, as follows. Critical Critical conditions are those that severely affect the primary functionality of the product and, because of the business impact to the customer, require nonstop immediate corrective action, regardless of time of day or day of the week, as viewed by a customer upon discussion with the organization. They include 1. Product inoperability (total or partial outage) 2. Reduction in capacity capability, that is, traffic/data handling capability, such that expected loads cannot be handled 3. Any loss of emergency capability (for example, emergency 911 calls) 4. Safety hazard or risk of security breach Major Major severity means that the product is usable, but a condition exists that seriously degrades the product operation, maintenance, administration, and so on, and requires attention during predefined standard hours to resolve the situation. The urgency is less than in critical situations because of a lesser immediate or impending effect on problem performance, customers, and the customer’s operation and revenue. Major problems include: 1. Reduction in the product’s capacity (but the product is still able to handle the expected load) 2. Any loss of administrative or maintenance visibility of the product and/or diagnostic capability 3. Repeated degradation of an essential component or function 4. Degradation of the product’s ability to provide any required notification of malfunction Minor Minor problems are other problems of a lesser severity than “critical” or “major,” such as conditions that result in little or no impairment of the function of the system.

c03.qxd

2/8/2009

5:21 PM

CHAPTER

Page 15

3

CONCEPTUAL MODEL OF RELIABILITY AND AVAILABILITY

3.1

CONCEPT OF HIGHLY AVAILABLE SYSTEMS

All systems eventually experience failures because no large software product is ever “bug-free” and all hardware fails eventually. Whereas normal systems may crash or experience degraded service in response to these inevitable failures, highly available systems are designed so that no single failure should cause a loss of service. At the most basic level, this means that all critical hardware is redundant so that there are no single points of failure. Figure 3.1 presents the design principle of highly available systems. The infinite number of potential failures is logically represented on the left side as triggers or inputs to the highly available system. Highly available systems include a suite of failure detectors, typically both hardware mechanisms (e.g., parity detectors and hardware checksums) and software mechanisms (e.g., timers). When a failure detector triggers, then system logic must isolate the failure to a specific software module or hardware mechanism and activate an appropriate recovery scheme. Well-designed highavailability systems will feature several layers of failure detection and recovery so that if the initial recovery was unsuccessful, perhaps because the failure diagnosis was wrong, then the system will automatically escalate to a more effective recovery mechanism. For instance, if restarting a single process does not resolve an apparent software failure, then the system may automatically restart the processor hosting the failed process and, perhaps, evenPractical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.

15

c03.qxd

2/8/2009

16

5:21 PM

Page 16

CONCEPTUAL MODEL OF RELIABILITY AND AVAILABILITY

Figure 3.1. Model of high-availability systems.

tually restart the software on the entire system. Undoubtedly, a human operator is ultimately responsible for any system, and if the system is not automatically recovering successfully or fast enough, the human operator will intervene and initiate manual recovery actions. Figure 3.2 illustrates the logical progression from failure to recovery of a highly available system. The smiley face on the left side of the figure represents normal system operation. The lightning bolt represents the inevitable occurrence of a stability-impacting failure. These major failures often fall into two broad categories: 1. Subacute failures that do not instantaneously impact system performance (shown as “Service Impaired”), such as memory or resource leaks, or “hanging” of required software processes or tasks. Obviously, a resource leak or hung/stuck process will eventually escalate to impact service if it is not corrected. 2. Acute failures that “instantaneously” and profoundly impact service (shown as “Service Impacted”), such as the catastrophic failure of a crucial hardware element like a processor or networking component. An acute failure will impact delivery of at least some primary functionality until the system recovers from

c03.qxd

2/8/2009

5:21 PM

Page 17

3.2

CONCEPTUAL MODEL OF SYSTEM AVAILABILITY

17

Some failures do not immediately impact service, like resource exhaustion (e.g., memory leaks) System Impaired

Normal Operation

Failure

Some failures immediately impact service, like hardware failures of crucial components Some failures cascade or eventually lead to service impact, like process failures when requested required resources are not available (e.g., uncorrected memory leaks eventually cause service impact)

Normal Operation

Service Impacted If service (or “primary functionality”) is impacted for longer than 15 seconds, then event is technically a TL 9000 Service Outage, and thus counts against SO metrics

Figure 3.2. Generic availability-state transition diagram.

the failure (often by switching to a redundant hardware unit or recovering failed software). Highly available systems should detect both acute and subacute failures as quickly as possible and automatically trigger proper recovery actions so that the duration and extent of any service impact is so short and small as to be imperceptible to most or all system users. Different applications with different customers may have different quantitative expectations as to how fast service must be restored following an acute failure for the interruption to be considered acceptable rather than a service outage. Obviously, systems should be architected and designed to automatically recover from failures in less than the customers’ maximum acceptable target time.

3.2

CONCEPTUAL MODEL OF SYSTEM AVAILABILITY

System availability is concerned with failures that produce system outages. Generally speaking, outages follow the high-level flow

c03.qxd

2/8/2009

18

5:21 PM

Page 18

CONCEPTUAL MODEL OF RELIABILITY AND AVAILABILITY

shown in Figure 3.3. A service-impacting failure event occurs, and then the system automatically detects and correctly isolates the failure, raises a critical alarm, and successfully completes automatic recovery action(s). As most high availability systems feature some redundancy, a failure of a redundant component will generally be automatically recovered by switching the service load to a redundant element. Then service is restored, the alarm is cleared when the failed component is repaired or replaced, and the system returns to normal. However, the system could fail to automatically detect the failure fast enough, prompting manual failure detection; and/or the system could fail to indicate the failed unit, prompting manual diagnostics and fault isolation; and/or the system’s automatic recovery action could fail, prompting manual recovery actions. A failure of both a redundant element and automatic failure detection, isolation, and recovery mechanisms so that service is not automatically restored is sometimes called a “double failure.” Outages have three fundamental characteristics: 1. Attributable Cause—The primary failure that causes the outage. Flaws in diagnostics, procedures, customer’s actions, or other causes may slow outage resolution, but prolonging factors are not the attributable cause of the outage itself. 2. Outage Extent—Some percentage of the system is deemed to be unavailable. Operationally, outage extent is generally quan-

Figure 3.3. Typical outage flow.

c03.qxd

2/8/2009

5:21 PM

Page 19

3.3

FAILURES

19

tized as a single user (e.g., a line card on an access element) at a field-replaceable unit (FRU) level (e.g., “10 %”), or the entire system (e.g., “100 %”). Other capacity loss levels are certainly possible, depending on the system’s architecture, design, and deployed configuration. 3. Outage Duration—After the primary failure occurs, the event must be detected, isolated, and recovered before service is restored. In addition to the activities shown in Figure 3.3, logistical delays (such as delays acquiring a replacement part or delays scheduling an appropriately trained engineer to diagnose and/or repair a system) can add significant latency to outage durations. Chapter 4 reviews why outage durations may vary from customer to customer. The following sections provide additional details for the different pieces of the conceptual model.

3.3

FAILURES

Failures generally produce one or more “critical” (by TL 9000 definition) alarms. Because most systems have multiple layers of failure detection, a single failure can eventually be detected by multiple mechanisms, often leading to multiple alarms. On highavailability systems, many of these critical failures will be rapidly and automatically recovered. Service disruptions caused by many alarmed failures may be so brief or affect so little capacity that they may not even be recorded by the customer as an outage event. By analogy, if the lights in your home flicker but do not go out during a thunderstorm, most would agree there was an electricity service disruption, but very few would call that a power outage. Thus, it is useful to differentiate outage-inducing failures (which cause recorded outages) from other failures (which often raise alarms, but may not lead to recorded outages). Failures that produce system outages can generally be categorized by root cause into one of the following: 앫 (Product-Attributable) Hardware—for events resolved by replacing or repairing hardware 앫 (Product-Attributable) Software (Includes Firmware)—for software/firmware outages that are cleared by module, processor, board, or system reset/restart, power cycling, and so on

c03.qxd

2/8/2009

20

5:21 PM

Page 20

CONCEPTUAL MODEL OF RELIABILITY AND AVAILABILITY

앫 Nonproduct-Attributable—for procedural mistakes (e.g., work not authorized, configuration/work errors), natural causes (e.g., lightning, fires, floods, power failures), and man-made causes (e.g., fiber cuts, security attacks) Readers with some familiarity with the subject may be wondering why “procedural error” is not listed as a source of outageinducing failures. It is true that procedural errors, both during the initial installation and during routine system operation, may result in outages. Many factors may contribute to these procedural outages, such as poor training, flawed documentation, limited technician experience, technician workload, customer policies, etc. Additionally, because procedural outages are a result of human interaction, and humans learn and teach others, the procedural outages for a given system typically decrease over time, often dramatically. For these reasons, installation, documentation, and training failures that result in downtime are beyond the scope of this document and will not be addressed. It is often insightful to add second-tier outage classifications identifying the functionality impacted by the outage, such as: 앫 Loss of Service—Primary end-user functionality is unavailable. This is, by definition, a “service outage.” 앫 Loss of Connectivity—Many systems require real-time communications with other elements to access required information such as user credentials or system, network, user or other configuration data. Inability to access this information may prevent service from being delivered to authorized users. Thus, an element could be capable of providing primary functionality, but be unable to authorize or configure that service because of connectivity problems with other elements. 앫 Loss of Redundancy—Many high-availability systems rely on redundancy either within the system or across a cluster of elements to enable high service availability. A failure of a standby or redundant element may not affect service, but it may create a simplex exposure situation, meaning that a second failure could result in service loss. 앫 Loss of Management Visibility—Some customers consider alarm visibility and management controllability of an element to be primary functionality, and, thus, consider loss of visibility to be an outage, albeit not a service outage. After all, if one loses alarm visibility to a network element, then one does not really know if it is providing service.

c03.qxd

2/8/2009

5:21 PM

Page 21

3.3

FAILURES

21

앫 Loss of Provisioning—A system may be fully operational but incapable of adding new subscribers or provisioning changes to existing subscribers. Beyond categorizing failures by root cause, one should also consider the extent of system capacity affected by the failure. On real, in-service elements the extent of failure is often resolved to the number of impacted users or other service units. 3.3.1

Hardware Failures

Hardware, such as components and electrical connections, fails for well-known physical reasons including wearing out and electrical or thermal overstress. Hardware failures are generally resolved by replacing the field-replaceable unit (FRU) containing the failed hardware component, or repairing a mechanical issue (e.g., tightening a loose mechanical connection). Firmware and software failures should be categorized separately because those failures can be resolved simply by restarting some or all of the processors on the affected element. Sometimes, outages are resolved by reseating a circuit pack; while it is possible that the reseating action clears a connector-related issue, the root cause of the failure is often software or firmware. Thus, product knowledge should be applied when classifying failures. Note that the failure mitigation, such as a rapid switchover to a redundant element, is different from the failure cause, such as hardware or software failure. For example, one cannot simply assume that just because service was switched from active element to standby element, that the hardware on the active element failed; a memory or resource leak could have triggered the switchover event and some software failures, like resource leaks, can be recovered by switching to a standby element. Hardware failure rate prediction is addressed by several standards. A detailed discussion of these standards is provided in Chapter 5, Section 5.5.1. Additional information on calculating hardware failure rates is provided in Chapter 7, Section 7.1. 3.3.2

Software Failures

Software and firmware failures are typically resolved by restarting a process, processor, circuit pack, or entire element. Also, sometimes running system diagnostics as part of troubleshooting may happen to clear a software/firmware failure because it may force a

c03.qxd

2/8/2009

22

5:21 PM

Page 22

CONCEPTUAL MODEL OF RELIABILITY AND AVAILABILITY

device or software back to a known operational state. Note that operators typically “resolve” software or firmware failures via restart/reboot, rather than “fix” them by installing patched or upgraded software; eventually, software failures will be “fixed” by installing new software, changing system configuration or changing a procedure. Thus, whereas a residual defect may be in version X.01 of a product’s software (and fixed in version X.02), the duration of an outage resulting from this defect will last only from the failure event until the software is restarted (nominally minutes or seconds), not until version X.02 is installed (nominally weeks or months). Occasionally, software failures are addressed by correcting a system configuration error or changing system configuration or operational procedures to work around (avoid) a known failure mode. Software outages in the field are caused by residual software defects that often trigger one of multiple types of events: 1. Control Flow Error—The program does not take the correct flow of control path. Some examples are executing the “if” statement instead of the “else,” selecting the wrong case in a switch statement, or performing the wrong number of iterations in a loop. 2. Data Error—Due to a software fault, the data becomes corrupted (e.g., “wild write” into heap or stack). This type of error typically will not cause an outage until the data is used at some later point in time. 3. Interface/Interworking Error—Communications between two different components fail due to misalignment of inputs/outputs/behaviors across an interface. The interface in question could be between software objects or modules, software drivers and hardware, different network elements, different interpretations of protocols or data formats, and so on. 4. Configuration Error—The system becomes configured incorrectly due to a software fault or the system does not behave properly in a particular configuration. Examples of this type of error include incorrectly setting an IP address, specifying a resource index that is beyond the number of resources supported by the system, and so on. Predicting software failure rates is more difficult than estimating hardware failure rates because:

c03.qxd

2/8/2009

5:21 PM

Page 23

3.4

OUTAGE RESOLUTION

23

1. Impact of residual software defects varies. Some residual defects trigger failures with catastrophic results; others produce minor anomalies or are automatically recovered by the system. 2. Residual defects only trigger failures when they are executed. Since execution of software (binaries) is highly nonuniform (e.g., some parts get executed all the time, whereas some hardly ever get executed), there is wide variation in how often any particular defect might be executed. 3. Software failures are sometimes state-dependent. Modern protocols, hardware components, and applications often support a bewildering number of modes, states, variables, and commands, many of which interact in complex ways; some failures only occur when specific settings of modes, states, variables, and commands combine. For example, software failure rates for some systems may increase as the system runs longer between restarts; this phenomenon prompts many personal computer (PC) users to periodically perform prophylactic reboots.

3.4

OUTAGE RESOLUTION

At the highest level, outage recoveries can be classified into three categories, as follows. 3.4.1

Automatically Recovered

Many failures will be automatically recovered by the system by switching over to a redundant unit or restarting a software module. Automatically recovered outages often have duration of seconds or less, and generally have duration of less than 3 minutes. Although customers are likely to write a trouble ticket for automatically recovered hardware outages because the failed hardware element must be promptly replaced, customer policy may not require automatically recovered software outages to be recorded via trouble tickets. Thus, performance counters of automatic switchover events, module restarts, etc, may give more complete records of the frequency of automatically recovered software outages. Trouble tickets for automatically recovered outages are often recorded by customer staff as “recovered without intervention” or “recovered automatically.” Automatically recovered outages are said to be “covered” because the system successfully detected, iso-

c03.qxd

2/8/2009

24

5:21 PM

Page 24

CONCEPTUAL MODEL OF RELIABILITY AND AVAILABILITY

lated, and recovered the failure; in other words, the system was designed to “cover” that type of failure. 3.4.2

Manual, Emergency Recovered

Some failures will require manual recovery actions, such as to replace a nonredundant hardware unit. Alternately, manual recovery may have been required because: 앫 The system did not automatically detect the failure fast enough. 앫 The system did not correctly isolate the failure (e.g., the system indicted the wrong hardware unit, or misdiagnosed hardware failure as a software failure). 앫 The system did not support automatic recovery from this type of failure (e.g., processor, board, or system does not automatically restart following all software failures). 앫 The system’s automatic recovery action did not succeed (e.g., switchover failed). Although the maintenance staff should promptly diagnose any outage that is not automatically recovered, customer policy may direct that not all outages be fixed immediately on an “emergency” basis. Although large capacity-loss outages of core elements will typically be fixed immediately, recovery from smaller capacity-loss events may be postponed to a scheduled maintenance window. For example, if a single port on a multiport unit fails and repair will require nonaffected subscribers to be briefly out of service, a customer may opt to schedule the repair into an off-hours maintenance window to minimize overall subscriber impact. Manually recovered outages are generally trouble ticketed by the customer and are often recorded as “replaced,” “repaired,” “reseated circuit pack,” and so on. Outages manually recovered on an emergency basis are usually less than an hour for equipment in staffed locations or for software outages that can be resolved remotely. Generally, there will be minimal or no logistics delays in resolving emergency outages because spare hardware elements will be available on site and appropriately trained staff will be available on site or on call. 3.4.3

Manual, Nonemergency Recovered

Customers may opt to recover some outages during scheduled maintenance windows rather than on an emergency basis immedi-

c03.qxd

2/8/2009

5:21 PM

Page 25

3.4

OUTAGE RESOLUTION

25

ately following the failure event, to minimize overall service disruption to end users. Likewise, to minimize operating expense, customers may opt to postpone recovery from outages that occur in off hours until normal business hours to avoid the overtime expenses. Also, logistical considerations will force some repairs to be scheduled; for example, failed equipment could be located in a private facility (e.g., a shopping mall or commercial building) that cannot be accessed for repair at all times, or because certain spares are stored off-site. Customers often mark trouble tickets that are addressed on a nonemergency basis as being “parked,” “scheduled,” or “planned.” As a practical matter, equipment suppliers should not be accountable for excess downtime because outages were resolved on a nonemergency basis. Interestingly, outage durations generally improve (i.e., get shorter) over time because: 앫 Maintenance staff becomes more efficient. As staff gains experience diagnosing, debugging and resolving outages on equipment, they will get more efficient and, hence, outage durations will decrease. 앫 Automatic recovery mechanisms become more effective. As systems gain experience in the field, system software is generally improved to correctly detect, isolate, alarm, and automatically recover more and more types of failures, and, thus, some failures that would require manual recovery in early product releases will be automatically recovered in later releases. Likewise, some failure events that are initially detected by slower secondary or tertiary failure-detection mechanisms are likely to be detected by improved primary and secondary mechanisms, thus shortening failure detection for some events. Also, recovery procedures may be streamlined and improved, thus shortening outage durations. The overall effect is that in later releases, both a larger portion of failure events are likely to be automatically recovered than in earlier releases, and the outage durations for at least some of those events is likely to be shorter. The combined effect of improved automatic recovery mechanisms and customer experience are that outage durations generally shorten over time. Software failure rates of existing software also tend to decrease (i.e., improve) from release to release as more residual defects are found and fixed. The combined effect of these trends is a general growth in field availability as system software

c03.qxd

2/8/2009

26

5:22 PM

Page 26

CONCEPTUAL MODEL OF RELIABILITY AND AVAILABILITY

matures and is upgraded. This availability growth trend is shown in Figure 3.4.

3.5

DOWNTIME BUDGETS

Just as financial budgets can be created and quantitatively managed, so can downtime budgets. The first challenge is to create a complete and correct set of downtime categories. TL 9000’s Standard Outage Template System (http://tl9000.org/tl_sots.htm) offers an excellent starting point for a downtime budget. The italicized text in the outline below are direct quotes from Standard Outage Template System documentation. TL 9000 begins by factoring downtime contributors into three broad categories based on the attributable party: 1. Customer Attributable—Outages attributable primarily to actions of the customer, including: 앫 Procedural—“Outages due to a procedural error or action by an employee of the customer or service provider.” “Actions”

Automatic recovery mechanisms become more effective, thus covering failures previously requiring manual action

Automatic recovery mechanisms become more effective, shortening failure detection, isolation and recovery times

Service provider learning and process/procedure improvements shorten manual outage detection and recovery latency

Figure 3.4. Availability growth over releases.

c03.qxd

2/8/2009

5:22 PM

Page 27

3.5

DOWNTIME BUDGETS

27

include decisions by a customer not to accept available redundancy offered by the product supplier. 앫 Power Failure, Battery or Generator—“[power failures] from the building entry into the element.” 앫 Internal Environment—“Outages due to internal environmental conditions that exceed the design limitations of the Vendor system’s technical specifications.” 앫 Traffic Overload—“Outages due to high traffic or processor load that exceeds the capacity of a properly designed and engineered system.” 앫 Planned Event (customer-attributable)—“Planned events not covered by other categories, e.g. equipment moves but not corrective actions.” 2. Product Attributable—Outages attributable primarily to design and implementation of the product itself, or actions of the supplier in support of installation, configuration, or operation of that product, including: 앫 Hardware Failure—“Outages due to a random hardware or component failure not related to design (MTBF).” 앫 Design, Hardware—“Outages due to a design deficiency or error in the system hardware.” 앫 Design, Software—“Outages due to faulty or ineffective software design.” 앫 Procedural—“Outages due to a procedural error or action by an employee or agent of the system or equipment supplier.” 앫 Planned Event—“Scheduled event attributable to the supplier that does not fit into one of the other outage classifications.” 3. Third-Party Attributable—Outages attributable primarily to actions of others, including: 앫 Facility Related—“Outages due to the loss of [communications] facilities that isolate a network node from the remainder of the communications network.” 앫 Power Failure, Commercial—“Outages due to power failures external to the equipment, from the building entry out.” 앫 External Environment—“Outages due to external environmental conditions that exceed the design limitations of the Vendor system’s technical specifications. Includes natural disasters, vandalism, vehicular accidents, fire, and so on.” Focusing only on product-attributable downtime, allows one to use a simple downtime budget with three major categories:

c03.qxd

2/8/2009

28

5:22 PM

Page 28

CONCEPTUAL MODEL OF RELIABILITY AND AVAILABILITY

1. Hardware—Covers hardware-triggered downtime. In simplex systems, hardware failures are often a substantial contributor to downtime; in duplex systems, hardware-triggered downtime generally results from the latency for the system to switch over to redundant elements. Naturally, hardware-triggered downtime is highly dependent on the systems’ ability to automatically detect hardware failures and rapidly switch over to redundant elements; one hardware failure that requires manual intervention to detect, isolate, and/or recover will probably accrue more product-attributable downtime than many automatically recovered hardware failures. 2. Software—Covers failures triggered by poor software or system design, or activation of residual software defects. As with hardware-triggered downtime, events that fail to be automatically detected, isolated, and successfully recovered by the system typically accrue much more downtime than automatically, successfully recovered events. 3. Planned and Procedural—Covers downtime associated with both successful and unsuccessful software upgrades, updates, retrofits, and patch installation, as well as hardware growth. Downtime attributed to poorly written, misleading, or wrong product documentation and maintenance procedures can be included in this category. One can, of course, use different taxonomies for downtime, or resolve downtime into more, smaller categories. For example, a software downtime budget could be split into downtime for software application and downtime for the software platform; hardware downtime could be budgeted across the major hardware elements. Since “five 9s” or 99.999% service availability maps to 5.26 downtime minutes per system per year, a “five 9s downtime budget” must allocate that downtime across the selected downtime categories. The downtime allocation will vary based on the system’s redundancy and recovery architecture, complexity of the hardware, maturity of the software, training and experience of the support organization, and other factors, including precise definitions and interpretations of the downtime budget categories themselves. Often a 20%:60%:20% downtime budget allocation across hardware/software/planned and procedural is a reasonable starting point for a mature system, or:

c03.qxd

2/8/2009

5:22 PM

Page 29

3.5

DOWNTIME BUDGETS

29

앫 Hardware—1 downtime minute per system per year 앫 Software—3.26 downtime minutes per system per year 앫 Planned and Procedural—1 downtime minute per system per year 앫 Total budget for product-attributable service downtime—5.26 downtime minutes per year, or 99.999% service availability Having set a downtime budget, one can now estimate and predict the likely system performance compared to that budget. If the budget and prediction are misaligned, then one can adjust system architecture (e.g., add more redundancy, make failure detection faster and more effective, make automatic failure recovery faster and more reliable), improve software and hardware quality to reduce failure rates, increase robustness testing to assure fast and reliable operation of automatic failure detection and recovery mechanisms, and so on.

c04.qxd

2/8/2009

5:23 PM

CHAPTER

Page 31

4

WHY AVAILABILITY VARIES BETWEEN CUSTOMERS

Question: Can a product with identical failures (rates and events) have different perceived or measured availability for different customers? Answer: Yes, because customers differ both on what events they record as outages and on how long it takes them to resolve those events. The factors that cause variations in what events are reported and how long those events take to be resolved are detailed in this chapter. These factors also contribute to why observed availability varies from predicted availability. Note that customers using the same product in different configurations, leveraging different product features, and/or using those features in different ways may observe different failure rates; those variations in operational profiles are not considered in this chapter.

4.1 CAUSES OF VARIATION IN OUTAGE EVENT REPORTING There are several causes of the variation in how customers report outage events. These include: 앫 Definition of “primary functionality” 앫 How scheduled events are treated Practical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.

31

c04.qxd

2/8/2009

32

5:23 PM

Page 32

WHY AVAILABILITY VARIES BETWEEN CUSTOMERS

앫 Customer’s policies and procedures 앫 Compensation policies The following sections elaborate on these different causes. 4.1.1

Definition of “Primary Functionality”

Only failures that cause loss of “primary functionality” are deemed by customers to be outages. Although total or profound loss of service is unquestionably an outage, failures causing other than total service loss may be viewed differently by different customers. For example, some customers consider alarm visibility and management controllability of a network element to be a primary functionality of the element, and thus deem any failure that causes them to lose management visibility to that element to be an outage event; other customers consider only disruptions of the end-user services offered by the element to be outages. 4.1.2

Treatment of Scheduled Events

Planned/scheduled events, such as applying software patches and upgrades, are typically more frequent than failure-caused outage events for high-reliability systems. Customers often have different processes in place to manage planned/scheduled events and unplanned/unscheduled events, which can include different databases to track, measure, and manage those events; they may even record different data for planned and unplanned outages. As customers may have different metrics and targets for planned/scheduled and unplanned/unscheduled events, the data may be categorized, analyzed, measured, and reported differently, thus causing differences in perceived availability. 4.1.3

Customer’s Policies and Procedures

Customer staff, including staff in network operations centers (NOCs), are often very busy and, unless policy requires it, they may not record brief outage events. For instance, one customer may ticket all critical alarms, including those that automatically recover or are quickly resolved without replacing hardware, whereas another customer may only require tickets for alarms that are standing for more than a particular time (e.g., 15 minutes) or

c04.qxd

2/8/2009

5:23 PM

Page 33

4.2

CAUSES OF VARIATION IN OUTAGE DURATION

33

require special action such as replacing hardware. Often, formal outage notification policies will be in place that dictate how quickly an executive must be notified after a high-value element has been out of service. Obviously, if executives are notified once or more of a product-attributed outage for any element, they are likely to be more cautious or have negative bias regarding the quality and reliability of that element. 4.1.4

Compensation Policies

Some customers tie aspects of the compensation for maintenance engineers to key performance indicators of quality and reliability. For instance, one sophisticated customer counts not approved in advance equipment “touches” by maintenance engineers (more “touches” is bad) on the hypothesis that well-maintained equipment should not have to be “touched” on an emergency (not approved in advance) basis. Some customers might tie some aspect(s) of trouble ticket resolution (e.g., resolution time) to compensation, or perhaps even tie service availability of selected high-value elements to compensation. As most metrics that are tied to compensation are both tracked carefully and actively managed by impacted staff, including availability-related metrics in compensation calculations is likely to impact the availability metrics themselves.

4.2

CAUSES OF VARIATION IN OUTAGE DURATION

Outage duration varies from customer to customer due to several factors: 앫 Efficiency of the customer staff in detecting and resolving outages 앫 How “parked” outages are treated 앫 Externally attributable causes These factors are discussed in more detail in the following sections. 4.2.1

Outage Detection and Resolution Efficiency

Latency to detect, isolate, and resolve outage events is impacted by customer policies including:

c04.qxd

2/8/2009

34

5:23 PM

Page 34

WHY AVAILABILITY VARIES BETWEEN CUSTOMERS

앫 Training and Experience of Staff—Better trained, more experienced staff can diagnose failures and execute recovery actions more effectively and faster than inexperienced and poorly trained staff. 앫 Sparing Strategy (e.g., On-site Spares) —Obviously, if a hardware element fails but the spare is not located on-site, then an additional logistics delay may be added to the outage duration. 앫 Operational Procedures (a.k.a., Method of Procedure, or MOPs) and Tools—Better operational procedures can both streamline execution times for activities such as debugging or recovering a failure, as well as reducing the likelihood of errors that can prolong the outage. Likewise, better monitoring, management, and operational support tools can both accelerate and improve the accuracy of fault detection and isolation. 앫 Alarm Escalation and Clearance Policies (e.g., No Standing Alarms)—Some customers strive for no standing alarms (a.k.a., “clean boards”), whereas others tolerate standing alarms (a.k.a., “dirty boards”). Standing alarms may slow detection and isolation of major failure events, as maintenance engineers have to sift through stale alarms to identify the cause of the major failure. 앫 Support Contracts—If a customer has already purchased a support contract, then they may contact the supporting organization sooner for assistance in resolving a “hard” outage, thus shortening the outage duration. Without a support contract in place, the customer’s staff may naturally spend more time trying to resolve the outage rather than having to work through the administrative process or approvals to engage an external support organization on the fly, thus potentially prolonging the outage. 앫 Management Metrics and Bonus Compensation Formulas— Many businesses use performance-based incentive bonuses to encourage desirable actions and behaviors of staff. For instance, if bonuses are tied to speed of outage resolution on selected types of network elements (e.g., high-value or high-impact elements), one would expect staff to preempt outage resolution of nonselected network elements to more rapidly restore the bonus-bearing elements to service. Likewise, if a customer has a policy that any outage affecting, say, 50 or more subscribers, and lasting for more than, say, 90 minutes must be reported to a customer executive, then one might expect staff to work a bit

c04.qxd

2/8/2009

5:23 PM

Page 35

4.2

CAUSES OF VARIATION IN OUTAGE DURATION

35

faster on larger outages to avoid having to call an executive (perhaps in the middle of the night). 앫 Government Oversight (e.g., Mandatory FCC Outage Reporting Rules)—Governments have reporting rules for failures of some critical infrastructure elements, and affected customers will strive to avoid the expense and attention that comes from these filings. For example, the United States Federal Communications Commission (FCC) has established reporting rules for outage events impacting 900,000 user minutes and lasting 30 minutes or more. Naturally, customers strive to minimize the number of outage events they must report to the FCC and, thus, strive to resolve large events in less than 30 minutes. 앫 Sophistication/Expectations of Customer Base—Customers in different parts of the world have different expectations for service availability, and end users will respond to those local expectations. Thus, leading customers are likely to have more rigorous policies and procedures in place to resolve outages in markets where end users are more demanding than the more relaxed policies and procedures that might suffice in less-demanding markets. 앫 “Golden” Elements—Some network elements directly support critical services or infrastructure (e.g., E911 call centers, hospitals, airports), or critical end users; these elements are sometimes referred to as “golden.” Given the increased importance of these golden elements, restoring service to any of these golden elements is likely to preempt other customer activities, including restoring service to nongolden elements. Thus, one would expect that outage durations on golden elements are likely to be somewhat shorter than those on ordinary (nongolden) network elements. 4.2.2

Treatment of “Parking” Duration

Manual recovery of minor outages is sometimes deferred to a maintenance window or some later time, rather than resolving the outage immediately. This is sometimes referred to as “parking” an outage. Although all customers will precisely track the outage start and outage resolution times, they may not record precisely when the decision was made to defer outage recovery and exactly when the deferred recovery actually began; thus, it is often hard to determine how much of the parked outage duration should be at-

c04.qxd

2/8/2009

36

5:23 PM

Page 36

WHY AVAILABILITY VARIES BETWEEN CUSTOMERS

tributed to the product versus how much should be attributed to the customer. Beyond simply deferring a well-defined action (e.g., a software restart) to a maintenance window, the delay could be necessitated by logistical or other real-world situations such as: 앫 Spare parts not being available locally 앫 Appropriately trained staff not being immediately available 앫 Recovery being preempted, postponed, or queued behind a higher-priority recovery action 앫 Delays in physically accessing equipment, perhaps because it is located on private property, in a secured facility, at a remote location, and so on As minutes count in availability calculations of mission-critical and high-value systems, rounding parking times to 15 minute or 1 hour increments, or not explicitly tracking parking times, can significantly impact calculations of product-attributable downtime and availability. As a simplifying assumption, one might cap the maximum product-attributable downtime for outages to mitigate this uncertainty. Outage durations longer than the cap could then be allocated to the customer rather than the product. 4.2.3

Externally Attributable Outages and Factors

Occasionally, extraordinary events occur that can prolong outage resolution times, such as: 앫 Force majeur (e.g., hurricanes, fires, floods, malicious acts). TL 9000 classifies outages associated with these types of events as being “Externally Attributable Outages.” 앫 Unfortunate timing of failures (e.g., New Year’s Eve, Christmas Day, national holidays, during software/system upgrades/retrofits) 앫 Worker strikes at the customer or logistics suppliers

c05.qxd

2/8/2009

5:39 PM

CHAPTER

Page 37

5

MODELING AVAILABILITY

“All models are wrong, some are useful.” —George Box, industrial statistician An accurate, architecture-based model of system availability is useful to: 1. Assess the feasibility of meeting a particular availability target. One can predict availability of a product from system test results, or even as early as the architecture phase, before a single circuit has been designed or line of code has been written. This is useful in selecting the hardware architecture (e.g., how much hardware redundancy is necessary in a system), determining the appropriate investment in reliability-improving features (e.g., how hard software has to work to rapidly detect, isolate, and recover from failures), setting hardware and software failure-rate targets, and, ultimately, setting availability expectations for a particular product release. 2. Understand where system downtime is likely to come from and how sensitive downtime is to various changes in system characteristics. Given an architecture-based availability model, it is easy to estimate the availability benefit of, say, reducing the hardware failure rate of a particular circuit pack; improving the effectiveness of automatic detection, isolation, and recovery from hardware or software failures; shortening system reboot time, and so on. 3. Respond to customer requests (e.g., request for proposals, or RFPs) for system availability predictions, especially because modeling is recommended by standards (such as Telcordia’s SR-TSY-001171). Practical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.

37

c05.qxd

2/8/2009

38

5:39 PM

Page 38

MODELING AVAILABILITY

In this chapter, we discuss the reasons for building availability models, cover reliability block diagrams (RBDs) and Markov models in detail, and then define the terms used in creating availability models. We then tie it all together with an example model for a hypothetical system called the “Widget System,” and then wrap it up with a set of pointers to additional information on how to create availability models.

5.1

OVERVIEW OF MODELING TECHNIQUES

There are many different kinds of reliability/availability models, including: 앫 Reliability block diagrams (RBDs) graphically depict simple redundancy strategies; these are the easiest models to construct. 앫 Markov models use state transition diagrams to model the time spent in each operational and nonoperational state from which the probability of the system being operating and/or down can be calculated. 앫 Fault tree models construct a tree using individual fault types and identify the various combinations and sequences of faults that lead to failure. 앫 Minimal cut-set method. A cut set is a set of components which, when failed, cause failure of the system. This method identifies minimum cut sets for the network/system and evaluates the system reliability (or unreliability) by combining the minimum cut sets. 앫 A Petri net is a graphical modeling language that allows the user to represent the actual system functions. Blockages or failures can be studied while monitoring the performance and reliability levels. Models can be complicated and may be difficult to solve. 앫 Monte Carlo simulations generate numbers of pseudorandom input values and then calculate the system outputs, such as service availability. They are useful where no specific assumptions have to be made for input parameters, such as failure or repair rates. Each of these techniques is reviewed below. Of these different model types, RBDs and Markov models are the two most frequent-

c05.qxd

2/8/2009

5:39 PM

Page 39

5.1

OVERVIEW OF MODELING TECHNIQUES

39

ly used methods, and will be covered in the most detail. In practice, the two methods are often used together because they complement each other well. RBDs display the redundancy structure within the system. Markov models address more detailed reliability evaluations. The advantages and drawbacks of the other modeling techniques are discussed in their respective sections. 5.1.1

Reliability Block Diagrams

Reliability block diagrams (RBDs) are one of the most common forms of reliability models. They are easy to understand and easy to evaluate. They provide a clear graphical representation of the redundancy inherent within a system. RBDs can be used to describe both hardware and software components and topologies at a high level. Figure 5.1 shows several RBDs: one for a serial system, one for a parallel system, and one for a hybrid system. The basic idea behind an RBD is that the system is operational if there is at least one path from the input to the output (from the left to the right by convention). In the serial example, failure of any one of components A, B, or C will break the path, and, thus,

Figure 5.1. Generic reliability block diagrams.

c05.qxd

2/8/2009

40

5:39 PM

Page 40

MODELING AVAILABILITY

the system will be unavailable. In the parallel system, both components D and E must fail for there to be no path from input to output. Finally, in the hybrid system, if either component A or C fails, the system is down, but both components D and E must fail before the system goes down. The Widget Example in Section 5.4 gives sample RBDs that represent a typical system. Despite their clarity and ease of use, RBDs do have some drawbacks. First, since each block can be in only one of two states (up or down) it is hard to represent some common configurations such as load sharing and redundant units that have to perform failovers before their mate can provide service. Markov models do not have these limitations. Additional information on RBDs is available in chapter 3 of reference [AT&T90]. Complicated systems can often be represented as a network in which system components are connected in series or parallel, are meshed, or a combination of these. Network reliability models address system reliability evaluation based on component reliability and the topologies through which the components are connected. For simple systems, the components can be connected in series, parallel, or a combination of both. System reliability can be then evaluated. 5.1.1.1 Series RBD Systems A series system is one with all of its components connected in series; all must work for the system to be successful. If we assume that each component’s reliability is given by Ri, then the system reliability is given in Equation 5.1, where the reliabilities R and Ri are expressed as percentages. Equation 5.1 also applies to availabilities. R = ⌸ Ri i

(5.1)

Because the reliability of a series system is the product of the individual component reliabilities, the system reliability is always worse than the reliability of the worst component; series systems are weaker than their weakest link. 5.1.1.2 Parallel RBD Systems A parallel system is one with all of its components connected in parallel; only one needs to work for the system to be successful. If we assume that each component’s reliability is given by Ri, then

c05.qxd

2/8/2009

5:39 PM

Page 41

5.1

OVERVIEW OF MODELING TECHNIQUES

41

the system reliability is given by Equation 5.2, where the reliabilities R and Ri are expressed as percentages. Equation 5.2 also applies to availabilities. R = 1 – ⌸ (1 – Ri) i

(5.2)

For parallel systems, the resultant system reliability is greater than the reliability of any individual component. 5.1.1.3 N-out-of-M RBD Systems Another common type of system that may be analyzed using RBDs is the N-out-of-M system. In an N-out-of-M system, there are M components, of which N must be operational for the system to be operational. The block diagram itself looks like a parallel system diagram but, typically, there is some indication that the components are in an N-out-of-M configuration: m–n m! Rs = 冱 ᎏᎏ (R)m–i (1 – R)i i=0 i!(m – i)!





(5.3)

Equation 5.3 models the system reliability based on the number of failed and working components, which can be analyzed mathematically by a binomial distribution (for details, see Section 2.1 in Appendix B—Reliability and Availability Theory). A classic example of an N-out-of-M system is a two-out-offour set of power supplies. In this configuration two supplies are powered from one source and two more from a separate source, with all four outputs connected together to power the system. In this configuration, failure of either power source will not cause a system outage, and the failure of any two power supplies will still leave the system operational. Another example of an N-out-of-M system is a multicylinder internal combustion engine. After failure of some number of cylinders, the engine will no longer have enough power to provide service. For example, consider an eight-cylinder engine in a car or airplane. If the engine must have at least four cylinders running to continue to move the car or keep the airplane airborne, then ignition system failures such as spark plug failures, plug wire failures, and ignition coil failures (for engines with one coil per cylinder) could each be modeled as a four-out-of-eight system.

c05.qxd

2/8/2009

42

5:39 PM

Page 42

MODELING AVAILABILITY

The methods derived from the basic series, parallel, and Nout-of-M models can be used to evaluate systems with a combination of the different types of configurations. More complex systems require more sophisticated methods to evaluate reliability of the entire system. Discussion of these methods is provided in subsequent sections. RBDs help us to understand the system better and also enable us to decompose the system into pieces, which we can then analyze independently. For example, consider the three systems shown in Figure 5.1. In the RBD for the serial system, we see three separate components, A, B, and C. Because they are in series, we can model each independently (typically by using a Markov model, which will be discussed in a later section) and then add the downtimes for A, B, and C together to get the total downtime for the series system. This is essentially what Equation 5.1 says, but some people find it easier to relate to the summation of the individual downtimes than they do to the product of the availabilities given in Equation 5.1. The RBD for the parallel system of Figure 5.1 would typically be analyzed as a single entity, in this case an active/standby pair. That analysis would typically entail a Markov model. Finally, the hybrid system shown in Figure 5.1 would be done as three separate models, one for each of the series elements, in this case a model for component A, a model for the D and E pair of components, and a model for component C. The resultant downtimes from the three models would then be added to arrive at the downtime for the hybrid system, just as for the simple series system. If availability were desired, then the resultant system downtime could easily be converted to availability. This would be done by subtracting the downtime from the amount of time in a year, and then dividing that result by a full year. For instance, if the model predicted 30 minutes of system downtime, we would subtract the 30 minutes from the 525,960 minutes in a year, leaving 525,930 minutes, and divide that by the number of minutes in the year (525,960), yielding a result of 99.9943%. 5.1.2

Markov Models

When evaluating the comprehensive system reliability/availability of a real system, the system structure, topology, and operating logic as well as the underlying probability distribution associated

c05.qxd

2/8/2009

5:39 PM

Page 43

5.1

OVERVIEW OF MODELING TECHNIQUES

43

with the components of the system need to be incorporated. RBDs and other simple network models are often insufficient to provide a comprehensive understanding of system availability and how to improve it. In particular, for repairable systems, they assume that the repair process is instantaneous or negligible compared with the operating time. This is an inherent restriction and additional techniques are required if this assumption is not valid. One very important technique that overcomes this problem and which has received considerable attention and use in the industry is known as the Markov approach or Markov modeling. Several texts [Feller68, Shooman68, Sandler63, Kemeny60, and Pukite98] are available on the subject of application of Markov chains to reliability analysis. Markov models model the random behavior of systems that vary discretely or continuously with respect to time and space. The Markov model approach can model the memoryless behavior of the system, that is, that the future random behaviors of the system are independent of all past states except the immediately preceding one. In addition, the process must be stationary, which means the behavior of the system must be the same at all points of time irrespective of the point in time being considered. Typically, system failure and the failure recovery process can be described by a probability distribution that is characterized by a constant failure or recovery rate, which implies that the probability of making a transition between two states remains constant at all points in time. This makes the Markov model approach straightforward for industrial practitioners to adapt when they model system failure and the failure recovery process. Appendix B documents widely used probability distributions in reliability engineering. Markov models are a bottom-up method that allows the analysis of complex systems and repair strategies. The method is based on the theory of Markov chains. This method represents system operations, failures, and repairs at specific points in time with state machines. The advantage of this method is that the system behavior can be analyzed thoroughly. It can incorporate details such as partial failures, capacity loss, and repair strategies. Sensitivity analysis of all the potential features to improve overall availability can be explored. [Trivedi02] provides a good introduction and summary to stochastic modeling techniques and tools that can be applied to computer and engineering applications. Markov models are relatively easy to solve, even for complex systems,

c05.qxd

2/8/2009

44

5:39 PM

Page 44

MODELING AVAILABILITY

using commonly available tools such as a spreadsheet tool like Microsoft Excel. Markov models can be applied to model system, subsystem, and component availability. For complicated systems, a relatively standard procedure for evaluating the reliability/availability of a system is to decompose the system into its constituent components, and individually estimate the reliability/availability of each of these components. Finally, the component reliability/availability evaluation results can then be combined to estimate the reliability/availability of the complete system. The foundation for Markov modeling is the state transition diagram. The state transition diagram (or state diagram, for short) describes the possible states of the system, the events that cause it to transition from one state to another, and the rates at which these transitions occur. The basic concepts of Markov modeling can be illustrated by the state diagram shown in Figure 5.2, which is the state transition diagram for a simplex system. The states in Figure 5.2 represent the operating mode that the system can be in, and the transitions between the states represent the rates of the transitions. State 1 represents the “active” state, in which the system is fully operational. State 2 represents the “down covered” state, in which a failure has been recognized by the system and recovery is initiated. State 3 represents the “down uncovered” state, in which the system has failed but the failure has not yet been rec-

3 Down Uncovered

2 Down Covered

Figure 5.2. Markov model for a simplex system.

c05.qxd

2/8/2009

5:39 PM

Page 45

5.1

OVERVIEW OF MODELING TECHNIQUES

45

ognized, so recovery actions have not yet been initiated. By convention, failure rates are represented as ␭, repair rates are represented as ␮, and coverage factors are represented as C. This is a discrete Markov model since the system is stationary and the movement between states occurs in discrete steps. Consider the first time interval and assume that the system is initially in State 1, which is the state in which the system is operating normally. If the system fails and the failure is detected, the system moves into State 2 with rate C␭, then the system transitions back to State 1 with rate ␮R after the repair is done. On the other hand, if the system fails and the failure is uncovered, then the system transitions from State 1 to State 3 with rate (1 – C)␭. After the failure is eventually detected, the system transitions from State 3 to State 2 with rate ␮SFD. After solving Equations 5.4 below, the steady-state probabilities of the system being in each state can be calculated. The downtime can be calculated by adding up the time spent in the down states (State 2 and State 3 in this example). Now that we know how the system operates, how do we solve the model to determine the time spent in each state? Because we are interested in the long-term steady-state time in each state, we know that the input and output for each state must be equal. So, we can write an equation for each state that says the input minus the outputs is equal to zero. If we let Pi represent the probability of being in state i, then we get the three equations of Equation 5.4. State 1 (active)—Normal operation:

␮RP2 – C␭P1 – (1 – C)␭P1 = 0 State 2 (down covered)—Detected failure:

␮SFDP3 + C␭P1 – ␮RP2 = 0

(5.4)

State 3 (down uncovered)—Undetected failure: (1 – C)␭P1 – ␮SFDP3 = 0 When we try to solve these three equations for P1, P2, and P3, we discover we have only two independent equations (go ahead and convince yourself!) but three unknowns. The final equation

c05.qxd

2/8/2009

46

5:39 PM

Page 46

MODELING AVAILABILITY

we require to solve this set of simultaneous equations is the one that shows that the sum of the probabilities must be 1: P1 + P2 + P3 = 1

(5.5)

With three independent equations in three unknowns, we may solve for the steady-state probability of being in each of the three states. That probability may then be multiplied by a time period to find out how much of that time period is spent in each state. For example, if the probability of being in state 1 is 2%, then in a year, the system will spend 2% of its time in state 1, or approximately 175 hours (8766 hours in a year × 2% ⬵ 175 hours) in state 1. This method works for state transition diagrams of any size, although the larger the diagram the more cumbersome the math becomes. Fortunately, it is easy to automate the math using computers, so system complexity is not an insurmountable barrier to good modeling. Matrix algebra and matrix techniques are typically used in solving the Markov models. We will work through an example of the simplex model above using the parameter values given in Table 5.1. This says that the failure rate is 10,000 FITs (a FIT is a failure in 109 hours), which is a reasonable estimate for something like a server. It also says that 90% of the faults are automatically detected and alarmed, it takes 4 hours to repair the unit, and it takes an hour to detect that the unit has failed if the failure was uncovered (and, hence, unalarmed). Notice that all the times have been converted to rates and the rates are expressed as per hour. TIP: We strongly recommend converting all rates to per hour before using them in a model. This will avoid erroneous results due to unit mismatch. We will use the equations for states 2 and 3 above, along with the equation that says the probabilities of being in each state must sum to 1. We will rearrange the equations so that the coefficients

Table 5.1. Input parameters for modeling example Parameter Failure rate Coverage Repair rate Detection rate for uncovered faults

Symbol

Value

Units

␭ C ␮R ␮SFD

1.00E-05 90.00% 0.25 1

failures/hour % per hour per hour

c05.qxd

2/8/2009

5:39 PM

Page 47

5.1

OVERVIEW OF MODELING TECHNIQUES

47

are in order from P1 through P3. Doing that yields the following three equations: C␭P1 – ␮RP2 + ␮SFDP3 = 0 (1 – C)␭P1 + 0P2 – ␮SFDP3 = 0

(5.6)

P1 + P2 + P3 = 1 To solve this set of equations for each Pi we will define three matrices: A, P, and R. We define a matrix A that represents the coefficients from the equations such that, for the above example, we have the coefficient matrix shown in Equation 5.7.



C␭ (1 – C)␭ A= 1

–␮ R 0 1

␮SFD –␮SFD 1



(5.7)

Using the equations for states 2 and 3 has an added advantage—it is very easy to fill in the values for the first two rows of matrix A. Table 5.2 shows how to easily fill in the values. Across the top we can write the “from” state number and along the rows we write the “to” state number, and then we fill in the rates from state to state. When we hit an entry where from and to are the same, we write everything that leaves that state, using a minus (–) sign to indicate that the rate is an outgoing rate. So, for example, in the second row of column 1, we enter the rate from state 1 to state 3. Referring back to the state transition diagram, we see that this is (1 – C)␭. Next, we define P as the vector of probabilities that we are in each state. Thus, in the above example, with three states, we have the probability vector,

冤冥

P1 P = P2 P3

(5.8)

Table 5.2. State transition information for example model From state To state 2 3

1 C␭ (1 – C)␭

2 –␮ R 0

3 ␮SFD –␮SFD

c05.qxd

2/8/2009

48

5:39 PM

Page 48

MODELING AVAILABILITY

Finally, we define a vector R that represents the right-hand side of the above equations. All the equations have a right-hand side of zero, except the one that says the probabilities must sum to one. Thus, for our example, we get the results vector

冤冥

0 R= 0 1

(5.9)

Then, using matrix notation, we can express the above set of equations as AP = R

(5.10)

What we really want to know is the values for the elements of P. We already know the values for A and R, so solving for P we get P = A–1R

(5.11)

where A–1 is the inverse of the matrix A. Going back to our example, and filling in the actual numbers, we get



0.000009 A = 0.000001 1

–0.25 0 1

1 –1 1



(5.12)

Inverting A results in –1

A



3.99983601 –3.99984 = 3.998 × 10–6

4.999795 –3.9998 –1

0.999959 4 × 10–5 1 × 10–6



(5.13)

Multiplying R by A–1 yields our answer:



0.999959 3.99984 × 10–5 P= 9.99959 × 10–7



(5.14)

Thus, the probability of being in state 1 is 99.9959%, the probability of being in state 2 is 0.00399984%, and the probability

c05.qxd

2/8/2009

5:39 PM

Page 49

5.1

OVERVIEW OF MODELING TECHNIQUES

49

of being in state 3 is 0.0000999959%. At first glance, these appear to be very tiny probabilities for states 2 and 3. To get to something that is easier to relate to, we will convert these to the number of minutes per year spent in each state. This is easy to do; we simply multiply each probability by the number of minutes in a year. There are 525,600 minutes per nonleap year, or 525,960 if you consider each year as having 365.25 days to accommodate leap years. Using the 525,960 value, and multiplying by the probabilities shown above, we discover that the simplex system of our example spends the number of minutes/year in each of the three states shown in Table 5.3. States 2 and 3 are both down states. Thus, our example system is down for 21.57 minutes per year—the sum of the time spent in each of states 2 and 3. Therefore, the availability of this system is the probability of being in the active state (state 1), or 99.9959%. There are many available tools that can be used to solve for the probabilities, from spreadsheets to custom software designed specifically for availability analysis. By understanding the basics, each modeler is free to pick a tool that meets their budget and format needs. Virtually everyone in the business world has access to a spreadsheet, thus it is appropriate to mention how to perform the above calculations using a spreadsheet. The first step is to enter the array data A into a matrix in the spreadsheet. You can do this as mentioned above using the “from state” and “to state” as a guide. Then, at the bottom of the matrix add a row of all ones to cover the equation that states the sum of the probabilities must be 1. You will now have matrix A filled in. For our example it would look like Table 5.4. Next, we need to create the inverse of matrix A. To do this, we select a range of cells somewhere else in the spreadsheet that is the same size as matrix A (in this case it will be 3 by 3). Then select the matrix invert function and select the values from matrix A

Table 5.3. Example simplex system modeling results State

Probability

Minutes/year

1 2 3

0.999959 3.9998E-05 9.9996E-07

525938.44 21.04 0.53

c05.qxd

2/8/2009

50

5:39 PM

Page 50

MODELING AVAILABILITY

Table 5.4. Coefficient matrix in spreadsheet form From state To state

1

2

3

2 3 ⌺Pi = 1

0.000009 0.000001 1

–0.25 0 1

1 –1 1

as the input. If you are using Microsoft Excel, you can do this by highlighting a new 3 by 3 area, selecting the “MINVERSE” function from the list of functions shown under the “Insert|Function...” menu item, and clicking OK. You will now be presented with a dialog that asks you to identify the input matrix. You can enter the range of cells for matrix A directly into the dialog (such as G24:I26) or click the button on the right end of the edit box and then highlight matrix A. Once you have selected matrix A, do not click OK in the dialog. You really want the entire matrix inversion, so hit Ctrl-shift-enter (hold down the control and shift keys, then hit the enter key while holding them down). This will populate the entire inverse of matrix A. You should have something that looks like Table 5.5. Next, we need to enter the vector for the right-hand side of the equations. This must be a columnar vector to keep the values properly associated with the equations they match. All entries except the last row will be 0; the last entry will be a 1. For our example system, vector R should look like Table 5.6. The next step is to perform the matrix multiplication to solve for the probabilities. We do this in a manner similar to how we did the matrix inversion. First, we select the output area. This should be a single column with one row for each state (three rows in our example). Next, we select the matrix multiplication function (MMULT in Microsoft Excel) and multiply the inverse of matrix A by vector R. In Microsoft Excel, we do the matrix multiplication by

Table 5.5. Inverted coefficient matrix in spreadsheet form 3.999836007 –3.99984001 3.99984E-06

4.9998 –3.9998 –1

0.99996 4E-05 1E-06

c05.qxd

2/8/2009

5:39 PM

Page 51

5.1

OVERVIEW OF MODELING TECHNIQUES

51

Table 5.6. Results vector in spreadsheet form 0 0 1

highlighting the output matrix (a single column with three rows for our example) and selecting “Insert|Function...” from the menu. This will open a dialog box asking us to select the two matrices to be multiplied. Select the inverse of the matrix A for the first matrix and select vector R as the second matrix. You can select them either by entering their cell designations directly or by clicking the buttons on the right side of the edit boxes and highlighting the matrix in the spreadsheet. Once both matrices have been selected, you need to hit ctrl-shift-enter as in the matrix inverse case. This will populate the probability matrix. Typically, we look at downtime on an annual basis, so normally the probabilities would be multiplied by the number of minutes in a year. Table 5.7 shows the result of the matrix multiplication in the middle column, with the state number in the first column and the probabilities multiplied by 525,960 minutes/year in the third column. This table was generated using Microsoft Excel exactly as described here. It shows that the example system would be down for about 21.5 minutes/year (the sum of the downtime incurred in states 2 and 3), and in normal operation the rest of the year. So far, we have described the basic Markov chain models and the transition probabilities in these models. The Markov approach implies that the behavior of the system must be stationary or homogenous and memoryless. This means that the Markov approach is applicable to those systems whose behavior can be described by a probability distribution that is characterized by a constant failure and recovery rate.

Table 5.7. Probability solution and downtime in spreadsheet form State

Probability

Minutes/Year

1 2 3

0.999959 3.9998E-05 9.9996E-07

525938.44 21.04 0.53

c05.qxd

2/8/2009

52

5:39 PM

Page 52

MODELING AVAILABILITY

Only if the failure and recovery rate is constant does the probability of making a transition between two states remain constant at all points of time. If the probability is a function of time or the number of discrete steps, then the process is nonstationary and designated as non-Markovian. Appendix B documents widely used probability functions, among which the Poisson and exponential distributions have a constant failure rate. Reference [Crowder94] documents more details on probability models and how the statistical analysis is developed from those models. 5.1.3

Fault Tree Models

Fault tree analysis was originally developed in the 1960s by Bell Telephone Laboratories for use with the Minuteman Missile system. Since then it has been used extensively in both aerospace and safety applications. Fault tree analysis, like reliability block diagrams, represents the system graphically. The graphical representation is called a “fault tree diagram.” Rather than connecting “components” as in an RBD, a fault tree diagram connects “events.” Events are connected using logical constructs, such as AND and OR. Figure 5.3 shows the fault tree diagram for the hybrid system shown in Figure 5.1. In the system of Figure 5.3, there are two simplex components, A and C, and a pair of redundant elements, D and E. The events depicted in the fault tree diagram are the failures of each of the components A, C, D, and E. These events are shown in the labeled boxes along the bottom of the diagram. The diagram clearly shows that the top-level event, which is system failure, results if A or C fail, or if both D and E fail. Other gates besides the AND and OR gates shown in Figure 5.3 are also possible, including a Voting-OR gate (i.e., N of M things must have faults before the system fails), a Priority AND (events must happen in a specific sequence to propagate a failure), and a few others. In a fault tree diagram, the path from an event to a fault condition is a “cut set.” In more complex systems, a single event may be represented as an input multiple times. The shortest path from a given event to a system fault is called a “minimal cut set.” Cut sets are discussed further in Section 5.1.4 One of the drawbacks to the use of fault tree analysis is the limited ability to incorporate the concepts of repair and maintenance, and the time-varying distributions associated with them.

c05.qxd

2/8/2009

5:39 PM

Page 53

5.1

OVERVIEW OF MODELING TECHNIQUES

53

System Failure

Logical OR of inputs A, C and output of AND gate.

Logical AND of inputs D and E.

D

E

A

C

Figure 5.3. Fault tree diagram for the hybrid system of figure 5.1.

Markov models are capable of including repair and maintenance actions, and are, thus, preferred for redundant systems, and especially for systems that are repaired while still providing service. It is generally fairly straightforward to convert from a fault tree diagram to an RBD (with a few exceptions), but the converse is not always true. Fault trees work in the “failure space,” or deal with the events that cause a system failure, whereas RBDs work in the “success space” or the situations in which the system is operating correctly. Additional information on fault tree analysis can be found in both the U.S. Nuclear Regulatory Commission Fault Tree Handbook [NUREG-0492] and NASA Fault Tree Analysis with Aerospace Applications [NASA2002], which is an update to NUREG0492 with an aerospace focus. 5.1.4

Minimal Cut-Set Method

A cut set is defined as a set of system components that, when failed, causes failure of the system. In other words, it is a set of components that, when cut from the system, results in a system failure. A minimal cut set is a set of system components that, when failed, causes failure of the system but when any one component of the set has not failed, does not cause system failure (in

c05.qxd

2/8/2009

54

5:39 PM

Page 54

MODELING AVAILABILITY

this sense the components in each cut set are put in parallel). In this method, minimum cut sets are identified for the network/system and the system reliability (or unreliability) is evaluated by combining the minimum cut sets (the minimum cut sets are then drawn in series). However, the concept of series from Section 5.1.1.1 cannot be used. Assume Ci is the ith cut set. The unreliability of the system is then given by Fs = P(C1 傼 C2 傼 . . . 傼 Cn). The reliability of the system is complementary to the unreliability. To clarify this concept, consider the fault tree diagram shown in Figure 5.3. The cut sets are {A}, {C}, {A, C}, {D, E}, {A, E}, {A, D}, {C, D}, {C, E}, {C, D, E}, {A, D, E}, and {A, C, D, E} because all these combinations of failures in components A, C, D, and E will result in a system failure. However, the minimal cut sets are {A}, {C}, and {D, E}. Each of these sets has at least one event that, when removed, will not restore the system to operational status. The unreliability of this example is then FS = P(CA 傼 CC 傼 CDE) = P(CA) + P(CC) + P(CDE) – P(CA 傽 CC) – P(CA 傽 CDE) – P(CC 傽 CDE) + P(CA 傽 CC 傽 CDE) where P(CA) = 1 – RA P(CC) = 1 – RC P(CDE) = (1 – RD)(1 – RE) P(CA 傽 CC) = P(CA)P(CC) = (1 – RA)(1 – RC) P(CA 傽 CDE) = P(CA)P(CDE) = (1 – RA)(1 – RD)(1 – RE) P(CC 傽 CDE) = P(CC)P(CDE) = (1 – RC)(1 – RD)(1 – RE) P(CA 傽 CC 傽 CDE) = P(CA)P(CC)P(CDE) = (1 – RA)(1 – RC)(1 – RD)(1 – RE) and Ri is the reliability of component i. Therefore, FS = (1 – RA) + (1 – RC) + (1 – RD)(1 – RE) – (1 – RA)(1 – RC) – (1 – RA)(1 – RD)(1 – RE) – (1 – RC)(1 – RD)(1 – RE) + (1 – RA)(1 – RC)(1 – RD)(1 – RE) = 1 – RARCRD – RARCRE + RARCRDRE

c05.qxd

2/8/2009

5:39 PM

Page 55

5.1

OVERVIEW OF MODELING TECHNIQUES

55

From the RBD method, we can derive the unreliability function as follows: FS = 1 – RS = RARC[1 – (1 – RD)(1 – RE)] = 1 – RARCRD – RARCRE + RARCRDRE This result of the system reliability derived from the cut-set method agrees with the system reliability calculated from the RBD method. Cut-sets can be useful to identify the events that cause a system to become inoperative. For example, any event that occurs in every minimal cut set is clearly an event whose impact should be mitigated, whether through redundancy or other mechanisms. The mathematics behind cut sets may become cumbersome, and will be different for every system. Additionally, cut-set analysis does not easily lend itself to inclusion of time dependent distributions or the inclusion of maintenance or repairs. 5.1.5

Petri-Net Models

Invented in 1962 by Carl A. Petri, the Petri-net method (also known as a place/transition net or P/T net) is a bottom-up method that utilizes a symbolic language. A Petri-net structure consists of places, transitions, and directed arcs. The Petri-net graph allows the user to represent the actual system functions and use markings to assign tokens to the net. Blockages or failures can be studied while monitoring the performance and reliability levels. A Petri net consists of four basic parts that allow construction of a Petri-net graph: 1. 2. 3. 4.

A set of places represented by circles A set of transitions represented by vertical bars One or more input functions One or more output functions

Figure 5.4 shows a simple example of using a Petri net to describe the failure and recovery mechanism of a simplex system or component. Places and transitions are shown in Figure 5.4. The input functions in this example are the assumptions about the failure

c05.qxd

2/8/2009

56

5:39 PM

Page 56

MODELING AVAILABILITY

Figure 5.4. A Petri-net example.

and recovery transitions (in this case, exponential) and the parameter values for these distributions. The output function is the availability result, in particular, availability as a function of the failure and recovery distributions. Places contain tokens, represented by dots. Tokens can be considered resources. The Petri-net marking is defined by the number of tokens at each place, which is also designated as the state of the Petri net. Petri nets are a promising tool for describing systems that are characterized as being concurrent, asynchronous, distributed, parallel, nondeterministic, and/or stochastic. There are several variations of Petri nets used in reliability analysis. A stochastic Petri net (SPN) is obtained with the association of a firing time with each transition. The firing of a transition causes a change of state. A generalized stochastic Petri net (GSPN) allows timed transitions with zero firing times and exponentially distributed firing times. The extended stochastic Petri net (ESPN) allows the firing times to belong to an arbitrary statistical distribution. In a timed Petri net, time values are associated with transitions. Tokens reside in places and control the execution of the transitions of the Petri net. Timed Petri nets are used in studying performance and reliability issues of complex systems, which include finding the expected delay in a complex set of actions, average throughput capacities of parallel computers, or the average failure rate for fault-tolerant designs.

c05.qxd

2/8/2009

5:39 PM

Page 57

5.1

OVERVIEW OF MODELING TECHNIQUES

57

Petri nets can be combined with Markov models to analyze the stochastic processes of failure and recovery, although, in general, they are more difficult to solve than Markov models. 5.1.6

Monte Carlo Simulation

All the modeling methods we have discussed so far are analytical methods. A calculation method is analytical when the results are computed by solving a formula or set of formulas. However, analytical methods are not always possible or practical. Numerical methods, which are an alternative to analytical methods, are sometimes used to evaluate the reliability and availability of complex systems. One commonly used numerical method is Monte Carlo simulation. Monte Carlo simulation uses random numbers to represent the various system inputs, such as failure rates, repair rates, and so on. One advantage to using the Monte Carlo method is that the inputs can utilize distributions that make analytical methods difficult or even impossible to solve. In reliability engineering, Monte Carlo simulations repeatedly evaluate the system reliability/availability using a logical model of the system such as an RBD. The input parameter values are regenerated randomly prior to each analysis. While the parameters are regenerated each time, each parameter is constrained by the distribution function specified for that parameter. Monte Carlo analysis does not require the solution of a large or complex set of equations; all that is required is a logical model of the system and the specification of the input parameter distributions. To help explain how Monte Carlo analysis works, we will assess the availability of the system shown in the RBD in Figure 5.5. This system is a very simple simplex system consisting of a DC motor. Additionally, from analysis of field data it has been determined that the failures follow a Weibull distribution with a char-

Figure 5.5. A Monte Carlo example.

c05.qxd

2/8/2009

58

5:39 PM

Page 58

MODELING AVAILABILITY

acteristic life (␪) of 1000 hours and a shape factor (␤) of 2.0. Further, let us assume that the repair times are represented by a lognormal distribution with a shape parameter (␮) of 5.0 and a scale parameter (␴) of 2.0. Using the above RBD and the failure and repair distribution information, we can run a Monte Carlo analysis of this system. If we perform a Monte Carlo analysis with 1000 iterations, starting with a random number seed of 1, we obtain a system availability of 73.1%. What happens if we rerun the analysis, but start with a random number seed of 2? In that case, we will get a system availability of 72.6%, which is a difference of half a percent. If we continue to experiment and reanalyze using 10,000,000 iterations, we get a system availability of 73.5914% using 1 as a seed, and 73.6025% using 2 as the seed. Now the difference is 0.0111%, a much smaller value. As one would expect, increasing the number of iterations increases the accuracy of the results. The trade-off is that additional iterations require additional compute time.

5.2

MODELING DEFINITIONS

The following definitions apply to the models and systems described in this document. They are aligned with the customer’s view of system availability (described in Chapter 3, Section 3.3) and the TL 9000 Measurements Handbook. 5.2.1

Downtime and Availability-Related Definitions

The following chapters define the modeling outputs. In other words, these terms define the quantity whose value the models seek to determine. Typically, these terms are normalized to a single system and express the annual value the average single system will experience. 5.2.1.1 Primary Functionality Downtime and availability are defined in terms of the primary functionality [TL 9000 Measurement Handbook] of the system. Typically, the primary functionality of a system is to process transactions (or perhaps more accurately, the ability to generate revenue for the customer). Sometimes, it is appropriate to rede-

c05.qxd

2/8/2009

5:39 PM

Page 59

5.2

MODELING DEFINITIONS

59

fine the primary functionality of a particular system to understand the downtime associated with different functionalities. For example, a customer may require 99.999% transaction processing availability and 99.995% management visibility from a single system. In this case, two models could be constructed: one using transaction processing as the defined primary functionality, and the other using management visibility as the defined primary functionality. 5.2.1.2 Downtime Downtime is the amount of time that the primary functionality of the system is unavailable. Downtime is typically expressed in minutes per year per system. Unless specifically stated otherwise, downtime includes both total and prorated partial loss of primary functionality. Occasionally, it is of interest to exclude partial outages (i.e., consider total capacity loss events, only). In these cases, it should be clearly stated that the downtime consists of total outages only. 5.2.1.3 Unavailability Unavailability is the percentage of time the primary functionality of the system is unavailable. Mathematically, this is the downtime divided by the in-service time. For example, if a system experienced 5 minutes of downtime in a year, then unavailability is 5/525960 = 0.00095%. (525,960 is the number of minutes in a year, including leap years.) 5.2.1.4 Availability Availability is the percentage of time the primary functionality of the system is available. This is 100% minus unavailability, or 100% minus the downtime divided by the in-service time. Because downtime includes prorated partial outages, this is the same as the customer’s view of availability described in Chapter 2, Section 2.3. 5.2.1.5 Product-Attributable Downtime Product-attributable downtime is the downtime triggered by the system design, hardware, software, or other parts of the system. Earlier versions of the TL 9000 Measurement Handbook referred to product-attributable downtime as “supplier-attributable” downtime.

c05.qxd

2/8/2009

60

5:39 PM

Page 60

MODELING AVAILABILITY

5.2.1.6 Customer-Attributable Downtime Customer-attributable downtime is downtime that is primarily attributable to the customer’s equipment, support activities, or policies. It is typically triggered by such things as procedural errors, the office environment (such as power, grounding, temperature, humidity, or security problems). Historically, this has been referred to as “service-provider attributable” downtime. 5.2.1.7 Total Versus Partial Outages and Partial Outage Prorating The impact of an outage can be described by the system capacity loss due to the outage. Total outages typically refer to outages that cause 90% or more capacity loss of a given system, whereas outages that cause less than 90% of the system capacity loss are typically referred to as partial outages. The system downtime, hence, is calculated by weighting the capacity loss of any given outages. For example, the system downtime due to a total outage of 10 minutes is 10 minutes, whereas the system downtime due to a partial outage that causes 50% of the system capacity to be lost for 10 minutes is counted as 5 minutes. The telecom standards [GR-929] and [GR-1929] document measurement definitions for total versus partial outages, and TL 9000 provides a detailed definition of total versus partial outages for all product categories. Using prorating on partial outages has a side effect that may be counterintuitive to many readers: it means the downtime due to multiple sets of components is the same as the downtime for a single set. For example, consider a single pair of database servers in an active standby configuration. Let us assume that this set of servers has an annual downtime of 10 minutes per year (derived either by field performance or modeling). What happens if we grow the system to include an additional pair of database servers so we can double the system capacity? If the second pair is identical to the first, then it too will be down 10 minutes per year. Now the system will see 20 minutes of outage each year, 10 minutes from the original database pair and 10 minutes from the pair we just added. But now the loss of a single database pair reduces system capacity by 50%, so we discount the downtime for each by 50% , meaning we now have two separate outages that count as 5 minutes each, for a total annual downtime of 10 minutes. This is the same downtime we had with a single database pair! And, no matter how many pairs we add we will still have 10

c05.qxd

2/8/2009

5:39 PM

Page 61

5.2

MODELING DEFINITIONS

61

minutes per year of downtime if we prorate the downtime based on capacity! This fact can be very handy when we have to build models of systems that can vary in size. The above example demonstrates that we do not need to build a model for each possible number of database server pairs; we build a model for the simplest case of a single pair and we can reuse that downtime number for any system configurations that use a different number of database pairs. 5.2.1.8 Counting and Exclusion Rules Counting rules are the rules that determine which outages are included in the system downtime, and which outages (if any) may be excluded. Typically, these come from the purchasers of the system (or possibly a group of purchasers) because they understand the financial implications of the different types of outages. For example, for telecom equipment, TL 9000 specifies the counting rules based on the specific category of equipment. In TL 9000, most equipment categories may exclude outages of less than 15 seconds, and outages that affect less than 10% of the system capacity. The counting rules clearly specify what problems or outages are considered too small to consider. Because there is a cost associated with counting outages—reports have to be created and tracked—there is a crossover point at where the cost of reporting the outage exceeds the actual revenue lost by the outage. The counting rules (or exclusion rules) define this point in a clear manner and make outage counting a much more soluble problem. 5.2.2

Unplanned Downtime Modeling Parameter Definitions

The definitions in this chapter apply to the input parameters used to model unplanned downtime. They are subdivided into four major categories: 1. 2. 3. 4.

Failure rates Recovery times Coverage Failovers

The techniques for estimating each of these parameters from field, lab, and design/architecture data are described in Chapters 6, 7,

c05.qxd

2/8/2009

62

5:39 PM

Page 62

MODELING AVAILABILITY

and 8, respectively. Additionally, the nominal ranges for these parameters are provided in Chapters 7 and 8. System availability is more sensitive to changes in some parameters than in other parameters. The parameters to which the availability is more sensitive we have labeled influential parameters. The sections that follow specify which parameters are influential and which are less so. 5.2.2.1 Failure Rate Definitions Each of the different components that make up the system can have its own failure rate. Within the models, these failure rates are usually expressed in terms of failures per hour. The failure rates may be provided in many different forms, such as FIT rate (failures in 109 hours), mean time between failure (MTBF) in hours or years, and failures per year. Each of these different forms of failure rate must be converted to the same units (preferably failures per hour) prior to their use in the models. 5.2.2.1.1 Hardware Failure Rate. Hardware failure rate is the steady-state rate of hardware incidents that require hardware maintenance actions (typically FRU replacement). Typically, the hardware failure rate includes both service-affecting and nonservice-affecting hardware faults. This is typically an influential parameter. Hardware failure rates vary over the hardware’s lifetime. There are often some early life failures when components with manufacturing defects or other weaknesses fail; this is often referred to as “infant mortality.” This infant mortality period may last for several months. After these early life failures have occurred, then the hardware failure rate stabilizes to a (relatively) steady failure rate. Eventually, the hardware reaches the end of its designed service life and wearout failures begin to increase the hardware failure rate. This canonical hardware failure rate as a function of time is often referred to as the “bathtub curve” and is shown in Figure 5.6. Hardware failure rate predictions and calculations (for example, TL 9000’s Yearly Return Rate or YRR) are for the so-called constant failure rate period—the bottom of the “bathtub.” This rate is depicted as “FR” in Figure 5.6. As discussed earlier, the exponential distribution has a memoryless property and a constant failure (or hazard) rate (see Appendix B for the mathematical de-

c05.qxd

2/8/2009

5:39 PM

Page 63

5.2

MODELING DEFINITIONS

63

Figure 5.6. Bathtub curve.

tails). Hence, it has been widely used to describe the failure phenomena, in particular, hardware failure processes. After some period of time, the failure rate begins to increase due to wear-out mechanisms. This signals the end of service life for the component. There is no concrete definition of exactly when end of service life occurs; values from 125% to 200% of the steady-state failure rate are frequently used, but the exact definition may vary from component to component and manufacturer to manufacturer. End of service life can vary dramatically depending on the wear-out mechanisms of the individual component. Electromechanical components, such as fans or disk drives, tend to wear out more quickly than components like semiconductors with no moving parts. Among the probability distributions (Appendix B), the Weibull distribution has a very important property, that is, the distribution has no specific characteristic shape. In fact, if the values of the three parameters in the Weibull distribution are properly chosen, it can be shaped to represent all three phases of the bathtub curve. Hence, the Weibull distribution is the most widely used distribution function used to analyze experimental data. This makes Weibull (and a few other distribution functions such as gamma and lognormal which are discussed in Appendix B) a very important function in experimental data analysis. 5.2.2.1.2 Software Failure Rate. The software failure rate used in modeling is the rate of software incidents that require a module, process, application restart, or a reboot to recover. Module/

c05.qxd

2/8/2009

64

5:39 PM

Page 64

MODELING AVAILABILITY

process/application restart or reboot is analogous to “hardware maintenance action” in the hardware failure-rate definition. It should be noted that the software failure rate is not necessarily the same as the software defect rate. This is because there are many possible software defects that will not affect the system availability. For example, software that generates a misspelled message or paints the screen the wrong color clearly has a defect, but these defects are not likely to result in a system outage. Software failure rate is typically an influential parameter. Chapter 7 and Chapter 8 discuss how to estimate software failure rates from the test data and outage rate from field data, respectively. 5.2.2.2 Recovery Time Definitions Note that whereas hardware repair time is fairly straightforward, there are many potential variables that factor into the software recovery time, including the escalation strategy, the probability of success at each escalation level, and the time spent to successfully execute each escalation level. The definitions here assume a threetiered automatic software escalation strategy. The first tier detects failures within a single task or process and restarts the task or process. The recovery escalates to the second tier if restarting an individual task or process fails to restore primary functionality. At the second tier, the entire application is restarted. If this fails to restore primary functionality, escalation proceeds to the third tier, where a reboot occurs. If the reboot fails, then a manual recovery will be required. Not all systems will map directly to the threetiered approach described and defined here, but the concepts and principles will apply to most systems, and can easily be modified to fit any specific system and software recovery strategy. The individual recovery time parameters are described in the following sections. Successful detection times of the covered failures are included in the models in this book. 5.2.2.2.1 Hardware FRU Repair Time. Hardware FRU repair time is the average amount of time required to repair a failed hardware FRU. This includes both the dispatch and actual repair time. This is an influential parameter for simplex systems, but is not very influential for redundant systems. 5.2.2.2.2 Covered Fault Detection Time. Detection time for alarmed and/or covered failures is the amount of time it takes to

c05.qxd

2/8/2009

5:39 PM

Page 65

5.2

MODELING DEFINITIONS

65

recognize that the system has failed (in a detected manner) before automatic recovery takes place. Although this time duration is typically very short, this is included in the models in this book. 5.2.2.2.3 Uncovered Fault Detection Time. Uncovered fault detection time is the amount of time it takes to recognize that the system has failed when it was not automatically detected. This often requires a technician to recognize that there is a problem (possibly via troubleshooting alarms on adjacent systems or because performance measures deviate from expectations). The value used for this parameter comes from analyzing field outage data. This parameter does not include the recovery time; it is just the time required for a person to detect that the system has failed. Uncovered fault detection time is typically an influential parameter. 5.2.2.2.4 Single-Process Restart Time. Single-process restart time is the amount of time required to automatically recognize that a process has failed and to restart it. This parameter applies to systems that monitor the individual processes required to provide primary functionality, and is the average time required to detect a failure and restart one of those processes. It also applies to systems that use software tasks instead of processes if those tasks are monitored and are restartable. This parameter can be somewhat influential. 5.2.2.2.5 Full-Application Restart Time. Full-application restart time is the amount of time required to fully initialize the application. A full application restart does not include a reboot or a restart of the operating system. Full application restart time applies to systems in which full application restart is one of the recovery levels, and it does not include restarting lower levels of software such as the operating system or platform software. This is typically not an influential parameter. 5.2.2.2.6 Reboot Time. Reboot time is the amount of time required to reboot and initialize an entire server, including the operating system and the application. This can be a somewhat influential parameter for simplex software systems, but is typically not very influential in redundant systems. There also tends to be a wide variation on this parameter, from fairly quick reboots for real-time

c05.qxd

2/8/2009

66

5:39 PM

Page 66

MODELING AVAILABILITY

operating systems to tens of minutes for non-real-time systems with large databases that need to be synchronized during reboot. 5.2.2.2.7 Single-Process Restart Success Probability. Singleprocess restart success probability is the probability that restarting a failed process will restore primary functionality on the server. This parameter has an effect on the average software recovery time due to the weighting it gives to the single-process restart time, but this is typically not an influential parameter. 5.2.2.2.8 Full-Application Restart Success Probability. Full-application restart success probability is the probability that restarting the full application will restore primary functionality on the server. This parameter has an effect on the average software recovery time due to the weighting it gives to the full-application restart time, but this is typically not an influential parameter. 5.2.2.2.9 Reboot Success Probability. Reboot success probability is the probability that rebooting the server will restore primary functionality on the server. This parameter has an effect on the average software recovery time due to the weighting it gives to the reboot time, and the weighting it gives to a typically much slower manual software recovery via the unsuccessful percentage, but reboot success probability is typically not an influential parameter because of the relatively low probability of needing a reboot (both the single-process restarts and the full-application restarts have to fail before a full reboot is necessary). 5.2.2.3 Coverage Definitions Coverage is a probability; therefore, it is expressed as a percentage. The following sections describe the different types of coverage. These parameters are not necessarily correlated, although in actual systems there is probably some correlation between them. Part of this is because the mechanisms for improving coverage in one area frequently provide some level of coverage in another area. An example of this is a bus watchdog timer. The bus watchdog timer can detect accesses to invalid addresses that result from a hardware fault in the address decoder, but it can also detect an invalid address access due to an invalid pointer dereference in software. Coverage can be difficult to measure, and the correlation between different coverage types is even more difficult to measure. Because

c05.qxd

2/8/2009

5:39 PM

Page 67

5.2

MODELING DEFINITIONS

67

of this, the models use independent values for each of the different coverage types defined in the following chapters. Any correlation that is known may be incorporated into the actual coverage values used in the model. 5.2.2.3.1 Hardware Fault Coverage. Hardware fault coverage is the probability that a hardware fault is detected by the system and an automatic recovery (such as a failover) is initiated. This parameter represents the percentage of hardware failures that are automatically detected by the system. The detection mechanism may be hardware, software, or a combination of both. The important aspect is that the fault was detected automatically by the system. As an example, consider a hardware fault that occurs in an I/O device such as a serial port or an Ethernet controller. This fault might be detected in hardware by using parity on the data bus, or it could be detected in software by using a CRC mechanism on the serial data. The important thing is that it can be detected automatically, not whether the detection mechanism was hardware or software. Hardware fault coverage is an influential parameter in the downtime calculation of redundant systems (such as active/standby or N+K), but it is not very influential for simplex systems. 5.2.2.3.2 Software Fault Coverage. Software fault coverage is the probability that a software fault is detected by the system and an automatic recovery (such as a failover) is initiated. This parameter represents the percentage of software failures that are automatically detected by the system. The detection mechanism may be hardware, software, or a combination of both. The important aspect is that the fault was detected automatically by the system. For example, consider a software fault that incorrectly populates the port number of an incoming port to a value greater than the number of ports in the system. This error could be detected by a software audit of the port data structures, or it could be detected by the hardware when an access to an out-of-range port is attempted. The important thing is that it can be detected automatically, not whether the detection mechanism was hardware or software. Hardware fault coverage is an influential parameter in the downtime calculation of redundant systems (such as active/standby or N+K), but it is not very influential for simplex systems. This is because the automatic recovery time for covered faults is usually much shorter than the detection time for uncovered faults.

c05.qxd

2/8/2009

68

5:39 PM

Page 68

MODELING AVAILABILITY

5.2.2.3.3 Failed-Restart Detection Probability. Failed-restart detection probability is the probability that a failed reboot will be detected automatically by the system. Some systems will attempt to reboot multiple times if a reboot fails to restore primary functionality, whereas others will simply raise an alarm and give up. In either case, if the reboot has failed and the failure goes unnoticed, the failed unit will remain out of service until a technician notices the unit has failed. If the failed reboot is detected automatically, then either an alarm will be raised informing the technician that he or she needs to take action, or the system will make another attempt at rebooting the failed unit. This is typically a noninfluential parameter, although it is more influential in simplex systems than in redundant systems. 5.2.2.4 Failover Definitions The failover definitions all apply to redundant systems; a simplex system has nothing to failover to. 5.2.2.4.1 Automatic Failover Time. Automatic failover time is the amount of time it takes to automatically fail primary functionality over to a redundant unit. This is a moderately influential parameter, but it is not necessarily continuous. TL 9000 allows an outage exclusion for most product categories; outages less than a certain threshold (15 seconds in TL 9000 Release 4) do not need to be counted. This means that most systems should strive to detect and failover faulty units within the customer’s maximum acceptable time (e.g., 15 seconds for most TL 9000 product categories). 5.2.2.4.2 Manual Failover Time (Detection and Failover). Manual failover time is the amount of time it takes a technician to detect the need for a manual failover and to manually force a failover. Manual failovers are only required after an automatic failover has failed. Manual failover time is typically a noninfluential parameter. 5.2.2.4.3 Automatic Failover Success Probability. Automatic failover success probability is the probability that an automatic failover will be successful. This is typically a noninfluential parameter. 5.2.2.4.4 Manual Failover Success Probability. Manual failover success probability is the probability that a manual failover will

c05.qxd

2/8/2009

5:39 PM

Page 69

5.3

PRACTICAL MODELING

69

be successful. An unsuccessful manual failover typically leads to a duplex failure in which the recovery can be much longer. In that case, both the active and the standby units/instances are repaired or rebooted. This is typically a noninfluential parameter.

5.3

PRACTICAL MODELING

Real systems and solutions are made up of various interlinked hardware and software modules that work together to provide service to customers or users. It is best to start by creating a reliability block diagram that clearly shows which major modules must be operational to provide service, and highlights the redundancy arrangements of those critical modules. Each redundancy group or simplex element that appears in series in the critical path through the reliability block diagram can be separately and individually modeled, thus further simplifying the modeling. For example, consider the sample in Figure 5.7, which shows components A and C as critical simplex elements, and component B is protected via a redundancy scheme. Thus, system downtime can be modeled by summing the downtime for simplex components A and C, and separately considering the downtime for the cluster of Component Bs. The remainder of this section will present sample Markov availability models of common redundancy schemes. These sample models can be used or modified as building blocks for systemor solution-level availability models. 5.3.1

Simplex Component

Figure 5.8 models a simplex component with imperfect coverage in three states:

Figure 5.7. Simple reliability block diagram example.

c05.qxd

2/8/2009

70

5:39 PM

Page 70

MODELING AVAILABILITY

앫 State 1—The component is active and fully operational 앫 State 2—The component is down (nonoperational) and system and/or maintenance staff are aware of the failure so recovery activities can be initiated 앫 State 3—The component is down (nonoperational) but neither the system itself nor maintenance staff is aware of the failure, so recovery activities cannot yet be initiated. State 3 is often referred to as a “silent failure” state. This model is frequently used for things like backplanes when there is only a single backplane for the entire system. Typically, highly available systems employ redundancy schemes to increase availability, so this model should not see abundant use in a highly available system. Because backplanes in general are quite reliable, and have fully redundant connections within a single backplane, it is usually acceptable for them to be simplex. In the rare cases in which failure detection is instantaneous and perfect, this model degenerates into the simpler model shown in Figure 5.9. 5.3.2

Active–Active Model

The model shown in Figure 5.10 is used for duplex systems that split the load between a pair of units. When one unit goes down,

1 Working Uncovered Failure

(1-C))λ

Covered Failure

3 Down Uncovered



Uncovered Failure is Detected

μSFD

μR Repair

2 Down Covered

States 2 and 3 are down states Figure 5.8. Simplex component Markov model.

c05.qxd

2/8/2009

5:39 PM

Page 71

5.3

PRACTICAL MODELING

1 Working Uncovered Failure

(1-C))λ )λ

Covered Failure

3 Down Uncovered

Cλ λ

μR Repair

Uncovered Failure is Detected

μSFD

2 Down Covered

The full model automatically changes to this model when the coverage is 100%.

Figure 5.9. Simplex, perfect coverage component Markov model.

Uncovered Failure

1 Duplex

2(1-CA) λ

Covered Failure

Uncovered Failure Detected μSFDTA

4 One Fails Covered

μ

μ

2CAλ

3 One Fails Uncovered

Repair

Repair

2nd Failure

2 Simplex Manual Failover

Successful Failover

FMμ FOM

Fμ FO

5 Failover Failed

Failed Failover

(1-F)μFO

λ

6 Duplex Failure

2nd Failure

λ

Failed Manual Failover

(1-FM)μFOM

2nd Failure

λ

States 3, 4 and 5 are 50% down. State 6 is 100% down.

Figure 5.10. Full active–active component Markov model.

71

c05.qxd

2/8/2009

72

5:39 PM

Page 72

MODELING AVAILABILITY

there is a 50% capacity loss until the lost traffic can be reestablished on the other unit. A good example of this type of equipment is a redundant hub or router. The model degenerates to that shown in Figure 5.11 when the failovers are perfect, that is, every failover attempt works perfectly. For those systems in which failovers occur relatively instantaneously, the model further degenerates to the one shown in Figure 5.12. And, for those rare cases where every failure is properly detected and a perfect instantaneous failover initiated, the model degenerates to the one shown in Figure 5.13. 5.3.3

Active–Standby Model

The active–standby model is used for duplex configurations that have one unit actively providing service and a second unit on standby just in case the first unit fails. A good example of this type of system is a redundant database server. One server actively an-

Uncovered Failure

1 Duplex

2(1-CA) λ

Covered Failure

Uncovered Failure Detected μ SFDTA

4 One Fails Covered

μ

μ

2CAλ

3 One Fails Uncovered

Repair

Repair

2nd Failure

2 Simplex Manual Failover

Successful Failover

FMμ FOM

Fμ FO

5 Failover Failed

Failed Failover

(1-F))μFO

λ

6 Duplex Failure

2nd Failure

λ

Failed Manual Failover

(1-FM)μFOM M

2nd Failure

λ

The full model automatically goes to this model when F = 100%.

Figure 5.11. Active–active with 100% failover success.

c05.qxd

2/8/2009

5:39 PM

Page 73

5.3

Uncovered Failure

Covered Failure

μ SFDTA

4 One Fails Covered

Repair

μ

Repair

μ

2CAλ

Uncovered Failure Detected

73

1 Duplex

2(1-CA) λ

3 One Fails Uncovered

PRACTICAL MODELING

2nd Failure

2 Simplex Manual Failover

Successful Failover

FMμFOM

FμFO

5 Failover Failed

Failed Failover

(1-F))μFO

λ

6 Duplex Failure

2nd Failure

λ

Failed Manual Failover

(1-FM)μFOM M

2nd Failure

λ

The full model automatically goes to this model when the failover time is very short.

Figure 5.12. Active–active with perfect, instantaneous failover.

swers queries while the other is in standby mode. If the first database fails, then queries are redirected to the standby database and service resumes. Figure 5.14 shows the state transition diagram for the full active–standby model. The full active–standby model degenerates to that shown in Figure 5.15 when the failovers are perfect, that is, every failover attempt works perfectly. For those systems in which failovers occur relatively instantaneously, the active–standby model further degenerates to the one shown in Figure 5.16. And, for those rare cases in which every failure is properly detected and a perfect instantaneous failover initiated, the active–standby model degenerates to the one shown in Figure 5.17. 5.3.4

N+K Redundancy

There are two different types of N+K redundancy. The first is true N+K, where there are N active units and K spare units. In true

c05.qxd

2/8/2009

74

5:39 PM

Page 74

MODELING AVAILABILITY

Uncovered Failure

1 Duplex

(1-CA) λ

Covered Failure

Uncovered Failure Detected

μSFDTA

4 One Fails Covered

μ

μ

CA2 λ

3 One Fails Uncovered

Repair

Repair

2nd Failure

2 Simplex Manual Failover

Successful Failover

FMμFOM

FμFO

5 Failover Failed

Failed Failover

(1-F))μFO

2nd Failure

λ

λ

6 Duplex Failure

2nd Failure

λ

Failed Manual Failover

(1-FM)μFOM M

The full model automatically goes to this model when C = 100%.

Figure 5.13. Active–active with perfect coverage and instant failover.

N+K redundancy the K spares do not perform any work until one of the N units fails, at which point traffic is routed to one of the K units and that unit assumes the work of the failed unit. The second type of N+K redundancy is called N+K Load Shared. In this configuration, the load is split across all N+K units, each handling roughly the same amount of load. The redundancy comes from the fact that only N units are required to support the needed capacity of the system. In N+K Load Shared systems there is a partial outage whenever any unit fails. This is because it takes a finite amount of time to redistribute that traffic to the remaining units. In true N+K systems, a partial outage occurs whenever one of the N units fails, but no outage occurs when one of the K units fails. Note that with equal N and K, the portion of partial outage in a true N+K system is greater than in an N+K Load Shared system, but that it is slightly less likely to occur. In the end, the choice becomes a matter of preference; both are very reasonable ways to construct a highly available system.

c05.qxd

2/8/2009

5:39 PM

Page 75

5.3

Uncovered Failure on the Active

1 Duplex

(1-CA)λ

Uncovered Failure Detected

μSFA

5 Active Down Covered

Uncovered Failure on the Standby

(1-CS) λ

Covered Failure on the Standby

Repair

μ

Covered Failure on the Active

4 Active Down Uncovered

75

PRACTICAL MODELING

CS λ

CA λ

2 Simplex

Uncovered Failure Detected μ SFS

Manual Failover

FMμFOM

2nd Failure 2nd

Successful Failover

FμFO (1-F))μFO

λ

Repair

Failure

μ

λ

2nd Failure

7 Failover Failed

Failed Failover

3 Standby Down Uncovered

λ

6 Duplex Failed

Failed Manual Failover

(1-FM)μFOM

2nd Failure

λ

States 4, 5, 6, and 7 are down states.

Figure 5.14. Full active–standby model.

An example of an N+K system is a cluster of servers, or server farm, used to create a website for access by a large number of users. Figure 5.18 shows the state transition diagram for an N+K Load Shared system. 5.3.5

N-out-of-M Redundancy

Section 5.1.1.3 discussed the RBD for an N-out-of-M system. The RBD method calculates the reliability of the system based on a binomial formula. Compared to the RBD method, the Markov model discussed here is richer since it allows detailed modeling of failure detection probabilities and failure recovery hierarchies. N-out-of-M redundancy is similar to N+K but it is used when failovers are essentially instantaneous. Practically speaking, it is typically used with power supplies and cooling fans. Power supplies typically have their outputs wired together, so failure of any

2/8/2009

76

5:39 PM

Page 76

MODELING AVAILABILITY

1 Duplex

Uncovered Failure on the Active

(1-CA) λ

(1-CS) λ

Covered Failure on the Standby

Repair

μ

Covered Failure on the Active

CS λ

CAλ

2 Simplex

4 Active Down Uncovered Uncovered Failure Detected

μSFA

5 Active Down Covered

Uncovered Failure on the Standby

Uncovered Failure Detected μ SFS

Manual 왖 Failover

FMμFOM M

Repair

Failure

μ

λ

Fμ FO

2nd Failure

7 Failover Failed

(1-F))μ FO

3 Standby Down Uncovered 2nd Failure

2nd

Successful Failover

Failed Failover



c05.qxd

λ

λ

6 Duplex Failed

Failed Manual Failover

( M)μFOM (1-F M

2nd Failure

λ

The full model automatically goes to this model when F = 100%.

Figure 5.15. Active–standby with 100% failover success.

single supply is instantly covered by the other supplies supplying more current. A similar thing occurs with fans—when a single fan fails the remaining fans in the system can go to a higher speed and keep the system cool. Figure 5.19 shows the Markov model for the N-out-of M redundancy scheme. 5.3.6

Practical Modeling Assumptions

In modeling the availability of a real system that consists of various hardware and software components, the first item that needs to be addressed is that the system needs to be decomposed into hardware and software subsystems. One of the simplifications is to separate the hardware and software and model them with two separate sets of models. To do this, first we have to demonstrate that two separate models (one for hardware and one for software) do not yield significantly different downtime results as compared to an integrated hardware and software model. Figure 5.20 shows an integrated hardware and software model for an active–warm

c05.qxd

2/8/2009

5:39 PM

Page 77

5.3

Uncovered Failure on the Active

1 Duplex

(1-CA) λ

PRACTICAL MODELING

Uncovered Failure on the Standby

(1-CS) λ

Repair

μ

Covered Failure on the Active

2 Simplex

4 Active Down Uncovered Uncovered Failure Detected

μ SFA

5 Active Down Covered

Covered Failure on the Standby

CS λ

CAλ

Uncovered Failure Detected μ SFS

Manual Failover

μ

M FOM M Successful Failover

7 Failover Failed

(1-F))μFO

3 Standby Down Uncovered 2nd Failure

2nd

FμFO Failed Failover

77

Repair

Failure

μ

λ

2nd Failure

λ

λ

6 Duplex Failed

Failed Manual Failover

(1-FM)μFOM

2nd Failure

λ

The full model automatically goes to this model when when the failover time is very short.

Figure 5.16. Active–standby with perfect, instantaneous failover.

standby design. Table 5.8 lists the state definitions of the model in Figure 5.20. The predicted downtime based on the integrated model and two separate hardware and software active–standby models are within 5% of each other. This suggests that the separate models can be used to simplify the downtime modeling. The reasons that we recommend using separate hardware and software models for practical applications are: 1. The separate models produce downtime predictions that are within the acceptable range of precision. 2. Simpler models are easier to manage, which prevents unnecessary human errors. 3. Simpler models require fewer input parameters to be estimated. The uncertainties in the parameter estimations, in turn, might dwarf the precision offered by the more complicated models.

c05.qxd

2/8/2009

78

5:39 PM

Page 78

MODELING AVAILABILITY

1 Duplex

Uncovered Failure on the Active

(1-CA) λ

Uncovered Failure on the Standby

(1-CS) λ

Repair

μ

Covered Failure on the Active

CS 2 λ Uncovered Failure

CA λ

μ SFA

5 Active Down Covered

Manual Failover

FMμFOM M

2nd Failure 2nd

Repair

Failure

μ

Successful Failover

Fμ FO

7 Failover Failed

Failed Failover

(1-F)) μFFO

3 Standby Down Uncovered

Detected μ SFS

2 Simplex

4 Active Down Uncovered Uncovered Failure Detected

Covered Failure on the Standby

2nd Failure

λ

λ

6 Duplex Failed

Failed Manual Failover

(1-FM)μFOM

2nd Failure

λ

The full model automatically goes to this model when when coverage = 100%.

Figure 5.17. Active–standby with perfect coverage and instant failover.

Caution does need to be taken, though, in developing these models. The most important thing is that the interactions between the separated components need to be considered in the separated models. Failure states and the capacity losses need to be correctly reflected.

5.4

WIDGET EXAMPLE

This section ties the information from the preceding chapters together with an example for a hypothetical product called the “Widget System.” The hypothetical Widget System supports transaction processing functionality in a redundant architecture that can be tailored to a variety of different application needs. The Widget System is built on a scalable blade server chassis supporting application-specific boards. Internally, the widget system will be able to steer traffic to the appropriate application blade.

2/8/2009 5:39 PM

Figure 5.18. N+K Load Shared redundancy model.

c05.qxd Page 79

79

Mλ( 1-C)

M Active μ

MλC M-1 Active

2

μ

3

μ

(M-2)λ C N-1 Active

4

μ

(N-1) λC



11 2 Active

5 2 Active

Figure 5.19. N-out-of-M redundancy model.

M-2 Active

(M-1)λC



(N-1)λ(1-C)

N-1 Active

10

2λ(1-C)

μ

2λC

1 Active

6

1 Active

12

λ (1-C)

μ

λC



1

μ SFDT

M-2 Active

μSFDT

M-1 Active

- C) ) λ(1

9

μ SFDT

8

μSFDT

(M-2)λ(1-C)

M 2 )λ(1 - C)

( M-1) λC (N 1 ) λ(1-C )

( M- 2) λC -C)

(N-1) λC

(M1

2 λC 2 λ(1

(M-1)λ(1-C)

μSFDT

80

13

0 Active

7

0 Active

5:39 PM

λ(1-C )

2/8/2009

μ SFDT

c05.qxd Page 80

25

29

23

11

10

2

9

5

4

7

Figure 5.20. Integrated active–warm standby hardware and software Markov model.

28

26

22

21

18

17

5:39 PM

20

19

14

2/8/2009

16

15

c05.qxd Page 81

81

c05.qxd

2/8/2009

82

5:39 PM

Page 82

MODELING AVAILABILITY

Table 5.8. State definitions for the integrated model State 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

ACTIVE pack Working Detected s1 failure. Restart process Detected s2 failure or failed restart of s1 failure. Initiate failover Working Failover to standby fails or manual detection of silent SW failure. Reboot & rebuild Silent SW failure Working Detected s2 failure or failed restart of s1 failure. Initiate failover Detected s1 failure. Restart process Detected HW failure. Initiate failover Silent HW failure Failover fails after detected HW failure or manual detection of silent HW failure Working Detected s1 failure. Restart process Detected s2 failure or failed restart of s1 failure. Initiate failover Silent SW failure. Detection of silent SW failure. Reboot and rebuild

20

Detected HW failure. Initiate failover Silent HW failure Failover fails after detected HW failure or manual detection of silent HW failure HW failure. Replace pack

21

Reboot & rebuild new pack

22 23 24

Silent HW failure Detected HW failure. Initiate failover Working

25 26

Detection of silent HW failure or failover to standby fails Silent HW failure

27

Detected s1 failure. Restart process

28 29

Detected s2 failure or failed restart of s1 failure or detection of silent SW failure. Reboot & rebuild Silent SW failure.

30

Working

STANDBY pack Working Working Working Reboot & copy data Reboot & rebuild Working Silent SW failure Silent SW failure Silent SW failure Silent SW failure Silent HW failure Reboot & rebuild Silent HW failure Silent HW failure Silent HW failure Silent HW failure Attempt reboot; detect HW failure Silent HW failure Silent HW failure Attempt reboot; detect HW failure HW failure. Replace pack Reboot & rebuild new pack Working Working Detected HW failure Reboot & rebuild Detected HW failure Detected HW failure Detected HW failure Detected HW failure Reboot & copy data

c05.qxd

2/8/2009

5:39 PM

Page 83

5.4

WIDGET EXAMPLE

83

The Widget System is representative of a variety of different potential products. It could be a telecom system in which the interface cards connect to the subscribers’ lines (in which case they would be called line cards), and the control boards manage the setup and teardown of the subscriber calls. It could also be used on a factory floor to control a robotic assembler. In this case, the interface cards would read the robot’s current position and pass that information to the control boards. The control boards would then calculate the correct control signal based on the current position, the desired position, the velocity, and the acceleration. The control boards would send the control signal information back to the interface cards, which in turn would translate it into the actual signal that drives the robots’ motors. The Widget System is implemented on a small footprint, industry standard platform. The platform includes a single chassis, redundant power supplies, fan assemblies, Ethernet external interfaces, and blades providing bearer path services. The ability to include a combination of CPU and network processor equipped blades enables exceptional performance, rapid introduction of new features, and a smooth evolution path. Figure 5.21 shows the Widget System hardware architecture. There is a redundant pair of control boards that provide the operations, administration, and maintenance (OAM) interfaces, centralized fault recovery control, initialization sequencing, and the monitoring and control for the chassis. Additionally, they provide

Additional slots for growth

Figure 5.21. Widget System hardware architecture.

c05.qxd

2/8/2009

84

5:39 PM

Page 84

MODELING AVAILABILITY

the communications interconnections between the remaining cards in the chassis. A pair of interface cards is shown. These cards provide the interface to the ingress and egress communications links for the Widget System. They may optionally include a network processor to enable handling of very high speed links. Redundant power converters are provided. In a typical installation, one converter is connected to each power bus, so that the system power remains operational in the event one of the power buses or one of the converters fails. Cooling is provided by a set of fan assemblies. Each fan assembly contains a fan controller and two fans. The fan controller controls the fan speed and provides status back to the control board. The fan controllers are designed so that the failure of a controller will result in both fans going to full speed. The system is designed so that it may operate indefinitely in the case of a single fan failure (the remaining fans will all go to full speed). Figure 5.22 shows the widget system hardware reliability block diagram. The backplane, power, fan assembly and other serial elements as well as the control board and interface cards are included. Figure 5.23 shows the software architecture that resides on each blade. Both the control boards and the interface cards contain this software architecture, with the exception that the OAM software task resides on the control boards only. The software on the interface cards has internal monitors, as well as having the control board software as a higher-level overseer. The connecting lines in the figure represent that the health of software processes Task 1, Task 2, . . . , and OAM SW is monitored by the monitoring software, MonSW.

Figure 5.22. Widget System hardware reliability block diagram.

c05.qxd

2/8/2009

5:39 PM

Page 85

5.4

WIDGET EXAMPLE

85

Operating System Other Platform SW

MonSW Task1 .

OAM SW Task2.



Task6 .

Figure 5.23. Widget software architecture.

A detailed software RBD for the control board is shown in Figure 5.24. The software RBD for the interface card is similar to the control board RBD. The interface card software is typically simpler than the control board software, but the same method can be used to model and estimate downtime. In this study, it is assumed that the interface cards operate in an active–standby mode. They can operate in other modes as well, such as simplex. A simplex interface card architecture results in a higher interface card downtime.

Figure 5.24. Widget System control board software reliability block diagram.

c05.qxd

2/8/2009

86

5:39 PM

Page 86

MODELING AVAILABILITY

The system software is composed of numerous modules [such as the operating system (OS), database software, platform software, etc.]. All of these software modules have different impacts on transaction processing downtime. For instance, failures of the OS and the transaction processing modules directly cause transaction-processing downtime, whereas failures of the OAM software and the MonSW cause transaction processing downtime only when a second failure (of the OS or transaction-processing software) occurs before the OAM and MonSW fully recover; hence, the OAM software and MonSW only have an indirect impact on transaction-processing downtime. The model considers these different failure impacts and it also incorporates the multiple layers of the failure escalation and recovery hierarchy. The next section documents the model input parameters, the assumptions of the parameter values and the methods of estimating these values. The Markov model in Figure 5.25 depicts the failure and failure-recovery scenarios for the active–standby control board (CB), and can be used to calculate the CB hardware downtime. Table 5.9

1 Duplex

Uncovered Failure on the Active

(1-CS)

(1-CA)λ

Uncovered Failure Detected

μSFA

5 Active Down Covered

Covered Failure on the Standby

Repair

μ

Covered Failure on the Active

4 Active Down Uncovered

Uncovered Failure on the Standby

CS λ

CA λ

2 Simplex

Uncovered Failure Detected μ SFS

3 Standby Down Uncovered

Manual Failover

FMμFOM

2nd Failure 2nd

Successful Failover

Fμ FO

7 Failover Failed

Failed Failover

(1-F))μ FO

Repair

Failure

μ

λ

2nd Failure

λ

λ

6 Duplex Failed

Failed Manual Failover

(1-FM) μFOM

2nd Failure

λ

States 4, 5, 6, and 7 are down states.

Figure 5.25. Active–standby Markov model for control board hardware.

c05.qxd

2/8/2009

5:39 PM

Page 87

5.4

WIDGET EXAMPLE

87

summarizes the parameters in the Markov model of Figure 5.25 while Table 5.10 shows the parameter values. The Markov model is solved for the steady state percentage of time spent in each state. Then, the time spent in states where the system is unavailable is summed to get the total downtime for the Control Board Hardware.

Table 5.9. Active–standby Markov model parameters Parameter Failure rate (failures/hour)

Symbol ␭

Definition Failure rates of the unit/software

Coverage factor for the active mode

CA

Probability of detecting a failure on the active unit/software by system diagnostics. (1 – CA) denotes the probability that the system misses detecting a failure on the active unit/software. (1 – CA) is the undetected failure probability).

Coverage factor for the standby mode

CS

Probability of detecting a failure on the standby unit/software by system diagnostics. (1 – CS) denotes the probability that the system misses detecting a failure on the standby unit. CS might equal CA.

Failover success probability

F

Success probability of automatically promoting the standby instance to active

Failover duration (hours)

1/␮FO

Manual failover success probability Manual failover time (hours) Repair/reboot time (hours)

Automatic failover duration

FM

Success probability for manually forcing traffic from a failed unit to an operational standby unit

1/␮FM

Mean time to reestablish a standby instance by manually forcing traffic from a failed unit to an operational standby unit

1/␮

Mean time to repair a failed hardware unit (or reboot the software), which typically includes system initialization time

Uncovered failure detection time on the active unit (hours)

1/␮SFA

Mean time to detect an uncovered failure on the active unit. This detection typically involves human interaction, and, thus, is slower than automatic detections.

Uncovered failure detection time on the standby unit (hours)

1/␮SFS

Mean time to detect an uncovered failure on the standby unit. This detection typically involves human interaction and, thus, is slower than automatic detections, and may take longer than detecting uncovered failures on the active unit, since no service has been lost.

c05.qxd

2/8/2009

88

5:39 PM

Page 88

MODELING AVAILABILITY

Table 5.10. Assumed parameter values for widget control board hardware Parameter

Value

Failure rate (failures/hour) Coverage factor for the active mode Coverage factor for the standby mode Failover success probability Failover duration (min) Standby recovery time (min) Manual failover success probability Manual failover time (min) Repair/reboot time (min) Uncovered failure detection time (min)

0.0000042 90% 90% 99% 0.1666667 240 99% 30 240 60

A similar Markov model is built for the interface cards, the chassis, the power converters, the fans, and the software that resides on the control board and the interface cards. Each Markov model is then solved independently, and the results added together to obtain the downtime of the entire system. Table 5.11 shows CB hardware downtime while Table 5.12 shows the resultant system downtime and availability. The predicted downtime of approximately 46 minutes is not all that great. In Chapter 8, Section 8.3, where we discuss sensitivity, we will show how to improve this system to reduce the downtime and make the system more reliable. The Widget System model has demonstrated the application of RBDs and Markov models to a small system. The methods used in the example all apply to modeling larger, more complex systems. For additional information on how to model system availability, the references in Appendix E should be consulted.

Table 5.11. Control board hardware downtime per state State

Downtime (min/yr)

1 2 3 4 5 6 7 Total Downtime

525936.7883 17.6715 5.3009 0.2209 0.0061 0.0013 0.0110 0.2393

c05.qxd

2/8/2009

5:39 PM

Page 89

5.5

ALIGNMENT WITH INDUSTRY STANDARDS

89

Table 5.12. Widget system downtime and availability Component Hardware Infrastructure (Fans, Backplane, Power Entry, etc.) Control board Interface card Power converter Software Control board Interface card Total

5.5

Downtime (min/yr)

Predicted availability

0.57

99.9999%

0.24 0.03 0.03

100.0000% 100.0000% 100.0000%

43.26 1.62 45.74

99.9918% 99.9997% 99.9913%

ALIGNMENT WITH INDUSTRY STANDARDS

There are many industry standards that might apply to a particular system or piece of equipment, as evidenced by the long list of standards in the references in the Appendix. In fact, Telcordia alone has so many standards that refer to reliability that they have issued a “Roadmap to Reliability Documents” [Telcordia08], which lists three pages worth of Telcordia reliability related standards! GR-874-CORE, “An Introduction to the Reliability and Quality Generic Requirements,” also provides a good overview of the various Telcordia standards that are reliability related. The authors consider the TL 9000 Measurements Handbook to be the overarching standard that takes precedence over the other standards in the cases where there is a conflict. There are multiple reasons to consider the TL 9000 Measurements Handbook first: 1. It was created by a large consortium of service and equipment providers, so it represents a well-balanced viewpoint. 2. It defines how to measure the actual reliability performance of a product in the field. 3. It measures reliability performance as it relates to revenue generation, which is the driving force for customers. 4. It is updated regularly, so it stays current with the equipment types and practices in actual use. In addition to aligning with TL 9000, there are a number of other potentially applicable standards. Alignment with them is discussed in the following sections. In the following sections, the

c05.qxd

2/8/2009

90

5:39 PM

Page 90

MODELING AVAILABILITY

shorthand reference for each standard is used to make it easier to read. For the complete reference name and issue information see the references in Appendix E. 5.5.1

Hardware Failure Rate Prediction Standard Alignment

There are two primary standards for predicting hardware failure rates: MIL-HDBK-217F and SR-332. The models described above can accept hardware failure predictions based on either of these standards. In addition, some equipment suppliers have proprietary methods of predicting hardware failure rates, which may also be used as input to the models. Each of these different prediction methods typically produces a different failure rate prediction. MIL-HDBK-217F tends to be quite pessimistic (i.e., it predicts a failure rate greater than what will actually be observed in the field), whereas release 1 of SR-332 is less pessimistic than MIL-HDBK-217F, but still pessimistic when compared with observed values. Release 2 of SR-332 attempts to address this pessimism, although, due to the relative newness of Release 2, it is still too early to tell how successful the attempt has been. To obtain accurate results from the models, it is often appropriate to scale the predictions with a scaling factor. This factor should be based on comparisons of actual failure rates with those of the prediction method used. Additionally, the scaling factor should consider the type of equipment and the environment in which it is operating, as well as the specific release of the prediction standard in use. For example, using SR-332 Release 1, the scaling factor for CPU boards in a controlled environment (such as an airconditioned equipment room) might be one-third (meaning that the observed failure rate is one-third the predicted failure rate), whereas the scaling factor for a power supply in a controlled environment might be one-half. If there is insufficient field data available to determine an appropriate scaling factor, 1 should be used as the scaling factor until enough field data becomes available. See Chapter 7, Section 7.1 for additional discussion of hardware failure rates. [MIL-STD-690C], “Failure Rate Sampling Plans and Procedures,” provides procedures for failure rate qualification, sampling plans for establishing and maintaining failure rate levels at selected confidence levels, and lot conformance inspection procedures associated with failure rate testing for the purpose

c05.qxd

2/8/2009

5:39 PM

Page 91

5.5

ALIGNMENT WITH INDUSTRY STANDARDS

91

of direct reference in appropriate military electronic parts established reliability (ER) specifications. [GR-357] also discusses hardware component reliability, but is primarily focused on device qualification and manufacturer qualification. GR-357 also defines the device quality levels used in SR-332, and because SR-332 is an acceptable (preferred) method for generating hardware failure rates, the modeling described above is aligned with GR-357 as well. [SR-TSY-000385] provides an overview of the terms and the mathematics involved in predicting hardware failure rates, as well discussing the typical failure modes for a variety of device types. SR-TSY-000385 also touches on system availability modeling, but this is covered primarily by using RBDs, rather than by using the preferred method of Markov modeling. Additional standards that relate to specific component types include: 앫 [GR-326-CORE]—Covers single-mode optical connectors and jumpers 앫 [GR-468-CORE]—Covers optoelectronic devices 앫 [GR-910-CORE]—Covers fiber optic attenuators 앫 [GR-1221-CORE]—Covers passive optical components 앫 [GR-1274-CORE]—Covers printed wiring assemblies exposed to hygroscopic dust 앫 [GR-1312-CORE]—Covers optical fiber amplifiers and dense wavelength-division multiplexed systems 앫 [GR-2853-CORE]—Covers AM/digital video laser transmitters and optical fiber amplifiers and receivers 앫 [GR-3020-CORE]—Covers nickel–cadmium batteries These standards should be consulted for a more in-depth understanding of the reliability of the individual component types, but typically do not play a significant role in generating a system level availability model. 5.5.2

Software Failure Rate Prediction Standard Alignment

Telcordia [GR-2813-CORE] gives “Generic Requirements for Software Reliability Prediction.” Chapter 3.4 of the standard begins, “Since no single software reliability prediction model is accepted universally, these requirements describe the attributes that such a

c05.qxd

2/8/2009

92

5:39 PM

Page 92

MODELING AVAILABILITY

model should have for use in predicting the software reliability of telecommunications systems.” The standard talks about using the prediction models to “determine the number of faults remaining in code” and to “determine the effort necessary to reach a reliability objective” in assessing whether the software is ready to be released. These are the exact tasks we use software reliability growth models (SRGMs) to complete. We also use SRGMs to compare/calibrate the software failure rate estimated in the testing environment and the software failure rate of early releases observed in the field environment to predict field failure rate prior to general availability (GA). This method is discussed in GR-2813-CORE. Telcordia GR-2813-CORE also discusses how to “correlate software faults with the characteristics of the software” such as software code size and complexity. Another Telcordia standard, SR-1547, “The Analysis and Use of Software Reliability and Quality Data,” describes methods for analyzing failure counts as a function of other explanatory factors such as complexity and code size. Our software metrics approach of estimating failure rates is consistent with both GR-2813-CORE and [SR-1547]. 5.5.3

Modeling Standards Alignment

The primary standard that relates to modeling is SR-TSY-001171, “Methods and Procedures for System Reliability Analysis.” This standard covers Markov modeling and encourages the use of coverage factors when modeling. The modeling described above is based on Markov techniques, includes the use of coverage factors, and is strongly aligned with the methods described in SR-TSY001171. There are multiple industry standards that suggest ranges or limits on the specific input values to be used but, unfortunately, these standards sometimes conflict with each other. In these cases, the most recent release of the TL 9000 Measurements Handbook should be used as the highest priority standard. The National Electronics Systems Assistance Center (NESAC), a consortium of North American telecommunications service providers, also issues targets for many of the TL 9000 metrics. The NESAC guidelines and targets should be the second highest priority. Both TL 9000 and the NESAC guidelines are kept current, reflect the views of a consortium of service providers, and view availability from the perspective of revenue generation for the service provider, which is their ultimate business goal. This implies that the ulti-

c05.qxd

2/8/2009

5:39 PM

Page 93

5.5

ALIGNMENT WITH INDUSTRY STANDARDS

93

mate consideration is availability of the primary functionality of the system, and that outages of the primary functionality for all product-attributable reasons (typically meaning both hardware and software causes) must be considered. [MIL-STD-756B], “Reliability Modeling and Prediction,” establishes uniform procedures and ground rules for generating mission reliability and basic reliability models and predictions for electronic, electrical, electromechanical, mechanical, and ordnance systems and equipments. [IEC 60300-3-1] provides a good overview of the different modeling methods (which they refer to as “dependability analysis methods”). They include a description of each method along with the benefits, limitations, and an example for each method. 5.5.4

Equipment Specific Standards Alignment

There are a number of standards that relate to a specific category of telecommunications equipment. Among them are: 앫 [GR-63-CORE]—Covers spatial and environmental criteria for telecom network equipment 앫 [GR-82-CORE]—Covers signaling transfer points 앫 [GR-418-CORE]—Covers fiber optic transport systems 앫 [GR-449-CORE]—Covers fiber distribution frames 앫 [GR-508-CORE]—Covers automatic message accounting systems 앫 [GR-512-CORE]—Covers switching systems 앫 [GR-929-CORE]—Covers wireline systems 앫 [GR-1110-CORE]—Covers broadband switching systems 앫 [GR-1241-CORE]—Supplement for service control points 앫 [GR-1280-CORE]—Covers service control points 앫 [GR-1339-CORE]—Covers digital cross-connect systems 앫 [GR-1929-CORE]—Covers wireless systems 앫 [GR-2841-CORE]—Covers operations systems These standards should be consulted to obtain equipmentspecific information, such as downtime requirements and objectives, downtime budgets, and so on. 5.5.5

Other Reliability-Related Standards

There are a number of other reliability-related standards that do not apply directly to modeling. Many of these are related to the re-

c05.qxd

2/8/2009

94

5:39 PM

Page 94

MODELING AVAILABILITY

liability and quality processes that help ensure a robust product. The appendices provide a comprehensive list of standards that may be consulted by the reader interested in obtaining a broader background in quality and reliability. [MIL-HDBK-338], “Electronic Reliability Design Handbook,” provides procuring activities and contractors with an understanding of the concepts, principles, and methodologies covering all aspects of electronic systems reliability engineering and cost analysis as they relate to the design, acquisition, and deployment of equipment or systems. [IEC 60300-3-6] (1997-11), “Dependability Management—Part 3: Application Guide—Section 6: “Software Aspects of Dependability,” describes dependability elements and tasks for systems or products containing software. This document was withdrawn in 2004 and replaced by IEC 60300-2. [IEC 61713] (2000-06), “Software Dependability Through the Software Life-Cycle Process—Application Guide,” describes activities to achieve dependable software to support IEC 60300-3-6 (1997-11) (replaced by IEC 60300-2); the guide is useful to acquire.

c06.qxd

2/10/2009

9:47 AM

CHAPTER

Page 95

6

ESTIMATING PARAMETERS AND AVAILABILITY FROM FIELD DATA

“You can’t manage what you don’t measure.” Customers generally keep maintenance records, often called trouble tickets, for all manual emergency and nonemergency recoveries, and often at least some automatic recoveries. Records often capture the following data: 앫 Date and time of outage event 앫 Equipment identifier, such as model, location, and serial number 앫 Outage extent, such as number of impacted subscribers or percentage of capacity lost 앫 Outage Duration, typically resolved to minutes or seconds 앫 Summary of failure/impairment 앫 Actual fix, such as “replaced hardware,” “reset software,” or “recovered without intervention” 앫 Believed root cause, such as hardware, software, planned activity, or procedural error These records may include details on other relevant items, including: 앫 Emergency versus nonemergency recovery, such as “PLANNED =y,” “SCHEDULED=y,” or a nonzero value for “PARKING_ DURATION” 앫 Fault owner, such as equipment supplier, customer error, or power company Practical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.

95

c06.qxd

2/10/2009

96

9:47 AM

Page 96

ESTIMATING PARAMETERS AND AVAILABILITY FROM FIELD DATA

Given outage trouble tickets and the number of systems deployed by a service provider, it is straightforward for customers to calculate availability and failure rates for elements they have in service. Customers will often calculate availability and outage rates for high-value and/or broadly deployed elements on a monthly basis. These results may be shared with suppliers on a monthly, quarterly, or annual basis. This chapter explains how this customer outage data can be analyzed to estimate important input parameters to system-level availability models and compute actual product-attributable availability. Given properly estimated input parameters and actual availability, one can calibrate the availability model, thus improving prediction accuracy for releases not yet deployed or developed. 6.1

SELF-MAINTAINING CUSTOMERS

Customers generally write outage trouble tickets for all manually recovered outages and, possibly, some automatically recovered outages. Customers generally only escalate an outage to the supplier (e.g., a supplier’s Customer Technical Assistance Center) when they are no longer making acceptable progress addressing the outage themselves. For example, if the system raises a critical alarm that a hardware element has failed, the customer’s staff will often replace the hardware element and return the failed pack to the supplier or a third-party repair house without engaging the equipment supplier in real time. Alternately, the first time a hardto-debug outage occurs, the customer may contact the supplier’s Customer Technical Assistance Center for assistance, thus creating a formal assistance request. As leading customers often have efficient knowledge management schemes in place, subsequent occurrences of that (initially) hard-to-debug outage are likely to be quickly debugged and promptly resolved by following the procedure used to resolve the first occurrence of the outage. 6.2

ANALYZING FIELD OUTAGE DATA

Customer’s outage records can be analyzed to understand the actual reliability and availability of a product deployed by a particular customer, as well as to estimate modeling parameters and validate an availability model. The basic analysis steps are:

c06.qxd

2/10/2009

9:47 AM

Page 97

6.2

1. 2. 3. 4. 5. 6. 7. 8.

ANALYZING FIELD OUTAGE DATA

97

Select target customer(s) Acquire the customer’s outage data Scrub the outage data Categorize the outages Normalize capacity loss Calculate exposure time Compute outage frequency Compute availability Each of these steps is reviewed below.

6.2.1

Select Target Customer(s)

As detailed in Chapter 4, both reported outage rates and outage durations may vary significantly between customers operating identical equipment. Thus, it is generally better to solicit and analyze a homogeneous dataset from a single customer than to aggregate data from different customers with different policies and procedures. Although the perceived reliability and availability from that single customer will not be identical to the perception of all customers, one can better characterize the operational policies and factors that determined the reliability/availability of the selected customer and, thus, intelligently extrapolate those results to other customers that might have somewhat different operational policies and factors. The criteria for selecting the target customer(s) to acquire data from include those that: 1. Have significant deployments of target element. Realize that deployment includes both number of elements and months in service to produce overall element-years of service. 2. Are willing and able to provide the data. This depends on two core factors: the willingness of the customer to share this information with the supplier, and the ability of the customer’s data systems to actually generate the report(s) necessary for an adequate analysis. Data systems at some customers may be regionally organized or segmented as outage-related data in such a way as to make it awkward or inconvenient to consolidate the data into a single dataset that can be efficiently analyzed. It is often most convenient to arrange for the participation in the analysis of an engineer who is fluent in the spoken language of the customer to minimize practical issues associated with:

c06.qxd

2/10/2009

98

9:47 AM

Page 98

ESTIMATING PARAMETERS AND AVAILABILITY FROM FIELD DATA

앫 Euphemisms. “Bumping,” “bouncing,” “sleeping,” and “dreaming” are all euphemisms for specific element failure modes that some English-speaking customers use. Other failure-related euphemisms may be language-, country- and perhaps even customer-specific. 앫 Abbreviations. “NTF” for “No Trouble Found” or “CWT” for “Cleared While Testing” may be common in English; however non-English-speaking customers may use acronyms that may be unfamiliar to nonnative speakers of those languages. 앫 Language subtleties. Customers in some countries may separately track outages attributed to “masculine other” (e.g., “otros” in Spanish) versus “feminine other” (e.g., “otras” in Spanish). Implications of these different classifications may not be obvious to engineers not intimately familiar with regional language usage patterns. As a practical matter, in-country supplier technical support staff is typically well equipped to clarify any issues involved in translation/interpretation of customer’s outage trouble tickets. 6.2.2

Acquire Customer’s Outage Data

Work with the targeted customer(s) to acquire: 1. Customer’s outage tickets for the product of interest. Data is typically provided as a Microsoft Excel spreadsheet with one row per outage event with columns containing various eventrelated parameters, including those enumerated at the beginning of this chapter. Twelve months of data is generally optimal for analysis because it gives visibility over an entire year, thus capturing any seasonal variations and annual maintenance/upgrade cycles. Some customers routinely offer this data to equipment suppliers as monthly, quarterly, or annual “vendor report cards,” “vendor availability reports,” and so on. 2. Number of network elements in service (by month) in the window of interest. The number of network elements in service is an important factor in determining the service exposure time. Chapter 9, Section 9.1 answers the question of “how much data is enough” in doing the field availability analysis. 3. Name of the equipment provider’s customer team engineer who can answer general questions about the customer’s deploy-

c06.qxd

2/10/2009

9:47 AM

Page 99

6.2

ANALYZING FIELD OUTAGE DATA

99

ment of the target product (e.g., what software version is running and when was it upgraded, how are elements configured). The analysis produced from this data is intended strictly for the equipment supplier’s internal use, and it is not suggested that the results be shared with the customer. The primary reasons for not generally offering to share the results of analysis are: 1. Availability may be worse than the customer had realized. 2. The customer’s operational definitions of availability may differ from the equipment supplier’s definitions, potentially opening up an awkward subject (see Chapter 4, Section 4.1.1). 3. Analysis may reveal that the customer’s policies, procedures, and other factors are better or worse than their competitors. This is obviously important for the equipment supplier team to understand, but generally inappropriate to reveal to customers (see Chapter 4, Section 4.2). 6.2.3

Scrub Outage Data

Upon receipt of the outage data, one should review and scrub the data to address any gaps or issues. Specifically, one should check that: 앫 No outages for elements other than the target element are included. 앫 No outages before or after the time window of interest are included. 앫 There are no duplicate records. 앫 Incomplete or corrupt records have been repaired (by adding nominal or “unknown” values). In the worst case, events are omitted from the analysis. 앫 Key data fields (e.g., outage duration, outage impact/capacityloss, actual fix) are nonblank; repair with nominal values, if necessary. Although TL 9000 metrics calculations may exclude brief and small-capacity-loss events, if records for those events are available, then they should be included in reliability analysis to better understand the system’s behavior of both “covered” failures and small capacity-loss events.

c06.qxd

2/10/2009

100

6.2.4

9:47 AM

Page 100

ESTIMATING PARAMETERS AND AVAILABILITY FROM FIELD DATA

Categorize Outages

While many customers explicitly categorize the root cause and/or actual fix of each outage, one should map each outage into standard categories to enable consistent analysis. At the highest level, outages should be classified into appropriate orthogonal categories, such as: 앫 (Product-attributable) Software/firmware—for software/firmware outages that are cleared by reset, restart, power cycling, etc. 앫 (Product-attributable) Hardware—for events resolved by replacing or repairing hardware 앫 Customer Attributable—for customer mistakes (e.g., work not authorized, work/configuration errors) 앫 External Attributable—for events caused by natural disasters (e.g., fires, floods) or caused by third parties (e.g., commercial power failures, fiber cuts, security attacks) 앫 Planned/Procedural/Other Product Attributable. Occasionally, product-attributable causes other than hardware or software/firm-ware will cause outages; thus, these events can be separately categorized to avoid compromising the hardware and software/firmware categories without attributing a productrelated failure to the customer. Outage recoveries should be categorized based on actual fix, duration, and so on, as: 앫 Automatic recoveries—for “unplanned” events, such as those listed as “recovered without intervention” or “recovered automatically.” These outages are typically 3 minutes or less. 앫 Manual emergency recoveries—for “unplanned” events recovered by reseating or replacing circuit packs, manually restarting software, and so on. These outages are typically longer than 3 minutes. 앫 Planned/scheduled recoveries. When manual outage recovery is intentionally deferred to a maintenance window or off hours. This is typically flagged as “planned activity.” Optionally, second-tier classifications can also be added if desired, such as:

c06.qxd

2/10/2009

9:47 AM

Page 101

6.2

ANALYZING FIELD OUTAGE DATA

101

앫 Transaction processing—core service was lost 앫 Management visibility—alarms; management visibility was lost 앫 Connectivity—connectivity to gateways and supporting elements lost 앫 Provisioning, if reported by the customer 앫 Loss of redundancy. If the customer reports loss of redundancy or simplex exposure events, then those should be explicitly considered. 6.2.5

Normalize Capacity Loss

Customers often precisely quantize the capacity loss of each outage as a discrete number of impacted subscribers, lines, ports, trunks, etc. For availability calculations, these capacity losses should be normalized into percentages, e.g., 100% outage if (essentially) all lines, ports, trunks, subscribers, etc, on a particular element were impacted. Operationally, round capacity loss to percentage loss from failure of the most likely failed component (e.g., line card [perhaps 10%], shelf/side [perhaps 50%], or the entire system [100%]) 6.2.6

Calculate Exposure Time

Exposure time of systems in service is measured in NE years. Operationally, one typically calculates this on a monthly basis by summing across the number of elements in service in a particular month times the number of days in that month. Mathematically, NE Years of service = ⌺month Number of elements × Days in month ᎏᎏᎏᎏᎏᎏ 365.25 Days 6.2.7

(6.1)

Compute Outage Rates

Outage rates are not prorated by capacity loss, and are calculated via a simple formula like Equation (6.2). Equation (6.2) calculates the hardware outage rate for a given network element: ⌺Product-attributable hardware outages Outage rateHardware = ᎏᎏᎏᎏ NE years of service

(6.2)

c06.qxd

2/10/2009

102

9:47 AM

Page 102

ESTIMATING PARAMETERS AND AVAILABILITY FROM FIELD DATA

One should separately compute hardware and software outage rates; optionally, one can compute outage rates for secondary categories (e.g., transaction processing, management visibility, connectivity, and provisioning). Similar calculations can be completed for software outage rate or for other outage classifications. It is often insightful to compare outage rates for hardware, software, and procedural causes; what percent of outages are coming from each category? It is straightforward to see that the reciprocal of the outage rate (say ␭) measures the average time an outage occurs, or the mean time to outage (MTTO). We discussed in Chapter 5 that the single-parameter exponential distribution is often used to model the random variable of time to failure. In this case, once the outage rate is estimated from the failure data, the probability density function of time to outage is determined. These estimators are known as parametric estimators according to the estimation theory in statistics. The parametric estimators are built from the knowledge or the assumption of the probability density function for the data and for the magnitudes to estimate. In the exponential case, since the true value of the parameter is unknown and the estimator is only made from a set of noisy observations, it is desired to evaluate the noise in the observation and/or the error associated with the estimation process. One way of getting an indication of the estimation confidence is to estimate the confidence bounds or confidence intervals, say [␭L, ␭U] for the outage rate, where ␭L is the lower bound and ␭U is the upper bound. The confidence intervals associate the point estimator with the error or confidence level. For example, if an interval estimator is [␭L, ␭U] with a given probability 1 – ␣, then ␭L and ␭U will be called 100(1 – ␣)% confidence limits. This means that the true failure rate is between ␭L and ␭U with a probability of 100(1 – ␣)%. The chi-squared distribution can be used to derive the confidence limits for the failure rate estimation, which can be summarized as follows: For a sample with n failures during a total T units of operation, the random interval between the two limits in Equation (6.3) will contain the true failure rate with a probability of 100(1 – ␣)%: 2 ␹ 1–( ␣/2),2n ␭ˆ L = ᎏᎏ 2T

and

␹(2␣/2),2n ␭ˆ U = ᎏ 2T

(6.3)

c06.qxd

2/10/2009

9:47 AM

Page 103

6.2

ANALYZING FIELD OUTAGE DATA

103

Section 3.1 in Appendix B documents the theoretical development of these limits. A numerical example is shown as follows. Assuming after testing T = 50,000 hours, n = 60 failures are observed, we obtain the point estimate of the failure rate: 60 ␭ˆ = ᎏ = 0.0012 outages/hour 50,000

(6.4)

For a confidence level of 90%, that is, ␣ = 1 – 0.9 = 0.1, we calculate the confidence intervals for the failure rate as: 2 2 ␹ 1–( ␹ 0.95,120 95.703 ␣/2),2n ␭ˆ L = ᎏᎏ = ᎏᎏ = ᎏ = 0.000957 outages/hour 2T 2 × 50,000 100,000 (6.5)

and 2 146.568 ␹ (2␣/2),2n ␹ 0.05,120 ␭ˆ U = ᎏ = ᎏᎏ = ᎏ = 0.001465 outages/hour 2T 2 × 50,000 100,000 (6.6)

In summary, the point estimate of the outage rate is 0.0012 outages/hour. With a probability of 90%, the true outage rate will be between the interval (0.000957, 0.001465) outages/hour. This method can be applied to failure rate estimation for both point and interval estimates when analyzing failure data. 6.2.8

Compute Availability

Annualized downtime can be derived by dividing prorated outage durations by total in-service time; mathematically, this is Anualized downtime = ⌺Product-attributable events Capacity loss × Outage duration ᎏᎏᎏᎏᎏᎏᎏ (6.7) NE Years of service Annualized downtime is generally the easiest availability-related number to work with because it is easy to understand, budget, and consider “what-if” scenarios with. One should calculate annualized downtime for both hardware and software separately, as well as overall product-attributable downtime.

c06.qxd

2/10/2009

104

9:47 AM

Page 104

ESTIMATING PARAMETERS AND AVAILABILITY FROM FIELD DATA

Note that product requirements and/or TL 9000 outage measurement rules may support omitting some small capacity-loss events from the annualized downtime calculation. For example, a total-capacity-loss-only calculation might exclude all events that impact less than 90% of capacity. Judgment and/or TL 9000 outage measurement rules may support capping maximum product-attributable outage duration to avoid factoring excess logistical or policy delays in outage resolution into the calculation. For example, it may be appropriate to cap product-attributable outage durations for failures in staffed offices to 1 hour, and to 2 or 4 hours in unstaffed locations. Although there may be a few longer duration outages for events that are escalated through the customers’ organization and/or to the equipment supplier in early deployments, customers will generally integrate this knowledge rapidly and generally have much shorter outage durations on subsequent failures. As customers (and equipment-supplier decision makers) often consider availability as a percentage compared to “99.999%,” one should also compute availability by percentage using the following formula, where annualized downtime is calculated from Equation (6.7): 525,960 – Annualized downtime Availability = ᎏᎏᎏᎏ 525,960

(6.8)

As we discussed in Chapter 2 and Chapter 5, Equation (6.8) is formulated based on the assumption that the components have two states, up and down, and for which the uptimes and downtimes are exponentially distributed. The uptimes and downtimes are estimated from recorded data, and, hence, the confidence level of unavailability estimation can also be made from the same set of data. In fact, Baldwin and coworkers introduced the approach to estimate the confidence limits of unavailability for power generating equipment in 1954 [Baldwin54]. The unavailability can be calculated from the availability equation in Equation (2.1): MTTF Unavailability = 1 – A = 1 – ᎏᎏ MTTF + MTTR

or

␭ U= ᎏ ␭+␮ (6.9)

c06.qxd

2/10/2009

9:47 AM

Page 105

6.2

ANALYZING FIELD OUTAGE DATA

105

where ␭ and ␮ are failure and repair rates respectively and U is unavailability. Note that MTTF = 1/␭ and MTTR = 1/␮. The average uptime duration m and the average downtime duration r estimated can be evaluated from the recorded data. Using these two values, a single point estimate of the unavailability can be evaluated from Equation (6.10): r Uˆ = ᎏ r+ m

(6.10)

The confidence level can also be made from the same set of recorded data. Based on [Baldwin54], the F-distribution can be used to derive the confidence intervals for the unavailability. Section 3.2 in Appendix B describes the details and the end results are shown in Equation (6.11): r Upper limit, UU = ᎏᎏ r + ␾⬘⬘m (6.11) r Lower limit, UL = ᎏ r + ␾⬘m where ␾⬘ and ␾⬘⬘ are constants depending upon the chosen confidence level, which can be found from the F-distribution table. The example below shows the estimation process. Let a = number of consecutive or randomly chosen downtime durations; let b = number of consecutive or randomly chosen uptime durations. Consider a component that operates in the field and the following data is collected: a = b = 10, r = 5 hours, m = 2000 days = 48,000 hours. Evaluate (1) the single-point estimate of unavailability and (2) the limits of unavailability to give 90% confidence of enclosing the true value. The point estimate of unavailability is given by Uˆ = 5/(48,000 + 5) = 0.000104. From the condition given, we have ␣ = 0.90 and (1 – ␣)/2 = 0.05. Using the F-distribution tables, we have Pr [F20,20 ⱖ ␾⬘ = 0.05], hence, ␾⬘ = 2.12; Pr[F20,20 ⱖ (1/␾⬘⬘)] = 0.05, hence ␾⬘⬘ = (1/␾⬘) = 0.471. Therefore, the upper and lower limits can be derived from

c06.qxd

2/10/2009

106

9:47 AM

Page 106

ESTIMATING PARAMETERS AND AVAILABILITY FROM FIELD DATA

Equation (6.11). Section 3.2 in Appendix B documents the theoretical development of these limits. The upper limit for U is UU = 5/[5 + (0.471 × 48,005)] = 0.000221, and the lower limit for U is UL = 5/[5 + 2.12 × 48,005)] = 0.0000491. From this example the following statements can be made: 앫 The single-point estimate of unavailability is 0.000104, or the availability is 99.9896%. 앫 There is a 90% probability that the true unavailability lies between 0.0000491 and 0.000221, or the availability is between 99.9779% and 99.99509%. 앫 There is a 95% probability that the true unavailability is less than 0.000221, or the availability is greater than 99.9779%. 앫 There is a 95% probability that the true unavailability is greater than 0.0000491, or the availability is less than 99.99509%.

6.3

ANALYZING PERFORMANCE AND ALARM DATA

Some systems directly record reliability-related parameters such as software restarts or switchovers as part of the system’s performance counters. Depending on exactly how these metrics are defined and organized, one may be able to aggregate this data across multiple elements for a sufficiently long time to estimate the software failure rate and other parameters. Naturally, these techniques will be product specific, based on the precise performance and alarm counters that are available. Customers may deploy service assurance or management products to archive critical or all alarms. It may be possible to extract failure rates directly from the critical alarm data, but one must be careful to discard redundant and second-order alarms before making any calculations. Likewise, one must also exclude alarms from: 앫 Intermediate or adjacent elements, such as critical alarms raised by a base station, because the backhaul connection was disrupted, or service failures because a supporting system (e.g., authentication server) could not be reached. 앫 Planned activities such as applying software upgrades. Although service may not be disrupted by a software upgrade, a

c06.qxd

2/10/2009

9:47 AM

Page 107

6.4

COVERAGE FACTOR AND FAILURE RATE

107

software restart is often required, and the software restart is likely to cause a brief loss of management visibility. Regardless of whether or not service was impacted, this event does not contribute to a failure rate because the alarm was triggered by a planned upgrade rather than a hardware or software failure. Also be aware that some restarts may be triggered by the upgrade of adjacent elements; for example, base stations might have to be restarted to resynchronize with upgraded software on a radio network controller. Extracting reliability/availability parameter estimates from alarm data requires a deep understanding of a system’s alarms and general behavior.

6.4

COVERAGE FACTOR AND FAILURE RATE

While all service-impacting failures are likely to generate one or more critical events (e.g., process restarts or switchovers) only a fraction of those alarmed events are likely to result in outage trouble-tickets. The lower bound of coverage factors can be established by Equation 6.12 (given for hardware; a similar formula can be used for software): Coverage factorhardware ⱖ

⌺hardware Automatically recovered events ᎏᎏᎏᎏᎏ (6.12) ⌺hardware All critical events Ideally, one would use correlated critical alarm data along with outage data as follows, via Equation (6.13) (given for software; a similar formula can be used for hardware): Coverage factorsoftware ⬇

⌺software Unique critical alarms – ⌺software Manually recovered critical events ᎏᎏᎏᎏᎏᎏᎏᎏᎏ ⌺software Unique critical alarms (6.13) Unique critical alarms represents unique software alarms as a proxy for software failures.

c06.qxd

2/10/2009

108

9:47 AM

Page 108

ESTIMATING PARAMETERS AND AVAILABILITY FROM FIELD DATA

All manual emergency and scheduled recoveries are, by definition, nonautomatic and, hence, are uncovered. The uncovered outage rate should be equal to failure rate times (1 – coverage):

⌺ Manual recoveries ᎏᎏᎏ ⬇ Failure rate × (1 – Coverage) NE Years of service

(6.14)

The coverage factors for hardware and software should be estimated separately. After the coverage factor and the outage rate are estimated, the overall failure rate (as opposed to outage rate) can be derived from Equation (6.14).

6.5

UNCOVERED FAILURE RECOVERY TIME

All manually recovered outage events and scheduled recoveries are, by definition, nonautomatic and, hence, uncovered. Durations of uncovered outage events inherently include uncovered failure detection time and manual recovery/repair time. Manually recovered outage durations will typically have a distribution from, perhaps, 5 to 15 minutes for some software events to hours for a few extraordinary cases. Rather than averaging the brief, typical events with the extraordinary cases (thus producing an excessively pessimistic view), the authors recommend using the more robust median value of manually recovered outage duration. Mathematically, the median represents the midpoint of the distribution; half the values are above and half the values are below. In contrast, the mean value (mathematical average) of manually recovered outage durations is often rather pessimistic because some portion of the long duration outages may have been deliberately parked and resolved on a nonemergency basis. Uncovered failure recovery times are generally different for hardware and software failures, and, thus, should be estimated separately. Over months, quarters, and years, uncovered failure recovery times are likely to shrink somewhat as the customers’ staff become more efficient at executing manual recovery procedures.

c06.qxd

2/10/2009

9:47 AM

Page 109

6.6

6.6

COVERED FAILURE DETECTION AND RECOVERY TIME

109

COVERED FAILURE DETECTION AND RECOVERY TIME

The median outage duration of automatically recovered outages is generally a good estimate of covered failure recovery time. Covered hardware and software failure recovery times may be different. Depending on the architecture of the system and the failure detection and recovery architecture, it might be appropriate to characterize several distinct covered failure recovery times. A good example of this is a system with several different redundancy schemes, such as an N+K front end and a duplex back-end database. Recovery for the front end may be as quick as routing all new requests away from a failed unit, whereas recovery for the back-end database may require synchronizing outstanding transactions, something that is likely to take longer than the front-end rerouting.

c07.qxd

2/10/2009

9:49 AM

CHAPTER

Page 111

7

ESTIMATING INPUT PARAMETERS FROM LABORATORY DATA

Results from some system testing and verification activities can be used to refine reliability parameter estimates to predict system availability as a product is released to customers; this chapter details how this can be done.

7.1

HARDWARE FAILURE RATE

Hardware failure rates can be predicted using well-known methodologies such as those in Telcordia’s SR-332 Issue 2 or MILHDBK-217F, which predict the failure rates for an entire field replaceable unit (FRU) by combining the predicted failure rates of the individual parts and other factors. Whereas MIL-HDBK-217F tends to be very conservative and overestimates the hardware failure rate, Telcordia SR-332 Issue 2 is anticipated to yield predictions (Issue 2 is relatively new, so its accuracy is not fully determined yet) closer to the actual observed hardware failure rates. Commercial tools such as Relex are available to simplify producing these hardware failure rate predictions. There are also companies that provide failure rate estimations as a service. Interestingly, the rate at which hardware field-replaceable units are returned to equipment suppliers is quite different from the “actual” or “confirmed” hardware failure rate. Hardware that is returned by customers to a repair center and retested with no failure discovered is generally referred to as “no trouble found” Practical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.

111

c07.qxd

2/10/2009

112

9:49 AM

Page 112

ESTIMATING INPUT PARAMETERS FROM LABORATORY DATA

(NTF) or, sometimes, “no fault found” (NFF). Thus, the hardware return percentage is actually Returns % ⬇

⌺Time period (Confirmed hardware failures + no trouble found)

ᎏᎏᎏᎏᎏᎏ ᎏ (7.1) Number of installed packs

“No trouble found” hardware occurs for a number of reasons, including: 앫 Poor diagnostic tools and troubleshooting procedures. If diagnostics, troubleshooting, training, or support are inadequate, then maintenance engineers may “shotgun” the repair by simply replacing packs until the problem is corrected. If the customer’s policy prohibits packs that were removed during troubleshooting from being reinserted into production systems before being retested, then all of those packs will be returned. Obviously, a circuit pack that was replaced but did not resolve a problem is unlikely to have failed and, thus, is likely to be labeled “no trouble found” at the repair center. Note that some software failures may be misdiagnosed as hardware failures. For instance, a memory or resource leak that apparently causes a board to fail will appear to be “repaired” by replacing the board and, thus, may result in the original board being sent back for repair. 앫 Intermittent or transient problems. Some failures are intermittent or caused by power, signal, or other transient events. Often, the situations that trigger these intermittent or transient problems will be absent at the repair center and, thus, the hardware will be labeled “no trouble found.” 앫 Stale firmware or obsolete patch version. Hardware design flaws, component quality issues, and firmware or device configuration bugs are occasionally discovered in production hardware. As repair centers will often apply all recommended hardware and firmware changes to a returned circuit pack before retesting it, it is possible that one of the recommended changes corrected the original failure and, thus, no failure is found when the pack is retested after completion of the recommended changes. Depending on warranty repair policy, customers may even be motivated to represent packs as having failed to get them updated to the latest hardware and firmware revisions if they would normally be charged for those updates.

c07.qxd

2/10/2009

9:49 AM

Page 113

7.1

HARDWARE FAILURE RATE

113

No trouble found rates for complex hardware can run quite high; NTF packs can even outnumber confirmed hardware failures. Although NTFs clearly represent a quality problem, one should be careful using actual hardware return rates in place of predicted or confirmed hardware failure rates. TL 9000 actually measures return rates based on the length of time individual packs have been in service: 앫 The early return indicator (ERI) captures the rate of hardware returns for the first six months after shipment of a hardware component. 앫 The yearly return rate (YRR) captures the rate of hardware returns for the year following the ERI period. 앫 The long-term return rate (LTR) captures the rate of hardware returns for all time after the YRR time period. It should be noted that the hardware failure rate for a particular component may vary significantly by customer, the locality in which it is deployed, and various other factors. Variables such as temperature, humidity, electrical stress, and contaminants in the air can all affect hardware reliability. For example, a component deployed in a city with high levels of pollution and poor (or no) air conditioning is likely to have significantly higher failure rates than the same component in a rural area in an air-conditioned equipment room. Ideally, the actual hardware failure rate would be used in all calculations. Sadly, the actual hardware failure rate is not as easy to calculate as it would seem. One of the more difficult inputs to determine is the exposure time; that is, for how many hours has the component been installed and powered in the field? Although this sounds simple on the surface, numerous factors conspire to make it difficult to determine. Most manufacturers know when a component was shipped, but do not know how long the interval from shipping to installation is. Additionally, many customers order spares, which sit in a spares cabinet until needed. These spare components, if included in the failure rate calculation, would make the failure rate look better than it really is. To accurately calculate hardware failure rates, these factors must be considered. The good news, if there is any, is that hardware failures typically contribute a relatively small portion of system downtime to the typical high-availability system, thus re-

c07.qxd

2/10/2009

114

9:49 AM

Page 114

ESTIMATING INPUT PARAMETERS FROM LABORATORY DATA

ducing the impact of errors in calculating the hardware failure rate.

7.2

SOFTWARE FAILURE RATE

As system testing is generally designed to mimic most or all aspects of typical customers’ operational profiles, one would expect that system test experience should represent a highly accelerated test period, vaguely analogous to accelerated hardware testing. Whereas hardware and software failures are fundamentally different, the rate of encountering new, stability-impacting defects in the system test environment should be correlated with the rate of encountering stability-impacting defects in field operation. Software reliability growth modeling or SRGM is a technique that can be used to model the defect detection process. It analyzes defects found during the system test period to estimate the rate of encountering new, stability-impacting defects. Comparing actual software failure rates from field data (described in Chapter 6, Section 6.3) with the testing failure rate analyzed by SRGM from the corresponding releases, one can estimate the acceleration or calibration factor to convert the rate of new, stability-impacting defects per hour of system testing to software failures per year in the field. Other techniques for software failure rate prediction use software metrics. We propose a mapping method for early software failure rate prediction (say in the design phase). In this method, software metrics such as code size, complexity, and maturity (or reuse rate) are assessed with objective or subjective rankings. Then software failure rates are predicted by mapping a combination of the metrics settings to software failure rate data. This method relies on historical data of software metrics and software failure rates; see Chapter 8, Section 8.1.5 for details. Other metrics such as defect density and function point are also used in software size, effort, and failure rate prediction. Function point is a standard metric for the relative size and complexity of a software system, originally developed by Alan Albrecht of IBM in the late 1970s [FP08]. These methods are static methods that correlate the defect density level (say, number of defects per thousand lines of code, KLOC) or function point level to software failure rates. The problem with the density approach is that the relationship between defects and failures is unknown, and this kind of approach oversimplifies it. We should be very cautious in at-

c07.qxd

2/10/2009

9:49 AM

Page 115

7.2

SOFTWARE FAILURE RATE

115

tempting to equate fault density to failure rate [Jones1991, Stalhane1992]. Another example shows defect density in terms of function points for different development phases. Table 7.1 [Jones1991] reports the benchmarking study based on a large amount of data from commercial sources. This kind of benchmark helps perform defect density prediction at a high level but caution should always be taken when applying this to a specific application or project. Another constraint is that the function points would have to be measured first, and the function point measurements are not always available in every project. The next section reviews the theory of SRGM, explains the steps needed to complete SRGM, and reviews how to convert the results of SRGM into parameters to be used in an availability model. References [Lyu96, Musa98, Pham2000] provide both detailed background information about SRGM and good summaries of most widely used SRGMs. 7.2.1

Theory of Software Reliability Growth Modeling (SRGM)

The most widely used class of SRGMs assumes that the fault discovery process follows a nonhomogenous Poisson process (NHPP). The software debugging process is modeled as a defect counting process, which can be modeled by a Poisson distribution. The word “nonhomogenous” means that the mean of the Poisson distribution is not a constant, but rather a time-dependent function. In this book, we use the term fault to mean a bug in the software and failure to mean the realization by a user of a fault. The mean value function of the NHPP is denoted by m(t), which represents the expected number of defects detected by time t. Quite often, it is defined as a parametric function depending on

Table 7.1. Sample defects per function point Defect origins Requirement Design Coding Documentation Bad fixes Total

Defects per function point 1 1.25 1.75 0.6 0.4 5

c07.qxd

2/10/2009

116

9:49 AM

Page 116

ESTIMATING INPUT PARAMETERS FROM LABORATORY DATA

two or more unknown parameters. The most common models use a continuous, nondecreasing, differentiable bounded mean value function, which implies that the failure rate of the software, ␭(t) = m⬘(t), monotonically goes to zero as t goes to infinity. The time index associated with NHPP SRGMs can represent cumulative test time (during the testing intervals) or cumulative exposure time among users (during the field operation phases). In the former case, the application of the model centers on being able to determine when the failure rate is sufficiently small that the software can be released to users in field environments. In the latter case, the application of the model centers on estimating the failure rate of the software throughout the early portion of its life cycle, and also on collecting valuable field statistics that can be folded back into test environment applications of subsequent releases. One of the earliest choices for a mean value function is m(t) = a(1 – e–bt), proposed by Goel and Okumoto [Goel79a]. Here, a denotes the expected number of faults in the software at t = 0 and b represents the average failure rate of an individual fault. The Goel–Okumoto (GO) model remains popular today. Recent applications and discussions of the GO model can be found in [Wood96, Zhang02, Jeske05a, Jeske05b, and Zhang06]. The mean value function of the GO model is concave and, therefore, does not allow a “learning” effect to occur in the test environment applications. Learning refers to the experience level of system testers, which ramps up in the early stages of the test environment as the testers and test cases become more proficient at discovering faults. An alternative mean value function that has the potential to capture the learning phenomena is the S-shaped mean value function [Yamada83, Ohba84, Pham97, Goel79b, Kremer83]. The most important application of SRGMs is to use them to predict the initial field failure rate of software and the number of residual defects as the software exits a test interval. Prediction of the initial field failure rate proceeds by collecting failure data (usually grouped) during the test interval, typically using maximum likelihood estimation (MLE) to estimate the parameters of m(t). Using SRGMs to dictate how much additional test time is required to ensure that the failure rate is below a specified threshold when it is released is less common. The basic steps of applying the SRGM method are: 1. Use SRGM (one or more models) to fit the testing data 2. Select the model that gives the best “goodness of fit”

2/10/2009

9:49 AM

Page 117

7.2

SOFTWARE FAILURE RATE

117

3. Use statistical methods to estimate the parameters in the SRGMs to obtain the software failure rate during testing 4. Calibrate the software failure rate during testing to predict the software failure rate in the field Appendix C documents the methods of calculating the maximum likelihood estimates of the SRGMs and the criteria for selecting the best-fit model(s). Figure 7.1 shows a typical example of applying SRGM to predict the software failure rate. The X-axis gives the cumulative testing effort, generally expressed in hours of test exposure. Ideally, this represents actual tester hours of test exposure, excluding time for administrative and non-testing tasks, and also excludes testing setup time, defect reporting time, and even tester vacations and holidays and so on. The Y-axis gives the cumulative number of nonduplicate, stability-impacting defects. If the project does not explicitly identify stability-impacting defects, then nonduplicate, high-severity defects are often a good proxy. The smooth curve represents a curve fitted through the data that asymptotically approaches discovering the “last bug” in the system. The vertical dotted line shows the point in time when the data ended (i.e., how much cumulative testing had been completed).

Cumulative Defects

c07.qxd

Estimated SW failure rate = # of residual defects per fault failure rate

Cumulative Testing Effort

Figure 7.1. Software reliability growth modeling example.

c07.qxd

2/10/2009

118

9:49 AM

Page 118

ESTIMATING INPUT PARAMETERS FROM LABORATORY DATA

The gap between the cumulative number of defects discovered at the vertical dotted line and the cumulative defects asymptote estimates the number of residual stability-impacting defects. For example, if the GO model m(t) = a(1 – e–bt) is used here, then the total number of defects is a and the average failure rate of a fault is b [Zhang02]. From the graph, the number of residual defects is the vertical distance from the asymptote to the cumulative number of defects by time T. By multiplying the per-fault failure rate and the estimated number of residual defects, one can estimate the rate of discovering new, stability-impacting defects at the end of system testing. Typically, although testers try to mimic the user’s environment, the test environment and the field environment do not match up completely. The following are the reasons for the mismatch of the two environments: 1. During the testing phase, testers intentionally try to break the software and find the defects. In this sense, software testing is more aggressive and, hence, yields a much higher defect detection rate. 2. During the field operation, the users are using the software for its designed purpose and, hence, the field operation is much less aggressive. 3. Most of the software defects are detected and removed during the testing interval and the remaining defects are significantly less likely to trigger failures in the field. Hence, when we predict the software field failure rate from the software failure rate estimated in the testing environment, we need to adjust for the mismatch of the testing and the field environment by using some calibration factors. [Zhang02] and [Zhang06] document details of why and how to address this practical issue when using SRGMs. To adjust the mismatch, this rate should be correlated with the software failure rate observed in the field by using some calibration factor. This calibration factor is best estimated by comparing the lab failure rate of a previous release against the field failure rate for that same release. Assuming that testing strategies, operational profiles, and other general development processes remain relatively consistent, the calibration factor should be relatively constant. References [Zhang02] and

2/10/2009

9:49 AM

Page 119

7.2

SOFTWARE FAILURE RATE

119

[Jeske05a] discuss more details on calibrating the software failure rate estimated in the testing environment to predict software failure rate in the field. [Jeske05b] and [Zhang06] also discuss two other practical issues: noninstantaneous defect removal time and deferral of defect fixes. Most SRGMs focus on defect detection and they assume that fault removal is instantaneous. So software reliability growth can be achieved after software defects are detected. Also, the detected defects will be fixed before software is released. In practice, it takes a significant amount of time to remove defects, and fixes of some defects might be deferred to the next release for various reasons; for example, if it is part of a new feature. Other practical issues include imperfect debugging and imperfect fault removal. Fortunately, these behaviors are often fairly consistent from release to release and, thus, can be roughly addressed via proper calibration against field data. In addition to qualitatively estimating software failure rates, SRGM offers a quantitative, easy-to-understand comparison of releases by overlaying multiple releases onto a single chart. For example, consider Figure 7.2, below. The triangles give the fitted SRGM curve for the first major release of a particular product; the diamonds give the fitted curve of the second major release. Even though the second release was tested more than the first release, the testers clearly had to work harder to find stability-impacting

Cumulative Defects

c07.qxd

Figure 7.2. Comparing releases with SRGM.

c07.qxd

2/10/2009

120

9:49 AM

Page 120

ESTIMATING INPUT PARAMETERS FROM LABORATORY DATA

defects; this suggests significant software reliability growth from the first release to the second release. The key features of implementing the SRGM approach to estimate software reliability include: 앫 Normalize the test exposure against the “real” effort, rather than the calendar time. System testing effort is often nonuniform over time because blocking defects can be encountered at any time, thus slowing progress. Likewise, work schedules undoubtedly vary across the test interval with holidays, vacations, meetings, administrative activities and so on, as well as periods of very long hours (often toward the end of the test interval). Plotting defects against calendar time homogenizes all these factors, making it hard to understand what is really happening. The authors recommend normalizing to “hours of testing effort” to remove the impact of these factors. 앫 Focus on stability-impacting defects. It is not uncommon for reported rates and trends for minor defects and enhancement requests to vary over the test cycle. Early in the cycle, testers may report more minor events, especially if they are blocked from their primary testing; later in the test cycle, testers may be too busy to file minor defect or enhancement reports. Some projects even report that severity definitions vary over the course of the development cycle in that if a severe defect is discovered in the first half of testing, it may be categorized as severity 2 (major), but if that same defect is discovered in the last third of the test cycle, it might be categorized as severity 1 (critical). Limiting the scope to stability-impacting defects avoids these reporting variations. One can compare characteristics of stability-impacting defects with those of other defect severities to check data validity, but stability-impacting defects will drive the outage-inducing software failure rate. The best situation is that the data can be scrubbed carefully to identify service-impacting defects. If this cannot be achieved, SRGM can be applied to data sets of different severities. The trends for different groups of severity levels can be analyzed and compared as shown in Figure 7.3. The ideal situation is for the defect tracking system to provide a field for “stability impacting” and have the system testers fill this in. This eliminates the need to scrub the data and avoids some of the other issues associated with using severity levels.

2/10/2009

9:49 AM

Page 121

7.2

SOFTWARE FAILURE RATE

121

Cumulative Defects

c07.qxd

Figure 7.3. SRGM example by severity.

SRGM makes the following assumptions: 1. System test cases mimic the customers’ operational profile. This assumption is consistent with typical system test practice. 2. System testers recognize the difference between a severe outage-inducing software failure (also known as a “crash”) and a minor problem. If system testers cannot recognize the difference between a critical, stability-impacting defect and a minor defect, then all system test data is suspect. 3. Severe defects discovered prior to general product availability are fixed. If, for some reason, the fix of some detected defects will be deferred, these defects should be counted as residual defects. 4. System test cases are executed in a random/uniform manner. “Hard” test cases are distributed across the entire test interval, rather than being clustered (e.g., pushed to the end of the test cycle). 7.2.2

Implementing SRGM

Below are the steps used to implement SRGM: 1. 2. 3. 4.

Select the test activities to monitor Identify stability-impacting defects Compute the test exposure time Combine and plot the data

c07.qxd

2/10/2009

122

9:49 AM

Page 122

ESTIMATING INPUT PARAMETERS FROM LABORATORY DATA

5. Analyze the data to estimate per-fault failure rate and the number of residual service-impacting defects 7.2.2.1 Select the Test Activities to Monitor Product defects are reported throughout the development cycle and out to trial and commercial deployment. Because SRGM relies on normalizing defect discovery against testing effort required to discover those defects, it is critical to understand what test activities will be monitored. Typically, one will focus on defects and exposure time for the following test activities: 앫 앫 앫 앫

System feature testing, including regression testing Stability testing Soak testing Network-level or cluster-level testing

Regression tests are important since regression tests ensure that the detected defects are removed and, hence, the software reliability growth really takes place. (Traditional SRGMs assume reliability growth using the detection times of the defects; that is, defects are removed as soon as they are detected. For most applications, this is reasonable since regression tests ensure that the defects are removed). Another reason is that for a given operational profile, the regression tests typically indicate that software stability is achieved. Defects generated from the following activities should be excluded from consideration when estimating the software failure rate in the testing environment: 앫 앫 앫 앫 앫

Developer unit testing and coding activities Unit/system integration testing Systems engineering/controlling document changes Design, document, and code reviews Trial/field problems

The reasons for not including these defects are (1) the goal is to estimate system-level software failure rates, and the defects from the early or later development phases are not representative of the system software failure rate; (2) normalizing exposure times during these phases is difficult. Stability-impacting defects discovered during nonincluded activities are not “lost”; rather they represent uncertainty on

c07.qxd

2/10/2009

9:49 AM

Page 123

7.2

SOFTWARE FAILURE RATE

123

where the “0 defects” horizontal axis on the SRGM plot should have been. Fortunately, the location of the “0 defects” X-axis is unimportant in SRGM because the gap between defects discovered and defect asymptote is what really matters regarding defects. 7.2.2.2 Identify Stability-Impacting Defects Having selected the test activities to focus on, one must then set a policy for determining which defects from that activity will “count.” The options, in descending order of preference, are: 1. Include an explicit “stability-impacting” flag in the defect tracking system and instruct system testers to assert this flag as appropriate. A product-specific definition of “stability-impacting” defect should assure consistent use of this flag. 2. Manually scrub severity 1 and severity 2 defects to identify those that really are stability-impacting, and only consider these identified/approved events. 3. Use severity 1 and severity 2 defects “raw,” without any scrubbing (beyond filtering out duplicates). Note. Duplicate defects must be removed from the dataset. For spawned/duplicate defects, only include the parent defects. Nondefects (typically includes no-change modification/change requests) or user errors and documentation defects should be removed from the dataset. Defects detected during the customerbased testing period are typically analyzed separately. We use them to proxy the software failure rates in the field environment, which can then be used to calibrate the test and the field software failure rates to improve future predictions. 7.2.2.3 Compute the Test Exposure Time Test exposure time represents the actual time that the system is exposed to testing, typically expressed in hours, measured on a weekly basis (or more frequently). On some projects, this information is directly recorded in a test management and tracking tool. If test hours are not directly recorded, then they can generally be estimated as tester half-days of test effort per week. The weakest (but still acceptable) measure of test exposure is normalizing against test cases attempted or completed per week. Test cases-per-week data are tracked on most projects, so this data should be readily available. Unfortunately, there can be a wide variation in time required to complete each individual test case,

c07.qxd

2/10/2009

124

9:49 AM

Page 124

ESTIMATING INPUT PARAMETERS FROM LABORATORY DATA

making it a weak proxy for exposure time. Nevertheless, it is superior to calendar time and should be used when tester hours or tester half-days-per-week data is not available. 7.2.2.4 Combine and Plot the Data Cumulative test effort and cumulative stability-impacting defects are recorded on at least a weekly basis. Data can be recorded in a simple table, as shown in Table 7.2, and plotted. This type of tabular data can often be easily imported to an SRGM spreadsheet tool for analysis. 7.2.2.5 Analyze the Data The most important characteristic to assess from the SRGM plot is whether the defect discovery rate is linear, or if the curve is bending over and asymptotically approaching the predicted number of stability-impacting defects for the particular operational profile. Several different curve-fitting model strategies can be used to predict the asymptote and slope, but a simple visual analysis will often show whether the defect discovery rate is slowing down. A curve gently approaching an asymptote suggests that new stability-impacting defects are becoming harder and harder for system testers to find, and, thus, that acceptable product quality/reliability is imminent or has been reached. A linear defect-discovery rate indicates that there are still lots of stability defects being discovered during testing and, thus, that the product is not sufficiently stable to ship to customers. Figure 7.4 shows the three stages of the software debugging process. A highly effective visualization technique is to simply overlay the SRGM curves for all known releases, as shown in Figure 7.2, making it very easy to see where the current release is at any point in time compared to previous releases. Typically, the cumulative defect data show either a concave or an S-shaped pattern. The concave curve depicts the natural de-

Table 7.2. Sample SRGM data table Week index Week 1 Week 2 Week 3 ...

Cumulative test effort (machine-hours)

Cumulative stability impacting defects

255 597 971 ...

8 17 28 ...

c07.qxd

2/10/2009

9:49 AM

Page 125

7.2

SOFTWARE FAILURE RATE

125

Figure 7.4. Software debugging stages.

fect debugging process; that is, the total number of detected defects increases as testing continues. The total number of detected defects grows at a slower slope and eventually approaches the asymptote as the software becomes stable. The S-shaped curve, on the other hand, indicates a slower defect detection rate at the beginning of the debugging process, which could be attributed either to a learning curve phenomenon [Ohba84, Yamada92], or to software process fluctuations [Rivers98], or a combination of both. Then the detected defects accumulate at a steeper slope as testing continues and, eventually, the cumulative defect curve approaches an asymptote. Accordingly, the SRGM models can be classified into two groups: concave and S-shaped models, as shown in Figure 7.5. There are more than 80 software reliability group models in the literature (see Appendix C for details). In practice, a few of the early models find wide applications in real projects. There are two reasons for this: 1. Most of the models published later were derived from the early models and they typically introduce more parameters to incorporate different assumptions about the debugging process, such

c07.qxd

2/10/2009

126

9:49 AM

Page 126

ESTIMATING INPUT PARAMETERS FROM LABORATORY DATA

Figure 7.5. Concave versus S-shaped SRGMs.

as debugging effort, fault introduction, imperfectness of the debugging process, and so on, but they are not fundamentally different from the early, simpler models. 2. Models with more parameters require larger datasets, which can be a realistic limitation. A few frequently used SRGM models are described below. We previously discussed the GO model, which is one of the earliest but most widely used models. Its mean value function of m(t) = a(1 – e–bt) has a concave shape, where a represents the total number of software defects and b represents the average failure rate of a fault. So the GO model assumes that there are a fixed number of total defects in the software and, on average, these defects cause a failure with a rate b. The mean value function m(t) represents the number of expected defects found by time t. Another concave SRGM is the Yamada exponential model [Yamada86]. Similar to the GO model, this model assumes a constant a for the total number of defects, but its fault detection function incorporates a time-dependent exponential testing-effort function. The mean value function for this model is m(t) = a(1 – e–r␣[1–e(–␤t)]). Figure 7.6 is an example demonstrating the close fit of both GO and Yamada exponential models to a given (concave) dataset.

c07.qxd

2/10/2009

9:49 AM

Page 127

7.2

SOFTWARE FAILURE RATE

127

Figure 7.6. Concave SRGM model examples.

The delayed S-shaped model [Yamada83] and inflexion Sshaped model [Hossain93] are two representative S-shaped models. The delayed S-shaped model is derived from the GO model; modifications were made to make it S-shaped. The inflexion Sshaped model is also extended from the GO model; the defect detection function is modified to make it S-shaped. The mean value function of the two models are m(t) = a[1 – (1 + bt)e–bt] and m(t) = a(1 – e–bt)/(1 + ␤e–bt), which reduces to the GO model if ␤ = 0. Some models [Pham99] can be either concave or S-shaped, that is, these models can have different shapes when fitted to different data. Typically, these models have more parameters in them. They often have greater goodness-of-fit power but, on the other hand, more parameters need to be estimated. Figure 7.7 shows these models fitted to an S-shaped dataset. When applying these models to the data, the first criterion is that the model pattern should match the data pattern. Then the parameters in the model need to be estimated (see Parameter Estimation in Appendix C for details) and the model that provides the best fit is selected (see Model Selection in Appendix C for details). For some data, several SRGMs might provide relatively close results, which is confirming. In this situation, one of these models (typically the one with the fewest parameters) can be used from a practical point of view. Sometimes, the curves might look close but it is good practice to check the goodness-of-fit readings (for details, see Section 2 in Appendix C). Commercial tools and self-developed

c07.qxd

2/10/2009

128

9:49 AM

Page 128

ESTIMATING INPUT PARAMETERS FROM LABORATORY DATA

Figure 7.7. S-shaped SRGM model examples.

programs are typically used to estimate the parameters. The number of residual defects and the average failure rate of a fault are obtained, from which the software failure rate can be predicted. Typically, the cumulative defects and cumulative testing time are input to these software tools and the tools (1) produce estimates of the parameters in the models and (2) compare the fitted curve with the raw data to show the goodness of fit. The software failure rate and the number of residual defects can then be calculated. In addition to the mathematical curve-fitting technique, if one has both validated the field software failure rate and completed SRGM plots for one or more previous releases, then one can “eyeball” the curves and may be able to visually estimate the overall software failure rate. 7.2.2.6 Factor the Software Failure Rate Typically, architecture-based system availability models will require finer grained software failure rate inputs than a single overall failure rate estimate. For example, failure rates might be estimated to the FRU or even the processor, to the application versus platform level, or even down to the module/subsystem level. Thus, it is often necessary to factor this overall software failure rate into constituent failure rates, which can be input to the availability model. Several strategies for factoring the overall failure rate are: 앫 By defect attribution. Defects are typically assigned against a specific software module (or perhaps FRU), thus making it very

c07.qxd

2/10/2009

9:49 AM

Page 129

7.3

COVERAGE FACTORS

129

easy to examine the distribution of stability-impacting defects by FRU. 앫 By software size. Many projects track or estimate lines of new/changed code by software module, thus making it easy to examine the distribution of new/changed code across a release. 앫 Other qualitative factors, like module complexity, reuse rate, frequency of execution, and so on, can also be considered, as can expert opinion of software architects and developers and system testers. One or more of these factors can be used to allocate the software failure rates to software modules, which are direct inputs for system availability models. Once the failure rates of software modules are determined, two objectives can be achieved: (1) rank the software modules according to their failure rates and identify the high-risk software modules, and (2) feed the software failure rates and other statistics analyzed from testing data, such as fault coverage and failure recovery times, back to the architecture-based models and update the availability prediction.

7.3

COVERAGE FACTORS

Fault-insertion testing provides one of the best mechanisms for estimating the coverage factor, and is recommended in standards such as GR-282-CORE, Software Reliability and Quality Acceptance Criteria (SRQAC), objective O3-11[7], which states: “It is desirable that the system be tested under abnormal conditions utilizing software fault insertion.” Fault-insertion testing is one of the best practices for creating abnormal conditions, so in many cases fault-insertion testing can be used to achieve multiple goals. The coverage factor can be estimated approximately from the results of fault-insertion testing by averaging the first and final test pass rates. Assuming that system testers choose appropriate (i.e., nonredundant) faults to insert into the system and set correct pass criteria, the probability of the system correctly addressing an arbitrary fault occurring in the field is expected to fall between the first-pass test-pass rate (p1) for fault insertion tests and the finalpass test-pass rate (p2). We use the first-pass test-pass rate to proxy the coverage for untested faults and the final-pass test-pass rate to proxy the coverage for tested faults. Assume that f represents the percentage of the entire fault population that is not covered by the

c07.qxd

2/10/2009

130

9:49 AM

Page 130

ESTIMATING INPUT PARAMETERS FROM LABORATORY DATA

selected/inserted faults. Mathematically, the coverage of the system can be estimated as Coverage factor ⬇ f × p1 + (1 – f) × p2

(7.2)

where f represents the fraction of the fault population that is not tested by the selected faults, p1 is first-pass test-pass rate, and p2 is final-pass test-pass rate. As a starting point, one can set f to 50% and, thus, the coverage factor estimation simplifies as follows: p1 + p2 Coverage factor ⬇ ᎏ 2

(7.3)

Coverage factors should be estimated separately for hardware and software by considering only software fault-insertion test cases and results when calculating the software coverage factor, and hardware cases and results when calculating hardware coverage factor. As best practice is to attempt several dozen hardware fault-insertion tests against complex boards, one can often estimate the hardware coverage factor for at least some of the major boards in a system; when possible, those board-specific hardware coverage factors should be used.

7.4 7.4.1

TIMING PARAMETERS Covered Failure Detection and Recovery Time

Recovery time for covered hardware failures is measured during appropriate fault-insertion testing. Measured time is from start of service impairment (often the same as fault-insertion time) to service restoration time. Note that this should include the failure detection duration, although the fault detection duration might be very short compared to the recovery duration. Best practice is to execute each test case several times and use the median value of the measured detection plus recovery latencies for modeling. The covered software failures are typically recovered by process restart, task/process/processor failover, or processor/ board reboot. One-second resolution is best for most systems, but 6 seconds (0.1 minute) or 15 seconds (0.25 minute) are also acceptable.

c07.qxd

2/10/2009

9:49 AM

Page 131

7.4

7.4.2

TIMING PARAMETERS

131

Uncovered Failure Detection and Recovery Time

Uncovered failure recovery time is often estimated from field performance of similar products deployed in similar situations and customers. Results of serviceability studies can reveal differences that might shorten or lengthen uncovered failure recovery time relative to similar products. Typical uncovered failure recovery times for equipment in staffed locations are: 앫 Uncovered failure detection time on an active unit. Thirty minutes is generally a reasonable estimate of the time to detect an uncovered fault on an active unit. Elements that are not closely monitored by the customer and/or elements that are not frequently used might use longer uncovered failure detection times; uncovered failures on some closely monitored systems might be detected even faster than 30 minutes. 앫 Uncovered failure detection time on a standby unit. Twenty-four hours is often assumed for uncovered failure detection time on standby units. For example, best practice is for customers to perform routine switchovers onto standby units periodically (e.g., every week or every day) to verify both that all standby hardware and configurations are correct, and that staff are well practiced on emergency recovery procedures. Elements with standby redundancy that are more closely monitored and execute routine switchovers may use shorter uncovered failure detection times on standby units; customers with less rigorous maintenance policies (e.g., not frequently exercising standby units) might have longer uncovered failure detection times on standby units. 7.4.3

Automatic Failover Time

Automatic failover times are measured in the laboratory during switchover tests. Best practice is to repeat the switchover tests several times and use the median value in availability modeling. 7.4.4

Manual Failover Time

If the manual failover time for a previous release or similar product is available, then use that value. That value could be refined based on the results of a serviceability assessment. Thirty minutes is a typical value for equipment in staffed offices.

c07.qxd

2/10/2009

132

7.5 7.5.1

9:49 AM

Page 132

ESTIMATING INPUT PARAMETERS FROM LABORATORY DATA

SYSTEM-LEVEL PARAMETERS Automatic Failover Success Rate

Failover success is estimated as the percentage of automatic switchover tests that succeed. Fortunately, the automatic failover success rate is fairly easy to estimate from switchover testing that is routinely performed on highly available systems. Binomial statistics can be used to calculate the number of tests that must be made to establish an automatic failover success rate with reasonable (60%) and high (90%) statistical confidence. Assume that we need N tests to demonstrate a failover success probability of p. The risk of having n failures (here n = 0, 1, 2, etc.) can be calculated by a binomial distribution: Pr(n) = 冱 n

N

冢 n 冣(1 – p) p

n (N–n)

N can be calculated by associating the risk with the confidence level. Table 7.3 and Table 7.4 summarize the number of test attempts N needed for different numbers of failures for 60% and 90% confidence levels. The left-most column shows the target success rate parameter; the remaining columns show how many tests must be completed to demonstrate that success rate with

Table 7.3. Test case iterations for 60% confidence

Failover success probability 90% 91% 92% 93% 94% 95% 96% 97% 98% 99% 99.5% 99.9%

Number of test interations to demonstrate 60% confidence, assuming 0 failures 11 13 14 17 19 23 29 40 60 120 240 1,203

1 failures 20 22 25 29 34 40 51 67 101 202 404 2,025

2 failures 31 34 38 44 51 62 78 103 155 310 621 3,107

c07.qxd

2/10/2009

9:49 AM

Page 133

7.5

SYSTEM-LEVEL PARAMETERS

133

Table 7.4. Test case iterations for 90% confidence Number of test interations to demonstrate 90% confidence, assuming

Failover success probability

0 failures

90% 91% 92% 93% 94% 95% 96% 97% 98% 99% 99.5% 99.9%

22 24 28 32 37 45 56 76 114 229 459 2,301

1 failures 38 42 48 55 64 77 96 129 194 388 777 3,890

2 failures 52 58 65 75 88 105 132 176 265 531 1065 5,325

zero, one, or two failures. Ideally, the automatic failover success rate is specified in a product requirements document, and this drives the system test team to set the number of test iterations to demonstrate that success rate to the appropriate confidence level. On the other hand, for a given test plan and execution, failover success probability can be estimated from the test results. If N failover tests are attempted from which n tests pass, then the failover success probability can be estimated pˆ = n/N

(7.4)

A 100(1 – ␣)% confidence interval for p is given by ᎏ , pˆ + z ᎏ冣 冢 pˆ – z 冪莦 冪莦 N N p(1 ˆ – p) ˆ

␣/2

p(1 ˆ – p) ˆ

␣/2

where z␣/2 is the upper ␣/2 percentage point of standard normal distribution. It takes the values of 0.84 and 1.64 for 60% and 90% confidence levels, respectively. This method is based on normal approximation to the binomial distribution. To be reasonably conservative, this requires that n(1 – p) be greater than 5. If n is large and (1 – p) is small (57×HW-MTBF >35×HW-MTBF >12×HW-MTBF 57 >35 >12 100 >50 >20 ) summarizes the intention of each section and would be deleted from an actual reliability report.

Reliability Report for the Widget System Version 1.0, January, 2009 Contact: John Smith, Reliability Engineer, Widgets’r’Us (John. [email protected]; +1-212-555-1234).

1 1.1

EXECUTIVE SUMMARY Architectural Overview

Practical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.

215

bappa.qxd

2/8/2009

216

5:54 PM

Page 216

SYSTEM RELIABILITY REPORT OUTLINE

1.1.1 Hardware Platform

1.1.2 Software Architecture

1.2

Reliability and Availability Features

1.2.1 Reliability and Availability Features Provided in the System Hardware

1.2.2 Reliability and Availability Features Provided in the System Software

bappa.qxd

2/8/2009

5:54 PM

Page 217

SYSTEM RELIABILITY REPORT OUTLINE

2 2.1

217

RELIABILITY REQUIREMENTS Availability Objective

2.2

Downtime Requirements

2.3

Hardware Failure Rate Requirements

2.4

Service Life Requirements

3 3.1

UNPLANNED DOWNTIME MODEL AND RESULTS Unplanned Downtime Model Methodology

bappa.qxd

2/8/2009

5:54 PM

Page 218

218

SYSTEM RELIABILITY REPORT OUTLINE

3.2

Reliability Block Diagrams

3.3

Standard Modeling Assumptions

3.4 Product-Specific Assumptions, Predictions, and Field or Test Data

3.4.1 Hardware Failure Rates

3.4.2 Software Failure Rates

3.4.3 Failover and Software Recovery Times

3.4.4 Coverage Factors

3.5

System Stability Testing

3.6

Unplanned Downtime Model Results

ANNEX A—RELIABILITY DEFINITIONS

ANNEX B—REFERENCES .

bappa.qxd

2/8/2009

220

5:54 PM

Page 220

SYSTEM RELIABILITY REPORT OUTLINE

ANNEX C—MARKOV MODEL STATE-TRANSITION DIAGRAMS

bappb.qxd

2/8/2009

APPENDIX

5:55 PM

Page 221

B

RELIABILITY AND AVAILABILITY THEORY

Reliability and availability evaluation of a system (hardware and software) can help answer questions like “How reliable will the system be during its operating life?” and/or “What is the probability that the system will be operating as compared to out of service?” System failures occur in a random manner and failure phenomena can be described in probabilistic terms. Fundamental reliability and availability evaluations depend on probability theory. This chapter describes the fundamental concepts and definitions of reliability and availability of a system. B.1

RELIABILITY AND AVAILABILITY DEFINITIONS

Reliability is defined as “the probability of a device performing its purpose adequately for the period of time intended under the operating conditions encountered” [Bagowsky61]. The probability is the most significant index of reliability but there are many parameters used and calculated. The term reliability is frequently used as a generic term describing the other indices. These indices are related to each other and there is no single all-purpose reliability formula or technique to cover the evaluation. The following are examples of these other indices: 앫 The expected number of failures that will occur in a specific period of time 앫 The average time between failures 앫 The expected loss of service capacity due to failure 앫 The average outage duration or downtime of a system 앫 The steady-state availability of a system Practical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.

221

bappb.qxd

2/8/2009

222

5:55 PM

Page 222

RELIABILITY AND AVAILABILITY THEORY

The approaches taken and the resulting formula should always be connected with an understanding of the assumptions made in the area of reliability evaluation. Attention must be paid to the validation of the reliability analysis and prediction to avoid significant errors or omissions. B.1.1

Reliability

Mathematically, reliability, often denoted as R(t), is the probability that a system will be successfully operating during the mission time t: tⱖ0

R(t) = P(T > t),

(1)

where T is a random variable denoting the time to failure. In other words, reliability is the probability that the value of the random variable T is greater than the mission time t. Probability of failure, F(t), is defined as the probability that the system will fail by time t: F(t) = P(T ⱕ t),

tⱖ0

(2)

In other words, F(t) is the failure distribution function, which is often called the cumulative failure distribution function. The reliability function is also known as the survival function. Hence, R(t) = 1 – F(t)

(3)

The derivative of F(t), therefore, gives a function that is equivalent to the probability density function, and this is called the failure density function, f(t), where dF(t) dR(t) f(t) = ᎏ = – ᎏ dt dt

(4)

Or, if we integrate both sides of Equation (4),

冕 f(t)dt t

F(t) =

(5)

0

and

冕 f(t)dt = 冕



t

R(t) = 1 –

0

t

f(t)dt

(6)

bappb.qxd

2/8/2009

5:55 PM

Page 223

B.1

RELIABILITY AND AVAILABILITY DEFINITIONS

223

In the case of discrete random variables, the integrals in Equations (5) and (6) can be replaced by summations. A hypothetical failure density function is shown in Figure B.1, where the values of F(t) and R(t) are illustrated by the two appropriately shaded areas. F(t) and R(t) are the areas under their respective portions of the curve. Some readers may find it more intuitive to start with the failure density function shown in Figure B.1 and go from there. The “bell curve” of the normal distribution is a probability density function that most readers are probably familiar with, although it typically does not have time as the horizontal axis. The failure density function shows the probability of the system failing at any given point in time. Because the sum of all the probabilities must be 1 (or 100%), we know the area under the curve must be 1. The probability of failing by time t is thus the sum of the probabilities of failing from t = 0 until time t, which is the integral of f(t) evaluated between 0 and t. Reliability is the probability that the system did not fail by time t and is, thus, the remainder of the area under the curve, or the area from time t to infinity. This is the same as the integral of f(t) evaluated from t to infinity. B.1.2

System Mean Time to Failure (MTTF)

Mean time to failure (MTTF) is the expected (average) time that the system is likely to operate successfully before a failure occurs.

Figure B.1. Hypothetical failure density function. F(t) = probability of failure by time t, R(t) = probability of survival by time t.

bappb.qxd

2/8/2009

224

5:55 PM

Page 224

RELIABILITY AND AVAILABILITY THEORY

By definition, the mean or expected value of a random variable is the integral from negative infinity to infinity of the product of the random variable and its probability density function. Thus, to calculate mean time to failure we can use Equation (7), where f(t) is the probability density function and t is time. We can limit the integral to values of t that are zero or greater, since no failures can occur prior to starting the system at time t = 0.





MTTF =

tf(t)dt

(7)

0

Substituting for f(t) using Equation (4), dR(t) f(t) = – ᎏ dt Equation (7) then becomes





MTTF = –

tdR(t)dt

0







(8)



= –tR(t) 0 +

R(t)dt

0

The first term in Equation (8) equals zero at both limits. It is zero when t is zero precisely because t is zero, and it is zero when t is infinite because the probability of the component continuing to work (i.e., surviving) forever is zero. This leaves the MTTF function as





MTTF =

R(t)dt

(9)

0

B.1.3

Failure Rate Function (or Hazard Rate Function)

In terms of failure, the hazard rate is a measure of the rate at which failures occur. It is defined as the probability that a failure occurs in a time interval [t1, t2], given that no failure has occurred prior to t1, the beginning of the interval. The probability that a system fails in a given time interval [t1, t2] can be expressed in terms of the reliability function as

bappb.qxd

2/8/2009

5:55 PM

Page 225

B.1

P(t1 < T ⱕ t2) =



t2





f(t)dt =





f(t)dt –

t1

t1

225

RELIABILITY AND AVAILABILITY DEFINITIONS

t2

f(t)dt = R(t1) – R(t2)

where f(t) is again the failure density function. Thus, the failure rate can be derived as R(t1) – R(t2) ᎏᎏ (t2 – t1)R(t1)

(10)

If we redefine the interval as [t, t + ⌬t], Equation (10) becomes R(t) – R(t + ⌬t) ᎏᎏ ⌬tR(t) The hazard function is defined as the limit of the failure rate as the interval approaches zero. Thus, the hazard function h(t) is the instantaneous failure rate, and is defined by R(t) – R(t + ⌬t) h(t) = lim ᎏᎏ ⌬t씮0 ⌬tR(t)



1 –dR(t) =ᎏ ᎏ R(t) dt



(11)

–dR(t) =ᎏ R(t)dt Integrating both sides and noticing the right side is the definition of the natural logarithm, ln, of R(t) yields

冕 h(t) = –冕 t

t

0

0

dR(t) ᎏ R(t)dt

冕 h(t) = –ln[R(t)] t

(12)

0

冤 冕 h(t)dt冥 t

R(t) = exp –

0

For the special case where h(t) is a constant and independent of time, Equation (12) simplifies to R(t) = e–ht

(13)

bappb.qxd

2/8/2009

226

5:55 PM

Page 226

RELIABILITY AND AVAILABILITY THEORY

This special case is known as the exponential failure distribution. It is customary in this case to use ␭ to represent the constant failure rate, yielding the equation R(t) = e–␭t

(14)

Figure B.2 shows the hazard rate curve, also known as a bathtub curve (discussed in more detail in Chapter 5, Section 5.2.2.1.1), which characterizes many physical components. B.1.4

Availability

Reliability is a measure of successful system operation over a period of time or during a mission. During the mission time, no failure is allowed. Availability is a measure that allows for a system to be repaired when failures occur. Availability is defined as the probability that the system is in normal operation. Availability (A) is a measure of successful operation for repairable systems. Mathematically, System uptime A = ᎏᎏᎏᎏᎏ System uptime + System downtime Or, because the system is “up” between failures, MTTF A = ᎏᎏ MTTF + MTTR where MTTR stands for mean time to repair.

Figure B.2. Bathtub curve.

(15)

bappb.qxd

2/8/2009

5:55 PM

Page 227

B.1

RELIABILITY AND AVAILABILITY DEFINITIONS

227

Another frequently used term is mean time between failures (MTBF). Like MTTF and MTTR, MTBF is an expected value of the random variable time between failures. Mathematically, MTBF = MTTF + MTTR If MTTR can be reduced, availability will increase. A system in which failures are rapidly diagnosed and recovered is more desirable than a system that has a lower failure rate but the failures take a longer time to be detected, isolated, and recovered. Figure B.3 shows pictorially the relationship between MTBF, MTTR, and MTTF. From the figure it is easy to see that MTBF is the sum of MTTF and MTTR. B.1.5

Downtime

Downtime is an index associated with the service unavailability. Downtime is typically measured in minutes per service year: Downtime = 525,960 × (1 – Availability) min/yr

(16)

where 525,960 is number of minutes in a year. Other indices are also taken into consideration during reliability evaluations; the above are the major ones.

Figure B.3. MTTR, MTBF, and MTTF.

bappb.qxd

2/8/2009

5:55 PM

228

Page 228

RELIABILITY AND AVAILABILITY THEORY

B.2 PROBABILITY DISTRIBUTIONS IN RELIABILITY EVALUATION This section presents some of the common distribution functions and their related hazard functions that have applications in reliability evaluation. A number of standard distribution functions are widely used: binomial, Poisson, normal, lognormal, exponential, gamma, Weibull, and Rayleigh. Textbooks such as [Bagowsky61], [Shooman68], [Pukite98], and [Crowder 94] provide detailed documentation on probability models and how statistical analyses and inferences about those models are developed. B.2.1

Binomial Distribution

Consider an example of tossing a coin. There are two independent outcomes: heads or tails. The probability of getting one of the two outcomes at each time the coin is tossed is identical (we assume the coin is “fair”). Let us assume that the probability of getting a head is p, and the probability of getting a tail is q. Since there are only two outcomes, we have p + q = 1. For a given number of trials, say n, the probability of getting all of them as heads is pn, the probability of getting (n – 1) heads and one tail is npn–1q, the probability of getting (n – 2) heads and two tails is [n(n – 1)/2!]pn–2q2. This goes on, and the probability of getting all tails is qn. Hence, the general probability of getting all the possible outcomes can be summarized as n(n – 1) n(n – 1) . . . n(n – r + 1) pn + npn–1q + ᎏ pn–2q2 + ᎏᎏᎏ pn–rqr 2! r! + . . . + qn = (p + q)n In reliability evaluation, the outcomes can be modified to success or failure. Consider n trials with the outcome of r successes and (n – r) failures. The probability of this outcome can be evaluated as follows: Pr = C nr prq(n–r) n! = ᎏᎏ prqn–r r!(n – r)!

(17)

where C nr denotes the combination of r successes from total n trials.

bappb.qxd

2/8/2009

5:55 PM

Page 229

B.2

PROBABILITY DISTRIBUTIONS IN RELIABILITY EVALUATION

229

For all possible outcomes, we have n

(p + q)n = 冱 C nr prqn–r = 1 r=0

Example: For a given manufacturing process, it is known that the product defect rate is 1%. If an average customer purchases 100 of these products selected at random, what is the probability that he/she receives two or less defective products? In this example, n = 100, p = 0.01, q = 0.99, and r = 0,1,2, therefore: Pr(2 or less defects) = Pr(2 defects) + Pr(1 defect) + Pr(0 defects) 2

r (0.01)r(0.99)100–r = 冱 C100 r=0

= 0.1849 + 0.3697 + 0.3660 = 0.9206 It can be proven that the mean and the variance for binomial distribution are E(X) = np and V(X) = npq

B.2.2

Poisson Distribution

Like binomial distributions, Poisson distributions are used to describe discrete random events. The major difference between the two is that in a Poisson distribution, only the occurrence of an event is counted, and its nonoccurrence is not counted, whereas a binomial distribution counts both the occurrence and the nonoccurrence of events. Examples of a Poisson distribution are: 앫 The number of people coming to a bus stop 앫 The number of failures of a system 앫 The number of calls in a given period Assume that the average failure rate of a system is ␭ and the number of failures by time t is x. Then the probability of having x failures by time t is given by

bappb.qxd

2/8/2009

230

5:55 PM

Page 230

RELIABILITY AND AVAILABILITY THEORY

(␭t)xe–␭t Pr(X = x) = ᎏ x!

for x = 0, 1, 2 . . .

(18)

The mean and the variance of a Poisson distribution are given by E(X) = ␭t and V(X) = ␭t B.2.2.1 Relationship with the Binomial Distribution It can be shown that for large sample size n (n Ⰷ r) and small p (p ⱕ 0.05), the Poisson and binomial distributions are identical. That is, np = ␭t and r=x B.2.3

Exponential Distribution

The exponential distribution is one of the most widely used probability distributions in reliability engineering. The most important characteristic of the exponential distribution is that the hazard rate is constant, in which case it is defined as the failure rate. The failure density function is given by f(t) = ␭e–␭t

t > 0, f(t) 0 otherwise

(19)

and the reliability function is R(t) = e–␭t Figure B.4 shows the exponential reliability functions. It can be proven that the mean and variance of the exponential distribution are: E(T) = 1/␭

bappb.qxd

2/8/2009

5:55 PM

Page 231

B.2

PROBABILITY DISTRIBUTIONS IN RELIABILITY EVALUATION

231

Figure B.4. Exponential reliability functions.

and V(t) = 1/␭2 We can see that the mean time to failure (MTTF) for the exponential distribution is the reciprocal of the failure rate ␭. Another property of the exponential distribution is known as the memoryless property, that is, the conditional reliability function for a component that has survived to time s is identical to that of a new component. Mathematically, we have Pr(T ⱖ t) = Pr(T ⱖ t + s/T ⱖ s),

for t > 0, s > 0

The exponential distribution is used extensively in the analysis of repairable systems in which components cycle between upstates and downstates. For example, in Markov models, the memoryless property is the fundamental assumption that characterizes failure and recovery distributions. B.2.4

Weibull Distribution

The exponential distribution is limited in its application due to the memoryless property. The Weibull distribution, on the other hand, is a generalization of the exponential distribution. It has a very important property—the distribution has no specific characteristic shape. In fact, depending on what the values of the parameters are in its reliability function, it can be shaped to represent many different distributions and it can be shaped to fit to experimental data that cannot be characterized as a particular distribution. This makes the Weibull (and a few other distribution func-

bappb.qxd

2/8/2009

232

5:55 PM

Page 232

RELIABILITY AND AVAILABILITY THEORY

tions such as gamma and lognormal, which will be discussed later) a very important function in experimental data analysis. The three-parameter probability density function of the Weibull distribution is given by

␤(t)␤–1 –(t/␪)␤ f(t) = ᎏ e ␪␤

for t ⱖ 0

(20)

where ␪ is known as the scale parameter and ␤ is the shape parameter. The reliability function is ␤

for t > 0, ␤ > 0, ␪ > 0

R(t) = e–(t/␪)

The hazard function is

␤(t)␤–1 ␭(t) = ᎏ ␪␤

for t > 0, ␤ > 0, ␪ > 0

The mean and variance of the Weibull distribution are





1 E(X) = ␪⌫ ᎏ + 1 ␤ and

冤冢





2 1 V(X) = ␪2 ⌫ 1 + ᎏ – ⌫2 1 + ᎏ ␤ ␤

冣冥

where ⌫ represents the gamma function. There are two special cases of the Weibull distribution that deserve mention. When ␤ = 1, the failure density function and the hazard function reduce to 1 f(t) = ᎏ e–(t/␪) ␪ and 1 ␭(t) = ᎏ ␪ The failure density is identical to the exponential distribution.

bappb.qxd

2/8/2009

5:55 PM

Page 233

B.2

PROBABILITY DISTRIBUTIONS IN RELIABILITY EVALUATION

233

When ␤ = 2, the failure density function and the hazard function reduce to 2t –(t2/␪2 f(t) = ᎏ e ) ␪2 and 2t ␭(t) = ᎏ ␪2 The failure density function is identical to the Rayleigh distribution which is discussed next. It can be shown that

␤ < 1 represents a decreasing hazard rate of the burn-in period ␤ = 1 represents a constant hazard rate of the normal life period ␤ > 1 represents an increasing hazard rate of the wear-out period So, in this sense, the hazard rate function of the Weibull distribution can be connected to the bathtub curve we discussed in the main text. B.2.5

Rayleigh Distribution

The Rayleigh distribution is a special case of the Weibull distribution and has only one parameter. Besides its use in reliability engineering, the Rayleigh distribution is used to analyze noise problems associated with communications systems. It has also been used in some software reliability growth models. The failure density function is 2t –(t2/␪2) f(t) = ᎏ e ␪2

(21)

A more general form of the Raleigh density function is f(t) = kte–(kt

2/2)

where k is the only parameter. When k = 2/␪2, the Rayleigh distribution is equivalent to the Weibull distribution for ␤ = 2. The Rayleigh distribution is a singleparameter function and k is both the scale and shape parameter.

bappb.qxd

2/8/2009

234

5:55 PM

Page 234

RELIABILITY AND AVAILABILITY THEORY

The Rayleigh reliability function is 2/2)

R(t) = e–(kt and the hazard function is

␭(t) = kt which is a linearly increasing hazard rate with time. This characteristic gives the Rayleigh distribution its importance in reliability evaluation. B.2.6

The Gamma Distribution

Similar to the Weibull distribution, the gamma distribution has a shape parameter (␤ is conventionally used for the gamma distribution shape parameter) and a scale parameter (␣ for the gamma distribution). By varying these parameters, the gamma distribution can be used to fit a wide range of experimental data. The failure density function is given by t␤–1 –(t/␣) f(t) = ᎏ e ␣␤⌫(␤)

for t ⱖ 0, ␣ > 0, ␤ > 0

(22)

and ⌫(␤) is defined as ⌫(␤) =





t␥–1e–t dt

0

Note that for integer values of ␥, ⌫(␤) reduces to ⌫(␤) = (␥ – 1)! The reliability function is given by





R(t) =

t

t␤–1 –(t/␣) ᎏ e dt ␣␤⌫(␤)

There are two special case of gamma distribution. They are when ␤ = 1 and when ␤ is an integer. When ␤ = 1 the failure density reduces to 1 f(t) = ᎏ e–(t/␣) ␣ Again, this is identical to the exponential distribution.

bappb.qxd

2/8/2009

5:55 PM

Page 235

B.2

PROBABILITY DISTRIBUTIONS IN RELIABILITY EVALUATION

235

When ␤ is an integer, the failure density reduces to t␤–1 f(t) = ᎏᎏ e–(t/␣) ␣␤(␤ – 1)! This density function is known as the special Erlangian distribution, which can be shown as ␤–1 t R(t) = e–(t/␣) 冱 ᎏ i=0 ␤

冢 冣

i

1 ᎏ j!

It can be shown that the mean and variance for the gamma distribution function is: E(t) = ␣␤ and V(t) = ␣2␤ B.2.7

The Normal Distribution

The normal distribution, also known as the Gaussian distribution, is probably the most important and widely used distribution in statistics and probability. It is of less significance in the reliability field. In the reliability field, it has applications in measurements of product susceptibility and external stress. Taking the wellknown bell shape, the normal distribution is perfectly symmetrical about its mean and the spread is measured by its variance. The failure density function is given by 1 2 f(t) = ᎏ e–1/2[(t–␮)/␴] ␴兹2 苶␲ 苶

(23)

where ␮ is the mean value and ␴ is the standard deviation. The larger ␴ is, the flatter the distribution. The reliability function is



t

R(t) = 1 –

–⬁

1 2 ᎏ e–1/2[(s–␮)/␴] ds ␴兹2 苶␲ 苶

The mean value and the variance of the normal distribution are given by

bappb.qxd

2/8/2009

236

5:55 PM

Page 236

RELIABILITY AND AVAILABILITY THEORY

E(t) = ␮ and V(t) = ␴2 B.2.8

The Lognormal Distribution

Like the normal distribution, the lognormal distribution also has two parameters. It has not been considered as an important distribution to model component lifetimes, but it can be a good fit to the distribution of component repair times in modeling repairable systems. The density function is given by 1 2 2 f(t) = ᎏ e–(ln t–␮) /2␴ t␴兹2 苶␲ 苶

(24)

where ␮ and ␴ are the parameters in the distribution function, but they are not the mean and variance of the distribution. Instead, they are the mean and variance of the natural logarithm of the random variable. ␴ is a shape parameter and ␮ is a scale parameter. The mean and variance of the distribution are 2/2)

E(t) = e␮+(␴ and 2

2

V(t) = e2␮+ ␴ [e␴ – 1] The cumulative distribution function for the lognormal distribution is 1 冕ᎏ e s␴兹2 苶␲ 苶 t

F(t) =

–(ln s–␮)2/2␴2

ds

(25)

0

and this can be related to the standard normal derivate Z by ln t – ␮ F(t) = P[T ⱕ t] = P[ln T ⱕln t] = P Z ⱕ ᎏ ␴





bappb.qxd

2/8/2009

5:55 PM

Page 237

B.3

ESTIMATION OF CONFIDENCE INTERVALS

237

Therefore, the reliability function is given by ln t – ␮ R(t) = p Z > ᎏ ␴





(26)

and the hazard function is ln t – ␮ ␾ ᎏᎏ f(t) ␴ ␭(t) = ᎏ = ᎏᎏ R(t) ␴tR(t)



B.2.9



(27)

Summary and Conclusions

This appendix has presented the most important probability distributions that are likely to be encountered in reliability evaluations. Some readers might be familiar with the concepts and distributions described here. In this case, this material can be used as a reference. For those who have not been previously exposed to this area, it is intended to provide some basic understanding of the fundamental distributions.

B.3

ESTIMATION OF CONFIDENCE INTERVALS

In estimation theory, it is assumed that the desired information is embedded in a noisy signal. Noise adds uncertainty and if there were no uncertainty then there would be no need for estimation. An estimator attempts to approximate the unknown parameters using the measurements. This is known as parametric estimation. On the other hand, methods that analyze the observed data and arrive at an estimate without assuming any underlying parametric function are called nonparametric estimation. In reliability engineering, the parametric approach is widely used since the physical failure processes can be well captured by the parametric distributions. Throughout this book, we discussed obtaining the point estimators for the reliability parameters. However, getting a point estimator is typically not sufficient for understanding the uncertainties or confidence levels of the estimation. Confidence interval estimation addresses this problem by associating the probability of covering the true value with upper and lower bounds. In this

bappb.qxd

2/8/2009

238

5:55 PM

Page 238

RELIABILITY AND AVAILABILITY THEORY

section, we will focus on the development of confidence intervals for two important reliability metrics: failure rate and unavailability. B.3.1

Confidence Intervals for Failure Rates

We have discussed using probability distributions to model the random variable of time to failure. Once some failure data are recorded after a certain time of operation, they can be used to obtain the estimates of the parameters in the distributions. In this section, we use exponential distribution as an example to explain how to obtain the upper and lower bounds of the failure rate parameter (failure rate is ␭ in the exponential case) for a given confidence level. Assume that we arrived at an estimate for ␭, say ␭ˆ , which is used as an estimated value for the true failure rate. Then we need to calculate confidence bounds or a confidence interval, say [␭L, ␭U] for the failure rate, where ␭L is the lower bound and ␭U is the upper bound. The confidence intervals associate the point estimator with the error or confidence level. For example, having interval estimators [␭L, ␭U] with a given probability 1 – ␣ means that with a 100(1 – ␣)% probability, the true failure rate lies in between ␭L and ␭U. Here, ␭L and ␭U will be called 100(1 – ␣)% confidence limits. Let us derive the confidence intervals for the failure rate using the exponential distribution as an example. We begin by obtaining a point estimate—the maximum likelihood estimator (MLE) of the failure rate ␭. Assume that we observed n failures and xi denotes the time when the ith failure occurred. Let X1, X2, . . . , Xn be a random sample from the exponential distribution with pdf f(x; ␭) = ␭e–␭x

x > 0, ␭ > 0

The joint pdf of X1, X2, . . . , Xn is given by n

L(X, ␭) = ␭ne–␭⌺i=1xi

(28)

Function L(X, ␭) is called the likelihood function, which is the function of the unknown parameter ␭ and the real data, xi and n in this case. The parameter value that maximizes the likelihood func-

bappb.qxd

2/8/2009

5:55 PM

Page 239

B.3

ESTIMATION OF CONFIDENCE INTERVALS

239

tion is called the maximum likelihood estimator. The MLE can be interpreted as the parameter value that is most likely to explain the dataset. The logarithm of the likelihood function is called the log-likelihood function. The parameter value that maximizes the log-likelihood function will maximize the likelihood function. The log-likelihood function is n

ln L(X, ␭) = n ln ␭ – ␭冱 xi

(29)

i=1

The function ln(L) can be maximized by setting the first derivative of ln L with respect to ␭, equal to zero, and solving the resulting equation for ␭. Therefore, n ⭸ ln L n ᎏ = ᎏ – 冱 xi = 0 ⭸␭ ␭ i=1

This implies that n ␭ˆ = ᎏ n 冱 xi

(30)

i=1

The observed value of ␭ˆ is the maximum likelihood estimator of ␭, that is, the total number of failures divided by the total operating time. It can be proven that 2n(␭/␭ˆ ) = 2␭T follows a chisquared (␹2) distribution. T is the total accrued time on all units. Knowing the distribution of 2␭T allows us to obtain the confidence limits on the parameters as follows: 2 2 P[␹ 1–( ␣/2),2n < 2␭T < ␹ (␣/2),2n] = 1 – ␣

(31)

or, equivalently, that 2 ␹ 1–( ␹ 2(␣/2),2n ␣/2),2n P ᎏᎏ < 2␭T < ᎏ =1–␣ 2T 2T





This means that in (1 – ␣)% of samples with a given size n, the 2 2 ˆ random interval between ␭ˆ L = (␹ 1–( ␣/2),2n/2T) and ␭U = (␹ (␣/2),2n/2T) will contain the true failure rate.

bappb.qxd

2/8/2009

240

5:55 PM

Page 240

RELIABILITY AND AVAILABILITY THEORY

For the example shown in Chapter 6, Section 6.2.7, if after testing for T = 50,000 hours, n = 60 failures are observed, the point estimate of the failure rate is 60 ␭ˆ = ᎏ = 0.0012 failures/hour 50,000 For a confidence level of 90%, that is, ␣ = 1 – 0.9 = 0.1, we calculate the confidence intervals for the failure rate as 2 2 95.703 ␹ 1–( ␹ 0.95,120 ␣/2),2n ␭ˆ L = ᎏᎏ = ᎏᎏ = ᎏ = 0.000957 failures/hour 2 × 50,000 100,000 2T

and 2 ␹ 2(␣/2),2n ␹ 0.05,120 146.568 ␭ˆ U = ᎏ = ᎏᎏ = ᎏ = 0.001465 failures/hour 2 × 50,000 100,000 2T

Let us discuss an example of associating confidence level with the failure rate bounds based on the recorded data—estimating the failure rate bounds for a given confidence level (say 95%) if zero failures have occurred in time t. Assume that a Poisson distribution is used to model the failure process. The probability of x failures or less in a total time t is x (t/m)ke–t/m Px = 冱 ᎏᎏ k! k=0

where m is the mean time to failure or the reciprocal of failure rate, that is, m = 1/␭; k is the index of the number of observed failures. Now let us investigate the probability of zero failures, that is, k = 0: 0 (t/m)0e–t/m Px=0 = 冱 ᎏᎏ = e–t/m 0! k=0

Next we estimate the one-sided confidence limit for ␭, given that zero failures occurred by time t. Assume a value of ␭, say ␭⬘, that satisfies ␭⬘ > ␭, and the probability of actually getting zero failures is 1 – ␣ = 5%, where ␣ = 95% is the confidence level. Then,

bappb.qxd

2/8/2009

5:55 PM

Page 241

B.3

ESTIMATION OF CONFIDENCE INTERVALS

241

1 – 0.95 = e–␭t

␭⬘t = 3.0 3.0 ␭⬘ = ᎏ t or m⬘ = 0.33t. This implies that if zero failures have occurred in time t, then there is a 95% confidence that the failure rate is less than 3/t and that the MTTF is greater than 0.33t. B.3.2

Confidence Intervals for Unavailability

The unavailability can be calculated from the availability (A) equation in Equation (2.1): MTTF Unavailability = 1 – A = 1 – ᎏᎏ MTTF + MTTR

or

␭ U = ᎏ (32) ␭+␮

where ␭ and ␮ are failure and repair rates, respectively, and U is unavailability. Note that MTTF = 1/␭ and MTTR = 1/␮. The average uptime duration m and the average downtime duration r estimated can be evaluated from the recorded data. Using these two values, a single-point estimate of the unavailability can be evaluated from Equation (22): r Uˆ = ᎏ r+m

(33)

The confidence level can also be made from the same set of recorded data. It was shown [Baldwin54] that r r ␭ Pr[␾⬘⬘ a,b ⱕ F2a,2b ⱕ ␾⬘ a,b] = Pr ᎏ ⱕ ᎏ ⱕ ᎏᎏ r + ␾⬘m ␭+␮ r + ␾⬘⬘m





(34)

where

␾⬘a,b and ␾⬘⬘ a,b are constants depending upon the chosen confidence level F2a,2b = F-statistic with 2a degrees of freedom in the numerator and 2b in the denominator

bappb.qxd

2/8/2009

242

5:55 PM

Page 242

RELIABILITY AND AVAILABILITY THEORY

a = number of consecutive or randomly chosen downtime durations b = number of consecutive or randomly chosen uptime durations The values of ␾⬘a,b and ␾⬘⬘ a,b are determined for a specific probability ␣. ␾⬘a,b is obtained from 1–␣ Pr[F2a,2b ⱖ ␾⬘] = ᎏ 2

(35)

1–␣ Pr[F2a,2b ⱕ ␾⬘⬘] = ᎏ 2

(36)

and ␾⬘⬘ a,b from

Since the upper tails of the F-distribution are usually tabulated [Odeh77], it is more convenient to express the equation above as 1 1–␣ 1 Pr ᎏ ⱖ ᎏ = ᎏ ␾⬘ 2 F2a,2b





or 1 1–␣ Pr F2b,2a ⱖ ᎏ = ᎏ ␾⬘⬘ 2





(37)

Once the values of ␾⬘ and ␾⬘⬘ are evaluated from the F-distribution with the chosen confidence level, ␣, they can be used to derive the following limits enclosing the true values of U: r Upper limit, UU = ᎏᎏ r + ␾⬘⬘m (38) r Lower limit, UL = ᎏ r + ␾⬘m B.3.3

Confidence Intervals for Large Samples

We have discussed taking multiple samples of the same size and by the same method to verify if a random sample is representative in Chapter 9, Section 9.2.1. Suppose we collect n samples. As the

bappb.qxd

2/8/2009

5:55 PM

Page 243

B.3

ESTIMATION OF CONFIDENCE INTERVALS

243

sample size (n) becomes larger, the sampling distribution of means becomes approximately normal, regardless of the shape of the variable in the population according to the central limit theorem (CLT). Assume that we estimated the sample means from all of them, say X 苶i. Then the mean of all these sample means, say X 苶i, is 苶 = ⌺ni=1X the best estimate of the population true mean, say ␮. According to the CLT, the sampling distribution will be centered around the population mean ␮, that is, X 苶 ⬵ ␮. The standard deviation of the sampling distribution (␴X), which is called its standard error, will approach the standard deviation of the population (␴) divided by 苶. (n1/2), that is, ␴X = ␴/兹n The table for the normal distribution indicates 95% of the area under the curve lies between a Z score of ±1.96. Therefore, we are 95% confident that the population mean ␮ lies between X 苶 ± 1.96␴X. Similarly, the table for the normal distribution indicates 99% of the area under the curve lies between a Z score of ±2.58. Therefore, we are 99% confident that the population mean ␮ lies between X 苶 ± 2.58␴X. When the sample size n is less than 30, the t-distribution is used to calculate the sample mean and standard error. The mathematics of the t-distribution were developed by W. C. Gossett and were published in 1908 [Gossett1908]. Reference [Chaudhuri2005] documents more discussion on estimating sampling errors.

bappc.qxd

2/10/2009

APPENDIX

2:40 PM

Page 245

C

SOFTWARE RELIABILITY GROWTH MODELS

C.1

SOFTWARE CHARACTERISTIC MODELS

Research activities in software reliability engineering have been conducted over the past 35 years and many models have been proposed for the estimation of software reliability. There exist some classification systems of software reliability models; for example, the classification theme according to the nature of the debugging strategy presented by Bastani and Ramamoorthy [Bastani86]. In addition, Goel [Goel85], Musa [Musa84], and Mellor [Mellor87] presented their classification systems. In general, model classifications are helpful for identifying similarity between different models and to provide ideas when selecting an appropriate model. One of the most widely used classification methods classified software reliability models into two types: the deterministic and the probabilistic [Pham1999]. The deterministic models are used to study: (1) the elements of a program by counting the number of operators, operands, and instructions; (2) the control flow of a program by counting the branches and tracing the execution paths; and (3) the data flow of a program by studying the data sharing and passing. In general, these models estimate and predict software performance using regression of performance measures on program complexity or other metrics. Halstead’s software metric and McCabe’s cyclomatic complexity metric are two known models of this type [Halstead77 and McCabe76]. In general, these models can be used to analyze the program attributes and produce the software performance measures without involving any random event. Software complexity models have also been studied in [Ottenstein81] and [Schneiderwind81], Practical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.

245

bappc.qxd

2/10/2009

246

2:40 PM

Page 246

SOFTWARE RELIABILITY GROWTH MODELS

who presented some empirical models using some complexity metrics. Lipow [Lipow82] presented some models to estimate the number of faults per line of code, which is an important complexity metric. A common feature of these models is that they estimate software reliability or the number of remaining faults by regression analysis. In other words, they determine the performance of the software according to its complexity or other metrics. The probabilistic models represent the failure occurrences and fault removals as probabilistic events. This type of software reliability model [Xie91, Lyu96, Musa87, and Pham2000] can be further classified into the following categories: fault seeding models, failure rate models, curve fitting models, program structure models, input domain models, Bayesian models, Markov models, reliability growth models, and nonhomogeneous Poisson process (NHPP) models. Among these models, NHPP models are straightforward to implement in real-world applications. This family of models has received the most attention from both research and industry communities. In the next section, NHPP theory and widely used models are discussed. C.2 C.2.1

NONHOMOGENEOUS POISSON PROCESS MODELS Summary of SRGMs

C.2.1.1 Basic NHPP Models Research activities in software reliability engineering have been conducted and a number of NHPP software reliability growth models (SRGMs) have been proposed to assess the reliability of software [Goel79a, 79b; Hossain93; Lyu96; Miller86; Musa83; Musa87; Ohba84a, 84b, 84c; Ohtera90a; Pham93, 96; Yamada83; Yamada92; Wood96]. One of the first NHPP models is suggested by Schneidewind (1975). The first and well-known NHPP model is given by Goel and Okumoto [Goel79a]) which has been further generalized in [Goel85] and modified by many researchers. This model essentially agrees with the model that John Musa [Musa84, 98] proposed. Moranda [Moranda81] described a variant of the JelinskiMoranda model. In this paper, the variable (growing) size of a developing program is accommodated so that the quality of a program can be estimated by analyzing an initial segment of the written code. Two parameters, mean time to failure (MTTF) and fault content of a program, are estimated.

bappc.qxd

2/10/2009

2:40 PM

Page 247

C.2

NONHOMOGENEOUS POISSON PROCESS MODELS

247

C.2.1.2 S-Shaped Models Ohba and coworkers [Ohba82] presented a NHPP model with an S-shaped mean value function. Some interesting results using an NHPP model are also presented by Yamada and coworkers [Yamada83]. Ohba [Ohba84] discussed several methods to improve some traditional software reliability analysis models. Selection of appropriate models was addressed. Ohba and coworkers [Ohba82] suggested the so-called S-shaped models. Based on experience, it is observed that the curve of the cumulative number of faults is often S-shaped with regard to the mean value function, which reflects the fact that faults are neither independent nor of the same size. At the beginning, some faults are hidden, so removing a fault has a small effect on the reduction of the failure intensity rate. Yamada [Yamada84] proposed another reason that the mean value function shows an S-shaped curve. The software testing usually involves a learning process by which people become familiar with the software and the testing tools and, therefore, can do a better job after a certain period of time. Yamada and Osaki [Yamada85] presented a general description of a discrete software reliability growth model that adopted the number of test runs or the number of executed test cases as the unit of fault detection period. Yamada and coworkers [Yamada86] proposed a testing-effort-dependent reliability growth model for which the software fault detection process is modeled by an NHPP model. They used exponential and Rayleigh distributions to model the testing expenditure functions. Since 1990, research activities have increased in the area of software reliability modeling. Yamada and Ohtera [Yamada90] incorporated the testing-effort expenditures into software reliability growth models. They conducted research on NHPP models and provided many modifications reflecting such issues as testing effort, delayed S-shaped, and learning process consideration. C.2.1.3 Imperfect Debugging Singpurwalla [Singpurwalla91] described an approach addressing optimal time interval for testing and debugging under uncertainty. He suggested two plausible forms for the utility function, one based on cost alone and the other involving the realized reliability of the software. Yamada [Yamada91a] described a software fault detection process during the testing phase of software development. Yamada et al. [Yamada91b] proposed two software reliabili-

bappc.qxd

2/10/2009

248

2:40 PM

Page 248

SOFTWARE RELIABILITY GROWTH MODELS

ty assessment models with imperfect debugging by assuming that new faults are sometimes introduced when faults originally latent in a software system are corrected and removed during the testing phase. Pham [Pham91] proposed software reliability models for critical applications. C.2.1.4 Fault Detection and Correction Processes Xie and Zhao [Xie92] investigated the Schneidewind NHPP model [Schneidewind75] and suggested that several NHPP models can be derived from it to model the fault detection process and fault correction process. Pham [Pham93] studied the imperfect debugging and multiple failure types in software development. Hossain and coworkers [Hossain93] suggested a modification of the Goel–Okumoto model. They also presented a necessary and sufficient condition for the likelihood estimates to be finite, positive, and unique. Pham and Zhang [Pham97] summarized the existing NHPP models and presented a new model incorporating the timedependent behavior of the fault detection function and the fault content function. Littlewood [Littlewood2000] studied different fault-finding procedures and showed that the effects these procedures have on reliability are not statistically independent. Wu [Wu2007] investigated fault detection and fault correction processes and proposed an approach to incorporate time delays due to fault detection and correction into software reliability models. Huang [Huang2004] proposed methods to incorporate fault dependency and time-dependent delay function into software reliability growth modeling. Gokhale [Gokhale98] proposes a method to incorporate debugging activities using rate-based simulation techniques. Various debugging policies are presented and the effects of these policies on the number of residual defects are analyzed. C.2.1.5 Testing Coverage Research has been conducted in testing coverage and its relationship with software reliability. Levendel [Levendel89] introduced the concepts of time-varying test coverage and time-varying defect repair density and related them to software reliability evaluation. Malaiya [Malaiya94, Malaiya2002] proposed a logarithmic model that relates the testing effort to test coverage and defect coverage. Testing coverage is measured in terms of blocks, branches, computation uses, predicate uses, and so on that are covered. Lyu

bappc.qxd

2/10/2009

2:40 PM

Page 249

C.2

NONHOMOGENEOUS POISSON PROCESS MODELS

249

[Lyu2003] documented an empirical study on testing and fault tolerance for software reliability engineering. Faults were inserted into the software and the nature, manifestation, detection, and correlation of these faults was carefully studied. This study shows that coverage testing is an effective means of detecting software faults, but the effectiveness of testing coverage is not equivalent to that of mutation coverage, which is a more truthful indicator of testing quality. Gokhale [Gokhale2004] proposed a relationship between test costs and the benefits, specifically between the quantity of testing and test coverage, based on the lognormal failure rate model. Cai [Cai2007] incorporates both testing time and testing coverage in software reliability prediction. Experiments were carried out using the model in multiversion, fault-tolerant software. C.2.1.6 Other Considerations Other software reliability growth models have been proposed in the literature. Yoshihiro Tohma and coworkers [Tohma91] worked on a hypergeometric model and its application to software reliability growth. Zhao and Xie [Zhao92] presented a log-power NHPP model that possesses properties such as simplicity and good graphical interpretation. Some papers also addressed the application of the software reliability models. Schneidewind [Schneidewind92] reported some software reliability studies of application to the U.S. space shuttle. He used experimental approaches to evaluate many existing reliability models and validated them using real software failure data. Schneidewind [Schneidewind93] claimed that it is not necessary that all the failure data be used to estimate software reliability since some of the failure data collected in the earlier testing phase is unstable. His research showed that improved reliability prediction can be achieved by using a subset of the failure data. Wood [Wood96] reported his experiments on software reliability models at Tandem Computer. He compared some existing software reliability models by applying them to the data collected from four releases of the software products. He observed that the number of defects predicted by the Goel–Okumoto model is close to the number reported in the field data. Research on software size estimation has also been conducted. Hakuta and coworkers [Hakuta96] proposed a model for estimating software size based on the program design and other docu-

bappc.qxd

2/10/2009

250

2:40 PM

Page 250

SOFTWARE RELIABILITY GROWTH MODELS

ments, then evaluated the model by looking at some application examples. Their model assumed a stepwise evaluation of software size at different levels of program design documents. Software reliability models based on NHPP have indeed been successfully used to evaluate software reliability [Musa83; Miller86; Musa87, Pham99]. Musa [Musa83, Musa87] promoted the use of NHPP models in software reliability growth modeling. Miller [Miller86] also provided a strong theoretical justification for using NHPP. An important advantage of NHPP models is that they are closed under superposition and time transformation. This characteristic is useful to describe different types of failures, even systems failures, including both hardware and software failures. Pham, Nordmann and Zhang [Pham99] developed a general NHPP model from which new models can be derived and existing models can be unified. C.2.1.7 Failure Rate Prediction SRGMs are used to evaluate and predict the software failure rate. References [Jeske05a, Zhang02, Zhang06] document details on how to use SRGM to evaluate testing data and predict software field failure rates. Typically, the software failure rate in the testing environment needs to be calibrated when predicting the software field failure rate to adjust the mismatch between the testing and the field environment. To adjust the mismatch, this rate should be correlated with the software failure rate observed in the field by some calibration factor. This calibration factor is best estimated by comparing the lab failure rate of a previous release against the field failure rate for that same release; assuming that testing strategies, operational profiles, and other general development processes remain relatively consistent, the calibration factor should be relatively consistent. References [Zhang02] and [Jeske05a] discuss more details on calibrating software failure rates estimated in the testing environment to predict software failure rates in the field. They also discuss two other practical issues: noninstantaneous defect removal time and deferral of defect fixes. Most SRGMs focus on defect detection and they assume that fault removal is instantaneous and all of the detected defects will be fixed before software is released. So software reliability growth can be achieved after software defects are detected. In practice, it takes a significant amount of time to remove defects, and fixes of some defects might be deferred to the next release for various rea-

bappc.qxd

2/10/2009

2:40 PM

Page 251

C.2

NONHOMOGENEOUS POISSON PROCESS MODELS

251

sons; for example, it is part of a new feature. References [Jeske05b] and [Zhang06] explained how to address these issues in realworld applications. C.2.2

Theory of SRGM

To use SRGM to describe the fault detection process, let N(t) denote the cumulative number of software failures by time t. The counting process {N(t),t ⱖ 0} is said to be a nonhomogeneous Poisson process with intensity function ␭(t), t ⱖ 0, if N(t) follows a Poisson distribution with mean value function m(t): [m(t)]k Pr{N(t) = k} = ᎏ e–m(t), k!

k = 0, 1, 2 . . . ,

(1)

where m(t) = E[N(t)] is the expected number of cumulative failures, which is also known as the mean value function. The failure intensity function (or hazard function) is given by R(t) – R(⌬t + t) f(t) ␭(t) = lim ᎏᎏ = ᎏ ⌬t씮0 ⌬tR(t) R(t) Given ␭(t), the mean value function m(t) satisfies

冕 ␭(s)ds t

m(t) =

(2)

0

Inversely, knowing m(t), the fault detection rate function at time t, can be obtained as dm(t) ␭(t) = ᎏ dt

(3)

Software reliability R(x/t) is defined as the probability that a software failure does not occur in (t, t + x), given that the last failure occurred at testing time t(t ⱖ 0, x > 0). That is, R(x/t) = e–[m(t+x)–m(t)]

(4)

For special cases, when t = 0, R(x/0) = e–m(x); and when t = ⬁, R(x/⬁) = 1.

bappc.qxd

2/10/2009

252

2:40 PM

Page 252

SOFTWARE RELIABILITY GROWTH MODELS

Most of the NHPP software reliability growth models in the literature are based on the same underlying theory, with different formats of the mean value functions. NHPP models assume that failure intensity is proportional to the residual fault content. A general class of NHPP SRGMs can be obtained by solving the following differential equation: dm(t) ᎏ = b(t)[a(t) – m(t)] dt

(5)

Where a(t) stands for the total number of defects in the software and b(t) is known as the defect detection function. The second term a(t) – m(t) represents the number of (undetected) residual defects. The model in Equation (5) is a general model that can summarize most of the NHPP models. Depending on how elaborate a model one wishes to obtain, one can use a(t) and b(t) to yield more or less complex analytical solutions for the function m(t). Different a(t) and b(t) functions also reflect different assumptions of the software testing processes. The GO model cited in Chapter 7 is the simplest NHPP with a(t) = a and b(t) = b. A constant a(t) indicates that no new faults are introduced during the debugging process and, therefore, is considered a perfect debugging assumption. A constant b(t) implies that the failure intensity function ␭(t) is proportional to the number of remaining faults. The general solution of the differential Equation (5) is given by [Pham97, Pham99]:



m(t) = e–B(t) m0 +

冕 a(␶)b(␶)e t

t0

B(␶)



d␶

(6)

where B(t) = 兰tt0b(␶) d␶, and m(t0) = m0 is the marginal condition of Equation (5), with t0 representing the starting time of the debugging process. Many existing NHPP models can be considered as special cases of the general model in Equation (6). An increasing a(t) function implies an increasing total number of faults (note that this includes those already detected and removed and those inserted during the debugging process) and reflects imperfect debugging. A time-dependent b(t) implies an increasing fault detection rate, which could be either attributed to a learning curve phenomenon [Ohba84,

bappc.qxd

2/10/2009

2:40 PM

Page 253

C.2

NONHOMOGENEOUS POISSON PROCESS MODELS

253

Yamada92], or to software process fluctuations [Rivers98], or a combination of both. This group of models with time-dependent fault detection functions are also referred to as “S-shaped” models since the fault detection function captures the delay at the beginning due to learning. Table C.1 summarizes the widely used NHPP software reliability growth models and their mean value functions (MVFs). C.2.2.1 Parameter Estimation Once the analytical expression for the mean value function m(t) is developed, the parameters of the mean value function need to be estimated, which is usually carried out by using the maximum likelihood estimates (MLE) [Schneidewind75]. There are two widely used formats for recording software failure data in practice. The first type records the cumulative number of failures for every separate time interval. The second records the exact failure occurrence time for each fault. Since there are two types of data, two methodologies of the parameter estimation are derived accordingly [Pham96]. Case 1 Let t1, t2, . . . , tn be the time interval of n software failures, and y1, y2,. . . , yn be the cumulative number of failures for each time interval. If data are given on the cumulative number of failures at discrete times [yi = y(ti) for i = 1, 2, . . . , n], then the log of the likelihood function (LLF) can be expressed as n

LLF = 冱(yi – yi–1) × log[m(ti) – m(ti–1)] – m(tn)

(7)

i=1

Thus, the maximum of the LLF is determined by the following: n



i=1

⭸ ⭸ ᎏᎏ m(ti) – ᎏᎏ m(ti–1) ⭸ ⭸x ⭸x ᎏᎏᎏ (yi – yi–1) – ᎏ m(tn) = 0 m(ti) – m(ti–1) ⭸x

(8)

where x represents the unknown parameters in the mean value function m(t) that need to be substituted. Case 2 The second method records the failure occurrence time for each failure. Let Sj be the occurrence time of the failure j (j = 1, 2, . . . ,

bappc.qxd

2/10/2009

254

2:40 PM

Page 254

SOFTWARE RELIABILITY GROWTH MODELS

Table C.1. Summary of the NHPP software reliability models Model name

Model type MVF, m(t)

Delayed S-shaped model [Yamada83]

S-shaped

m(t) = a[1 – (1 + bt)e–bt]

Modification of G-O model to make it S-shaped

Goel–Okumoto (G-O) model [Goel79]

Concave

m(t) = a(1 – e–bt) a(t) = a b(t) = b

Also called exponential model

Inflection S-shaped model [Hossain93]

S-shaped

a(1 – e–bt) m(t) = ᎏᎏ 1 + ␤e–bt

Solves a technical condition with the G-O model. Becomes the same as G-O if ␤=0

a(t) = a b b(t) = ᎏᎏ 1 + ␤e–bt

Pham– Nordmann– Zhang (PNZ) model

S-shaped and concave

␣ a(1 – e–bt)(1 – ᎏᎏ) + ␣at b m(t) = ᎏᎏᎏ –bt 1 + ␤e a(t) = a(1 + ␣t) b b(t) = ᎏᎏ 1 + ␤e–bt

Pham–Zhang (PZ) model [Pham99]

S-shaped and concave

1 + a)(1 – e–bt) m(t) = ᎏᎏ[(c (1 + ␤e–bt) ab – ᎏᎏ(e–␣t – e–bt)] b–␣ a(t) = c + a(1 – e–␣t) b b(t) = ᎏᎏ 1 + ␤e–bt

Yamada exponential model [Yamada86]

Concave

m(t) = a(1 – e–r␣(1–e(–␤t))) a(t) = a b(t) = r␣␤e–␤t

Comments

Assumes the introduction rate is a linear function of the testing time, and the fault detection rate is nondecreasing with an inflexion S-shaped model Assumes introduction rate is an exponential function of the testing time, and the fault detection rate is nondecreasing with an inflexion S-shaped model Incorporates an exponential testing-effort function

bappc.qxd

2/10/2009

2:40 PM

Page 255

C.2

255

NONHOMOGENEOUS POISSON PROCESS MODELS

Table C.1. Continued Model name

Model type MVF, m(t)

Comments 2 –r␣(1–e(–␤t /2))

Yamada Rayleigh model [Yamada86]

S-shaped

m(t) = a(1 – e a(t) = a b(t) = r␣␤te–␤t2/2

Yamada imperfect debugging model (1) [Yamada92]

S-shaped

ab m(t) = ᎏᎏ (e–␣t – ebt) ␣+b

Yamada imperfect debugging model (2) [Yamada92]

S-shaped

)

Assumes exponential fault content function and a constant fault detection rate

a(t) = ae␣t b(t) = b

␣ m(t) = a(1 – e–bt) 1 – ᎏᎏ + ␣at b



Incorporates a Rayleigh testing-effort function



a(t) = a(1 + ␣t) b(t) = b

Assumes constant introduction rate ␣ and fault rate detection b

n), then the log of the likelihood function takes the following form: n

LLF = 冱 log[␭(Si)] – m(Sn)

(9)

i=1

where ␭(Si) is the failure intensity function at time Si. Thus, the maximum of the LLF is determined by the following: n



i=1

⭸ ᎏᎏ ␭(Si) ⭸x ⭸ ᎏ – ᎏ m(Sn) = 0 ⭸x ␭(Si)

(10)

where x represents the unknown parameters in the mean value function m(t) that need to be substituted. Typically, software tools can help users to fit a specific model to a given data set. The tool will run algorithms to maximize the likelihood function, which yields the values of the parameters in the model for a given dataset. C.2.2.2

SRGM Model Selection Criteria

Descriptive Power. An NHPP SRGM model proposes a mean value function m(t) that can be used to estimate the number of expected

bappc.qxd

2/10/2009

256

2:40 PM

Page 256

SOFTWARE RELIABILITY GROWTH MODELS

failures by time t. Once a mean value function is fit to the actual debugging data (typically in the format of number of cumulative failures by cumulative test time), the goodness of fit can be measured and compared using the following three criteria: mean squared error (MSE), Akaike’s information criterion (AIC) [Akaike74], and predictive-ratio risk (PRR) proposed by Pham and Deng [Pham2003]. The closeness of fit between a given model and the actual dataset provides the descriptive power of the model. These three metrics compare the SRGMs on how well each of them fits the underlying data, in other words, they compare the SRGMs based on their descriptive power. Criterion 1 The MSE measures the distance of a model estimate from the actual data with the consideration of the number of observations and the number of parameters in the model. It is calculated as follows:

冱i (m(ti) – yi)2

MSE = ᎏᎏ n–N

(11)

where n is the number of observations, N stands for the number of parameters in the model, yi is the cumulative number of failures observed by time ti, m(ti) is the mean value function of the NHPP model, and i is the index for the reported defects. Criterion 2 AIC measures the ability of a model to maximize the likelihood function that is directly related to the degrees of freedom during fitting. The AIC criterion assigns a larger penalty to a model with more parameters: AIC = –2 × log(max value of the likelihood function) + 2 × N (12) where N stands for the number of parameters in the model. Criterion 3 The third criterion, predictive-ratio risk (PRR), is defined as



m(ti) – yi PRR = 冱 ᎏᎏ m(ti) i



2

(13)

bappc.qxd

2/10/2009

2:40 PM

Page 257

C.2

NONHOMOGENEOUS POISSON PROCESS MODELS

257

where yi is the number of failures observed by time ti and m(ti) is the mean value function of an NHPP model. PRR assigns a larger penalty to a model that has underestimated the cumulative number of failures at any given time. For all three, a lower metric value indicates a better fit to the data. Predictive Power. The predictive power is defined as the ratio of the difference between the predicted number of residual faults and number of faults observed in the postsystem test to the number of observed faults in the postsystem test, that is, — ˆ

Ns(T) – Npost P = ᎏᎏ Npost

(14)

— ˆ

where Ns is the estimated number of remaining faults by the end of the system test, and Npost is the number of faults detected during the post system test phase. A negative value indicates that the model has underestimated the number of remaining faults. A lower absolute metric value indicates better predictive power. For projects that have defect data from the test and field/trial intervals, the model that provides the best descriptive power and predictive power should be selected. If only test data is available and the test data is relatively large, the data can be divided into two subsets. The first subset can be used for descriptive power comparison, whereas the second subset can be used for predictive power comparison. If the test data is small, then only descriptive power can be used for model selection. C.2.3 SRGM Example–Evaluation of the Predictive Power: Data from a Real-Time Control System In this section, the predictive power of the proposed model is evaluated by using a dataset collected from testing a program for monitor and real-time control systems. The data is published in [Tohma91]. The software consists of about 200 modules, and each module has, on average, 1000 lines of a high-level language like FORTRAN. Table C.2 records the software failures detected during a 111-day testing period. This actual data set is concave overall, with two clusters of significant increasing detected faults.

bappc.qxd

2/10/2009

258

2:40 PM

Page 258

SOFTWARE RELIABILITY GROWTH MODELS

Table C.2. Failure per day and cumulative failure Days

Faults

Cumulative faults

Days

2 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

5* 5* 5* 5* 6* 8 2 7 4 2 31 4 24 49 14 12 8 9 4 7 6 9 4 4 2 4 3 9 2 5 4 1 4 3 6 13 19 15 7 15 21 8 6 20 10

5* 10* 15* 20* 26* 34 36 43 47 49 80 84 108 157 171 183 191 200 204 211 217 226 230 234 236 240 243 252 254 259 263 264 268 271 277 293 309 324 331 346 367 375 381 401 411

46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90

Faults 3 3 8 5 1 2 2 2 7 1 0 2 3 2 7 3 0 1 0 1 0 0 1 1 0 0 1 1 0 0 0 1 2 0 1 0 0 0 0 0 0 2 0 0 0

Cumulative faults 414 417 420 430 431 433 435 437 444 446 446 448 451 453 460 463 463 464 464 465 465 465 466 467 467 467 468 469 469 469 469 470 472 472 473 473 473 473 473 473 473 475 475 475 475

bappc.qxd

2/10/2009

2:40 PM

Page 259

C.2

NONHOMOGENEOUS POISSON PROCESS MODELS

259

Table C.2. Continued Days 91 92 93 94 95 96 97 98 99 100 101

Faults

Cumulative faults

0 0 0 0 0 1 0 0 0 1 0

475 475 475 475 475 476 476 476 476 477 477

Days 102 103 104 105 106 107 108 109 110 111

Faults

Cumulative faults

0 1 0 9 1 0 0 1 0 1

477 478 478 478 479 479 479 480 480 481

*Interpolated data.

Table C.3. MLEs of model parameters—control system data MLEs (61 data points)

MLEs (111 data points)

aˆ = 522.49 bˆ = 0.06108

aˆ = 483.039 bˆ = 0.06866

Goel–Okumoto m(t) = a(1 – e–bt) (G-O) model a(t) = a b(t) = b

aˆ = 852.97 bˆ = 0.01283

aˆ = 497.282 bˆ = 0.0308

Inflexion S-shaped model

aˆ = 852.45 bˆ = 0.01285 ␤ˆ = 0.001

aˆ = 482.017 bˆ = 0.07025 ␤ˆ = 4.15218

aˆ = 470.759 bˆ = 0.07497

aˆ = 470.759 bˆ = 0.07497

Model name Delayed S-shaped model

MVF, m(t) m(t) = a[1 – (1 + bt)e–bt] a(t) = a b2 t b(t) = ᎏᎏ 1 + bt

a(1 – e–bt) m(t) = ᎏᎏ 1 + ␤e–bt a(t) = a b b(t) = ᎏᎏ 1 + ␤e–bt

Pham– Nordmann– Zhang (PNZ) model

␣ a(1 – e–bt) 1 – ᎏᎏ + ␣at b m(t) = ᎏᎏ ᎏᎏ –bt 1 + ␤e



a(t) = a(1 + ␣t) b b(t) = ᎏᎏ 1 + ␤e–bt



␣ˆ = 0.00024 ␤ˆ = 4.69321

␣ˆ = 0.00024 ␤ˆ = 4.69321

(continued)

bappc.qxd

2/10/2009

260

2:40 PM

Page 260

SOFTWARE RELIABILITY GROWTH MODELS

Table C.3. Continued Model name Pham–Zhang (PZ) model

MLEs (61 data points)

MVF, m(t)

aˆ = 0.920318 1 –bt + a)(1 – e ) bˆ = 0.0579 m(t) = ᎏᎏ[(c (1 + ␤e–bt) ␣ˆ = 2.76 × 10–5 ␤ˆ = 3.152 ab cˆ = 520.784 – ᎏᎏ(e–␣t – e–bt)] b–␣

MLEs (111 data points) aˆ = 0.46685 bˆ = 0.07025 ␣ˆ = 1.4 × 10–5 ␤ˆ = 4.15213 cˆ = 482.016

a(t) = c + a(1 – e–␣t) b b(t) = ᎏᎏ 1 + ␤e–bt (–␤t))

Yamada exponential model

m(t) = a(1 – e–r␣(1–e a(t) = a b(t) = r␣␤e–␤t

Yamada Rayleigh model

m(t) = a(1 – e–r␣(1–e(–␤t a(t) = a 2 b(t) = r␣␤te–␤t /2

Yamada imperfect debugging model (1)

ab m(t) = ᎏᎏ (e–␣t – ebt) ␣+b

Yamada imperfect debugging model (2)

)

2/2)

)

)

a(t) = ae␣t b(t) = b

␣ m(t) = a(1 – e–bt) 1 – ᎏᎏ + ␣at b



a(t) = a(1 + ␣t) b(t) = b



aˆ = 9219.7 ␣ˆ = 0.09995 ␤ˆ = 0.01187

aˆ = 67958.8 ␣ˆ = 0.00732 ␤ˆ = 0.03072

aˆ = 611.70 ␣ˆ = 1.637 ␤ˆ = 0.00107

aˆ = 500.146 ␣ˆ = 3.31944 ␤ˆ = 0.00066

aˆ = 1795.7 bˆ = 0.00614

aˆ = 654.963 bˆ = 0.02059

␣ˆ = 0.002

␣ˆ = 0.0027

aˆ = 16307 bˆ = 0.0068 ␣ˆ = 0.009817

aˆ = 591.804 bˆ = 0.02423 ␣ˆ = 0.0019

It seems that the software in this example turns stable after 61 days of testing. It is desired to compare the descriptive power of the models using the first 61 data points and to compare the predictive power of the models assuming the last 50 days data as "actual" data after the prediction was made. The results of parameter estimation using the first 61 days of data are summarized in Table C.3. The result of the predictive power comparison is shown in Table C.4. From Table C.4, we can see that the PZ model shows the best predictive power (with the lowest prediction MSE) followed by the Yamada Rayleigh, inflexion S-shaped and Delayed S-shaped models. From Table C.3, the number of total defects is estimated.

bappc.qxd

2/10/2009

2:40 PM

Page 261

C.2

NONHOMOGENEOUS POISSON PROCESS MODELS

261

Table C.4. Model comparison—control system data Model Name Delayed S-shaped model Goel–Okumoto(G-O) model Inflection S-shaped model Pham–Nordmann–Zhang (PNZ) model Pham–Zhang (PZ) model Yamada exponential model Yamada Rayleigh model Yamada imperfect debugging model (1) Yamada imperfect debugging model (2)

MSE (Prediction) 935.88 11611.42 590.38 2480.7 102.66 12228.25 187.57 8950.54 2752.83

For all the models except the PZ model, the number of total defects is a. ˆ The number of the total defects for the PZ model can be estimated as (ˆ c + a). ˆ For example, using all 111 days of data, the total number of defects estimated from the PZ model is 482.47 and that estimated from the Delayed S-shaped model is 483.03. Hence for this data set, we can conclude that the PZ model is the best model to estimate the software reliability parameters, that is, the number of residual defects and the per fault failure rate. Some of the other models provide close estimates or predictive power; they can be used to confirm the prediction. Let's use the PZ model to further estimate the software reliability parameters. Based on all 111 data points, the number of total defects is (ˆ c + a) ˆ = 482/47, hence the number of residual defects is – c + a) ˆ – n = 482.47 – 481 = 1.47. The per fault failure rate is Nˆ (t) = (ˆ then bˆ = 0.07025 failures/day/fault. So the initial failure rate based – on the testing data is then ␭ˆ (T = 111 days) = Nˆ (t) × bˆ = 1.47 × 0.07025 = 0.1053 failures/day, or 38.5 failures/year. As discussed in Section 7.2, assume we estimated a calibration factor of K = 10 from other releases or similar products, the adjusted initial field failure rate can be estimated as: 0.07025 bˆ – ␭ˆ (tfield = 0) = Nˆ (t) × ᎏ 1.47 × ᎏ = 0.01053 K 10 failures/day, or 3.85 failures/year. This software failure rate can be used in the architecture-based reliability model to update the downtime and availability prediction. Here let us discuss how to use the software failure rate estimated from the test data to predict the downtime from a high

bappc.qxd

2/10/2009

262

2:40 PM

Page 262

SOFTWARE RELIABILITY GROWTH MODELS

level. Assume that the software coverage factor C = 95%, that is 95% of the time software failures can be automatically detected and recovered. Let's further assume that the auto detection and recovery takes 10 seconds on average. Then 5% of the failures that escape the system fault detection mechanisms are detected and recovered through some human intervention. Let's assume the manual detection and recovery takes 30 minutes on average. We can quickly estimate the annual software downtime as:





10 DT = 3.85 0.95 × ᎏ + 0.05 × 30 = 6.38 minutes/year 60 This corresponds to an availability of 99.9988%. With this example, we show how to use SRGM to estimate the software failure rate, which can be fed back to the architecture-based model to produce an updated downtime prediction (as compared to an early architectural phase prediction).

bappd.qxd

2/8/2009

APPENDIX

5:58 PM

Page 263

D

ACRONYMS AND ABBREVIATIONS

CB CLT COTS CPLD CPU CRC CWT DC DSL ECC EPROM EPA ERI FCC FIT FPGA FRU GA GO GSPN HA HLR HW IC IP IT KLOC LED

Control Board Central Limit Theorem Commercial Off-The-Shelf Complex Programmable Logic Device Central Processing Unit Cyclic Redundancy Check Cleared While Testing Direct Current Digital Subscriber Line Error Checking and Correction Electrically Programmable Read-Only Memory Environmental Protection Agency Early Return Indicator Federal Communications Commission Failures in 1 Billion Hours Field-Programmable Gate Array Field-Replaceable Unit General Availability Goel–Okumoto Generalized Stochastic Petri Net High Availability Home Location Register Hardware Interface Card Internet Protocol Information Technology Kilolines of Code Light-Emitting Diode

Practical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.

263

bappd.qxd

2/8/2009

264

5:58 PM

Page 264

ACRONYMS AND ABBREVIATIONS

LTR MLE MOP MR MTBF MTTF MTTO MTTR NE NEBS NEO4 NESAC NHPP NOC NFF NTF OAM OEM OS PBX PC PCI PDF PE PMC QuEST RBD RF RFP RPP SLA SNMP SO4 SPN SRGM SRQAC SW YRR

Long-term Return Rate Maximum Likelihood Estimation Method of Procedures Modification Request; a defect recorded in a defect tracking system Mean Time Between Failures Mean Time to Failure Mean Time to Outage Mean Time To Repair Network Element Network Equipment Building Standards Network Element Outage, Category 4 National Electronics Systems Assistance Center Nonhomogenous Poisson Process Network Operations Center No Fault Found, same as No Trouble Found No Trouble Found, same as No Fault Found Operations, Administration, and Maintenance Original Equipment Manufacturer Operating System Private Branch Exchange Personal Computer Peripheral Component Interconnect Probability Density Function Prediction Error PCI Mezzanine Card Quality Excellence for Suppliers of Telecommunications Reliability Block Diagram Radio Frequency Request for Proposal Reliability Prediction Procedure Service-Level Agreement Simple Network Management Protocol Product-Attributable Service Outage Downtime Stochastic Petri Net Software Reliability Growth Model Software Reliability and Quality Acceptance Criteria Software Yearly Return Rate

bappe.qxd

2/8/2009

APPENDIX

6:15 PM

Page 265

E

BIBLIOGRAPHY

References are grouped first as a complete list, then by topic in the following sections to make it easier for the readers to find additional information on a specific subject. [Akaike74] Akaike, H., A New Look at Statistical Model Identification, IEEE Transactions on Automatic Control, 19, 716–723 (1974). [ANSI91] ANSI/IEEE, Standard Glossary of Software Engineering Terminology, STD-729-1991, ANSI/IEEE (1991). [AT&T90] Klinger, D. J., Nakada, Y., and Menendez, M. A., (Eds.), AT&T Reliability Manual, Springer (1990). [Bagowsky61] Bagowsky, I., Reliability Theory and Practice, PrenticeHall (1961). [Baldwin54] Baldwin, C., J. et al., Mathematical Models for Use in the Simulation of Power Generation Outage, II, Power System Forced Outage Distributions, AIEE Transactions, 78, TP 59–849 (1954). [Bastani82] Bastani, F. B., and Ramamoorthy, C. V., Software Reliability Status and Perspectives, IEEE Trans. Software Eng., SE-11, 1411–1423 (1985) [Cai 2007] Cai, X., and Lyu, M. R., Software Reliability Modeling with Test Coverage: Experimentation and Measurement With a Fault-Tolerant Software Project, in The 18th IEEE International Symposium on Software Reliability, 17–26 (2007). [Chaudhuri2005] Chaudhuri, A., and Stenger, H., Survey Sampling: Theory and Methods, Second Edition, Chapman & Hall/CRC Press (2005). [Crowder 94] Crowder, M. J., Kimber, A., Sweeting, T., and Smith, R., Statistical Analysis of Reliability Data, Chapman & Hall/CRC Press (1994). [Feller68] Feller, W., An Introduction to Probability Theory and Its Applications, Vol., 1, Wiley (1968). [FP08] Website of Function Points, http://www.ifpug.org/. Practical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.

265

bappe.qxd

2/8/2009

266

6:15 PM

Page 266

BIBLIOGRAPHY

[Goel79a] Goel, A. L. and Okumoto, K., Time-Dependent Fault Detection Rate Model for Software and Other Performance Measures, IEEE Transactions on Reliability, 28, 206–211 (1979). [Goel79b] Goel, A. L. and Okumoto, K., A Markovian Model for Reliability and Other Performance Measures of Software Systems, in Proceedings of the National Computer Conference, pp. 769–774 (1979). [Goel85] Goel, A. L., Software Reliability Models: Assumptions, Limitations, and Applicability. IEEE Trans Software Eng., SE-11, 1411–1423 (1985). [Gokhale1998] Gokhale, S. S., Lyu, M. R., and Trivedi, K. S., Software Reliability Analysis Incorporating Fault Detection and Debugging Activities, in Proceedings of the Ninth International Symposium on Software Reliability Engineering, 4–7 November, 202–211 (1998). [Gokhale2004] Gokhale, S. S., and Mullen, R. E., From Test Count to Code Coverage Using the Lognormal Failure Rate, in 15th International Symposium on Software Reliability Engineering, 295–305 (2004). [Gossett08] Gossett W. S., The Probable Error of a Mean, Biometrika 6(1), 1–25 (1908). [Hakuta97] Hakuta, M., Tone, F., and Ohminami, M., A Software Estimation Model and Its Evaluation, J. Systems Software, 37, 253–263 (1997). [Halstead77] Halstead M. H., Elements of Software Science, Elsevier North-Holland (1977). [Hossain93] Hossain, S. A. and Dahiya, R. C., Estimating the Parameters of a Non-Homogeneous Poisson Process Model for Software Reliability, IEEE Trans. on Reliability, 42(4), 604–612 (1993). [Huang2004] Huang, C.-Y., Lin, C.-T., Lyu, M. R., and Sue, C.-C., Software Reliability Growth Models Incorporating Fault Dependency With Various Debugging Time Lags, in Proceedings of the 28th Annual International Computer Software and Applications Conference, COMPSAC 2004, 1, 186–191 (2004). [IEEE95] IEEE, Charter and Organization of the Software Reliability Engineering Committee (1995). [Jeske05a] Jeske, D. R., and Zhang, X., Some Successful Approaches to Software Reliability Modeling in Industry, Journal of Systems and Software, 74, 85–99 (2005). [Jeske05b] Jeske, D. R., Zhang, X. and Pham, L., Adjusting Software Failure Rates that are Estimated from Test Data, IEEE Transactions on Reliability, 54(1), 107–114 (2005). [Jones1991] Jones, C., Applied Software Measurement, McGraw-Hill (1991). [Keene94] Keene, S. J., Comparing Hardware and Software Reliability, Reliability Review, 14(4), 5–7, 21 (1994). [Keiller91] Keiller, P. A. and Miller, D. R., On the Use and the Perfor-

bappe.qxd

2/8/2009

6:15 PM

Page 267

BIBLIOGRAPHY

267

mance of Software Reliability Growth Models, Software Reliability and Safety, 32, 95–117 (1991). [Kemeny60] Kemeny, J. G., and Snell, J. L., Finite Markov Chains, Van Nostrand (1960). [Kremer83] Kremer, W., “Birth-Death and Bug Counting,” IEEE Transactions on Reliability,” R-32(1), pp. 37–47 (1983). [Levendel1989] Levendel, Y., Software Quality and Reliability Prediction: A Time-Dependent Model with Controllable Testing Coverage and Repair Intensity, in Proceedings of the Fourth Israel Conference on Computer Systems and Software Engineering, 175–181 (1989). [Lipow82] Lipow, M., Number of Faults per Line of Code, IEEE Trans. on Software Eng., 8(4), 437–439 (1982). [Littlewood2000] Littlewood, B., Popov, P. T., Strigini, L., and Shryane, N., Modeling the Effects of Combining Diverse Software Fault Detection Techniques, IEEE Transactions on Software Engineering, 26(12), 1157–1167 (2000). [Lyu95] Lyu, M. R., Software Fault Tolerance, Wiley (1995). [Lyu96] Lyu, M. R. (Ed.), Handbook of Software Reliability Engineering, IEEE Computer Society Press (1996). [Lyu2003] Lyu, M. R., Huang, Z., Sze, S. K. S., and Cai, X., An Empirical Study on Testing and Fault Tolerance for Software Reliability Engineering, in 14th International Symposium on Software Reliability Engineering, 119–130 (2003). [Malaiya1994] Malaiya, Y. K., Li, N., Bieman, J. Karcich, R., and Skibbe, B., The Relationship Between Test Coverage and Reliability, in Proceedings of 5th International Symposium on Software Reliability Engineering, 186–195 (1994). [Malaiya2002] Malaiya, Y. K. Li, M. N., Bieman, J. M., and Karcich, R., Software Reliability Growth With Test Coverage, in Transactions of Reliability Engineering, 420–426 (2002). [McCabe76] McCabe, T. J., A Complexity Measure, IEEE Transactions on Software Engineering, SE-2(4), 308–320 (1976). [Mellor87] Mellor, P., Software Reliability Modelling: the State of the Art, Information and Software Technology, 29(2), 81–88 (1987). [Miller86] Miller, D. R., Exponential Order Static Models of Software Reliability Growth, IEEE Transactions on Software Engineering, SE-12(1), 12–24 (1986). [Musa83] Musa J. and Okumoto K., Software Reliability Nodels: Concepts, Classification, Comparison, and Practice, Electronic Systems Effectiveness and Life Cycle Costing, J. K. Skwirzynski (Ed.), NATO ASI Series, F3, Spring-Verlag, 395–424 (1983). [Musa84a] Musa, J. D., Software Reliability. Handbook of Software Engineering, C. R. Vick and C. V. Ramamoorthy (Eds.), 392–412 (1984). [Musa84b] Musa, J. D. and Okumoto, K, A Logarithmic Poisson Execution

bappe.qxd

2/8/2009

268

6:15 PM

Page 268

BIBLIOGRAPHY

Time Model for Software Reliability Measurement, International Conference on Software Engineering, Orlando, Florida, 230–238 (1984). [Musa87] Musa, J., Iannino, A., and Okumoto, K., Software Reliability, McGraw-Hill (1987). [Musa98] Musa, J. D., Software Reliability Engineering, McGraw-Hill (1998). [NASA2002] National Aeronautics and Space Administration (NASA), Fault Tree Handbook with Aerospace Applications, Version 1.1 (2002). [NUREG-0492] U.S. Nuclear Regulatory Commission, Fault Tree Handbook, NUREG-0492 (1981). [Odeh77] Odeh, et al. Pocket Book of Statistical Tables, Marcel Dekker (1977). [Ohba82] Ohba, M. et al., S-shaped Software Reliability Growth Curve: How Good Is It? COMPSAC'82, 38–44 (1982). [Ohba84a] Ohba, M., Software Reliability Analysis Models, IBM Journal of Research Development, 28, 428–443 (1984). [Ohba84b] Ohba, M., Inflexion S-shaped Software Reliability Growth Models, in Stochastic Models in Reliability Theory, Osaki, S. and Hatoyama, Y. (Eds.), Springer, 144–162 (1984). [Ohba84c] Ohba, M. And Yamada, S., S-shaped Software Reliability Growth Models, Proc. 4th Int. Conf. Reliability and Maintainability, 430–436 (1984. [Ohtera90a] Ohtera H. and Yamada, S., Optimal Allocation and Control Problems for Software- Testing Resources, IEEE Trans. Reliab., R-39, 171–176 (1990). [Ohtera90b] Ohtera, H., and Yamada, S., Optimal Software-Release Time Considering an Error-Detection Phenomenon during Operation, IEEE Transactions on Reliability, 39(5), 596–599 (1990). [Ottenstein81] Ottenstein, L., Predicting Numbers of Errors Using Software Science, Proceedings of the 1981 ACM Workshop/Symposium on Measurement and Evaluation of Software Quality, 157–167 (1981). [Pham91] Pham, H., and Pham, M., Software Reliability Models for Critical Applications, Idaho National Engineering Laboratory, EG&G2663, 1991. [Pham93] Pham, H., Software Reliability Assessment: Imperfect Debugging and Multiple Failure Types in Software Development. Report EG&G-RAAM-10737; Idaho National Engineering Laboratory (1993). [Pham96] Pham, H., 1996, A Software Cost Model with Imperfect Debugging, Random Life Cycle and Penalty Cost, International Journal of Systems Science, 27(5), 455–463 (1996). [Pham97] Pham, H. and Zhang, X., An NHPP Software Reliability Model and Its Comparison, International Journal of Reliability, Quality and Safety Engineering, 4, 269–282 (1997). [Pham99] Pham, H. Nordmann, L. and Zhang X., A General Imperfect

bappe.qxd

2/8/2009

6:15 PM

Page 269

BIBLIOGRAPHY

269

Software Debugging Model with S-shaped Fault Detection Rate, IEEE Transactions on Reliability, 48(2), 169–175 (1999). [Pham2000] Pham, H., Software Reliability, Springer (2000). [Pham2002] Pham, L., Zhang, X., and Jeske, D. R., Scaling System Test Software Failure Rate Estimates For Use in Field Environments, in Proceedings of the Annual Conference of the American Statistical Association, pp. 2692–2696 (2002). [Pham2003] Pham, H., and Deng, C., Predictive-Ratio Risk Criterion for Selecting Software Reliability Models, in Proceeding of the Ninth ISSAT International Conference on Reliability and Quality in Design, Honolulu, Hawaii, pp. 17–21 (2003). [Pukite98] Pukite, J., and Pukite, P., Modeling for Reliability Analysis, IEEE Press (1998). [Rivers98] Rivers, A. T., and Vouk M. A., Resource-Constrained NonOperational Testing of Software, in Proceedings of the 9th International Symposium on Software Reliability Engineering, Paderborn, Germany, November, pp. 154–163, IEEE Computer Society Press (1998). [Sandler63] Sandler, G. H, System Reliability Engineering, Prentice-Hall (1963). [Schneidewind75] Schneidewind, N. F., Analysis of Error Processes in Computer Software. Sigplan Notices, 10, 337–346 (1975). [Schneidewind79] Schneidewind, N. and Hoffmann, H., An Experiment in Software Error Data Collection and Analysis, IEEE Trans. Software Eng., 5(3), 276–286 (1979). [Schneidewind92] Schneidewind, N. F., Applying Reliability Models to the Space Shuttle, IEEE Software, 28–33 (1992). [Schneidewind93] Schneidewind, N. F., Software Reliability Model with Optimal Selection of Failure Data, IEEE Trans. on Software Engineering, 19(11), 997–1007(1993). [Shooman68] Shooman, M. L., Probabilistic Reliability, An Engineering Approach. McGraw-Hill (1968). [Singpurwalla91] Singpurwalla, N. D., Determining an Optimal Time Interval for Testing and Debugging Software, IEEE Transactions on Software Engineering, 17(4), 313–319 (1991). [Stalhane92] Stalhane, T., Practical Experience with Safety Assessment of a System for Automatic Train Control, in Proceedings of SAFECOMP’92, Zurich, Switzerland, Pergamon Press (1992). [Telcordia08] Telcordia, Telcordia Roadmap to Reliability Documents, Issue 4, Aug 2008, Telcordia. [Tohma91] Tohma, Y., Yamano, H., Ohba, M., and Jacoby, R., The Estimation of Parameters of the Hypergeometric Distribution and its Application to the Software Reliability Growth Model, IEEE Transactions on Software Engineering, 17(5), 483–489 (1991). [Trivedi02] Trivedi, K., Probability and Statistics with Reliability,

bappe.qxd

2/8/2009

270

6:15 PM

Page 270

BIBLIOGRAPHY

Queueing, and Computer Science Applications, 2nd Edition, Wiley (2002). [Wood96] Wood, A., Predicting Software Reliability,” IEEE Computer Magazine, November, 69–77 (1996). [Wu2007] Wu, Y. P., Hu, Q. P., Xie, M., and Ng, S. H., Modeling and Analysis of Software Fault Detection and Correction Process by Considering Time Dependency, IEEE Transactions on Reliability, 56(4), 629–642 (2007). [Xie91] Xie, M., Software Reliability Engineering, World Scientific (1991). [Xie92] Xie, M., and Zhao, M., The Schneidewind Software Reliability Model Revisited, Proceedings of the Third International Symposium on Software Reliability Engineering, 184–193 (1992). [Yamada83] Yamada, S., Ohba, M., and Osaki, S., S-shaped Reliability Growth Modeling for Software Error Detection, IEEE Transactions on Reliability, 12, 475–484 (1983). [Yamada84] Yamada, S., et al., Software Reliability Analysis Based on an Nonhomogeneous Error Detection Rate Model, Microelectronics and Reliability, 24, 915–920 (1984). [Yamada85] Yamada, S., and Osaki, S., Discrete Software Reliability Growth Models, Applied Stochastic Models and Data Analysis, 1, 65–77 (1985). [Yamada86] Yamada, S., Ohtera, H., and Narihisa, H., Software Reliability Growth Models with Testing Effort, IEEE Trans. on Reliability, 35(1), 19–23 (1986). [Yamada90] Yamada, S. and Ohtera, H., Software Reliability Growth Models Testing Effort Control. European J. Operation Research, 46, 343–349 (1990). [Yamada91a] Yamada, S., Software Quality/Reliability Measurement and Assessment: Software Reliability Growth Models and Data Analysis, Journal of Information Processing, 14(3), 254–266 (1991). [Yamada91b] Yamada, S., Tokuno, K., and Osaki, S., Imperfect Debugging Models with Fault Introduction Rate for Software Reliability Assessment, International Journal of Systems Science, 23(12), 2253– 2264 (1991). [Yamada 1990b] Yamada, S. and Ohtera, H., Software Reliability Growth Model for Testing Effort Control, European J. Operation Research 46, 343–349 (1990). [Yamada92] Yamada, S., Tokuno, K., and Osaki, S., Imperfect Debugging Models with Fault Introduction Rate for Software Reliability Assessment, International Journal of Systems Science, 23(12), 2241–2252 (1992). [Zhang02] Zhang, X., Jeske, D. R., and Pham, H., Calibrating Software Reliability Models When the Test Environment Does Not Match the User Environment, Applied Stochastic Models in Business and Industry, 18, 87–99 (2002).

bappe.qxd

2/8/2009

6:15 PM

Page 271

BIBLIOGRAPHY

271

[Zhang06] Zhang, X., and Pham, H., Field Failure Rate Prediction Before Deployment, Journal of Systems and Software, 79(3) (2006). [Zhao92] Zhao, M. and Xie, M., On the Log-Power NHPP Software Reliability Model, Proceedings of the Third International Symposium on Software Reliability Engineering, 14–22 (1992).

The following textbooks document reliability modeling techniques and statistics background. Bagowsky, I., Reliability Theory and Practice, Prentice-Hall (1961). Crowder, M. J., Kimber, A., Sweeting, T., and Smith, R., Statistical Analysis of Reliability Data, Chapman & Hall/CRC Press (1994). Feller, W., An Introduction to Probability Theory and Its Applications, Vol., 1, Wiley (1968). Kemeny, J. G., and Snell, J. L., Finite Markov Chains, Van Nostrand (1960). Lyu, M. R., Software Fault Tolerance, Wiley (1995). Pukite, J., and Pukite, P., Modeling for Reliability Analysis, IEEE Press (1998). Sandler, G. H, System Reliability Engineering, Prentice-Hall (1963). Shooman, M. L., Probabilistic Reliability, An Engineering Approach. McGraw-Hill (1968). Trivedi, K., Probability and Statistics with Reliability, Queueing, and Computer Science Applications, 2nd Edition, Wiley (2002).

The following are references on reliability terminologies, specific reliability topics, statistic tables, and so on. Akaike, H., A New Look at Statistical Model Identification, IEEE Transactions on Automatic Control, 19, 716–723 (1974). ANSI/IEEE, Standard Glossary of Software Engineering Terminology, STD-729-1991, ANSI/IEEE (1991). Baldwin, C., J. et al., Mathematical Models for Use in the Simulation of Power Generation Outage, II, Power System Forced Outage Distributions, AIEE Transactions, 78, TP 59–849 (1954). Chaudhuri, A., and Stenger, H., Survey Sampling: Theory and Methods, Second Edition, Chapman & Hall/CRC Press (2005). Gossett W. S., The Probable Error of a Mean, Biometrika 6(1), 1–25 (1908). IEEE, Charter and Organization of the Software Reliability Engineering Committee (1995). Keene, S. J., Comparing Hardware and Software Reliability, Reliability Review, 14(4), 5–7, 21 (1994).

bappe.qxd

2/8/2009

272

6:15 PM

Page 272

BIBLIOGRAPHY

National Aeronautics and Space Administration (NASA), Fault Tree Handbook with Aerospace Applications, Version 1.1 (2002). Odeh, et al. Pocket Book of Statistical Tables, Marcel Dekker (1977). Stahlhane, T., Practical Experience with Safety Assessment of a System for Automatic Train Control, in Proceedings of SAFECOMP’92, Zurich, Switzerland, Pergamon Press (1992). U.S. Nuclear Regulatory Commission, Fault Tree Handbook, NUREG0492 (1981).

The following references document software reliability modeling techniques and applications. These are SRGM textbooks that readers who are new to software reliability modeling will find very useful. Jones, C., Applied Software Measurement, McGraw-Hill (1991). Lyu, M. R. (Ed.), Handbook of Software Reliability Engineering, IEEE Computer Society Press (1996). Musa, J. D., Software Reliability Engineering, McGraw-Hill (1998). Pham, H., Software Reliability, Springer (2000). Xie, M., Software Reliability Engineering, World Scientific (1991).

Useful papers on software system data analysis and field failure rate predictions. Jeske, D. R., and Zhang, X., Some Successful Approaches to Software Reliability Modeling in Industry, Journal of Systems and Software, 74, 85–99 (2005). Jeske, D. R., Zhang, X. and Pham, L., Adjusting Software Failure Rates that are Estimated from Test Data, IEEE Transactions on Reliability, 54(1), 107–114 (2005). Zhang, X., Jeske, D. R., and Pham, H., Calibrating Software Reliability Models When the Test Environment Does Not Match the User Environment, Applied Stochastic Models in Business and Industry, 18, 87–99 (2002). Zhang, X., and Pham, H., Field Failure Rate Prediction Before Deployment, Journal of Systems and Software, 79(3) (2006).

Useful papers on widely used software reliability growth models. Goel, A. L. and Okumoto, K., Time-Dependent Fault Detection Rate Model for Software and Other Performance Measures, IEEE Transactions on Reliability, 28, 206–211 (1979). Goel, A. L. and Okumoto, K., A Markovian Model for Reliability and Other Performance Measures of Software Systems, in Proceedings of the National Computer Conference, pp. 769–774 (1979).

bappe.qxd

2/8/2009

6:15 PM

Page 273

BIBLIOGRAPHY

273

Keiller, P. A. and Miller, D. R., On the Use and the Performance of Software Reliability Growth Models, Software Reliability and Safety, 32, 95–117 (1991). Kremer, W., “Birth-Death and Bug Counting,” IEEE Transactions on Reliability,” R-32(1), pp. 37–47 (1983). Ohba, M., Inflexion S-shaped Software Reliability Growth Models, in Stochastic Models in Reliability Theory, Osaki, S. and Hatoyama, Y. (Eds.), Springer, 144–162 (1984). Pham, H. and Zhang, X., An NHPP Software Reliability Model and Its Comparison, International Journal of Reliability, Quality and Safety Engineering, 4, 269–282 (1997). Pham, H. Nordmann, L. and Zhang X., A General Imperfect Software Debugging Model with S-shaped Fault Detection Rate, IEEE Transactions on Reliability, 48(2), 169–175 (1999). Pham, L., Zhang, X., and Jeske, D. R., Scaling System Test Software Failure Rate Estimates For Use in Field Environments, in Proceedings of the Annual Conference of the American Statistical Association, pp. 2692–2696 (2002). Pham, H., and Deng, C., Predictive-Ratio Risk Criterion for Selecting Software Reliability Models, in Proceeding of the Ninth ISSAT International Conference on Reliability and Quality in Design, Honolulu, Hawaii, pp. 17–21 (2003). Rivers, A. T., and Vouk M. A., Resource-Constrained NonOperational Testing of Software, in Proceedings of the 9th International Symposium on Software ReliabilityEngineering, Paderborn, Germany, November, pp. 154–163, IEEE Computer Society Press (1998). Yamano, H., Ohba, M., and Jacoby, R., The Estimation of Parameters of the Hypergeometric Distribution and its Application to the Software Reliability Growth Model, IEEE Transactions on Software Engineering, 17(5), 483–489 (1991). Wood, A., Predicting Software Reliability,” IEEE Computer Magazine, November, 69–77 (1996). Yamada, S., Ohba, M., and Osaki, S., S-shaped Reliability Growth Modeling for Software Error Detection, IEEE Transactions on Reliability, 12, 475–484 (1983). Yamada, S., Tokuno, K., and Osaki, S., Imperfect Debugging Models with Fault Introduction Rate for Software Reliability Assessment, International Journal of Systems Science, 23(12), 2241–2252 (1992).

Software reliability growth models incorporating fault detection and removal processes are discussed in the following. Gokhale, S. S., Lyu, M. R., and Trivedi, K. S., Software Reliability Analysis Incorporating Fault Detection and Debugging Activities, in

bappe.qxd

2/8/2009

274

6:15 PM

Page 274

BIBLIOGRAPHY

Proceedings of the Ninth International Symposium on Software Reliability Engineering, 4–7 November, 202–211 (1998). Huang, C.-Y., Lin, C.-T., Lyu, M. R., and Sue, C.-C., Software Reliability Growth Models Incorporating Fault Dependency With Various Debugging Time Lags, in Proceedings of the 28th Annual International Computer Software and Applications Conference, COMPSAC 2004, 1, 186–191 (2004). Littlewood, B., Popov, P. T., Strigini, L., and Shryane, N., Modeling the Effects of Combining Diverse Software Fault Detection Techniques, IEEE Transactions on Software Engineering, 26(12), 1157–1167 (2000). Wu, Y. P., Hu, Q. P., Xie, M., and Ng, S. H., Modeling and Analysis of Software Fault Detection and Correction Process by Considering Time Dependency, IEEE Transactions on Reliability, 56(4), 629–642 (2007).

Software reliability growth models incorporating testing coverage are covered in the following references. Cai, X., and Lyu, M. R., Software Reliability Modeling with Test Coverage: Experimentation and Measurement With a Fault-Tolerant Software Project, in The 18th IEEE International Symposium on Software Reliability, 17–26 (2007). Gokhale, S. S., and Mullen, R. E., From Test Count to Code Coverage Using the Lognormal Failure Rate, in 15th International Symposium on Software Reliability Engineering, 295–305 (2004). Levendel, Y., Software Quality and Reliability Prediction: A TimeDependent Model with Controllable Testing Coverage and Repair Intensity, in Proceedings of the Fourth Israel Conference on Computer Systems and Software Engineering, 175–181 (1989). Lyu, M. R., Huang, Z., Sze, S. K. S., and Cai, X., An Empirical Study on Testing and Fault Tolerance for Software Reliability Engineering, in 14th International Symposium on Software Reliability Engineering, 119–130 (2003). Malaiya, Y. K., Li, N., Bieman, J. Karcich, R., and Skibbe, B., The Relationship Between Test Coverage and Reliability, in Proceedings of 5th International Symposium on Software Reliability Engineering, 186–195 (1994). Malaiya, Y. K. Li, M. N., Bieman, J. M., and Karcich, R., Software Reliability Growth With Test Coverage, in Transactions of Reliability Engineering, 420–426 (2002).

Materials on software metrics (such as complexity, etc.). Website of Function Points, http://www.ifpug.org/.

bappe.qxd

2/8/2009

6:15 PM

Page 275

BIBLIOGRAPHY

275

Halstead M. H., Elements of Software Science, Elsevier North-Holland (1977). McCabe, T. J., A Complexity Measure, IEEE Transacations on Software Engineering, SE-2(4), 308–320 (1976).

The following standards documents describe the methodology used to determine the downtime, availability, and failure rate estimates in system reliability analysis. Telcordia standard documents: Special Report SR-TSY-001171, Methods and Procedures for System Reliability Analysis, Issue 2, November 2007, Telcordia. Special Report SR-332, Reliability Prediction Procedure for Electronic Equipment, Issue 2, January 2006, Telcordia Technologies. GR-63-CORE, NEBS Requirements: Physical Protection, Issue 2, April 2002, Telcordia. GR-357-CORE, Generic Requirements for Assuring the Reliablity of Components Used in Telecommunications Equipment, Issue 1, March 2001, Telcordia. SR-TSY-000385, Bell Communications Research Reliability Manual, Issue 1, June 1986, Bell Communications Research. GR-418-CORE, Generic Reliability Assurance Requirements for Fiber Optic Transport Systems, Issue 2, December 1999, Telcordia. GR-512-CORE, LSSGR: Reliability, Chapter 12, Issue 2, January 1998, Telcordia. GR-874-CORE, An Introduction to the Reliability and Quality Generic Requirements (RQGR), Issue 3, April 1997, Bellcore. GR-929, Reliability and Quality Measurements for Telecommunications Systems (RQMS-Wireline), Issue 8, December 2002, Telcordia. GR-1339-CORE, Generic Reliability Requirements for Digital Cross-Connect Systems, Issue 1, March 1997, Bellcore. GR-1929, Reliability and Quality Measurements for Telecommunications Systems (RQMS-Wireless), Issue 1, December 1999, Telcordia. GR-2813, Generic Requirements for Software Reliability Prediction, Issue 1, December 1993, Bellcore. GR-2841-CORE, Generic Requirements for Operations Systems Platform Reliability, Issue 1, June 1994, Telcordia.

Other documents related to reliability standards: IEC 60300-3-1, Dependability Management—Part 3-1: Application Guide—Analysis Techniques for Dependability—Guide on Methodology, International Electrotechnical Commission, second edition (2003).

bappe.qxd

2/8/2009

276

6:15 PM

Page 276

BIBLIOGRAPHY

IEC 61713, Software Dependability through the Software Life-Cycle Process Application Guide (2000). NESAC Recommendations, http://questforum.asq.org/public/nesac/index.shtml. NRIC Best Practices: http://www.nric.org. TL 9000 Quality Measurement System, Measurements Handbook, Release 4.0, QuEST Forum, December 31, 2006.

More Telcordia reliability and quality documents from “Telcordia Roadmap to Reliabiltiy Documents” (there may be some overlap with the documents listed previously): GR-282-CORE, Software Reliability and Quality Acceptance Criteria (SRQAC), Issue 4, July 2006, Telecordia. GR-326-CORE, Generic Requirements for Singlemode Optical Connectors and Jumper Assemblies, Issue 3, September 1999, Telecordia. GR-357-CORE, Generic Requirements for Assuring the Reliability of Components Used in Telecommunications Equipment, Issue 1, March 2001, Telcordia.. GR-418-CORE, Generic Reliability Assurance Requirements for Fiber Optic Transport Systems, Issue 2, December 1999, Telcordia. GR-449-CORE, Generic Requirements and Design Considerations for Fiber Distributing Frames, Issue 2, July 2003, Telcordia. GR-468-CORE, Generic Reliability Assurance Requirements for Optoelectronic Devices Used in Telecommunications Equipment, Issue 2, September 2004, Telcordia. GR-508-CORE, Automatic Message Accounting (AMA), Issue 4, September 2003, Telcordia. GR-910-CORE, Generic Requirements for Fiber Optic Attenuators, Issue 2, December 2000, Telcordia. GR-929-CORE, Reliability and Quality Measurements for Telecommunications Systems (RQMS-Wireline), Issue 8, December 2002, Telcordia. GR-1110-CORE, Broadband Switching System (BSS) Generic Requirements, Issue 4, December 2000, Telcordia. GR-1221-CORE, Generic Requirements Assurance Requirements for Passive Optical Components, Issue 2, January 1999, Telcordia. GR-1241-CORE, Supplemental Service Control Point (SCP) Generic Requirements, Issue 7, December 2006, Telcordia. GR-1274-CORE, Generic Requirements for Reliability Qualification Testing of Printed Wiring Assemblies Exposed to Airborne Hygroscopic Dust, Issue 1, July 1994, Telcordia. GR-1280-CORE, Advanced Intelligent Network (AIN) Service Control Point (SCP) Generic Requirements, Issue 1, November 1993, Telcordia.

bappe.qxd

2/8/2009

6:15 PM

Page 277

BIBLIOGRAPHY

277

GR-1312-CORE, Generic Requirements for Optical Fiber Amplifiers and Proprietary Dense Wavelength-Division Multiplexed Systems, Issue 3, April 1999, Telecordia. GR-2813-CORE, Generic Requirements for Software Reliability Prediction, Issue 1, December 1993, Bellcore. GR-2853-CORE, Generic Requirements for AM/Digital Video Laser Transmitters, Optical Fiber Amplifiers and Receivers, Issue 3, December 1996. GR-3020-CORE, Nickel Cadmium Batteries in the Outside Plant, Issue 1, April 2000, Telcordia. SR-TSY-000385, Bell Communications Research Reliability Manual, Issue 1, July 1986, Telcordia. SR-TSY-001171, Methods and Procedures for System Reliability Analysis, Issue 2, November 2007, Telcordia. SR-1547, The Analysis and Use of Software Reliability and Quality Data, Issue 2, December 2000, Telecordia.

bindex.qxd

2/10/2009

9:59 AM

Page 279

INDEX

Active-active model, 70–72 Active-standby model, 72–73 Activity Recorder, 190 Application restart time, 65 Application restart success probability, 66 Asserts, 195 Assumptions, modeling, 76–78 Attributability, 11, 18 customer-attributable, 11, 26 external causes, 36 product-attributable, 11, 27 hardware, 19 software, 19 Audits, 197 Automate procedures, 203 Automatic failover time, 68, 131, 148 Automatic failover success probability, 68, 132, 149 Automatic fault detection, 205 Automatic recoveries, 23, 25 Automatic response procedure display, 205 Availability: classical view, 8 conceptual model, 15 customer’s view, 9 definition, 6, 59, 221–222, 226–227 road map , 209–210 Backout, 206 Binomial distribution, 228–230

Budgets (downtime), 26–29 Bus parity, 188 Bus watchdog timers, 189 Calibration factor, 114 Camp-on diagnostics, 200 Capacity loss, 10–12 Checksums, 199 Circuit breakers, 183 Clock failure detectors, 190 Compensation policies, 33 Complexity, 142 Confidence intervals, 105–106, 237–243 Connectors, 183 Coverage, 107, 129–130 hardware, 67, 139 software, 67 Covered fault detection time, 64, 109, 139 Counting rules, 12, 61 CPU watchdog timers, 187 CRC: message, 197 serial bus, 188 Critical failure-mode monitoring, 199 Critical severity problem, 14 Customer attributable (outages/downtime), 11, 26, 60 Customer policies and procedures, 32 Cut set, 53

Practical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.

279

bindex.qxd

2/10/2009

280

9:59 AM

Page 280

INDEX

Dataset characterization, 171 Diagnostics: camp-on, 200 routine, 200 runtime, 201 Defect: severity, 14 software, 114–119 Displaying reminders, 205 Distribution: probability, 228–237 Downtime: customer attributable, 60 definition, 59, 227 product attributable, 59 unplanned, 61 Duration: outage duration, 11, 19 parking duration, 35 uncovered failure detection, 65 Element availability, 12 Error checking and correcting on memory, 189 Exclusion rules, 12, 61 Exposure time, 123, 170 Exponential distribution, 230–231 Failed-restart detection probability, 68 Fail-safe fan controllers, 181 Failover time, 68, 131 Failure: hardware, 19, 21 software, 19, 21 covered, 66–67 category, 16, 19, 20 Failure intensity function, 251 Failure rate: calibration, 208–209 estimating, 108 function, 224–226 hardware, 62–63, 111–114, 138–139 prediction, 250 software, 63–64, 114–115 Fans, 181, 182 Fan alarms, 181 Fan controllers, 181, 182

Fan fusing, 184 Fault detection and correction process, 248 Fault tree, 52–53 Fault insertion: testing,129 Fault tree models, 38, 52–53 Feasibility, 206–208 FIT, 46 Field data analysis: alarm, 106–108 outage, 96–106 Field replaceable: circuit breakers,183 electronics, 184 fans, 181 power supplies,181 Full-application restart time, 65 Full-application restart success probability, 66 Function point, 114 Gamma distribution, 234–235 Goel-Okumoto model, 116 Growth (reliability growth), 115, 209 Hardware failures, 19, 21 Hardware failure rate, 62–63, 90–91 Hardware fault coverage, 67 Hardware fault injection testing, 187 Hardware MTTR, 140 Hardware redundancy, 187 Heartbeats, 194 Helping the humans, 203 Hot swap, 185 Imperfect debugging, 247, 252 Independently replaceable fan controllers,182 Input checking, 205 Insertion/injection of faults, 187 JTAG fault insertion,188 Lab: data, 114–135 test, 208 Lognormal distribution, 236–237

bindex.qxd

2/10/2009

9:59 AM

Page 281

INDEX

Major severity problem, 14 Manual failover time, 67, 131, 149 Manual failover success probability, 68, 135, 149 Manual recoveries, 24 Markov models, 38, 42–52 Maturity, 141–142 Maximum likelihood estimate (MLE), 116 Mean value function, 251 Memory: error checking and correcting,189 leak detection, 194 protection, 193 Message: CRC/parity,197 validation, 197 Minimal cut set models, 38, 53–55 Minimize tasks/decisions per procedure, 203 Minor severity problem, 14 Modeling: assumptions, 76–78 definitions, 58–69 fault tree, 38, 52–53 Markov, 38, 42–52 minimal cut set, 38, 53–55 Monte Carlo Simulation, 38, 57–58 parameters, 87, 152 petri net, 38, 55–57 reliability block diagrams, 38, 39–42 standards, 92–93 Models : active-active, 70–72 active-standby, 72–73 N+K redundancy, 73–75 N-out-of-M redundancy, 75–76 simplex, 69–70 MTTR, 8, 140 MTTF, 8, 223–224 MTTO, 102 N+1 fans, 181 N+K : protection, 199 redundancy model, 73–75 N-out-of-M redundancy model, 75–76

281

Network element impact outage, 12 Nonhomogenous Poisson process (NHPP), 115 Normal distribution, 235–236 Normalization factors, 13 Null pointer access detection, 193 Outage: classifications, 20, 100 definition, 11 downtime, 12 duration, 11, 19 exclusions, 12 partial outage, 60 Overload detection and control, 193 Parameter validation, 196 Parity: bus, 188 message,197 parking, 35 Partial outage, 60 Pass rate, 129–135 Petri net models, 38, 55–57 Planned downtime, 28 Poisson distribution, 229–230 Postmortem data collection, 199 Power: feeds, 180, 183 supplies, 181 supply monitoring, 188 switch protection, 184 Power-on self-tests, 200 Prediction: accuracy, 167–171 error, 172 software failure rate prediction, 114–129, 140–145 Primary functionality, 32, 58 Problem severities, 14 Procedures: automating, 203 clarity, 204 documenting, 203, 205 making intuitive, 204 similarity, 204 simplification, 204 testing, 203 Procedural outages, 26, 27

bindex.qxd

2/10/2009

282

9:59 AM

Page 282

INDEX

Process monitoring, 194 Process restart time, 65 Process restart success probability, 66 Product attributable outage/downtime, 11, 27, 59 Progress indicator, 204 Pro-rating outages/downtime, 60 QuEST forum, 5 Rayleigh distribution, 233–234 Reboot time, 65, 200 Reboot success probability, 65 Recovery: automatic recovery, 23,24 manual emergency recovery, 24 planned/scheduled recovery, 24–26 time, 64, 130, 145–146 Redundant power feeds, 180 Reliability, 7, 222–223 block diagrams, 38, 39–42 definition, 7 report, 110–111, 215–220 road map, 209–210 Requirements: availability requirements, 179 fault coverage requirement, 15 recovery time requirement, 145–146 Return code checking, 195 Repair time, 64, 140 Report system state, 206 Residual defects, 118 Reuse, 141 Road map, 209–210 Rolling updates, 198 Routine diagnostics, 200 Runtime consistency checking, 197 Runtime diagnostics, 201 Safe point identification, 205 Sampling error, 172 Scheduled events, 32 Sensitivity analysis, 149–166 Service impact outage, 12 Service life, 63 Service , 7

Service affecting, 120, 123 Severity, 120, 123 Simplex model, 69 Single process restart time, 65 Single process restart success probability, 66 Size (software), 141 Soft error detection, 189 Software failures, 21 Software failure rate, 63–64, 91–92, 140–145 Software fault coverage, 67, 145 Software metrics, 141 complexity, 142, 245 maturity, 141 reuse, 141 size, 141 Software reliability growth modeling, 115–129 application, 249 concave, 126 fault detection and correction process, 248 hypergeometric models, 249 failure intensity function, 251 mean value function, 251 imperfect debugging, 247, 252 parameter estimation, 253 residual defects, 118 SRGM model selection criteria, 255 S-shaped, 124–127 testing coverage, 248 Software updates, 198 S-shaped model, 124–127 Standards, 89–94 System activity recorder, 190 System state, 206 Task monitoring, 194 Techniques: hardware, 186–192 physical design, 179–186 procedural techniques , 202–206 software, 192–202 Temperature monitoring, 183 Tight loop detection, 198 Timeouts, 197 TL 9000, 5

bindex.qxd

2/10/2009

9:59 AM

Page 283

INDEX

Unavailability, definition, 59 Uncovered failure recovery time, 108, 131, 146–147 Uncovered fault detection time, 65 Undo, 206 Unplanned downtime, 61 Validation: message, 197 parameter, 196

Visual status indicators, 185 Watchdog timers: bus, 189 CPU, 187 Weibull distribution, 231–233 Widget example, 78–89 Yamada exponential model, 126

283

babout.qxd

2/10/2009

10:00 AM

Page 285

ABOUT THE AUTHORS

ERIC BAUER is technical manager of Reliability Engineering in the Wireline Division of Alcatel-Lucent. He originally joined Bell Labs to design digital telephones and went on to develop multitasking operating systems on personal computers. Mr. Bauer then worked on network operating systems for sharing resources across heterogeneous operating systems, and developed an enhanced, high performance UNIX file system to facilitate file sharing across Microsoft, Apple and UNIX platforms, which led to work on an advanced Internet service platform at AT&T Labs. Mr. Bauer then joined Lucent Technologies to develop a new Java-based private branch exchange (PBX) telephone system that was a forerunner of today's IP Multimedia Subsystem (IMS) solutions, and later worked on a long-haul/ultra long-haul optical transmission system. When Lucent centralized reliability engineering, Mr. Bauer joined the Lucent reliability team to lead a reliability group, and has since worked reliability engineering on a variety of wireless and wireline products and solutions. He has been awarded 11 U.S. Patents, holds a Bachelor of Science degree in Electrical Engineering from Cornell University, Ithaca, New York, and a Master of Science degree in Electrical Engineering from Purdue University, West Lafayette, Indiana. He lives in Freehold, New Jersey. XUEMEI ZHANG received her Ph.D. in Industrial Engineering and her Master of Science degree in Statistics from Rutgers University, New Brunswick, New Jersey. Currently she is a principle member of technical staff in the Network Design and Performance Practical System Reliability. By Eric Bauer, Xuemei Zhang, and Douglas A. Kimber Copyright © 2009 the Institute of Electrical and Electronics Engineers, Inc.

285

babout.qxd

2/10/2009

286

10:00 AM

Page 286

ABOUT THE AUTHORS

Analysis Department in AT&T Labs. Prior to joining AT&T Labs, she has worked in the Performance Analysis Department and the Reliability Department in Bell Labs in Lucent Technologies (and later Alcatel-Lucent), in Holmdel, New Jersey. She has been working on reliability and performance analysis of wire line and wireless communications systems and networks. Her major work and research areas are system and architectural reliability and performance, product and solution reliability and performance modeling, and software reliability. She has published more than 30 journal and conference papers. She has 6 awarded and pending U.S. patent applications in the areas of system redundancy design, software reliability, radio network redundancy, and end-to-end solution key performance and reliability evaluation. She has served as a program committee member and conference session chair for international conferences and workshops. She was an invited committee member for Ph.D. and Master theses at Rutgers University, Piscataway, New Jersey and New Jersey Institute of Technology, Newark, New Jersey. Dr. Zhang is the recipient of a number of awards and scholarships, including the Bell Labs President's Gold Awards in 2002 and 2004, Bell Labs President's Silver Award in 2005, Best Contribution Award 3G WCDMA in 2000 and 2001, fellowship and scholarships from Rutgers University. DOUGLAS A. KIMBER earned his Bachelor of Science in Electrical Engineering degree from Purdue University, West Lafayette, Indiana, and a Master of Science in Electrical Engineering degree from the University of Michigan, Ann Arbor, Michigan. He began his career designing telecommunications circuit boards for an Integrated Services Digital Network (ISDN) packet switch at AT&T Bell Labs. He followed the corporate transition from AT&T to Lucent Technologies, and finally to Alcatel-Lucent. During this time he did software development in the System Integrity department, which was responsible for monitoring and maintaining service on the 5ESS digital switch. This is where he got his start in system reliability. He then moved on to develop circuitry and firmware for the Reliable Clustered Computing (RCC) Department. RCC created hardware and software that enhanced the reliability of commercial products, and allowed Mr. Kimber to work on all aspects of system reliability. After RCC Mr. Kimber did systems engineering and architecture, and ultimately worked in the Reliability Department where he was able to apply his experience to analyze and

babout.qxd

2/10/2009

10:00 AM

Page 287

ABOUT THE AUTHORS

287

improve the reliability of a variety of products. He holds 4 U.S. patents with several more pending, and was awarded two Bell Laboratories President's Silver Awards in 2004. Mr. Kimber is currently retired and spends his time pursuing his hobbies, which include circuit design, software, woodworking, automobiles (especially racing), robotics, and gardening.