Risk Assessment: Theory, Methods, and Applications (Statistics in Practice) [2 ed.] 1119377234, 9781119377238

Introduces risk assessment with key theories, proven methods, and state-of-the-art applications Risk Assessment: Theory,

2,590 295 13MB

English Pages 784 [771] Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Risk Assessment: Theory, Methods, and Applications (Statistics in Practice) [2 ed.]
 1119377234, 9781119377238

Citation preview

Risk Assessment

WILEY SERIES IN STATISTICS IN PRACTICE Advisory Editor, Marian Scott, University of Glasgow, Scotland, UK Founding Editor, Vic Barnett, Nottingham Trent University, UK Statistics in Practice is an important international series of texts which provide detailed coverage of statistical concepts, methods, and worked case studies in specific fields of investigation and study. With sound motivation and many worked practical examples, the books show in down-to-earth terms how to select and use an appropriate range of statistical techniques in a particular practical field within each title’s special topic area. The books provide statistical support for professionals and research workers across a range of employment fields and research environments. Subject areas covered include medicine and pharmaceutics; industry, finance, and commerce; public services; the earth and environmental sciences; and so on. The books also provide support to students studying statistical courses applied to the aforementioned areas. The demand for graduates to be equipped for the work environment has led to such courses becoming increasingly prevalent at universities and colleges. It is our aim to present judiciously chosen and well-written workbooks to meet everyday practical needs. Feedback of views from readers will be most valuable to monitor the success of this aim. A complete list of titles in this series appears at the end of the volume.

Risk Assessment Theory, Methods, and Applications

Second Edition Marvin Rausand Stein Haugen

This edition first published 2020 © 2020 John Wiley & Sons, Inc. Edition History John Wiley & Sons, Inc. (1e, 2011) All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions. The right of Marvin Rausand and Stein Haugen to be identified as the authors of this work has been asserted in accordance with law. Registered Office John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA Editorial Office 111 River Street, Hoboken, NJ 07030, USA For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com. Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats. Limit of Liability/Disclaimer of Warranty While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. Library of Congress Cataloging-in-Publication Data Names: Rausand, Marvin, author. | Haugen, Stein, author. Title: Risk assessment : theory, methods, and applications / Marvin Rausand, Stein Haugen. Description: Second edition. | Hoboken, NJ : John Wiley & Sons, 2020. | Series: Wiley series in statistics in practice | Includes bibliographical references and index. Identifiers: LCCN 2019041379 (print) | LCCN 2019041380 (ebook) | ISBN 9781119377238 (hardback) | ISBN 9781119377283 (adobe pdf ) | ISBN 9781119377221 (epub) Subjects: LCSH: Technology–Risk assessment. | Risk assessment. Classification: LCC T174.5 .R37 2020 (print) | LCC T174.5 (ebook) | DDC 363.1/02–dc22 LC record available at https://lccn.loc.gov/2019041379 LC ebook record available at https://lccn.loc.gov/2019041380 Cover Design: Wiley Cover Image: © Soloviova Liudmyla/Shutterstock Set in 10/12pt WarnockPro by SPi Global, Chennai, India Printed in the United States of America 10 9 8 7 6 5 4 3 2 1

To Hella, Guro and Idunn, Emil and Tiril To Jorunn, Trine, Ingrid, Kristian and Brage, Nora and Alma

vii

Contents Preface xiii Acknowledgments xvii About the Companion Site xix 1

Introduction 1

1.1 1.2 1.3 1.4 1.5 1.6 1.7

Risk in Our Modern Society 1 Important Trends 2 Major Accidents 4 History of Risk Assessment 4 Applications of Risk Assessment 9 Objectives, Scope, and Delimitation 11 Problems 12 References 13

2

The Words of Risk Analysis 15

2.1 2.2 2.3 2.4 2.5 2.6 2.7

Introduction 15 Risk 16 What Can Go Wrong? 20 What is the Likelihood? 38 What are the Consequences? 44 Additional Terms 49 Problems 54 References 56

3

Main Elements of Risk Assessment 59

3.1 3.2 3.3 3.4 3.5 3.6

Introduction 59 Risk Assessment Process 60 Risk Assessment Report 76 Risk Assessment in Safety Legislation 81 Validity and Quality Aspects of a Risk Assessment Problems 83 References 84

82

viii

Contents

4.1 4.2 4.3 4.4 4.5 4.6

87 Introduction 87 Study Object 87 Operating Context 91 System Modeling and Analysis 92 Complexity 95 Problems 97 References 98

5

Risk Acceptance 99

5.1 5.2 5.3 5.4 5.5 5.6

Introduction 99 Risk Acceptance Criteria 99 Approaches to Establishing Risk Acceptance Criteria 106 Risk Acceptance Criteria for Other Assets than Humans 114 Closure 115 Problems 115 References 117

6

Measuring Risk 121

6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8

Introduction 121 Risk Metrics 121 Measuring Risk to People 123 Risk Matrices 148 Reduction in Life Expectancy 154 Choice and Use of Risk Metrics 156 Risk Metrics for Other Assets 158 Problems 159 References 163

7 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.11

Risk Management 167

4

Study Object and Limitations

Introduction 167 Scope, Context, and Criteria 170 Risk Assessment 170 Risk Treatment 171 Communication and Consultation 172 Monitoring and Review 173 Recording and Reporting 174 Stakeholders 175 Risk and Decision-Making 176 Safety Legislation 179 Problems 180 References 180

Contents

8

Accident Models 183

8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10 8.11 8.12

Introduction 183 Accident Classification 183 Accident Investigation 188 Accident Causation 188 Accident Models 190 Energy and Barrier Models 193 Sequential Accident Models 195 Epidemiological Accident Models 201 Event Causation and Sequencing Models 208 Systemic Accident Models 213 Combining Accident Models 228 Problems 229 References 230

9

9.1 9.2 9.3 9.4 9.5 9.6

Data for Risk Analysis 235 Types of Data 235 Quality and Applicability of Data 238 Data Sources 239 Expert Judgment 250 Data Dossier 254 Problems 254 References 257

10

Hazard Identification 259

10.1 10.2 10.3 10.4 10.5 10.6 10.7 10.8 10.9 10.10 10.11 10.12 10.13

Introduction 259 Checklist Methods 263 Preliminary Hazard Analysis 266 Job Safety Analysis 278 FMECA 287 HAZOP 295 STPA 306 SWIFT 316 Comparing Semiquantitative Methods Master Logic Diagram 322 Change Analysis 324 Hazard Log 327 Problems 331 References 335

11

Causal and Frequency Analysis 339

11.1 11.2

Introduction 339 Cause and Effect Diagram Analysis 341

322

ix

x

Contents

11.3 11.4 11.5 11.6

Fault Tree Analysis 344 Bayesian Networks 370 Markov Methods 384 Problems 396 References 400

12

Development of Accident Scenarios 401

12.1 12.2 12.3 12.4 12.5 12.6 12.7 12.8

Introduction 401 Event Tree Analysis 402 Event Sequence Diagrams 426 Cause–Consequence Analysis 426 Hybrid Causal Logic 428 Escalation Problems 429 Consequence Models 429 Problems 431 References 435

13

Dependent Failures and Events 437

13.1 13.2 13.3 13.4 13.5 13.6 13.7 13.8 13.9 13.10 13.11

Introduction 437 Dependent Failures and Events 437 Dependency in Accident Scenarios 439 Cascading Failures 441 Common-Cause Failures 442 𝛽-Factor Model 452 Binomial Failure Rate Model 456 Multiple Greek Letter Model 457 𝛼-Factor Model 459 Multiple 𝛽-Factor Model 461 Problems 461 References 462

14

Barriers and Barrier Analysis 465

14.1 14.2 14.3 14.4 14.5 14.6 14.7 14.8 14.9 14.10 14.11 14.12

Introduction 465 Barriers and Barrier Classification 466 Barrier Management 474 Barrier Properties 476 Safety-Instrumented Systems 477 Hazard–Barrier Matrices 487 Safety Barrier Diagrams 488 Bow-Tie Diagrams 490 Energy Flow/Barrier Analysis 490 Layer of Protection Analysis 493 Barrier and Operational Risk Analysis 502 Systematic Identification and Evaluation of Risk Reduction Measures 512

Contents

14.13 Problems 518 References 520 15

Human Reliability Analysis 525

15.1 15.2 15.3 15.4 15.5

Introduction 525 Task Analysis 536 Human Error Identification 543 HRA Methods 552 Problems 573 References 574

16

Risk Analysis and Management for Operation 579

16.1 16.2 16.3 16.4 16.5 16.6 16.7 16.8

Introduction 579 Decisions About Risk 581 Aspects of Risk to Consider 583 Risk Indicators 585 Risk Modeling 594 Operational Risk Analysis – Updating the QRA 596 MIRMAP 598 Problems 601 References 602

17

Security Assessment 605

17.1 17.2 17.3 17.4 17.5 17.6 17.7

Introduction 605 Main Elements of Security Assessment 608 Industrial Control and Safety Systems 615 Security Assessment 617 Security Assessment Methods 625 Application Areas 626 Problems 627 References 628

18

Life Cycle Use of Risk Analysis 631

18.1 18.2 18.3 18.4 18.5 18.6 18.7 18.8 18.9 18.10

Introduction 631 Phases in the Life Cycle 631 Comments Applicable to all Phases 634 Feasibility and Concept Selection 635 Preliminary Design 637 Detailed Design and Construction 639 Operation and Maintenance 641 Major Modifications 641 Decommissioning and Removal 643 Problems 643 References 643

xi

xii

Contents

19

Uncertainty and Sensitivity Analysis 645

19.1 19.2 19.3 19.4 19.5 19.6 19.7

Introduction 645 Uncertainty 647 Categories of Uncertainty 648 Contributors to Uncertainty 651 Uncertainty Propagation 656 Sensitivity Analysis 661 Problems 663 References 664

20

Development and Applications of Risk Assessment 667

20.1 20.2 20.3 20.4 20.5 20.6 20.7 20.8 20.9 20.10 20.11 20.12 20.13

Introduction 667 Defense and Defense Industry 668 Nuclear Power Industry 670 Process Industry 674 Offshore Oil and Gas Industry 678 Space Industry 681 Aviation 683 Railway Transport 685 Marine Transport 686 Machinery Systems 689 Food Safety 690 Other Application Areas 692 Closure 695 References 697 701 Introduction 701 Outcomes and Events 701 Probability 706 Random Variables 710 Some Specific Distributions 718 Point and Interval Estimation 728 Bayesian Approach 732 Probability of Frequency Approach 733 References 739

Appendix A Elements of Probability Theory

A.1 A.2 A.3 A.4 A.5 A.6 A.7 A.8

Appendix B Acronyms 741 Author Index 747 Subject Index 753

xiii

Preface This book gives a comprehensive introduction to risk analysis and risk assessment, with focus on the theory and the main methods for such analyses. The objects studied are technical or sociotechnical systems, and we delimit our attention to potential, sudden, and major accidents. Day-to-day safety problems and negative health effects due to long-term exposure are outside the scope of the book. Topics, such as financial risk are also outside the scope. More detailed objectives and delimitations of the book are supplied at the end of Chapter 1.

What is Changed From the First Edition? This second edition is a major update of the first edition (Rausand 2011). Almost all sections have been reorganized and rewritten. The most significant changes include: • Chapters 1 and 2 are totally rewritten and many definitions have been rephrased. • Material related to the risk assessment process is merged into a new Chapter 3. • Aspects related to the study object and its delimitation are presented in a new Chapter 4. • The section on Petri nets is removed. • The STAMP accident model and the STPA method are covered in a new section. • Additional new chapters cover: – Risk analysis and management in operation – Security assessment – Life cycle use of risk analysis • Exercise problems are provided at the end of all relevant chapters. • The glossary of terms has been removed. Instead, definitions are highlighted in the subject index. • An author index has been added.

xiv

Preface

Supplementary Information on the Internet An immense amount of relevant information is today available on the Internet, and many of the aspects discussed in this book may be found as books, reports, notes, tutorials, or slides. The quality of this information is varying and ranging from very high to very low, the terminology is often not consistent, and it may sometimes be a challenge to read some of these Internet resources. After having read this book, we hope it will be easier to search for supplementary information, to understand this information, and to judge its quality.

Intended Audience The book is written primarily for engineers and engineering students, and most of the examples and applications are related to technology and technical systems. Still, we believe that other readers may also find the book useful. There are two groups that are our primary audience: • The book was originally written as a textbook for university courses in risk analysis and risk assessment at NTNU. This second edition is based on experience gained from use of the first edition, at NTNU and other universities. • The book is in addition intended to be a guide for practical risk assessments. The various methods are therefore described sufficiently such that you should be able to use the method after having read the description. Each method is described according to the same structure. The method descriptions are, as far as possible, self-contained, and it should therefore not be necessary to read the entire book to apply the individual methods. Readers should have a basic course in probability theory. A brief introduction to probability theory is provided in Appendix A. The reader should refer to this appendix to get an understanding of what knowledge is expected.

Selection of Methods A wide range of theories and methods have been developed for risk analysis. All these cannot be covered in an introductory text. The objective of the book is not to show how knowledgeable the authors are, but to present theory, methods, and knowledge that will be useful for you as a risk analyst. When selecting material to cover, we focus on methods that: • Are commonly used in industry or in other application areas • Give the analyst insight and increases her understanding of the system (such that system weaknesses can be identified at an early stage of the analysis)

Preface

• Provide the analyst with genuine insight into system behavior • Can be used for hand-calculation (at least for small systems) • Can be explained rather easily to, and understood by non-risk specialists and managers. Both authors have been engaged in applications related to the offshore oil and gas industry, and many examples therefore come from this industry. The methods described and many of the examples are equally suitable for other industries and application areas. Use of Software Programs Programs are required for practical analyses, but we refrain from promoting any particular programs. A listing of suppliers of relevant programs may be found on the book companion site. Organization This book is divided into 20 chapters and two appendices. Chapters 1–9 focus on the basic concepts and the theory behind risk analysis and risk assessment. Chapter 1 outlines why this is an important topic and briefly reviews the history of risk assessment. In Chapter 2, basic concepts are presented and discussed, followed by Chapter 3, where the main elements of risk assessment are described step by step. In Chapter 4, we elaborate on the study object in risk assessment. In Chapter 5, we move into the important topic of risk acceptance and review some of the approaches that are used to determine whether risk can be accepted or not. Chapter 6 deals with how risk can be measured, mainly in quantitative terms. The main focus is on measuring risk to people. Moving to Chapter 7, we discuss the wider process of risk management, and specifically the role of risk assessment in risk management. A risk assessment is always influenced by the study team’s perception of the potential accidents and accident causation, and accident models are therefore presented and discussed in Chapter 8. Chapter 9 lists and describes the input data that are required for a risk assessment. Chapters 10–19 cover the most relevant analytical methods. In this book, we define risk as the answer to three questions: (i) What can go wrong? (ii) What is the likelihood of that happening?, and (iii) What are the consequences? Chapters 10–12 describe methods that can be used to answer these three questions, respectively. This is followed by a set of chapters dealing with specific problem areas that we often face in risk assessment. Dependency between failures and events is often a critical factor in risk analysis, and Chapter 13 deals with methods for analysis of this issue. Chapter 14 looks at barriers and barrier analysis, and in Chapter 15, methods for analysis of

xv

xvi

Preface

human errors and human reliability are described. In Chapter 16, risk analysis and management for operation of a system is discussed, followed by a brief review of methods for security assessment in Chapter 17. Risk analysis is used in various ways throughout the life cycle of systems, and Chapter 18 provides a description of use from the conceptual stage through to decommissioning and removal. The uncertainties related to the results from a risk analysis are often of concern, and this is treated in Chapter 19. Chapter 20 briefly reviews applications of risk assessment. The various analytical methods are, as far as possible, presented according to a common structure. The description of each method is designed to be self-contained such that you should be able to carry out the analysis without having to read the entire book or search other sources. A consequence of this strategy is that the same information may be found in the description of several methods. Appendix A introduces key elements of probability theory. An introduction to probability theory is given together with some elements from system reliability and Bayesian methods. If you are not familiar with probability theory, you may find it useful to read this appendix in parallel with the chapters that use probability arguments. Appendix B lists acronyms. Online Information Additional material, solutions to many of the end-of-chapter exercises, and any errors found after the book goes to press, are posted on the book’s associated website. The address to this website is provided under the heading “Extra” in Wiley’s web presentation of the book. Trondheim, Norway 1 July 2019

M. Rausand & S. Haugen

Reference Rausand, M. (2011). Risk Assessment: Theory, Methods, and Applications. Hoboken, NJ: Wiley.

xvii

Acknowledgments In the preface of the book “The Importance of Living” (William Morrow, New York, 1937), Lin Yutang writes: “I must therefore conclude by saying as usual that the merits of this book, if any, are largely due to the helpful suggestions of my collaborators, while for the inaccuracies, deficiencies and immaturities of judgment, I alone am responsible.” If we add colleagues and references to the word collaborators, this statement applies equally well for the current book. Rather than mentioning anyone specifically, with the obvious risk that someone we would have liked to mention are forgotten, we acknowledge all the input we have received over the years that we have worked in this field. The second author would like to thank Alma Mater Studiorium Università di Bologna for giving him the opportunity to spend eight months at the university while working on this book. A special thanks to Professor Valerio Cozzani for organizing the visit. We also acknowledge the editorial and production staff at John Wiley & Sons for their careful, effective, and professional work. In particular, we would like to mention our main contacts in the final stages of preparing the book, Kathleen Santoloci, Benjamin Elisha, and Viniprammia Premkumar. Several definitions used in this book are from the International electrotechnical vocabulary (IEV) http://www.electropedia.org. We appreciate the initiative of the International Electrotechnical Commission (IEC) to make this vocabulary freely available. References to the vocabulary are given in the text as IEV xxx-yy-zz, where xxx-yy-zz is the number of the definition in the IEV. Definitions 3.1, 3.2, and 3.3 as well as a modified version of Figure 4 from ISO 31000:2009, definition 3.6.1.3 from ISO Guide 73 and definition 3.5 from NS 5814 have all been reproduced under license from Standard Online AS June 2019. ©All right reserved. Standard Online makes no guarantees or warranties as to the correctness of the reproduction.

xviii

Acknowledgments

Several references are given to publications by the UK Health and Safety Executive (HSE). This is public sector information published by the HSE and licensed under the Open Government License v.1.0. During the writing of the book, we have read many books, scientific articles, standards, technical reports, guidelines, and notes related to risk assessment. We have tried to process, combine, and reformulate the information obtained, and we have tried to give proper references. If we unconsciously copied sentences without giving proper reference, it has not been our intention, and we apologize if so has happened.

xix

About the Companion Site Risk Assessment: Theory, Methods, and Applications is accompanied by a companion website: www.wiley.com/go/riskassessment2e

The website includes the following materials for students and other readers: • A supplementary report (in PDF format) covering: – Listings of relevant scientific journals, conferences, societies, organizations issuing standards, software providers, and universities providing education programs in risk assessment. – Listing of important major accidents that have occurred after the book was published. – Suggestions to further reading (mainly with URLs) for each chapter. – Comments and extensions to the material provided in the various chapters. – Other material. • Slides to each chapter of the book in PDF format • Errata – Lists misprints (when they are revealed) and possible errors in the book.

xx

About the Companion Site

The website includes the following materials for instructors: • Solutions to the end-of-chapter problems in the book. • Additional problems with solutions. • Guidance to planning a course in risk assessment; lecture plans, suggested problems, etc. The companion site will be updated from time to time, so please check the version numbers.

1

1 Introduction 1.1 Risk in Our Modern Society In the Middle Ages, some of the leading engineers and architects were employed as church builders. In this period, churches changed from being built in the Romanesque style to the Gothic style. This transition implied a move from fairly massive stone structures with thick walls, limited height, and relatively small and few windows to a style with much more slender structures, rising higher, and with more and larger openings in the walls for windows. This technological development had a price, with frequent collapses of the new churches. A prominent example is the collapse of Cathedrale Saint-Pierre de Beauvais in 1284 and then again in 1573 (Murray 1989). This is a good example of how technology traditionally has evolved, through trying and failing. The church builders of the Middle Ages moved beyond what had been done earlier, and this sometimes led to catastrophic failures. In the Middle Ages, accidents were seen as acts of God, punishing man for attempting to construct such huge buildings. Today, we have a different view on why accidents occur, and society is not willing to accept failure to the same degree as in the Middle Ages. Accidents result in loss of life or serious environmental damage and are often very expensive. Over the last few decades, concepts and techniques have been developed to help us understand and prevent failures and accidents before they happen, rather than just trying to learn from failures that occur. Application of these techniques is what we normally call risk analysis or risk assessment. Risk assessments are systematic studies of what can go wrong in the future, describing it and evaluating if we need to do something to reduce risk. They might have been able to predict and prevent the collapse of the Beauvais Cathedral (Figure 1.1) and the consequences of this if these methods were available in 1284. This book is mainly about methods for performing risk analyses, and the theoretical basis for these.

Risk Assessment: Theory, Methods, and Applications, Second Edition. Marvin Rausand and Stein Haugen. © 2020 John Wiley & Sons, Inc. Published 2020 by John Wiley & Sons, Inc. Companion Website: www.wiley.com/go/riskassessment2e

2

1 Introduction

Figure 1.1 The Beauvais Cathedral (Source: Photo by David Iliff. License: CC-BY-SA 3.0).

Using a word from everyday language, we may say that risk assessment is a method for systematization of foresight. The Merriam-Webster online dictionary defines foresight as “an act of looking forward” and this is exactly what we are trying to do when we analyze risk. We have now started using terms, such as risk, risk analysis, and risk assessment, without really explaining what they mean. For the purpose of this introductory chapter, it is sufficient with a layman’s understanding, but proper definitions and discussions are given in Chapters 2 and 3.

1.2 Important Trends Many trends in society have led to increased focus on risk and risk assessment. Partly, this is due to increased attention to and reduced willingness to accept risk, partly due to increased risk, and partly due to new and different risk sources being introduced or emerging. Increased attention and reduced willingness to accept risk often go hand in hand. When accidents occur, in particular serious accidents, the media attention is very high and the interest among the general public is correspondingly high. More rapid and comprehensive access to news about accidents, through the Internet, has further increased our attention (and fear) of accidents.

1.2 Important Trends

The increasing focus and our reduced preparedness to accept accidents may be seen as a result of our increasing wealth. In the rich part of the world, many of the dangers that we were exposed to earlier, such as life-threatening diseases, hunger, and war are far less prominent in our lives. Our basic needs are generally well attended to and our attention has therefore turned to other causes of death and losses. This can explain why there are large differences in legislation, regulations, and general attention to risk in rich and poor countries. From this point of view, the expectation that accidents should be avoided can be seen as a result of the increasing standards of living. Many new trends and developments either increase risk, change existing risk, or introduce new sources of risk. Some examples of different character are given in the following. (a) Higher speed. In recent decades, high-speed trains have become increasingly common. Higher speed implies more severe consequences if an accident occurs. (b) Increasingly connected computer networks. More and more devices are linked through the Internet. This does not just apply to computers, but many other devices such as cars, kitchen appliances, power systems, electrical meters, heating systems in homes, and mobile phones. This introduces possibilities for accessing and hacking devices from anywhere in the world. The increased number of connected devices increases the possible consequences and the magnitude of the consequences. With the rapid expansion of the Internet of things this problem increases day by day. (c) Increased competition and production pressure have several aspects that influence risk. Processes are moving faster with less time for preparation and planning, with increasing pressure to be efficient, leaving less time to take care to avoid accidents. Cost cutting may also increase risk. (d) Autonomous systems is a new technology that changes risk. Less people are involved, meaning that fewer are exposed if an accident should occur. On the other hand, people not directly involved may be more exposed (e.g. pedestrians being hit by autonomous cars). Machines may be more reliable for routine tasks than operators, reducing the probability of making errors, but operators are usually better at adapting to unexpected or unusual circumstances. Autonomous systems are complex, and we may not be able to predict all the ways they can fail. (e) Terrorism has existed for a long time, but mainly locally. It is only in the last couple of decades that this has become a global phenomenon. (f ) Climate change is a global problem that changes risk in many ways. Risk related to natural hazards changes, with not only more violent storms, frequent flooding but also droughts. The world can be affected in different ways, among others reduced food production and lack of drinking water. This can in turn lead to hunger and more refugees.

3

4

1 Introduction

To manage the effect of all these changes, we need to understand them and this requires systematic methods to identify them and to analyze them.

1.3 Major Accidents When used in risk research, a major accident is an accident with large and even catastrophic consequences. During the previous decades, a number of major accidents have made the public increasingly aware of the risk posed by certain technical systems and activities. A common denominator of these is that they not only have immediate effects in terms of loss of life, environmental damage, or economical effects but also long-term effects by changing the public’s and the authorities’ attitudes toward the systems that have been involved in the accidents. A result of this is that changes in regulations often are made after major accidents. For companies involved, the consequences of these accidents not only incur enormous costs but may even force a company out of business and seriously damage the image of an entire industry. Examples of some past accidents with far-reaching effects are listed in Table 1.1. These accidents are representative of a large number of accidents that have served to remind us that safety can never be taken for granted. Macza (2008) discusses several of these accidents and the society’s response to each accident with respect to legislation changes and other actions. Many books give overviews of major accidents (Kletz 2001; Mannan 2012) and investigation reports are often published. In some cases, scientific books are also written about major accidents (e.g. see Hopkins 2000; Vaughan 1996).

1.4 History of Risk Assessment The development of risk assessment is closely related to the development of reliability assessment. The two subjects have many concepts and methods in common, and it is therefore difficult to say what belongs to risk and what belongs to reliability. The origin of the word “risk” and its early usage is thoroughly outlined by Bernstein (1998). A thorough historical account of the more recent history of risk assessment is given by Zackmann (2014). Here, we give only some few highlights. We realize that our presentation is biased because its main focus is delimited to developments in Europe and the United States. Probabilistic risk assessment as we know it today had its root in the insurance (actuarial) discipline at the end of the nineteenth century. The Swedish actuary Filip Lundberg is considered to be the founder of mathematical risk theory. His first mathematical model for nonlife insurance was presented already in 1909, but was largely ignored till the Swedish professor Harald Cramér in 1930

1.4 History of Risk Assessment

Table 1.1 Some past major accidents. Location of accident

Year

Consequences

North Atlantic

1912

Titanic colliding with an iceberg and sinking, 1500 killed.

Flixborough, UK

1974

Explosion and fire, 28 killed, more than 100 injured.

Seveso, Italy

1976

Dioxin release, 2000 poisoned, contamination of environment, mass evacuation.

North Sea, Norway

1977

Oil/gas blowout on Bravo platform, pollution of sea.

Three Mile Island, USA

1979

Nuclear accident. Limited actual consequences, but had potential for a major release of radiation.

Bhopal, India

1984

Release of toxic gas (MIC), 3800 killed, 500 000 exposed to gas.

Mexico City, Mexico

1984

Explosion and fire at LPG storage and distribution depot at San Juan Ixhautepec. Around 500 killed.

USA

1986

Explosion of Challenger space shuttle, 7 killed.

Chernobyl, Ukraine

1986

Explosion and atomic fallout at nuclear power station.

Basel, Switzerland

1986

Fire at Sandoz warehouse. Rhine River contaminated, severe environmental damage.

Zeebrügge, Belgium

1987

The car and passenger ferry Herald of Free Enterprise capsized, 193 killed.

North Sea, UK

1988

Explosion and fire on the Piper Alpha platform. Platform lost, 167 killed.

Alaska, USA

1989

Oil spill from tanker Exxon Valdez. Severe environmental damage.

Amsterdam, The Netherlands

1992

Boeing 747 cargo plane crashed near Schipol Airport, 43 killed.

Baltic Sea

1994

The car and passenger ferry Estonia capsized, claiming 852 lives.

Eschede, Germany

1998

High-speed train derailed, 101 killed, 88 injured.

Longford Australia

1998

Explosion and fire, 2 killed, Melbourne without gas for 19 days.

Bretagne, France

1999

Loss of tanker Erika. Major oil spill.

Enschede, The Netherlands

2000

Explosion in fireworks plant. 22 killed, 1000 injured, more than 300 homes destroyed.

Toulouse, France

2001

Explosion and fire in fertilizer plant, 30 killed, 2000 injured, 600 homes destroyed. (Continued)

5

6

1 Introduction

Table 1.1 (Continued) Location of accident

Year

Consequences

Galicia, Spain

2002

Loss of tanker Prestige, major oil spill.

Texas City, USA

2005

Explosion and fire, 15 killed, 180 injured.

Hertfordshire, UK

2005

Explosion and fire at Buncefield Depot.

Gulf of Mexico

2010

Blowout and explosion on the drilling rig Deepwater Horizon, 11 killed, 17 injured, rig lost, major oil spill.

Fukushima Daiichi, Japan

2011

Release of radioactive material with widespread contamination.

Giglio, Italy

2012

Cruise ship Costa Concordia capsized, 32 killed.

Indonesia/Ethiopia

2018/2019

Two crashes with Boeing 737 MAX, with 189 and 157 fatalities, respectively.

developed his insurance risk theory based on Lundberg’s approach. In the following years, Harald Cramér made a series of important contributions to risk and reliability theory. To become a separate discipline, risk assessment had to wait well into the twentieth century. The book “Risk, Uncertainty, and Profit” (Knight 1921) was an impressing landmark. In this book, Knight defined risk as “measurable uncertainty.” Another seminal book, Industrial Accident Prevention: A Scientific Approach (Heinrich 1931), appeared 10 years later. During World War II, the German mathematicians Robert Lusser and Eric Pieruschka made important contributions to the quantification of reliability. Their most well-known result was the formula for calculating the reliability of a series system. The first draft to a standard for risk and reliability emerged in 1949, through the guideline on failure modes and effects analysis (FMEA) that was published by the US military as MIL-P-1629. This guideline was later converted to the military standard MIL-STD-1629A. Another important method, fault tree analysis, was introduced in 1962 by Bell Telephone Laboratories during a reliability study of the launch control system of the intercontinental Minuteman missile. The military standard MIL-STD-1574A “System safety program for space and missile systems” appeared in 1979 and was transformed to MIL-STD-882 “System safety” in 1987. Human error was early recognized as an important cause of accidents and the technique for human error rate prediction (THERP) was introduced in 1962, mainly by Alan Swain. THERP was primarily directed toward identification and prevention of human errors in nuclear power plants.

1.4 History of Risk Assessment

Until 1970, the risk assessments were mainly qualitative. Quantitative aspects entered the scene in parallel to the developments of reliability theory that started from the early 1960s. An impressive early work was the book “Reliability Theory and Practice” (Bazovsky 1961). Several new books on reliability theory appeared during the 1960s and set the scene for the introduction of quantitative risk assessments from approximately 1970. The first attempts to use a HAZOP-like approach to identify deviations and hazards in a chemical plant were made by ICI in 1963, but HAZOP, as we know it today, was not developed until around 1974. Preliminary hazard analysis was introduced in 1966 as a tool to fulfill the US Department of Defense’s requirement for safety studies in all stages of system development. Perhaps the most important achievements in the 1970s was the “Reactor Safety Study” (NUREG-75/014 1975). A wide range of new methods and new approaches were developed, either as part of, or inspired by this study. Important methods include the “kinetic tree theory” (KITT) by William Vesely and models for treatment of common-cause failures (Fleming 1975). The Reactor Safety Study was heavily criticized, but this criticism does not diminish its importance. The risk of nuclear energy was discussed in most Western countries and new education programs in risk and reliability emerged in several countries. The US Nuclear Regulatory Commission (NRC) has played a very important role in the development of risk assessment. Two major landmarks are the publication of the “Fault Tree Handbook” (NUREG-0492) in 1981 and the “PRA Procedures Guide: A Guide to the Performance of Probabilistic Risk Assessment for Nuclear Power Plants” (NUREG/CR-2300). Another US report that led to a lot of risk assessments in many countries was “Critical Foundations: Protecting America’s Infrastructures” that was published by the President’s Commission of Critical Infrastructure Protection in 1997. The infrastructures are exposed to natural hazards, technical failures, as well as deliberate hostile actions. The concepts vulnerability, hazard and threat, and security suddenly became common ingredients in most discussions among risk analysts. In many countries, it became mandatory for all municipalities to carry out “risk and vulnerability analyses” of infrastructure and services. Many of the developments of risk assessment have been made as a response to major accidents (see Section 1.3). In Europe, two major accidents occurred close to the publishing of the Reactor Safety Study. The first of these, the Flixborough accident occurred in 1974 in North Lincolnshire, UK. It killed 28 people and seriously injured 36 out of a total of 72 people on-site at the time. The casualty figures could have been much higher if the explosion had occurred on a weekday, when the main office area would have been occupied. The other important accident occurred in 1976 in Seveso approximately 20 km north of Milan in Italy, where an explosion led to the release of a

7

8

1 Introduction

significant amount of cancer-causing dioxin. Together with the Flixborough accident, the Seveso accident triggered the development of the new EU directive on “the major-accident hazards of certain activities,” which is known as the Seveso directive and was approved in 1982. In the 1970s and 1980s, a range of laws and regulations on safety and risk emerged in many countries. Two well-known laws are the US Consumer Product Safety Act from 1972 and the UK Health and Safety at Work act from 1974. Many new organizations were established to prevent accidents. The United Kingdom Atomic Energy Authority (UKAEA) was formed already in 1954. In 1971, UKAEA formed its Safety and Reliability Directorate (SRD). The UKAEA SRD was an active organization and published a range of high-quality reports. One of the central persons in SRD was Frank Reginald Farmer who became famous for the Farmer curve (FN-curve) that was used to illustrate the acceptability of risk. Farmer was also the first editor of the international journal Reliability Engineering, the forerunner of the journal Reliability Engineering and System Safety (RESS). Another early organization was the IEEE Reliability Society that was established already in 1951. This society is responsible for the journal IEEE Transactions on Reliability. A forerunner to this journal appeared in 1952 under a different name. It changed name three times and finally got its current name from 1962. The first scientific society that was dedicated to risk analysis, the Society of Risk Analysis (SRA) was established in 1980 and its associated journal, Risk Analysis: An International Journal, appeared in 1981. 1.4.1

Norway

In Norway, developments of risk assessment have been made in parallel with the offshore oil and gas activities. The first major oil and gas accident, the Bravo blowout on the Ekofisk field in the North Sea, occurred in 1977. There were no fatalities but a significant release of oil to the sea. First and foremost, this accident was an eye-opener for the authorities and the oil companies who suddenly realized that the oil and gas activities were associated with a very high risk. As a consequence of this accident, the Norwegian Research Council initiated a large research program called Safety Offshore, and the authorities demanded the oil companies to support Norwegian research projects and universities. This requirement was strengthened after the second major accident, the capsizing of the semi-submersible accommodation platform Alexander L. Kielland in 1980, with 123 fatalities. The support from the Safety Offshore research program and the oil companies produced a number of new academic positions and a comprehensive education program at the Norwegian University of Science and Technology (NTNU) in Trondheim. Both authors of the current book participated in

1.5 Applications of Risk Assessment

this development at NTNU. The knowledge gained through this period is an important part of the basis for the book.

1.5 Applications of Risk Assessment The use of risk assessment has increased vastly over the years. A steadily increasing number of legislations, regulations, and standards require or mention risk assessment – and methods are being developed. The increase that we have seen therefore seems to continue into the future. The prime objective of any risk assessment is to provide decision support. Whenever making a decision that affects risk, a risk assessment helps understanding what the sources of risk are. To illustrate this issue, some typical process industry decisions that can be supported by information from risk assessment are listed. (a) Location of a process plant. Chemical process plants often handle toxic, flammable, and/or explosive materials (commonly called hazardous materials). Release of these materials may affect people living or working outside the plant. Understanding the risk these people are exposed to is important before making a decision about where to locate a plant. (b) Layout of a process plant. Release of flammable material may cause fire, and this may spread to other equipment, leading to a far more severe event than the initial fire. Understanding the sources of risk may help us locate the equipment at safe distances from each other. (c) Need for and design of safety systems. All process plants are protected by a range of safety systems, to reduce pressure in tanks and vessels in emergencies, to isolate equipment that is leaking, to detect fires and gas releases, to extinguish fires, and so on. Risk assessment can help us understand what capabilities and capacities these systems need to have to protect against accidents. (d) Performing maintenance operations. There is a need for continuous maintenance of equipment in a process plant. Some of the work may represent a risk to the maintenance personnel and to others. Risk assessment can help us plan the work so it can be performed in a safe manner and inform the personnel about the risk involved. (e) Deciding about repairs and modifications. Equipment that is important for safety may fail during the operation of a plant and we normally want to repair this as quickly as possible. Sometimes, doing the repair may represent a risk, and we need to weigh the risk associated with doing the repair against the risk associated with postponing the repair, for example, until the next major shutdown of the plant. Risk assessment can be used to compare options.

9

10

1 Introduction

(f ) Reliable work operations. Sometimes, work operations may be particularly critical to perform correctly because errors in the performance may have large consequences. Risk assessment can be used to systematically evaluate such work operations, to identify if changes are required. (g) Reductions in manning. A common situation is that cost cutting leads to reduction in manning in process plants. This can have unwanted effects such as less time to perform work or postponement of work that may be critical to maintain a safe plant. Risk assessment can also be used in situations like these, to determine what the effects are on risk. These are just some examples of decisions where risk assessment may provide input to the decision-making process. The examples illustrate the wide range of problems that may be addressed, from wider issues such as location of a plant to technical details of how an individual system should be designed, and from purely technical issues to issues involving human and organizational factors. The range of industries and applications where risk assessment is being used is widening constantly. Some examples are listed in Table 1.2. The table gives some examples and does not pretend to provide a complete picture. Table 1.2 Risk arenas that may be subject to risk analysis. Risk arena

Application or problem area

Hazardous substances

Chemical/process industry, petroleum industry (incl. pipelines), explosives industry, nuclear industry.

Transport

Air traffic (airplanes, helicopters, drones), railways, marine transport, road transport.

Space industry

Space equipment and projects.

Product safety

Technical products, such as machinery, cars, robots, autonomous systems.

Critical infrastructures

Drinking water supply, sewage systems, power grids, communication systems, hospitals and health-care, banking and financial systems.

Medical sector

Medical equipment, robotic surgery, bacteria/viruses.

Work, activity

Industry, agriculture, forestry, sport.

Environmental protection

Pesticides, CO2 , temperature increases, ocean level increases.

Food safety

Contamination, infection.

Health safety

Cancer, tobacco, alcohol, radiation.

Project risk

Time and cost of large projects (e.g. construction, software development).

Economic/financial

Insurance, investment, financial, enterprise, and project risk.

Security

Sabotage, theft, cyberattacks, espionage, terrorism.

1.6 Objectives, Scope, and Delimitation

The underlying principles and methods described in this book can be applied to all of the risk arenas in Table 1.2, but there are differences in terminology and methods that may be confusing. This applies, for example, if we compare a risk assessment of hazardous materials with a security risk assessment. Definitions and methods are described in guiding documents, standards, and legislation for different applications. In this book, we try to describe risk assessment in a generic manner and Chapter 20 provides examples from a variety of application areas. In the following section, we specify more precisely what the focus of this book is, and the type of applications we are primarily aiming at.

1.6 Objectives, Scope, and Delimitation This book is written for students, engineers, and analysts engaged in risk assessments, both in the design phase and in the operational phase of systems. The main objective of the book is to give a thorough introduction to risk assessment and to present the essential theory and the main methods that can be used to perform a risk assessment. More specific objectives are (a) To present and discuss the terminology used in risk assessment. Optimistically, we hope that this may contribute to a more harmonized terminology in risk assessment. (b) To define and discuss how risk can be quantified and how these metrics may be used to evaluate the tolerability of risk. (c) To present the main methods for risk analysis and discuss the applicability, advantages, and limitations of each method. (d) To present and discuss some specific problem areas related to risk assessment (e.g. human errors, dependent failures). (e) To describe how a risk analysis may be carried out in practice and illustrate some important application areas. The book is concerned with risk related to (i) A technical or sociotechnical system, in which (ii) events may occur in the future, that have (iii) unwanted consequences (iv) to assets that we want to protect. The systems considered may be any type of engineered system, ranging from small machines up to complex process plants or transportation networks.1 The book does not cover all aspects of risk, but is limited to accidents where a sudden event harms one or more tangible assets. Adverse effects caused by continuous and long-term exposure to a hazardous environment or hazardous substances (e.g. asbestos) are thus not covered unless the exposure is caused by a specific event (e.g. an explosion). 1 System aspects are discussed in Chapter 4.

11

12

1 Introduction

When people or the environment is exposed to hazardous chemicals, the risk is traditionally analyzed by dose–response models, also called exposure–response relationship. This topic is not covered in the book. The book is concerned with the consequences of accidents, but does not describe how we can calculate or otherwise determine the physical effects of accidents. Examples include the impact energy involved in a collision between two cars, the size and intensity of a fire, or the overpressure generated by an explosion. Instead, the methods focus on the probabilistic aspects of the analysis. In the financial world, investments involving risk of losing money are often made. The outcome may be either positive or negative, and risk is then a statement about the uncertainty regarding the outcome of the investment. This interpretation of the word risk is not relevant for this book, which is concerned exclusively with adverse outcomes. In general, events that harm only intangible assets (e.g. finances, reputation, and goodwill) are not covered in the book, unless this (intangible) harm is associated to an event harming a tangible asset. The main focus of the book is risk assessment per se, not how the results from the assessment may be used or misused, but some issues related to risk management are discussed briefly in Chapter 7. The book is mainly focused on the study of major accident risk. Many of the methods described may be used to analyze and prevent minor accidents, such as occupational accidents, but this is not the main focus of the book. Risk related to deliberate actions, such as sabotage and cyberattacks, is not a main focus of the book, but an introduction to this increasing problem is given in Chapter 17. Environmental risk and resilience are likewise treated only very briefly.

1.7 Problems 1.1

Section 1.2 describes some trends that change risk. Take some time to reflect on other technological trends that can increase or reduce risk.

1.2

An important basis for all risk analyses is a good understanding of failures and accidents that have occurred earlier. Has there been any recent examples of major accidents? Read about the accidents and identify points that we can learn from to avoid future accidents.

1.3

Look for accident investigation reports from one of the accidents that are listed in Table 1.1. In many cases, reports can be found on the Internet. Review the report and see what causes are identified and whether they are related to technical failures, human errors, or organizational aspects.

References

1.4

On the Internet, numerous examples of risk analyses and risk assessments can be found. Search for examples and look at what the content is and what the scope of the analysis is.

1.5

Look for examples of legislation, guidelines or standards that require risk assessment in your own country.

References Bazovsky, I. (1961). Reliability Theory and Practice. Englewood Cliffs, NJ: Prentice-Hall. Bernstein, P.L. (1998). Against the Gods: The Remarkable Story of Risk. Hoboken, NJ: Wiley. Fleming, K.N. (1975). A Reliability Model for Common Mode Failures in Redundant Safety Systems. Tech. Rep. GA-A13284. San Diego, CA: General Atomic Company. Heinrich, H.W. (1931). Industrial Accident Prevention: A Scientific Approach. New York: McGraw-Hill. Hopkins, A. (2000). Lessons from Longford: The Esso Gas Plant Explosion. Sydney: CCH Australia. Kletz, T. (2001). Learning from Accidents, 3e. Abington: Routledge. Knight, F.H. (1921). Risk, Uncertainty and Profit. New York: Houghton Mifflin. Macza, M. (2008). A Canadian perspective of the history of process safety management legislation. 8th International Symposium on Programmable Electronic Systems in Safety-Related Applications, Cologne, Germany. Mannan, S. (ed.) (2012). Lee’s Loss Prevention in the Process Industries: Hazard Identification, Assessment and Control, 4e. Waltham, MA: Butterworth-Heinemann / Elsevier. MIL-STD-1629A (1980). Procedures for performing a failure mode, effects, and criticality analysis, Military standard. Washington, DC: U.S. Department of Defense. Murray, S. (1989). Beauvais Cathedral: Architecture of Transcendence. Princeton, NJ: Princeton University Press. NUREG-0492 (1981). Fault tree handbook. Washington, DC: U.S. Nuclear Regulatory Commission, Office of Nuclear Regulatory Research. NUREG-75/014 (1975). Reactor safety: an assessment of accident risk in U.S. commercial nuclear power plants . Technical report NUREG-75/014. Washington, DC: U.S. Nuclear Regulatory Commission. Vaughan, D. (1996). The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA. Chicago: University of Chicago Press. Zackmann, K. (2014). Risk in historical perspective: concepts, contexts, and conjunctions. In: Risk – A Multidisciplinary Introduction, Chapter 1 (ed. C. Klüppelberg, D. Straub, and I.M. Welpe), 3–35. Heidelberg: Springer-Verlag.

13

15

2 The Words of Risk Analysis 2.1 Introduction In 1996, the prominent risk researcher Stan Kaplan received the Distinguished Award from the Society for Risk Analysis. To express his gratitude, Kaplan gave a talk to the plenary session at the society’s annual meeting. In the introduction to this talk, he said: The words of risk analysis have been, and continue to be a problem. Many of you remember that when our Society for Risk Analysis was brand new, one of the first things it did was to establish a committee to define the word “risk.” This committee labored for 4 years and then gave up, saying in its final report, that maybe it’s better not to define risk. Let each author define it in his own way, only please each should explain clearly what way that is (Kaplan 1997, p. 407). This quotation neatly summarizes one of the problems with this discipline; there are no common definitions of many of the words that we are using. One reason may be that risk is a concept that many disciplines are concerned about, bringing different starting points, perspectives, and terminologies into the discussion. In addition, risk is a term that is commonly used in everyday language, often without a precise meaning attached to it. This chapter attempts to define the key concepts and words from the point of view of the main objectives of this book, which is risk assessment of technical and sociotechnical systems. This means that other users of risk analysis and risk assessment may choose other definitions, and we briefly mention some of these in this section. We have tried to develop a coherent and useful set of definitions that fits our purpose. A word of caution when reading about this topic is, therefore, to make sure that you understand what the authors mean, when they use words such as risk,

Risk Assessment: Theory, Methods, and Applications, Second Edition. Marvin Rausand and Stein Haugen. © 2020 John Wiley & Sons, Inc. Published 2020 by John Wiley & Sons, Inc. Companion Website: www.wiley.com/go/riskassessment2e

16

2 The Words of Risk Analysis

hazard, and accident. Unfortunately, not all authors writing about risk have heeded the advice of Stan Kaplan, explaining clearly how they define the terms. The meaning may therefore be unclear and different from how we define the terms in this book. When discussing risk and other terms in this chapter, we sometimes use terms before they are properly defined. If this causes confusion or things are unclear, please be patient and read on and things will hopefully become more clear as you progress through the chapter.

2.2 Risk Bernstein (1998) claims that the word risk entered the English language in the 1660s from the Italian word risicare, which means “to dare.” This tells us something about what risk may mean, but if you ask ten people about the word risk, you most likely get ten different answers. The same inconsistency prevails in newspapers and other media. A search for the word risk in some Internet newspapers gave the results in Table 2.1. In some of the statements, the word “risk” may be replaced with chance, likelihood, or possibility. In other cases, it may be synonymous with hazard, threat, or danger. The situation is not much better in the scientific community, where the interpretation is almost as varying as among the general public. A brief search in risk assessment textbooks, journal articles, standards, and guidelines easily prove that this applies also for the specialists in risk assessment.

Table 2.1 The word risk as used in some Internet newspapers (in 2018). Ford recalls electric car power cables due to fire risk. Is financial turmoil in Turkey and other emerging economies at risk of spreading? Are there any other legal risks? Investors are willing to take on a high risk. Bridge designer warned of risk of corrosion. Saturday features more widespread rain risk. Multigene test may find risk for heart disease. We could put at risk our food and water supplies. This political risk was described in an intriguing analysis. Because of the risk of theft.

Reindeer at risk of starvation after summer drought. Coalition at risk as talks on refugee policy falter. Seven ways to minimize the risk of having a stroke. Company to close 42 stores, putting 1500 jobs at risk. Death is a risk the drivers willingly take and their loved ones accept. You are putting lives at risk over Brexit. This carries an accident risk of “Chernobyl proportions.” £ 80 billion investment plan at risk.

2.2 Risk

2.2.1

Three Main Questions

Risk (as used in this book) is always related to what can happen in the future. In contrast to our ancestors, who believed that the future was determined solely by the acts of God (e.g. see Bernstein 1998), or by destiny, we have the conviction that we can influence the future by analyzing and managing risk in a rational way. Our tool is risk assessment, and our goal is to inform decision-making concerning the future. The possibility that events with unwanted effects may happen is an inherent part of life. Such events can be caused by natural forces, such as flooding, earthquake, or lightning; technical failures; or human actions. Some events can be foreseen and readily addressed, whereas others come unexpectedly because they appear unforeseeable or have only a very remote likelihood of occurrence. In many systems, various safeguards are put in place to prevent harmful events or to mitigate the consequences should such events occur. Risk assessment is used to identify what harmful events can occur, the causes of these events, to determine the possible consequences of harmful events, to identify and prioritize means to reduce risk, and to form a basis for deciding whether or not the risk related to a system is tolerable.1 For the purpose of this book, we follow Kaplan and Garrick (1981) and define risk as: Definition 2.1 (Risk) The combined answer to the three questions: (1) What can go wrong? (2) What is the likelihood of that happening? and (3) What are the consequences? ◽ The three questions may be explained briefly as follows2 : (1) What can go wrong? To answer this question, we must identify the possible accident scenarios that may harm some assets that we want to keep and protect. An accident scenario is a sequence of events, starting with an initiating event and ending with an end state that affects and causes harm to the assets. The assets may include people, animals, the environment, buildings, technical installations, infrastructure, cultural heritage, reputation, information, data, and many more. (2) What is the likelihood of that happening? The answer to this question can be given as a qualitative statement or quantitatively as probabilities or frequencies. We consider each accident scenario that was identified in Question 1, one-by-one. To determine the likelihood, it is often necessary to carry out a causal analysis to identify the basic causes (hazards and threats) that may give rise to the accident scenario. 1 The tolerability of risk is discussed in Chapter 5. 2 The main terms are defined and discussed later in this chapter.

17

18

2 The Words of Risk Analysis

(3) What are the consequences? For each accident scenario, we must identify the potential harm or adverse consequences to the assets mentioned in Question 1. Most systems have safeguards or barriers that may prevent or mitigate harm. The harm to the assets depends on whether or not these barriers function as required when the accident scenario occurs. A risk analysis is carried out to provide answers to the three questions in the definition of risk. Risk analysis and risk assessment are further defined and discussed in Chapter 3. The first question “what can go wrong?” clearly shows that we focus on scenarios that give “negative” consequences – even if risk refers to both gains and losses in economic theory. Remark 2.1 (Positive and negative consequences) Observe that classifying a consequence as positive or negative represents social judgments and cannot be derived from the nature of the accident scenario (Klinke and Renn 2002). This implies that consequences can be regarded as positive by some people and negative by others. Examples include terrorist attacks or other cases where someone wants to cause harm. The terrorists aim to cause as much harm as possible to get attention. For them, it is therefore a positive consequence. For most others, it is seen as negative. ◽ Remark 2.2 (Danger) The word danger is used in our daily language, both as a noun and as the associated adjective dangerous. Standards for risk assessment very seldom use the noun danger, but the adjective dangerous is commonly used in expressions such as dangerous chemicals, dangerous behavior, and dangerous activity. We follow the standards and refrain from using the noun danger in this book. ◽ 2.2.1.1

Expressing Risk

The answer to the first question in Definition 2.1 gives a set of accident scenarios {s1 , s2 , … , sn }. The answer to the second question gives the likelihood (i.e. usually the frequency) of accident scenario si for each i = 1, 2, … , n. If accident scenario si occurs, it may harm several assets in different ways and with different probabilities. The answer to the third question is therefore a set of possible harms with associated probabilities. This set is called the consequence spectrum ci associated with accident scenario si , for i = 1, 2, … , n. The consequence spectrum ci may be expressed as a multidimensional vector that includes various types of harm or damage to all relevant assets (e.g. people, property, and the environment). ci may sometimes be time dependent if the magnitude of damage varies with time. This means that the answers to the three questions in Definition 2.1 can be answered by the triplet ⟨si , fi , ci ⟩ for each i = 1, 2, … , n, where

2.2 Risk

si provides a name, a precise definition, and a description of potential accident scenario i fi is an estimate of the likelihood (e.g. frequency) of accident scenario si ci is a multidimensional vector of the potential types of harm/damage to all relevant assets caused by accident scenario si , with associated probabilities, that is, the consequence spectrum for si The risk R related to the study object can now, according to Kaplan and Garrick (1981), be expressed by the set of triplets }n { R = ⟨si , fi , ci ⟩ i=1 If all relevant accident scenarios si are included, the set of triplets is considered to be complete and hence to represent the risk. The risk – defined by answers to the three questions – may be presented as in Table 2.2, where the first column lists the accident scenarios si that may give harm. The second column lists the frequency of si , and the last column lists the consequence spectrum associated with si . The frequency fi and the consequence spectrum ci depend on the capability and reliability of the barriers that are available in the study object. A more thorough discussion of accident scenarios can be found in Section 2.3.

2.2.2

Alternative Definitions of Risk

Many alternative definitions of risk have been suggested in the literature. Among these are the following: (a) “Effect of uncertainty on objectives” (ISO 31000). This definition is different from the one used in this book. Events are not mentioned in the definition and it also encompasses both positive and negative consequences. (b) “The possibility that human actions or events lead to consequences that harm aspects of things that human beings value” (Klinke and Renn 2002). (c) “Situation or event where something of human value (including humans themselves) has been put at stake and where the outcome is uncertain” (Rosa 1998). Table 2.2 Risk related to a system (example). i

Accident scenario (si )

Frequency (fi )

Consequence (ci )

1

Gas leak in area 1

0.01

Consequence spectrum 1

2

Falling load from crane 2

0.03

Consequence spectrum 2









19

20

2 The Words of Risk Analysis

(d) “Uncertainty about and severity of the consequences (or outcomes) of an activity with respect to something that humans value” (Aven and Renn 2009). (e) “The probability that a particular adverse event occurs during a stated period of time, or results from a particular challenge” (Royal Society 1992, p. 22). (f ) “Risk refers to the uncertainty that surrounds future events and outcomes. It is the expression of the likelihood and impact of an event with the potential to influence the achievement of an organization’s objectives” (Treasury Board 2001). One aspect that distinguishes several of these definitions from our definition is the use of uncertainty instead of likelihood or probability. Thorough discussions of the various definitions and aspects of risk are given, for example, by Lupton (1999) and Johansen (2010). We do not go into further details on this here, just repeat our cautionary comment that risk is not always defined as in this book. Remark 2.3 (Risk: singular or plural?) Many standards, books, and articles use the word “risk” in both singular and plural. The plural form – risks – is most often used when assets are exposed to several sources of risk. In this book, we refrain from using the plural form, except when quoting other authors. Instead, we use the term “sources of risk” when it is important to point out that there are several “sources” that may give rise to harm. ◽

2.3 What Can Go Wrong? To be able to answer the first question in the definition of risk, we need to specify what we mean by “What can go wrong?” So far the term accident scenario has been used to describe this, but we now elaborate more on this question. 2.3.1

Accident Scenario

An accident can usually be described as a sequence of events that harms one or more assets. The term accident scenario is used to describe a possible, future accident and is defined as follows: Definition 2.2 (Accident scenario) A potential sequence of events from an initiating event to an undesired end state that will harm one or more assets. ◽ Accident scenarios may vary significantly, both with respect to the number of events and the time interval from the initiating event to the end event or state.

2.3 What Can Go Wrong?

The “path” of an accident scenario is diverted by various conditions and when barriers are activated. In cases where no barriers are available, the sequence may be reduced to a single event. The concept of accident scenario is discussed further by Khan and Abbasi (2002) and is a central element in the ARAMIS methodology (ARAMIS 2004). Example 2.1 (Accident scenario in a process plant) A possible accident scenario starting with a gas leak in a process plant may proceed as follows: (1) (2) (3) (4) (5) (6) (7) (8)

A gas leak from flange A occurs (i.e. the initiating event). The gas is detected. An alarm is triggered. The process shutdown system fails to shut off the gas flow to the flange. The gas is ignited and a fire occurs. The firefighting system is activated. The fire is extinguished within approximately one hour. One person is injured by the fire. ◽

Accident scenarios are identified and described as part of a risk analysis, but this does not mean that they will indeed occur. For events that have actually occurred, the term accident or accident course is more appropriate. 2.3.1.1

Categories of Accident Scenarios

In most risk analyses, it requires too much time and too many resources to study all the possible accident scenarios. A set of representative scenarios is therefore selected for detailed analysis. These are often called reference scenarios. Definition 2.3 (Reference accident scenario) An accident scenario that is considered to be representative of a set of accident scenarios that are relevant to include in a risk analysis. ◽ In some applications, it may be relevant to consider the worst possible scenarios: Definition 2.4 (Worst-case accident scenario) The accident scenario with the highest consequence that is physically possible regardless of likelihood (Kim et al. 2006). ◽ A worst-case release scenario may, for example, involve the release of the maximum quantity of some hazardous material during “worst-case” weather conditions. Worst-case scenarios are often used in establishing emergency plans, but should not be used in, for example, land use planning (see Chapter 5).

21

22

2 The Words of Risk Analysis

Because a worst-case accident scenario often has a remote probability of occurrence, a more credible accident scenario may be more relevant. Definition 2.5 (Worst credible accident scenario) The highest-consequence accident scenario identified that is considered plausible or reasonably believable (Kim et al. 2006). ◽ The terms “plausible or reasonably believable” are not defined, but Khan and Abbasi (2002) suggest that credible accidents are those that have a probability of occurring greater than 10−6 per year. 2.3.2

Hazard

Hazard is a commonly used term in connection with risk analysis. Definition 2.6 (Hazard) A source or condition that alone or in combination with other factors can cause harm. ◽ We are always surrounded by hazards, but a hazard is not critical until it comes out of control or is triggered one way or another. The event where a hazard comes out of control is an initiating event in the way that we define it below. Hazards can therefore be seen as a primary cause of an initiating event. A hazard can be related to a property of a system or a material (e.g. flammability or toxicity), it can be a state (e.g. gas under pressure, potential energy in an object that has been lifted) or it can be a situation (e.g. a train traveling at high speed). Accident scenarios may involve several hazards, creating domino effects or escalating situations, as illustrated in Example 2.2. Example 2.2 (Accident scenario involving more than one hazard) sible accident scenario may comprise the following events:

A pos-

(a) A car driver loses control over her car on a road. (b) The car moves over in the opposite lane. (c) A truck coming in the opposite direction tries to avoid collision and swerves off the road. (d) The truck topples over. (e) The truck carries petrol that is spilled. (f ) The petrol ignites, causing a large fire. In this case, several hazards can be identified: The speed (kinetic energy) of the first car, the speed of the truck, and the petrol in the truck. ◽ The term “hazard” is widely used in many different ways in connection with risk assessment. The systematic methods that we use to answer the question

2.3 What Can Go Wrong?

Table 2.3 Examples of hazards. Hazard

Comment

A car

A stationary car can be a hazard if it is located on top of a hill (in a specific state) and can start rolling (releasing potential energy). It is also a hazard because it contains fuel that represents thermal energy (a property of the material) that can ignite or explode. There is also electrical energy stored in the battery. A moving car represents an additional hazard because it has kinetic energy.

Propane gas under pressure

In this case, both the state (under pressure) and a property (flammable) represent hazards associated with the propane.

Water in a hydroelectric power dam

It is the water that is stored that is the main hazard in this case. The water contains massive amounts of potential energy and if the water is released this may cause severe damage. The dam itself can also be regarded as a hazard, but then just locally, because it can collapse and kill people nearby or damage equipment.

Ammunition for a gun

According to the definition, it is the gunpowder that is the hazard. It is the thermal energy in the gunpowder that makes it expand rapidly and fire the bullet. Pulling the trigger on a gun that is loaded is the initiating event that releases the energy.

A large crowd in a confined space

A full football stadium can be seen as a hazard. Some unexpected event can lead to panic and trigger rapid evacuation. This is a situation that can cause injuries and even fatalities.

Tension between tectonic plates

It may seem strange to define this and not earthquake as a hazard, but the earthquake is the event that follows from release of the energy that is otherwise under control (at least temporarily).

Pressure differences in the atmosphere

This is also an unusual definition and in most cases storm would be seen as the hazard. Similarly to earthquake, storm is the event that results from the pressure difference.

Tension in an offshore structure

The initiating event in this case could be “structural collapse,” caused by the tension in the structure exceeding the load-bearing capacity.

“what can go wrong?” are often called “hazard identification methods,” even if it is initiating events or hazardous events that are identified. Many hazards are related to energy defined in a very wide sense. Examples are mechanical, thermal, or electrical energy. Table 2.3 lists some hazards together with brief comments, to further illustrate the definition of hazard.

23

24

2 The Words of Risk Analysis

Some of these hazards may seem counterintuitive and needlessly formal. In practice, the term “hazard” is used in many different ways and often without any clear definition. We come back to this issue in later chapters. 2.3.3

Initiating Event and Hazardous Event

Before defining initiating event, we give a general definition of the term “event”: Definition 2.7 (Event) Incident or situation that occurs in a particular place during a particular interval of time. ◽ In risk analysis, the term “event” refers to a future occurrence. The duration of the event may range from very short (e.g. an instantaneous shock) to a rather long period. An initiating event is defined as follows: Definition 2.8 (Initiating event) An identified event that represents the beginning of an accident scenario. (Adapted from IAEA 2002.) ◽ This definition states that the initiating event represents the beginning of an accident scenario, but it is not clearly specified where and when the beginning is. In Example 2.1, we said that “gas leak from flange A” was the beginning of the accident scenario, that is the initiating event, but we do not say anything about why the leak occurred. If this was because of an impact against the flange, we could have said that “impact against flange A” was the initiating event. In practice, it is up to the risk analyst to decide what she wants to define as the beginning. This depends on what the focus of the analysis is, the limitations of the analysis, limitations in our knowledge about the system or the accident scenarios, and so on. Among the causes of initiating events are hazards. 2.3.3.1

Hazardous Event

Another commonly used term is hazardous event, sometimes used more or less synonymously with initiating event. We have chosen to distinguish these two, although it is not easy to give a precise definition of hazardous event. A possible definition is: Definition 2.9 (Hazardous event) cause harm.

An event that has the potential to ◽

From this definition, any event that is part of an accident scenario, including the initiating event, may be classified as a hazardous event. The practical use of this term is discussed under methods for hazard identification in Chapter 10. Observe that both initiating event and hazardous event are terms that are difficult to give a precise physical meaning. Instead, both can be regarded

2.3 What Can Go Wrong?

as analytical terms that are used to identify starting points for the analysis. Because the starting point of our analysis may be chosen more or less freely, it is not possible to pinpoint specific events in a sequence that are initiating and hazardous events, respectively. As a guideline, it may be useful to specify the hazardous event as the first event in the sequence when the situation moves from a normal situation to an abnormal situation. This is not necessarily very precise, but may still be a help in performing risk analyses in practice. The hazardous event is a central concept in the bow-tie model of risk analysis, which is described in Section 2.3.4 (i.e. after the examples). Example 2.3 (Hazardous events) A hazardous event was defined as an event that may cause harm, and it was suggested to use the first abnormal event as the hazardous event. The following examples illustrate this: (1) An object dropping from a crane. The process of lifting itself is completely normal and is not considered a hazardous event. If, on the other hand, the object that is lifted starts falling, it is definitely an abnormal situation. (2) A car driver losing control of the car. Driving is a very common and normal activity, but if the driver loses control of the car while driving, the situation may develop into a serious accident with severe consequences. (3) An aircraft engine stopping during flight. It should be fairly obvious that this is an abnormal situation. Most commercial planes have two engines and are able to land without problems with one engine not working. It is still reasonable to classify this as a hazardous event in a risk analysis. (4) A person slipping when climbing a ladder. Climbing a ladder is normal, but if the climbing person slips, she may lose her balance and fall off the ladder, with potentially serious consequences. On the other hand, she may be able to regain her hold and balance again, avoiding an accident. A hazardous event will therefore not necessarily always lead to an accident. ◽ In all these examples, some prerequisites need to be in place for the event to occur. In the case of the car, control can only be lost if the car is driving, dropped objects are only possible if something has been lifted, the aircraft engine stopping is critical only during flight, not on the ground. This is an indication that hazardous events on their own not necessarily are critical but need to occur in a context where a hazard is present. Example 2.4 (Crane operation) A crane is used to lift heavy elements on a construction site. The lifting operation has several hazards. One of these is the potential energy of a lifted element. This hazard is an intrinsic hazard of the lifting operation because it is not possible to lift anything without creating potential energy. The potential energy can be released, for example, if the

25

2 The Words of Risk Analysis

Hazard

Hazardous event

Asset

Asset

(a)

(b)

Consequence (accident)

(c)

Figure 2.1 Hazard (a), event (b), and consequence (c) (Example 2.4).

hazardous event “chain breaks” occurs. This event leads to “uncontrolled fall of the element.” The consequence of the fall depends on where the element falls down – if there are people or important equipment (i.e. assets) in the area. The concepts used in this example are shown in Figure 2.1. ◽ 2.3.4

The Bow-tie Model

The bow-tie model for risk analysis is shown in Figure 2.2. An identified hazardous event is placed in the middle of the figure, with the causes shown on the left side and the consequences on the right side. The figure indicates that various hazards and/or threats may lead to the hazardous event and that the hazardous event may in turn lead to many different consequences. Various barriers are usually available between the hazards/threats and the hazardous event, and between the hazardous event and the consequences. The model in Figure 2.2 is called a bow-tie model because it resembles the bow tie that men sometimes use in place of a necktie with a formal suit. The bow-tie model is a useful illustration of both conception and analysis of risk. Enabling events and conditions

Hazardous event

Accident scenario

End state Consequence spectrum

Initiating event Hazards/threats

26

Figure 2.2 An accident scenario and related concepts illustrated in a bow-tie diagram.

2.3 What Can Go Wrong?

Many of the concepts that are introduced in this section can be illustrated in the bow-tie diagram. An accident scenario, involving the identified hazardous event, is shown as a path from left to right. The sequence of events is started by an initiating event and is terminated by a specific end event that causes an end state. The end state will lead to consequences for one or more assets. The path of the accident scenario is diverted (i.e. changes state) by the action of barriers and hence, the steps of the path indicate the presence of barriers. All the possible accident scenarios involving the specified hazardous event can, in principle, be represented in the same bow-tie diagram. If we are able to identify all relevant hazardous events, we can, in principle, identify and describe all relevant accident scenarios by using bow-tie diagrams. This makes the bow-tie model a practical and efficient tool for risk analysis. 2.3.5

End Event and End State

In the same way that we define initiating events as the start of the accident scenarios, we often use the term “end event” or “end state” to signify the end of the accident scenario. Defining this is not any easier than defining the initiating event. Definition 2.10 (End event) An identified event that represents the end of a defined accident scenario. ◽ This definition does not specify what the end of the scenario is. In the same way as with initiating events, it is up to the analyst to decide where and when the scenario ends. When the end event occurs, the system enters a state that we call the end state. This end state is used as a basis for establishing the consequence spectrum for the accident scenario. 2.3.6

A Caveat

The initiating event, the hazardous event, and the end event are all problematic concerning where to locate them in an accident scenario. The initiating event may be the least problematic, but as seen from the comment after Definition 2.8, it is not always obvious how and where to start an accident scenario. The end event and the following end state may be the most important part of an accident scenario and is often used as a name for the scenario, such as “fire in process area 1” or “collision with train on same track.” As illustration, consider a potential fire in the process plant. Should we define the end state of the accident scenario as (i) a fire is ignited, (ii) a significant fire is ongoing, or (iii) the fire has been extinguished? In many risk analyses, the end state is set as soon as a state is initiated that inevitably – if not stopped – will cause harm to some assets. For the scenario above, this means that the end state is defined as “fire is

27

(b) Hazardous event

Consequences

Hazards/threats

(a) Hazardous event

Consequences

2 The Words of Risk Analysis Hazards/threats

28

Figure 2.3 Two different choices of hazardous events, (a) early in the event sequence and (b) late in the event sequence.

ignited.” When this end state is present, a number of safeguards, or barriers, are usually activated to stop the development of the end state and/or to protect the assets. The consequences of the accident scenario are then determined by the capability and reliability of these barriers. Where to set the end state depends on the study object and on what is deemed by the risk analyst to be practical. The hazardous event is purely an analytical concept and may be defined by the analyst as any event in an accident scenario between and including – the initiating event and the end event. There are no clear recommendations for the choice of hazardous events, but some choices may lead to an efficient risk analysis and some may not. For the bow-tie diagram in Figure 2.2, observe that if the hazardous event is moved further to the left in the diagram (see Figure 2.3a), the number of possible paths for the event sequence will typically increase. On the other hand, if we move the hazardous event to the right in the bow-tie diagram (see Figure 2.3b), the number of causes of the hazardous event increases and fewer possible paths follow from the hazardous event. The approach in Figure 2.3a gives a more simple causal analysis and a more complicated consequence analysis, whereas the approach in Figure 2.3b gives a more complicated causal analysis and a more simple causal analysis. Which of these approaches gives the best and most complete result depends on the system and the problem at hand. 2.3.7

Enabling Events and Conditions

Hazards are primary causes of initiating events, but specific events or conditions often need to be in place in addition for an initiating event to occur. The same applies for events later in the accident scenario. These events and conditions are called enabling events and conditions. Definition 2.11 (Enabling events and conditions) An event or a condition that on its own or in combination with other events or conditions can trigger an initiating event or enable an accident scenario to develop further toward an accident. ◽ Enabling events and conditions are events and conditions that contribute to instigate the initiating event and to drive the accident scenario forward toward

2.3 What Can Go Wrong?

Table 2.4 Hazards, enabling events and conditions, and initiating events. Hazard

Enabling event/condition

Initiating event

A car on top of a hill

Handbrake is not on

Car starts rolling

Propane gas under pressure

Corrosion in tank

Gas is released

Water in a hydroelectric power dam

Extreme rain

Water flows over top of dam

A large crowd in a confined space

Excitement in crowd

Panic breaks out

Tension between tectonic plates

Build up over long period

Earthquake

Pressure differences in the atmosphere

Increasing pressure difference

Storm

Tension in an offshore structure

Crack growth in structure due to fatigue

Failure of a structural member

harm of an asset. Sometimes, it may be difficult to distinguish clearly between events that are in the accident scenario sequence and enabling events, but as a general rule, all events that are not on the “main path” toward the accident scenario end event are enabling events. It may seem unnecessary to distinguish between these two, but for the purpose of managing risk it may be quite important. If an initiating event or another event occurs that is defined as being part of the accident scenario, this means that the situation has moved one step closer to becoming an accident. Enabling events (and conditions) only change the probability that an event in the sequence occurs. In an earlier example, “Gas leak from flange A” was used as an initiating event. An enabling event could be “impact on flange” and an enabling condition could be “corrosion” because both increase the probability of failure of the flange. Table 2.4 lists some hazards, enabling events and conditions, and initiating events to help clarify the concepts and illustrate the differences between them. 2.3.7.1

Active Failures and Latent Conditions

Reason (1997) distinguishes between active failures and latent conditions. Active failures are events that trigger unwanted events. Examples of active failures are errors and violations by field operators, pilots, and control room operators.3 These are the people in the operation – what Reason calls the sharp end of the system. Latent conditions do not trigger an accident immediately, but they lie dormant in the system and may contribute to a future accident. Examples of latent conditions are poor design, maintenance failures, poor 3 Human errors are discussed further in Chapter 15.

29

30

2 The Words of Risk Analysis

and impossible procedures, and so on. Latent conditions can increase the probability of active failures. There are clear similarities in the way that Reason uses these terms and our way of using enabling events and conditions. 2.3.8

Technical Failures and Faults

Failures and malfunctions of technical items may be relevant as both hazards and enabling events. A failure is defined as follows: Definition 2.12 (Failure of an item) item to perform as required.

The termination of the ability of an ◽

A failure is always linked to an item function and occurs when the item is no longer able to perform the function according to the specified performance criteria. Failure is an event that takes place at a certain time t. Item failures can be recorded, and we can estimate the frequency of failures in a certain population of similar items. This frequency is called the failure rate 𝜆 of the item. The occurrence of some failures can be observed immediately when they occur, and these failures are called evident failures. For other failures, it is not possible to observe the failure without testing the item. These failures are called hidden failures. Hidden failures are a particular problem for many safety systems, such as fire or gas detection systems, and airbag systems in cars. After a failure, the item enters a failed state or a fault and remains in this state for a shorter or longer time. Many failures require a repair action to be brought back to a functioning state. Some items – especially software items – may spend a negligible time in failed state. A fault of a technical item is defined as follows: Definition 2.13 (Fault of an item) A state of an item, where the item is not able to perform as required. ◽ Many faults are caused by a preceding failure, but there is also another important category of faults – systematic faults. A systematic fault is caused by a human error or a misjudgment made in an earlier stage of the item’s life cycle, such as specification, design, manufacture, installation, or maintenance. A systematic fault remains in – or is related to – an item until the fault is detected as part of an inspection or test, or when the systematic fault generates an item failure. Systematic faults are important causes of safety system failures and include faults, such as, software bugs, calibration errors of detectors, erroneously installed detectors, too low capacity of fire-fighting systems, and so forth.

Performance

2.3 What Can Go Wrong?

Target value Acceptable deviation Actual performance

Failure (event)

Fault (state) Time

Figure 2.4 Failure and fault of a degrading item.

Remark 2.4 (Analogy to death and being dead) If we compare a human being and a technical item, the terms “death” and “failure” are similar terms. In most cases, we can record the time of death of a person, and we can calculate the frequency of deaths in a certain population. When a person dies, she enters the state of being dead, and remains in this state. As for technical items, it is not possible to calculate any frequency of being dead. The main difference between the terms is that technical components often can be repaired and continue to function, whereas a dead person cannot. ◽ Example 2.5 (Pump failure) Consider a pump that is installed to supply water to a process. To function as required, the pump must supply water between 60 and 65 l/min. If the output from the pump deviates from this interval, the required function is terminated and a failure occurs. The failure will often occur due to a gradual degradation, as shown in Figure 2.4. ◽ A failure may occur in many different ways, and these are referred to as failure modes. Definition 2.14 (Failure mode) The manner in which a failure occurs, independent of the cause of the failure. ◽ Example 2.6 (Pump failure modes) Reconsider the pump in Example 2.5. The following failure modes may occur: • • • • •

No output (the pump does not supply any water) Too low output (i.e. the output is less than 60 l/min) Too high output (i.e. the output is more than 65 l/min) Pump does not start when required Pump does not stop when required

31

32

2 The Words of Risk Analysis

• Pump starts when not required …more failure modes depending on other functional requirements: for example, related to power consumption or noise. ◽ Failure mode is a very important concept in risk and reliability analyses and is further discussed in Section 10.5. Technical failures do not occur without a failure cause, defined as: Definition 2.15 (Failure cause) (IEV 192-03-11).

Set of circumstances that leads to failure ◽

A failure cause may originate during specification, design, manufacturing, installation, operation, or maintenance of an item. Some of the possible failure causes are classified as failure mechanisms and are defined as follows: Definition 2.16 (Failure mechanism) that leads to failure.

Physical, chemical, or other process ◽

The pump in Example 2.5 may, for example, fail due to the failure mechanisms corrosion, erosion, and/or fatigue. Failure may also occur due to causes that are not failure mechanisms. Among such causes are operational errors, inadequate maintenance, overloading, and so on. 2.3.8.1

Failure Classification

Failures of an item can be classified in several ways. Here, we suffice by mentioning one classification. The classification is related to a specified function of the item and not the hardware as such. To illustrate the different types of failure, we may consider the function “wash clothes” of a washing machine. Primary failure. These failures occur in the normal operating context of the item and are typically hardware failures caused by some deterioration, such as wear. Primary failures are random failures where the probability distribution is determined by the properties of the item. Primary failures are in some applications called random hardware failures. Secondary failure. These failures are also called overload failures. A secondary failure of a washing machine may, for example, be caused by a lightning strike or a far too heavy load. Secondary failures are often of a random nature, but the probability distribution has little to do with the properties of the item. Systematic failure. These failures occur because of a dormant systematic fault of the item (e.g. software bug, maintenance error, and installation error). The systematic failure occurs when a specific demand for the item occurs. The demands may be of a random or nonrandom nature. The first author of this

2.3 What Can Go Wrong?

book has experienced persistent software bugs in his washing machine, causing the washing program to abort. Input/output failures. These failures occur because the required inputs or outputs to the item function are missing or wrong. The inputs to a washing machine consist of electricity, water, detergent, and mobile phone signals (on brand new machines). Output is dirty water to the sewage. The function of the machine is failed when one of these inputs/outputs are missing or deviating from required values. The input/output failures may be random or nonrandom. Deliberate failures. These failures are nonrandom and occur when a threat actor (also called attacker) uses a physical or cyber threat to harm the item. For some systems, cyber threats may lead to physical harm to assets. A physical threat action is also called a sabotage. Example 2.7 (Cruise ship near accident) The cruise ship Viking Sky with 1373 passengers and crew aboard narrowly escaped a major accident on 23 March 2019, when her engines failed during a severe storm. The ship drifted rapidly toward the coast of mid-Norway in very rough waters, but was finally saved by the anchors less than 100m from land. All engines tripped almost at the same time because of a low-level signal from the level transmitters in the lubrication oil tanks. This system is installed to protect the engines from being destroyed if the lubrication is lost. The level of oil was not critically low, but the heavy seas probably caused movements in the tanks that fooled the level transmitters. The (preventive) shutdown of the engines was therefore a typical systematic fault, caused by a specification or design error of the lubrication oil tanks and/or the placement of the level transmitters. If not modified, the same engine shutdown will reoccur the next time the ship meets the same weather conditions. ◽ For more details about failures and failure classification, see Rausand et al. (2020). 2.3.9

Terminology Comments

This section has defined a number of commonly used terms in risk assessment. The purpose is to establish a terminology that helps to describe different elements of the problem being addressed in a risk analysis. Unfortunately, as stated already, terminology is a problem within this field. Therefore, we once more warn the reader about the use of these terms in other documents, reports, standards, and scientific publications. All the terms defined in this section are used in different ways by different authors. In particular the terms “hazard,” “initiating event,” and “hazardous event” are used in many different ways compared to how it has been defined

33

34

2 The Words of Risk Analysis

here. Hazard is often used to encompass both hazardous events and enabling events and conditions. Hazard then becomes a term that covers more or less anything that either are events in accident scenarios or conditions that can influence the development of those scenarios. This may be sufficient in some cases, but we see that it can cause confusion and result in an unstructured process to identify what can go wrong. Our opinion is therefore that it is important to have precise definitions. To illustrate the above, an example of what a checklist for hazard identification can look like is shown in Table 2.5. If this list Table 2.5 Generic hazard list (not exhaustive). Mechanical hazard

Noise hazards

– Kinetic energy – Acceleration or retardation – Sharp edges or points – Potential energy – High pressure – Vacuum – Moving parts – Rotating equipment – Reciprocating equipment – Stability/toppling problems – Degradation of materials (corrosion, wear, fatigue, etc.)

– External – From internal machines Hazards generated by neglecting ergonomic principles

Hazardous materials

– Flooding – Landslide – Earthquake – Lightning – Storm – Fog

– Explosive – Oxidizing – Flammable – Toxic – Corrosive – Carcinogenic Electrical hazards – Electromagnetic hazard – Electrostatic hazard – Short circuit – Overload – Thermal radiation Thermic hazards – Flame – Explosion – Surfaces with high or low temperature – Heat radiation Radiation hazards – Ionizing – Nonionizing

– Unhealthy postures or excessive effort – Inadequate local lightning – Mental overload or underload, stress – Human error, human behavior – Inadequate design or location of visual display units Environmental hazards

Organizational hazards – Inadequate safety culture – Inadequate maintenance – Inadequate competence – Inadequate crowd control Sabotage/terrorism – Cyber threat – Arson – Theft – Sabotage – Terrorism Interaction hazards – Material incompatibilities – Electromagnetic interference and incompatibility – Hardware and software controls

2.3 What Can Go Wrong?

is compared to the definitions, it contains both hazards, enabling events and enabling conditions. To add to the confusion, several other terms are used that overlap our terms, but often without a clear definition. Examples include accident initiator, accident initiating event, accidental event, critical event, undesired event, unwanted event, process deviation, and potential major incident (accident). 2.3.10

Accident

An accident may be defined as: Definition 2.17 (Accident) A sudden, unwanted, and unplanned event or event sequence that has led to harm to people, the environment, or other tangible assets. ◽ By this definition, we have moved from talking about the future to considering the past. An accident is an event that actually has caused harm to one or more assets. The definition further implies that an accident is not predictable with respect to whether and when it occurs. The definition emphasizes that an accident is a distinct event or event sequence and not a long-term exposure to some hazardous material or energy. Suchman (1961) argues that an event can be classified as an accident only if it is unexpected, unavoidable, and unintended. Accidents can be classified in many different ways, such as according to types of accidents, causes of accidents, and severity of accidents. Some terms that are used to describe accidents are, for example, major accident, process safety accident, personal accident, occupational accident, and disaster. In many cases, the accident types are not clearly defined or the definitions may vary from case to case. In the process industry, it is common to distinguish between process safety accidents and personal accidents. Process safety accidents are related to the process plant as such, the processes going on and the materials being used in the plant. Common causes of these accidents are that the process comes out of control or that hazardous substances are released. The potential consequences can be very large, both for people, the environment, and other assets. Personal accidents or occupational accidents, usually involve one or few people. Typical examples are falls, cuts, crushing, and contact with electricity. In this case, the categorization is done mainly with respect to the types of accidents (and thereby also causes). In practice, the categorization is also according to the degree of possible consequences that may occur. Of particular concern are accidents with very large consequences. These accidents – called major accidents – receive a lot of attention, with thorough investigation of causes and sometimes with wide impact on regulations,

35

36

2 The Words of Risk Analysis

technology, operations, and public perception of risk. As mentioned previously, process safety accidents is the equivalent term being used in the process industry. There is no generally accepted definition of what this is, but sometimes the term high-impact, low-probability event neatly summarizes the main features: they are events of low probability that have a high impact both directly (in terms of direct consequence) and indirectly (e.g. in terms of regulatory and political implications). Accidents and accident models are discussed in more detail in Chapter 8. Example 2.8 (Helicopter accidents) In the SINTEF helicopter studies (Herrera et al. 2010), helicopter accidents are classified into eight categories: (1) Accident during takeoff or landing on a heliport (2) Accident during takeoff or landing on a helideck (i.e. on an offshore platform) (3) Accident caused by critical aircraft failure during flight (4) Midair collision with another aircraft (5) Collision with terrain, sea, or a building structure (6) Personnel accident inside a helicopter (i.e. caused by toxic gases due to fire or cargo) (7) Personnel accident outside a helicopter (i.e. hit by tail rotor) (8) Other accidents. The studies are limited to accidents involving helicopter crew and passengers. Accidents involving other persons and other assets are not included. ◽

2.3.11

Incident

The term “incident” may be defined as: Definition 2.18 (Incident) A sudden, unwanted, and unplanned event or event sequence that could reasonably have been expected to result in harm to one or more assets, but actually did not. ◽ This definition is identical to the definition of accident, with the important distinction that incidents do not cause any significant harm. As for accident, the term incident is mainly used about events that have occurred in the past. From an initiating event, the event sequence can develop until it ends either with an accident or with an incident. Other terms that are used more or less with the same meaning as incident are near accident, mishap, and near miss. Observe that some authors use the term incident to include both incidents and accidents as they are defined in this book.

2.3 What Can Go Wrong?

2.3.12

Precursors

Phimister et al. (2004) define precursors as: Definition 2.19 (Precursors) cede and lead up to accidents.

Conditions, events, and sequences that pre◽

A precursor is therefore something that happens that may alert us that an accident is imminent. If precursors can be identified, this offers a great potential for avoiding accidents. Many organizations have developed systems to identify accident precursors and made procedures on how to intervene before any accident occurs. It is sometimes easy to spot precursors after an accident has occurred. More difficult is to identify precursors upfront an accident. Precursors are often technical failures, human errors, or operating conditions that individually or in combination with other precursors, may lead to an accident. Often, precursors can be identified when they lead to incidents and near accidents that were stopped by functioning safety controls. An incident (with no significant consequences) is not a precursor, but may help to reveal precursors (e.g. see U.S. DOE 1996). 2.3.13

Special Types of Accidents

Accident causation and accident models are discussed in more detail in Chapter 8. The development of accident theory has been strongly influenced by the views of Charles Perrow and James Reason, who have introduced new notions for major accidents. Reason (1997) introduces the concept of organizational accident, defined as follows: Definition 2.20 (Organizational accident) A comparatively rare, but often catastrophic, event that occurs within complex modern technologies (e.g. nuclear power plants, commercial aviation, the petrochemical industry, chemical process plants, marine and rail transport, banks, and stadiums) and has multiple causes involving many people operating at different levels of their respective companies. Organizational accidents often have devastating effects on uninvolved populations, assets, and the environment. ◽ Reason (1997) calls accidents that cannot be classified as organizational accidents, individual accidents: Definition 2.21 (Individual accident) An accident in which a specific person or group is often both the agent and the victim of the accident. The consequences to the people concerned may be great, but their spread is limited. ◽

37

38

2 The Words of Risk Analysis

In the book “Normal Accidents: Living with High-Risk Technologies,” Perrow (1984) introduces the concepts of system accident and normal accident. Normal accident theory is discussed in Chapter 8, so here we just define his concept of system accident: Definition 2.22 (System accident) An accident that arises in the interactions among components (electromechanical, digital, and human) rather than in the failure of individual components (Perrow 1984). ◽ In analogy with the terminology of Reason (1997), an accident that cannot be classified as a system accident is sometimes called a component failure accident: Definition 2.23 (Component failure accident) An accident arising from component failures, including the possibility of multiple and cascading failures (e.g. see Leveson 2004). ◽

2.4 What is the Likelihood? To answer the second question in the triplet definition of risk,“What is the likelihood of that happening?” we need to use concepts from probability theory. A brief introduction to probability theory is given in Appendix A. Essentially, the probability of an event E is a number between 0 and 1 (i.e. between 0% and 100%) that expresses the likelihood that the event occurs in a specific situation, and is written as Pr(E). If Pr(E) = 1, we know with certainty that event E occurs, whereas for Pr(E) = 0, we are certain that event E will not occur. 2.4.1

Probability

Probability is a complex concept about whose meaning many books and scientific articles have been written. There are three main approaches to probability: (i) the classical approach, (ii) the frequentist approach, and (iii) the Bayesian or subjective approach. People have argued about the meaning of the word “probability” for at least hundreds of years, maybe thousands. So bitter, and fervent, have the battles been between the contending schools of thought, that they’ve often been likened to religious wars. And this situation continues to the present time (Kaplan 1997, p. 407). 2.4.1.1

Classical Approach

The classical approach to probability is applicable in only a limited set of situations, where we consider experiments with a finite number n of possible

2.4 What is the Likelihood?

outcomes, and where each outcome has the same likelihood of occurring. This is appropriate for many simple games of chance, such as tossing coins, rolling dice, dealing cards, and spinning a roulette wheel. We use the following terminology: An outcome is the result of a single experiment, and a sample space S is the set of all the possible outcomes. An event E is a set of (one or more) outcomes in S that have some common properties. When an outcome that is a member of E occurs, we say that the event E occurs. These and many other terms are defined in Appendix A. Because all n possible outcomes have the same likelihood of occurring, we can find the likelihood that event E will occur as the number nE of outcomes that belong to E divided by the number n of possible outcomes. The outcomes that belong to E are sometimes called the favorable outcomes for E. The likelihood of getting an outcome from the experiment that belongs to E is called the probability of E: n No. of favorable outcomes = E (2.1) Pr(E) = Total no. of possible outcomes n The event E can also be a single outcome. The likelihood of getting a particular outcome is then called the probability of the outcome and is given by 1∕n. When – as in this case – all the outcomes in S have the same probability of occurrence, we say that we have a uniform model. 2.4.1.2

Frequentist Approach

The frequentist approach restricts our attention to phenomena that are inherently repeatable under essentially the same conditions. We call each repetition an experiment and assume that each experiment may or may not give the event E. The experiment is repeated n times as we count the number nE of the n experiments that end up in the event E. The relative frequency of E is defined as n fn (E) = E n Because the conditions are the same for all experiments, the relative frequency approaches a limit when n → ∞. This limit is called the probability of E and is denoted by Pr(E) n (2.2) Pr(E) = lim E n→∞ n If we do a single experiment, we say that the probability of getting the outcome E is Pr(E) and consider this probability a property of the experiment. 2.4.1.3

Bayesian Approach

In a risk analysis, we almost never have a finite sample space of outcomes that occur with the same probability. The classical approach to probability is therefore not appropriate. Furthermore, to apply the frequentist approach, we must

39

40

2 The Words of Risk Analysis

at least be able to imagine that experiments can be repeated a large number of times under nearly identical conditions. Because this is rarely possible, we are left with a final option, the Bayesian approach. In this approach, the probability is considered to be subjective and is defined as: Definition 2.24 (Subjective probability) A numerical value in the interval [0, 1] representing an individual’s degree of belief about whether or not an event will occur. ◽ In the Bayesian approach, it is not necessary to delimit probability to outcomes of experiments that are repeatable under the same conditions. It is fully acceptable to give the probability of an event that can only happen once. It is also acceptable to talk about the probability of events that are not the outcomes of experiments, but rather are statements or propositions. This can be a statement about the value of a nonobservable parameter, often referred to as a state of nature. To avoid a too-complicated terminology, we also use the word event for statements, saying that an event occurs when a statement is true. The degree of belief about an event E is not arbitrary but is the analyst’s best guess based on her available knowledge  about the event. The analyst’s (subjective) probability of the event E, given that her knowledge is , should therefore be expressed as Pr(E ∣ )

(2.3)

The knowledge  may come from knowledge about the physical properties of the event, earlier experience with the same type of event, expert judgment, and many other information sources. For simplicity, we often suppress  and simply write Pr(E), but we should not forget that this is a conditional probability depending on . In a risk analysis, the word subjective may have a negative connotation. For this reason, some analysts prefer to use the word personal probability, because the probability is a personal judgment of an event that is based on the analyst’s best knowledge and all the information she has available. The word judgmental probability is also sometimes used. To stress that the probability in the Bayesian approach is subjective (or personal or judgmental), we refer to the analyst’s or her/his/your/my probability instead of the probability. Example 2.9 (Your subjective probability) Assume that you are going to do a job tomorrow at 10 o’clock and that it is very important that it is not raining when you do this job. You want to find your (subjective) probability of the event E: “rain tomorrow between 10:00 and 10:15.” This has no meaning in the frequentist (or classical) approach, because the “experiment” cannot be repeated. In the Bayesian approach, your probability Pr(E) is a measure of your belief about the weather between 10:00 and 10:15. When you quantify this belief

2.4 What is the Likelihood?

and, for example, say that Pr(E) = 0.08, this is a measure of your belief about E. To come up with this probability, you may have studied historical weather reports for this area, checked the weather forecasts, looked at the sky, and so on. Based on all the information you can get hold of, you believe that there is an 8% chance that event E occurs and that it will be raining between 10:00 and 10:15 tomorrow. ◽ The Bayesian approach can also be used when we have repeatable experiments. If we flip a coin, and we know that the coin is symmetric, we believe that the probability of getting a head is 1∕2. In this case, the frequentist and the Bayesian approach give the same result. An attractive feature of the Bayesian approach is the ability to update the subjective probability when more evidence becomes available. Assume that an analyst considers an event E and that her initial or prior belief about this event is given by her prior probability Pr(E): Definition 2.25 (Prior probability) An individual’s belief in the occurrence of an event E prior to any additional collection of evidence related to E. ◽ Later, the analyst gets access to the data D1 , which contains information about event E. She can now use Bayes formula to state her updated belief, in light of the evidence D1 , expressed by the conditional probability Pr(E ∣ D1 ) = Pr(E)

Pr(D1 ∣ E) Pr(D1 )

(2.4)

that is a simple consequence of the multiplication rule for probabilities Pr(E ∩ D1 ) = Pr(E ∣ D1 ) Pr(D1 ) = Pr(D1 ∣ E) Pr(E) The analyst’s updated belief about E, after she has access to the evidence D1 , is called the posterior probability Pr(E ∣ D1 ).

Thomas Bayes Thomas Bayes (1702–1761) was a British Presbyterian minister who has become famous for formulating the formula that bears his name – Bayes’ formula (often written as Bayes formula). His derivation was published (posthumously) in 1763 in the paper “An essay toward solving a problem in the doctrine of chances” (Bayes 1763). The general version of the formula was developed in 1774 by the French mathematician Pierre-Simon Laplace (1749–1825).

41

42

2 The Words of Risk Analysis

Definition 2.26 (Posterior probability) An individual’s belief in the occurrence of the event E based on her prior belief and some additional evidence ◽ D1 . Initially, the analyst’s belief about the event E is given by her prior probability Pr(E). After having obtained the evidence D1 , her probability of E is, from (2.4), seen to change by a factor of Pr(D1 ∣ E)∕ Pr(D1 ). Bayes formula (2.4) can be used repetitively. Having obtained the evidence D1 and her posterior probability Pr(E|D1 ), the analyst may consider this as her current prior probability. When additional evidence D2 becomes available, she may update her current belief in the same way as previously and obtain her new posterior probability: Pr(E ∣ D1 ∩ D2 ) = Pr(E)

Pr(D1 ∣ E) Pr(D2 ∣ E) Pr(D1 ) Pr(D2 )

(2.5)

Further updating of her belief about E can be done sequentially as she obtains more and more evidence. 2.4.1.4

Likelihood

By the posterior probability Pr(E ∣ D1 ) in (2.4), the analyst expresses her belief about the unknown state of nature E when the evidence D1 is given and known. The interpretation of Pr(D1 ∣ E) in (2.4) may therefore be a bit confusing because D1 is known. Instead, we should interpret Pr(D1 ∣ E) as the likelihood that the (unknown) state of nature is E, when we know that we have got the evidence D1 . In our daily language, likelihood is often used with the same meaning as probability, even though there is a clear distinction between the two concepts in statistical usage. In statistics, likelihood is used, for example, when we estimate parameters (maximum likelihood principle) and for testing hypotheses (likelihood ratio test). Remark 2.5 (The term likelihood) In the first part of this chapter, we have used the word likelihood as a synonym for probability, as we often do in our daily parlance. The reason for this rather imprecise use of the word “likelihood” is that we wanted to avoid using the word “probability” until it was properly introduced – and because we wanted to present the main definitions of risk concepts with the same wording that is used in many standards and guidelines. ◽ 2.4.2

Controversy

In risk analysis, it is not realistic to assume that the events are repeatable under essentially the same conditions. We cannot, for example, have the same explosion over and over again under the same conditions. This means that we need to

2.4 What is the Likelihood?

use the Bayesian approach. Although most risk analysts agree on this view, there is an ongoing controversy about the interpretation of the subjective probability. There are two main schools: (1) The first school claims that the subjective probability is subjective in a strict sense. Two individuals generally come up with two different numerical values of the subjective probability of an event, even if they have exactly the same knowledge. This view is, for example, advocated by Lindley (2007), who claims that individuals have different preferences and hence judge information in different ways. (2) The second school claims that the subjective probability is dependent only on knowledge. Two individuals with exactly the same knowledge  always give the same numerical value of the subjective probability of an event. This view is, for example, advocated by Jaynes (2003), who states: A probability assignment is “subjective” in the sense that it describes a state of knowledge rather than any property of the “real” world but is “objective” in the sense that it is independent of the personality of the user. Two rational human beings faced with the same total background of knowledge must assign the same probabilities [also quoted and supported by Garrick (2008)]. The quotation from Jaynes (2003) touches on another controversy: that is, whether the probability of an event E is a property of the event E, the experiment producing E, or a subjective probability that exists only in the individual’s mind. The mathematical rules for manipulating probabilities are well understood and are not controversial. A nice feature of probability theory is that we can use the same symbols and formulas whether we choose the frequentist or the Bayesian approach, and whether or not we consider probability as a property of the situation. The difference is that the interpretation of the results is different. Remark 2.6 (Objective or subjective?) Some researchers claim that the frequentist approach is objective and therefore the only feasible approach in many important areas, for example, when testing the effects of new drugs. According to their view, such a test cannot be based on subjective beliefs. This view is probably flawed because the frequentist approach also applies models that are based on a range of assumptions, most of which are subjective. ◽ 2.4.3

Frequency

When an event E occurs more or less frequently, we often talk about the frequency of E rather than the probability of E. Rather than asking “What is the

43

44

2 The Words of Risk Analysis

probability of event E,” we may ask, for example, “How frequently does event E occur?” Fatal traffic accidents occur many times per year, and we may record the number nE (t) of such accidents during a period of length t. A fatal traffic accident is understood as an accident where one or more persons are killed. The frequency of fatal traffic accidents in the time interval (0, t) is given by nE (t) (2.6) t The “time” t may be given as calendar time, accumulated operational time (e.g. the accumulated number of hours that cars are on the road), accumulated number of kilometers driven, and so on. In some cases, we may assume that the situation is kept unchanged and that the frequency approaches a constant limit when t → ∞. We call this limit the rate of the event E and denote it by 𝜆E : ft (E) =

nE (t) (2.7) t→∞ t In the frequentist interpretation of probability, parameters like 𝜆E have a true, albeit unknown value. The parameters are estimated based on observed values, and confidence intervals are used to quantify the variability in the parameter estimators. Models and formulas for the analysis may be found in Appendix A. 𝜆E = lim

2.5 What are the Consequences? The third question in our definition of risk introduces the term consequences. A consequence involves specific damage to one or more assets and is also called adverse effect, negative effect, impact, impairment, or loss. ISO Guide 73 defines consequence as “outcome of an event affecting objectives,” in line with their definition of risk. The term harm is used in several important standards, including ISO 12100. Their definition of harm is “physical injury or damage to health,” limiting the definition only to cover people. This is in line with the objectives of the standard. In this book, no distinction is made between harm and consequence, and we define it more generally as follows: Definition 2.27 (Harm/consequence) Injury or damage to the health of people, or damage to the environment or other assets. ◽

2.5 What are the Consequences?

Table 2.6 Some types of assets. – Humans (first, second, third, and fourth parties) – Community – The environment (animals, birds, fish, air, water, soil, landscape, natural preserve areas, and built environment) – Performance (e.g. availability of a production system, punctuality of a railway service) – Material assets (e.g. buildings, equipment, and infrastructure)

2.5.1

– Historical monuments, national heritage objects – Financial assets – Intangibles (e.g. reputation, goodwill, and quality of life) – Timing or schedule of activities (project or mission risk) – Organizational behavior

Assets

To answer the third question in the triplet definition of risk “what are the consequences?” we first have to identify who – or what – might be harmed. In this book, these “objects” are called assets. Definition 2.28 (Asset)

Something we value and want to preserve.



Assets are also called targets, vulnerable targets, victims, recipients, receptors, and risk-absorbing items. Examples of assets are listed in Table 2.6. Observe that the sequence of the assets in Table 2.6 does not imply any priority or ranking. 2.5.2

Categories of Human Victims

In most risk assessments, humans are considered to be the most important asset. The possible victims of an accident are sometimes classified according to their proximity to and influence on the hazard (Perrow 1984): (1) First-party victims. These are people directly involved in the operation of the system. Employees in a company where accidents may occur are typical examples of first-party victims. (2) Second-party victims. These are people who are associated with the system as suppliers or users, but exert no influence over it. Even though such exposure may not be entirely voluntary, these people are not innocent bystanders, because they are aware of (or could be informed about) their exposure. Passengers on airplanes, ships, and railways, for example, are considered to be second-party victims.

45

46

2 The Words of Risk Analysis

(3) Third-party victims. These are innocent bystanders who have no involvement in the system, for example, people living in the neighborhood of a plant. (4) Fourth-party victims. These are victims of yet-unborn generations. The category includes fetuses that are carried while their parents are exposed to radiation or toxic materials, and all those people who will be contaminated in the future by residual substances, including substances that become concentrated as they move up the food chain. Example 2.10 (Victims of Railway Accidents) times classifies human assets in five categories: (a) (b) (c) (d) (e)

The railway industry some-

Passengers Employees People on the road or footpath crossings of the line Trespassers (who are close to the line without permission) Other persons

2.5.3



Consequence Categories

In addition to distinguishing between different assets, the adverse effects may also be classified into several categories related to the assets. Some examples are given in Table 2.7. For harm to people, it is common to distinguish between: – Temporary harm/injury. In this case the person is injured but will be totally restored and able to work within a period after the accident. – Permanent disability. In this case, the person gets permanent illness or disability. The degree of disability is sometimes given as a percentage. – Fatality. The person dies, either immediately or because of complications. The fatality may sometimes occur a long time after the accident, for example, due to cancer caused by radiation after a nuclear accident. 2.5.4

Consequence Spectrum

A hazardous event (or end state) i may lead to a number of potential consequences ci,1 , ci,2 , …, ci,m . The probability pi,j that consequence ci,j will occur Table 2.7 Some types of harm to different assets. – Loss of human life – Personal injury – Reduction in life expectancy – Damage to the environment (fauna, flora, soil, water, air, climate, and landscape) – Damage to material assets – Investigation and cleanup costs – Business-interruption losses – Loss of staff productivity

– Loss of information – Loss of reputation (public relations) – Insurance deductible costs – Fines and citations – Legal action and damage claims – Business-sustainability consequences – Societal disturbances – Reduction of human well-being – Loss of freedom

2.5 What are the Consequences?

Figure 2.5 Consequence spectrum for a hazardous event.

Hazardous event i

ci,1

pi,1

ci,2

pi,2

ci,3

pi,3

ci,m

pi,m

depends on the physical situation and whether or not the barriers are functioning. The possible consequences and the associated probabilities resulting from the hazardous event are shown in Figure 2.5. The diagram in Figure 2.5 is called a consequence spectrum, a risk picture, or a risk profile related to hazardous event i. The consequence spectrum may also be written as a vector: ci = [ci,1 , ci,2 , … , cim ][pi,1 ,pi,2 ,…,pi,m ]

(2.8)

In Figure 2.5 and in the vector (2.8), we have tacitly assumed that the consequences can be classified into a finite number (m) of discrete consequences. This is often a simplification that we make in risk analysis. A study object may lead to several potential hazardous events. It may therefore be relevant to establish the consequence spectrum for the study object rather than for a single hazardous event. Each hazardous event then has a consequence spectrum, as shown in Figure 2.5. Combining the consequence spectra for all the relevant hazardous events yields the consequence spectrum for the study object. This consequence spectrum has the same form as for a hazardous event. The consequence spectrum may also be presented in a table, as shown in Table 2.8. The probability p associated with each consequence lies between 0 and 1, where p = 0 means that the consequence is impossible and p = 1 signals that it will always occur. Both extremities correspond to a fatalistic world view in which the future is conceived of as independent of human activities. According to Rosa (1998), the term risk would be of no use in such a world of Table 2.8 Consequence spectrum for a study object (example). i

Consequence (ci )

Probability (pi )

1

Operator is killed

0.001

2

Operator is permanently disabled

0.004

3

Operator is temporarily injured

0.008







m

Minor material damage

0.450

47

48

2 The Words of Risk Analysis

Hazardous event i

ci,1

l(ci,1)

pi,1

ci,2

l(ci,2 )

pi,2

ci,3

l(ci,3)

pi,3

ci,m

l(ci,m)

pi,m

Figure 2.6 Loss spectrum for a hazardous event i.

predetermined outcomes. At the heart of the concept of risk is thus the idea that the consequences admit to some degree of uncertainty. In some cases, it may be possible to measure the consequences of a hazardous event i to the different assets in a common unit (e.g. in US dollars). Let 𝓁(ci,j ) denote the loss in dollars if consequence ci,j occurs, for j = 1, 2, … , m. The loss spectrum for the hazardous event can then be pictured as in Figure 2.6. In this case, it may be meaningful to talk about the mean consequence or mean loss if the hazardous event should occur n ∑ 𝓁(ci,j ) pi,j (2.9) E[𝓁(ci )] = j=1

Observe that (2.9) is the conditional mean loss given that the specified hazardous event has occurred. The minimum and maximum loss and the standard deviation may easily be provided. In cases where the consequences cannot be easily measured with a common unit, it is considered much more meaningful to present the entire consequence spectrum to the decision-maker, primarily for the whole study object but also for the most critical hazardous events (or end states). 2.5.5

Time of Recording Consequences

Some of the consequences of an accident may occur immediately, whereas others may not materialize until years after the accident. People are, for example, still (claimed to be) dying of cancer in 2019 as a consequence of the Chernobyl accident in 1986. A large quantity of nuclear fallout was released and spread as far as northern Norway. During the accident, only a few persons were harmed physically, but several years after the accident, a number of people developed cancer and died from the fallout. The same applies for other accidents involving hazardous materials, and notably for the Bhopal accident that took place 23 December 1984, in Bhopal, India. When we assess the consequences of an accident, it is therefore important not only to consider the immediate consequences but also to consider the delayed effects.

2.6 Additional Terms

2.5.6

Severity

In some cases, it is useful to define a limited set of possible consequence classes or categories and use these rather than a continuous spectrum of consequences. The term severity is sometimes used to describe these classes: Definition 2.29 (Severity) Seriousness of the consequences of an event expressed either as a financial value or as a category. ◽ The categories may be, for example, catastrophic, severe loss, major damage, damage, or minor damage. Each category has to be described to ensure the categories are understood by all relevant stakeholders. This is discussed further in Chapter 6.

2.6 Additional Terms This section defines a number of terms that are associated to risk and that are treated in more detail in later chapters of the book. 2.6.1

Barriers

Most well-designed systems have barriers that can prevent or reduce the probability of hazardous events, or stop or mitigate their consequences. Definition 2.30 (Barrier) Physical or engineered system or human action (based on specific procedures or administrative controls) that is implemented to prevent, control, or impede energy released from reaching the assets and causing harm. ◽ Barriers are also called safeguards, protection layers, defenses, controls, or countermeasures. Barriers are discussed in more detail in Chapter 14. Some categories of barriers are listed in Table 2.9. 2.6.2

Safety

Safety is a problematic concept that is used with many different meanings. Many standards and guidelines related to risk assessment use the word safety but avoid defining the concept. An exception is MIL-STD-882E (2012), where safety is defined as “freedom from those conditions that can cause death, injury, occupational illness, damage to or loss of equipment or property, or damage to the environment.” According to this definition, safety implies that all hazards are removed and that no assets will be harmed. This implies that risk is zero. For

49

50

2 The Words of Risk Analysis

Table 2.9 Categories of barriers. Physical barriers

Organizational barriers

– Equipment and engineering design – Personal protective equipment (e.g. clothes, hard hats, and glasses) – Fire walls, shields – Safety devices (e.g. relief valves, emergency shutdown systems, and fire extinguishers) – Warning devices (e.g. fire and gas alarms)

– Hazard identification and analyses – Line management oversight – Supervision – Inspection and testing – Work planning – Work procedures – Training – Knowledge and skills – Rules and regulations

most practical systems, safety is therefore not attainable, and may be considered a Utopia. Many risk analysts feel that the definition of safety in MIL-STD-882E is not of any practical use and that we need a definition such that safety is an attainable state. The following definition is therefore proposed: Definition 2.31 (Safety) A state where the risk has been reduced to a level that is as low as reasonably practicable (ALARP) and where the remaining risk is generally accepted. ◽ This definition implies that a system or an activity is safe if the risk related to the system/activity is considered to be acceptable. Safety is therefore a relative condition that is based on a judgment of the acceptability of risk. The meaning of acceptable risk and ALARP is discussed further in Chapter 5. From Definition 2.31, safety is closely dependent on risk because it is the risk level that determines whether a system is safe or not. An important distinction between risk and safety, as defined above, is that safety is a state that either is reached or not, whereas risk is measured on a continuous scale and can be high, medium, or low or measured or expressed in other ways. This means that even if a system is safe, there will still be risk. 2.6.3

Safety Performance

In this book, we use the word risk to describe our uncertainty about adverse events that may occur in the future. Sometimes, decision-makers may be wondering “whether the estimated risk in the coming period (e.g. five years) is higher or lower than the risk was in the past period.” With our definition of risk, speaking of risk in the past has no meaning. This is because when a period is over, there is no uncertainty related to what happened in that period. We therefore need another term that can be used to describe what happened in a past period – and we use the term safety performance.

2.6 Additional Terms

Definition 2.32 (Safety performance) An account of all accidents that occurred in a specified (past) time period, together with frequencies and consequences observed for each type of accident. ◽ In this way, the estimated risk in the coming period can be compared to the safety performance in the past period. Remark 2.7 (Was the risk analysis wrong?) Observe that the occurrence of events and accidents is – at least partly – a random process. If the risk in the coming period is estimated to be rather high, and by the end of that period, we find that the safety performance in the period showed no accidents, this does not necessarily mean that the risk analysis was wrong. The same argument can also be used the other way around. In particular for major accident risk, it can be claimed that risk analyses are hardly ever wrong (although they may not always be right)! ◽ 2.6.4

Security

In risk analysis, it is important to identify all the relevant hazardous events. The hazardous events may be (i) random, such as technical failures and natural events (e.g. lightning, flooding), (ii) systematic, such as software bugs or erroneous installation, or (iii) due to deliberate actions, such as computer hacking and arson. The term safety is often used when we talk about random events, whereas security is used in relation to deliberate actions. The term total safety is sometimes used to cover both safety and security. Security assessment is discussed in Chapter 17. Definition 2.33 (Security) Freedom from, or resilience against, harm committed by hostile threat actors. ◽ Security is, as safety, a relative concept that is closely related to risk acceptability. The principal difference between safety and security is intentionality; security is characterized by adversary intent to do harm. Assessing security risk therefore changes the first question of Kaplan and Garrick (1981) into how someone can make something happen. This complicates risk assessment, as the range of possible events is restricted only by the assessor’s imagination and ability to put herself in the situation of a potential enemy or criminal. Central to an understanding of the concept of security are the terms threat, threat actor, and vulnerability: Definition 2.34 (Threat) A generic category of an action or event that has the potential to cause damage to an asset. ◽

51

52

2 The Words of Risk Analysis

The deliberate hostile action can be a physical attack, such as arson, sabotage, and theft, or a cyberattack. The generic categories of attacks are called threats, and the entity using a threat is called a threat actor or a threat agent. Arson is therefore a threat, and an arsonist is a threat actor. The threat actor may be a disgruntled employee, a single criminal, a competitor, a group, or even a country. When a threat actor attacks, she seeks to exploit some weaknesses of the item. Such a weakness is called a vulnerability of the item. Weak passwords and heaps of combustible materials close to the item are examples of vulnerabilities. There are two categories of threats, (i) physical threats and (ii) cyber threats. Cyber threats include hacking, worms, viruses, malware, trojan horses, password cracking, and many more. With our increasing dependency of computers and communication networks, our fear of cyber threats is steadily increasing. Remark 2.8 (Natural threat) The word “threat” is also used for potential natural events, such as avalanche, earthquake, flooding, hurricane, landslide, lightning, pandemic, tsunami, and wildfire, to name a few. We may, for example, say that earthquake is a threat to our system. No threat actor is involved for this type of threats. ◽ The term threat actor is used to indicate an individual or a group that can manifest a threat. When analyzing security risk, it is fundamental to identify who could want to exploit the vulnerabilities of a system, and how they might use them against the system. Definition 2.35 (Threat actor) An individual, a group or a thing that acts, or has the power to act, to cause, carry, transmit, or support a threat. ◽ A threat actor is sometimes called a threat agent. An example of a threat agent is a hacker who breaks into computers, usually by gaining access to administrative controls. To cause harm, a threat agent must have the intention, capacity, and opportunity to cause harm. Intention means the determination or desire to achieve an objective. Capacity refers to the ability to accomplish the objective, including the availability of tools and techniques as well as the ability to use these correctly. Opportunity to cause harm implies that the asset must be vulnerable to attack. Vulnerability may be defined as follows: Definition 2.36 (Vulnerability) A weakness of an asset or control that can be exploited by one or more threat actors. ◽

2.6 Additional Terms

Threat &

Attack

Vulnerability

Hazardous event

Threat actor

Figure 2.7 The concepts of threat, threat actor, and vulnerability.

A vulnerability is a characteristic or state of the asset that allows a threat actor to carry out a successful attack. The weakness may have been introduced during design, installation, operation, or maintenance. Vulnerability refers to the security flaws in a system that allow an attack to be successful. These weaknesses may be categorized as physical, technical, operational, and organizational. A vulnerability in security terms can be, for example, an unlocked door, allowing unauthorized people to access a computer that is not protected by a password. We can see that a vulnerability in many respects can be compared to what we would call “lack of” or “weak” barriers when we are talking about risk. Vulnerability is also used in relation to safety, but then more as an opposite to resilience (see next section). Security and security assessment are discussed in more detail in Chapter 17. 2.6.4.1

An Illustration

Figure 2.7 shows that a threat agent may use a threat to launch an attack intended to exploit a vulnerability in the system. If the vulnerability is “penetrated” a hazardous event will occur. The threat actor may in some cases get information about vulnerabilities in the system and choose the type of threat that most likely will make the attack successful (from her point of view). 2.6.5

Resilience

Resilience is in many respects the opposite of vulnerability. Foster (1993) defines resilience as: Definition 2.37 (Resilience) The ability to accommodate change without catastrophic failure, or the capacity to absorb shocks gracefully. ◽ The word resilience conveys the ability to recover and to be brought back into shape or position after being stressed. Resilience is a broader term than robustness, which is a static concept that is basically synonymous with damage tolerance. In addition to the ability to withstand damage, resilience also has a

53

54

2 The Words of Risk Analysis

dynamic component, that is, adaptation to a new situation. Resilience is therefore a pervasive property that influences a system’s response to a wide range of stressors and threats (e.g. see Rosness et al. 2004; Hollnagel et al. 2006).

2.7 Problems 2.1

Describe the main difference between the concepts of hazard and threat.

2.2

What is the difference between a probability and a frequency?

2.3

In Table 2.1, various uses of the word “risk” from media are shown. Look at the statements and see if risk should be replaced with another term if we were to apply the definitions used in this book.

2.4

Search for the term “hazard” on the Internet and see if it is used in accordance with our definition.

2.5

List the possible failure modes of the driver’s door on a modern car.

2.6

Start with the following situation: You are cycling down a steep road at high speed and approach a major crossing road. Describe a few possible accident scenarios that can develop from this situation. What are the hazards, initiating events and enabling events and conditions in the scenarios that you have described?

2.7

Consider the following events related to a ship: • The ship hits an obstruction. • The crew abandons ship. • The captain of a ship is planning a voyage and fails to identify an obstruction in the planning process. • The ship sets sail from port. • The ship starts sinking. • All crew drowns. • During the voyage, the person on the bridge of the ship falls asleep. (a) Order these events into a logical accident scenario. (b) Use the definitions of hazardous event and initiating event and identify the steps in the sequence that could be classified as hazardous events and initiating events. Different answers may be relevant, but provide arguments for why you choose as you do.

2.7 Problems

2.8

In this chapter, reference accident scenario, worst-case accident scenario, and worst credible accident scenario are defined. (a) What are the differences between these three? (b) Do you see any challenges in defining these scenarios in a practical case?

2.9

What are the differences between the two concepts robustness and vulnerability?

2.10

There are numerous definitions of the word “risk” and Section 2.2 provides a definition and lists some alternatives. Compare the alternative definitions of risk provided and see how they differ from the definition used in the book.

2.11

Search the Internet for the word “risk” to see how this is used in different contexts, e.g. in media, and how the everyday use of the word compares to the formal definition. Some examples to look for are situations where risk is used synonymous with hazard, safety performance, and frequency.

2.12

Compare the terms incident and hazardous event and discuss the similarities and differences between these terms. Use practical examples and discuss the terms based on these rather than discussing purely from a theoretical viewpoint.

2.13

Assume that a bicycle has a brake system with a handle on the handlebars, a wire running from the handle to the brake pads, and finally brake pads that make contact with the wheel when the handle is pulled. Identify relevant failures, failure modes, and failure mechanisms for the brake system. Classify the failures according to the cause of the failure and the degree of the failure.

2.14

Consider the hazardous event “Car hits back of car in front while driving” and describe this event in a bow-tie. Identify relevant barriers.

2.15

Consider the hazardous event “Fire in student flat” and describe this event in a bow-tie. Identify relevant barriers.

2.16

Compare Definitions 2.31 and 2.33 for safety and security, respectively. Discuss the difference between these two definitions and suggest an alternative definition for security.

55

56

2 The Words of Risk Analysis

References ARAMIS (2004). Accidental Risk Assessment Methodology for Industries in the Context of the Seveso II Directive. Technical report EVSG1-CT-2001-00036. Fifth Framework Programme of the European Community, Energy, Environment and Sustainable Development. Aven, T. and Renn, O. (2009). On risk defined as an event where the outcome is uncertain. Journal of Risk Research 12 (1): 1–11. Bayes, T. (1763). An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London 53: 370–418. Bernstein, P.L. (1998). Against the Gods: The Remarkable Story of Risk. Hoboken, NJ: Wiley. Foster, H.D. (1993). Resilience theory and system evaluation. In: Verification and Validation of Complex Systems: Human Factors Issues (ed. J.A. Wise, V.D. Hopkin, and P. Stager), 35–60. Berlin: Springer. Garrick, B.J. (2008). Quantifying and Controlling Catastrophic Risks. San Diego, CA: Academic Press. Herrera, I.A., Håbrekke, S., Kråkenes, T. et al. (2010). Helicopter Safety Study (HSS-3). Research report SINTEF A15753. Trondheim, Norway: SINTEF. Hollnagel, E., Woods, D.D., and Leveson, N. (2006). Resilience Engineering: Concepts and Precepts. Aldershot: Ashgate. IAEA (2002). Procedures for Conducting Probabilistic Safety Assessment for Non-Reactor Nuclear Facilities. Technical report IAEA-TECDOC-1267. Vienna, Austria: International Atomic Energy Agency. ISO 12100 (2010). Safety of machinery – general principles for design: risk assessment and risk reduction, International standard ISO 12100. Geneva: International Organization for Standardization. ISO 31000 (2018). Risk management – guidelines, International standard. Geneva: International Organization for Standardization. ISO Guide 73 (2009). Risk management – vocabulary, Guide. Geneva: International Organization for Standardization. Jaynes, E.T. (2003). Probability Theory: The Logic of Science. Cambridge: Cambridge University Press. Johansen, I.L. (2010). Foundations of Risk Assessment. ROSS report 201002. Trondheim, Norway: Norwegian University of Science and Technology. Kaplan, S. (1997). The words of risk analysis. Risk Analysis 17: 407–417. Kaplan, S. and Garrick, B.J. (1981). On the quantitative definition of risk. Risk Analysis 1 (1): 11–27. Khan, F.I. and Abbasi, S.A. (2002). A criterion for developing credible accident scenarios for risk assessment. Journal of Loss Prevention in the Process Industries 15 (6): 467–475.

References

Kim, D., Kim, J., and Moon, I. (2006). Integration of accident scenario generation and multiobjective optimization for safety-cost decision making in chemical processes. Journal of Loss Prevention in the Process Industries 19 (6): 705–713. Klinke, A. and Renn, O. (2002). A new approach to risk evaluation and management: risk-based, precaution-based, and discourse-based strategies. Risk Analysis 22 (6): 1071–1094. Leveson, N. (2004). A new accident model for engineering safer systems. Safety Science 42 (4): 237–270. Lindley, D.V. (2007). Understanding Uncertainty. Hoboken, NJ: Wiley. Lupton, D. (1999). Risk. London: Routledge. MIL-STD-882E (2012). Standard practice for system safety. Washington, DC: U.S. Department of Defense. Perrow, C. (1984). Normal Accidents: Living with High-Risk Technologies. New York: Basic Books. Phimister, J.R., Bier, V.M., and Kunreuther, H.C. (eds.) (2004). Accident Precursor Analysis and Management: Reducing Technological Risk Through Diligence. Washington, DC: National Academies Press. Rausand, M., Høyland, A., and Barros, A. (2020). System Reliability Theory: Models, Statistical Methods, and Applications, 3e. Hoboken, NJ: Wiley. Reason, J. (1997). Managing the Risks of Organizational Accidents. Aldershot: Ashgate. Rosa, E.A. (1998). Metatheoretical foundations for post-normal risk. Journal of Risk Research 1 (1): 15–44. Rosness, R., Guttormsen, G., Steiro, T. et al. (2004). Organizational Accidents and Resilient Organizations: Five Perspectives. STF38 A04403. Trondheim, Norway: SINTEF. Royal Society (1992). Risk: Analysis, Perception and Management. London: Report of a Royal Society study group, Royal Society. Suchman, E.A. (1961). A conceptual analysis of the accident problem. Social Problems 8 (3): 241–246. Treasury Board (2001). Integrated Risk Management Framework. Catalogue Number BT22-78/2001. Ottawa, Canada: Treasury Board of Canada. ISBN: 0-622-65673-3. U.S. DOE (1996). Process Safety Management for Highly Hazardous Chemicals. DOE-HDBK-1101-86. Washington, DC: U.S. Department of Energy.

57

59

3 Main Elements of Risk Assessment 3.1 Introduction The terms “risk analysis” and “risk assessment” are used several times in Chapters 1 and 2 without any proper definition. We define the terms as follows: Definition 3.1 (Risk analysis) A systematic study to identify and describe what can go wrong and what the causes, the likelihoods, and the consequences might be. ◽ A risk analysis is, according to Definition 3.1, aimed at providing answers to the three questions used to define risk. Definition 3.2 (Risk assessment) The process of planning, preparing, performing, and reporting a risk analysis, and evaluating the results against risk acceptance criteria. ◽ In addition to planning, preparing, and reporting, risk assessment consists of two main analytical parts: risk analysis and risk evaluation, as shown in Figure 3.1. First, a risk analysis is carried out to identify and describe relevant accident scenarios and likelihoods, which together, define the risk. The second part, risk evaluation, compares the risk determined by the risk analysis with risk acceptance criteria, as discussed in Chapter 5. 3.1.1

The Role of the Risk Analyst

The main role of the risk analyst is to answer the three questions in the definition of risk in Section 2.2.1, and to carry out the risk analysis as correctly and accurately as possible. The analyses and the evaluations should be as objective as the available data permit, and they should be neutral, impartial, and dispassionate.

Risk Assessment: Theory, Methods, and Applications, Second Edition. Marvin Rausand and Stein Haugen. © 2020 John Wiley & Sons, Inc. Published 2020 by John Wiley & Sons, Inc. Companion Website: www.wiley.com/go/riskassessment2e

60

3 Main Elements of Risk Assessment

Risk assessment Risk analysis

+

Figure 3.1 Risk assessment as a combination of risk analysis and risk evaluation.

Risk evaluation

During the risk assessment process, the risk analysts have to make many judgments and interpretations that are subjective, and hence, different analysts seldom reach exactly the same answers to the three questions. Even so, the analysts should make an effort to do the job as objectively as possible. Remark 3.1 (Terminology) Some authors do not distinguish between a risk analysis and a risk assessment, but use the two terms as synonyms. ◽

3.2 Risk Assessment Process The risk assessment process consists of a sequence of steps and substeps. The steps in the process are shown in Figure 3.2. We divide the process into six steps, and the steps are described in separate sections. The risk evaluation is very briefly described in this chapter and in general, this book does not go into a lot of detail on this process. Chapter 5 explains the principles of risk Step 1: Plan the risk assessment Step 2: Define the study Step 3: Identify hazards and initiating events Step 4: Develop accident scenarios and describe consequences Step 5: Determine and assess the risk Step 6: Risk presentation Figure 3.2 The six steps of a risk assessment.

3.2 Risk Assessment Process

acceptance criteria, but does not go into the process of deciding whether or not the risk should be accepted because this usually involves wider considerations than just risk. The activities are not always carried out in the same sequence as shown in Figure 3.2. Several activities may be performed in parallel, and we may also jump backwards and forwards in the process. The choice of approach may, for example, imply that the competence in the study team must be supplemented with an additional specialist, and delimitation of the study object may require that the project plan be revised. Even if the main steps are always performed, all the substeps defined in the following sections are not relevant for all types of risk assessment. 3.2.1

Step 1: Plan the Risk Assessment

For a risk assessment to provide the required results, the process must be carefully planned and prepared. One can easily be impatient in this part of the study, aching to get on with the “real work” as soon as possible. In most cases, it is beneficial to allocate sufficient time and resources to prepare the study. Experience has shown that this is perhaps the most critical part of the whole risk assessment process. Without a good understanding of what we want to achieve, it is hard to do a good job. As described later in the book, there are many ways of doing a risk analysis. A reason why so many different methods have been developed is that risk assessment is applied to a wide range of problems and study objects. To be able to select the best methods, we need to have a good understanding of the problem that we are going to analyze. Further, risk assessments are normally performed to provide input to decisions about risk – whether we need to do something with the risk and if so, what should be done. Understanding the decision alternatives is also important before starting the work. The structure of step 1 is shown in Figure 3.3.

– Overall objectives – Decision and criteria – Rules and regulations – Standards – Personnel/competence

Step 1: Plan the risk assessment 1.1 Clarify decision and decision criteria 1.2 Define outputs from risk assessment 1.3 Define objectives and scope 1.4 Establish study team and organize work 1.5 Establish project plan 1.6 Identify and provide background data

To step 2

Figure 3.3 Step 1: Plan the risk assessment.

– Agreed objectives – Risk acceptance criteria – Study team – Time schedule – Stakeholder list

61

62

3 Main Elements of Risk Assessment

3.2.1.1

Step 1.1: Clarify Decision and Decision Criteria

The reasons for performing risk assessments may vary, but common to all is that they are done to provide input to decisions. It is therefore important that the study team understands the decision alternatives and gives clear and specific answers that can be used in the decision-making process. If the objective of the risk assessment is not clear from the beginning, it is not likely that the assessment will answer the questions that need to be answered to support the actual decision-making. The study team needs to know who the users of the results are and their knowledge about risk assessment. The users may influence both how the risk assessment should be carried out, and particularly, how the results should be presented. If the users are risk assessment experts, the presentation of results and the use of terminology may be different compared to if the users have no or little experience with risk assessment. If risk acceptance criteria for risk evaluation have been established, these must be known to the study team such that the results from the risk assessment are given in a format that enables comparison with the criteria (see Chapters 5 and 6). A good understanding of these criteria and how they should be interpreted is essential. Example 3.1 (Decision criteria) A railway company has established a risk acceptance criterion that states “The frequency of fatal accidents per year in our operation shall be less than 0.1 per year.” In this case, it is necessary to clarify if this applies only to employees or also to passengers and/or people crossing the railway line. The answer to this question affects the scope of the risk assessment. ◽ If several types of assets are considered in the risk assessment, separate decision criteria need to be established for each type of asset. This may apply even if people are the only asset considered. Different criteria may, for example, be applied for employees and neighbors living close to a hazardous plant. Criteria for when the assets are considered to be harmed may also need to be established. Many risk assessments are concerned only with people, and the criteria for harm are then usually whether people are injured or not. The risk assessment should demonstrate that all the significant accident scenarios related to each decision alternative have been considered. If only one decision alternative is considered, the risk assessment should cover the risk if the decision alternative is chosen and also the risk if the decision alternative is not chosen. The role of the risk assessment in the decision-making process is illustrated in Figure 3.4. From the decisions, the required input can be defined, and this determines the objectives and scope of the risk assessment. When the assessment is completed and results are ready, this feeds back to the decisions

3.2 Risk Assessment Process

Decision alternatives considered

Other inputs to decisions (e.g., feasibility, cost, requirements)

& Required inputs to decisionsmaking

Objectives and scope for the risk assessment

Risk assessment and proposals for risk reduction

Uncertainty assessment of result

Figure 3.4 Risk assessment in the decision-making process.

again, together with other relevant information that is taken into account (e.g. feasibility and cost). 3.2.1.2

Step 1.2: Define Outputs from Risk Assessment

The required outputs from the assessment should be clearly defined. The decisions and decision criteria determine, to a large extent, which outputs are required, but several aspects need to be considered: • Should the outputs be qualitative or quantitative? The decision criteria significantly influence this decision. • In what format should the results be presented? Only in a technical report or also in other formats (summary report, presentation, brochure)? • What level of detail should be used when presenting the results? This ties in with the decision and also the users. • Are there several stakeholders with different backgrounds that require different types of information? Most important always is to keep the decision in mind and ask ourselves whether what we are planning to present will help the decision-maker to make the best decision with regard to risk. 3.2.1.3

Step 1.3: Define Objectives and Scope

The decision alternatives and the decision criteria determine the scope and objectives of the study. The objectives can often be derived directly from the decisions that the analysis is supporting and this in turn also determines the scope of the study. The objective may be to • • • • •

verify that the risk is acceptable, choose between two alternative designs, identify potential improvements in an existing design, a combination of these, …and several more.

63

64

3 Main Elements of Risk Assessment

When the objectives are defined, the scope may be set up. The scope describes the work to be done and specific tasks that need to be performed. At this stage, it is possible to specify more clearly what resources and competences are required to perform the risk assessment. 3.2.1.4

Step 1.4: Establish the Study Team and Organize the Work

The risk assessment is usually carried out by a study team that is led by a team leader. The team members should together meet the following requirements: • Have competence relevant to the study object. Including team members from sectors with similar problems may also be useful. • Have necessary knowledge of the study object, how it is operated, and how safety is managed and maintained. • Have competence in risk assessment in general and risk analysis methods in particular. • Come from different levels of the organization. The purpose of this is to better reflect the priorities, attitudes, and knowledge of different organizational levels in the analysis. The last item may seem unnecessary because as long as the analysis is done in an objective manner, based on the available information, should we not arrive at the same results regardless of who is involved? This is an important issue for any risk assessment and should always be kept in mind. A risk analysis predicts what may happen in the future. Because we are never able to know exactly what will happen, the analysis is always based both on facts and judgments. Different people may judge the situation differently and may thus arrive at different conclusions. A risk analysis should therefore never be seen as a completely objective study, but a reflection of the available data and the knowledge of the participants in the study, including their values, attitudes, and priorities. The number of persons taking part in the study may vary depending on the scope of the risk assessment and how complicated the study object is. In some cases, it may be relevant to contract a consulting company to carry out the risk assessment. If the risk assessment is done by external consultants, it is important that in-house personnel carefully follow the assessment process to ensure that the company accepts ownership of the results. The competence and experience of each member of the study team should be documented, along with their respective roles in the team. In cases where external stakeholders are exposed to risk, it should be considered if, and to what degree, these should be involved in the risk assessment. 3.2.1.5

Step 1.5: Establish Project Plan

For the results from the risk assessment to be used as the basis for a decision, it is of paramount importance that the assessment process be planned such that the results are available in a timely manner. The study team should, in cooperation with the decision-makers decide on a time schedule and estimate the resources that are required to do the risk assessment. The extent of the

3.2 Risk Assessment Process From step 1

– Experience data – Reliability data – Analytical methods – Physical models – Consequence models – Exposed assets

Step 2: Define the study 2.1 Define and delimit the study object 2.2 Provide documentation and drawings 2.3 Familiarization 2.4 Select method 2.5 Select data 2.6 Identify relevant assets

– System delimitation – Assets to be protected – Data dossier

To step 3

Figure 3.5 Step 2: Define the study.

assessment depends on how complicated the study object is, the risk level, the competence of the study team, how important the decision is, the time available for the study, access to data, and so on. The level of detail in the study should be agreed upon as part of the planning. 3.2.1.6

Step 1.6: Identify and Provide Background Information

Most study objects have to comply with a number of laws and regulations. Many of these give requirements related to health and safety, and some of them require that risk assessments be performed. It is important that the study team is familiar with these laws and regulations, such that the requirements are taken into account in the risk assessment. Risk assessment standards and/or guidelines have been developed for many types of study objects and application areas (see Chapter 20). Internal requirements and guidelines given by the organization that performs the risk assessment may also need to be adhered to. The study team must be familiar with these standards and guidelines. The number of documents required to support a risk assessment may be substantial. A document control system should therefore be established to manage the various documents and other information sources. This system must control the updating, revision, issue, or removal of reports in accordance with the quality assurance program to ensure that the information remains up to date. 3.2.2

Step 2: Define the Study

The structure of step 2 is shown in Figure 3.5. 3.2.2.1

Step 2.1: Define and Delimit the Study Object

The study object must be defined precisely. When the risk assessment is initiated at an early stage of a system development project, we have to suffice with a

65

66

3 Main Elements of Risk Assessment

preliminary system definition and delimitation on a high level, leaving a more detailed description to a later stage. Aspects of the study object that need to be considered include the following: • The boundaries and interfaces with related systems, both physical and functional. • Interactions and constraints with respect to factors outside the boundary of the study object. • Technical, human, and organizational aspects that are relevant. • The environmental conditions. • The energy, materials, and information flowing across boundaries (input to and output from the study object). • The functions that are performed by the study object. • The operating conditions to be covered by the risk assessment and any relevant limitations. In many risk assessments, it is difficult to delimit the study object and to decide which assumptions and conditions that apply. What should be covered in the risk assessment, and what can be disregarded? In the first steps of a risk assessment, the objective should be to establish a picture of the most important risk issues related to the study object. Later on, the risk assessment may be extended to cover specific parts of the study object under special conditions. In most cases, the study object must be divided into reasonable parts for analysis. Depending on how complicated the study object is, these parts may be subsystems, assemblies, subassemblies, and components. A numerical coding system corresponding to the system breakdown should be established, such that each part is given a unique number. In the offshore oil and gas industry, this system is usually called the tag number system. Several methods are available for system breakdown. It is most common to use some sort of hierarchical structure. In some cases, it is most relevant to focus on functions or processes, whereas in others, the focus is on the physical elements of the system. System breakdown methods are discussed further in Chapter 11 and onwards when different methods are discussed. The study object is studied further in Chapter 4. 3.2.2.2

Step 2.2: Provide Documentation and Drawings

A lot of information about the study object is required, in particular for detailed analyses. Information sources of interest may include (e.g. see IAEA 2002): • • • • •

System layout drawings, including the relation to other systems and assets. System flow, logic, and control diagrams. Descriptions of normal and possible abnormal operations of the study object. Inventories of hazardous materials. Operation procedures and operator training material.

3.2 Risk Assessment Process

Testing and maintenance procedures. Emergency procedures. Previous risk assessments of the same or similar systems Descriptions of engineered safety systems (barriers) and safety support systems, including reliability assessments. • Description of previous hazardous events and accidents in the study object. • Feedback from experience with similar systems. • Environmental impact assessments (if relevant). • • • •

The document control system fills an important role in keeping track of all the documentation that is used as input to the risk assessment. In system development projects, the design is developing continuously, and it is important to know which versions of documents have been used as basis for the analysis. 3.2.2.3

Step 2.3: Familiarization

When the study team has been established, it is important that the team members have access to all relevant information and documentation so they can become familiar with the study object and its operating context. As part of the familiarization, it may be necessary to revisit the previous substep: • More information may be required because the delimitations of the study object are extended or because the information is incomplete. • Details may be insufficient and have to be supplemented. • There may be discrepancies in the documentation. • Part of the information may be unclear and open for interpretation and needs to be discussed with designers/operators. 3.2.2.4

Step 2.4: Select Method

A number of analytical methods have been developed for risk analysis. Many factors influence the choice of methods, some of which are as aforementioned: • In general, we need to choose a method that gives the answers required for the decisions to be made. This means that we need to understand the problem and the decisions to choose method. • If several alternative methods are available, we will usually choose the method that requires least work. • The acceptance criteria may determine which methods can be used. If quantitative criteria are given, quantitative methods must be used. If we do not have quantitative criteria, qualitative methods usually suffice. • Methods have been developed for special types of systems and for special types of problems. We, therefore, need to consider the system and problem type before choosing which method to apply. • If limited information about the study object is available, it may be more relevant to choose a coarse method than a detailed method. In early project

67

68

3 Main Elements of Risk Assessment

Table 3.1 Applicability of analysis methods in the various phases of a system’s life. Method (chapter)

Early design

Design

Operation

Modification

Checklists (10)

G

G

G

G

Preliminary hazard analysis (10)

G

B

M

M

HAZOP (10)

M

G

M

G

SWIFT (10)

G

M

M

G

FMECA (10)

B

G

M

G

Fault tree analysis (11)

B

G

G

G

Bayesian networks (11)

M

G

G

G

Event tree analysis (12)

B

G

G

G

Human reliability analysis (15)

B

G

G

G

Safety audits (7)

B

B

G

B

G = good/suitable, M = medium/could be used, B = bad/not suitable.



• • •

phases, coarse methods are therefore often used, switching to more detailed methods later in the project. Consider the availability of data before choosing method. If no or little quantitative input data are available, performing a quantitative analysis may not be possible. Usually, there are time constraints on when the results need to be ready. This may place constraints on which method to choose. The size and how complicated the study object is, will influence the choice of method. There may be authority requirements, and/or relevant guidelines and standards that impose requirements and constraints on how the risk assessment should be performed.

An overview of the most relevant methods is given in Table 3.1, together with an indication of the phase(s) of a system’s life in which they are suitable. 3.2.2.5

Step 2.5: Select Data

A number of data sources are required for a risk assessment. We seldom have many alternative data sources to choose from and often we struggle to find data at all, in particular data that are directly relevant. Types of data and data sources are discussed in Chapter 9. Factors, such as quality, age, completeness, and relevance of data should be considered when making the choice. 3.2.2.6

Step 2.6: Identify Relevant Assets

Before starting the risk analysis, we need to identify the assets that are relevant to consider. The assets to consider (e.g. people, environment, and reputation)

3.2 Risk Assessment Process From step 2

– Experience data – Checklists – Analytical methods – Reliability data

Step 3: Identify hazards and initiating events 3.1 Identify and list generic hazards and events 3.2 Define specific and representative events 3.3 Identify causes of events 3.4 Determine frequencies of events

– List of hazards – List of relevant/typical initiating events – Causes and frequencies

To step 4

Figure 3.6 Step 3: Identify hazards and initiating events.

are defined as part of the scope of the study, but we may need to define more precisely which assets may be harmed by accidents. An example may be that we have to specify whether we are looking only at employees at a plant or also neighbors who may be affected. 3.2.3

Step 3: Identify Hazards and Initiating Events

The structure of step 3 is shown in Figure 3.6. 3.2.3.1

Step 3.1: Identify and List Generic Hazards and Events

Methods that may be used to perform this step are presented in Chapter 10. The aim of this step is to answer the question “What can go wrong?” Based on the terminology discussion in Chapter 2, we see that this question may be answered in different ways. We may identify: • • • •

Sources of possible harm (i.e. hazards). Starting points for accident scenarios (i.e. initiating events). Events that are the “center” of the bow-tie (i.e. hazardous events). Events and conditions that trigger accidents (i.e. enabling events and conditions).

At an initial stage of a practical risk analysis, we often identify several or all of these when we try to answer the question. We may identify both flammable materials, gas leaks, ignition, fire, and explosion as separate hazards/events to consider, but when we inspect these more closely, we can see that they can form different events in a sequence, that is, an accident scenario. The important point in this process is to identify as many events and conditions as possible. At this stage, we should not be too concerned about whether these are hazards, initiating events, enabling events, or enabling conditions.

69

70

3 Main Elements of Risk Assessment

Hazard and event identification may be supported by generic lists of hazards. Examples of such lists can be found in the standards related to machinery safety (ISO 12100 2010) and for major accident hazard management during design of offshore installations (ISO 17776 2016). When looking at these lists, we often find that the lists are also a mix of hazards, initiating events, and enabling events and conditions in the way that it is defined in this book. This may be confusing, but this should not stop us from using these lists for brainstorming purposes because the structuring of the information can be done afterwards. 3.2.3.2

Step 3.2: Define Specific and Representative Events

In this step, we have to be more careful about what we include as events to be analyzed further. From the (often unorganized) list of identified hazards and events, we should specify a set of specific initiating or hazardous events to form the backbone of the risk analysis. Now, the definitions introduced in Chapter 2 help us in the screening process. We should not discard any hazards and events that are not included in the list of hazardous/initiating events, because they can be parts of accident scenarios or causes that we use in later steps in the risk assessment process. From the generic lists, we may have identified “fire” as an event, but we now need to make this sufficiently specific, stating, for example, “fire in room X during daytime.” The specificity of the events must be balanced with the resources required to perform the analysis. More events typically require more time and resources to perform the analysis so we try to define representative events that can cover a range of more or less similar events. In most cases, a screening of the identified events may be performed as part of this step. Events that are considered to have a very low probability of occurrence or are expected to have no or negligible consequences are usually screened out and not included for further analysis. This screening should be carefully documented. 3.2.3.3

Step 3.3: Identify Causes of Events

The purpose of the causal analysis is to identify the causes of the hazardous or initiating events that have been identified. How “deep” into the causes we should go depends on a number of factors, including • How detailed is the analysis? The level of detail should be determined in step 1. A more detailed analysis requires that we go into more details on the causes. • What causes can be influenced by the decision-makers? Causes outside what we can change are less relevant to study in detail except to help us design a robust system that can withstand or compensate for these causes. • Both technical, human, and organizational causes should be considered whenever relevant. The causal analysis may form an important basis for the frequency analysis.

3.2 Risk Assessment Process

3.2.3.4

Step 3.4: Determine Frequencies of Events

This step is not a part of all risk analyses or may sometimes be performed in a simplified manner. In some cases, the risk analysis is purely qualitative and the description of causes from the causal analysis are then adequate as a description of risk, combined with a description of the consequences. In some cases, frequency or probability classes are used instead of assigning a numerical frequency to each event. Frequency classes may, for example, be specified as d0 , it should not be implemented. A disproportionality limit of, for example, d0 = 3 means that for a measure to be rejected, the costs should be more than three times larger than the benefits. There are no strict authoritative requirements on what limit to employ, but it is reasonable to use a higher value of d0 for high risk (i.e. close to the upper limit) than for lower risk (HSE 2001). A challenge with the cost–benefit approach is that it raises the problem of expressing not only the costs but also the risk reduction benefits in monetary terms. This is a particularly sensitive issue when it comes to putting a value on human life as was discussed earlier. To guide decision-making, some companies use internal criteria for the value of human life. An alternative to such explicit valuation of human life is simply to calculate the cost–benefit ratio for any risk reduction measure and to look out for any clearly unreasonable situations. If the value of life is not quantified, then resources, which also have value, cannot be allocated rationally to develop and implement countermeasures to protect life. For other types of consequences, such as environmental damage, costs also need to be calculated. Clean-up costs are sometimes used in connection with spills, but these costs do not reflect irreversible damage to species affected by spills. A particular problem when a comparison of cost and benefit is done in this way is that the costs are deterministic, whereas the benefits are probabilistic. If we decide to implement a risk reduction measure, we know that there will be a cost associated with it. The benefit we gain is a reduction in the probability of an accident occurring or a reduction in the consequences, should an accident occur. Often, the probability of an accident is very low – regardless of whether individual risk measures are introduced or not – such that the accident will not occur even if the risk measure is not introduced. The benefit is thus purely probabilistic. Another aspect of cost–benefit assessment is the use of discounting of costs and benefits. In financial calculations, this is common to do and implies that future costs and benefits have a lower value than costs and benefits that we get today. In cost–benefit assessment, this is used sometimes, but not always.

5.3 Approaches to Establishing Risk Acceptance Criteria

There are arguments both for and against using this approach, but perhaps the foremost argument can be tied to the occurrence of accidents now versus in the future. If we have a hypothetical situation, where we know that an accident will occur, but we can choose whether it will occur one year from now or 10 years from now, everyone would undoubtedly choose the latter option. In this respect, we can say that future accidents have a lower “cost” than accidents today and that future risk therefore should be discounted. To summarize, the ALARP principle states that money must be spent to reduce risk until it is reasonably low and must continue to be spent for as long as the cost of doing so is not “grossly disproportionate” and the risk is not negligible. If a “tolerable” level of risk can be reduced further at a reasonable cost and with little effort, it should be. At the same time, the ALARP principle recognizes that not all risk can be eliminated. Because it may not be practicable to take further action to reduce the risk or to identify the accidents that pose the risk, there will always be some residual risk of accidents. Remark 5.1 (SFAIRP) In the United Kingdom Health and Safety at Work Act and in other regulations, the term SFAIRP is often used instead of ALARP. The two concepts are similar and can usually be interchanged. SFAIRP is an acronym for “so far as is reasonably practicable.” ◽ 5.3.2

The ALARA Principle

ALARA is an acronym for “as low as reasonably achievable,” which is the risk acceptability framework in the Netherlands. The ALARA principle is conceptually similar to ALARP, but does not include any region of broad acceptability. Until 1993, the region of negligible risk was part of the Dutch policy. Subsequently, it has been abandoned on the grounds that all risk should be reduced as long as it is reasonable (Bottelberghs 2000). ALARA has gained a somewhat different interpretation in practice. According to Ale (2005), it is common practice in the Netherlands to focus on complying with the upper limit rather than evaluating the reasonable practicality of further action. The unacceptable region in ALARA is, on the other hand, generally stricter than the one in ALARP, and the risk levels usually end up in the same range. 5.3.3

The GAMAB Principle

GAMAB is an acronym of the French expression globalement au moins aussi bon, which means “globally at least as good.” The principle assumes that an acceptable solution already exists and that any new development should be at least as good as the existing solutions. The expression globalement (in total) is important here because it provides room for trade-offs. An individual

111

112

5 Risk Acceptance

aspect may therefore be worsened if it is overcompensated by an improvement elsewhere. The GAMAB principle has been used in decision-making related to transportation systems in France, where new systems are required to offer a total risk level that is globally as low as that of any existing equivalent system. The principle is included in the railway RAMS standard (EN 50126 1999). A recent variant of GAMAB is GAME, which rephrases the requirement to at least equivalent. GAMAB is a technology-based criterion, which means that it uses existing technology as the point of reference. By applying this principle, the decision-maker is exempted from the task of formulating a risk acceptance criterion because it is already given by the present level of risk (e.g. see Johansen 2010).

5.3.4

The MEM Principle

Minimum endogenous mortality (MEM) is a German principle that uses the probability of dying of “technological facts” (e.g. sport, work, transport) as a reference level for risk acceptability. The principle requires that new or modified technological systems must not cause a significant increase of the individual risk for any person (Schäbe 2001). MEM is based on the fact that death rates vary with age and the assumption that a portion of each death rate is caused by technological systems (Nordland 2001). Endogenous mortality means death due to internal or natural causes. In contrast, exogenous mortality is caused by the external influences of accidents. The endogenous mortality rate is the rate of deaths due to internal causes of a given population at a given time. Children between 5 and 15 have the MEM rate, which in Western countries is about 2 × 10−4 yr−1 , per person on average (EN 50126 1999). This means that, on the average, one in a group of 5000 children will die each year. The MEM principle requires any technological system not to impose a significant increase in risk compared to this level of reference. According to the railway standard (EN 50126 1999), a “significant increase” is equal to 5% of MEM. This is deduced mathematically from the assumption that people are exposed to roughly 20 types of technological systems. Among these are technologies of transport, energy production, chemical industries, and leisure activities. Assuming that a total technological risk in the size of the MEM is acceptable, the contribution from each system is confined to ΔAIR ≤ MEM ⋅ 5% = 10−5

(5.3)

A single technological system thus poses an unacceptable risk if it causes the individual risk to be increased by more than 5% of MEM. It should be emphasized that this criterion concerns the risk to any individual, not only the age group that provides the reference value. Unlike ALARP and GAMAB, MEM

5.3 Approaches to Establishing Risk Acceptance Criteria

offers a universal quantitative risk acceptance criterion that is derived from the MEM rate. The MEM principle primarily relies on the technology principle as its basis, in that the criterion is related to existing risk levels. 5.3.5

Societal Risk Criteria

In 2001, the UK HSE published the report “Reducing Risks, Protecting People” (HSE 2001), which includes a proposed societal risk criterion which says that for any single industrial installation, “the risk of an accident causing the death of 50 or more people in a single event should be regarded as intolerable if the frequency is estimated to be more than one in five thousand per annum.” This was the first time there had been a widely consulted and published criterion of this type. 5.3.6

The Precautionary Principle

The precautionary principle differs from the other approaches described in this chapter. Common for these approaches is that they are risk-based, which means that risk management is to rely on the numerical assessment of probabilities and potential harms (Klinke and Renn 2002). The precautionary principle is, in contrast, a precaution-based strategy for handling uncertain or highly vulnerable situations. The precaution-based approach does not provide any quantitative criterion to which the assessed risk level can be compared. Risk acceptability is instead a matter of proportionality between the severity of potential consequences and the measures that are taken in precaution. The original definition of the precautionary principle is given in principle 15 of the UN declaration from Rio in 1992 (UN 1992): Definition 5.4 (Precautionary principle) Where there are threats of serious or irreversible damage, lack of full scientific certainty shall not be used as a reason for postponing cost-effective measures to prevent environmental degradation. ◽ The precautionary principle is invoked where • There is good reason to believe that harmful effects may occur to human, animal, or plant health, or the environment. • The level of scientific uncertainty about the consequences or frequencies is such that risk cannot be assessed with sufficient confidence to inform decision-making. Invocation of the precautionary principle may be appropriate with respect to, for example, genetically modified plants where there is good reason to believe

113

114

5 Risk Acceptance

that the modifications could lead to harmful effects on existing habitats, and there is a lack of knowledge about the relationship between the hazard and the consequences. A contrary example is that of the offshore industries, for which the hazards and consequences are generally well understood and conventional assessment techniques can be used to evaluate the risk by following a cautionary approach. Invocation of the precautionary principle is therefore unlikely to be appropriate offshore. The European Commission has provided guidance on when and how to use the precautionary principle (EU 2000). Studies have shown that the practical implementation still varies significantly (Garnett and Parsons 2017). The decision when to invoke the principle seems to be poorly defined, and there are indications that less evidence is required if the issue is related to harm to people compared to the environment. Example 5.4 (Deliberate release into the environment of GMOs) EU Directive 2001/18/EC (EU 2001) is concerned with the deliberate release of genetically modified organisms (GMOs) into the environment. The directive states that risk related to release of GMOs need to be controlled, both related to people and ecosystems. It is therefore necessary to take preventive actions. The precautionary principle has been followed in the preparation of the directive and it should also be taken into account in its implementation. Further, “No GMOs […] intended for deliberate release are to be considered for placing on the market without first having been subjected to satisfactory field testing….” There is no requirement for any indications that this may be hazardous to trigger field testing. All EU member states have to approve the field test, placing a high burden of proof on those who want to introduce GMOs. Further, monitoring is required after it has been introduced, implying that if any of the EU members are not satisfied, the GMO may also be prohibited for use in the future. ◽

5.4 Risk Acceptance Criteria for Other Assets than Humans The discussion above is primarily concerned with risk to humans. For other types of assets, other considerations may apply. Brief comments are given on the two most important, environment and economic risk. Environmental RACs are based on the principles discussed earlier in this chapter. The ALARP principle is commonly applied and the precautionary principle was originally developed specifically for environmental applications. The underlying principles of the ALARP principle can be applied to environmental consequences, although the term “environment” covers an

5.6 Problems

extremely wide range of vulnerable assets, from natural beauty, to individual species (being it animals, insects, fish, plants, etc.), to complete ecosystems. Comparing these and assessing the risk on a consistent level is clearly challenging. Economic considerations are usually simple, and risk acceptance can in most cases be based solely on cost–benefit analysis. The option with the highest benefit is normally the preferred option. In some cases, consideration of worst-case consequences may be necessary. This is because the consequences in some cases may be catastrophic, in the sense that a company goes bankrupt if certain events occur. In such circumstances, this may override a criterion based purely on costs and benefits.

5.5 Closure There has been an ongoing discussion in the scientific community regarding the adequacy of RAC for making decisions about risk. Among the critics is Terje Aven (e.g. see Aven 2007; Aven and Vinnem 2005), who questions the use of RAC from a decision theoretical and ethical perspective. Risk acceptability evaluations must be based on ALARP-considerations, he claims, and not static criteria that fail to address the relationships among risk, benefits, and improvement. In view of the fact that the benefits we gain from accepting a risk weigh so heavily, it may be reasonable to argue that we do not need a specific upper limit, but rely solely on weighing risk against benefits. Johansen (2010) takes a more pragmatic perspective by suggesting that RAC should be evaluated according to their feasibility to the decision-maker and the extent to which they promote sound decisions. The first issue is primarily a matter of how we measure risk, and the ability of different measures to rank alternatives and provide precise recommendations with little uncertainty. The second is a matter of how the criteria are derived (e.g. MEM versus ALARP) and the extent to which they encourage improvement or preservation of status quo and allow for balancing risk with other considerations. Although useful for evaluating risk, it is recommended that practitioners interpret RAC as guiding benchmarks rather than rigid limits of acceptability.

5.6 Problems 5.1

Acceptable risk is dependent on context and values. Find examples where there are differences in what risk we accept. Why do you think there are differences?

115

116

5 Risk Acceptance

5.2

In discussions about oil exploration offshore, there is often disagreement between environmentalists and oil companies about what the risk and what can be tolerated. Identify reasons why they may have such different opinions.

5.3

What are the main differences between the GAMAB and the MEM principle?

5.4

Can you think of examples of hazardous activities that we still choose to do because they are useful to us? What is the risk and what are the benefits?

5.5

Three “pure” principles of risk acceptance are described in the chapter, equity, utility, and technology. List advantages and disadvantages of each of these principles if we apply just one of them for establishing RAC.

5.6

A risk analysis has been performed, and the risk has been calculated at 0.05 fatalities/yr. A risk reduction measure has been proposed and if this is implemented, the risk will be reduced to 0.04 fatalities/yr (20% reduction). The cost of this risk reduction measure is 1 million in investment and 0.2 million yr−1 in operating costs. The risk reduction measure will have an effect for 20 years. For cost–benefit purposes, VSL is set to 25 million. Perform a cost–benefit analysis and decide if the risk reduction measure has a positive cost–benefit or not.

5.7

The results from a risk analysis will always be uncertain to a greater or smaller degree. For the case in the previous problem, the risk analyst has informed us that the initial risk estimate of 0.05 fatalities/yr is uncertain, and that it is likely that the risk is somewhere in the range 0.01–0.08 fatalities/yr. How will this uncertainty affect the results of the cost–benefit analysis (assume that the reduction in risk always is 20%)? What does this tell us about decision-making based on results from risk analysis?

5.8

The precautionary principle is one possible approach to managing risk, particularly aimed at technology and developments that are new and where we do not necessarily know what the risk is. An example of a new technology is autonomous cars and other transport systems. What may the effect be on this type of new technology if we apply the precautionary principle in a strict sense?

5.9

Is GAMAB a useful principle for expressing what is acceptable risk if we introduce a completely new technology, e.g. autonomous transport

References

systems? Are we willing to accept the same risk for a new system as an existing one? Find examples where this is not the case. 5.10

Consider a system development process and assume that you have identified a particular risk factor. Further, assume that you, based on an evaluation, decide to leave the risk factor as it is without implementing any risk reduction measure. Have you then, in fact, accepted the risk factor?

5.11

A new pesticide is proposed, based on a new and unproven mix of chemicals. Discuss the risk related to the introduction of this new pesticide in light of the precautionary principle. Formulate a set of questions to be answered before the pesticide is accepted for use.

5.12

List the main strengths and weaknesses of the GAMAB principle.

5.13

A company applies the following acceptance criterion for acute pollution events: If no traces or effects of the pollution can be observed five years after the pollution event, it is acceptable. Discuss this principle for acute pollution to the sea (e.g. oil spill).

5.14

A hazardous fluid needs to be transported through a densely populated area on a regular basis. Identify factors that need to be considered to decide whether the risk related to this transport is acceptable.

References Ale, B.J.M. (2005). Tolerable or acceptable: a comparison of risk regulation in the United Kingdom and in the Netherlands. Risk Analysis 25: 231–241. Ashenfelter, O. (2005). Measuring value of a statistical life: Problems and prospects, Working paper 505. Princeton, NJ: Princeton University. Aven, T. (2007). On the ethical justification for the use of risk acceptance criteria. Risk Analysis 27: 303–312. Aven, T. and Vinnem, J.E. (2005). On the use of risk acceptance criteria in the offshore oil and gas industry. Reliability Engineering & System Safety 90: 15–24. Bottelberghs, P.H. (2000). Risk analysis and safety policy developments in the Netherlands. Journal of Hazardous Materials 71: 59–84. Cameron, R.F. and Willers, A. (2001). Use of risk assessment in the nuclear industry with specific reference to the Australian situation. Reliability Engineering & System Safety 74: 275–282.

117

118

5 Risk Acceptance

CCPS (2009). Guidelines for Developing Quantitative Safety Risk Criteria. Hoboken, NJ: Wiley and Center for Chemical Process Safety, American Institute of Chemical Engineers. CNCS (2009). Guidance on the use of deterministic and probabilistic criteria in decision-making for class I nuclear facilities, Draft RD-152. Ottawa, Canada: Canadian Nuclear Safety Commission. Court of Appeal (1949). Edwards vs national coal board. EN 50126 (1999). Railway Applications: The Specification and Demonstration of Reliability, Availability, Maintainability and Safety (RAMS). Brussels: European Norm. EU (2000). Communication from the commission on the precautionary principle, Communication. Brussels: Commission of the European Communities. EU (2001). Directive 2001/18/EC of the European Parliament and of the Council of 12th March 2001 on the deliberate release into the environment of genetically modified organisms. EU (2016). Directive (EU) 2016/798 of the European Parliament and of the Council of 11 May 2106 on railway safety. Fischhoff, B., Lichtenstein, S., and Keeney, R.L. (1981). Acceptable Risk. Cambridge: Cambridge University Press. Garnett, K. and Parsons, D. (2017). Multi-case review of the application of the precautionary principle in European Union law and case law. Risk Analysis 37 (3): 502–516. HSE (1992). The Tolerability of Risk from Nuclear Power Stations. London: HMSO. HSE (2001). Reducing Risks, Protecting People; HSE’s Decision-Making Process. Norwich: HMSO. Johansen, I.L. (2010). Foundations and Fallacies of Risk Acceptance Criteria. ROSS report 201001. Trondheim, Norway: Norwegian University of Science and Technology. Klinke, A. and Renn, O. (2002). A new approach to risk evaluation and management: risk-based, precaution-based, and discourse-based strategies. Risk Analysis 22 (6), 1071–1094. Nordland, O. (2001). When is risk acceptable? Presentations at 19th International System Safety Conference, Huntsville, AL. NS 5814 (2008). Requirements for Risk Assessment, Norwegian edn. Oslo, Norway: Standard Norge. NSW (2011). Risk Criteria for Land Use Safety Planning: Hazardous Industry Planning Advisory Paper No. 4. Technical report. Sydney, Australia: New South Wales, Department of Planning. ISBN: 978-0-73475-872-9. Pandey, M.D. and Nathwani, J.S. (2004). Life quality index for the estimation of societal willingness-to-pay for safety. Structural Safety 26: 181–199. Royal Society (1992). Risk: Analysis, Perception and Management. London: Report of a Royal Society study group, Royal Society.

References

Schäbe, H. (2001). Different approaches for determination of tolerable hazard rates. In: Towards a Safer World (ESREL’01) (ed. E. Zio, M. Demichela, and N. Piccinini). Turin, Italy: Politechico di Torino. UN (1992). Report of the United Nations Conference on Environment and Development, Rio Declaration on Environment and Development. Tech. Rep. A/CONF.151/26/Rev. 1 (Vol. 1). New York: United Nations.

119

121

6 Measuring Risk 6.1 Introduction Risk assessments are mainly performed to provide input to decision-making. The decision may, for example, concern modifications of equipment, allocation of risk reduction expenditures, or siting of a hazardous plant. Common to all is the need to specify what to measure and how to evaluate what has been measured. How we measure risk ultimately determines what information we can get from a risk analysis and the validity of our conclusions. This chapter deals with how to quantify risk. The main focus is on measuring risk to humans, but some comments are also given on how to measure risk to other assets. Most of the chapter consists of presentation and discussion of various risk measures for expressing quantities of risk.

6.2 Risk Metrics Recall from Chapter 2 that the term “risk” is used to express our uncertainty about what may happen in the future. Because the future is unknown, we express risk in probabilistic terms by using risk metrics. A risk metric has two parts: (i) A clear definition and an explanation of a quantity that provides information about the risk level. (ii) A measurement procedure and a formula that can be used to determine the numerical value of the quantity when data becomes available. A risk metric is formally defined as follows: Definition 6.1 (Risk metric) A quantity – and a measurement procedure and a method for determining the quantity – that provide information about the level of risk related to the study object in a specified future context. ◽

Risk Assessment: Theory, Methods, and Applications, Second Edition. Marvin Rausand and Stein Haugen. © 2020 John Wiley & Sons, Inc. Published 2020 by John Wiley & Sons, Inc. Companion Website: www.wiley.com/go/riskassessment2e

122

6 Measuring Risk

A risk metric is an estimator that can be used to quantify the risk level when data becomes available, as illustrated in Example 6.1. Example 6.1 (Car accident fatalities) Our goal is to assess the (future) frequency 𝜆 of fatalities in car accidents on a new stretch of road. To define the risk metric, we must first clarify the two parts of the metric: Part (i) involves answering questions, such as • Shall the frequency be measured per time unit, per car entering the road, or per kilometer driven? • Shall all types of automobiles be considered, only family cars, or some other delimited group of cars? • Shall the fatalities be delimited to people sitting inside the cars or shall we also consider third persons on the road, such as pedestrians? • Shall only immediate fatalities be considered, or shall also delayed fatalities caused by the accident be considered? • Shall only ordinary traffic be considered, or shall we also consider extreme weather situations and terrorism? • Shall frequency trends and variations over the day and over the year be considered? Part (ii) involves setting up a procedure and a formula that can be used to quantify the frequency of fatalities when data becomes available. If the answers to the questions in part (i) conclude that an average frequency is sufficient, the frequency can be estimated as the observed number of fatalities divided by the accumulated exposure (e.g. time, kilometers). If we are to consider trends and variations, the procedure becomes more complicated. ◽ The result obtained by applying the risk metric to a dataset is called a risk measure. The procedure of using a risk metric to determine risk measure is illustrated in Figure 6.1. A more thorough treatment of probabilistic metrics is provided by Eusgeld et al. (2008). A risk assessment starts with the objectives of the study. These objectives must comply with the required inputs needed to make a specific decision. To meet the objectives, one or more risk metrics must be specified. To quantify the risk metric, input data must be provided through some sort of data collection exercise. In many cases, the availability of data influences which metric to use. The data collection gives a dataset, which combined with the risk metric gives a risk measure. The risk measure may be a single numerical value, a vector of numbers, or a quantified function. In the design phase of study objects, the data mainly come from other and similar systems, which may be more or less relevant for the study object. When determining the risk measure, foreseeable changes in the future operating context should also be taken into account. The overall procedure for using a risk metric is shown in Figure 6.1, where the time axis and information sources are also indicated. The prediction of the risk measure in a future operating context is often an expert judgment based on

6.3 Measuring Risk to People

the past performance (if available), generic data and knowledge, assumptions about the future operating context, and a risk analysis. Several risk metrics are defined in the following sections and examples are provided that supplement the procedure in Figure 6.1. Risk metrics are used to compare different options for the study object and to evaluate their (prospective) risk. Risk metrics and measures are further used to judge the acceptability of the study object (see Chapter 5). For this purpose, it is important that the specifications of the metric and the measurement procedure/method are clear and unambiguous. The safety level observed in the past is called safety performance in Section 2.6.3, and a metric for this performance is called a safety performance metric. Definition 6.2 (Safety performance metric) A quantity – and a measurement procedure and a method for determining the quantity – that when applied to recorded (i.e. past) performance data from a specific study object provides information about its past performance. ◽ A safety performance metric is a formula that is used to obtain a safety performance measure, which, in many cases, is a number. Safety performance metrics are used mainly for monitoring purposes, that is, to check whether the safety performance has been constant, increasing, or decreasing, and to compare the safety performance of one activity with that of other activities. Often, safety performance measures are computed annually and as line plots or histograms. As for risk metrics, it is important that the safety performance metric is clearly defined, to avoid comparing apples and pears. A single metric does not give a complete picture of the risk. A comprehensive understanding of risk requires information about all the three questions in our definition of risk: what can happen, how likely is it, and what are the consequences. A risk metric is based on this information, but is only a very simple summary of all the information (MacKenzie 2014). This is important to understand. Further, it only provides an indication of one dimension of the risk (e.g. personnel risk). A single risk metric should therefore never be considered as giving a complete understanding of what the risk is, but it may still be useful for many purposes.

6.3 Measuring Risk to People Risk to people may be expressed in many ways, and the expressions may be divided into two distinct groups: Individual risk expresses the risk that an individual person will be exposed to during a specific time period (usually, one year). Individual risk is most often addressed in terms of a hypothetical or statistical person, who is an

123

124

6 Measuring Risk

Risk metric Risk analysis Generic data and knowledge

&

Past performance

Risk measure Assumed future operating context

Past

Now

Safety performance metric

Future

Time

&

Safety performance measure

Figure 6.1 The application of a risk metric and a safety performance metric.

individual with some defined, fixed relationship to a specified hazard. This may be the most exposed person, for example a train driver, or a person with some assumed pattern of life. The analyst usually has to define a number of hypothetical persons to ensure that the entire exposed population is addressed. Individual risk does not depend on the number of people who are exposed to the hazard. Group risk expresses the risk that will be experienced by a group of people. When common citizens are exposed, the group risk is sometimes called societal risk. We prefer the term group risk, both when the group members are employees in a specific company and when they are common citizens. The group risk is found by combining individual risk levels and the number of people at risk, that is, the population being exposed. For an individual, it does not matter much whether she is the only person exposed to risk, or there is a large group of people involved. From this point of view, individual risk is more important than group risk. For politicians having to prioritize resources for reducing risk in society, group risk may be more relevant. If 100 people are killed each year due to a particular type of hazard, it is natural that this is given priority over something that kills 10 people each year. Individual risk and group risk may therefore be more or less relevant for different types of decisions. Several metrics for individual and group risk are presented in the following sections. The presentation is mainly delimited to fatalities.

6.3.1

Potential Loss of Life

The potential loss of life (PLL) is one of the most common metrics for group risk, and is defined as follows:

6.3 Measuring Risk to People

Definition 6.3 (Potential loss of life, PLL) The (statistically) expected number of fatalities within a specified population due to a specified activity or within a specified area per annum. ◽ To use this metric, we must first clearly define what we are looking at. This includes delimiting the population – who are members of the population and who are not. If the area is a plant, what is the population we should look at? Do we consider only full-time employees, or do we also include service people who come and go, and who are not employed by the plant? We further need to define what we mean by a fatality. Is it delimited to fatalities directly caused by an accident, or do we also include deaths caused by a substandard working environment? For some study objects, such as the railways, third person suicide is a recurrent problem. In some cases, it may be difficult to decide whether a fatality is caused by an accident or is a suicide. How should these events be treated in the risk metric? Observe that the number of persons in the population (the size of the population) is not part of the definition, neither is the time they are exposed to hazards. It is therefore not possible to use PLL to compare the risk in different populations. PLL is one of the most simple risk metrics for group risk. It does not distinguish between an accident that causes 100 deaths and 100 accidents each of which causes one death over the same period of time. As such, PLL fails to reflect the contrast between society’s strong reaction to rare but major accidents and its quiet tolerance of the many small accidents that occur frequently (Hirst 1998). In practice, the same applies to many of the risk metrics that express risk through just one number. PLL is also known as the annual fatality rate (AFR).

6.3.1.1

PLL for a Specified Population

A common way of using PLL is to consider a specific population, such as the employees in a company, all passengers traveling by train, or all people living in a country. PLL may also be calculated for a single or a limited set of hazards or events. An example of this can be PLL for people living in Norway in the age group 25–40 years who are killed in car accidents each year. Consider a population of n persons that are exposed to a set A1 , A2 , … , Am of potential initiating events that may cause fatalities in the population. The frequencies of these initiating events have been determined through risk analysis to be 𝜆1 , 𝜆2 , … , 𝜆m per year, respectively. Assume that the n persons are exposed to hazards independent of each other. Further, assume that any single person in the population will be killed due to initiating event Ai with probability pi , for i = 1, 2, … , m. When n, m, pi , and 𝜆i , for i = 1, 2, … , m are known, the PLL can

125

126

6 Measuring Risk

be calculated as follows: PLL = n

m ∑

𝜆i pi

(6.1)

i=1

According to Appendix A, the number of initiating events Ai can be modeled as a homogeneous Poisson process, Po(𝜆i t). The mean number of initiating events Ai in an interval (0, t) is hence 𝜆i t. Because the persons in the population are exposed to hazards independent of each other, we have a binomial situation (see Appendix A) and the mean (expected) number of fatalities in (0, t) caused by initiating event Ai is n𝜆i tpi . Because there are m different initiating events, we add the contributions from each of them – and because the time considered is one year and all quantities are given per year, t = 1, we may skip t in the expression to obtain (6.1). 6.3.1.2

PLL as Safety Performance Metric

The risk metric PLL is also used as a safety performance metric that is applied to a recorded dataset from a past period of time. The corresponding safety performance is an estimate of the risk metric and is calculated as follows: PLL∗ = Number of observed fatalities in a specified population or area per annum

(6.2) Observe that we add an asterisk (∗) to PLL to point out that we now have a safety performance metric and not a risk metric. PLL∗ may be plotted as a function of the year and used to check whether there is a trend in the safety level. Observe that PLL∗ is often an unstable estimate because it is strongly influenced by major accidents. Example 6.2 (PLL for selected occupations in Norway) The Norwegian Labour Inspection Authority collects data related to all occupational accidents in Norway and classifies the data into different categories. The PLL∗ for the various occupations can then be calculated for each year. Table 6.1 provides the average PLL∗ for the period 2013–2017 for some selected occupations. These numbers show that the construction industry in Norway has recorded the highest PLL∗ over the period considered. A property of PLL is that it does not take into account the size of the population being considered. In this case, the number of construction workers is more than three times the number of agricultural workers. Even if the number of fatalities is higher for the construction industry, the risk that each individual agricultural worker is exposed to is thus higher compared to construction workers. This shows that PLL∗ is not a suitable measure for all types of comparisons. ◽

6.3 Measuring Risk to People

Table 6.1 PLL∗ for some selected types of occupations in Norway, based on the average number of fatalities in 2013–2017. PLL∗

Type of occupation

Agriculture

6.2

Transport and communication

6.6

Construction

8.2

Health and social services

0.8

Source: The Norwegian Labour Inspection Authority (2018).

6.3.2

Average Individual Risk

The individual risk (IR) to a specified person is defined as follows: Definition 6.4 (Individual risk, IR) son is killed during one year.

The probability that an individual per◽

The individual risk is most often related to a specified set of hazards and determined as an average value for a specified population of individuals for a period of one year. The risk is therefore often presented as an average individual risk (AIR) per annum. The Center for Chemical Process Safety (CCPS) defines three different measures for the average individual risk (AIR):1 AIRExposed is the average individual risk in an exposed population and is calculated as AIRExposed =

The expected number of fatalities in the exposed population The number of exposed persons (6.3)

Normally, only fatalities during working hours are included. AIRTotal is the average individual risk in a total population without regard to whether or not all people in that population are actually exposed to the risk. The AIRTotal is calculated as AIRTotal =

The expected number of fatalities in the total population The total number of persons in the population (6.4)

1 See the CCPS Process Safety Glossary – https://www.aiche.org/ccps/resources/glossary.

127

128

6 Measuring Risk

In most cases, only fatalities caused by the specified hazards are included, but the same approach may also be used to present the total average fatality rate in a population. AIRHour is the average individual risk per hour of exposure to a specified set of hazards and may be calculated as AIRHour =

The expected number of fatalities in the exposed population The number of exposed person-hours (6.5)

We may also use another time unit and calculate the AIR per work operation or for a specific time period (e.g. a workday). Observe that both AIRExposed and AIRTotal are determined relative to the number of persons that are exposed, without regard for how long time each person is exposed to the specified hazards. The three different formulations of AIR remind us that it is important to understand the definition of AIR used when the numbers are calculated. For the same situation and the same risk, the three risk metrics will provide very different results. Unfortunately, it is not always easy to find out exactly how calculations have been done in all situations. To determine the value of the risk measure, we have to predict the expected number of fatalities and the number of exposed persons (or person-hours) for the assumed future operating context (as shown in Figure 6.1). When data are available for a past period, the corresponding safety performance measures can be calculated by the formulas above by replacing “the expected number of fatalities” with “the observed number of fatalities.” The safety performance metric AIR∗i gives an indication of the average individual safety level in the specified group of individuals in the given past time period (for i = Exposed, Total, Hour). AIR is obviously related to PLL (see Definition 6.3). Because PLL depends on how the number of fatalities is calculated, we introduce PLLi (for i = Exposed, Total, Hour). Hence, PLLExposed PLLExposed AIRExposed = (6.6) = The number of exposed persons nExposed where PLLExposed is the number of fatalities during working hours in a specified exposed population of nExposed persons. This may also be written as PLLExposed = AIRExposed nExposed In the same way AIRTotal =

PLLTotal PLLTotal = The total number of persons in the population nTotal (6.7)

6.3 Measuring Risk to People

where PLLTotal is the number of fatalities in the total population of nTotal persons. This may be written as PLLTotal = AIRTotal nTotal Similarly, AIRHour =

PLLHour PLLHour = The number of exposed person-hours nHour

(6.8)

where PLLHour is the number of fatalities in the population exposed to the hazards. This may be written as PLLHour = AIRHour nHour AIR is also called individual risk (IR) or individual risk per annum (IRPA). Example 6.3 (Average individual risk in Norway – Traffic accidents) In 2017, 107 persons were killed in traffic accidents in Norway out of a total population of about 5 270 000 people. The safety performance measure AIR∗Total of being killed in a traffic accident averaged over the total population in 2017 was AIR∗Total =

PLL∗Total nTotal

=

107 ≈ 2 × 10−5 5 270 000

This means that in a group of 100 000 persons chosen at random, on the average two got killed in a traffic accident in 2017. Observe that this estimate is based on all people in Norway, without regards to how much they are exposed to this hazard. A detailed analysis of the data shows that AIR∗Total is dependent on, among others, gender. A total of 76 of the fatalities are males and 31 are females. The male population was approximately 2 650 000 people and the female 2 620 000 people. This means that the AIR∗Total for men in 2017 was about 2.9 × 10−5 , whereas the AIR∗Total for females was 1.2 × 10−5 . Similar differences can also be observed if we look at different age groups. Using AIR∗Total can therefore “hide” large differences within the population that is being considered. ◽ Example 6.4 (Total average individual risk) Consider a person who is leading a very boring life and spends 20% of her time each year at work, 75% at home, and 5% traveling to and from work by car. Let us assume that we have calculated her AIR∗Total as if she were spending her entire time at work (AIR∗W ), at home (AIR∗H ), and driving (AIR∗D ). The AIR∗Total can then be determined as follows: AIR∗Total = 0.20 × AIR∗W + 0.75 × AIR∗H + 0.05 × AIR∗D ◽

129

130

6 Measuring Risk

Example 6.5 (Average individual risk for air travel) Consider a person who is traveling by air between two cities m times a year. Each flight takes t hours. The frequency of accidents on this stretch has been estimated to be 𝜆 per hour and the probability of being killed if you are on board is pfat . The AIRHour for this person for this activity per exposed hour is then m𝜆tpfat (6.9) = 𝜆pfat (per flight hour) mt The arguments used to set up this formula are very similar to those used to obtain (6.1). ◽ AIRHour =

Example 6.6 (Individual risk on cargo ships) Consider the individual risk for crew members on cargo ships. Some hazards are relevant only during ordinary working hours, whereas other hazards are relevant 24 hours a day. Assume that we have data from an accumulated number of 𝜏s = 29 500 cargo ship-years (one ship-year is one ship during one year). Each ship has an average number of crew members (persons on board, POB) = 25, and each crew member spends on average a = 50% of the year onboard. Further, assume that n = 490 crew members have been killed during this exposure time. This corresponds to PLL∗Exposed for this group of people. The safety performance during this period is hence PLL∗Exposed 490 ∗ AIRExposed = a= × 0.5 ≈ 3.3 × 10−4 𝜏s POB 29 500 × 25 Observe that this safety performance measure does not distinguish between single fatalities and multiple fatalities in major accidents. This is because AIR∗Exposed is an individual measure, that is not concerned with how many people are killed at the same time. Further, it expresses the safety level only in terms of a single number, limiting the information that is possible to present. ◽ 6.3.2.1

PLL Within an Area

Reconsider a hazardous installation to which several people are exposed to a set of hazards, as shown in Figure 6.2. Let AIRExposed (x, y) denote the individual risk per annum of an individual who is located at the coordinates (x, y) on a map, and let m(x, y) denote the population density as a function of the coordinates (x, y). Assume that the presence of the persons at the location and their vulnerability to the risk have been incorporated into the estimate of AIRExposed (x, y). The expected number of fatalities per year, the PLLExposed , within a specified area A can then be determined by PLLExposed,A =

∫ ∫A

AIRExposed (x, y) m(x, y) dx dy

(6.10)

6.3 Measuring Risk to People

10–5

10–6

10–7

Hazardous installation

Figure 6.2 Risk contour plot where several people are exposed to hazards.

6.3.3

Deaths per Million

The number of deaths per million (DPM) people belonging to a specified group is sometimes used as a safety performance metric. This is an alternative to AIR and is directly proportional to AIR: DPM∗ = AIR∗i × 106 where i = Exposed or Total, depending on whether the population considered is delimited to those who are exposed or we consider the whole population. Table 6.2 presents the annual number of deaths per million as a function of age for some age groups in the United Kingdom. The data cover both accident fatalities and fatalities due to illness. Table 6.2 Annual deaths per million for various age groups in the United Kingdom based on deaths in 1999.

Population group

Risk per annum

Deaths per million (DPM∗ )

Entire population

1 in 97

10 309

Men aged 65–74

1 in 36

27 777

Women aged 65–74

1 in 51

19 607

Men aged 35–44

1 in 637

1 569

Women aged 35–44

1 in 988

1 012

Boys aged 5–14

1 in 6907

145

Girls aged 5–14

1 in 8696

115

Source: HSE (2001a).

131

132

6 Measuring Risk

If we assume that the risk level in the United Kingdom has not changed since 1999 and pick a person in the United Kingdom at random, without regard to age and gender, the figures in Table 6.2 indicate that she or he will die during the next year with a probability approximately equal to 10 309∕106 ≈ 1.03%. Observe that we have to be very careful when using this type of historical data for predictions because the health situation in a country may change over time and the distribution between the various age groups will vary. Individual safety performance statistics are published annually for most countries. These statistics are usually split into various categories, and can, for example, be used to estimate: • The annual probability that an average woman between 50 and 60 years will die due to cancer. • The annual probability that an average employee in the construction industry will be killed in an occupational accident. Individual risk varies greatly from activity to activity and from industry to industry. Some overall figures from the UK are presented in Table 6.3. Observe that Table 6.3 does not distinguish between the deaths of young and old people. Further, deaths that come immediately after the accident and deaths that follow painful and debilitating disease are treated as equivalent. Table 6.3 Annual risk of death from industrial accidents to employees for various industry sectors. Industry sector

DPM∗

AIR∗

8

8 × 10−6

20

20 × 10−6

109

109 × 10−6

Construction

59

59 × 10−6

Extractive and utility

50

50 × 10−6

58

58 × 10−6

29

29 × 10−6

13

13 × 10−6

2

2 × 10−6

3

3 × 10−6

Fatalities to employees Fatalities to self-employed Mining and quarrying of energy producing materials

supply industries Agriculture, hunting, forestry, and fishing (not sea fishing) Manufacture of basic metals and fabricated metal products Manufacturing industry Manufacture of electrical and optical equipment Service industry Source: Data adapted from HSE (2001b).

6.3 Measuring Risk to People

When presenting risk figures, such as in Table 6.3, it should always be specified to whom or to what group of people the figures apply. It would, for example, be meaningless to say that the national average risk of being killed during hang-gliding is one in 20 million. What we are really interested in is the risk to the people who actually practice hang-gliding. 6.3.4

Location-Specific Individual Risk

Consider a plant that is storing and/or using hazardous materials. If these materials are released and causes, for example, a fire or an explosion, people who are present in the neighborhood of the plant may be harmed. Figure 6.2 indicates that the individual risk depends on factors such as the following: (i) the topography around the source of the hazards, (ii) the distance from hazard source, (iii) the dominant wind directions, and thereby (iv) in which direction from the hazards source the exposed persons are located. This makes it of interest to consider the location-specific individual risk (LSIR) defined as follows: Definition 6.5 (Location-specific individual risk, LSIR) The probability that a hypothetical person, who is unprotected and always present at a particular location, is killed in an accident during a specified year. ◽ LSIR is also called localized individual risk per annum (LIRA). 6.3.4.1

LSIR is a Property of the Location

In the definition of LSIR, it is assumed that a hypothetical person is always present at a given location. Because LSIR remains unchanged irrespective of whether a person is at the spot when an accident occurs, it can rightfully be claimed that LSIR is a geographic rather than an individual risk measure. Due to its location-specific properties, LSIR is mainly used for land-use planning related to hazardous installations (e.g.see Laheij et al. 2000). Requirements for land-use planning are included as Article 13 in the EU Seveso III directive (EU 2012) and a guideline related to these requirements has been developed (EU-JRC 2006). 6.3.4.2

LSIR for Different Types of Scenarios

Consider a hazardous installation that may give rise to m independent and different initiating events E1 , E2 , … , Em and let 𝜆i denote the frequency of the occurrence of initiating event Ei , for i = 1, 2, … , m. Assume that a hypothetical unprotected person is permanently present at a location with coordinates (x, y) on a map. Based on an analysis of the stresses and doses of toxic gases to which the person will be exposed during an accident of type Ei , we may estimate the probability that she will be killed, Pr(Unprotected person located at (x, y) is killed|Event Ei has occurred)

133

134

6 Measuring Risk

Let us for brevity denote this probability by Pr(Fatality at (x,y)|Ei ). The LSIR at location (x, y) due to event Ei is now LSIRi (x, y) = 𝜆i Pr(Fatality at (x, y)|Ei )

for i = 1, 2, … , m

If we assume that risk from the m events can be added, the total LSIR at location (x, y) due to the hazardous installation is LSIR(x, y) =

m ∑

𝜆i Pr(Fatality at (x, y)|Ei )

(6.11)

i=1

6.3.5

Individual-Specific Individual Risk

The definition of LSIR(x, y) may be modified to take into account the proportion a of time the individual is present at location (x, y). This new measure is called the individual-specific individual risk [ISIR(x, y)] and defined as follows: Definition 6.6 (Individual-specific individual risk, ISIR) The probability that a hypothetical person, who is unprotected and who is working at a particular location (a specific number of hours per year) is killed in an accident during one specified year. ◽ This definition may be expressed as follows: ISIR(x, y) =

m ∑

𝜆i Pr(fatality at (x, y)|Ei ) a

(6.12)

i=1

When it is expected that an individual is more or less likely to be killed (there may be differences between age groups for example), this may also be taken into account. Example 6.7 (ISIR with reduced exposure) Consider an office building located at the coordinates (x, y) near a hazardous installation. At this location, the location-specific individual risk LSIR(x, y) has been determined. An individual is present in the office building for approximately 1500 hours per year. The probability that she is present in the office building at a random point in time is hence 1500∕8760 ≈ 17%. The building serves as a protection layer, and the individual, therefore, has a lower risk while she is inside the building. If we assume that the accident probability remains constant over the day, the individual risk of the person due to the hazardous installation is approximately ISIR ≈ LSIR(x, y) × 0.17. If the probability of the accident varies over the day, and is higher during normal working hours, a more thorough analysis is required. ◽

6.3 Measuring Risk to People

10–5

10–6

10–7

Hazardous installation

Figure 6.3 Risk contour plot example.

6.3.6

Risk Contour Plots

The geographical feature of LSIR may be used to illustrate the risk in the vicinity of a hazardous installation – by a risk contour plot, as shown in Figure 6.3. The contours show how LSIR varies in an area around a hazardous installation, and hence, the risk to which an unprotected individual would be exposed if she was present continuously at a given location. The risk contour is drawn on a map of the area around the hazardous installation. The geographical area must usually be divided into smaller areas to make the calculation manageable. The LSIR must then be calculated for each area by adding up the hazards that may affect this area. The LSIR is usually split into levels, such as 10−5 , 10−6 , 10−7 , and so on. A 10−5 iso-risk contour is then drawn around the areas with LSIR ≥ 10−5 . Thereafter, a similar 10−6 contour is drawn around the areas with LSIR ≥ 10−6 , and so on. If an average unprotected individual is permanently present at a 10−5 iso-risk contour, she will have a probability of being killed due to accidents in the hazardous installation that equals 10−5 per year. The distance from the hazardous installation to an iso-risk contour depends on the type of hazards, the topography, the dominant wind direction, and so on. Because risk contours may be tedious to calculate by hand, a number of computer programs have been developed for this purpose. Risk contours do not take into account any actions that people might take to escape from an event, or the actual time that people are present. Also observe that the number of people who are exposed to the risk is not considered. The risk contour is simply an indication of how hazardous the area is. It is typically used in layout and siting reviews of industrial plants (land-use planning). Risk contours may also be established for areas around airports and roads where hazardous goods are transported.

135

136

6 Measuring Risk

Table 6.4 Individual risk criteria for various installations. Exposure type

Risk level

Hospitals, schools, child-care facilities, and nursing homes

Less than 5 × 10−7

Residential developments and places of continuous occupation (hotels/resorts)

Less than 1 × 10−6

Commercial developments, including offices, retail centers, warehouses with showrooms, restaurants, and entertainment centers.

Less than 5 × 10−6

Sporting complexes and active open space areas

Less than 1 × 10−5

Industrial sites

Less than 5 × 10−5

Source: Data from various Australian authorities.

Example 6.8 (Iso-risk contours in the Netherlands) In the Netherlands, no new dwellings or vulnerable installations, such as kindergartens or hospitals, are allowed within the 10−6 (per annum) iso-risk contour. Less vulnerable installations, such as offices, are allowed in the zone between the 10−5 and the ◽ 10−6 (per annum) iso-risk contours (Laheij et al. 2000). Several countries have defined maximum LSIR values for different types of installations. The data in Table 6.4 are compiled from various Australian authorities and are used for land-use planning. 6.3.7

Fatal Accident Rate

The fatal accident rate (FAR) was introduced by ICI as a risk metric for occupational risk in the UK chemical industry. The FAR is the most common risk metric for occupational risk in Europe. FAR is defined as follows: Definition 6.7 (Fatal accident rate, FAR) The expected number of fatalities in a defined population per 100 million hours of exposure. ◽ The FAR value is calculated as follows: Expected no. of fatalities FAR = × 108 No. of hours exposed to risk

(6.13)

FAR may be given the following interpretation: For a working year of 2000 hours and a working life of 50 years, 108 work hours correspond to 1000 working lives. FAR is then the estimated number of these 1000 persons who will die in a fatal accident during their working lives. Observe that the expected number of fatalities corresponds to the PLL for the group that is being considered.

6.3 Measuring Risk to People

Table 6.5 Experienced FAR∗ values for Norway for the period 2013–2017.

Industry

FAR∗ (Fatalities per 108 working hours)

Agriculture, forestry, and fishing

3.1

Raw material extraction

1.6

Industry, manufacturing

0.5

Building and construction

1.4

Transport and warehousing

1.7

Private and public services

0.7

Health services

0.1

Total

2.0

Source: Data from the Norwegian Labour Inspection Authority and Statistics Norway, 2018.

FAR is useful for comparing the average risk in different occupations and activities, but it is not always easy to clearly define the exposure time, especially for part-time activities or subpopulations at risk. In industrial risk analyses, the number of exposure hours is typically defined as the number of working hours. For offshore oil and gas risk analyses, the FAR value is sometimes calculated based on the actual working hours, and sometimes on the total hours the personnel spend on the installation. The corresponding safety performance metric FAR∗ , is calculated as Observed no. of fatalities × 108 No. of hours exposed to risk PLL∗ × 108 = No. of hours exposed to risk

FAR∗ =

(6.14)

In Table 6.5, selected FAR∗ values for Norway are shown, based on official statistics. Similar data from the United Kingdom are presented in Table 6.6. Observe that the data in Tables 6.5 and 6.6 are difficult to compare because the industry or activity groups are not defined in the same way. Example 6.9 (Risk of rock climbing) In Table 6.6, FAR values for both work and specific activities are included. We observe that rock climbing is about 1000 times more dangerous than factory work. This illustrates that FAR is not necessarily well suited for expressing risk associated with activities that are relatively short term compared to normal work. According to HSE (2001b), the risk associated with rock climbing can also be expressed in terms of probability of fatality per climb. This value is 1 in 320 000 climbs. If we assume that an “average”

137

138

6 Measuring Risk

Table 6.6 Experienced FAR∗ values for the United Kingdom.

Activity/industry

FAR∗ (Fatalities per 108 hours of exposure)

Factory work (average)

4

Construction (average)

5

Construction, high-rise erectors

70

Manufacturing industry (all)

1

Oil and gas extraction

15

Travel by car

30

Travel by air (fixed wing)

40

Travel by helicopter

500

Rock climbing while on rock face

4000

Source: Data from Hambly (1992).

climber performs this activity 10 times each year, we arrive at a fatality probability of 1 in 32 000 or and individual risk of approximately 3 × 10−5 per year. If we apply the FAR values for factory work from Table 6.6 and assume an average work year of 1800 hours, we arrive at an individual risk of 7.2 × 10−5 per year. This calculation shows that the annual risk associated with factory work is more than twice as high as rock climbing. This is a good illustration of how the different risk metrics give very different answers and therefore need to be used with care. ◽ FAR∗ is an unbiased, but rather a nonrobust estimate of FAR, due to its strong dependency on major accidents. If we calculate FAR∗ over an interval (0, t), where t is increasing, FAR∗ may have a low value until the first major accident occurs. Then the estimate takes a leap before it starts to decrease until the next major accident occurs. This is illustrated clearly in Table 6.7, which presents the estimated FAR∗ for offshore workers in the UK and Norwegian sectors of the North Sea for the period 1 January 1980-1 January 1994. Two major accidents occurred in this period: the Alexander L. Kielland platform capsized in the Norwegian sector of the North Sea on 27 March 1980, with the loss of 123 lives, and the Piper Alpha platform exploded and caught fire in the British sector of the North Sea on 6 July 1988, with the loss of 167 lives. It is clear that the total experienced FAR∗ is dominated by these two accidents. FAR is generally considered to be a very useful overall risk measure, but may be a very coarse measure. This is because FAR (and FAR∗ ) applies for all members of a specified group, without considering that various members of the group may be exposed to significantly different levels of risk. In Table 6.7, for example, all offshore workers are considered to be members of the same

6.3 Measuring Risk to People

Table 6.7 Experienced FAR∗ for offshore workers in the United Kingdom and Norwegian sector of the North Sea for the period 1 January 1980–1 January 1994.

Area

Condition for calculation

FAR∗ (Fatalities per 108 working hours)

UK

TotalFAR∗

36.5

Norway

Excluding the Piper Alpha accident

14.2

TotalFAR∗

47.3

Excluding the Alexander L. Kielland accident

8.5

Source: Data from Holand (1996).

group. FAR thus represents an average for all offshore workers, even though it is obvious that the production crew, the drilling crew, and the catering crew are exposed to different levels of risk. They are exposed to some common job-specific hazards (such as collapse of the entire installation), but many significant hazards are clearly job-specific. Hence, although the FAR for all workers may be considered acceptable for a given situation, it could happen that, for example, the drilling crew had a very high FAR. FAR values are, therefore, sometimes split over groups, various activities, or sources of risk. 6.3.7.1

Accident Rates in Transport

The transport sector sometimes uses exposure measures other than hours. In aviation, it is common to use • Number of flight hours • Number of person flight hours • Number of aircraft departures The safety performance measure FAR∗a in aviation may be calculated as FAR∗a =

Number of accident-related fatalities × 105 Number of flight hours

(6.15)

Here, FAR∗a expresses the number of fatalities per 100 000 hours flown. An alternative to FAR∗a is related to the number of departures: FAR∗d =

Number of accident-related fatalities × 105 Number of aircraft departures

(6.16)

In the railway and road transport sectors, the following exposure metrics are sometimes used: • Number of kilometers driven (vehicle kilometers) • Number of person kilometers (vehicle kilometers times the average number of persons in the vehicle)

139

140

6 Measuring Risk

• Number of person travel hours (number of hours traveled times the average number of persons in the vehicle) Relevant risk metrics are therefore • Number of fatalities per 100 million person kilometers • Number of fatalities per 100 million vehicle kilometers 6.3.8

Lost-Time Injuries

So far we have only discussed metrics of fatality risk, not taking into account injuries. All are commonly used in risk analysis, but fatality metrics are in many cases not very good for measuring safety performance, since, fortunately, relatively few fatal accidents occur. To get a sufficient statistical basis, it is often necessary to measure risk for a whole industry or a whole country. Injuries occur far more often and are therefore commonly used to measure safety performance and in particular to study trends in performance. Definition 6.8 (Lost-time injury, LTI) An injury that prevents an employee from returning to work for at least one full shift. ◽ The frequency of LTIs is used as a safety performance metric and is defined as No. of lost-time injuries (LTIs) (6.17) × 2 × 105 LTIF∗ = No. of hours worked LTIF∗ is commonly used by companies and may be calculated per month or per year, depending on the size and type of company. As seen from the definition, LTIF∗ is a measure of the frequency of events with consequence above a certain lower limit. Example 6.10 (Calculating LTIF) An average employee works around 2000 hours per year.2 A total of 2 × 105 = 200 000 hours is therefore approximately 100 employee-years. If a company has an LTIF∗ = 10 LTIs per 200 000 hours of exposure, this means that on average one out of 10 employees will experience an LTI during one year. ◽ Some companies/organizations use another time scale and, for example, define the LTIF∗ as the number of LTIs per million (106 ) hours worked. The following safety performance metrics are also used: • The (average) time between LTIs in a specified population (expressed in days or worked hours) 2 In many countries, normal work-hours are less than 2000 per year.

6.3 Measuring Risk to People

• The time since the previous LTI in a specified population (expressed in days or worked hours) • The frequency of injuries requiring medical treatment Observe that the last measure is slightly different from those measuring LTI. The lower limit for consequence in this case is “medical treatment,” which is not the same as a lost-time injury. There may be LTIs that do not require medical treatment, and there may also be cases where the injured person returns to work immediately after getting medical treatment. This illustrates that when comparing values of LTI from different sources or different industries, we need to make sure that the definitions are identical. 6.3.8.1

Lost Workdays Frequency

The LTIF∗ does not weight the seriousness of the injury, so a fatal accident has the same effect on the LTIF∗ as an injury causing one day off work. The seriousness of an LTI may be measured by the number of workdays lost due to the LTI, and the lost workdays frequency, LWF∗ , may alternatively be used as a safety performance measure. The LWF∗ is defined as LWF∗ =

No. of lost workdays due to LTIs × 2 × 105 No. of hours worked

(6.18)

• Some companies and organizations use another time scale and define the LWF∗ as the number of lost workdays due to LTIs per million (106 ) hours of exposure. • The average number of workdays lost per LTI is found from LWF∗ /LTIF∗ . • The LWF∗ is sometimes called the S-rate (severity rate). • Fatalities and 100% permanent disability are sometimes counted as 7500 workdays (Kjellén 2000). Example 6.11 (Calculating LWF∗ ) Consider a company with a total of 150 000 employee-hours per year, which corresponds to 75 employees working full time (2000 hours per year). Assume that the company has had eight LTIs during one year, which gives LTIF∗ = 10.7 LTIs per 200 000 hours of exposure. Assume that the company has lost 107 workdays due to LTIs. This corresponds to LWF∗ ≈ 143 lost workdays per 200 000 hours of exposure. Observe that the LWF∗ does not tell us anything about the seriousness of each LTI and whether the lost workdays are equally distributed, for example, if seven LTIs caused only one day lost, whereas the eighth caused 100 lost workdays. If we instead make the assumption that out of the eight LTIs, one was a fatality, we see the big difference between LTIF∗ and LWF∗ . LTIF∗ remains unchanged because the number of LTIs remains the same, whereas LWF∗ increases from 143 to more than 10 000 if we use 7500 workdays for a fatality. ◽

141

6 Measuring Risk

Frequency of exceedance (log scale)

142

10–3 10–4 10–5 10–6 10–7 10–8 10–9 100

101

102

103

104

No. of fatalities, n (log scale)

Figure 6.4 FN curve example.

6.3.9

FN Curves

The possible consequences of an accident may in many cases vary over a wide range. It has, therefore, been found useful to present the consequences versus their frequencies in a graph. To make the graph more “stable,” it has become common to plot the cumulative frequency f (c) of a consequence C ≥ c. Logarithmic scales are normally used both for frequency and consequence. This type of plotting is attributed to Farmer (1967), who plotted cumulative frequencies of various releases of I-131 from thermal nuclear reactors. Curves of this type are, therefore, sometimes called Farmer curves. When the relevant consequences are the number of fatalities, N, the curve is usually called an FN curve. The FN curve is a descriptive risk metric that provides information on how a risk is distributed over small and large accidents. An example of an FN curve is shown in Figure 6.4, where the frequency F on the ordinate axis is the frequency of “exceedance,” meaning that F(n) denotes the frequency of accidents where the consequence is n or more fatalities. Assume that fatal accidents in a specified system or within a specified area occur according to a homogeneous Poisson process with frequency 𝜆 (per annum). By fatal accident, we mean an accident with at least one fatality. Let N denote the number of fatalities of a future fatal accident. Because all accidents under consideration are fatal accidents, we know that Pr(N ≥ 1) = 1. The frequency of fatal accidents with n or more fatalities is therefore F(n) = 𝜆[N≥n] = 𝜆 Pr(N ≥ n)

(6.19)

and the FN curve is obtained by plotting F(n) as a function of n, for n = 1, 2, …, as shown in Figure 6.4. The frequency of accidents with exactly n fatalities is

6.3 Measuring Risk to People

therefore f (n) = F(n) − F(n + 1)

(6.20)

Because f (n) ≥ 0 for all n, the FN curve must be either flat or falling. The probability that a future accident will have exactly n fatalities is F(n) − F(n + 1) 𝜆 The mean number of fatalities in a single, future accident is Pr(N = n) = Pr(N ≥ n) − Pr(N ≥ n + 1) =

E(N) =

∞ ∑

n Pr(N = n) =

n=1

∞ ∑

Pr(N ≥ n)

(6.21)

(6.22)

n=1

The mean number of fatal accidents per annum is 𝜆 (because we use years as the time unit), and the mean total number of fatalities per annum is therefore E(NTot ) = 𝜆E(N) =

∞ ∑

𝜆 Pr(N ≥ n) =

n=1

∞ ∑

F(n)

(6.23)

n=1

which can be represented by the “area” under the FN curve in Figure 6.4. It may be noted that this corresponds to PLL. FN curves for some types of transport systems are shown in Figure 6.5. The FN curve may be used for at least three purposes: • To show the historical record of accidents (used as a safety performance metric) • To depict the results of quantitative risk assessments (used as a risk metric) • To display criteria for judging the tolerability or acceptability of outputs from quantitative risk assessments 6.3.9.1

FN Criterion Lines

By introducing criterion lines in the FN diagram as shown in Figure 6.6, the outputs from quantitative risk assessment can be judged against a predefined level of acceptable risk. An FN criterion line is determined by two parameters (Ball and Floyd 1998): (1) An anchor point, (n, F(n)), which is a fixed pair of consequence and frequency. (2) A risk aversion factor, 𝛼, which determines the slope of the criterion line. Given an anchor point and a risk aversion factor 𝛼, the criterion line is constructed from the equation3 : F(n)n𝛼 = k1 3 This equation is often written as F × N 𝛼 = k1 .

(6.24)

143

6 Measuring Risk 10 000

Accidents per annum with N or more fatalities (log scale)

144

1000 Road transport 1969–2001 100

10

1 Railway 1968–2001

Aviation 1967–2001

0.1

0.01 1

10 100 Number of fatalities (log scale)

1000

Figure 6.5 FN curve example. Source: (Adapted from HSE 2003).

where k1 is a constant. By taking logarithms, the equation becomes log F(n) + 𝛼 log n = k

(6.25)

where k = log k1 . If plotted in a coordinate system with logarithmic scale, as shown in Figure 6.6, this function gives a straight line with slope −𝛼. As shown in Figure 6.6, two different lines are usually drawn, thus splitting the area into three regions: an unacceptable region, a tolerable as low a level as reasonably practicable (ALARP) region, and a broadly acceptable region. The ALARP principle is described further in Chapter 5. The anchor points and the slopes have to be deduced from risk acceptance criteria. With the values chosen in Figure 6.6, the upper FN criterion line indicates that it is unacceptable to have fatal accidents with a higher frequency than 10−2 per year. Similarly, the lower FN criterion line indicates that a situation where fatal accidents occur with a frequency of less than 10−4 per year is broadly acceptable. The region between the two lines is called the ALARP region. In this region, the risk is considered to be tolerable if the ALARP principle is followed. The slope in Figure 6.6 is based on 𝛼 = 1. A higher value of 𝛼 gives a steeper line, indicating risk aversion to accidents with a high number of fatalities. Risk aversion means being more than proportionally concerned with the number of fatalities per accident. A risk-averse person regards an accident that kills

6.3 Measuring Risk to People

Accidents per annum with N or more fatalities (log scale)

1.0E–01

1.0E–02

1.0E–03

Unacceptable

1.0E–04

Tolerable ALARP

FN

crit

erio

n li

ne (

1.0E–05

upp

er)

Broadly acceptable 1.0E–06

FN c

rite

1.0E–07 1

100 10 Number of fatalities (log scale)

rion

line

(low

er)

1000

Figure 6.6 FN criterion lines (example).

several persons as being less acceptable than several accidents that collectively take the same number of lives. Accounting for risk aversion in policymaking is a controversial issue, and there are therefore different opinions regarding what value of 𝛼 to employ. In the United Kingdom, Health and Safety Executive (HSE) prescribes a so-called “risk neutral” factor of 𝛼 = 1, which means that the frequency of an accident that kills 100 people (or more) should be approximately 10 times lower than one that kills 10 people (or more). The Dutch government, on the other hand, promotes a risk aversion factor of 𝛼 = 2. An accident that kills 100 people (or more) is then required to have approximately 100 times lower frequency than an accident that kills 10 people (or more). For a thorough discussion on this issue, the reader is referred to Skjong et al. (2007) and Ball and Floyd (1998). Observe that both settle for an aversion factor of 𝛼 = 1 to avoid implicit judgments of risk acceptability. 6.3.9.2

Some Comments

To investigate the properties of the FN criterion line, we assume that accidents occur according to the assumptions of a specified line with aversion factor 𝛼 = 1 and formula F(n)n = k1 . For n = 1, we get F(1) = k1 . The constant k1 is therefore equal to the total frequency of fatal accidents. The frequency of fatal accidents with a single

145

146

6 Measuring Risk

fatality is f (1) = F(1) − F(2) = k1 − k1 ∕2 = k1 ∕2. The assumptions behind the criterion line, therefore, imply that exactly half of the fatal accidents have a single fatality. In general (when 𝛼 = 1), the frequency of an accident with exactly n fatalities is k1 (6.26) f (n) = F(n) − F(n + 1) = n(n + 1) which implies that all numbers n of fatalities occur with a certain regularity. If risk is calculated as the expected annual consequence, the risk contribution from accidents with n fatalities is ΔRn = nf (n) = k1 ∕(n + 1) and decreases with the number of fatalities. The FN criterion line is sometimes claimed to be an iso-risk line, usually without any clear definition of what is meant by the term iso-risk. In light of the value of ΔRn , the iso-risk assertion might be discussed. With the same assumptions, the expected number of annual fatalities would be ∞ ∞ ∑ ∑ 1 nf (n) = k1 =∞ E(NTot ) = n + 1 n=1 n=1 In most cases, the number of persons exposed is limited to some value nmax and the expected number of annual fatalities is therefore E(NTot ) = k1

nmax ∑ n=1

1 n+1

(6.27)

FN criterion lines are widely used to evaluate the group risk of an activity or system, but their use is contested. Among the critics are Evans and Verlander (1997), who claim that FN criterion lines provide illogical recommendations if, for example, the line is exceeded in one area but is otherwise lower. Albeit the FN criterion lines are usually considered to provide valuable guidance to decisions on risk acceptability, the user should therefore be aware of the limitations of this approach. 6.3.10

Potential Equivalent Fatality

Most risk measures described are an expression of fatality risk only. In many cases, the risk to people is not adequately described by the fatality risk, and injuries should also be taken into account. This is sometimes done by comparing injuries and disabilities with fatalities and calculating a potential equivalent fatality. Definition 6.9 (Potential equivalent fatality, PEF) A convention for aggregating harm to people by regarding major and minor injuries as being equivalent to a certain fraction of a fatality (RSSB, 2007). ◽

6.3 Measuring Risk to People

Example 6.12 (London Underground QRA) In a Quantitative Risk Assessment (QRA) of London Underground Limited, potential injuries were classified into minor and major injuries. Major injuries were given the weight 0.1, such that 10 major injuries were considered to be “equivalent” to one fatality (i.e. 10 × 0.1 = 1). Minor injuries were given the weight 0.01, such that a hundred minor injuries were considered “equivalent” to one fatality. ◽ 6.3.11

Frequency of Loss of Main Safety Functions

An indirect way of measuring risk to people was introduced by the Norwegian Petroleum Safety Authority4 in 1981. In their Guidelines for Concept Safety Evaluation Studies, they stated risk acceptance criteria related to the frequency of loss of so-called main safety functions due to major accidents. These are still implemented, in a somewhat modified form, in the regulations for the oil and gas industry in Norway. This is not a direct expression of risk to people, expressed through injuries or fatalities, but the idea is that as long as these safety functions are intact, there is a good chance that people are safe. The principle can perhaps best be explained through an example. Example 6.13 (Offshore oil/gas installation) Consider an offshore oil and gas installation and assume that a major fire is initiated on this installation. In this situation, the personnel onboard is exposed to the fire and may be killed. To avoid this, a set of key functions need to be in place: (1) People who are working in different areas around the platform should have safe escape routes available that they can follow to get away from the fire. (2) They should be able to gather in a safe area until the situation is under control or a decision to evacuate is made. (3) The structure of the installation needs to remain intact until the situation is under control or evacuation has been completed. (4) If it is decided to evacuate, safe evacuation means must be available for the whole crew. In this simple example, four main safety functions are identified: (1) escape routes, (2) safe area, (3) structure, and (4) evacuation means. ◽ If any of main safety functions fail during a major accident, the probability that personnel in the installation will be killed is increased. If, for example, no evacuation routes are available, people may be trapped by the fire, and if the fire is not controlled, they may be killed. Similarly, if the structure of the installation collapses before evacuation has been completed, it is not surprising that the people on board will be killed. 4 At that time part of the Norwegian Petroleum Directorate.

147

148

6 Measuring Risk

Loss of the main safety functions is thus an indirect measure of risk to people. If the frequency of loss is high, we can expect that the risk to people is high. The Norwegian regulations for offshore oil and gas installations state that the frequency of loss of the main safety functions should be less than 10−4 per accident type and per year. The accident types are specified in the regulations. One advantage of using frequencies of loss of main safety functions as risk metrics is that they directly focus on aspects of design of the installation. The main safety function escape routes is for example directly related to the layout of the installation and the protection of the escape routes against impact from fire, explosions, and other events. If the acceptance criteria are not met, the layout needs to be improved.

6.4 Risk Matrices A risk matrix is a tabular illustration of the likelihood and severity of hazardous events or accident scenarios based on categories of likelihoods and severity (see example in Figure 6.7). The risk matrix may be used to rank hazardous events according to their significance, to screen out insignificant events, or to evaluate the need for risk reduction for each event (e.g. see HSE, 2001a). Risk matrices are commonly used in several types of risk analyses. The analysis is based on subjective judgments and may seem to be simple and straightforward, but there are many pitfalls that can make the results of limited value. The approach and its limitations are thoroughly discussed by among others Baybutt (2018), Duijm (2015), and Cox (2008). Probability/ consequence 5 Catastrophic

1 Improbable 6

2 Remote

3 Possible 7

8

4 Occasional 9

5 Fairly normal 10

4 Severe loss

5

6

7

8

9

3 Major damage

4

5

6

7

8

2 Damage

3

4

5

6

7

1 Minor damage

2

3

4

5

6

Broadly acceptable Acceptable – use the ALARP principle and consider further analysis Not acceptable – risk reduction required

Figure 6.7 Risk matrix.

6.4 Risk Matrices

There are no international standards that covers risk matrices. The size of the matrix, the labeling of the axes, and so on, must therefore be decided by the analyst. In most risk matrices, the likelihood and the severity are divided into three to six categories, with the likelihood on the horizontal axis and the severity on the vertical axis. Using more than six categories is considered to make it difficult to classify events, whereas very few categories make it difficult to distinguish between events. In the risk matrix shown in Figure 6.7, five categories are used for both the likelihood and the severity, but risk matrices can have different numbers of categories for likelihood and severity. Each cell in the matrix corresponds to a specific combination of likelihood and severity, which can be assigned a risk level number or some other risk descriptor. The categories can be expressed quantitatively or qualitatively and may include consequences to people, the environment, material assets, and/or other assets.

6.4.1

Classification of Likelihoods

Depending on the application, the likelihood can be expressed as a frequency or a probability. To simplify the notation in this section, we assume that the likelihood is presented as a frequency. In most applications, it is sufficient to group the frequencies (both when it is a risk measure and a safety performance measure) in rather broad categories, for example, by using Table 6.8. Each category is given a name (e.g. occasional), a frequency, and a description. Using terms such as “occasional” or “possible” without further explanation opens for misinterpretations, both by the risk analysts and the decision-makers using the results. It is therefore important that the categories are described as precisely as possible. Using frequency categories is the best option to ensure that the words are not interpreted in different ways. Table 6.8 Frequency categories. Category

Frequency (per year)

Description

5. Fairly normal

10–1

Event that is expected to occur frequently

4. Occasional

1–0.1

Event that happens now and then and will normally be experienced by the personnel

3. Possible

10−1 –10−2

Rare event, but will possibly be experienced by the personnel

2. Remote

10−2 –10−3

Very rare event that will not necessarily be experienced in any similar plant

1. Improbable

10−4 –0

Extremely rare event

149

150

6 Measuring Risk

It is important to state how the frequency is measured. Frequencies may, for example, be expressed per year, per flight, or per operation. We also need to clarify whether this should be for a specific plant, a company, a country, or other units. There may be a big difference between asking someone to classify how often an event will occur in a plant compared to a national level. Being precise in the definitions of categories is important. Because of this, we also have to adjust the scale to fit the problem being considered. The scale should be such that the events that are identified spread out over the whole scale. If all events end up in one or two of the frequency categories, we are not able to distinguish the large and small contributors to risk and it becomes difficult to prioritize which should be focused to reduce the risk. It is common to define the “width” of the categories such that the likelihood/severity of the next higher category is 10 times as high as for the preceding category. This means that a logarithmic scale is used for both likelihood and severity. This is convenient when we use the risk level (see below) to rank hazardous events in the risk matrix because the logarithm of the categories then increases by one from one category to the next. This is not an absolute requirement. 6.4.2

Classification of Consequences

The consequences of an accident may be classified into different levels according to their severity. An example of such a classification is given in Table 6.9. Table 6.9 Classification of consequences according to their severity. Consequence types Category

People

Environment

Property

5. Catastrophic

Several fatalities

Time for restitution of ecological resources ≥ 5 years

Total loss of system and major damage outside system area

4. Severe loss

One fatality

Time for restitution of ecological resources = 2–5 years

Loss of main part of system; production interrupted for months

3. Major damage Permanent Time for restitution of disability, prolonged ecological resources hospital treatment ≤ 2 years

Considerable system damage; production interrupted for weeks

2. Damage

Minor system damage; minor production influence

Medical treatment Local environmental and lost-time injury damage of short duration (≤1 month)

1. Minor damage Minor injury, annoyance, disturbance

Minor environmental damage

Minor property damage

6.4 Risk Matrices

Table 6.10 Severity classification in MIL-STD-882E (2012). Category

Description

Catastrophic

Any failure that could result in deaths or injuries or prevent performance of the intended mission

Critical

Any failure that will degrade the system beyond acceptable limits and create a safety hazard (could cause death or injury if corrective action is not taken immediately)

Major

Any failure that will degrade the system beyond acceptable limits but can be counteracted or controlled adequately by alternative means

Minor

Any failure that does not degrade the overall performance beyond acceptable limits – one of the nuisance variety

Table 6.9 shows typical categories. For frequency, it is sufficient to use a single scale, but for consequences, a separate scale for each asset is required, such as for people, environment, and property. In practice, we therefore end up with separate risk matrices for each asset. When a risk assessment of a specific system is carried out, it is often beneficial to adapt the categories to the situation at hand and much of the advice given for specifying frequency categories is also relevant here. The severity categories should be defined such that the severity of a category is approximately 10 times higher than the severity of the preceding category, but this is not necessarily so easy always. By this approach, the severity numbers are on a logarithmic scale. In addition, we need to describe the consequence categories as precisely as possible. Other commonly used consequence categories are given in Table 6.10. 6.4.3

Rough Presentation of Risk

Assume that a total of n mutually exclusive events E1 , E2 , … , En have been identified for a study object. The events may be initiating events, hazardous events, or accident scenarios. If we, for each event Ei are able to quantify the consequence Ci by a cost or loss function to be 𝓁(Ci ) and can quantify the probability or frequency pi of event Ei , for i = 1, 2, … , n, Eq. (2.9) suggests ∑n that the risk can be quantified as i=1 𝓁(Ci )pi , an expression that is used by many authors. In the current situation, we cannot quantify exactly the costs and the probabilities of the possible events. For each event Ei (i = 1, 2, … , n), we can only specify the level of the consequence by a severity number Si and the level of the frequency by a likelihood number Li . Still, many authors use the following

151

152

6 Measuring Risk

expression to roughly represent the risk related to the study object: R=

n ∑

(6.28)

Si Li

i=1

where Si Li is the product of the severity index Si and the likelihood rate Li . 6.4.4

Risk Priority Number

To be able to prioritize risk reduction efforts, it is of interest to rank the initiating events, the hazardous events, or the accident scenarios according to their contribution to the “total risk.” For this purpose, a metric is helpful. The most commonly used metric is the risk priority number (RPN). The RPNi of event i is defined as RPNi = Si Li

for i = 1, 2, … , n

(6.29)

which, in light of (6.28), is an obvious candidate. If we can determine the RPNs for all the initiating events (alternatively hazardous events, or accident scenarios), we can rank how much the various events contribute to the “total” risk. An event with a high RPN gives a higher risk contribution than an event with a low RPN. The RPN values cannot be given a clear interpretation and should only be used to rank the relative importance of the events. The definition in (6.29) is sensible when the severity and the likelihood categories are defined as a near linear function of the consequences and the frequencies of the events. To illustrate this point, consider two events E1 and E2 that occur with the same likelihood L and where the consequence of event E1 is 10 times as high as the consequence of event E2 (e.g. 10 fatalities versus 1 fatality). To reflect the difference in risk, it would be natural to assume that RPN1 should be much higher (e.g. ≈10 times as high) as RPN2 , at least when the severity is classified with a high number of categories. Many standards and many authors suggest that categories similar to ones shown in Tables 6.8 and 6.9 be used. These categories are defined by using a scale that is approximately logarithmic with base 10. This is clearly seen from Table 6.8, where the frequency in category i + 1 is approximately 10 times the frequency in category i, for i = 1, 2, 3, 4. A similar, but not so clear, logarithmic scale is used in Table 6.9. When using a logarithmic scale, the severity number is, therefore, the logarithm of the consequence, and the likelihood number is the logarithm of the frequency of event i. Si = log(consequencei ) Li = log(frequencyi ) When calculating RPMi by (6.29), we really multiply RPNi = Si Li = log(consequencei ) × log(frequencyi )

6.4 Risk Matrices

To multiply two logarithms does not have any clear logical meaning, and it would be more natural to use RPNi = log[(consequencei )(frequencyi )] = log(consequencei ) + log(frequencyi ) = Si + Li

(6.30)

We have, therefore, two different definitions of RPNi for event i: (1) A multiplication rule: RPNi = Si Li (2) An addition rule: RPNi = Si + Li The difference between these two approaches is illustrated in Example 6.14. Example 6.14 (Two approaches for determining RPN) Consider the two accident scenarios E1 and E2 . Assume that we have found that E1 has severity category 5 and likelihood category 1 and that E2 has severity rate 4 and likelihood rate 2 (from Tables 6.8 and 6.9). The two approaches – addition versus multiplication – give Addition. The RPN1 for event E1 is RPN1 = 5 + 1 = 6 and the RPN2 for event E2 is RPN2 = 4 + 2 = 6, which means that E1 and E2 get the same RPN, and are equally important. Multiplication. The RPN obtained by multiplication for event E1 is RPN1 = 5 × 1 = 5 and for event E2 , we get is RPN2 = 4 × 2 = 8, and event E2 is more ◽ important than event E1 . Example 6.14 shows that the two approaches give different results in some cases. According to our opinion, the multiplication rule is most correct when the severity and the likelihood are measured on a linear scale, whereas the addition rule is most correct when the severity and the likelihood are measured with a scale that is logarithmic or approximately logarithmic. Many standards suggest that severity and likelihood categories are defined on an approximate logarithmic scale, and for that reason, we use the addition approach in the rest of this book. In some applications, the RPN is called risk index. Example 6.15 (Fire in a subway train) Consider an initiating event E: “fire with smoke in a subway train in a tunnel.” • The consequence is assessed based on Table 6.9. We assume 3–10 fatalities and get consequence category 5. • The probability is assessed based on Table 6.8. We assume that the initiating event E occurs once every 10–100 years and get frequency category 3.

153

154

6 Measuring Risk

The risk index for the initiating event (accident scenario) E is then determined by 5 + 3 = 8. The numbers in this example are not based on a thorough analysis and are included only as an illustration. ◽ In the risk matrix in Figure 6.7, the risk indices for the various combinations are calculated as shown in the matrix. Because this matrix is a 5 × 5 matrix, the risk indices range from 2 to 10. Following the approach described earlier, it is implied that events with the same risk level index have approximately the same risk. It is sometimes relevant to group initiating events that have risk indices in a certain range and treat them similarly. In Figure 6.7, three different ranges/areas are distinguished as defined by the UK Civil Aviation Authority (UK CAA, 2008, p. 12): (1) Acceptable. The consequence is unlikely or not severe enough to be of concern; the risk is tolerable. Consideration should be given to reducing the risk further to ALARP, to further minimize the risk of an accident or incident.5 (2) Review. The consequence and/or probability is of concern; measures to mitigate the risk to ALARP should be sought. Where the risk is still in the review category after this action, the risk may be accepted provided that the risk is understood and has the endorsement of the person who is ultimately accountable for safety in the organization. (3) Unacceptable. The probability and/or severity of the consequence is intolerable. Major mitigation is necessary to reduce the probability and/or severity of the consequences associated with the hazard. In the risk matrix in Figure 6.7, the acceptable region is in the lower left corner of the matrix and covers the events with risk level index 2–5. The unacceptable region is in the upper right corner and covers events with risk indices 8–10, whereas the review region is in the mid-region of the matrix, covering events with risk indices 6 and 7.

6.5 Reduction in Life Expectancy Death and disability due to a specific hazard need not occur during the course of the hazardous activity, but may sometimes be delayed for several years. An employee’s risk of immediate death due to an accident at work might, for example, be less than the risk of developing a fatal cancer due to an accidental release of a carcinogenic material. In the first case, she might die as a young woman, whereas the latter may allow her to work for 20–30 years before any cancer is developed. 5 The ALARP principle is discussed further in Section 5.3.1.

6.5 Reduction in Life Expectancy

The risk metrics that have been introduced so far do not distinguish between fatalities of young and old people. To cater to the age of the victim, the reduction in life expectancy, RLE, has been suggested as a risk metric. If a person dies at age t due to an accident, the RLEt is defined as RLEt = t0 − t where t0 denotes the mean life length of a randomly chosen person of the same age as the person killed, who has survived up to age t. The RLEt is seen to be equal to the mean residual life of the individual who is killed at age t. This is in other words not a true risk metric, in the sense that it expresses a combination of probability of consequence. It is only the consequence that is expressed as the number of “lost” years. It is, of course, possible to assign a probability of experiencing this consequence and thus calculate the risk, but this is not commonly used. The RLE puts increased value on young lives because the reduction in life expectancy depends on the age at death. From an ethical point of view, this may be questionable, but it is not uncommon in health services to prioritize treatment of young people over old people. Example 6.16 (Calculating RLE) Consider a local community that has to decide where to locate a hazardous installation. Close to one of the alternative locations is a kindergarten (K) and close to another location is a retirement home (R). Assume that the expected number of fatalities is the same in both cases and that the worst-case consequence is that 20 people are killed (either children or retirees). Assume that the average age of the children is five years, with a life expectancy of 80 years. The average age of the retirees is 75, with a life expectancy of 85 years. The RLE for the two cases can be expressed as follows: RLEK = 20 persons × (80 − 5) = 1500 years RLER = 20 persons × (85 − 75) = 200 years ◽ RLE is used to express risk associated not only with diseases in particular but also for accidents. To calculate the average reduction in life expectancy RLEav of a specified group of persons exposed to a certain hazard, we have to compare the observed life expectancy with the estimated life expectancy of the same group in the absence of the hazard. The life expectancy of the group may sometimes differ significantly from that of the general population of the country. A table of the estimated average reduction in life expectancy RLE∗av due to some selected causes is presented by Fischhoff et al. (1981). A brief extract of their data is presented in Table 6.11.

155

156

6 Measuring Risk

Table 6.11 Estimated average reduction in life expectancy due to various causes. Cause

Days

Heart disease

2100

Cancer

980

Stroke

520

Motor vehicle accidents

207

Accidents in home

95

Average job, accidents

74

Drowning

41

Accidents to pedestrians

37

Source: Fischhoff et al. (1981).

6.6 Choice and Use of Risk Metrics In this chapter, several ways of measuring risk to people are introduced and discussed. Most of them are related to fatality risk, but some also include injuries. For a specific application, which risk metrics should we use? Are any of them better than the others? Later, some general advice and discussion of how to choose risk metrics is presented. Because risk analysis is used to support decision-making, the foremost criterion for choosing how to measure risk is that it can express risk in a way that provides the answers required for the decision-making. An obvious factor is how the risk acceptance criteria are expressed. If the acceptance criterion is expressed as a FAR-value, the risk analysis needs to express the risk as a FAR-value, to allow comparison. A supplement to this is that the risk should be expressed in a way that the decision-maker is able to understand, enabling her to interpret the results correctly. Complex risk metrics that are difficult to explain and interpret should therefore be avoided. If not, good explanations and examples need to be provided. Some general considerations when choosing among and using risk metrics are as follows: • PLL is dependent on the size of the group that is being considered. The bigger the group, the larger PLL tends to be. Comparing different activities and comparing different groups may therefore be difficult when using PLL. • For evaluating the effect of risk reduction measures, PLL may be useful. We are then comparing different risk levels for the same group and same activity and the comparison is then valid. PLL further has the advantage that it is

6.6 Choice and Use of Risk Metrics















convenient when performing cost-benefit analyses (see Chapter 5) for comparison of alternative risk reduction measures. Individual risk metrics such as AIR and FAR are often used to express acceptance criteria because they consider the individuals in a group rather than the group as a whole. Both can be used for the same purposes and are independent of the size of the group that is being considered and also independent of exposure time because both express risk for a given exposure (either per year or per 108 hours). Group risk metrics are useful when we consider large systems that expose a large group of individuals. On the other hand, individual risk may be a more suitable metric when smaller groups are exposed to additional risk (that may be high) (Vanem, 2012). The FN curve provides more information about risk than the single-number risk measures, such as PLL, FAR, and AIR. With the FN curve, we get an impression of the severity of the accidents, for example, whether the risk is mainly associated with accidents with one or a few fatalities or accidents with many fatalities. Major accidents with many fatalities are usually regarded as less acceptable than (many) smaller accidents and should, therefore, be important for decision-making. A disadvantage with the FN curve is, on the other hand, that it can be difficult to understand. For risk to people, it is quite common to focus on fatalities only and not injuries. For this reason, most of the risk measures, therefore, also measure only fatality risk. The limitation that is introduced through this should be clearly communicated to the decision-maker and if this is not an adequate measure of risk, LTI, LWF, or PEF may be used to include injuries into the description of risk. The way that all the risk metrics normally are used, it is common to average over relatively large groups of people. AIR and FAR, therefore, express risk for an average individual, and not for specific individuals. In some cases, averaging may “hide” large individual differences between the individuals in a group and this needs to be communicated clearly. When using a risk matrix to express risk, we have full freedom to decide how the categories for both frequency and consequence are defined. Some companies and institutions have defined standard risk matrices that they use for all types of study objects. This is not necessarily a good idea because the scale of the problem and the type of events that are relevant may vary significantly from case to case. A standard risk matrix need not be suitable for all these situations because the events tend to be “lumped” together in just a small part of the risk matrix, giving little distinction between different events. Finally, using safety performance metrics based on fatalities can be problematic if major accidents have occurred because they tend to dominate the total number of fatalities and thus strongly influence the value of the safety performance measure. This is a general problem with measuring safety

157

158

6 Measuring Risk

performance for major accidents, unless we have a very large population over which to measure risk. If we look at risk associated with process plants, trends in major accidents are often measured on a global basis because that is the only way that we can have a sufficient data set to produce reasonably reliable trend figures. A general comment to the risk metrics that express risk as a single number is that this gives very limited information about risk. Using several risk metrics, such as PLL and AIR or PLL and FN curve may be useful because a more nuanced picture of risk is provided. MacKenzie (2014) discusses how to construct risk metrics and present some views and aspects to consider when deciding on what risk metrics to use.

6.7 Risk Metrics for Other Assets So far, we have only looked at risk metrics for risk to persons and not other assets, such as environment, economy, and reputation. We do not give a comprehensive discussion of this, but provide some examples of how risk can be measured. 6.7.1

Measuring Environmental Risk

Perhaps the biggest issues related to environmental risk at the time of writing this book are long-term issues such as climate change, microplastics in the oceans, and extinction of pollination bees. Long-term effects are often not expressed as risk, but indirectly by the levels of release of hazardous materials, levels of toxic and carcinogenic materials that are present in food, exposure to chemicals, and so on. We do not go into this, but instead focus on accidental releases, mainly to water. Time from the release occurs until the environment has recovered. This approach is used, among others, by the SAFEDOR project (Skjong et al., 2007), that proposes four damage categories, ranging from minor (10 years recovery time). Size of the area affected by the release. A second approach considers the extent of the effects rather than the duration (Diamantidis, 2017). Damage is expressed in terms of the size of the area that is affected or the length of the coastline that is affected by a spill at sea. The Seveso directive (EU, 2012) uses size of area as a criterion for determining what is a major accident. Combination of the extent of damage and the time to recovery. A third approach is to combine the abovementioned two into one measure. The extent of damage is calculated repeatedly over time until the damage is negligible. From this, a combined metric can be calculated by adding the calculated extent of damage for each time step and dividing by the number of time steps.

6.8 Problems

6.7.2

Measuring Economic Risk

Risk to material assets is typically expressed very simply as the combination of the frequency of the loss and the loss expressed in monetary values, such as Euros. This gives a statistically expected loss (equivalent to PLL). Sometimes, other assets, such as reputation, are expressed in terms of monetary values. For companies listed on a stock exchange, this can be expressed as loss of share value.

6.8 Problems 6.1

In 2009, 212 persons were killed in car accidents in Norway. The population in 2009 was approximately 4.85 million. Calculate AIR∗ and PLL∗ based on these numbers. Comparatively, in the United States in 2009, 33 808 persons were killed. The population was approximately 307 million. Calculate AIR∗ and PLL∗ based on this. Compare the numbers from the two countries and comment on what the results tell us. Where is driving safest?

6.2

A total of 95 persons are working onboard an offshore oil and gas installation. The personnel can be split into three groups: admin, process, and maintenance. A risk analysis has been performed and it has been found that (i) ship collision, (ii) process releases, (iii) fires in the utility area, and (iv) occupational accidents contribute to risk. The calculated risk results for these accident types are shown below, in different formats. The platform is split on three areas (process, utility, and living quarter). A ship collision will affect all areas in the same way. Process releases and fires in the utility area will have a different effect on different areas. Occupational accidents are independent of area, but related to the type of work (personnel group). Data from the risk analysis: • Ship collisions: – Frequency per year = 3.20 × 10−4 (all areas are exposed in the same way) – Pr(fatality|collision) = 0.08 • Process releases and fires LSIR Process

Process releases Fires in utility area

Utility

Living Q

1.50 × 10−3

3.00 × 10−4

2.00 × 10−5

−4

−4

1.00 × 10−5

2.00 × 10

8.00 × 10

159

160

6 Measuring Risk

• Occupational accidents (AIR) AIR

Admin personnel

4.00 × 10−5

Process personnel

1.10 × 10−4

Maintenance personnel

9.00 × 10−5

• Manning level and proportion of time in each area: Number of persons

Process (%)

Utility (%)

Living Q. (%)

Admin personnel

40

5

5

90

Process personnel

25

35

10

55

Maintenance personnel

30

20

25

55

The number of hours that personnel spend on the installation = 2920 (1/3 of a year). (a) Calculate AIR for all personnel onboard and AIR per group for the three personnel groups. (b) Calculate PLL per year. (c) Calculate average FAR-value for all personnel onboard. (d) Assume that the total manning is reduced from 95 to 50 and recalculate PLL, AIR, and average FAR. Comment on changes. (e) Assume that the working pattern is changed such that people spend 50% of the year offshore. Recalculate the PLL, the average AIR, and the average FAR. Comment on changes. 6.3

Frequency of loss of main safety functions (a) Explain what we mean by the risk metric frequency of loss of main safety functions (b) What does this metric tell us about risk (e.g. is it an expression of personnel risk, environmental risk, or risk to other assets)? (c) Give examples of main safety functions. (d) In what industry is main safety functions used as a risk metric and in what phase in the life cycle is it particularly well suited?

6.4

From Norwegian statistics, the following numbers can be found: • Average population in Norway 1991–2000: 4 400 000 • Split between males and females is very close to 50%. • Average annual number of fatalities due to accidents (1991–2000): 1734 • Average annual number of fatalities due to accidents among males: 986

6.8 Problems

• Total average number of fatalities (all causes) in the period 1991–2000: 45 300 Calculate the following: (a) Average annual PLL∗ due to accidents for the period 1991–2000 (b) Average FAR∗ value due to accidents for the period 1991–2000 (assume 24 hours exposure per person per day) (c) Average individual risk (AIR∗ ) due to accidents of males and females, respectively, for the period 1991–2000 (d) Total AIR∗ due to all causes for the period 1991–2000 (e) Can you explain what this last AIR∗ value means in practical terms? Comment on how realistic it is to use an average value like this if you want to calculate the risk, for example, for students. 6.5

What information is required to establish an FN curve?

6.6

Within a specific industry, an average of 24 000 man-years of work per year have been performed from 1990 to 1999. One man-year is assumed to consist of 2000 hours of work, and all employees are assumed to work full time. In this period, 26 fatalities have occurred. (a) Calculate PLL∗ , FAR∗ , and AIR∗ for this period. (b) Will any of the risk metrics change if the number of man-years is increased, but the number of deaths remains the same?

6.7

The total risk has been quantified for train passengers and the results may be summarized as follows:

Consequence

Frequency per year

0 Fatality, 5 injuries

3.2 × 10−3

1 Fatality, 10 injuries

6.1 × 10−4

5 Fatalities, 20 injuries

4.9 × 10−5

10 Fatalities, 30 injuries

3.6 × 10−6

20 Fatalities, 40 injuries

1.2 × 10−7

40 Fatalities, 50 injuries

8.6 × 10−7

If you need to make further assumptions to solve the problems, describe these. (a) Calculate average annual PLL for the train operation. (b) Calculate AIR. (c) Calculate average FAR-value for all passengers.

161

162

6 Measuring Risk

(d) Calculate Potential Equivalent Fatality. Injuries are given a weight of 0.1 fatalities per injury. (e) Draw an FN curve based on this information. 6.8

In a factory, there are 200 full-time employees who all are exposed to the same risk. A risk analysis for the factory has been performed and the following information has been arrived at • The only type of accidents that can cause fatalities at the factory is fires. • The frequency of fire is 0.0028 per year • The average consequence if a fire occurs is 1.2 fatalities. Calculate AIR for the employees in the factory. Assume that the operation at the factory is changed and that a toxic chemical is introduced, leading to another type of accidents that may occur. If this is released, employees may be killed. The annual frequency of release is 4.2 × 10−4 , and the average consequence if a release occurs is 12 fatalities. Recalculate AIR for the employees in the factory taking into account the changes.

6.9

A train is standing on a station ready to leave. The train line out of the station is a single track line, and it is used by trains driving in both directions. Several measures are in place to prevent the train from leaving the station while another train is coming in the opposite direction. The probability of the train leaving the station without the line being clear is therefore quite low, at 3.0 × 10−6 per situation. The train driver and the train conductor will experience “Train waiting at station, ready to leave” on average 10 times every working day, and they are working 220 days each year. An average passenger will experience the same situation two times on each trip and takes 440 trips each year. On average, the train carries the train driver, the train conductor, and 30 passengers when it is waiting to leave. The probability of a collision, given that the “Train drives out of station and line is not clear” is 0.5, and the probability of being killed, given that a person is on a colliding train is 0.25. (a) What is AIR for the train driver and for an average passenger? (b) What is DPM for the train driver and for an average passenger? (c) What is PLL for train collision? (d) What is average FAR-value for the train driver? Use 1850 working hours per year as exposure time.

6.10

For an offshore platform, the following information is provided from a risk analysis:

References

• Blowout from a well is one of the main types of accidents that can occur. • The frequency per year of a blowout is 0.0012. • If a person is present on the platform, the average probability of being killed is 0.25, given that a blowout occurs. • An average person spends 2920 hours per year (out of total of 8760 hours per year) on the platform. • On average, there are 150 persons on the platform at any time. (a) Calculate LSIR due to blowout for the offshore installation. (b) Calculate AIR due to blowout for an average person. (c) Calculate FAR due to blowout for an average person. Assume that exposure time is equal to the time a person spends on the platform. (d) Calculate PLL per year due to blowout for the offshore installation 6.11

Consider an accident where two private cars crash. One person is killed, one is injured, and both cars are severely damaged. Consider the consequence categories provided in Table 6.9 and identify the consequence categories that the adverse effects can be classified in.

6.12

In which way is the ALARP principle implemented in the risk matrix?

6.13

At a workplace, statistics of accidents have been calculated for a period of 10 years, and the results are shown in the following table. It is assumed that there are 500 employees working 2000 hours per year the whole period. Year 1

Year 2

Year 3

Year 4

Year 5

Year 6

Year 7

Year 8

Number of LTIs

12

6

9

7

10

5

4

2

Number of LWDs

65

42

90

55

75

80

50

62

(a) Calculate LTIF∗ and LWF∗ for each year and plot the numbers in two curves. (b) What do the curves tell us about the severity of the LTIs?

References Ball, D.J. and Floyd, P.J. (1998). Societal Risks. Tech. Rep. London: Health and Safety Executive. Baybutt, P. (2018). Guidelines for designing risk matrices. Process Safety Progress 37 (1): 49–55. Cox, L.A.J. (2008). What’s wrong with risk matrices? Risk Analysis 28 (2): 497–512.

163

164

6 Measuring Risk

Diamantidis, D. (2017). A critical view on environmental and human risk acceptance criteria. International Journal of Environmental Science and Development 8 (1): 62–66. Duijm, N.J. (2015). Recommendations on the use and design of risk matrices. Safety Science 76, 21–31. EU (2012). Council Directive 2012/18/EU of 4 July 2012 on the control of major-accident hazards involving dangerous substances, Official Journal of the European Union L197/1. EU-JRC (2006). Land use planning guidelines in the context of article 12 of the Seveso II directive 96/82/EC as amended by directive 105/2003/EC. Tech. Rep. Ispra, Italy: European Commission, Joint Research Centre. Eusgeld, I., Freiling, F.C., and Reusser, R. (eds.) (2008). Dependability Metrics. Berlin: Springer-Verlag. Evans, A.W. and Verlander, N.Q. (1997). What is wrong with criterion FN-lines for judging the tolerability of risk? Risk Analysis 17: 157–168. Farmer, F. (1967). Siting criteria: a new approach. Atom 128: 152–170. Fischhoff, B., Lichtenstein, S., and Keeney, R.L. (1981). Acceptable Risk. Cambridge: Cambridge University Press. Hambly, E.C. (1992). Preventing Disasters. London: Royal Institution Discourse. Hirst, I.L. (1998). Risk assessment: a note on F-N curves, expected numbers of fatalities, and weighted indicators of risk. Journal of Hazardous Materials 57: 169–175. Holand, P. (1996). Offshore blowouts: causes and trends. PhD thesis. Trondheim, Norway: Department of Production and Quality Engineering, Norwegian Institute of Technology. HSE (2001a). Marine Risk Assessment. London: HMSO. HSE (2001b). Reducing Risks, Protecting People; HSE’s Decision-Making Process. Norwich: HMSO. HSE (2003). Transport Fatal Accidents and FN-Curves (1967–2001). London: HMSO. Kjellén, U. (2000). Prevention of Accidents Through Experience Feedback. London: Taylor & Francis. Laheij, G.M.H., Post, J.G., and Ale, B.J.M. (2000). Standard methods for land-use planning to determine the effects on societal risk. Journal of Hazardous Materials 71: 269–282. MacKenzie, C.A. (2014). Summarizing risk using measures and risk indices. Risk Analysis 34: 2143–2162. MIL-STD-882E (2012). Standard practice for system safety. Washington, DC: U.S. Department of Defense. RSSB (2007). Engineering Safety Management (The Yellow Book), vols 1 and 2. London: Rail Safety and Standards Board. Skjong, R., Vanem, E., and Endresen, Ø. (2007). Risk Evaluation Criteria. Tech. Rep. SAFEDOR-D-4.5.2 DNV. EU Project, SAFEDOR Project.

References

UK CAA (2008). Safety Management Systems: Guidance to Organisations. Technical report. Gatwick Airport, UK: Civil Aviation Authority, Safety Regulation Group. Vanem, E. (2012). Ethics and fundamental principles of risk acceptance criteria. Safety Science 50 (4): 958–967.

165

167

7 Risk Management 7.1 Introduction The topic of this book is risk assessment, but identifying and describing risk cannot by itself reduce risk. We need to make decisions based on the results and make sure that these decisions are implemented and followed up before we can expect to see any effect. Systematic performance of these and other activities aimed at controlling risk are what we may call risk management. This chapter provides a brief introduction to this topic to give a better understanding of how and where risk assessment fits into the bigger picture of risk management. The international standard ISO 31000 “Risk management guidelines,” was published in 2009 and has later been updated in 2018 (ISO 31000 2018). This standard is widely recognized, but a vast body of other standards and guidelines provide advice on what risk management is and how it should be implemented. Examples include the following: • ISO 45001. “Occupational health and safety management systems – Requirements with guidance for use” • CCPS (2016). “Guidelines for implementing process safety management” • ICAO (2018). “Safety Management Manual” • U.S. Homeland Security (2011). “Risk management fundamentals” • UK HSE (2013). “Managing for health and safety” • ISO/IEC 27001. “Information technology – Security techniques – Information security management systems” Most of these documents are either specific to certain types of risk (e.g. ISO 45001 that focus on occupational risk) or to specific industries (e.g. Center for Chemical Process Safety [CCPS] and International Civil Aviation Organization [ICAO]). Our presentation of risk management is mainly based on ISO 31000 because this is the most generic standard. Even if the terminology and the descriptions vary between the sources, the main principles are in most cases the same or very similar. Risk Assessment: Theory, Methods, and Applications, Second Edition. Marvin Rausand and Stein Haugen. © 2020 John Wiley & Sons, Inc. Published 2020 by John Wiley & Sons, Inc. Companion Website: www.wiley.com/go/riskassessment2e

168

7 Risk Management

ISO 31000 has a wide scope and is aimed at managing all types of risk that an organization may be exposed to. This includes not only risk associated with accidents but also other aspects, such as financial, political, legal, contractual, and regulatory risk. ISO 31000 uses the term “risk” with a much wider meaning than we apply in this book and also allows for positive outcomes. An organization may, for example, choose to accept a high financial risk because there is a possibility of making a lot of money if the risk is not realized into a negative outcome. In this situation, risk management is about optimization of the level of risk, not minimization or reducing risk to as low as reasonably practicable (ALARP). ISO 31000 emphasizes that risk management has to be anchored in the top level management of the organization and also that risk management needs to be integrated closely with normal operation and management of the organization. ISO 31000 (2018) defines risk management as follows: Definition 7.1 (Risk management) Coordinated activities to direct and control an organization with regard to risks. ◽ This definition applies to organizations, but this is because the standard is delimited to risk management in organizations. The definition covers all the activities that organizations plan and perform to deal with risk, including what is done to identify and describe risk. Risk assessment is thus a part of risk management. The US Department of Homeland Security, on the other hand, defines risk management as follows: Definition 7.2 (Risk management) Risk management is the process for identifying, analyzing, and communicating risk and accepting, avoiding, or controlling it to an acceptable level considering associated costs and benefits of any actions taken (Homeland Security 2010). ◽ According to ISO 31000, risk management should follow these principles: • Integrated – should be a part of the overall management, not an activity separated from strategic and operational decisions • Structured and comprehensive – systematic and covering everything that is relevant and significant • Customized – adapted to and suitable for the organization in question • Inclusive – ensure involvement of relevant stakeholders • Dynamic – accepts that risk changes and that risk management needs to adapt to this • Best available information should be used

Risk assessment

Risk analysis

Risk evaluation

Recording and reporting

Scope, context, and criteria

Monitoring and review

Communication and consultation

7.1 Introduction

Risk treatment

Figure 7.1 Risk management process. Source: Adapted from ISO 31000 (2018).

• Human and cultural factors need to be taken into account • Continual improvement The term safety management is often used with the same meaning as risk management, but in this chapter we prefer the term “risk management.” This is not only in accordance with ISO 31000 but also in accordance with how we use, for example, risk metrics and safety performance metrics, in relation to the future and the past, respectively. Another term that is used in relation to management of risk is risk governance (IRGC 2008). In practice, there are a lot of similarities between how risk management is described in ISO 31000 and how International Risk Governance Council (IRGC) describes risk governance. The main steps of the risk management process are illustrated in Figure 7.1, which is a modified version of the figure in ISO 31000, because the standard uses a terminology that is slightly different from the one used in this book. For all practical purposes, the differences are not important. The risk management process in Figure 7.1 is shown as a loop, with two activities on the sides, but in practice it is an iterative process. Briefly, the main steps are as follows: (1) Scope, context, and criteria. Defines the framework for the risk management process (2) Risk assessment. Covers risk analysis and risk evaluation (3) Risk treatment. Decides what to do (if anything) and implements the decision (4) Communication and consultation. Involves all relevant stakeholders in the process

169

170

7 Risk Management

(5) Monitoring and review. Continuous follow-up of performance, to see if the performance meets the objectives of risk management (6) Recording and reporting. Documentation and feedback The six steps are discussed briefly in the following.

7.2 Scope, Context, and Criteria Step 1 defines the scope of the risk management process, the external and internal context, and establishes the risk criteria. This is basically the same process we go through before starting a risk assessment (Chapter 3), except that the scope and context is wider. It is recommended that the scope of the process is comprehensive, covering all types of risk, although some may still choose to limit it only to specific hazards and events. The internal and external context within which risk management takes place is important. The internal context relates to the organization itself, whereas the external is everything outside the organization that can be influenced by the organization or that the organization is influenced by. Internal context would include, for example, the values of the organization, together with its vision and mission. Other factors are strategies, policies, culture, available resources, knowledge and data, organizational structure, and internal relationships. External factors cover a wide variety of aspects, including social, cultural, political, legal, regulatory, financial, environmental, technological, and economic factors. It may seem that such a wide range of factors fall far outside what may affect management of accident risk, but consider the political and social pressure that is placed on a company that has experienced a severe accident. Another example is financial factors, where a company struggling to survive financially due to market depression also has to reconsider spending on risk management. Chapter 5 deals with risk acceptance criteria, but primarily about how these can be established and expressed in principle and not in practice. Establishing risk acceptance criteria is a complicated process, where the internal and external context plays a big part in deciding how to formulate the criteria and what level of risk should be tolerated. Many of the abovementioned factors therefore provide relevant inputs also to this step in the process.

7.3 Risk Assessment Risk assessment (step 2) is the main topic of this book and is not treated in detail in this section. The reader may refer to Chapter 3 for more details.

7.4 Risk Treatment

7.4 Risk Treatment Risk treatment (step 3) involves two substeps: (i) decide what to do and (ii) implement the decision. In practice, there are several options available: Do nothing. If risk is tolerable and ALARP (for acceptance criteria that are formulated according to this principle), the conclusion may be that nothing needs to be done. Maintain existing measures to control risk. In most cases, existing risk controls are already in place and have been taken into account in the risk analysis. For the results from the risk analysis to remain valid, these controls need to be maintained. Consider further risk reduction measures. Even if risk is tolerable, we can seldom conclude that it is ALARP unless we have tried to identify and evaluate further risk reduction measures. An evaluation of their effectiveness and comparison of alternatives should be conducted before a decision is made. Perform further analysis. It is not uncommon that this is the outcome, in particular if the risk is high and there are no obvious, cost-effective risk reduction measures available. It may also be that the uncertainty is high or that the risk is not well understood. Reevaluate risk acceptance criteria. In special situations, this may be the outcome. If the risk is high, not meeting the criteria, but the benefits from performing an activity or making a decision are very high, it may be concluded to deviate from the risk acceptance criteria and accept a higher risk than stipulated by the criteria. In most cases, it is relevant to consider reduction of risk and as a basis for this, systematic identification of risk reduction measures is important. For this, a good understanding of accident scenarios is important. If we understand how relevant scenarios can evolve, we are also better positioned to suggest measures to avoid them or to modify the development into a scenario with less consequences. Systematic evaluation of risk reduction measures is also important. Use of cost–benefit analysis is one aspect of this, but we also need to consider if measures may have negative effects, will they be specific for just one accident scenario or may they cover more broadly, how effective will they be, and other aspects. Systematic identification and evaluation of risk reduction measures is usually an integrated part of the risk analysis, and these topics are covered in more detail in Chapter 13. Remark 7.1 (Risk perception) Results from risk analysis are an important input into the decision process, but decisions are also strongly influenced by the risk perception of the decision-maker(s). Different people may have different perceptions of risk, based on their individual characteristics, earlier experience,

171

172

7 Risk Management

and other factors. An example of this is the fear of flying. Two people flying on the same plane, may have widely differing perceptions of what the risk is. Some may have an extreme fear of flying and do everything possible to avoid it, whereas others have no or few concerns. This can largely be attributed to differences in perception. ◽

7.5 Communication and Consultation Step 4 of the risk management process consists of two key parts that fit closely together: • Communication is about how we present the results from risk analysis to various stakeholders, often with a highly varying background and understanding of what risk and risk analysis are. • Consultation is about getting feedback from the stakeholders on the risk assessment and the decisions that have been proposed or made. Unless communication is good, consultation is likely to fail because the stakeholders do not understand the risk analysis and the results. Risk communication is difficult and is a research area in its own right. The concept of risk is defined in many different ways and easily misunderstood (as we have seen in Chapter 2). Complicating matters even more is the fact that there may be many stakeholders with varying background and needs with regard to communication. The risk measures that we use are not necessarily well suited for informing nonexperts. A risk analyst will (hopefully) understand what an FN curve is, but will a decision-maker with no or limited formal training in risk assessment understand the meaning? One of the important researchers in this area is Baruch Fischhoff , who has published several books and many papers on risk communication and risk perception. Fischhoff (1995) nicely summarizes twenty years of development in risk communication in the eight “stages” listed in Table 7.1. Unfortunately, it is too often the case that only the first two to three stages are followed. As risk analysts, we are not necessarily involved in all of these stages, but it is still important to understand that risk communication is much more than just doing the analysis in the best possible way and presenting the results. More information on risk communication may be found in, for example, Fischhoff and Scheufele (2013), Árvai and Rivers (2014), Morgan (2009), and Kasperson (2014). If communication is good, consultation is more likely to be successful. Consultation is essentially about getting feedback from relevant stakeholders. A good process for this requires that the arenas for presenting and discussing the results are good and that the timing enables input to be taken into account in the risk management process. How this is done in detail depends on the

7.6 Monitoring and Review

Table 7.1 Stages of development in risk communication. 1.

All we have to do is get the numbers right

2.

All we have to do is tell them the numbers

3.

All we have to do is explain what we mean by the numbers

4.

All we have to do is show them that they’ve accepted similar risks

5.

All we have to do is show them that it’s a good deal for them

6.

All we have to do is treat them nice

7.

All we have to do is make them partners

8.

All of the above

Source: Fischhoff (1995). © John Wiley & Sons, Inc. Reproduced with permission.

situation. When developing a new plant, introducing new legislation, or making other changes that will affect risk, it is important that stakeholders have a real opportunity of influencing the situation and timing is then very important. In day-to-day operation of a plant, consultation with workers about risk may, on the other hand, take place on a regular basis and be part of the overall operation of the plant. Because it is often difficult to present all information that may be relevant for stakeholders in writing, it may be necessary to involve the risk analysts who have performed the analysis in direct communication with stakeholders. Otherwise, communication problems and erroneous inferences may prevail.

7.6 Monitoring and Review Equally important as selecting and implementing risk reduction measures is the day-to-day follow-up of risk. Monitoring and review of risk (step 5) cover a variety of activities with a number of objectives: • Making sure that the measures that have been adopted give the result that was intended. Sometimes, we may experience that a risk reduction measure does not have the expected effect (e.g. procedures that were developed are not fit for purpose, gas detectors have not been located optimally and do not detect flammable gas, and training of personnel did not lead to improved performance). • Monitoring that risk reduction measures continue to work as intended. Technical systems need to be regularly inspected, tested, and maintained to make sure that they continue working. Personnel that have been trained in emergency response will need regular exercises and retraining because they seldom need to use this knowledge. • Monitoring trends in safety performance and risk levels. If there are negative trends in the risk level, we want to detect this as early as possible and take

173

174

7 Risk Management

corrective actions. Monitoring is often done through safety performance indicators (e.g. the lost-time injury [LTI] rate) or through measuring performance of key safety systems. The Petroleum Safety Authority Norway (PSAN) started a monitoring program for major accident risk in the Norwegian oil and gas industry in 2001 and publish annual reports providing a status on the risk level (PSA 2018). Among the indicators used are the number of serious incidents, the number of injuries, failure of tests of safety critical equipment, and also status on human and organizational factors that may influence risk. • Monitoring also needs to consider changes in the internal and external context, which may require modifications to the risk management system itself or the measures we have implemented to reduce and control risk. This may be caused by changes in operations, changes in regulations, modifications to equipment, and so on. All are examples of changes where an updated risk assessment may be required. Experience has shown that changes often are a major factor in accidents, because the effect of the changes on risk have not been properly evaluated. Management of change is therefore often highlighted as a particularly important part of risk management. • As part of this step of the process, one should regularly evaluate if the management system as such is functioning as intended. Audits are often used to verify performance, and modifications may have to be made based on the results. 7.6.1

Safety Audits

A particular type of monitoring activity is audits. Audits are used in a wide range of contexts, and the purpose is typically to verify that a function, a procedure, a management system, or another activity is working as intended. A specific type of audits is safety audits, and these may have several purposes. The main purpose is usually to verify that the risk management system that has been put in place is functioning as intended. The safety audits may serve to identify weaknesses in the system, identify hazards that have been overlooked, or to verify compliance with relevant laws, regulations, and standards. The activities involved in an audit may vary significantly, depending on the scope. Examples of activities are review of procedures, checking documentation, and performing interviews.

7.7 Recording and Reporting The final step of the risk management process is recording and reporting information (step 6). The whole process should be documented properly, including the results from the process. The documentation is an important tool for communication both internally and externally and enables tracking how the

7.8 Stakeholders

process was performed and what has been achieved. This is an important basis for improvement. According to ISO 31000, key aspects to take into account when deciding on what and how to report are the following: • • • •

Different stakeholders have different needs when it comes to reporting. Reporting can be done in different ways, and not just paper-based. Reporting should be targeted to help the organization meet its objectives. Reports need to be available on time, and without excessive use of resources.

7.8 Stakeholders As described in Section 7.5, relevant stakeholders need to be involved in the risk management process. ISO 31000 (2018) defines a stakeholder as follows: Definition 7.3 (Stakeholder) Person or organization that can affect, be affected by, or perceive themselves to be affected by a decision or activity. ◽ A stakeholder is also called interested party, which is the preferred term in ISO 45001. Definition 7.3 covers many categories of stakeholders, such as people who are affected by the consequences of possible accidents, but who do not have the ability or power to influence the decision. Observe that even if someone are not affected, but perceive themselves to be affected, they are stakeholders and potentially entitled to be informed and consulted about risk. 7.8.1

Categories of Stakeholders

Stakeholders may be classified in different ways. One such classification is based on the stakeholders’ (i) power, (ii) urgency, and (iii) legitimacy. Stakeholders may alternatively be classified as (Yosie and Herbst 1998): (1) People who are affected directly by a decision to act on any issue or project. (2) People who are interested in a project or an activity, want to become involved in the process, and seek an opportunity to provide input. (3) People who are more generally interested in the process and may seek information. (4) People who are affected by the outcome of a decision but are unaware of it or do not participate in the stakeholder process. Some stakeholders may have several roles in relation to the study object. The consequences of an accident will be different for various stakeholders, depending on their relation to the assets that may be harmed. If a worker is killed in an accident, her husband and children will, for example, experience consequences other than those that would affect her employer.

175

176

7 Risk Management

7.9 Risk and Decision-Making The decision about what to do with risk was described in Section 7.4. In the following, we elaborate briefly on the decision-making process. 7.9.1

Model for Decision-Making

It is important to remember that risk is always only one dimension of a decision problem. Operational, economic, social, political, and environmental considerations may further be important decision criteria. Even if a decision influences risk, it is never made in a vacuum. There are always constraints, such as laws and regulations, time and cost limits, and so on, that need to be adhered to, and there are usually a set of stakeholders who have interests in the decision and who will seek to influence the decision-making in different ways. A simple model for decision-making involving risk is shown in Figure 7.2, which is an expanded version of a similar figure in Aven (2003, p. 98). The results from a risk assessment can be used as follows: • Direct input to decisions (see Figure 7.2) • Indirect input to decisions, for example, by influencing stakeholders The actual decision must be taken by the management and is not part of the risk assessment process. In the following, we briefly describe three “modes” of decision-making that rely on results from risk analysis to a varying degree. Constraints – Laws and regulations – Cost – Time

Stakeholders – Preferences – Objectives – Criteria

Risk analysis Decision problem Other analyses

Managerial review and judgment

– Decision alternatives

Decision analysis

Figure 7.2 Decision framework. Source: Adapted from Aven (2003).

Decision

7.9 Risk and Decision-Making

7.9.1.1

Deterministic Decision-making

Deterministic decision-making means that decisions are made without any consideration of the likelihood of the possible outcomes. Risk analysis is not used as input to deterministic decision-making. Scenarios are predicted based on a deterministic view of the future, assuming that a bounding set of fault conditions will lead to one undesired end event. To prevent this end event from occurring, the decision-maker relies on traditional engineering principles, such as redundancy, diversity, and safety margins. 7.9.1.2

Risk-Based Decision-making

Risk-based decision-making (RBDM) is based almost solely on the results of a risk assessment. The U.S. Department of Energy defines RBDM as: Definition 7.4 (Risk-based decision-making, RBDM) A process that uses quantification of risks, costs, and benefits to evaluate and compare decision options competing for limited resources (adapted from U.S. DOE 1998). ◽ The US Coast Guard gives a detailed description of the RBDM process in the four volumes of USCG (2002). The process can be split into four steps: (1) Establish the decision structure (identify the possible decision options and the factors influencing these). (2) Perform the risk assessment (e.g. as described in Chapter 3). (3) Apply the results to risk management decision-making (i.e. assess the possible risk management options and use the information from step 2 in the decision-making). (4) Monitor effectiveness through impact assessment (track the effectiveness of the actions taken to manage the risk and verify that the organization is getting the results expected from the risk management decisions). USCG (2002) may be consulted for details about the various steps. 7.9.1.3

Risk-Informed Decision-making

The RBDM approach has been criticized for putting too much focus on probabilistic risk estimates and paying too little attention to deterministic requirements and design principles. To compensate for this weakness, the risk-informed decision-making (RIDM) approach has emerged. NUREG-1855 (2009) defines RIDM as: Definition 7.5 (Risk-informed decision-making, RIDM) An approach to decision-making representing a philosophy whereby risk insights are considered together with other factors to establish requirements that better focus the attention on design and operational issues commensurate with their importance to health and safety. ◽

177

178

7 Risk Management

Laws, regulations, codes, and standards

Deterministic risk assessment

Probabilistic risk assessment

Risk-informed decision-making Operating experience

Stakeholder requirements

Potential consequences

Figure 7.3 Elements of RIDM. Source: Based on a presentation by Gerry Frappier, director general of the Canadian Nuclear Safety Commission.

A slightly different definition is given in NASA (2007). The RIDM process can, according to NUREG-1855 (2009), be carried out in five steps: (1) Define the decision under consideration (including context and boundary conditions). (2) Identify and assess the applicable requirements (laws, regulations, requirements, and accepted design principles). (3) Perform risk-informed analysis, consisting of (a) Deterministic analysis (based on engineering principles and experience and prior knowledge). (b) Probabilistic analysis (i.e. a risk assessment including an uncertainty assessment). (4) Define the implementation and monitoring program. An important part of the decision-making process is to understand the implications of a decision and to guard against any unanticipated adverse effects. (5) Integrated decision. Here, the results of steps 1 through 4 are integrated, and the decision is made. This requires that the insights obtained from all the other steps of the RIDM process be weighed and combined to reach a conclusion. An essential aspect of the integration is consideration of uncertainties. The main difference between RBDM and RIDM is that with RBDM, the decisions are based almost solely on the results of the probabilistic risk assessment, whereas following RIDM, the decisions are made on the basis of information from the probabilistic risk assessment as well as from deterministic analyses and technical considerations. RIDM and RBDM are compared and discussed by Apostolakis (2004). The main elements of RIDM are shown in Figure 7.3.

7.10 Safety Legislation

7.10 Safety Legislation The foremost requirement to the risk management is, of course, that applicable laws and regulations are met. This is, therefore, an important context for the process. Safety legislation has a long history. As early as 1780 CE, Hammurabi’s code of laws in ancient Mesopotamia contained punishments based on a peculiar “harm analogy.” Law 229 of this code states If a builder builds a house for someone, and does not construct it properly, and the house that he built falls in and kills its owner, then that builder shall be put to death. Many laws and regulations related to safety have been introduced after major accidents. This may be regarded as a reactive approach, where legislation is based on experience rather than risk assessment. In addition, safety legislation has traditionally been based on a prescriptive regulating regime, in which detailed requirements for the design and operation of a system are specified by the authorities. An example may be that the number of seats in lifeboats should be at least the same at the maximum number of crew and passengers on board a ship. The trend in many countries is now a move away from such prescriptive requirements toward a performance-based regime, which holds the management responsible for ensuring that appropriate safety systems are in place. A performance-based requirement to lifeboats could, for example, be that there should be sufficient evacuation means to ensure safe evacuation of all persons on the ship. Hammurabi’s law may be the earliest example of a performance-based standard (Macza 2008). Goal orientation and risk characterization are two major components of modern performance-based regimes, which have been endorsed enthusiastically by international organizations and various industries (Aven and Renn 2009). An important documentation that performance-based requirements are met is often risk assessment.

7.10.1

Safety Case

Several countries have introduced a safety case regime. A safety case is a risk management regime that requires the operator of a facility to produce a document which: (1) identifies the hazards and potential hazardous events; (2) describes how the hazardous events are controlled; (3) describes the safety management system in place to ensure that the controls are effective and applied consistently.

179

180

7 Risk Management

The detailed content and the application of a safety case vary from country to country, but the following elements are usually important1 : • The safety case must identify the safety critical aspects of the facility, both technical and managerial. • Appropriate performance standards must be defined for the operation of the safety critical aspects. • The workforce must be involved. • The safety case is produced in the knowledge that a competent and independent regulator will scrutinize it.

7.11 Problems 7.1

What is risk management and what are the main activities that are part of risk management?

7.2

What are the main options for risk treatment?

7.3

A plant for producing liquefied natural gas (LNG) is planned close to a city and a risk assessment of the plant is to be carried out. List relevant stakeholders for this risk assessment.

7.4

Classify the stakeholders identified in the previous problem by using the categories in Section 7.8.1.

7.5

Describe the main differences between deterministic, risk-based, and risk-informed decision-making.

7.6

What is a safety case?

References Apostolakis, G.E. (2004). How useful is quantitative risk assessment? Risk Analysis 24 (3): 515–520. Árvai, J. and Rivers, L.I. (eds.) (2014). Effective Risk Communication. Abington: Routledge / Earthscan, Milton Park. Aven, T. (2003). Foundations of Risk Analysis. Chichester: Wiley. Aven, T. and Renn, O. (2009). The role of quantitative risk assessments for characterizing risk and uncertainty and delineating appropriate risk 1 Source: http//www.nopsa.gov.au.

References

management options, with special emphasis on terrorism risk. Risk Analysis 29 (4): 587–600. CCPS (2016). Guidelines for Implementing Process Safety Management, Wiley and Center for Chemical Process Safety, 2e. Hoboken, NJ: American Institute for Chemical Engineers. Fischhoff, B. (1995). Risk perception and communication unplugged: twenty tears of process. Risk Analysis 2 (15): 137–145. Fischhoff, B. and Scheufele, D.A. (2013). The sciences of science communication. Proceedings from the National Academy of Sciences 110: 14 033–14 039. Homeland Security (2010). DHS risk lexicon. Terminology Guide DHS 2010 Edition. Washington, DC: U.S. Department of Homeland Security. Homeland Security (2011). Risk Management Fundamentals. Report. Washington, DC: U.S. Department of Homeland Security. HSE (2013). Managing for Health and Safety, Report, 3rd ed. HSG65. London: Health and Safety Executive. ICAO (2018). Safety Management Manual (SMM). Technical report 9859. Montreal, Canada: International Civil Aviation Organization. IRGC (2008). An Introduction to the IRGC Risk Governance Framework. Tech. Rep. Lausanne, Switzerland: International Risk Governance Council, EFPL. ISO 31000 (2018). Risk management – guidelines, International standard. Geneva: International Organization for Standardization. ISO 45001 (2018). Occupational health and safety management systems – requirements with guidance for use, International standard. Geneva: International Organization for Standardization. ISO/IEC 27001 (2013). Information security management – security techniques – information security management systems – requirements, International standard. Geneva: International Organization for Standardization. Kasperson, R.E. (2014). Four questions for risk communication. Journal of Risk Research 17 (10): 1233–1239. Macza, M. (2008). A Canadian perspective of the history of process safety management legislation. 8th International Symposium on Programmable Electronic Systems in Safety-Related Applications, Cologne, Germany. Morgan, M.G. (2009). Best Practice Approaches for Characterizing, Communicating, and Incorporating Scientific Uncertainty in Climate Decision Making. Tech. Rep. SAP 5.2. Washington, DC: U.S. Climate Change Science Program. NASA (2007). NASA Systems Engineering Handbook. Tech. Rep. NASA/SP-2007-6105. Washington, DC: U.S. National Aeronautics and Space Administration. NUREG-1855 (2009). Guidance on the Treatment of Uncertainties Associated with PRAs in Risk-Informed Decision Making. Washington, DC: U.S. Nuclear Regulatory Commission, Office of Nuclear Regulatory Research.

181

182

7 Risk Management

PSA (2018). Trends in Risk Level in the Petroleum Activity - Summary Report 2018. Tech. Rep. Stavanger, Norway: Petroleum Safety Authority. U.S. DOE (1998). Guidelines for Risk-Based Prioritization of DOE Activities. Technical report DOE-DP-STD-3023-98. Washington, DC: U.S. Department of Energy. USCG (2002). Risk Based Decision Making. Guidelines PB2002-108236. Washington, DC: U.S. Coast Guard. Yosie, T.F. and Herbst, T.D. (1998). Using Stakeholder Processes in Environmental Decisionmaking. Tech. Rep. The Global Development Research Center. https:// pdfs.semanticscholar.org/7d1e/059a89dd035371e0fe4a987984909db33f12.pdf (accessed 03 October 2019).

183

8 Accident Models 8.1 Introduction To understand the mechanisms of accidents and to develop accident prevention and control strategies, it is essential to know about and learn from past accidents (Khan and Abbasi 1999). Accident data are therefore collected and stored in various databases (see Chapter 9), and accident models have been developed to support accident investigations. Although accident investigation is outside the scope of this book, it forms the basis for much of the material presented. This is because accident models influence our perception of why accidents occur and thus our approach to risk assessment. The first accident models were very simple and attributed accidents primarily to single technical failures. Some years later, human factors and human errors were included in the models. Current accident researchers realize that systems also consist of societal, organizational, and environmental elements in addition to technology and individuals, which should all be integrated into the accident model (e.g. see Leveson 2004; Qureshi 2008). Many accident models are focused primarily on occupational accidents and are, as such, outside the scope of this book. Brief descriptions of some of these models are included for historical reasons, mainly because we can learn from them and because they have formed the basis for more complex models of major accidents.

8.2 Accident Classification Accidents can be classified in several ways. In this section, we describe some common classifications.

Risk Assessment: Theory, Methods, and Applications, Second Edition. Marvin Rausand and Stein Haugen. © 2020 John Wiley & Sons, Inc. Published 2020 by John Wiley & Sons, Inc. Companion Website: www.wiley.com/go/riskassessment2e

8 Accident Models

8.2.1

Jens Rasmussen’s Categories

Rasmussen (1997) classifies accidents into three main categories, as shown in Figure 8.1. The three categories are helpful in distinguishing where risk analysis is particularly useful. Accidents of Category 1. Some accidents, such as road traffic accidents and minor occupational accidents, occur so often and so “regularly” that we may predict the number of similar accidents in the near future based on past observations. For risk management purposes, accident statistics is therefore often more used than risk assessment. Accidents of this category are characterized by a relatively high frequency and a correspondingly low consequence. Accidents of Category 2. Accidents of category 2 in Figure 8.1 occur rather seldom and have more severe consequences than the accidents in category 1. Examples of such accidents are major industrial accidents, air accidents, railway accidents, and maritime accidents. In the aftermath of such accidents, detailed accident investigations are usually carried out to identify the possible causes of the accident and what could have been done to prevent it from occurring. To determine the risk related to such accidents, it is not sufficient to base the assessment on the number of accidents observed in the past. Rather, we should carry out a detailed risk analysis to identify all the possible hazards and accident scenarios that have yet to occur. Each part of the system is then analyzed and the total risk is determined based on the risk related to the various parts.

1 Frequency (log scale)

184

High risk

Occupational accidents Traffic accidents

2 Airplane accidents Railway accidents

3

Low risk

Nuclear power accidents

Consequence (log scale)

Figure 8.1 Three main categories of accidents. Source: Adapted from Rasmussen (1997).

8.2 Accident Classification

Accidents of Category 3. Accidents of category 3 in Figure 8.1 occur very seldom, but when they do, they have catastrophic and wide-ranging consequences. An example of an accident in this category is the Chernobyl nuclear power accident in 1986. For such accidents, it has no meaning whatsoever to try to determine the risk based on historical accident frequencies because there, fortunately, are so few accidents. It is therefore necessary to carry out detailed risk analyses of the various parts of the system. The risk analyses that are covered in this book are mainly relevant for systems that produce accidents in categories 2 and 3, although not only that. We use the term major accidents to broadly cover categories 2 and 3 and come back to the definition of this later in this section. 8.2.2

Other Categorizations of Accidents

The prominent risk researcher James Reason classifies accidents into two main types. The first type is individual accidents, which are accidents suffered by single individuals (Reason 1997), see Definition 2.21. Individual accidents are quite common, as can be illustrated by the relatively high number of both road traffic accidents and falls in their homes by elderly people. This therefore corresponds closely to Rasmussen’s category 1. Reason (1997) refers to the second type as organizational accidents: Definition 8.1 (Organizational accident) A comparatively rare, but often catastrophic, event that occurs within complex modern technologies (e.g. nuclear power plants, commercial aviation, the petrochemical industry, chemical process plants, marine and rail transport, banks, and stadiums) and has multiple causes involving many people operating at different levels of their respective companies. Organizational accidents often have devastating effects on uninvolved populations, assets, and the environment. ◽ These are fortunately quite rare, but may have a large-scale impact on populations. Organizational accidents are generally characterized by multiple causes and numerous interactions between different system elements. It is this type of accident that is of main interest in this book. Another classification was proposed by Klinke and Renn (2002). This is based not just on probability and consequence but also uncertainty and public perception. They used characters from Greek mythology to describe the accident types and these are shown in Table 8.1. The purpose of this classification was to use it as a basis for developing diverse strategies for how to manage risk. The book Normal Accidents: Living with High-Risk Technologies (Perrow 1984) introduces the concepts of system accident (see Definition 2.22) and normal accident. We return to the normal accident theory later in this chapter.

185

186

8 Accident Models

Table 8.1 Accident types. Name

Characteristics

Example

Sword of Damocles

Low probability, high consequence

Nuclear accidents

Cyclops

Uncertain probability, high consequence with low uncertainty

Earthquakes

Pythia

Both probability and consequence are uncertain

Genetic engineering

Pandora’s box

Probability, consequence, and causal relationships are uncertain

Persistent organic pollutants

Cassandra

Probability and consequence are high, but effects are delayed

Manmade climate change

Medusa

Low analyzed risk but perceived as unacceptable

Electromagnetic fields

Source: Based on data from Klinke and Renn (2002)

In the terminology of Reason (1997), an accident that cannot be classified as a system accident may be called a component failure accident: Definition 8.2 (Component failure accident) An accident arising from component failures, including the possibility of multiple and cascading failures (e.g. see Leveson 2004). ◽ If we go into more detail, accidents may be classified in different ways. One is related to the context in which the accident occurs. We may therefore distinguish between • • • • • • • •

Nuclear accidents Process plant accidents Aviation accidents Railway accidents Ship accidents Road traffic accidents Occupational accidents Home accidents … and so on

Each category can be divided further into several subcategories, as shown in Example 8.1 with respect to aviation accidents. Example 8.1 (Aviation accidents) follows:

Aviation accidents can be classified as

• Accident during takeoff or landing • Technical failure during flight • Controlled flight into terrain, power line, etc.

8.2 Accident Classification

• Midair collision • Harm to people in or outside the aircraft • Other accidents (e.g. lightning strike, icing, and extreme turbulence)



Accidents may also be classified according to the asset that is harmed, for example, personnel accident or environmental accident. Another way of classifying accidents is according to severity, for example, as a minor accident, fatal accident, major accident, and catastrophe/disaster. 8.2.3

Major Accidents

We use the term “major accident” to describe the accidents that are the focus of this book. There are many different definitions of this term. Some examples are the following: (1) Aviation accidents, defined by the US National Transportation Safety Board as an accident in which any of three conditions is met (a) The airplane was destroyed; or (b) There were multiple fatalities; or (c) There was one fatality, and the airplane was damaged substantially. (2) Major accident as defined by the offshore safety directive of EU (2013/30/EU): (a) an incident involving an explosion, fire, loss of well control, or release of oil, gas, or dangerous substances involving, or with a significant potential to cause fatalities or serious personal injury; (b) an incident leading to serious damage to the installation or connected infrastructure involving, or with a significant potential to cause fatalities or serious personal injury; (c) any other incident leading to fatalities or serious injury to five or more persons who are on the offshore installation where the source of danger occurs or who are engaged in an offshore oil and gas operation in connection with the installation or connected infrastructure; or (d) any major environmental incident resulting from incidents referred to in points (a), (b) and (c) (3) The EU directive on railway safety (2004/49/EC) uses the term “serious accident” and defines this as “any train collision or derailment of trains, resulting in the death of at least one person or serious injuries to five or more persons or extensive damage to rolling stock, the infrastructure or the environment, and any other similar accident with an obvious impact on railway safety regulation or the management of safety; ‘extensive damage’ means damage that can immediately be assessed by the investigating body to cost at least EUR 2 million in total.” In all these cases, it may be observed that the definitions include what types of accidents (e.g. airplane was damaged, fire, derailment) can be classified as major accidents. Obviously, this is different from one industry to another.

187

188

8 Accident Models

Okoh and Haugen (2013) discuss several definitions used in the process and oil and gas industry and conclude that characteristics such as “sudden, acute, adverse, and unplanned” are used, that the consequences include both immediate and delayed effects, and that the consequences are “major” (although the precise definition of major varies). Some definitions include not just events with actual consequences but also events where the consequences potentially could have been major. For risk assessment, this is an important addition. For our purpose, we use the following definition: Definition 8.3 (Major accident) An unexpected event that causes or has the potential to cause serious immediate or delayed consequences such as several casualties, extensive environmental damage, and/or extensive damage to other tangible assets. ◽ James Reason splits accidents into organizational accidents and individual accidents. Similarly, major accidents and occupational accidents are used in the Norwegian oil and gas industry, whereas the process and chemical industries distinguish between process accidents and personal accidents. These terms correspond with Reason’s terminology.

8.3 Accident Investigation There are two main objectives of an accident investigation: (i) to assign blame for the accident, and (ii) to understand why the accident happened so that similar future accidents may be prevented. The second objective is of most interest for this book (e.g. see Leveson 2004). An accident investigation is always based on some underlying model of accident causation. If, for example, investigators consider an accident to be a sequence of events, they start to describe the accident as such a sequence and perhaps forget to consider other possibilities. As for all types of investigations, we have the general problem that “what-you-look-for-is-what-you-find” (e.g. see Lundberg et al. 2009). A survey of methods for accident investigations is given by Sklet (2004), and a more thorough introduction to accident investigation is given in the DOE Workbook Conducting Accident Investigations (U.S. DOE 1999).

8.4 Accident Causation Effective prevention of accidents requires proper understanding of their causes. Several approaches to accident causation have been used throughout history, some of which are mentioned in this chapter.

8.4 Accident Causation

8.4.1

Acts of God

Accidents that are outside human control are sometimes claimed to be acts of God. Examples include natural accidents caused by earthquakes, hurricanes, flooding, lightning, and so on. In earlier times, many accidents were considered acts of God, meaning that nobody could be held responsible for the accident and that there was no possibility of preventing the accident. Another, similar, explanation has been to blame accidents on “destiny.” In practice, this also has the same implications as saying that it is an act of God. This view has few supporters today. Many of the accidents that earlier were considered acts of God can now be prevented or at least forecast and explained, and their consequences can be mitigated by many different means. Remark 8.1 (Act of God) The term “act of God” is in common use as a legal term in English. This term usually refers to events that are outside human control, typically natural accidents, such as extreme storms and landslides. Acts of God may be given as acceptable reasons for not being able to fulfill a contract. ◽ 8.4.2

Accident Proneness

Studies in the 1920s suggested that accidents were caused by individuals who were more disposed than others to being injured. It was claimed that these individuals had inherent characteristics that predisposed them to a higher probability of being involved in accidents. This theory, called the accident proneness theory, is very controversial, but is still influential in, for example, accident investigations by police. Some accident studies support the theory by showing that injuries are not distributed randomly. Other studies have shown that there is no scientific basis for the accident proneness theory. Today’s researchers tend to view accident proneness as associated with the propensity of individuals to take risk or to take chances. This provides a more positive view of safety, as behavior can be changed even if the propensity to take risk cannot. 8.4.3

Classification of Accident Causes

Several taxonomies of factors contributing to accidents have been developed. The taxonomies use such classifications as natural or man-made, active or passive, obvious or hidden, and initiating or permitting. On a general level, the causes and contributing factors are often classified as follows: Direct causes. These are the causes that lead immediately to accident effects. Direct causes are also called immediate causes or proximate causes, as they usually result from other, lower-level causes.

189

190

8 Accident Models

Root causes. These are the most basic causes of an accident. The process used to identify and evaluate root causes is called root cause analysis. Risk-influencing factors (RIFs). These are background factors that influence the causes and/or the development of an accident. The term “cause” is in itself a somewhat troublesome term, because it can be difficult to determine what is a cause and what is not a cause. Usually, there is no discussion about direct causes, but when it comes to root causes and RIFs, it is often more difficult to say whether they should be classified as causes or not because the causal relationship usually is not so clear. A useful way of distinguishing may be to divide causes in a different way, into two categories: Deterministic causes. These are events and conditions that, if they are present, always lead to the next event in the scenario. Probabilistic causes. These are events and conditions that, if they are present, increases the probability that the next event in the scenario will occur. In reality, the large majority of what we call causes are probabilistic rather than deterministic causes. In this book, cause is used to describe both deterministic and probabilistic causes. Later, we shall see that there are different methods that are used for modeling deterministic and probabilistic causes in risk analysis (Chapters 11 and 12). Example 8.2 (Car brake failure) Assume that we are driving a car and are unable to stop because the brakes have failed. In this case, there is a direct causal relationship between “brakes fail” and “car not stopping.” This can therefore be regarded as a deterministic cause. On the other hand, we could have had a situation where the road was icy. If the car is driving fast or the tires are worn out, the car may be unable to stop, but that is not necessarily always the case. We may then say that “icy road” is a probabilistic cause. ◽ Example 8.3 (Direct causes of airplane accidents) A study by the aircraft manufacturer Boeing determined the direct causes of airplane accidents in the interval 1992–2001 as listed in Table 8.2. ◽ Example 8.4 (Car accident causes) The majority of car accidents can be attributed to one or more of the following causal factors in Table 8.31 : ◽

8.5 Accident Models Accident models are simplified representations of accidents that have already occurred or might occur in real life. Each accident model has its own characteristics based on what types of causal factors it highlights (Kjellén 2000). 1 More details may be found at http://www.smartmotorist.com.

8.5 Accident Models

Table 8.2 Causes of airplane accidents 1992–2001. Percent

Cause

66

Flight crew error

14

Aircraft error (mechanical, electrical, and electronic) Weather

10 5

Air traffic control

3

Maintenance

3

Other (e.g. bombs, hijackings, and shoot-downs)

Source: Statistical summary of commercial jet airplane accidents. Worldwide operations 1959–2001, Boeing, airplane safety.

Table 8.3 Causes of car accidents (Example 8.4). • Equipment failure

• Weather conditions – Snow/ice – Rain – Wind – Fog

– Brakes – Tires – Steering and suspension • Road design – Visibility – Surface – Traffic control devices – Behavioral control devices – Traffic flow

• Driver behavior – Speeding – Violation of rules … and many more

• Road maintenance – Surface condition (e.g. potholes) – Salting and sanding – Maintenance activities – Construction activities

8.5.1

Objectives of Accident Models

Accident models may be used for many different purposes. Among these are (1) Accident investigation (a) To assign blame for an accident. (b) To understand why the accident happened so that similar accidents may be prevented in the future. (2) Prediction and prevention (a) To identify potential deviations and failures that may lead to accidents in new or existing technical and sociotechnical systems. (b) To propose changes to existing or new technical and sociotechnical systems to prevent deviations and failures that may lead to accidents.

191

192

8 Accident Models

(3) Quantification (a) To estimate probabilities of deviations and failures that may lead to accidents. (b) To provide input to quantitative risk assessments. 8.5.2

Classification of Accident Models and Analysis Methods

The available accident models may be classified in many ways. Any attempt at classification is, however, problematic, because several methods may either fall outside the classification or fit into several categories. A simple classification of the models and methods that are presented in this book is outlined below. Some models are described in other chapters of the book, for example, as modules of more comprehensive approaches. The respective chapters are then indicated in the classification scheme. (1) Energy and barrier models. These models are based primarily on the simple hazard-barrier-asset model of Gibson (1961), shown in Figure 8.2. (a) Barrier analysis (Chapter 14) (b) Energy flow and barrier analysis (Section 14.9) (2) Event sequence models. These models explain an accident as a sequence of discrete events that occur in a particular temporal order. Event sequence models are attractive because they are easy to understand and can be represented graphically, but may in many cases be too simple. (a) Heinrich’s domino model (b) Loss causation model (c) Event tree analysis (Section 12.2) (d) Layer of protection analysis (LOPA) (Section 14.10) (3) Event causation and sequencing models. (a) Root cause analysis (b) Fault tree analysis (Section 10.3) (c) Man-technology-organization (MTO) (d) Management oversight and risk tree (MORT) (e) Tripod-Delta (4) Epidemiological accident models. These models regard the onset of accidents as analogous to the spreading of a disease, that is, as the outcome of a combination of factors, some manifest and some latent, which happen to exist together in space and time. (a) Reason’s Swiss cheese model (5) Systemic accident models. These models attempt to describe the characteristic performance on the level of the system as a whole, rather than that of specific cause-effect “mechanisms” or even epidemiological factors (Hollnagel, 2004). (a) Man-made disasters (MMD) (b) Hierarchical sociotechnical framework

8.6 Energy and Barrier Models

Figure 8.2 The hazard-barrier-asset model.

Hazard

Barrier

Asset

(c) Systems-theoretic accident model and processes (STAMP) (d) Normal accident theory (e) High reliability organization (HRO) (6) Accident reconstruction methods. These methods are used primarily in accident investigations to describe what really happened. (a) Sequentially timed events plotting (STEP) (b) MTO (c) Tripod-Beta (d) AcciMap Observe that some accident models appear in more than one category. For a more thorough classification of accident models, see, for example, Qureshi (2008) and Hollnagel (2004). A comprehensive review of occupational accident models is given by Attwood et al. (2006).

8.6 Energy and Barrier Models Energy and barrier models are based on the idea that accidents can be understood and prevented by focusing on dangerous energies and the means by which such energies can reliably be separated from vulnerable targets (Gibson 1961; Haddon 1970, 1980). These models have had a great impact on practical safety management. The energy and barrier model has four basic elements. Energy source. Most systems have a range of energy sources. The energy sources correspond to hazards as shown in Figure 8.2. The energy–barrier model is the reason why we say that hazards often can be associated with energy in one form or another. Hazards were discussed in Chapter 2. Barriers. Barrier was defined in Chapter 2 as “Physical or engineered system or human action (based on specific procedures or administrative controls) that is implemented to prevent, control, or impede energy released from reaching the assets and causing harm.” Put simply, barriers are all the measures that we have put in place to reduce or control risk. Barriers originally had primarily a physical meaning but has later been extended to cover all sorts of risk reduction measures. Barriers are discussed more in Chapter 14.

193

194

8 Accident Models

Energy pathways. These are the pathways from the energy source to the vulnerable assets, which may go through air, pipe, wire, and so on. Assets. The assets that are exposed to the energy may be people, property, the environment, and so on. Assets were discussed in Chapter 2. The bow-tie model was introduced in Chapter 2 as a means of visualizing accident scenarios. The bow-tie model is based on energy and barrier model principles. It is not quite clear when the bow-tie model was first devised, but it has been found to be applied in courses on hazard analysis developed by the chemical company ICI in 1979 (Alizadeh and Moshashaei 2015). A review of literature on bow-tie is provided by de Ruijter and Guldenmund (2016). The first major user of bow-ties was the international oil company Shell. Today, it is being widely used and has been implemented in several software packages. 8.6.1

Haddon’s Models

William Haddon was both a physician and an engineer who was engaged in designing safer roads in the United States in the late 1950s. He developed a framework for analyzing injuries based on three attributes: Human. The person at risk of injury. Equipment. The energy (e.g. mechanical, thermal, and electrical) that is transmitted to the human through an object or a path (another person, animal). Environment. The characteristics of the environment in which the accident takes place (e.g. road, building, and sports arena) and the social and legal norms and practices in the culture and society at the time (e.g. norms about discipline, alcohol consumption, and drugs). Haddon further considers the three attributes in the following three phases: Preinjury phase. This is about stopping the injury event from occurring by acting on its causes (e.g. pool fences, divided highways, and good road or house design). Injury phase. This is to prevent an injury or reduce the seriousness of an injury when an event occurs by designing and implementing protective mechanisms (e.g. wearing mouth guard, seat belt, or helmet). Postinjury phase. This is to reduce the seriousness of an injury or disability immediately after an event has occurred by providing adequate care (e.g. the application of immediate medical treatment such as first aid), as well as in the longer-term working to stabilize, repair, and restore the highest level of physical and mental function possible for the injured person. Haddon’s accident prevention approach has three main components: (1) A causal sequence of events (2) Haddon’s matrix (3) Haddon’s ten countermeasure strategies (discussed in Section 14.12)

8.7 Sequential Accident Models

Factors

Phases

Human

Equipment

Environment

Pre injury

Training Alertness

Maintenance ESP system

Road quality Weather

Injury

Reaction Robustness

Airbag Headrest

Midroad barrier

Post injury

Trained medical personell available

Equipment to rescue people trapped in car

Short distance to hospital

Figure 8.3 Haddon’s matrix (example for a traffic accident).

8.6.1.1

Causal Sequence of Events

To find which factors to include in the Haddon matrix, thinking in terms of a “causal sequence of events” that leads to injuries is recommended. See Chapters 11 and 12 for more about sequence-of-events models. 8.6.1.2

Haddon’s Matrix

The Haddon matrix (Haddon 1980) is used to identify injury prevention measures according to the three phases (preinjury, injury, and postinjury) and the three attributes (human, equipment, and environment), as shown in Figure 8.3. The environment is sometimes divided into two subattributes: the physical and the social environments. 8.6.1.3

Haddon’s 4Es

In Haddon’s matrix, the three attributes human, equipment, and environment are used. Haddon later rephrased and extended this into four factors for injury prevention (i.e. risk reduction), which are often referred to as Haddon’s 4Es. E1 : Engineering. Controlling the hazards through design changes, process changes, and maintenance. E2 : Environment. Making the physical and social environment safer. E3 : Education. Training workers and operators related to all facets of safety, providing the required information to individuals, and convincing management that attention to safety pays off. E4 : Enforcement. Ensuring that workers as well as management follow internal and external rules, regulations, and standard operating procedures.

8.7 Sequential Accident Models Sequential accident models explain accidents as the result of a sequence of discrete events that occur in a particular order.

195

196

8 Accident Models

8.7.1

Heinrich’s Domino Model

Heinrich’s domino model is one of the earliest sequential accident models (Heinrich 1931). The model identifies five causal factors and events that are present in most accidents: (1) Social environment and ancestry related to where and how a person was raised, and undesirable character traits which may be inherited, for example, recklessness and stubbornness. (2) Fault of the person or carelessness that is created by the social environment or acquired by ancestry. (3) Unsafe act or condition that is caused by careless persons or poorly designed or improperly maintained equipment. (4) Accident that is caused by an unsafe act or an unsafe condition in the workplace. (5) Injury from the accident. Heinrich (1931) arranges the five factors in a domino fashion, such that the fall of the first domino results in the fall of the entire row. The corresponding model is therefore called the domino model and is shown in Figure 8.4. Strictly speaking, the domino model is not an accident causation model, but rather a conceptual representation of how injuries occur. Heinrich focuses on people, not on technical equipment, as the main contributor to accidents. The sequence of events illustrates that there is not a single cause, but many causes of an accident. The domino model is sometimes called the sequence of multiple events model because it implies a linear one-to-one progression of events leading to an accident. It is deterministic, in the sense that the outcome is seen as a necessary consequence of one specific event. Just as when playing domino, the removal of one factor will stop the sequence and prevent the injury. The domino model therefore implies that accidents can be prevented through control, improvement, or removal of unsafe acts or physical hazards. When the domino model was proposed, it represented a redirection in the search to understand an accident as being the result of a sequence of discrete events. The domino theory was later criticized because it describes accidents

nt

An

ce

str

r nvi

y

e ial Soc

e onm

lt o Fau

Figure 8.4 The domino model.

n rso f pe

U

nsa

o ct/c fe a

tion ndi

nt ide Acc

Injury

8.7 Sequential Accident Models

in far too simple a manner. It has also been criticized because it models only single sequences of events, and because it does not try to explain why unsafe acts are taken, or why hazards arise.

8.7.2

Loss Causation Model

The loss causation model, a modified version of the original domino model, was developed by Bird and Germain (1986) on behalf of the International Loss Control Institute. The model, which is often referred to as the ILCI model, considers an accident as a sequence of events involving: (1) (2) (3) (4) (5)

Lack of management control Basic causes (personal factors or job factors) Immediate causes (substandard acts and conditions) Incident (contact with energy, substance, and/or people) Loss (people, property, environment, and material)

The main elements of the loss causation model are shown in Figure 8.5. The three last elements of the loss causation model are similar to those of the domino model, except that the term injury has been replaced by the more general term loss, and the term accident has been replaced by incident. The change from accident to incident indicates that the model is also suitable for analyzing undesired events that do not necessarily lead to significant loss. The first two elements represent the root causes, focusing on management factors such as, inadequate program, inadequate standards, job factors, and personal factors. Observe that these factors are similar to those found in quality assurance programs. The five main elements of the loss causation model may be described briefly as follows (e.g. see Sklet 2002): Lack of management control

Basic causes

Immediate causes

Inadequate program

Personal factors

Inadequate program standards

Job factors

Substandard acts and conditions

Inadequate compliance to standards

Figure 8.5 The main elements of the loss causation model.

Incident

Loss

Contact with energy or substance

People Property Environment Quality

197

198

8 Accident Models

Lack of management control may indicate a lack of internal standards concerning elimination or reduction of risk, which may be related to • Hazard identification and risk reduction • Performance appraisal • Communication between employees and management It may also be that internal standards are in place but are outdated or inadequate. Another and often interrelated cause may be that management and/or employees do not follow the established internal standards. Basic causes are divided into two main groups, personal factors and job factors. The personal factors may be related to • Physical capacity/stress • Mental capacity/stress • Knowledge • Skill • Motivation The job factors may be related to inadequate: • Supervision • Engineering • Purchasing • Maintenance • Tools and equipment • Work standards • Quality/reliability of tools and equipment Immediate causes are acts and conditions that may be related to • Inadequate barriers (including personal protective equipment) • Inadequate alarms/warning systems • Poor housekeeping • Influence of alcohol/drugs … and so on Incident describes the contact between the energy source and one or more assets (e.g. people, property, and the environment) and is the event that precedes the loss. The immediate causes of an incident are the circumstances that immediately precede the contact. These circumstances can usually be sensed and are often called unsafe acts or unsafe conditions. In the loss causation model, the terms Substandard acts (or practices) and substandard conditions are used. Loss is the potential result of an incident in relation to one or more assets. Checklists play a major role when investigating accidents based on this model. A special chart has been developed for accident investigation, which acts as a checklist and reference to ensure that the investigation covers all facets of an accident.

8.7 Sequential Accident Models

Root cause

Causal chain

Hazardous event

Event chains

Persons, assets

Figure 8.6 Rasmussen and Svedung’s accident model.

The International Safety Rating System (ISRS),2 which is operated by DNV GL, is based on the loss causation model. 8.7.3

Rasmussen and Svedung’s Model

Rasmussen and Svedung (2000) describe an accident as a linear sequence of causes and events, as shown in Figure 8.6. The model has a structure similar to that of an accident scenario in a bow-tie diagram, but its interpretation is different. The main elements in Rasmussen and Svedung’s model are (1) Root cause. According to the terminology used in this book, this represents a hazard or a threat, or a combination of several hazards/threats. Actions to prevent accidents of similar types are to eliminate or reduce these hazards and threats. (2) Causal sequences. Proactive barriers are installed in most systems to prevent hazardous events. For a hazardous event to occur, the causal sequences must have found a loophole in these barriers, or the barriers must have been too weak to withstand the loads from the hazards. Actions to prevent the accident are then to install new barriers or to improve existing ones. (3) Hazardous event. A hazardous event inevitably leads to harm to people, the environment, or other assets if the event is not “contained.” As was discussed in Chapter 2, it may be difficult in some cases to decide which event in an event sequence should be defined as the hazardous event. (4) Event sequences. As for the causal sequences, most well-designed systems have barriers intended for stopping or reducing the development of consequence sequences following a hazardous event. These barriers are often called reactive barriers. To create harm, there must be “holes” in these barriers. Actions to prevent the accident are then to install new barriers or to improve existing ones. (5) Persons, assets. These are the assets that are harmed during an accident. To reduce the harm in similar future accidents, it is necessary to reduce the exposure, strengthen the defenses, establish better first-aid systems, and/or implement other mitigating actions. Rasmussen and Svedung’s model as presented in Figure 8.6 is a small part of a much more comprehensive accident model framework; see Section 8.10.2 and Rasmussen and Svedung (2000) for more details. 2 ISRS is a registered trademark of DNV GL, Høvik, Norway.

199

200

8 Accident Models

8.7.4

STEP

The STEP method was developed by Hendrick and Benner (1987). STEP is principally an accident investigation tool whose main purpose is to reconstruct an accident by plotting the sequence of contributing events and actions into a STEP diagram. The accident is viewed as a process that starts with an undesired change in the system and ends in an event where some assets are harmed. The structure of a STEP diagram is shown in Figure 8.7. The main elements of a STEP diagram are (1) The start state describes the normal state of the system. (2) The initial event is the event that disturbed the system and initiated the accident process. The initial event is an unplanned change done by an actor. (3) The actors that changed the system or intervened to control the system. An actor does not need to be a person. Technical equipment and substances can also be actors. The actors are listed in the column “Actors” in the STEP diagram. (4) The elementary events. An event is an action committed by a single actor. Events are used to develop the accident process and are drawn as rectangles in the STEP diagram. A brief description of each event is given in the rectangles. (5) The events are assumed to flow logically in the accident process. Arrows are used to illustrate the flow. (6) A timeline is used as the horizontal axis in the STEP diagram for recording when the events started and ended. The timeline does not need to be drawn on a linear scale, as the main point is to keep the events in order, that is, to define their temporal relation. (7) The end event of the STEP diagram is the point where an asset is harmed and the point that defines the end of the diagram. Start state

Actors

Events

End event

A B C D

Time Figure 8.7 STEP diagram (main elements).

8.8 Epidemiological Accident Models

The actors can be involved in two types of events: they can introduce a change (deviation), or they can correct a deviation. The action can be physical and observable, or mental if the actor is a person (e.g. see Sklet 2002). The analysis starts with definitions of the initial event and the end event of the accident sequence. Then the analysts identify the main actors and the events (actions) that contributed to the accident, and position the events in the STEP diagram. The following information should be recorded: • • • • •

The time at which the event started The duration of the event The actor that caused the event The description of the event A reference to where/how the information was provided

All events should have incoming and outgoing arrows to indicate “preceded by” and “following after” relationships between events. Special attention should be given to the interactions between the actors. Observe that the focus should be on actions rather than their reasons. When developing the STEP diagram, the analysts should repeatedly ask: “Which actors must do what to produce the next event?” If an earlier event is necessary for a later event to occur, an arrow should be drawn from the preceding event to the resulting event. For each event in the diagram, the analysts should ask: “Are the preceding events sufficient to initiate this event or were other events necessary?” The accuracy of the event representation may be checked by using the back-STEP technique, which reasons backwards in order to determine how each event could be made to occur. Reasoning backwards helps the analysts to identify other ways in which the accident process could have occurred and measures that could have prevented the accident. In this way, STEP can also be used to identify safety problems and to make recommendations for safety improvements (Kontogiannis et al. 2000).

8.8 Epidemiological Accident Models Epidemiological accident models consider the events leading to an accident as analogous to the spreading of a disease. An accident is in this view conceived as the outcome of a combination of factors, some manifest and some latent, that happen to coexist in time and space. A good account of this work has been provided by Reason (1990, 1997, 2016). 8.8.1

Reason’s Swiss Cheese Model

The Swiss cheese model was developed by James Reason, who uses slices of Swiss cheese as an analogy to barriers. Figure 8.8 shows the principal idea behind the model, namely that barriers may, like cheese slices, have holes of

201

202

8 Accident Models

Decision-makers Latent conditions

Line managers

Latent conditions

Line managers

Latent conditions Operators Active failures Safety barriers

C au

sal s

equ

enc

e Accident

Figure 8.8 Reason’s Swiss cheese model.

different sizes and at different places. The holes are referred to as active failures or latent conditions. The Swiss cheese model shows the development of an accident from these latent conditions to active failures which both penetrate a series of safety barriers and eventually lead to accidents. This is shown by an arrow passing through the holes of several slices of cheese in Figure 8.8. The main elements of the Swiss cheese model are (1) Decision-makers. This typically includes corporate management and regulatory authorities, who are responsible for the strategic management of available resources to achieve and balance two distinct goals: the goal of safety and the goal of time and cost. This balancing act often results in fallible decisions. (2) Line managers. These are the people who are responsible for implementing the decisions made by the decision-makers. They adopt and implement the everyday activities of the operations. (3) Operators. These are the people who operate and maintain the system and perform the actions required to implement the decisions. (4) Latent conditions. The decisions made by the decision-makers and the actions implemented by line management create many of the preconditions in which the workforce attempts to carry out their responsibilities safely and effectively. (5) Active failure. These are failures of technical systems and components and unsafe acts carried out by the workers/operators.

8.8 Epidemiological Accident Models

(6) Safety barriers. These are the defenses or safeguards that are in place to prevent injury, damage, or costly interruptions in the event of an active failure. Several versions of the Swiss cheese model have been presented, and the model has been improved and changed over the years. Example 8.5 (Herald of Free Enterprise) The roll-on roll-off car and passenger ferry Herald of Free Enterprise capsized on 6 March 1987, just after having left the Belgian port of Zeebrugge. A total of 193 passengers and crew were killed in the accident, which has been analyzed in detail using many different methods. In relation to the Swiss cheese model, the main contributing factors to the accident are seen to be Decision-makers: • Inherently unsafe “top-heavy” ferry design • Failure to install bow door indicator Line managers: • Negative reporting culture • Poor rostering Latent conditions: • Fatigue • Choppy sea • Pressure to depart early • “Not my job” culture Active failures: • Assistant bosuns fail to shut bow doors • Captain leaves port with bow doors open 8.8.2



Tripod

Tripod is a safety management approach that was developed in a joint project by the University of Leiden (The Netherlands) and the University of Manchester (UK) for use in the oil and gas industry. The project was initiated in 1988 on commission by the oil company Shell (Reason 1997). The Tripod safety management system is now called Tripod-Delta to distinguish it from its close relative, Tripod-Beta: Tripod-Delta is a safety management system and a proactive method for accident prevention. Tripod-Beta is a method for accident investigation and analysis. As such, it is a reactive approach that is used mainly after an accident has taken place to prevent recurrence. To prevent the recurrence of an accident, it is not sufficient to understand what happened, it is even more important to understand why it happened. Accident investigation is therefore a necessary part of any safety management system.

203

204

8 Accident Models

Measure and control

Identify and confirm BRFs Learn from

Basic risk factors

Minimize

Inspect and improve

Train and motivate Hazards

Accidents, incidents, losses

Unsafe acts

Defenses Figure 8.9 The three feet of Tripod-Delta: basic risk factors, hazards and unsafe acts, and accidents, incidents, and losses.

8.8.2.1

Tripod-Delta

Tripod-Delta got its name from the three-part structure in Figure 8.9, consisting of (1) Basic risk factors (2) Hazards and unsafe acts (3) Accidents, incidents, and losses Tripod-Delta focuses on all levels of the organization, not only on the immediate causes of an accident. Accidents occur because protective barriers fail, and these may fail when people make mistakes or commit active failures. Active failures are more likely under certain preconditions, which are made possible by latent failures, which again are caused by some basic risk factors. The basic risk factors are in turn caused by fallible management decisions. This sequence of causes constitutes the Tripod accident causation model, which is illustrated in Figure 8.10. Wagenaar et al. (1990) call this a general accident scenario, as it shows the general pattern of factors leading to accidents. Accidents are the end-results of long chains of events that start with decisions at management level (Wagenaar et al. 1990).

8.8.2.2

Basic Risk Factors

Tripod-Delta emphasizes that the immediate causes of an accident (e.g. technical failures, unsafe acts, and human errors) do not occur in isolation but are

8.8 Epidemiological Accident Models

influenced by one or more basic risk factors.3 The basic risk factors are latent and hidden in the organization, but may contribute to accidents indirectly. They have been compared to diseases – you cannot see them directly, only through their symptoms. Definition 8.4 (Basic risk factors) “…those features of an operation that are wrong and have been so for a long time, but remain hidden because their influences do not surface without a local trigger” (Wagenaar et al. 1990). ◽ Tripod-Delta defines 11 basic risk factors that cover human, organizational, and technical problems, as listed in Table 8.4 (see Tripod Solutions 2007). The basic risk factors have been identified based on brainstorming, accident analyses, studies of audit reports, and theoretical studies (Groeneweg 2002). Ten of the basic risk factors lead up to hazardous events and are therefore sometimes called preventive factors to reflect that it is possible to prevent accidents by improving these factors. The eleventh basic risk factor is aimed at controlling the consequences after an accident has occurred and is sometimes called the mitigating factor. Of the 10 preventive basic risk factors, there are five generic factors (6–10 in Table 8.4) and five specific factors (1–5 in Table 8.4). Most of the basic factors are determined by decisions or actions taken by planners, designers, or managers who are far away from the scene of a potential accident. The basic risk factors will, by their nature, have a broad impact. Identifying and controlling these factors have a wider benefit than influencing the immediate cause of a specific accident scenario. The preconditions in Figure 8.10 are the environmental, situational, or psychological system states, or even states of mind, that promote, or directly cause, active failures. Preconditions form the link between active and latent failures and may be viewed as the sources of human error (see Chapter 15). The original accident causation model in Figure 8.10 has been updated to better fit the current version of Tripod-Delta. The updated model is shown in Figure 8.11.

Fallible management decisions

Latent failure(s)

Precondition(s)

Active failure(s)

Basic risk factors

Accident

Failed controls or defenses

Figure 8.10 The Tripod accident causation model. Source: Adapted from Wagenaar et al. (1990). 3 The basic risk factors were earlier called general failure types.

205

206

8 Accident Models

Table 8.4 Basic risk factors. No.

Basic risk factor

Definition

1.

Hardware

Poor quality, condition, suitability or availability of materials, tools, equipment, or components

2.

Design

Ergonomically poor design of tools or equipment (not user friendly)

3.

Maintenance management

No or inadequate performance of maintenance tasks and repairs

4.

Housekeeping

No or insufficient attention given to keeping the work floor clean or tidied up

5.

Error-enforcing conditions

Unsuitable physical performance of maintenance tasks and repairs

6.

Procedures

Insufficient quality or availability of procedures, guidelines, instructions, and manuals (specifications, “paperwork,” use in practice)

7.

Incompatible goals

The situation in which employees must choose between optimal working methods according to the established rules on the one hand, and the pursuit of production, financial, political, social, or individual goals on the other

8.

Communication

No or ineffective communication between the various sites, departments, or employees of a company or with the official bodies

9.

Organization

Shortcomings in the organization’s structure, the organization’s philosophy, organizational processes, or management strategies, resulting in inadequate or ineffective management of the company

10.

Training

No or insufficient competence or experience among employees (not sufficiently suited/inadequately trained)

11.

Defenses

No or insufficient protection of people, material, and environment against the consequences of the operational disturbances

Source: Adapted from Tripod Solutions (2007).

8.8.2.3

Tripod-Beta

Tripod-Beta is a method used to perform accident analysis in parallel with accident investigation. A comprehensive description can be found in Energy Institute (2017). Feedback from the analyses gives investigators the opportunity to validate findings, confirm the relevance of risk management measures, and identify new investigation possibilities. As for Tripod-Delta, Tripod-Beta can be used for both accidents and operational disturbances.

8.8 Epidemiological Accident Models Breached barriers Fallible management decisions

Latent failures (10 BRFs)

Psychological precursors

Substandard acts

Operational disturbance

Breached barriers

Accident

Consequences

Basic risk factors

Figure 8.11 Updated Tripod-Delta accident causation model.

Tripod-Beta merges two different models, the hazard and effects management process (HEMP) model and the Tripod-Delta theory of accident causation. The HEMP model is used to illustrate the actors and the barriers that contributed to a particular accident, as shown in Figure 8.12. Two main elements are required for an accident to occur: (i) a hazard and (ii) a target (or asset).4 The hazard is usually controlled by one or more barriers, and the target is normally protected by defenses or mitigating barriers. When the hazard controls and the defenses fail at the same time as shown in Figure 8.12, an accident occurs. Figure 8.11 presents the accident causation model that is used in Tripod-Beta for identifying the causes of the breached controls and defenses as illustrated in the HEMP model. Tripod-Beta is different from conventional approaches to accident investigation because no research is done to identify all the contributing substandard acts or clusters of substandard acts. Rather, the aim of investigation is to find out whether any of the basic risk factors are acting. When Failed control Hazard (e.g. hot pipework) Event (e.g. operator gets burned) Target (e.g. operator) Failed defense Figure 8.12 The basic HEMP model as part of Tripod-Beta. 4 In recent versions of Tripod-Beta, agent or agent of change has replaced the term hazard and object has replaced the term target.

207

208

8 Accident Models

the basic risk factors have been identified, their impact can be reduced or even eliminated. Thus, the source of problems is tackled instead of the symptoms.

8.9 Event Causation and Sequencing Models This section presents two different approaches to modeling the causes and the sequence of events and causes of an accident. 8.9.1

MTO-Analysis

The basis for the MTO method is that humans, technology, and the organization are equally important when analyzing accidents (Sklet 2002). M: (Man). This category comprises the workers at the sharp end. Attributes to be considered can be classified as follows: • Generic • Specific (gender, race, age, and education) • Personality (values, beliefs, and trust) T: (Technology). This category includes equipment, hardware, software, and design. The attributes may be classified as follows: • Function level • System level • Human-machine interaction (interface details) • Automation level • Transparency O: (Organization). This category comprises management, owners, and authorities. The attributes may be classified as (e.g. see Rasmussen 1997): • National/government • Regional • Company (culture and environment) • Risk management MTO-analysis of an accident is based on three main elements (e.g. see Sklet 2002): (1) A structured analysis by use of an event and cause diagram as shown in Figure 8.13, to describe the accident event sequence. Possible technical and human causes of each event are identified and positioned vertically in relation to the events in the diagram. (2) A change analysis describing how the conditions and events when the accident occurred deviated from what is considered normal or common practice. Both the normal situation and the deviation (the situation when the accident occurred) are illustrated in the diagram.

Barrier failure

Event and causal analysis

Deviations

8.9 Event Causation and Sequencing Models

Deviation Deviation

Deviation

Deviation

Deviation

Date/time 1

Date/time 2

Date/time 3

Date/time n

Event 1

Event 2

Event 3

Event n

Cause/factor

Cause/factor

Cause/factor

Cause/factor

Cause/factor

Cause/factor

Cause/factor

Cause/factor

Cause/factor

Barrier (description)

Cause/factor

Barrier (description)

Barrier (description)

Barrier (description)

Figure 8.13 MTO diagram (main elements).

(3) A barrier analysis identifying technical, human, and administrative barriers that have failed or are missing. Missing or failed barriers are represented below the events in the diagram. Barrier analysis is discussed further in Chapter 14. The main elements of the resulting diagram, called an MTO diagram, are shown in Figure 8.13. The basic questions in the analysis are the following: (1) What could have prevented continuation of the accident sequence? (2) What could the organization have done in the past to prevent the accident? The last important step in the MTO analysis is to identify and present recommendations. The recommendations might be technical, human, and/or organizational and should be as realistic and specific as possible. A checklist for identification of failure causes is also part of the MTO analysis (Sklet 2002). The checklist contains the following factors: (1) (2) (3) (4) (5) (6) (7)

Organization Work organization Work practice Management of work Change procedures Ergonomic/deficiencies in the technology Communication

209

210

8 Accident Models

(8) Instructions/procedures (9) Education/competence (10) Work environment For each of these failure cause factors, there is a detailed checklist of basic or fundamental causes. Examples of basic causes for the failure cause “work practice” are the following: • • • • •

Deviation from work instruction Poor planning or preparation Lack of self-inspection Use of wrong equipment Wrong use of equipment

MTO analyses have been widely used for accident investigation in the Swedish nuclear industry and in the Norwegian offshore oil and gas industry (IFE 2009). The MTO approach has emerged in several variants, which differ according to their main focus. Acronyms such as OMT and TMO are also seen (see also Niwa 2009). The US Department of Energy has published a workbook on how to conduct accident investigations that in practice describe the three main elements in the MTO method, even if this name is not used (U.S. DOE 1999). 8.9.2

MORT

MORT approach was developed by William G. Johnson for the US nuclear industry in the early 1970s. MORT is described thoroughly in several user’s manuals and textbooks (e.g. see U.S. AEC 1973; Johnson 1980; Vincoli 2006; NRI 2009). MORT is based on energy and barrier concepts and is used primarily for three main purposes: (1) Accident investigation (2) Support for safety audits (3) Evaluation of safety programs MORT is a deductive technique that applies a predesigned generic tree diagram, called a MORT chart, which has gate symbols similar to those of a fault tree (see Section 11.3). The MORT chart contains approximately 100 problem areas and 1500 possible accident causes, which are derived from historic case studies and various research projects. The top structure of the MORT chart is shown in Figure 8.14.5 The tree structure is quite complex and is intended to be used as both a reference and a checklist. To perform a MORT analysis, you need to have the complete MORT chart available. This can be downloaded from http://www.nri.eu.com. 5 This top structure has been slightly reorganized to fit into one page.

8.9 Event Causation and Sequencing Models

Injury, damage, performance lost or degraded

Future undesired events?

R

SM

ris med

Oversights and omissions

R1

S

Potentially harmful energy flow or condition

R2

R3

R4

R5

M

Specific control factors LTA

Stabilization and restoration LTA

k

Assu

Accident

Vulnerable people or objects

Management system factors LTA

“Policy” LTA

Implementation LTA

Risk management system LTA

Controls and barriers LTA

An event resulting from the combination of more basic events acting through logic gates An event described by a basic component or part failure. The event is independent of other events An event that is normally expected to occur

Figure 8.14 The top structure of a generic MORT chart.

AND-gate. Coexistence of all inputs is required to produce output

OR-gate. Output will exist if at least one input is present

211

212

8 Accident Models

The top event of the MORT chart represents the accident (experienced or potential) to be analyzed. Once the extent of the accident has been established, the analyst arrives at the first logic gate, which is an OR-gate. The inputs to an OR-gate are the two main branches, which are denoted SM and R in Figure 8.14. The SM-branch is again split into the two main branches denoted S and M, which are connected through an AND-gate to indicate that both have to be considered in combination. The three main branches address different topics that each influence the top event accident: S−branch. This branch contains factors representing specific oversights and omissions associated with the accident. M−branch. This branch presents general characteristics of the management system that contributed to the accident. R−branch. This branch contains assumed risk – risk aspects that are known, but for some reasons are not controlled. 8.9.2.1

S-Branch

The S-branch focuses on the events and conditions of the accident, the potential harmful energy flows (hazards) or environmental conditions, the people or objects of value (i.e. the assets) that are vulnerable to an unwanted energy flow, and the controls and barriers that should protect these assets from harmful consequences. Haddon’s 10 strategies (Haddon 1980) for accident prevention are key elements in this branch. Three basic types of barriers are considered in MORT: (1) Barriers that surround and confine the energy source (i.e. the hazard) (2) Barriers that protect the asset (3) Barriers that separate the hazard and the asset physically in time or space The time is not explicitly included, but the MORT chart is designed such that the time develops from left to right, and the sequence of causes develops from bottom to top. Factors that relate to the different life-cycle phases of the system are recognized at the next level of the S-branch (not shown in Figure 8.14). The phases are the project phase (design and plan), startup (operational readiness), and operation (supervision and maintenance). The idea here is to link barrier failures to their first occurrence in the life cycle. 8.9.2.2

M-Branch

The M-branch is used to evaluate why the inadequacies revealed in the S-branch have been allowed to occur. Events and conditions of the S-branch have their counterparts in the M-branch. At the M-branch, the analyst’s thinking is expanded to the total management system. Thus, any recommendations affect many other possible accident scenarios as well. The most important safety management functions are represented in the M-branch:

8.10 Systemic Accident Models

• Policy, goals, requirements, and so on • Implementation • Follow-up (including risk assessment) These are the same basic elements that we find in the quality assurance principles of the ISO quality management standards (ISO 9000 2015). 8.9.2.3

R-Branch

The R-branch consists of assumed risk, that is, events and conditions that are known to the management and have been evaluated and accepted at the proper management level prior to the MORT analysis. Other events and conditions that are revealed through the evaluations following the S- and M-branches are denoted “less than adequate” (LTA). The MORT user’s manual contains a collection of questions related to whether specific events and conditions are “adequate” or LTA. Comments on best practices in the safety literature and criteria to assist the analyst in this judgment can also be found in the user’s manual. Although the judgments made by the analyst are partly subjective, Johnson (1980) claims that MORT considerably enhances the capability of the analyst to identify underlying causes in the accident analysis. The MORT approach has received international recognition, and it has been applied to a wide range of projects, from investigation of occupational accidents to hazard identification. The interest in MORT has increased thanks to the activities of the Noordwijk Risk Initiative Foundation, which has issued a new and updated MORT manual (NRI 2009) and made additional resources available on its Internet page, http://www.nri.eu.com.

8.10 Systemic Accident Models The systemic accident models posit that an accident occurs when several causal factors (e.g. human, technical, and environmental) exist coincidentally in a specific time and space (Hollnagel 2004). 8.10.1

Man-Made Disasters Theory

The first accident model that specifically focused on major accidents was the MMD theory that was proposed by Turner (1978) and Turner and Pidgeon (1997). Up till then, major accidents were regarded as acts of God or completely random occurrences. Turner was the first to consider major accidents as a phenomenon that could be systematically analyzed. He reviewed 84 accidents and developed his theory based on the knowledge he gained from studying these accidents. Turner distinguishes four stages of an accident.

213

214

8 Accident Models

Nominally normal situation. The starting point is a normal situation, with no major problems in the system being studied. Incubation period. From the normal situation, the system state slowly moves toward a state where an accident may occur. This movement may be started by minor events that lead to changes in the status of barriers or other factors influencing risk. These changes are not perceived as critical, partly because information about what is happening do not reach the right persons. An example is when plant operators notice that a pressure indicator is not working properly, but because they know that other indicators can provide the information they need, they do not consider this as an immediate problem and do not flag it as such. The problem hence goes unnoticed. Another example is when the work done deviates from procedures, but is accepted by everyone involved, such that it effectively becomes the normal way of working. In the incubation period, a number of such issues develop. The accident occurs. Turner divides this into three substeps: (i) precipitating event, (ii) onset, and (iii) rescue. In practice, these represent the development of the event, the occurrence of loss, and the immediate corrective actions taken after the accident. Cultural readjustment. This stage concerns what happens in society in the aftermath of the accident. Major accidents are usually investigated, and this often leads to the realization that risk had not been managed properly and that improvements are necessary. This process is what Turner refers to as “cultural adjustment.” The MMD theory suggests that accidents could be avoided if only the right persons had access to, and acted upon, information that was already available within the organization. The underlying assumption is then of course that it is possible to say in advance what information actually tells us if an accident is imminent or not. This is a difficult assumption to verify because when we are analyzing accidents that have occurred, we always have the “benefit of hindsight,” meaning that we know exactly what has happened and can claim that it was a mistake to overlook this information or to make that decision. Still, even if we cannot expect that we are able to predict everything that can go wrong, a more systematic overview of the status of relevant factors that can influence risk is highly likely to be beneficial for risk management. Turner’s theory is not so often used any more, but it has had important impact on later work on major accidents. 8.10.2

Rasmussen’s Sociotechnical Framework

Chapter 1 indicated some causes for the increasing risk in modern sociotechnical systems. The main causes have been studied by Rasmussen (1997), who claims that the accident models that have been described so far in this chapter

8.10 Systemic Accident Models

are inadequate for studying accidents in such modern sociotechnical systems. Rasmussen advocates a system-oriented approach based on control theory concepts, proposing a framework for modeling the organizational, management, and operational structures that create the preconditions for accidents. This section is based on Rasmussen (1997), Rasmussen and Svedung (2000), and the comparative studies by Qureshi (2008). 8.10.2.1

Structural Hierarchy

Rasmussen (1997) views risk management as a control problem in the sociotechnical system, where unwanted consequences are seen to occur due to loss of control of physical processes. Safety depends on our ability to control these processes and thereby to avoid accidental side effects that cause harm to assets. The sociotechnical system involved in risk management is described by Rasmussen in several hierarchical levels, ranging from legislators to organization and operation management, to system operators. Figure 8.15 shows a hierarchy with six levels, but the number of levels and their labels may vary across industries (Qureshi 2008). The six levels in Figure 8.15 are (e.g. see Rasmussen 1997; Qureshi 2008): (1) Government. This level describes the activities of the government, which controls safety in the society through policy, legislation, and budgeting. (2) Regulators and associations. This level describes the activities of regulators, industrial associations, and unions, which are responsible for implementing the legislation in their respective sectors. (3) Company. This level concerns the activities of the particular company. (4) Management. This level concerns management in the particular company and their policies and activities to manage and control the work of their staff. (5) Staff . This level describes the activities of the individual staff members who are interacting directly with the technology and/or the processes being controlled, such as control room operators, machine operators, maintenance personnel, and so on. (6) Work. This level describes the application of the engineering disciplines involved in the design of potentially hazardous equipment and operating procedures for process control in, for example, nuclear power generation and aviation. The types of knowledge required to evaluate the various levels in Figure 8.15 are listed to the left of each level in the figure, and the environmental stressors that are influencing the levels are listed on the right-hand side. Traditionally, each level is studied separately by different academic disciplines without detailed consideration of the processes at the lower levels. The framework in Figure 8.15 points to a critical factor that is overlooked by the horizontal

215

216

8 Accident Models

Research discipline Political science, law, economics, sociology

Government Public opinion

Judgment

Laws Judgment

Economics, decision theory, organizational sociology Industrial engineering, management and organization

Psychology, human factors, human–machine interaction

Mechanical, chemical, and electrical engineering

Regulations

Safety reviews, accident analyses 1. Regulators, Associations Incident reports

Changing political climate and public awareness

2.

Company

Judgment

Operations reviews

Company policy

Management

Judgment

Logs and work reports

3.

4.

Changing market conditions and financial pressure

Changing competency and levels of education

Staff

Plans Judgment

Environmental stressors

Observations, data 5.

Work Action

Hazardous process

Fast pace of technological change

6.

Figure 8.15 Hierarchical model of sociotechnical system involved in risk management. Source: Reproduced from Rasmussen and Svedung (2000) with permission from the Swedish Civil Contingencies Agency.

research efforts: the additional need for “vertical” alignment across the levels. The organizational and management decisions made at higher levels should transmit down the hierarchy, whereas information about processes at lower levels should propagate up the hierarchy. This vertical flow of information forms a closed-loop feedback system, which plays an essential role in the safety of the overall sociotechnical system. Accidents are hence caused by decisions and actions by decision-makers at all levels, not just by workers at the process control level (see Qureshi 2008).

8.10 Systemic Accident Models

8.10.2.2

System Dynamics

It is not possible to establish procedures for every possible condition in complex and dynamic sociotechnical systems. In particular, this concerns emergencies, high-risk, and unanticipated situations (Rasmussen 1997). Decision-making and human activities are required to remain between the bounds of the workspace, which are defined by administrative, functional, and safety constraints. Rasmussen (1997) argues that in order to analyze the safety in a work domain, it is important to identify the boundaries of safe operations and the dynamic forces that may cause the sociotechnical system to migrate towards or cross these boundaries. Figure 8.16 shows the dynamic forces that can cause a complex sociotechnical system to modify its structure and behavior over time. The space of safe performance in which actors can navigate freely is confined within three boundaries, which relate to (1) Individual unacceptable workload (2) Economic failure (3) Functionally acceptable performance (e.g. safety regulations and procedures) If the boundary of functionally acceptable performance is exceeded, accidents occur. Due to the combined effect of management pressure for increased efficiency and a trend toward least effort, Rasmussen (1997) argues that behavior is likely to migrate toward the boundary of functionally acceptable performance. The exact boundary between acceptable and unacceptable risk is not always obvious to the actors, especially in complex systems where different actors attempt to optimize their own performance without complete knowledge as to how their decisions may interact with decisions made by other actors. At each level in the sociotechnical hierarchy, people are working hard to respond to pressures of cost-effectiveness, but they do not see how their decisions interact with those made by other actors at different levels in the system. Rasmussen claims that these uncoordinated attempts of adaptation are slowly but surely “preparing the stage for an accident.” He therefore argues that efforts to improve safety-critical decision-making should focus on making the boundaries toward unacceptable risk visible and known, such that the actors are given the opportunity to control their behavior at the boundaries. Traditional strategies for ensuring safe handling of conflicting goals rarely meet these objectives (Qureshi 2008; Størseth et al. 2010). 8.10.3

AcciMap

AcciMap is an accident analysis diagram proposed by Rasmussen (1997). AcciMap shows the interrelationships between causal factors at all six levels in

217

218

8 Accident Models

Boundary of functionally acceptable performance

Error margin

Counter gradient from campaigns for “safety culture”

Resulting perceived boundary of acceptable performance

Boundary to economic failure Gradient toward least effort Experiments to improve performance create “Brownian movements” Management pressure toward efficiency

Boundary to unacceptable work load

Space of possibilities: Degrees of freedom to be resolved according to subjective preferences

Figure 8.16 Boundaries of safe operation. Source: Reproduced from Rasmussen (1997) with permission from Elsevier.

Figure 8.15, thereby highlighting the problem areas that should be addressed to prevent similar accidents from occurring in the future. The analysis is carried out by asking why an accident happened, that is, by identifying the factors that caused it or failed to prevent its occurrence. This is repeated for each of the causal factors, so as to gain an understanding of the context in which the sequence of events took place. An AcciMap should not only include events and acts in the direct flow of events but may also serve to identify decisions at the higher level in the sociotechnical system that have influenced the conditions leading to the accident through their normal work activities, and how conditions and events interacted with each another to produce the accident (Svedung and Rasmussen 2002). The development of an AcciMap is therefore useful to highlight the organizational and systemic inadequacies that contributed to the accident, thereby ensuring that attention is not directed solely toward the events, technical failures, and human errors that led directly to the accident. The main structure and the symbols used in an AcciMap are shown in Figure 8.17. The stepwise construction of an AcciMap is described in detail by Rasmussen and Svedung (2000), who provide several examples. An illustrative example is also provided by Qureshi (2008).

System level 1. Government, policy, and budgeting

Precondition

2. Regulatory bodies and associations

Reference to annotations 1

3. Local area government, company management, planning and budgeting

Decision

4. Technical and operational management

Function

Priorities

Order 8 Decision

Plan

Order

5. Physical processes and actor activities

Influence 11

Task or action Direct consequence

Indirect consequence Task or action

Critical event Direct Loss of control conseqor loss of uence containment

9 6. Equipment and surroundings

Consequence

Precondition evaluated no further

Figure 8.17 AcciMap structure and symbols. Source: Reproduced from Rasmussen and Svedung (2000) with permission from the Swedish Civil Contingencies Agency.

220

8 Accident Models

The objective of AcciMap is to identify potential improvements, not to allocate responsibility for an accident. The diagram is therefore not established as a representation of facts, but rather as an identification of factors that should be improved to avoid future accidents. AcciMap is for this reason not solely an accident investigation tool but also a methodology for proactive risk management in a dynamic society (Rasmussen 1997). Svedung and Rasmussen (2002) designed a generic AcciMap to show the factors, especially the decisions, which may generate a typical scenario (i.e. a scenario that is representative of several accidents) in a particular domain of application. They have also constructed an ActorMap, which lists the actors involved in the generic AcciMap, from the company management level to the highest level (e.g. government). Further, an InfoMap has been introduced to deal with the interaction between actors in terms of the form and content of communications between the various decision-makers. 8.10.4

Normal Accidents

The normal accident theory was developed by Charles Perrow in 1979 when he was advising a Presidential commission investigating the nuclear accident at Three Mile Island, Harrisburg, PA. Perrow (1984) claims that some sociotechnical systems have properties that naturally lead to accidents. He identifies two important system characteristics that make complex sociotechnical systems especially prone to major accidents: interactive complexity and tight coupling. His claim is that accidents such as the Three Mile Island accident must be considered as “normal” consequences of interactive complexity and the tight coupling in sociotechnical systems. His theory was therefore called “normal accident theory.” Definition 8.5 (Normal accident) Multiple failure accident in which there are unforeseen interactions that make them very difficult or impossible (with our current understanding of the system) to diagnose. ◽ Normal accidents are also called system accidents. Even if the normal accident theory asserts that the occurrence of accidents is inevitable, it does not mean that we should not, or cannot, do anything about them. In fact, the normal accident theory proposes a shift of focus within accident prevention. Accident analysis should “focus on the properties of systems themselves, rather than on the errors that owners, designers, and operators make in running them” (Perrow 1984). The conclusion of Perrow is that in accident analysis, “what is needed is an explanation based upon system characteristics” (Huang 2007). Perrow introduces interactive complexity and tight coupling to characterize systems.

8.10 Systemic Accident Models

8.10.4.1

Interactive Complexity

Perrow (1984) claims that some sociotechnical systems, such as major nuclear power plants, are characterized by high interactive complexity, a concept that is slightly different from the complexity concept that was introduced in Chapter 4. Systems with high interactive complexity are difficult to control not only because they consist of many components but also because the interactions among components are difficult and sometimes impossible to comprehend. Perrow uses linear systems as a contrast to complex systems (see Figure 8.19). A system is linear when we can “understand” the system and can predict what will happen to the output of the system when we change the input. The terms input and output are interpreted here in a general way. Input changes may, for example, be a component failure, a human error, and a wrong set-point for a pressure switch. Linear interactions between the system components lead to predictable and comprehensible event sequences. Interactions are said to be nonlinear when we are not able to predict the effect on the system output when the input is changed. Nonlinear interactions therefore lead to unexpected event sequences. Nonlinear interactions are often related to feedback loops, which means that a change in one component may escalate due to a positive feedback or be suppressed by a negative feedback. Feedback loops are sometimes introduced to increase efficiency in the work process. However, interactive complexity makes abnormal states difficult to diagnose, because the conditions that cause them may be hidden by feedback controls that are introduced to keep the system stable under normal operation. Moreover, the effects of possible control actions are difficult to predict, because positive or negative feedback loops may propagate, attenuate, or even reverse the effect in an unforeseeable manner (e.g. see Rosness et al. 2004). Interactive complexity may be defined as: Definition 8.6 (Interactive complexity) Failures of two or more components interact in an unexpected way – due to a multitude of connections and interrelationships. ◽ Some attributes of interactive complexity are listed in Table 8.5. The complexity can be technological or organizational – and sometimes a bit of both. 8.10.4.2

Tight Coupling

Another system characteristic that makes control difficult is tight coupling. Tightly coupled systems are characterized by the absence of “natural” buffers and have little or no slack. They respond to and propagate disturbances rapidly, such that operators do not have the time or ability to determine what is wrong. As a result, human intervention is both unlikely and improper (Sammarco 2005). Tight couplings are sometimes accepted as the price to be paid for increased efficiency. An example is just-in-time production, which is a production

221

222

8 Accident Models

Table 8.5 Attributes of interactive complexity. Complex system attributes

Comments

Proximity

Close proximity of physical components or process steps, very little underutilized space

Common cause connections

Many common cause connections

Interconnected subsystems

Many interconnections – failures can “jump” across subsystem boundaries

Substitutions

Limited possibilities for substitution of people, hardware, or software. Strict requirements for each element

Feedback loops

Unfamiliar and unintended feedback loops

Control parameters

Multiple and interacting control parameters

Information quality

Indirect, inferential, or incomplete information

Understanding of system structure and behavior

Limited, incomplete, or incorrect understanding of the system and its structure

Source: Adapted from Sammarco (2003).

philosophy that allows companies to cut inventory costs but makes them more vulnerable if a link in the production chain breaks down. In other cases, tight couplings may be the consequence of restrictions on space and weight. On an offshore oil and gas platform, for example, the technical systems have to be packed tightly, which makes it more challenging to keep fires and explosions from propagating or escalating (Rosness et al. 2004). Tight coupling may be defined as follows: Definition 8.7 (Tight coupling) Processes that are part of a system happen quickly and cannot be turned off or isolated–due to direct and immediate connections and interactions between components. ◽ The relations between interactions and coupling are indicated by some example systems in Figure 8.18. Perrow’s concerns about tight coupling are expressed in the following quotation (Perrow 1984): The subcomponents of a tightly coupled system have prompt and major impacts on each other. If what happens in one part has little impact on another part, or if everything happens slowly (in particular, slowly on the scale of human thinking times), the system is not described as tightly coupled. Tight coupling also raises the odds that operator interventions make things worse because the true nature of the problem may not be understood correctly.

8.10 Systemic Accident Models

Interaction Complex

Tight

Linear

Nuclear power plant Aircraft

Coupling

Marine transport

Chemical plant

Loose

Assembly line

University

Figure 8.18 Interactions and coupling. Table 8.6 Attributes of tight coupling. Tight coupling attributes

Comments

Time dependency

Delays in processing are not tolerated

Sequences

Processes are rigidly ordered and cannot be changed (A must follow B, etc.)

Flexibility

There is only one path to a successful outcome or to implement a function

Slack

Little or no slack is permissible in supplies, equipment, personnel, and in system structure or behavior – precise quantities of specific resources are required for a successful outcome

Substitutions

Substitutions of supplies, equipment, and personnel may be available, but are limited and designed-in

Source: Adapted from Sammarco (2003).

Some attributes of tight coupling are listed in Table 8.6, whereas some differences between tight and loose coupling are highlighted in Table 8.7. Remark 8.2 (Criticism of the normal accident theory) The normal accident theory has provoked a lot of controversy, mainly because Perrow concludes that some technologies should be abandoned in their current form because they cannot be adequately controlled by any conceivable organization. Some analysts have criticized the normal accident theory because it does

223

224

8 Accident Models

Table 8.7 Characteristics of tight and loose coupling systems. Tight coupling

Loose coupling

Delays in processing not possible

Processing delays are possible

Order of sequences cannot be changed

Order of sequence can be changed

Only one method is applicable to achieve the goal

Alternative methods are available

Little slack possible in supplies, equipment, personnel

Slack in resources possible

Buffers and redundancies may be available, but are deliberately designed-in, and there is no flexibility

Buffers and redundancies fortuitously available

Substitutions of supplies, equipment, and personnel may be available, but are limited and designed-in

Substitutions are fortuitously available

Source: Adapted from Perrow (1984)

not include any criteria for measuring complexity and coupling. Complexity measures are discussed further by Sammarco (2003, 2005). ◽ 8.10.5

High-Reliability Organizations

The theory of high-reliability organizations (HROs) was developed partly as a reaction to the challenges posed by the normal accident theory (La Porte and Consolini 1991; Weick and Sutcliffe 2007). HRO researchers observed that a number of complex, high-risk organizations (e.g. aircraft carriers, nuclear submarines, and air traffic control systems) had been able to operate for decades without any major accidents. This seemed to be in contrast to the normal accident theory and showed that it should be possible to prevent serious accidents by properly managed organizational processes and practices. The HRO perspective focuses on being proactive and on predicting and preventing potential dangers as early as possible. As such it is actually not an accident model, but rather a recipe for how to avoid accidents. In the same way as Haddon realized, this may of course also be turned around, to describe weaknesses. A central risk reduction strategy is to build organizational redundancy. This strategy requires that a sufficient number of competent personnel be available so that overlap in competence, responsibilities, and possibilities for observation can be achieved. Workplace design should allow, and even encourage, counsel seeking, observation of other people’s work, and intervention in case of erroneous actions. Moreover, it is necessary to build a culture that encourages questioning and intervention. Another strategy is to build organizations with a capacity for spontaneous, adaptive reconfiguration (Rosness et al. 2004).

8.10 Systemic Accident Models

According to Weick and Sutcliffe (2007), the HRO theory is based on five principles. (1) Preoccupation with failure. This principle implies that HRO takes all failures, small and large, seriously and take them as prewarnings that accidents may happen at any time. This is related to detecting as many failures as early as possible but also to anticipating what failures may occur that can have serious consequences. (2) Reluctance to simplify. This is related to how we interpret a situation that we encounter. In reality, we very seldom analyze a situation that we face in all details before we interpret a situation or an event. Instead, we usually identify a few characteristics and try to match the observations that we make to something that we have experienced earlier or to categories we have developed for classifying situations and events. This helps us to quicker make sense of the situation, but we also risk losing critical information. If we simplify too much, we may overlook signals and symptoms telling us that this is a different situation that we need to handle in a different way. (3) Sensitivity to operations. In many cases, there is a difference between what we believe about operations and how operations actually take place. This can lead to situations where wrong decisions are made by management because they believe that work is being done in a specific way (e.g. always following procedures), whereas reality is different. Understanding how operations actually are performed and acting in accordance with this is therefore the third principle of HRO. (4) Commitment to resilience. Resilience is a concept that is defined in many different ways. We have earlier defined this as “the ability to accommodate change without catastrophic failure, or the capacity to absorb shocks gracefully.” Weick and Sutcliffe (2007) describe three main elements of resilience: (i) to be able to absorb the effects of the event and continue working; (ii) to be able to recover from negative effects; and (iii) to learn from what has been experienced earlier. The point is that organizations should not just be able to avoid accidents, by making sure that they work according to the first three principles but also need to prepare for situations where something has gone wrong. Resilience is an important element in this, being able to handle accidents that happen without catastrophic loss. Organizational redundancy is one way of building resilience. (5) Deference to expertise. The final principle is about how an organization is flexible in who makes decisions, depending on the situation. Expertise is everywhere in an organization and does not follow the hierarchy. The true experts on operations are usually close to operation and not in the top management. HRO has the ability to let the best expertise make decisions in all situations, rather than following a fixed hierarchical structure.

225

226

8 Accident Models

Control algorithms set points Controller

Actuators

Sensors

Controlled variables Process inputs

Measured variables Controlled process

Process outputs

Disturbances Figure 8.19 Standard control loop. Source: Adapted from Leveson (2011).

HRO theory has been criticized because the examples used are organizations that are operating in highly unusual circumstances that do not apply to most organizations. A common HRO example is naval aircraft carriers, which do not operate in a market and have to make a profit. Safety can therefore always be a high priority for them. The same applies to hospital emergency rooms, which can be seen as being in a constant state of emergency, with unique operating rules. Sagan (1995) presents an interesting comparison of normal accident theory and HRO theory in his book The limits of safety: Organizations, accidents and nuclear weapons where he discusses accidents with nuclear weapons from the perspectives of both theories.

8.10.6

STAMP

STAMP is an accident causation model (Leveson 2011) that applies control theory. STAMP views an accident as caused by inappropriate or inadequate control or enforcement of safety constraints in the development, design, and operation of a system. The standard control loop of STAMP is shown in Figure 8.19. Control is also imposed by the organization’s management functions and by the social and political system within which the organization exists. In accident analysis, the role of all these factors must be considered. Key elements of STAMP are the following: • Safety constraints. Leveson does not explicitly define what a safety constraint is, but indirectly it is explained by saying that accidents occur because safety constraints were not enforced. A safety constraint can, for example, be (i)

8.10 Systemic Accident Models

the minimum separation distance between aircrafts flying in a controlled airspace or (ii) the maximum pressure that is allowed in a pressure vessel. • Hierarchical safety control structures. This concept is inspired by Rasmussen’s model of sociotechnical systems in Figure 8.15. • Process models. The final main element is process models, which form the basis for controlling the process. Four conditions are required for a process model: – Goal condition. First, one or more goals need to be defined, such as to maintain pressure at a certain level or to ensure that all personnel have adequate training for doing their job. – Action condition. Second, the controller must be able to control the process, through the actuator. Maintaining the pressure at a given level can be achieved, for example, by increasing or reducing the pressure of fluids flowing into a pressure vessel. Ensuring adequate training can be achieved by arranging courses at regular intervals. – Model condition. A model of the system must be available, which describes what the effect of taking actions will be on the process. The controller must understand what effect changing the pressure of the flow has on the pressure in the vessel, to know how much change to make in the inflow. Similarly, she needs to understand how much training is required to reach the goal of adequate training for all personnel. – Observability condition. Finally, the controller needs to be able to see what the effect of the actions are. For the pressure vessel example, this can be achieved through pressure sensors. For the training example, tests can be used to check whether the training has had the desired effect. 8.10.6.1

CAST

CAST (Causal analysis based on STAMP) is an accident analysis method based on STAMP (Leveson 2011). CAST encompasses the following nine steps. (1) (2) (3) (4) (5) (6) (7) (8) (9)

Describe the system involved and the hazard causing the loss. Identify relevant safety constraints and requirements. Describe the hierarchical control structure that was in place. Determine the event chain (accident scenario) causing the loss. Analyze the accident at the lowest level in the hierarchy, the physical system level. Successively move upwards in the hierarchy and identify failures in control from one level to the level below. Analyze communication and coordination in and between all levels to identify contributors to the accident. Identify any weaknesses in the control structure that had developed over time and contributed to the accident. Propose recommendations.

227

228

8 Accident Models

Table 8.8 A brief comparison of accident models. Model

How and why accidents occur

Implications for risk management

Energy–barrier

Accidents are the result of hazards (energy) out of control. This is caused by inadequate or failed barriers that protect vulnerable assets

Add more barriers or improve existing barriers

Man-made disasters

Accidents occur because information about weaknesses in the system does not flow and is not coordinated such that it can be acted upon

Improve information flow

Normal accidents

Accidents occur because systems are complex and tightly coupled

Reduce complexity and coupling in systems

High reliability organizations

This theory is focused on five properties of organizations that contribute to avoiding accidents rather than why accidents occur

Ensure that the organization functions in accordance with the five HRO principles

Rasmussen

Accidents are caused by inadequate control. The control structure is affected by conflicting objectives that can affect the balance

Ensure adequate controls are in place and make sure that safety objectives have sufficient priority

CAST has been used to analyze several major accidents, such as a major railway accident in China (Ouyang et al. 2010) and a ferry accident in Korea (Kim et al. 2016). A UK train accident has further been analyzed using the ATSB (Australian Transport Safety Bureau) accident analysis model, AcciMap and CAST, allowing comparison of the methods (Underwood and Waterson 2014).

8.11 Combining Accident Models The accident models that are presented in this chapter present different views on what an accident is, why it happens, and how accidents can be avoided and controlled. In some respects, they may be regarded as “competing models,” where we have to decide which one we believe in and do our analysis based on that. In our opinion, it is important to have a good understanding of all these models because applying different perspectives and models often can be useful and provide more insight into the problem. Table 8.8 summarizes briefly some of the most important accident models, in terms of how they describe accidents and what the implications are of the

8.12 Problems

models for risk management. This clearly shows that different views provide different ways of managing risk. It is not possible to say that one or the other is “right” or “wrong”; rather they tell us that we need to consider many different aspects of the system to manage risk in a robust manner.

8.12 Problems 8.1 Compare James Reason’s accident categories (Section 8.2.2) with Jens Rasmussen’s categories (Section 8.2.1) and try to classify Rasmussen’s categories according to Reason’s classification. 8.2 How would you define a major accident? 8.3 At the website of US National Transportation Safety Board (https://www.ntsb.gov/investigations/AccidentReports/Pages/ AccidentReports.aspx), it is possible to find a large number of accident reports from transportation accidents in the United States. The reports normally end up with a conclusion on what is the probable cause, but often more factors that influenced the accident can be found when reading the complete reports. Read through one of the reports and identify deterministic and probabilistic causes that are mentioned in the report. 8.4 Why do we need accident models and what can we use them for? 8.5 What are the main types of accident models? 8.6 As described in this chapter, accidents can be classified in many different ways. Suggest some relevant subgroups of accidents for ships and mention some specific examples within each subgroup. 8.7 How can the insight you get by using Haddon’s phase model on specific accidents be used in a risk analysis? 8.8 Describe the main differences between an energy barrier model and a sequential model. 8.9 Read about a traffic accident in a newspaper and fill in the information into a Haddon matrix. Discuss the extra insight you get by using this method. 8.10

Use Haddon’s 10 countermeasures for the traffic accident that you looked at and propose specific risk reduction measures/strategies.

229

230

8 Accident Models

8.11

What are the main differences between a risk analysis and an accident analysis?

8.12

Five main types of accident models are described in this chapter: Energy Barrier, Sequential, Epidemiological, Event causation and sequencing, and Systemic models. What are the main features and differences between these types?

8.13

Apply the energy barrier model and Rasmussen’s sociotechnical framework model to a traffic accident. Do you identify different causes of the accident when you apply different models?

8.14

Where would you place the railway station of a large city in Perrow’s rectangle in Figure 8.18? Justify your choice.

References Alizadeh, S.S. and Moshashaei, P. (2015). The bowtie method in safety management system: a literature review. Scientific Journal of Review 4: 133–138. Attwood, D., Khan, F.I., and Veitch, B. (2006). Occupational accident models: where have we been and where are we going? Journal of Loss Prevention in the Process Industries 19 (6): 664–682. Bird, F.E. and Germain, G.L. (1986). Practical Loss Control Leadership. Loganville, GA: International Loss Control Institute. Energy Institute (2017). Guidance on Using Tripod Beta in the Investigation and Analysis of Incidents, Accidents, and Business Losses. Technical Report 5.1. London: Energy Institute. Gibson, J.J. (1961). The contribution of experimental psychology to the formulation of the problem of safety. In: Behavioral Approaches to Accident Research, 296–303. New York: Association for the Aid of Crippled Children. Groeneweg, J. (2002). Controlling the Controllable: Preventing Business Upsets, 5e. Leiden, The Netherlands: Global Safety Group Publications. Haddon, W. (1970). On the escape of tigers: an ecologic note. American Journal of Public Health and the Nation’s Health 8 (12): 2229–2234. Haddon, W. (1980). Advances in the epidemiology of injuries as a basis for public policy. Landmarks in American Epidemiology 95 (5): 411–421. Heinrich, H.W. (1931). Industrial Accident Prevention: A Scientific Approach. New York: McGraw-Hill. Hendrick, K. and Benner, L. (1987). Investigating Accidents with STEP. New York: Marcel Dekker. Hollnagel, E. (2004). Barriers and Accident Prevention. Aldershot: Ashgate.

References

Huang, Y.H. (2007). Having a new pair of glasses: applying systematic accident models on road safety. PhD thesis. Linköping, Sweden: Linköping University. IFE (2009). Assessing Organizational Factors and Measures in Accident Investigation. Technical Report IFE/HR/F-2009/1406. Kjeller, Norway: Institutt for Energiforskning (IFE) (in Norwegian). ISO 9000 (2015). Quality Management Systems: Fundamentals and Vocabulary. Tech. Rep. Geneva: International Organization for Standardization. Johnson, W.G. (1980). MORT Safety Assurance System. New York: Marcel Dekker. Khan, F.I. and Abbasi, S.A. (1999). Major accidents in process industries and an analysis of causes and consequences. Journal of Loss Prevention in the Process Industries 12: 361–378. Kim, T-e., Nazir, S., and Øvergård, K.I. (2016). A STAMP-based causal analysis of the Korean Sewol ferry accident. Safety Science 83: 93–101. Kjellén, U. (2000). Prevention of Accidents Through Experience Feedback. London: Taylor & Francis. Klinke, A. and Renn, O. (2002). A new approach to risk evaluation and management: risk-based, precaution-based, and discourse-based strategies. Risk Analysis 22 (6): 1071–1094. Kontogiannis, T., Leopoulos, V., and Marmaras, N. (2000). A comparison of accident analysis techniques for safety-critical manmachine systems. Industrial Ergonomics 25: 327–347. La Porte, T.R. and Consolini, P.M. (1991). Working in practice but not in theory: theoretical challenges of “high-reliability organizations”. Journal of Public Administration Research and Theory 1: 19–47. Leveson, N. (2004). A new accident model for engineering safer systems. Safety Science 42 (4): 237–270. Leveson, N. (2011). Engineering a Safer World. Cambridge, MA: MIT Press. Lundberg, J., Rollenhagen, C., and Hollnagel, E. (2009). What-you-look-for-iswhat-you-find: the consequences of underlying accident models in eight accident investigation manuals. Safety Science 47 (10): 1297–1311. Niwa, Y. (2009). A proposal for a new accident analysis method and its application to a catastrophic railway accident in Japan. Cognition, Technology & Work 11: 187–204. NRI (2009). NRI MORT user’s manual. Technical report NRI-1. The Noordwijk Risk Initiative Foundation. Okoh, P. and Haugen, S. (2013). Maintenance-related major accidents: classification of causes and case study. Journal of Loss Prevention in the Process Industries 26 (6): 1060–1070. Ouyang, M., Hong, L., Yu, M.H., and Fei, Q. (2010). STAMP-based analysis on the railway accident and accident spreading: taking the China Jiaoji railway accident for example. Safety Science 48: 544–555. Perrow, C. (1984). Normal Accidents: Living with High-Risk Technologies. New York: Basic Books.

231

232

8 Accident Models

Qureshi, Z.H. (2008). A Review of Accident Modelling Approaches for Complex Critical Sociotechnical Systems. Technical report DSTO-TR-2094. Edinburgh, Australia: Defence Science and Technology Organization. Rasmussen, J. (1997). Risk management in a dynamic society: a modelling problem. Safety Science 27: 183–213. Rasmussen, J. and Svedung, I. (2000). Proactive Risk Management in a Dynamic Society. Karlstad, Sweden: Swedish Rescue Services Agency (Currently: The Swedish Civil Contingencies Agency). Reason, J. (1990). Human Error. Cambridge: Cambridge University Press. Reason, J. (1997). Managing the Risks of Organizational Accidents. Aldershot: Ashgate. Reason, J. (2016). Organizational Accidents Revisited. Boca Raton, FL: CRC Press. Rosness, R., Guttormsen, G., Steiro, T. et al. (2004). Organizational Accidents and Resilient Organizations: Five Perspectives. STF38 A04403. Trondheim, Norway: SINTEF. de Ruijter, A. and Guldenmund, F. (2016). The bowtie method: a review. Safety Science 88: 211–218. Sagan, S.D. (1995). The Limits of Safety: Organizations, Accidents and Nuclear Weapons. Princeton, NJ: Princeton University Press. Sammarco, J.J. (2003). A normal accident theory-based complexity assessment methodology for safety-related embedded computer systems. PhD thesis. Morgantown, WV: College of Engineering and Mineral Resources, West Virginia University. Sammarco, J.J. (2005). Operationalizing normal accident theory for safety-related computer systems. Safety Science 43: 697–714. Sklet, S. (2002). Methods for Accident Investigation. ROSS report 200208. Trondheim, Norway: Norwegian University of Science and Technology. Sklet, S. (2004). Comparison of some selected methods for accident investigation. Journal of Hazardous Materials 111: 29–37. Størseth, F., Rosness, R., and Guttormsen, G. (2010). Exploring safety critical decision-making. In: Reliability, Risk, and Safety: Theory and Applications (ed. R. Bris, C.G. Soares, and S. Martorell), 1311–1317. London: Taylor & Francis. Svedung, I. and Rasmussen, J. (2002). Graphic representation of accident scenarios: mapping system structure and the causation of accidents. Safety Science 40: 397–417. Tripod Solutions (2007). Tripod beta, User guide. Tripod Solutions, Den Helder, The Netherlands. Turner, B.A. (1978). Man-Made Disasters. London: Wykeham Publications. Turner, B.A. and Pidgeon, N.F. (1997). Man-Made Disasters, 2e. Butterworth-Heinemann / Elsevier. Underwood, P. and Waterson, P. (2014). Systems thinking, the Swiss Cheese Model and accident analysis: a comparative systemic analysis of the Grayrigg

References

train derailment using the ATSB, AcciMap and STAMP models. Accident Analysis and Prevention 68: 75–94. U.S. AEC (1973). MORT – The Management Oversight and Risk Tree. Tech. Rep. AT(04-3)-821.Washington, DC: U.S. Atomic Energy Commission, Division of Operational Safety. U.S. DOE (1999). Conducting Accident Investigations. Tech. Rep. Washington, DC: U.S. Department of Energy. Vincoli, J.W. (2006). Basic Guide to System Safety, 2e. Hoboken, NJ: Wiley. Wagenaar, W.A., Hudson, P.T.W., and Reason, J.T. (1990). Cognitive failures and accidents. Applied Cognitive Psychology 4: 273–294. Weick, K.E. and Sutcliffe, K.M. (2007). Managing the Unexpected: Resilient Performance in an Age of Uncertainty, 2e. San Francisco, CA: Jossey-Bass.

233

235

9 Data for Risk Analysis 9.1 Types of Data A wide range of information is necessary as input to any risk assessment and in particular the data needs for quantitative risk assessments will be extensive. The required data may be grouped into two broad categories: Descriptive data. These types of data describe the study object and the context in which it is placed, covering the technical systems, the organization, the operation, inputs and outputs, and the environmental factors influencing the study object. The descriptive data can be regarded as facts related to the present situation and can normally be established with limited uncertainty, even if the effort to collect the information may delimit how much detail we go into. Before applying the descriptive data in the analysis, we need to make assumptions about the future state of the study object and its operating context and modify the data based on these assumptions. Obviously, this introduces uncertainty in these data when used in risk assessment. Operational data as they are today may well change in the future due to increased or reduced production, and this will have to be reflected in the data sets that are used in the analysis. This information largely has to be provided as part of the operating context but assumptions about the future state may also have to be made by the risk analyst. Probabilistic data. These types of data cause most problems for the risk analyst. They are related to how likely it is that various negative events will take place in the future. Examples are how often hazardous events will occur, how often components and systems will fail, and how often operational errors will be made. In the following, we give a more detailed overview of the main types of data that belong in these two categories. When discussing specific data sources and treatment of data, we mainly focus on the last group, probabilistic data.

Risk Assessment: Theory, Methods, and Applications, Second Edition. Marvin Rausand and Stein Haugen. © 2020 John Wiley & Sons, Inc. Published 2020 by John Wiley & Sons, Inc. Companion Website: www.wiley.com/go/riskassessment2e

236

9 Data for Risk Analysis

9.1.1

Descriptive Data

Technical data. A variety of technical data are needed to understand all the functions of the study object and to establish system models, such as fault trees and event trees (see Chapters 11 and 12). For a process plant, the study team should, for example, have access to piping and instrumentation diagrams (P&IDs); which safety devices are, or will be, installed; which hazardous substances are present in the system; and where these substances are stored and used. Technical data are usually obtained from the system owner, equipment manufacturers, technical manuals, and so on. Operational data. To understand how components and subsystems are operated and to establish flow and system models, many types of operational data are required. Procedures for normal operation, startup, and shutdown of the system are examples of this type of data. Procedures for handling abnormal situations are another category. Production data. Production data cover all sorts of data describing the output from the study object. This can be the number of passengers per year, the number of cars produced each year in a car factory, and so on. Maintenance data. These are data that tell us how the technical components and subsystems will be – or are planned to be – maintained, and how long the repair or downtime will be. Safety equipment, such as fire and gas detection systems and emergency shutdown systems, is often part of passive systems that must be tested to determine whether it is functioning. In such cases, it is necessary to know how the systems are tested, the time between tests, and the type and proportion of failures that can be revealed by a test. Meteorological data. Weather conditions may affect both the probability of hazardous events and the consequences of such events. The dominant wind direction can, for example, determine in which direction a cloud of hazardous gas is likely to move. Exposure data. To determine the consequences of a hazardous event, it is necessary to have information about where people reside and how often and how long time they stay at the various locations. For workers in the system, it will also be relevant to know what personal protective equipment is used. Environmental data. The environmental consequences of a potential accident depend on how fragile the environment is, what plants and animal species are exposed to impact, and so on. External safety functions. In many cases, external safety functions, such as fire engines, ambulances, and hospitals are important to limit the consequences of an accident. Data relating to the capacity and availability of such systems are then important input data to the risk analysis. Stakeholder data. Usually, there are several stakeholders to a risk assessment. The stakeholders may, among other things, affect how the analysis should be

9.1 Types of Data

carried out and how it should be reported. It is important to recognize who are, or who may be, important stakeholders and to know their requirements regarding the risk analysis process and the reporting of the analysis. 9.1.2

Probabilistic Data

Accident data. The study team should have knowledge of previous accidents and near accidents in the same type of, or in similar systems. Many databases containing descriptions of past accidents have been established (see Section 9.3 and Chapter 20). Data on natural events. For some systems, natural events such as floods, landslides, storms, earthquakes, and lightning strikes are important causes of accidents. In such cases, it is important to have access to estimates of the magnitude and frequency of such events. Hazard data. There are two main types of hazard data: (i) checklists of relevant hazards and (ii) information (e.g. fact sheets) about dangerous substances and dosages that will harm human beings and the environment. Lists of relevant hazards have been developed for several application areas. An example of such a list for machinery systems may be found in ISO 12100 (2010). Several organizations maintain databases with information about dangerous substances and can sometimes supply relevant and updated information to risk analysts. Previous experience related to dangerous situations and dangerous substances in the plant being analyzed is also important information. More knowledge is needed regarding the dangers related to mixtures of substances. Reliability data. This is information related to how and how often the components and subsystems in a system may fail. Several reliability databases have been established: both generic and company-specific databases. A generic database provides average data for a rather wide application area, whereas a company-specific database is based on reported failures and other events from the actual application of the equipment (see Section 9.3.3). Some application areas have also established joint databases for several companies, such as OREDA (2015) for the offshore and onshore oil and gas industry. Some reliability databases provide failure rates for each relevant failure mode of the equipment, whereas other databases give only the total failure rates. In some cases, multiple components could fail because of commoncause failures (CCFs) (see Chapter 13), and it is hence necessary to obtain estimates of how often such failures will occur. Such estimates are found in very few databases, but in most cases, the estimates must be determined by using checklists (for details see Chapter 13). Human error data. In sociotechnical systems, it is also necessary to estimate the probability of human errors. Human error data are typically given for “generic” activities and tasks, either on a very low level of detail or as more general

237

238

9 Data for Risk Analysis

types of tasks. Methods for estimating human error probabilities (HEPs) are discussed in Chapter 15. Consequence data. In addition to failure data, many types of data related to consequences are required, such as how often physical phenomena occur (ignition probability), how likely it is that certain consequence thresholds are exceeded (probability of exceedance of explosion overpressure of 1 bar), and other types of data. These data are normally calculated based on models of the physical processes, such as models for calculating ignition probabilities (e.g. see Cox et al. 1990). Even these wide ranges of categories of data are far from sufficient in all risk assessments. To discuss all the types of data that might be needed would be too excessive for this chapter. The data required for a specific analysis depend on the study object and the risk modeling. The data needs are usually evident when the modeling is completed. A listing of data typically included in risk analysis may be found in the Norwegian risk assessment standard (NS 5814 2008).

9.2 Quality and Applicability of Data Probabilistic data for use in risk assessment is nearly always a problem to find. For several decades, significant effort has been devoted to the collection and processing of reliability and accident data. Despite this, the quality of the available data is still not good enough – and may never become “good enough” – because what we are trying to do is predict the future based on history. Because the technology and systems we are studying in most cases keep on changing, we have no guarantee that history can tell us the whole truth about the future. When we have found a source that contains data relevant for our risk assessment, there are a number of issues that we need to consider before applying the data. • Industry and application. The first issue to consider is whether the data come from the same type of industry and the same application as we are planning to use them for. There are quite a few sources available for, for example, failure of pumps, but if we are going to analyze an offshore installation and the data are from the nuclear industry, is that sufficiently similar? In order to evaluate this, we need to look at where the pumps have been installed (enclosed area or area exposed to weather), what service they have been used for (water or oil) and how they have been maintained (is the maintenance policy to operate until breakdown or maintain regularly to prevent stops?). Other factors may also be relevant. Often, it is difficult to get precise answers to these questions from the data sources we are using. • Age of data. Data sources often provide data that are quite old, maybe going back 20–30 years or even more. In most cases, both technology and operations have developed significantly in such a long period, implying that

9.3 Data Sources

a pump manufactured 30 years ago is likely to be quite different from a pump manufactured today. The older the data, the more critical we need to be when evaluating their potential use. • Present and future technology trends. Similar to the previous item is technology trends that we see today or that we know will be coming. Electrical cars are increasingly common and can we expect that, for example, data about fires in petrol cars are applicable for electrical cars? Again, we have to consider the effects of such changes on our expectations for the future compared to the past. • Completeness and accuracy of data. A major problem with databases and other data sources is often that the data are not complete. Underreporting has been found to be a significant problem in several cases, casting doubts on whether the reported failure rates are realistic or whether they are too low (e.g. see Hassel et al. 2011). Errors in reporting is a common problem, potentially leading to wrong classification of reported events. Both issues are difficult to evaluate. Even if we go into the details of how reporting and recording is being done, it is still difficult to know whether underreporting is a major issue. • Extent of data. Finally, it is important to consider the extent of data that the accident and failure rates are based on. There are cases where published data sources of accident rates have been based on as little as one single accident. The statistical uncertainty in the estimates is then obviously high. Later in this chapter, further discussion is provided specifically on quality of reliability databases.

9.3 Data Sources In this section, information about some data sources that may be useful in risk assessment is provided. More information about specific application areas can also be found in Chapter 20. At the time of writing, the sources mentioned in this chapter are considered to be among the best sources of data for risk assessment. This may obviously change in the future, although it takes time to establish good sources of data. Because several of the sources mentioned are available online only, some might be discontinued. 9.3.1

Data Collection Mandated Through Regulations

In some industries, it is mandatory to collect, analyze, and store data relating to accidents and incidents. A few examples are given to illustrate the requirements: • Nuclear power industry. In this industry, the data collection is rooted in the international convention on nuclear safety. According to this convention, each contracting party commits to taking the appropriate steps to ensure that

239

240

9 Data for Risk Analysis

[…] incidents significant to safety are reported in a timely manner by the holder of the relevant license to the regulatory body; [and that] programs to collect and analyze operating experience are established, the results obtained and the conclusions drawn are acted upon and that existing mechanisms are used to share important experience with international bodies and with other operating organizations and regulatory bodies (IAEA 1994). • Aviation. According to EU directive 2003/42/EC on “Occurrence reporting in civil aviation” (EU 2003), data related to all civil aviation incidents and accidents must be collected, reported, and analyzed. The organization ECCAIRS has been established to “assist national and European transport entities in collecting, sharing and analyzing their safety information in order to improve public transport safety.” • Industry covered by the EU Seveso III directive. Companies in Europe that have to comply with the Seveso directive (EU 2012) are obliged to collect and report data in a specified format to the national authorities and to the eMARS database.1 • Offshore oil and gas. Several requirements to collect offshore oil and gas data are in operation around the world. In the United States, Bureau of Safety and Environmental Enforcement (BSEE), which was established after the Macondo blowout on Deepwater Horizon, has introduced requirements for reporting of all sorts of accidents and incidents, including fatalities, injuries, fires, explosions, well losses, structural damage, and so on. In the United Kingdom, Reporting Injuries, Diseases and Dangerous Occurrences Regulations (RIDDOR) cover a similar scope, extending also outside the oil and gas industry. The European Safety, Reliability, and Data Association (ESReDA)2 was established in 1992 as a forum for exchange of information, data, and results from research in safety and reliability. ESReDA has published several handbooks related to safety and reliability data. In this chapter, the term database is used to denote any type of data source, from a brief data handbook to a comprehensive computerized database. 9.3.2

Accident Data

Accident databases may be a useful source of information, but as input to quantitative risk assessment they may not be sufficient, mainly because they list and describe the accidents but with no information about the exposed population (i.e. the systems that did not have any accidents). Such databases can, therefore, 1 See http://emars.jrc.ec.europa.eu/. 2 See http://www.esreda.org.

9.3 Data Sources

not help us to estimate accident rates. Some few accident databases are combined with exposure data, but the majority is not and data will have to be found from other sources. This introduces the problem that it may be difficult to know exactly what population the accident data are based on and whether the exposure data that we acquire matches this. Accident databases are most often developed to provide lessons learned from actual accidents, as a basis for improving technical systems, operations, management systems, and organizations. Accident databases may be useful sources of ideas for identifying what can go wrong when we are doing a risk analysis. These data are useful, but perhaps less so for risk analysis specifically than for risk management in general. Data from accidents may, according to Kvaløy and Aven (2005), be used to • • • • • •

Monitor the risk and safety level Give input to risk analyses Identify hazards Analyze accident causes Evaluate the effect of risk reduction measures Compare alternative areas of efforts and measures

Many databases with information about accidents and incidents have been established. Some of these are official databases that are established by the authorities and have good quality assurance. Others are established by consulting companies, interest groups, or even individuals. The quality of this type of databases varies greatly. Some databases are very detailed, whereas others provide only a brief description of the accident/incident and provide no information about the causes of the accident/incident. Some databases cover only major accidents, whereas others focus on occupational accidents or accidents in which, generally, only one person is affected in each accident. 9.3.2.1

Some Accident and Incident Databases

Some databases providing accident and incident data are listed below. The list is far from exhaustive. Major accident reporting system (eMARS). It was established to support the EU Seveso directive (EU 2012) and is operated by the Major Accident Hazards Bureau (MAHB) of the Joint Research Centre in Ispra, Italy, on behalf of the EU. Seveso III plants in Europe have to report all accidents and incidents to the eMARS database using a rather detailed format. The online database covers all industrial accidents and near accidents involving hazardous materials within the EU. Until the end of 2018, information about more than 900 events since 1979 had been reported. The information contained in eMARS is available to all public except the restricted data which is only for the EU Member States.

241

242

9 Data for Risk Analysis

The data in eMARS comprise, among several other items: • Type of accident/incident • Industry where accident/incident occurred • Activity being carried out • Components directly involved • Causative factors (immediate and underlying) • Ecological systems affected • Emergency measures taken Several EU countries have established accident and incident databases that complement and extend the eMARS database. Process safety incident database (PSID). It provides information about accidents and incidents involving hazardous materials. The database is operated by the Center for Chemical Process Safety (CCPS) of the American Institute of Chemical Engineers (AIChE). At the time of writing this chapter (June 2019), the webpage states that the database contains more than 700 records. Incident reporting system (IRS). It is a database of accidents and incidents in nuclear power plants. The database is operated by IAEA, Department of Nuclear Safety, Vienna, Austria. Almost all countries with a nuclear power program participate in IRS. Aviation accident database. This is one of several databases providing data for aviation accidents and incidents. The Aviation Accident Database is operated by the US National Transport Safety Board, Office of Aviation Safety. In Europe, ECCAIRS, a center for coordination of accidents and IRSs has been established. This is directly related to two European regulations; No 996/2010 on investigation and prevention of accidents and incidents in civil aviation and No 376/2014 reporting, analysis, and follow-up of occurrences in civil aviation. ECCAIRS is hosted by JRC, and they operate two databases, one on incidents and one on safety recommendations. Unfortunately, none of them are publicly available. International Civil Aviation Organization (ICAO) also operate a database and publish accident statistics on their website http://www.icao.int. See also ATSB (2006). International road traffic and accident database (IRTAD). It was established in 1988 as part of the OECD Road Transport Research Program. Data are supplied from many countries in a common format and analyzed by the IRTAD group. All IRTAD members have full access to the database. World offshore accident database (WOAD). It provides information about more than 6000 accidents in the offshore industry since 1970. The database is operated by DNV GL, and access to the database is subject to paying a fee. WOAD is based largely on publicly available data that are processed by DNV GL. SINTEF offshore blowout database. It contains data on offshore blowouts and is operated by SINTEF, the Norwegian research organization. The data are based largely on public sources. Access to data requires membership in the project.

9.3 Data Sources

HSE RIDDOR. It contains information about reported injuries and fatalities and also dangerous occurrences for workplaces in the United Kingdom. Statistics are published regularly and can be downloaded from http://www.hse.gov.uk/ riddor. 9.3.2.2

Accident Investigation Reports

Accident reports are useful sources of information, in particular for learning what can go wrong and how accidents develop. This is useful for identification of what can go wrong in our study object, but individual accident reports cannot provide data that can be used directly in quantification of risk. An independent party usually investigates major accidents, and the investigation reports are often made public. Investigation reports of US chemical accidents may, for example, be found on the homepage of the US Chemical Safety Board (https://www.csb.gov). Many of these reports are very detailed, and some are accompanied by videos. The US National Transportation Safety Board also publish reports on their website (https://www.ntsb.gov). This covers all key modes of transportation, including on roads, rail, at sea, and in the air. An example of a far-reaching accident investigation is Lord Cullen’s inquiry into the Piper Alpha accident in the North Sea in 1988. His report (Cullen 1990) led to new legislation, new operational procedures, and new ways of conducting risk assessments. Many risk analysts might learn a lot from reading the report. Another example is the investigations after the Macondo/Deepwater Horizon blowout. A number of investigations were undertaken, including by the US Coast Guard, a specially appointed National Commission, by the National Academy of Engineering and by BP as the operator of the well. All the reports are publicly available on the Internet. The accident had a major impact on regulations both in the United States and in Europe. In Norway, the government commissions accident investigations of major accidents, and the reports are published as official NOU-reports. These reports are often a basis for proposed changes in laws and regulations that are presented to the parliament. 9.3.3

Component Reliability Data

There are two main types of component reliability data: (i) descriptions of single failure events and (ii) estimates of failure frequencies/rates. 9.3.3.1

Component Failure Event Data

Many companies maintain a component failure event database as part of their computerized maintenance recording system. Failures and maintenance actions are recorded related to the various components. The data are used in maintenance planning and as a basis for system modifications. In some cases,

243

244

9 Data for Risk Analysis

companies exchange information recorded in their component failure report databases. An example is Government Industry Data Exchange Program (GIDEP), which is a cooperation between the government and the industry in the United States and in Canada, see http://www.gidep.org. Some industries have implemented a failure reporting, analysis, and corrective action system (FRACAS), as described in MIL-STD-2155 (1985). By using FRACAS or similar approaches, failures are formally analyzed and classified before the reports are stored in the failure report database. 9.3.3.2

Component Failure Rates

A wide range of component failure rate databases is available. A component failure rate database provides estimates of (constant) failure rates for single components. Some databases may also give failure mode distributions and repair times. Databases with information about manufacturer/make of the various components are usually confidential to people outside a specific company or group of companies. The failure rate estimates may be based on: (1) Recorded failure events (2) Expert judgment (3) Laboratory testing or a combination of these. It is usually assumed that the components have constant failure rate 𝜆 and hence that failures occur as a homogeneous Poisson process (HPP) with intensity 𝜆. Let N(t) denote the number of failures during an accumulated time in service t. The HPP assumption implies that (𝜆t)n −𝜆t for n = 0, 1, 2, … (9.1) e n! The mean number of failures during an accumulated time in service t is then Pr(N(t) = n) =

E[N(t)] = 𝜆t

(9.2)

The parameter 𝜆 can hence be written E[N(t)] (9.3) t such that 𝜆 is the mean number of failures per unit time in service. An obvious estimator for 𝜆 is then N(t) observed number of failures = (9.4) 𝜆̂ = t accumulated time in service This estimator is seen to be unbiased, and a 90% confidence interval for 𝜆 when n failures is observed during the time in service t is found to be ) ( 1 1 (9.5) z0.95,2n , z0.05,2(n+1) 2t 2t 𝜆=

9.3 Data Sources

where z𝛼,m is the upper 𝛼 percentile of a chi-square distribution with m degrees of freedom, and is defined such that Pr(Z > z𝛼,m ) = 𝛼. Chi-square percentiles may be found in spreadsheet programs and in most programs for statistical computing, such as R.3 Tables, listing the percentiles for selected values of m and 𝛼 are also available on the Internet. Observe that some of these tables and (L) ) = 𝛼. Before programs present the lower 𝛼 percentile, defined as Pr(Z ≤ z𝛼,m using percentile values, you should always check whether your table or computer program presents upper or lower percentile values. 9.3.3.3

Generic Reliability Databases

Most reliability data sources maintain their own web pages, presenting information about their data sources and how they can be accessed. This section lists some commonly used reliability data sources. Process equipment reliability database (PERD). It is operated by the Center for Chemical Process Safety of AIChE, and contains reliability data for process equipment. Data are only available to members of the PERD project. Electronic parts reliability data (EPRD). It is available from Quanterion Solutions.4 This handbook of more than 2000 pages contains failure rate estimates for integrated circuits, discrete semiconductors (diodes, transistors, and optoelectronic devices), resistors, capacitors, and inductors/ transformers – obtained from field usage of electronic components. Nonelectronic parts reliability data (NPRD). It is also available from Quanterion Solutions. This handbook of about 1000 pages provides failure rates for a wide variety of component types, including mechanical, electromechanical, and discrete electronic parts and assemblies. MIL-HDBK-217F, reliability prediction of electronic equipment, contains failure rate estimates for the various part types used in electronic systems, such as integrated circuits, transistors, diodes, resistors, capacitors, relays, switches, and connectors. The estimates are based mainly on laboratory testing with controlled environmental stresses, and the failure rates in MIL-HDBK-217F (1991) are, therefore, related only to component-specific (primary) failures. The basic failure rate of a component found for normal stress levels in the laboratory is denoted 𝜆B . The effects of influencing factors (𝜋), such as quality level, temperature, and humidity are given in tables in MIL-HDBK-217F that are used to find the failure rate 𝜆P for the relevant application and environment, as 𝜆P = 𝜆 B ⋅ 𝜋 Q ⋅ 𝜋 E ⋅ 𝜋 A · · ·

(9.6)

Failures due to external stresses and CCFs are not included. The data are not related to specific failure modes. 3 https://www.r-project.org/. 4 https://www.quanterion.com.

245

246

9 Data for Risk Analysis

MIL-HDBK-217F remains an active US Department of Defense (DoD) handbook, but is no longer being actively maintained or updated. The French FIDES initiative is a promising extension of the MIL-HDBK-217F approach.5 Offshore and onshore reliability data (OREDA). It provides reliability data for components and systems used in offshore oil and gas installations, collected from installations in several geographic areas. The computerized database is available only to OREDA participants, but several OREDA handbooks have been published presenting generic data. The data are classified under the following main headings (i) machinery, (ii) electric equipment, (iii) mechanical equipment, (iv) control and safety equipment, and (v) subsea equipment. NSWC. It is a reliability prediction method for mechanical equipment, developed by the US Naval Surface Warfare Center (NSWC 2011). Reliability data for control and safety systems. It is a data handbook that is developed to support reliability assessments of safety-instrumented systems that should comply with IEC 61508 (2010). The handbook is based partly on data from OREDA and is developed by SINTEF, the Norwegian research organization. Safety equipment reliability handbook (SERH) is a handbook in three volumes covering reliability data for safety-instrumented systems. The handbook has been developed by exida.com and the three volumes cover (i) sensors, (ii) logic solvers and interface modules, and (iii) final elements. IEEE Std. 500. It is a handbook (IEEE Std. 500 1984) providing failure rate estimates for various electrical, electronic, sensing, and mechanical components. A Delphi method (see Section 9.4.2) combined with field data is used to produce component failure rate estimates. The data come from nuclear power plants, but similar applications are also considered as part of the Delphi method process. The data in the handbook are now rather old. International common cause data exchange (ICDE). It is a database operated by the Nuclear Energy Agency (NEA) on behalf of nuclear industry authorities in several countries. Various summary reports are available on the Internet for nonmembers. Commoncause failure data base (CCFDB). It is a data collection and analysis system operated by the US Nuclear Regulatory Commission (NRC). CCFDB includes a method for identifying CCF events, coding, and classifying those events for use in CCF studies, and a computer system for storing and analyzing the data. CCFDB is described thoroughly in NUREG/CR-6268 (2007). European industry reliability data (EIReDA). It gives failure rate estimates for components in nuclear power plants operated by EDF in France. EIReDA PC is a computer version of the EIReDA data bank. Data relate to the electrical, mechanical, and electromechanical equipment of nuclear plants. 5 http://www.fides-reliability.org/.

9.3 Data Sources

Reliability and availability data system (RADS). It was developed by the US NRC to provide reliability and availability data needed to perform generic and plant-specific assessments and to support PRA and risk-informed regulatory applications. Data are available for the major components in the most risk important systems in both boiling water reactors and pressurized water reactors. 9.3.4

Data Analysis

The quality of the data presented in the databases obviously depends on the way the data are collected and analyzed. Several guidelines and standards have been issued to obtain high quality in data collection and analysis. Among these are (1) “Handbook on quality of reliability data” (ESReDA 1999) (2) “Guidelines for Improving Plant Reliability Through Data Collection and Analysis” (CCPS 1998) (3) ISO 14224: “Petroleum, Petrochemical, and Natural Gas Industries: Collection and Exchange of Reliability and Maintenance Data for Equipment” (ISO 14224 2016) (4) “Handbook of Parameter Estimation for Probabilistic Risk Assessment” (NUREG/CR-6823 2003) 9.3.5

Data Quality

There are several quality requirements for a good reliability data source. Among these are Accessibility. The database must be easily accessible such that resources are not wasted on searching for the data. User-friendliness. The database must be user-friendly, with sufficient help for the user such that she does not misinterpret the information in the database. System/component boundaries. The physical and operational boundaries for the systems and components in the database must be specified such that the user can know which failures are covered by the estimate. Traceable sources. The sources of the raw data must be specified such that the user can check whether her application of the systems and components is compatible with the application for which the data were collected. Failure rate function. Almost all commercially available reliability databases provide only constant failure rates, even for mechanical equipment that degrades due to such mechanisms as erosion, corrosion, and fatigue. When the failure rate function is assumed to be constant, this should, as far as possible, be justified in the database: for example, by trend testing.

247

248

9 Data for Risk Analysis

Homogeneity. If the raw data come from different samples (e.g. different installations), the database should, as far as possible, verify that the samples are homogeneous. Updating. The database must be updated regularly, such that the failure rate estimates, as far as possible, are applicable for the current technology. 9.3.5.1

Failure Modes and Mechanisms Distributions

Some reliability databases include failure rate estimates for each failure mode, but most of the databases present only an overall failure rate. To use such data in a specific risk analysis, we may need to estimate the percentage of the failure rate that applies to a specific failure mode. A specific handbook has been developed for this purpose: Failure mode/mechanism distributions. It is a handbook (FMD 2013) providing relative probabilities of occurrence of the various failure modes and failure mechanisms of a wide range of electrical, electronic, mechanical, and electromechanical parts and assemblies. 9.3.5.2

Plant-Specific Reliability Data

The MIL-HDBK-217F approach in (9.6) is a simple example of a proportional hazards model where the actual failure rate 𝜆P for a specific operational and environmental context is determined by multiplying the basic failure rate 𝜆B by a number of influencing coefficients. These coefficients are also called covariates and concomitant variables. The MIL-HDBK-217F approach may seem simple, but a lot of research has been carried out to determine the various coefficients. If the temperature in the given context, for example, is 90 ∘ C, the influencing coefficient given in MIL-HDBK-217F covers both the effect of the increased temperature and the importance of the temperature as an influencing factor. Similar influencing coefficients have not been developed for mechanical, electro-mechanical, and more complex equipment. The general expression of the proportional hazards model when it is assumed that the basic failure rate is constant (i.e. with no wear-out effects) is 𝜆P = 𝜆B h(𝜋1 , 𝜋2 , … , 𝜋m )

(9.7)

where 𝜋1 , 𝜋2 , … , 𝜋m are the influencing coefficients. In some cases, one or more of these coefficients may vary with time. 9.3.6

Human Error Data

Human errors and human reliability are discussed in Chapter 15. Several databases providing data on human errors are available. As for technical failures, the data can be divided into two types: (i) descriptions of human errors and (ii) probabilities of typical human errors in a specified context. The second type is often presented as HEPs.

9.3 Data Sources

9.3.6.1

Human Error Databases

A human error database describes human errors that have occurred in a particular system, along with their associated causal factors and consequences. Most safety-critical systems have an error database of some type. The computerized operator reliability and error database (CORE-DATA) is established by the University of Birmingham in England (HSE 1999). CORE-DATA covers human/operator errors in the nuclear, chemical, and offshore oil domains. CORE-DATA uses the following data sources as its input: • • • •

Incident and accident report data Simulator data from training and experimental simulations Experimental data Expert judgment data

The information in CORE-DATA is analyzed to present the following information elements (adapted from Basra and Kirwan 1998). (1) Task description, where a general description is given of the task being performed and the operating conditions. (2) Human error mode. This is similar to a component failure mode and is a description of the observable manifestation of the error, for example, action too late, action too early, incorrect sequence, and so on. (3) Psychological error mechanisms. This element provides a description of the operator’s internal failure modes, such as attention failure, cognitive overload, misdiagnosis, and so on. (4) Performance shaping factors, where a description of the performance shaping factors that contributed to the error mode are described, for example, ergonomic design, task complexity, and so on. (5) Error opportunities that quantify how many times a task was completed and the number of times the operator failed to achieve the desired outcome, for example, 1 error in 50. (6) Nominal HEP. This is the mean HEP of a given task, calculated as the number of errors observed divided by the number of opportunities for error. … and several more. The US NRC collects data on human errors in nuclear power plants. The data is not publicly available, but stakeholders may get access the database.6 9.3.6.2

Human Error Probabilities

Several data sources for human errors are available. Among these are Human performance evaluation system (HPES), established by the Institute for Nuclear Power Operations in Atlanta, Georgia, requires a membership fee 6 See https://www.nrc.gov/reactors/operating/ops-experience/human-factors.html.

249

250

9 Data for Risk Analysis

for access to data, but annual summary reports are published. HPES provides data for human errors in the nuclear power industry and gives HEP estimates and information on root causes of the errors. Handbook of human reliability analysis with emphasis on nuclear power plant applications (Swain and Guttmann 1983) presents 27 tables of HEPs, together with performance shaping factors (PSFs) that can be used to adapt the HEPs to a specific situation/application. The data are most relevant for nuclear industry applications. Human error assessment and reduction technique (HEART) contains a table with generic HEPs, together with PSFs and a procedure for adapting the HEPs to specific applications. The HEART calculation procedure is more simple than the one proposed by Swain and Guttmann (1983) (see above). A guide to practical human reliability assessment (Kirwan 1994) has an appendix (II) with HEPs from generic sources, operational plants, ergonomic experiments, and simulators. Human reliability & safety analysis data handbook (Gertman and Blackman 1994) gives a thorough discussion of challenges and problems related to human reliability data and also presents some data. CORE-DATA, described in Section 9.3.6, provides quantitative data, and thus is also a data source for HEP estimates. There are many more data sources for human error. Interested readers may, for example, search for these on the internet. A good starting point is http://en .wikipedia.org/wiki/Human_error.

9.4 Expert Judgment The use of expert judgment is essential to the whole process of risk assessment and includes judgment applied both in modeling and data. We focus here on how expert judgment plays a role in establishing data for use in risk assessment. Broadly, we can talk about two main ways of using expert judgment: (1) When data are available, but they are not fully relevant, and experts are used to modify the data to be more fit for purpose. (2) When no data are available, and experts are used to provide their opinion on parameter values. In the first case, the experts will be asked to provide adjustment factors that attempt to modify existing data to better reflect the future. In the second case, the experts will have to come up with completely new values, based on their experience and knowledge. A question that is sometimes raised about expert judgment in general is whether data based on expert judgment are “valid” data and “allowed” for use

9.4 Expert Judgment

in risk assessment. Use of expert judgment is a necessity in most predictions we are making about complex phenomena in society. Decisions are made every day based on predictions about how the economy will develop, what we think that our income will be in the future, how population in a country or a city will develop, how pollution will affect climate, and so on. These decisions are also made based on a mix of models, collected data, and expert judgment, in exactly the same way as risk assessment. Even if these predictions are not always correct, it helps us to make better decisions and prioritize better in the long run. This is also possible to achieve with good risk assessments, based on the best available knowledge. That will in practice always also include expert judgment. 9.4.1

Adjusting Existing Data

In most quantitative risk assessments, it is common to adjust existing data, because they do not fully represent the state as we expect it to be in the future. This is most often done by the risk analysts themselves, thus representing the “expertise.” In many cases, these adjustments are made in a very simple manner. A systematic procedure for adjusting data, that also can be documented and audited afterward, is described in the following. (1) Select data that are “promising” from a relevant database. Assume that this is a failure rate 𝜆. (2) Identify the differences between the population that we have data from and the study object. This should be as specific as possible (not just “age” or “new technology” but, for example, “different materials being used” or “improved operating procedures implemented recently”). The differences are designated D1 , D2 , D3 , and so on. Only aspects that are important for risk (failure rate) should be identified and described. The number of differences should be limited to maximum 3–5. A problem is often that it may be difficult to identify precisely what the population that the data are gathered from consists of, thereby making it difficult to determine if there are differences and precisely what they are. (3) Identify possible dependencies between the identified differences. If we introduce a less corrosive material in some of the components in the system and at the same time place the system in a less corrosive environment, the reduction in failures due to corrosion cannot automatically be counted twice. (4) Assess the quantitative effect of each of the differences Di individually, by assigning adjustment factors ki that we have identified. These factors can take on any positive or zero value, with a value of one implying that there is no adjustment and zero implying that we believe the change makes the failure irrelevant.

251

252

9 Data for Risk Analysis

(5) Based on the identification of dependencies, determine if any of the adjustment factors need to be modified, to take these dependencies into account. The modified factors are designated ki∗ . (6) Multiply the failure rate obtained from the database with all the adjustment factors identified in the previous step to calculate an updated failure rate 𝜆∗ : ∏ 𝜆∗ = 𝜆 ki∗ (9.8) i

(7) Finally, do a comparison of the calculated value with the original value and evaluate if this is reasonable, based on the available evidence. In particular, if the original data are given as probabilities, make sure that the end result is not greater than one! If necessary, make final adjustments. (8) Document the process to ensure that others can understand and revisit the assumptions later. Observe that this approach is similar to (9.6) used in MIL-HDBK-217F. It may be noted that expert judgment in various forms is applied at all stages in this process, except items 1, 6, and 8. An example may illustrate this procedure. Example 9.1 (Adjusting failure data for valves in a process system) A novel process system is planned for handling of light hydrocarbons at high temperature and high pressure. The system will be equipped with valves for controlling flow through the system. The valves are designed in a new material that is highly corrosion-resistant. A risk analysis needs to be performed, and we have found a database containing failure data for valves in traditional oil process systems. The annual frequency of external leak found from this database is 𝜆 = 3 × 10−4 per year. In this case, we are considering that the key differences are that the material is highly corrosion-resistant (D1 ), that light hydrocarbons as opposed to oil is handled (D2 ), and that the temperature and pressure is high (D3 ). There may be dependencies between all three of these because all may influence the effect of corrosion. Based on expert judgment, we conclude that light hydrocarbons compared to oil is a positive effect, using an adjustment factor of k1 = 0.7 (i.e. lower failure rate compared to the data given). Increased temperature and pressure may increase the probability of failure and an adjustment factor of k2 = 1.5 is therefore applied. Finally, the effect of corrosion-resistant material should ideally be based on data about how large a proportion of all failures are due to corrosion. If we know, for example, that 50% of failures are due to corrosion, and we assume that the new material reduces corrosion by 90%, we can apply an adjustment factor of k3 = 0.55 (i.e. 45% reduction in failure rate). Because the effects on corrosion are partly positive and partly negative, we can probably keep the adjustment factors as they are, without modifying to take into account dependencies, ki∗ = ki .

9.4 Expert Judgment

The updated failure rate can then be calculated as follows: 𝜆∗ = 𝜆 ⋅ k1 ⋅ k2 ⋅ k3 = 3 × 10−4 ⋅ 0.7 ⋅ 1.5 ⋅ 0.55 = 1.7 × 10−4 Based on the information, it is reasonable to expect a reduction in failure rate compared to what the historical data show and the number is therefore applied in the risk analysis. ◽ 9.4.2

Providing New Data When No Data Exists

In situations where we have no data that can be used as a starting point for the analysis, we rely on experts to provide base values that can be used in the risk assessment directly. In situations like this, especially if the potential consequences are high, we often gather a team of experts that together develop values that can be used. This process essentially consists of three steps: • Selection of experts • Elicitation of information from the experts • Analysis of the data from the experts Expert judgment elicitation is a process for obtaining data directly from experts in response to a specified problem. This may be related to the structure of the models and to the parameters and variables used in the models. The expert judgment process may be formal or informal, and may involve only one expert or a group of individuals with various types of expert knowledge. Several structured processes for eliciting information from experts have been suggested in the literature and have been proven useful in practical risk analyses. It is not an objective of this chapter to give a thorough introduction to expert judgment elicitation. Interested readers may consult, for example, Cooke (1991) or Ayyub (2001). As mentioned in Section 9.3.3, the failure rate estimates in IEEE Std. 500 (1984) were produced by a Delphi method. The Delphi method is a special procedure for expert judgment elicitation where the individual experts answer questionnaires in two or more rounds. After each round, a facilitator provides an anonymous summary of the experts’ forecasts from the previous round as well as the reasons they provided for their judgments. In this way, the experts are encouraged to revise their earlier answers in light of the replies of other members of their panel. During this process, it is believed that the answers will converge toward the “correct” answer. The process is terminated when a predefined stop criterion is met (e.g. number of rounds, stability of results) and the mean or median scores of the final rounds determine the results.7 7 For more information about the Delphi method, see http://en.wikipedia.org/wiki/Delphi_ method.

253

254

9 Data for Risk Analysis

Another method that has become increasingly popular for risk assessment applications is the analytic hierarchy process (AHP), that was originally developed as a method for supporting decision-making in general (Saaty 1980). The AHP method is based on experts doing a pairwise comparison and could be used, for example to support the process of determining adjustment factors in the process we described above. It is also useful to obtain data for use in Bayesian networks (Chapter 11).

9.5 Data Dossier When the report from a risk analysis is presented, what data the risk analysis is based on is sometimes questioned. It is important that the choice of input data is thoroughly documented, especially the choice of reliability data. It is therefore recommended that a data dossier be set up that presents and justifies the choice of data for each component or input event in the risk analysis. An example of such a data dossier is shown in Figure 9.1. In many applications, a simpler data dossier, as shown in Table 9.1, may be used.

9.6 Problems 9.1

In Section 9.1.1, a large number of descriptive data are listed. Where can we get hold of all these types of data?

9.2

In Section 9.3.2, different uses of accident data are listed. Discuss how the different purposes will have an impact on what information needs to be included in a database containing these data.

9.3

Look at the information contained in the eMARS database (https://emars .jrc.ec.europa.eu/en/emars/content) and discuss if the information is sufficient to meet all the uses of databases listed in Section 9.3.2.

9.4

You are going to do a risk analysis of a building crane. List examples of technical data, operational data, reliability data, meteorological data, and exposure data that would be required to perform the analysis.

9.5

What are the quality criteria for reliability databases?

9.6

What do we need to evaluate when we have a set of data that we plan to use for risk assessment?

9.6 Problems

Data dossier Component. Hydraulically operated gate valve

System. Pipeline into pressure vessel A1

Description. The valve is a 5-in. gate valve with a hydraulic “fail safe” actuator. The fail safe-function is achieved by a steel spring that is compressed by hydraulic pressure. The valve is normally in the open position and is only activated when the pressure in the vessel exceeds 150 bar. The valve is function-tested once a year. After a function test, the valve is considered to be “as good as new.” The valve is located in a sheltered area and is not exposed to frost/icing.

Failure mode: – Does not close on command – Leakage through the valve in closed position – External leakage from valve – Closes spuriously – Cannot be opened after closure

Failure rate (per hour):

Source:

3.3 × 10–6 1.2 × 10–6 2.7 × 10–6

Source A Source B Source A

4.2 × 10–7 3.8 × 10–6 7.8 × 10–6 1/300

Source A Source A Source B Expert judgment

Assessment: The failure rates are based on sources A and B. The failure rate for the failure mode “cannot be opened after closure” is based on the judgments from three persons with extensive experience from using the same type of valves and is estimated to one such failure per 300 valve openings. Source B is considered to be more relevant than source A, but source B gives data for only two failure modes. Source B is therefore used for the failure modes “does not close on command” and “closes spuriously,” while source A is used for the remaining failure modes

Testing and maintenance: The valve is function-tested after installation and thereafter once per year. The function test is assumed to be a realistic test, and possible failures detected during the test are repaired immediately such that the valve can be considered “as good as new” after the test. There are no options for diagnostic testing of the valve

Comments: The valve is a standard gate valve that has been used in comparable systems for a long time. The data used therefore have good validity and are relevant for the specified application

Figure 9.1 Example of a reliability data dossier.

255

256

9 Data for Risk Analysis

Table 9.1 A simple data dossier for a risk analysis. Data

Database

Comments

Leak frequencies

HSE UK Database 2000–2017

Adjusted values to take into account local leak data

Probability of safety system failure

System requirement document XXX-25-18430-2015

Applied values directly except for gas detectors where 50% reduction is assumed

Blowout frequencies

SINTEF blowout database – report 2016



9.7

You are going to do a risk analysis of a proposed new ship concept with a new type of hybrid machinery that is a combination of a traditional diesel-driven engine and an electrical energy powered by batteries. The electrical engine will be used when there is sufficient power in the batteries, but for the rest of the time, the diesel engine will be used. The electrical engine is a novel type that has not been used for marine applications earlier, whereas the diesel engine is a standard marine engine. You want to determine the probability that power will be lost and have found two data sources that provide data. One data source covers traditional marine diesel engines and the other data source covers electrical engines in general, for use in land-based applications (not ships). The first data source contains data for a total period of 20 years, up to now, whereas the second data source is much more limited, covering only 5 years. Evaluate the data sources that have been found and identify and describe differences that may require the data to be adjusted before applying them in the risk analysis. The differences should only be described qualitatively.

9.8

Failure data has been collected from a number of identical components. Assume that five failures have been observed during an accumulated time in service of 978 850 hours. (a) Find an estimate of the failure rate 𝜆 of the components. (b) What does it mean that the estimator used is unbiased? Provide a careful explanation. (c) What is the difference between the upper and the lower percentile of the chi-square distribution? (d) Express the relationship between the upper and the lower percentile (as a formula).

References

(e) Use a table or a computer program to find the relevant percentiles of the chi-square distribution and determine a 90% confidence interval for the failure rate 𝜆 (use Eq. (9.5)). (f ) Carefully explain what is meant by a confidence interval in this situation.

References ATSB (2006). International Fatality Rates: A Comparison of Australian Civil Aviation Fatality Rates with International Data. B2006/0002. Canberra, Australia: Australian Transport Safety Bureau. Ayyub, B.M. (2001). Elicitation of Expert Opinions for Uncertainty and Risks. Boca Raton, FL: CRC Press. Basra, G. and Kirwan, B. (1998). Collection of offshore human error probability data. Reliability Engineering & System Safety 61: 7793. CCPS (1998). Guidelines for Improving Plant Reliability Through Data Collection and Analysis. New York: Center for Chemical Process Safety, American Institute of Chemical Engineers. Cooke, R.M. (1991). Experts in Uncertainty: Opinion and Subjective Probability in Science. New York: Oxford University Press. Cox, A.W., Lees, F.P., and Ang, M.L. (1990). Classification of Hazardous Locations. Report. Rugby, UK: IIGCHL, IChemE. Cullen, W.D. (1990). The Public Inquiry into the Piper Alpha Disaster on 6 July 1988, HM. London: Stationary Office. ESReDA (1999). Handbook on quality of reliability data. Working Group Report. DNV GL, Høvik, Norway: European Reliability Data Association. EU (2003). Council Directive 2003/42/EC of 13 June 2003 on Occurrence Reporting in Civil Aviation, Official Journal of the European Communities, L167/23 (2003). EU (2012). Council Directive 2012/18/EU of 4 July 2012 on the control of major-accident hazards involving dangerous substances, Official Journal of the European Union L197/1. FMD (2013). Failure Mode / Mechanism Distribution. Utica, NY: Quanterion Solutions. Gertman, D.I. and Blackman, H.S. (1994). Human Reliability & Safety Analysis Data Handbook. New York: Wiley. Hassel, M., Asbjørnslett, B.E., and Hole, L.P. (2011). Underreporting of maritime accidents to vessel accident databases. Accident Analysis and Prevention 43 (6), 2053–2063. HSE (1999). The Implementation of CORE-DATA, A Computerized Human Error Probability Database. Research report 245/1999. London: Health and Safety Executive.

257

258

9 Data for Risk Analysis

IAEA (1994). Convention on Nuclear Safety. INFCIRC/449. Vienna, Austria: International Atomic Energy Agency. IEC 61508 (2010). Functional safety of electrical/electronic/programmable electronic safety-related systems, Parts 1-7. Geneva: International Electrotechnical Commission. IEEE Std. 500 (1984). IEEE guide for the collection and presentation of electrical, electronic, sensing component, and mechanical equipment reliability data for nuclear power generating stations. New York: IEEE and Wiley. ISO 12100 (2010). Safety of machinery – general principles for design: risk assessment and risk reduction, International standard ISO 12100. Geneva: International Organization for Standardization. ISO 14224 (2016). Petroleum, Petrochemical, and Natural Gas Industries: Collection and Exchange of Reliability and Maintenance Data for Equipment. Tech. Rep. Geneva: International Organization for Standardization. Kirwan, B. (1994). A Guide to Practical Human Reliability Assessment. London: Taylor & Francis. Kvaløy, J.T. and Aven, T. (2005). An alternative approach to trend analysis in accident data. Reliability Engineering & System Safety 90: 75–82. MIL-HDBK-217F (1991). Reliability prediction of electronic equipment. Washington, DC: U.S. Department of Defense. MIL-STD-2155 (1985). Failure reporting, analysis and corrective action system. Washington, DC: U.S. Department of Defense. NS 5814 (2008). Requirements for Risk Assessment. Oslo, Norway, Norwegian edn: Standard Norge. NSWC (2011). Handbook of reliability prediction procedures for mechanical equipment. Handbook NSWC-11. West Bethesda, MD: Naval Surface Warfare Center, Carderock Division. NUREG/CR-6268 (2007). Common-cause failure database and analysis system: event data collection, classification, and coding. Washington, DC: U.S. Nuclear Regulatory Commission, Office of Nuclear Regulatory Research. NUREG/CR-6823 (2003). Handbook of parameter estimation for probabilistic risk assessment. Washington, DC: U.S. Nuclear Regulatory Commission, Office of Nuclear Regulatory Research. OREDA (2015). Offshore and Onshore Reliability Data, 6e. DNV GL. 1322 Høvik, Norway: OREDA Participants. Saaty, T.L. (1980). The Analytic Hierarchy Process. New York: McGraw-Hill. Swain, A.D. and Guttmann, H. (1983). Handbook of Human Reliability Analysis with Emphasis on Nuclear Power Plant Applications. Technical report NUREG/CR-1278. Washington, DC: Nuclear Regulatory Commission.

259

10 Hazard Identification 10.1 Introduction The first question in the triplet definition of risk is: What can go wrong? Answering this question implies identifying the hazards and threats and the initiating and/or hazardous events that have the potential to cause harm to one or more assets. Several methods have been developed for this purpose. These methods are called hazard identification methods. Definition 10.1 (Hazard identification) The process of identifying and describing all the significant hazards, threats, and hazardous events associated with a system (DEF-STAN 00-56 2007). ◽ Several hazard identification methods are not only delimited to identification of hazards but also cover the two other questions in the definition of risk. They can therefore be regarded as “complete” risk analysis methods. Comprehensive descriptions and reviews of hazard identification methods are given in HSL (2005) and ISO 31010 (2009). 10.1.1

Objectives of Hazard Identification

The objectives of the hazard identification process are to (1) identify all the hazards and hazardous events that are relevant during all intended use and foreseeable misuse of the study object, and during all interactions with the study object; (2) describe the characteristics, and the form and quantity, of each hazard; (3) describe when and where in the study object the hazard is present; (4) identify possible enabling events and conditions related to each hazard; (5) identify under what conditions the hazard could lead to an initiating/hazardous event and which pathways the hazard may follow;

Risk Assessment: Theory, Methods, and Applications, Second Edition. Marvin Rausand and Stein Haugen. © 2020 John Wiley & Sons, Inc. Published 2020 by John Wiley & Sons, Inc. Companion Website: www.wiley.com/go/riskassessment2e

260

10 Hazard Identification

(6) identify potential initiating/hazardous events that could be caused by the hazard (or in combination with other hazards); (7) make operators and study object owners aware of hazards and potential hazardous events. 10.1.2

Classification of Hazards

Hazards may be classified in different ways. No universally accepted classification system has emerged, but it may be argued that different classification systems can be useful for different applications. A few classifications are listed in the following: Based on the main contributor to an accident scenario: (a) Technological hazards (e.g. related to equipment, software, structures, and transport) (b) Natural (or environmental) hazards (e.g. flooding, earthquake, lightning, storm, and high/low temperatures) (c) Organizational hazards (e.g. long working hours, inadequate competence, inadequate procedures, inadequate maintenance, and inadequate safety culture) (d) Behavioral hazards (e.g. drugs/alcohol, lack of concentration) (e) Social hazards (e.g. cyberattacks, theft, arson, sabotage, terrorism, and war) Based on the origin of a technological hazard (energy source): (a) Mechanical hazards (b) Electrical hazards (c) Radiation hazards … and so on Based on the nature of the potential harm: (a) Cancer hazards (b) Suffocation hazards (c) Electrocution hazards (d) Pollution hazards … and so on Based on the boundaries of the study object: (a) Endogenous hazards (i.e. hazards that are internal in the study object) (b) Exogenous hazards (i.e. hazards that are external to the study object) Table 2.5 provides a generic list of hazards. Example 10.1 (Hazards for a ship) Some typical hazards for a ship, classified as exogenous and endogenous hazards, are listed in Table 10.1. The list is not complete. ◽

10.1 Introduction

Table 10.1 Hazards for a ship (Example 10.1). Exogenous hazards

Endogenous hazards

Hazards external to a ship are, for example

Hazards onboard a ship:

• • • •

Storms, lightning, extreme waves Poor visibility Submerged objects, other ships War, sabotage … and many more

10.1.3

• In accommodation areas. Combustible furnishings, cleaning material in stores, oil/fat in galley equipment, etc. • In deck areas. Cargo, crane operations, slippery deck, electrical connections, etc. • In machinery spaces. Cabling, fuel and diesel oil, fuel oil piping and valves, refrigerants, etc. • Sources of ignition. Naked flame, electrical appliances, hot surface, sparks from hot work, deck, and engine room machinery. • Operational hazards to personnel. Long working hours, working on deck at sea, cargo operation, tank surveys, on-board repairs, etc.

Hazard Identification Methods

The hazard identification methods that are described in this chapter are the following: Checklist and brainstorming. In many cases, it is useful to start with a list of generic hazards and/or generic hazardous events and to decide if, where, and how these events may occur in relation to the study object. Such a list of generic events is, for example, given in HSE (2001) for risk analysis of offshore oil and gas installations. A small part of the list is presented in Table 10.2 for illustration. Teamwork and brainstorming sessions may be used to come up with details about the events. Preliminary hazard analysis (PHA). PHA1 is a rather simple method and is commonly used to identify hazards in the design phase of a study object. The analysis is called “preliminary” because its results are often updated as more thorough risk analyses are carried out. PHA may also be used in later phases of the system’s life cycle, and can, for relatively simple systems, be a complete and sufficient risk analysis. A simplified PHA is sometimes called a hazard identification (HAZID). Job safety analysis (JSA). JSA is a simple method – similar to PHA – aimed at analyzing work operations and procedures. JSA is most often used just before a work operation is to be performed, to prepare and raise the safety awareness of those who are involved. 1 Observe that PHA is also used as an abbreviation for process hazard analysis, especially in the United States.

261

262

10 Hazard Identification

Table 10.2 List of generic hazardous events on offshore oil and gas installations. Blowouts:

Nonprocess spills:

• Blowout in drilling • Blowout in completion • Blowout in production • Blowout in workover • etc. Process leaks–leaks of gas or oil from:

• Chemical spills • Methanol/diesel/aviation fuel spills • Bottled gas spills • Radioactive material releases • etc. Marine collisions:

• Wellhead equipment • Separators and other process equipment • Compressors and other gas treatment equipment • Process pipes, flanges, valves, pumps • etc. Nonprocess fires:

• Supply vessels • Standby vessels • Passing merchant vessels • Fishing vessels • Drilling rigs • etc. And several more categories

• • • • •

Fuel gas fires Electrical fires Accommodation fires Methanol/diesel/aviation fuel fires etc.

Source: Extract adapted from HSE (2001).

Failure modes, effects, and criticality analysis (FMECA). FMECA (or FMEA) was one of the first methods for system reliability analysis, and the first guideline was issued as early as 1949. The objective of FMECA of a technical system is to identify all the potential failure modes of the system components, identify the causes of these failure modes, and assess the effects that each failure mode may have on the entire system. Hazard and operability (HAZOP) study. The HAZOP approach was developed to identify deviations and dangerous situations in process plants. The method is based on teamwork and brainstorming that is structured based on guidewords. HAZOP has been used with great success and is today a standard method for risk assessment in the design of process plants. HAZOP is also used in later phases of a system’s life cycle, especially related to modifications of the system. A variant of HAZOP can be used to identify hazards in complicated work procedures. Systems-theoretic process analysis (STPA). STPA was developed to address some of the limitations in existing methods, especially related to complicated systems with software. STPA is based on control theory and requires the study object be described as a hierarchical control system. STPA may be used in early design phases, but is also suitable for detailed analyses. Structured what-if technique (SWIFT). SWIFT is carried out by a group of experts in a brainstorming session where a set of what-if questions are asked – and answered. The work is structured using a dedicated checklist.

10.2 Checklist Methods

The method was earlier called a “what-if/checklist” method. SWIFT can be used as a simplified HAZOP and can be applied to the same type of systems. Master logic diagram (MLD). MLD may be used to identify hazards in complicated systems that are exposed to a wide range of hazards and failure modes. The method resembles fault tree analysis (see Section 11), but is clearly distinguishable from fault tree with several specific features. MLD is mentioned only briefly in this chapter, and interested readers are advised to consult the literature (e.g. see Modarres 2006). Change analysis. Change analysis is used to identify hazards and threats related to planned modifications of a system. The analysis is carried out by comparing the properties of the modified system with a basic (known) system. Change analysis may also be used to evaluate modifications to operating procedures. Finally, there is also a description of hazard log. This is not a hazard identification method but a useful tool for recording information about hazards and hazardous events, and for keeping this information updated. Several methods containing modules for hazard identification are discussed elsewhere in the book, for example in Chapter 13 related to barrier analysis and in Chapter 15 related to human errors. No method can guarantee that we identify all the hazardous events that can potentially occur in a system, and it is always possible that unidentified hazardous events will occur. A hazardous event that is not identified will not be controlled and will therefore always lead to a higher risk than assessed. The effectiveness of any hazard identification analysis depends entirely on the experience and creative imagination of the study team. The methods applied only impose a disciplined structure on the work. Remark 10.1 (Brainstorming versus functional methods) Hazard identification methods are sometimes classified as brainstorming methods or functional methods (e.g. see de Jong 2007). The brainstorming methods are mainly used by a group of experts and carried out in specific meetings. Examples of brainstorming methods are PHA and SWIFT. The functional methods are based on a detailed analysis of system structures and functions. Examples of functional methods are FMECA and STPA. Most of the methods have elements of both types and we therefore do not use this categorization in this chapter. ◽

10.2 Checklist Methods A hazard checklist is a written list of hazards or hazardous events that have been derived from past experience. The entries of the list can be examples of

263

264

10 Hazard Identification

hazards and events or they may be formulated as questions that are intended to help the study team consider all aspects of safety related to a study object. A checklist analysis for hazard identification is also called a process review. Checklists may be based on past experience and previous hazard logs and should be made specifically for a process or an operation. Checklists should be regarded as living documents that need to be reviewed and updated regularly. 10.2.1

Objectives and Applications

The objectives of a checklist analysis are to (1) Identify all the hazards that are relevant during all intended use and foreseeable misuse of the study object, and during all interactions with the study object. (2) Identify required controls and safeguards. Checklist approaches are used in a wide range of application areas and for many different purposes. The main focus has been on early design phases and on the establishment of work procedures (e.g. see HSE 2001). Checklists have been used further to ensure that organizations are complying with standard practices. Hazard checklists may also be useful as part of other and more detailed hazard identification methods. Many checklists developed for specific purposes exist. For the offshore oil and gas industry, ISO 17776 (2016) provides a comprehensive checklist. Another generic checklist for major accident hazards on an offshore installation is given in HSE (2001). For machinery safety, ISO 12100 (2010) contains a corresponding checklist. IMO (2015) provides a hazard list for ships, and Maragakis et al. (2009) provide a similar list for air traffic. Most checklists do not only contain hazards as defined in this chapter but also initiating events, hazardous events, and enabling conditions and events. This is not necessarily a problem because the main purpose of the analysis is to identify as many potential problems as possible. A longer list may contribute to trigger more ideas and make the hazard identification more complete. After the hazard identification is completed, it is required to sort and structure the list of identified hazards and events. Because most major accidents develop through sequences of events (i.e. accident scenarios), we should carefully check if the identified events can be structured into accident scenarios. Example 10.2 (Structuring information from a hazard identification) Assume that we – by using a checklist – find that (i) flammable material, (ii) fire, (iii) ignition, and (iv) inadequate maintenance are identified as hazards. In this case, we may easily envisage an accident scenario that develops as follows:

10.2 Checklist Methods

(1) (2) (3) (4)

In a site, flammable fluid is stored in a tank. Due to inadequate maintenance, the tank has corroded and a leak develops. The leak is ignited. A fire occurs.

In many cases, structuring the information is useful and necessary, to help understand how hazards can come out of control and eventually develop into a loss. ◽ 10.2.2

Analysis Procedure

A checklist analysis does not normally follow any strict procedure. An important initial task is to prepare a suitable checklist. Checklists can be lists of questions related to potential hazardous event categories, or they can also be lists of hazards and hazardous events that act as a generic starting point for identifying specific events related to the study object. Checklists are developed based on system analysis, operating history, and experience from past accidents and near misses. In some cases, process reviews are carried out without using a written checklist, and the study team must employ a “mental checklist.” This approach is, of course, much more liable to omission of potential hazardous events. Table 10.3 is an example of a portion of a checklist. 10.2.3

Resources and Skills Required

No specific skills are required except knowledge of the study object, but the study team must maintain strict attention to details and perseverance in obtaining information. The information required varies depending on the checklist chosen. 10.2.4

Advantages and Limitations

The main advantages and limitations of checklist methods include Advantages: The checklist method • can easily be used by people with no background in risk analysis; • makes use of experience from previous risk assessments; • ensures that common and more obvious problems are not overlooked; • is valuable in the design process for revealing hazards otherwise overlooked; • requires minimal information about the installation, and so is suitable for concept design. Limitations: The checklist method • is limited to previous experience, and thus may not anticipate hazards in novel designs or novel accidents from existing designs;

265

266

10 Hazard Identification

Table 10.3 Process/system checklist for the design phase. Materials. Review the characteristics of all process materials: raw materials, catalysts, intermediate products, and final products. Obtain detailed data on these materials, such as

Flammability: • What is the autoignition temperature? • What is the flash point? • How can a fire be extinguished? Explosivity: • What are the upper and lower explosive limits? • Does the material decompose explosively? Toxicity: • What are the breathing exposure limits (e.g. threshold limit values, immediate dangerous to life and health)? • What personal protective equipment is needed? Corrosivity and compatibility: • Is the material strongly acidic or basic? • Are special materials required to contain it? • What personal protective equipment is needed?

Waste disposal: • Can gases be released directly to the atmosphere? • Can liquids be released directly into water? • Is a supply of inert gas available for purging equipment? • How would a leak be detected? Storage: • Will any spill be contained? • Is this material stable in storage? Static electricity: • Is bonding or grounding of equipment needed? • What is the conductivity of the materials, and how likely are they to accumulate static? Reactivity: • Critical temperature for autoreaction? • Reactivity with other components including intermediates? • Effect of impurities?

Source: Adapted from CCPS (2008).

• can miss hazards that have not been seen previously; • does not encourage intuitive/brainstorming thinking, and gives limited insight into the nature of the hazards related to the study object. Overall, a generic hazard checklist is useful for most risk assessments, but should not be the only hazard identification method, except for standard installations whose hazards have been studied in more detail elsewhere.

10.3 Preliminary Hazard Analysis PHA is used to identify hazards and potential accidents in the early stages of system design and is basically a review of where energy or hazardous materials can be released in an uncontrolled manner. PHA is used not only for hazard

10.3 Preliminary Hazard Analysis

identification but also for ranking the hazards with respect to probability and consequence. The PHA technique was developed by the US Army (MIL-STD-882E), and has been used with success in safety analysis within the defense, for safety analysis of machinery, in process plants, and for a wide range of other applications. A PHA is called “preliminary” because it is usually refined through additional and more thorough studies. Many variants of PHA have been developed, and they appear under different names, such as HAZID and rapid risk ranking (RRR). The abbreviation PHA is also used to mean process hazard analysis, which is a requirement in the United States under the Occupational Safety and Health Administration (OSHA) regulations. 10.3.1

Objectives and Applications

The overall objective of a PHA is to reveal potential hazards, threats, and hazardous events early in the system development process, such that they can be removed, reduced, or controlled in the further development of the project. More specific objectives of a PHA are to (1) (2) (3) (4) (5) (6) (7) (8)

identify the assets that need to be protected; identify the hazardous events that can potentially occur; determine the main causes of each hazardous event; determine how often each hazardous event may occur; determine the severity of each hazardous event; identify relevant safeguards for each hazardous event; assess the risk related to each hazardous event; determine the most important contributors to the risk (and rank the contributors).

PHA is best applied in the early design phases of a study object, but can also be used in the later phases. PHA can be a stand-alone analysis or a part of a more detailed analysis. When the PHA is part of a more comprehensive risk assessment, the results of the analysis are used to screen events for further study. 10.3.2

Analysis Procedure

The PHA can be carried out in seven steps. Step 1 is described in Chapter 3 and the details of this step are therefore not repeated here. We first list the seven steps and then give a more thorough description of steps 2–7. (1) (2) (3) (4) (5)

Plan and prepare. Identify hazards and hazardous events. Determine the frequency of hazardous events. Determine the consequences of hazardous events. Suggest risk reduction measures.

267

268

10 Hazard Identification

(6) Assess the risk. (7) Report the analysis. The analysis procedure is shown in Figure 10.1. 10.3.2.1

Step 2: Identify Hazards and Select Hazardous Events

The aim of this step is to establish a list of hazardous events that can be further analyzed in the following steps. The hazard identification in a PHA is often done during a meeting and based on either a generic checklist or a specific list developed for the purpose. This is combined with experience from the study object (or similar systems) and the knowledge and expertise of the participants. The meeting will typically be a structured brainstorming session, where the checklist provides the structure. Personnel from all parts of the study object should participate in identifying hazardous events. The outcome of this process is typically a combined list of hazards, enabling events and conditions and hazardous events. Initially, it is positive if this list is long to increase the chance of catching all relevant hazardous events. The participants should, therefore, be encouraged to include as much as possible on this initial list. Structuring and filtering then takes place before a final list of hazardous events is established. This will serve several purposes: • Sort the list to obtain a structured set of hazardous events, without overlap. Observe that a specific hazard and enabling condition may be the cause of several hazardous events (and vice versa). • Filter out overlapping hazardous events, to avoid that risk is counted doubly in the analysis. • Filter out all hazardous events that obviously have a low risk associated with them because of negligible probability and consequence. It may be necessary to reduce the number of hazardous events to save time and resources in the analysis. Events that are removed from the list should be documented by a brief note describing why they have been left out. Each of the remaining events should be clearly defined, especially related to what is happening, when it is happening, and where it is happening. In many cases, it may be appropriate to identify events in specific categories, for example (i) random events and (ii) deliberate actions. These categories may be split into subcategories and thus make it easier to identify relevant events. As new events are identified, new subcategories are often identified – and this process goes in a loop until the study team considers the list of hazardous events to be sufficient related to the objectives of the study. During this session, the study team tries to identify what may happen in the future. In this work, it may be helpful to use not just checklists but also other sources such as

10.3 Preliminary Hazard Analysis Input

Step 1: – Organization and planning – Objectives and limitation – System description – Provision of background information (see Chapter 3)

Output – Specified objectives – Study team – Project plan

Divide the system into elements

Choose an element Step 2: – Identify hazards and threats – Identify hazardous events – Select realistic and typical hazardous events

– Experience data – Checklists of hazards, threats, and threat agents – Decision criteria

– List of hazards – List of relevant/typical hazardous events – Input to hazard log

Choose a hazardous event

– Data sources – Expert judgments

Step 3: – Identify causes – Estimate frequency

– Frequencies for each hazardous event – Input to hazard log

– Experience data – Consequence data

Step 4: – Identify consequences – Assess/rank consequences

Description/ranking of consequences for the hazardous events

Step 5: – Identify relevant risk reduction measures

List of relevant safeguards for each hazardous event

Step 6: – Compile results – Assess the risk Yes

More events? No

Yes

More elements? No Step 7: Prepare the report from the analysis

Figure 10.1 Analysis workflow for PHA.

PHA report

269

10 Hazard Identification

• • • • •

Reports from previous accidents and incidents Accident statistics Expert judgments Operational data Existing emergency plans

To identify deliberate actions, it may be useful to start by identifying potential threat actors. To group these into categories may be of help to identify potential actions. Possible categories may be cyber criminals, normal criminals, visitors, external staff, competitors, drug addicts, and employees. The description of actions should include a description of the mode of action. In addition to using checklists and other information, it may help the process to ask more generic questions, such as • • • • •

Are there any hardware hazards? Are there any software hazards? Are there any human-induced hazards? Are there any procedure-related hazards? Are there any obvious interface hazards between software, hardware, and humans?

Theft

Falling load

Explosion

Location

Crushing

Hazard/threat

Falling person

To get an overview of all the hazardous events, it may be appropriate to enter them into a matrix such as the one shown in Figure 10.2. When a hazardous

Fire

270

Laboratory Chemical storage Workshop Garage Data server Admin. building

Figure 10.2 Hazards and threats in various locations of the study object.

10.3 Preliminary Hazard Analysis

event is considered to be relevant for a specific location, a cross is entered into the matrix to indicate that this event must be treated in the further analysis. To aid the further analysis, a PHA worksheet such as the one in Figure 10.3 is used. It is usually most efficient to complete the evaluation of one event (both frequency and consequence) before studying the next event. 10.3.2.2

Step 3: Determine the Frequency of Hazardous Events

In this step, the study team identifies and discusses the causes and estimates the frequency of each of the hazardous events that were identified in step 2. The causal analysis is usually rather brief and coarse, and usually only the obvious and direct causes of each event are recorded. The frequencies are mainly estimated as frequency classes, for example, as proposed in Table 6.8. The frequency estimation may be based on historical data (has similar events occurred earlier?), expert judgment, and assumptions about the future. The historical data may comprise of statistics and reports from near accidents and special events from the company (or plant), from other industry, from organizations, and from authorities. In addition to the causes, the safeguards that have already been implemented to prevent the hazardous event should be assessed and taken into account when the frequency is determined. For more details about safeguards (barriers), see Chapter 14. 10.3.2.3

Step 4: Determine the Consequences of Hazardous Events

In this step, the potential consequences following each of the hazardous events in step 2 are identified and assessed. The assessment should consider both immediate consequences and consequences that will emerge after some time. As for the frequency, categories are usually used also for consequences, such as Table 6.9. Several approaches are used. Among these are to assess: (1) The most probable case (2) The worst conceivable case (3) The worst credible case (i.e. the worst case that may reasonably occur) Which approach to select depends on the objectives of the PHA, and whether the PHA is a first step in a more comprehensive risk analysis. An important consideration is that to get a realistic picture of the risk, there has to be a link between the estimation of frequency and consequence. The reason for this is that the hazardous event represents an event where there may be a loss, whereas the consequences represent actual loss. In many cases, the probability of a serious loss may be small, even if the hazardous event occurs. If we apply the worst conceivable consequence category, we may end up overestimating the risk by several orders of magnitude in extreme cases. This is illustrated in Example 10.3.

271

Study object: LNG transport with tank truck

Date: 2019-06-20

Reference:

Name: Stein Haugen

System element or activity

Hazard/threat

Flexible pipe

Human error

No.

1

Hazardous event (what, where, when) Driver leaves the LNG terminal without disconnectfrom the tank

Cause (triggering event)

Consequence (harm to what?)

– Lack of attention – LNG leakage – Disturbed in her work (not ignited) – Fire/ explosion

Risk Freq. Cons. RPN 3

2

5

1

5

6

Risk reduction measure Install barrier in front of truck that only can be opened when is disconnected Install alarm in truck when is connected

Tank truck

Driving route

2

Tank truck backs into concrete pillar when arriving at the customer’s LNG tank

Narrow lane and poor – Damage to tank visibility require the truck, no leaktank truck to back into age of LNG the LNG reception area – Hole in tank, both inner and outer tank – a. Not ignited – b. Fire/explosion

Figure 10.3 Sample PHA worksheet.

2

2

4

2

3

5

1

5

6

New design of access to the area

ResponsComment ible

10.3 Preliminary Hazard Analysis

Example 10.3 (Falling down stairs) Assume that we are doing a PHA on a passenger ferry and have identified “person falling down stairs” as a hazardous event. On a ferry that is moving, this is likely to be a relatively frequent event, say at least once per year. If we then turn to the consequence, and apply the worst conceivable case, this is clearly a fatality. A person may be unlucky and break her neck when falling. On the other hand, if we apply the most probable case, it is likely that the person is not injured at all (except maybe injured pride!) or perhaps has only minor injuries. In the first case, we thus conclude that the risk is at least one fatality per year; in the second case, we say that the risk is minor injuries at least once per year. It is highly likely that risk will be managed differently depending on which approach is chosen. ◽ In many cases, the scope of the risk analysis may require that we assess the consequences for different assets, such as related to people (employees, third party), the environment, facilities, equipment, and reputation. Assets were discussed in Chapter 2. In such situations, we need separate consequence categories for each type of asset and a separate estimation of consequences is performed for each asset type. When assessing the consequences, the safeguards that have been introduced to mitigate the consequences should also be taken into account. Remark 10.2 (Benefit of consequence ranking) Hammer (1993) claims that the benefit from this consequence ranking is rather limited. He recommends using the available time to eliminate or reduce the consequences instead of ranking them. On the other hand, MIL-STD-882E (2012) requires that such a column for consequence ranking be part of the PHA worksheet. ◽ 10.3.2.4

Step 5: Suggest Risk Reduction Measures

When the study team considers the safeguards already in place, suggestions for new or improved safeguards and other risk reduction measures will often pop up. All these suggestions should be recorded when they emerge. The focus on identifying new or improved safeguards depends on the objectives of the analysis. The study team should focus primarily on identifying hazards, threats, and hazardous events and on describing the level of risk related to the study object. To develop a comprehensive list of new risk reduction measures is not always an important objective of the PHA, although the process of identifying and analyzing hazardous events usually gives an insight that is useful also for identifying risk reduction measures. A challenge with PHA is that it may be difficult to show the effect of risk reduction measures in the risk matrix and therefore also to rank the measures. This is because the categories that we use to rank frequency and consequence usually are quite broad, implying that a large risk reduction often is necessary to change category as shown in Example 10.4.

273

274

10 Hazard Identification

Example 10.4 (Changing of frequency category) quency categories defined as follows: (1) (2) (3) (4) (5)

Assume that we have fre-

Less than 0.001 events per year 0.001–0.1 events per year 0.01–0.1 events per year 0.1–1 events per year More than 1 event per year

Further, assume that a hazardous event has been categorized with frequency 3 in the risk analysis. If we identify a risk reduction measure that reduces the frequency of this event, the effect must be to reduce the frequency by a factor of about 10 (an order of magnitude) to reclassify the event from frequency category 3 to 2. Few risk reduction measures have a sufficient effect to achieve this. ◽ The list of proposed risk reduction measures should be screened and systematized to reveal if any of the proposals may have an effect on more than one hazardous event. Finally, the study team may also do a simple cost/benefit assessment of each proposal. Safeguards (barriers) are discussed further in Chapter 14. 10.3.2.5

Step 6: Assess the Risk

In this step, the risk related to the study object is described as a listing of all the potential hazardous events, together with their associated frequencies and consequences. A risk priority number (RPN) is sometimes calculated for each hazardous event (see Section 6.4.4). It is also common to enter the hazardous events into a risk matrix to illustrate the risk. This may be helpful when improvements are to be evaluated and/or ranked. Risk matrices are discussed in Chapter 6. 10.3.2.6

Step 7: Report the Analysis

The results and the lessons learned from the PHA must be reported to the management, safety personnel, and other stakeholders, such that the results from the analysis can be used for safety management. The results from a PHA are usually presented in a specific PHA worksheet, as shown in Figure 10.3, often supplemented with a risk matrix. PHA worksheets are not standardized and are very often tailored to the specific application, by omitting some of the columns or adding others. Example 10.5 (LNG transport system) A tank truck is used to transport liquefied natural gas (LNG) from an LNG terminal to a customer. The study object comprises uploading of LNG to the tank truck, transport to the customer, unloading of the LNG, and transport (empty) back to the LNG terminal. Uploading and unloading of the LNG is carried out by the truck driver through

10.3 Preliminary Hazard Analysis

a flexible pipe equipped with a quick coupling. The truck is driving on a public road with normal traffic. The potential hazardous events can be classified into two main groups: (1) Events that do not result in release of LNG, comprising • Normal traffic accidents • Harm to the driver during uploading or unloading of LNG (2) Events that result in release of LNG, comprising • Rupture of the flexible pipe • Cracks or holes in the LNG tank • Release from the safety valve • Release of LNG during coupling or uncoupling of the flexible pipe Some hazards related to the LNG transport system are shown in the PHA worksheet in Figure 10.3. The worksheet is only an illustration and not the result of a thorough study of the LNG transport system. ◽ Remark 10.3 (Extended PHA worksheet) Some guidelines for PHA recommend using a worksheet with two separate columns for risk (including frequency, consequence, and RPN), one for the current design, and one when proposed safeguards have been implemented. In this way, it is possible to study the effect of the safeguards proposed. This is not done in Figure 10.3, but such a column is easy to add. ◽

10.3.2.7

HAZID

In some applications, it may be relevant to use a simplified worksheet, as shown in Figure 10.4. The analysis based on such a simplified worksheet is sometimes called a simplified PHA, or a HAZID. The application of the HAZID worksheet in Figure 10.4 is illustrated by some hazards that are relevant for Example 10.5. The objective of a HAZID is to reveal the hazardous events that should be subject to further study in a more detailed risk analysis. As seen, it is often problematic to estimate the frequencies and the consequences. The same hazardous event may give many different consequences. A possible approach is to describe the most relevant consequences, classify these into consequence classes and then estimate the related frequencies. This is done in the PHA worksheet in Figure 10.3. In the HAZID worksheet in Figure 10.4, only an average consequence is presented. This will not give sufficient information about the risk, but because the analysis is performed to reveal events that should be subject to further analysis, this approach may be sufficient. Another approach that is sometimes used is to split a hazardous event into two events, one with limited consequences and one with serious consequences. Normally, the frequency of the event with limited consequences will then be classified as higher than the one with serious consequences.

275

Study object: LNG transport with tank truck

Date: 2019-04-20

Reference:

Name: Stein Haugen

No.

1

Hazardous event (what, where, when) Tank truck collides with another vehicle

frequency class The road has poor visibility and several crossing roads.

Freq. class 4

road is sometimes slippery

consequence class The most common damage is limited to the body of the truck. The tank is assumed to be punctured in one out of 50 collisions (both the inner and the outer tank). In two out of

Cons. class

RPN color code

2

6 (yellow)

2

5 (yellow)

assumed to be ignited

2

Leakage of gas through the safety valve

The safety valve is tested and maintained at regular intervals (every third month). The transport company has extensive experience data for this type of valves

3

The most common damage is that gas is released without being ignited. The gas is assumed to be ignited in 1 out of 100 such cases. This and emptying when the gas concentration becomes highest because the truck is standing idle and the area is sheltered

Figure 10.4 Sample HAZID worksheet.

10.3 Preliminary Hazard Analysis

10.3.3

Resources and Skills Required

The PHA requires experience and understanding of the study object, and it is therefore common to put together a team comprising people with different backgrounds, experience, and expertise. The size of the group can vary a lot, but 4–10 people is quite common, although larger groups may be established. The analysis may also be carried out by just one or two experienced engineers, preferably someone who has a background as safety engineers. If the PHA is carried out in an early stage of a project, a limited amount of information about the study object will be available. For a process plant, the process concept has to be settled before the analysis is initiated. At that point in time, the most important chemicals and reactions are known, together with the main elements of the process equipment (e.g. vessels, pumps). Due to the versatility of PHA, it is also used in later project phases or in operations, when more detailed information is available. The objective may then be to establish a first overview over the risk, as a basis for doing more detailed analysis with other methods. The PHA must be based on all the safety-related information about the study object, such as design criteria, equipment specifications, specifications of materials and chemicals, previous accidents, and previous hazard studies of similar systems that are available at the time when the analysis is performed (e.g. see MIL-STD-882E 2012). Computerized tools and a variety of hazard checklists are available and may assist the study team in performing the PHA. Frequently, spreadsheets or word-processors are also used to record the information.

10.3.4

Advantages and Limitations

Advantages: The main advantages are that PHA • is simple to use and requires limited training; • is a useful first step in most risk analyses and has been used extensively in defense and process applications; • identifies and provides a log of hazards and their corresponding risk; • can be used in early project phases, that is, early enough to allow design changes; • is a versatile method that can cover a range of problems. Limitations: The main limitations are that PHA • is difficult to use to represent events with widely varying consequences; • fails to assess risk of combined hazards or coexisting system failure modes; • may be difficult to use to illustrate the effect of additional safeguards and to provide a basis for prioritizing safeguards.

277

278

10 Hazard Identification

10.4 Job Safety Analysis A JSA is a simple risk assessment method that is applied to review job procedures and practices to identify potential hazards and determine risk reduction measures. Each job is broken down into specific tasks, for which observation, experience, and checklists are used to identify hazards and associated controls and safeguards. The JSA is carried out by a team, and most of the work is done in a JSA meeting. The results from the analysis are documented in JSA worksheets, as shown in Figure 10.6. Because of its in-depth and detailed nature, the JSA can identify potential hazards that may go undetected during routine management observations or audits. JSA has been used for many years and in many industries and has been shown to be an effective tool for identifying hazardous conditions and unsafe acts. Other names for JSA include safe job analysis (SJA), job hazard analysis (JHA), and task hazard analysis (THA). 10.4.1

Objectives and Applications

JSA is used for three main purposes: 1. Nonroutine jobs. JSA is carried out for nonroutine and one-off jobs that are considered to have a high risk. The objectives are to (a) Make the operators aware of inherent or potential hazards that may be encountered when executing the job; (b) Ensure that responsibilities are clear and that work is coordinated between different operators involved; (c) Provide prejob safety instructions; (d) Give operators guidance on how to deal with hazardous events if such events should happen; (e) Teach operators and supervisors how to perform the job correctly and in the safest possible way. 2. Dangerous routine jobs. JSA is performed to scrutinize and make improvements to jobs that have led to several incidents or accidents. More detailed objectives are to (a) Reveal hazardous motions, postures, activities, or work practices of individual employees; (b) Help determine how hazards should be managed in the work environment; (c) Teach supervisors and employees how to perform operations correctly; (d) Enhance communication between management and employees regarding safety concerns; (e) Increase employee involvement in the safety process; (f ) Provide new personnel with “on-the-job” safety awareness training; (g) Create a basis for training and introducing new employees into the work environment.

10.4 Job Safety Analysis

3. New work procedures. JSA may be used as a basis for establishing work instructions for new routine jobs. In this case, the main objective is to ensure that hazards personnel may be exposed to are identified in advance and that proper controls are included in the new procedures. 10.4.2

Analysis Procedure

The analysis procedure is slightly different for the three categories of applications of JSA. For all categories, the JSA typically includes the following seven steps: (1) (2) (3) (4) (5) (6) (7)

Plan and prepare. Become familiar with the job. Break down the job. Identify the hazards. Categorize frequencies and consequences. Identify risk reduction measures. Report the analysis.

Each step must be completed before starting the next step. The various steps are explained in more detail later, and the analysis workflow for JSA is shown in Figure 10.5. 10.4.2.1

Step 1: Plan and Prepare

This task was discussed in Chapter 3. To decide which jobs need a JSA, the following criteria may be used: Nonroutine jobs: • System hazard analyses have indicated that the job is critical or dangerous. • The job has in the past produced a fatality or disabling injury in the same or a similar system. • The job involves hazardous materials or hazardous energy sources. • The job is new. Dangerous routine jobs: • The job has produced a high frequency of incidents or accidents. • The job has led to one or more severe accidents or to incidents with potentially high severity. • The job involves several actors and require communication and coordination. • The job involves hazardous materials. • The job involves hazardous energy sources. • The job is complicated. New work procedures: • All jobs that can potentially cause harm.

279

280

10 Hazard Identification

Step 1: – Objectives and limitations – Establish JSA team – Provide information/data – Select a job (see also Chapter 3)

Input

Output – Objectives – JSA team

Step 2: JSA team discusses the job, and ensures that they understand all aspects of the job and its boundary conditions

– Work procedures – Drawings, data

Step 3: Break down the job into a sequence of tasks

List of all tasks in the required sequence

Choose a task Step 4: Identify all hazards related to the task

– Experience data – Causal analysis

List of all hazards related to the tasks

Choose a hazard Step 5: Categorize frequencies and consequences related to the hazard

– Experience data – Consequence analysis

Frequency and consequence categories for each hazard and task

Yes More hazards? No Step 6: – Propose risk reduction measures – Appoint responsible person

List of possible risk reduction measures for each task

Yes More tasks? No Step 7: Document the analysis

Figure 10.5 Analysis workflow for JSA.

Report from the JSA

10.4 Job Safety Analysis

JSA Team The analysis is performed by a team comprising:

• • • •

A JSA leader (preferably the supervisor of the job to be analyzed). The line manager who is responsible for the job. An health, safety, and environment (HSE) representative. The workers who are going to do the job. The number of workers who should be involved depends on the type of job. If only a few workers are going to carry out the job, all of them should be involved in the analysis. If the objective of the JSA is to establish safe work procedures that will apply for a large number of workers, it may be relevant to involve two or three workers who are familiar with the job.

If the job is complicated, it may be beneficial to appoint a secretary, who can record the results in the JSA worksheets. It is important that at least one of the team members have knowledge and experience with JSA. If none of the team members fulfill this requirement, an extra person with such knowledge should be invited to the team. JSA Meeting Most of the analysis is done in a JSA meeting where all team mem-

bers must be present. The JSA meeting for nonroutine jobs should take place as close in time to the job execution as possible. Background Information Before the JSA meeting, the JSA leader should, when

relevant: • Provide written work procedures, manuals, drawings, and other information relevant to the job. It is important that the information be up-to-date and reflect the current situation. • Develop a preliminary breakdown of the job sequence into distinct tasks and list the tasks in the order in which they are to be performed. JSA Worksheet The JSA leader must prepare a suitable JSA worksheet that can

be used to record the findings during the JSA meeting. Two alternative worksheets are shown in Figures 10.6 and 10.7. 10.4.2.2

Step 2: Become Familiar with the Job

The JSA leader presents the job and its boundary conditions to the team members. The team discusses various aspects of the job and ensures that all team members understand the job and the various aspects related to it. If in doubt, more information has to be provided, and it may also be necessary to visit the workplace and make observations. A JSA is not suitable for jobs that are defined too broadly or vaguely, for example “perform maintenance work,” or too narrowly, for example “opening the valve.”

281

Job:

Lift containers from ship to quay

Date:

2019-01-20

Reference:

Job 14/C

Names:

JSA team members

No.

1

2

Hazard/cause

Task

Attach four hooks to the container

Lift the container

Heavy hooks

Potential consequences Back problems

Risk Freq. Cons. RPN 3

1

Risk reduction measure

4

Better design of hook or hook fasteners

Crushing between the hook and the container

3

1

4

and/or arms

Mandatory use of protective gloves

Fall on same level

Fall injury

4

1

5

More tidy work area

Fall to lower level

Severe fall injury

2

3

5

Better design of ladder

Container swings toward worker

Serious crushing harms, possible fatality

2

4

6

The worker must stand behind a barrier when the lifting is started. The container must always be lifted vertically.

Container crashes into an other object which falls toward worker

Serious crushing harms, possible fatality

3

4

7

Remove objects that may come in contact with the lifting. The worker must be standing behind a barrier when the lifting is started.

Container falls down, ruptured wires

Serious crushing harms, possible fatality

2

4

6

Labor must withdraw to a location x when the container has left the ground

Figure 10.6 JSA worksheet (example).

Responsible

10.4 Job Safety Analysis

Job safety analysis (JSA)

No.

Job sequence:

Task:

No.

JSA team members:

Task description:

Hazards:

Potential consequences:

Risk reduction measures:

Required safety equipment:

Required personal protection equipment:

Date and signature:

Figure 10.7 Alternative JSA worksheet.

10.4.2.3

Step 3: Break Down the Job

Most jobs can be broken down into a sequence of tasks. The starting point of this step is the preliminary job breakdown that has been made by the JSA leader prior to the meeting. It is important that the tasks be described briefly and be action-oriented. For each task, what is to be done should be described, for example, by using words such as lift, place, remove, position, install, and

283

284

10 Hazard Identification

open – not how this is done. The level of detail required is determined by the level of risk. Tasks that do not have the potential to cause significant accidents do not need to be broken down into lower levels. When the job breakdown is done, the sequence of the tasks should, if relevant, be verified by the worker(s) to ensure completeness and accuracy. As a rule of thumb, if the number of tasks grows too large, say more than 20, the job is most likely too complicated to be efficiently analyzed by JSA. The following aspects should be considered: (a) (b) (c) (d) (e)

The task as it is planned to be done The preparation before the task, and the closing of the task Communication and coordination during the task Special activities such as tool provision, cleaning, and so on Correction of deviations that may occur

It may also be relevant to consider (f ) Maintenance and inspection/testing of equipment and tools (g) Corrective maintenance tasks This step of the analysis is often more time-consuming than the subsequent hazard identification in step 4. It is important that the task descriptions are short and that all tasks be covered. Before starting to identify hazards in step 4, the JSA team should ensure that the list of tasks is as complete as possible. 10.4.2.4

Step 4: Identify the Hazards

Once the job has been broken down into tasks, each task is reviewed to identify any actual or potential hazards. To accomplish this, the team may wish to Table 10.4 Questions used to identify hazards (partly based on CCOHS (2009)). • Is there danger for striking against, being struck by, or otherwise making harmful contact with an object? • Can any body part get caught in, by, or between objects? • Do tools, machines, or equipment present any hazards? • Can the worker make harmful contact with moving objects? • Can the worker slip or trip? • Can the worker fall from one level to another or even fall on the same level? • Can the worker suffer strain from lifting, pushing, bending, or pulling? • Is the worker exposed to extreme heat or cold? • Is excessive noise or vibration a problem? • Is there a danger from falling objects? • Is lighting a problem? • Can weather conditions affect safety? • Are there flammable, explosive, or electrical hazards? • Is harmful radiation a possibility? • Can contact be made with hot, toxic, or caustic substances? • Are there dust, fumes, mists, or vapors in the air?

10.4 Job Safety Analysis

observe a similar job being carried out if possible, and consult accident reports, employees, management/supervisors, industrial or manufacturing organizations, or other companies with similar operations. Hazard checklists are useful at this stage. One such checklist is, for example, included in ISO 12100 (2010). Checklists are also supplied in OSHA (2002, App.2) and in NOG-090 (2017, Annex C). Questions such as those listed in Table 10.4 may be useful when identifying hazards. 10.4.2.5

Step 5: Categorize Frequencies and Consequences

To be able to prioritize between different risk reduction actions, each hazard needs to be evaluated with respect to frequency and potential consequences. The frequency and consequences may be classified according to the categories in Tables 6.8 and 6.9, but other and more simple categories may also be used: such as “high,” “medium,” and “low.” In many JSAs, the frequency and consequence categorization is skipped and not documented in the JSA worksheet. An alternative JSA worksheet without risk ranking is shown in Figure 10.7. 10.4.2.6

Step 6: Identify Risk Reduction Measures

Once the hazards have been identified and assessed, the JSA team proposes one or more of the following measures to manage or eliminate the associated risk. • Engineering controls: – Eliminate or minimize the hazard through design changes. Change the physical conditions that create the hazards. Substitute with a less hazardous substance. Modify or change equipment and tools. – Enclose the hazard or the personnel. – Isolate the hazard with guards, interlocks, or barriers. • Administrative controls: – Use written procedures, work permits, and safe practices. – Introduce exposure limitations. – Improve training. – Increase monitoring and supervision during the job. – Reduce the frequency of the job or of a task. – Find a safer way to do the job. • Personal protective equipment: – Hard hats – Safety glasses – Protective clothing – Gloves – Hearing protection … and so on.

285

286

10 Hazard Identification

The type of measures that are proposed depends on the objective of the analysis. If the analysis is done as a preparation for performing a job, it is mainly measures that can be implemented immediately, before the work is started, that are relevant. If the objective is to improve safety performance in the longer term, more long-term risk reduction measures may also be relevant. For further details, see, for example, OSHA (2002), CCOHS (2009), and NOG-090 (2017). 10.4.2.7

Step 7: Report the Analysis

The results from the JSA are usually documented in a worksheet, for example as shown in Figures 10.6 and 10.7. More details may be found in Chapter 3. Example 10.6 (Lifting heavy containers) A number of heavy containers are to be lifted from a ship to a quay. The containers are lifted by a large crane. A worker attaches four hooks to each container. To attach the hooks, he has to climb a portable ladder, fetch a hook, and attach it to the container. The hook is rather heavy, and the work position is awkward. It is therefore easy to get back problems. It is further possible to crush fingers and hands between the hook and the container. When the hooks are attached, the worker signals the crane operator to start lifting the container. During this operation, the container may swing and hit the worker. Other hazards are related to falls on the same level and from the ladder. Figure 10.6 shows some few hazards related to this job. The figure is meant as an illustration and is not based on a thorough analysis. ◽ 10.4.3

Resources and Skills Required

JSA is a simple method and does not require any formal education or deep analytical skills. The analysis must be carried out by personnel who are familiar with the job and who know the system in which it is executed. The number of team members that are required depends on the complexity of the job and may vary from 2 to 12. At least one of the team members should be familiar with the analysis method and have some experience from earlier JSAs. The time needed to carry out a JSA depends on how complicated the job is and on the experience of the team members. A single task may not need more than a few minutes. JSAs seldom last more than two to three hours. A JSA requires a fairly extensive collection of information. This includes information that is necessary both to understand the tasks and to identify hazards. For systems that have been in operation for some time, a lot of experience is often available. First and foremost, this is found among those who do the work and with supervisors. The information may be collected through • Interviews • Written job instructions (may be inaccurate and often incomplete)

10.5 FMECA

• • • • •

Manuals for machinery Work studies, if they exist Direct observation of tasks Audiovisual aids (photography or video recording) Reports of accidents and near misses

A good understanding of the job is a necessity, and unless the participants have experience from the job themselves, observations and interviews are necessary for the analysis to be done properly. It is also important to establish a good and trustful contact with the persons who carry out the work to be analyzed.

10.4.4

Advantages and Limitations

Advantages. The main advantages are that a JSA • gives the workers training in safe and efficient work procedures; • increases the workers’ awareness of safety problems; • introduces new employees to the job and safe work procedures; • provides prejob instructions for nonroutine jobs; • identifies safeguards that need to be in place; • enhances employee participation in workplace safety; • promotes positive attitudes about safety. Limitations. The main limitations of a JSA are that • for complicated jobs, it will be too time-consuming; • if extensive coordination is required, it is not suited for uncovering potential problems; • it is not a very structured method and may, therefore, be too superficial in some instances.

10.5 FMECA Failure modes and effects analysis were one of the first systematic techniques for failure analysis of technical systems. The technique was developed by reliability analysts in the late 1940s to identify problems in military systems. The traditional FMEA identifies and describes the possible failure modes, failure causes, and failure effects. When we, in addition, describe or rank the severity of the various failure modes, the technique is called FMECA. The borderline between FMEA and FMECA is vague, and there is no good reason to distinguish between them. In the following, we use the term “FMECA.” FMECA is a simple technique and does not build on any particular algorithm. The analysis is carried out by reviewing as many components, assemblies, and

287

288

10 Hazard Identification

subsystems as possible to identify failure modes, causes, and effects of such failures. For each component, the failure modes and their resulting effects on the rest of the system are entered into a specific FMECA worksheet. Technical failures and failure modes are introduced in Chapter 2. FMECA is mainly an effective technique for reliability engineering, but it is also often used in risk analyses. There are several types of FMECAs. In the context of risk analysis, the most relevant type is the product FMECA, which is also called a bottom-up FMECA, and this section is restricted to this type. Because FMECA was developed as a reliability technique, it will also cover failure modes that have little or no relevance for the risk related to the study object. When the objective of the FMECA is to provide input to a risk analysis, these failure modes may be omitted from the FMECA. When performing an FMECA, it is important to keep the definition of a failure mode in mind. As explained in Chapter 2, a failure mode may be regarded as a deviation from the performance criteria for the component/item.

10.5.1

Objectives and Applications

The objectives of an FMECA are to (1) identify how each of the system components can conceivably fail (i.e. what are the failure modes?); (2) determine the causes of these failure modes; (3) identify the effects that each failure mode can have on the rest of the study object; (4) describe how the failure modes can be detected; (5) determine how often each failure mode will occur; (6) determine how serious the various failure modes are; (7) assess the risk related to each failure mode; (8) identify risk reduction actions/measures that may be relevant. FMECA is used mainly in the design phase of a technical system to identify and analyze potential failures. The analysis is qualitative, but may have some quantitative elements, including specifying the failure rate of the failure modes and a ranking of the severity of the failure effects. FMECA can also be used in later phases of a system’s life cycle. The objective is then to identify parts of the system that should be improved to meet certain requirements regarding safety or reliability, or as input to maintenance planning. Many industries require an FMECA be integrated in the design process of technical systems, and that FMECA worksheets be part of the system documentation. This is, for example, a common practice for suppliers to the defense, aerospace, and automobile industries. The same requirements are becoming common in the offshore oil and gas industry.

10.5 FMECA

10.5.2

Analysis Procedure

The FMECA may be carried out in seven steps: (1) (2) (3) (4) (5) (6) (7)

Plan and prepare. Carry out system breakdown and functional analyses. Identify failure modes and causes. Determine the consequences of the failure modes. Assess the risk. Suggest improvements. Report the analysis.

Steps 1 and 7 are described thoroughly in Chapter 3, so are not repeated here, but a few comments on step 7 are given. The analysis workflow is shown in Figure 10.8. A dedicated FMECA worksheet is used when performing the analysis. A typical FMECA worksheet is shown in Figure 10.10. Steps 3–6 are described by referring to the relevant columns in this worksheet. 10.5.2.1

Step 2: Carry Out System Breakdown and Functional Analyses

The main tasks of this step are to (1) define the main functions (missions) of the system and specify the function performance criteria; (2) describe the operational modes of the system; (3) break down the system into subsystems that can be handled effectively. This can be done, for example by establishing a system breakdown structure, as shown in Figure 4.1; (4) review the system functional diagrams and drawings to determine interrelationships between the various subsystems. These interrelations may be illustrated by drawing functional block diagrams where each block corresponds to a subsystem; (5) prepare a complete component list for each subsystem. Each component should be given a unique identification number. These numbers are sometimes referred to as tag numbers; (6) describe the operational and environmental stresses that may affect the system and its operation. These should be reviewed to determine the adverse effects that they could generate on the system and its components. Items at the lowest level of the system breakdown structure are called components here. The structure in Figure 4.1 has only three levels. The number of levels that should be used depends on how complicated the study object is. All subsystems do not have to be broken down to the same number of levels. The functions and their performance requirements for each component should be discussed and understood by the study team. The same applies to all subsystems, subsubsystems, and so on.

289

290

10 Hazard Identification Step 1: – Organization and planning – Objectives and limitation – System description – Provision of background information (see Chapter 3)

Input

Output – Specified objectives – Study team – Project plan

Step 2: – System breakdown – Component description – Operational mode Choose a component Step 3: – Identify failure modes – Determine causes of failure – Describe how to detect failure

– Data sources – Expert judgments

– List of failure modes – List of failure causes for each failure mode – Input to hazard log

Choose a failure mode

– Experience data – Consequence data

Step 4: Identify consequences of the failure mode on local and system levels

– Description/ranking of consequences for each failure mode – Input to hazard log

Step 5: Determine and classify the frequency and severity of the failure mode. Calculate RPN

– Risk data – Input to risk matrix

Step 6: Suggest improvements

– Suggested improvements

Yes

More failure modes? No

Yes

More components? No Step 7: Prepare the report from the analysis

FMECA report

Figure 10.8 Analysis workflow for FMECA.

The FMECA is most often applied to the components on the lowest level in the system hierarchy, but may also be applied to other levels, for example the subsubsystem level. The term “components” is used when describing the worksheet.

10.5 FMECA

The results of step 2 are entered into columns 1–3 in Figure 10.9. Reference (column 1). A unique reference to the component is given in this column. The reference can be to a drawing or some other documentation. Function (column 2). The function(s) of the component is (are) described in this column. Operational mode (column 3). The component may have various operational modes, for example, running or standby. Operational modes for an airplane may include, for example taxi, takeoff, climb, cruise, descent, approach, flare-out, and roll. 10.5.2.2

Step 3: Identify Failure Modes and Causes

For each component, the relevant failure modes and failure causes are identified. Experience data and generic failure mode checklists may provide useful help. The results of step 3 are entered into columns 4–6 in Figure 10.9. Failure mode (column 4). For each function and operational mode, the related failure modes are identified and recorded, one by one. When identifying failure modes, it is important to relate these to the functions and the performance criteria for the component. Failure cause (column 5). For each failure mode in column 4, the possible failure causes and failure mechanisms are recorded. Relevant failure causes are corrosion, erosion, fatigue, overstress, maintenance errors, operator errors, and so on. Detection of failure (column 6). The possible ways of detecting the identified failure modes are then recorded. These may involve condition monitoring, diagnostic testing, functional testing, human perception, and so on. One failure mode is that of evident failures. Evident failures are detected instantly when they occur. The failure mode “spurious stop” of a pump with the operational mode “running” is an example of an evident failure. Another type of failure is the hidden failure. A hidden failure is typically detected only during testing of the component. The failure mode “fail to start” of a fire pump with the operational mode “standby” is an example of a hidden failure. When FMECA is used in the design phase, this column will record the designer’s recommendations for condition monitoring, functional testing, and so on. 10.5.2.3

Step 4: Determine the Consequences of the Failure Modes

For each failure mode, the credible consequences are entered into columns 7 and 8 in the FMECA worksheet in Figure 10.9. Local effects of failure (column 7). Here, the consequences that the failure mode will have on the next higher level in the hierarchy are recorded. System effects of failure (column 8). All the main effects of the failure mode on the primary function of the system are now recorded. The resulting

291

Study object: Process-system east

Date:

Reference: Process diagram 14.3-2019

Name: Marvin Rausand

Description of unit

Description of failure

Risk

Ref. no.

Function

Operational mode

Failure mode

Failure cause

Detection of failure

On the sub-system

(1)

(2)

(3)

(4)

(5)

(6)

(7)

Close gas

Normal operation

4.1

2019-12-20

Valve fails to close on demand

Spring broken Periodic function Shutdown test function failed Hydrates in valve

On the sys- Freq- Seve- Detecttem function uency rity ability RPN (8)

(9)

(10)

(11)

(12)

Production must be stopped

2

4

4

10

4.2

Open gas

Closed

Valve cannot be opened on command

Erosion in valve seat Sand between valve seat and gate Leakage in hydraulic system Too high friction in actuator

Figure 10.9 Example of an FMECA worksheet.

Periodic function Shutdown test function degraded

System must be repaired within one month

Immediately detected

System cannot produce

Cannot start production

(13) Periodic control of spring Periodic operation of valve

Too high friction in actuator Leakage through the valve

Risk reduction measure

2

3

5

10

Improved startup control to prevent sand production

ResponsComment ible (14)

(15)

10.5 FMECA

operational status of the system after the failure may also be recorded, that is, whether the system is functioning, or has to be switched over to another operational mode. 10.5.2.4

Step 5: Assess the Risk

In this step, the frequency and severity of the consequences of each failure mode are estimated and recorded in classes. In some cases, it is also relevant to record the detectability of the failure mode. An RPN is sometimes also calculated. The results are entered into columns 9–12 in the FMECA worksheet in Figure 10.9. Failure rate (column 9). Failure rates for each failure mode are then recorded. In many cases, it is more suitable to classify the failure rate in rather broad classes (see Table 6.8). Observe that the failure rate with respect to a failure mode might be different for the various operational modes. The failure mode “Leakage to the environment” for a valve may, for example, be more likely when the valve is closed and pressurized than when the valve is open. Severity (column 10). The severity of a failure mode is usually interpreted as the worst credible consequence of the failure, determined by the degree of injury, environmental damage, or system damage that could ultimately occur (see Table 6.9). Detectability (column 11). The consequences of a failure mode are sometimes dependent on how fast the failure mode can be detected. We classify the detectability in five groups, where group 1 means that the failure is detected immediately and group 5 means that the failure mode will usually not be detected. In the example in Figure 10.9, the failure mode “Valve fails to close on command” is assessed as group 4, because the failure mode cannot be revealed until a functional test is performed, for example once per six months. RPN (column 12). The RPN was defined in Section 6.4.4 but is here a combination of the three preceding columns. RPN of a failure mode is computed by summing the class numbers for frequency, severity, and detectability for that failure mode. 10.5.2.5

Step 6: Suggest Improvements

Risk reduction measures (column 13). Possible actions to correct the failure and restore the function or prevent serious consequences are then recorded. Actions that are likely to reduce the frequency of the failure modes should also be recorded. Responsible (column 14). Here, the name of the person who should be responsible for the follow-up of the failure mode and/or the risk reduction measures that have been identified is recorded. Comments (column 15). This column may be used to record pertinent information not included in the other columns.

293

294

10 Hazard Identification

10.5.2.6

Step 7: Report the Analysis

A wide range of results are likely to be produced from the FMECA process, and it is important to summarize both the process and the results in an FMECA report. When analyzing a large or complicated system, several FMECA processes may have been running at the same time. The FMECA report is a place to bring together the results of these analyses. The hazards identified by the FMECA process should be entered in the hazard log and be maintained as part of this log. As mentioned earlier in this section, there are several variants of the FMECA worksheet. The main columns are, however, covered in the worksheet in Figure 10.9. In the same way as for the PHA, the various failure modes may be entered into a risk matrix (see Section 6.4). 10.5.3

Resources and Skills Required

An FMECA can be carried out by a single person or by a study team, depending on the complexity of the system. FMECA does not require any deep analytical skills but requires a thorough understanding of the study object, its application, and operating conditions. Although the FMECA process itself is simple, it can be time-consuming and the quantity of data assessed and recorded can make it appear complicated. A structured and disciplined approach is essential if the full benefit of the FMECA is to be delivered. Several computer programs have been developed for FMECA. A suitable program can significantly reduce the workload of an FMECA and make it easier to update the analysis. Remark 10.4 (A slightly different approach) An FMECA of a complicated system may be a tedious and boring job–and is therefore sometimes left to junior personnel. Some companies have realized that this practice often results in mediocre quality of the FMECA and have therefore started using another approach that is similar to HAZOP (see Section 10.6). In this approach, a group of experts carry out the FMECA as a top-down analysis. When the significant failures have been identified and prioritized, junior personnel may fill in the necessary gaps. ◽ 10.5.4

Advantages and Limitations

Advantages. The main advantages are that FMECA: • is widely used and easy to understand and interpret; • provides a comprehensive hardware review; • is suitable for complicated systems;

10.6 HAZOP

• is flexible such that the level of detail can be adapted to the objectives of the analysis; • is systematic and comprehensive, and should be able to identify all failure modes with an electrical or mechanical basis; • is supported by efficient computer software tools. Limitations. The main limitations of FMECA are that • its benefits depend on the experience of the analyst(s); • it requires a hierarchical system drawing as the basis for the analysis, which the analysts usually have to develop before the analysis can start; • it considers hazards arising from single-point failures and will typically fail to identify hazards caused by combinations of failures; • it can be time-consuming and expensive. Another drawback is that all component failures are examined and documented, including those that do not have any significant consequences. For large systems, especially systems with a high degree of redundancy, the amount of unnecessary documentation work is a major disadvantage.

10.6 HAZOP A HAZOP study is a systematic hazard identification process that is carried out by a group of experts (a HAZOP team) to explore how a system or a plant may deviate from the design intent and create hazards and operability problems. The analysis is done in a series of meetings as a guided brainstorming based on a set of guidewords and process parameters. The system or plant is divided into a number of study nodes that are examined one by one. For each study node, the design intent and the normal state are defined. Then guidewords and process parameters are used in brainstorming sessions to give rise to proposals for possible deviations in the system. The HAZOP approach was developed by ICI Ltd in 1963 for the chemical industry (Kletz 1999). The main international standard for HAZOP is IEC 61882 (2016). 10.6.1

Guidewords

The guidewords and process parameters are supposed to stimulate individual thought and induce group discussions. Some typical guidewords are listed in Table 10.5. Several slightly different lists of guidewords may be found in the literature. The guidewords and process parameters should be combined in such a way that they lead to meaningful process deviations. All the guidewords cannot be applied to all process parameters (e.g. “reverse” and “temperature”).

295

296

10 Hazard Identification

Table 10.5 Generic HAZOP guidewords. Guideword

Deviation

no/none

No part of the design intention is achieved (e.g. no flow, no pressure, when there should be)

more of

An increase above the design intention is present, more of a physical property than there should be (e.g. higher flow, higher pressure, and higher temperature)

less of

A decrease below the design intention is present, less of a relevant physical property than there should be (e.g. lower flow, lower pressure, and lower temperature)

as well as

The design intent is achieved, but something else is present

part of

Only some of the design intention is achieved, wrong composition of process fluid. A component may be missing or too low/high ratio

reverse

The design intention is the opposite of what happens

other than

The design intention is substituted by something different

early

Something happens earlier in time than expected

late

Something happens later in time than expected

before

Relating to the sequence of order, something happens before it is expected

after

Relating to a sequence of order, something happens after it is expected

10.6.2

Process Parameters

Typical process parameters for a chemical process are: • • • • •

Flow Pressure Temperature Level Composition

Example 10.7 (HAZOP questions) During the sessions, the HAZOP leader (i.e. the chairperson of the HAZOP team) will usually stimulate the discussion by asking such questions as follows: (1) (2) (3) (4) (5)

Could there be “no flow”? If so, how could it arise? What are the consequences of “no flow”? Are the consequences hazardous, or do they prevent efficient operation? Can “no flow” be prevented by changing the design or operational procedures?

10.6 HAZOP

(6) Can the consequences of “no flow” be prevented by changing the design or operational procedures? (7) Does the severity of the hazard or problem justify extra expense? ◽ 10.6.3

Objectives and Applications

The objectives of a HAZOP study are to (1) Identify all deviations from the way the system is intended to function: their causes, and all the hazards and operability problems associated with these deviations. (2) Decide whether actions are required to control the hazards and/or the operability problems, and if so, to identify the ways in which the problems can be solved. (3) Identify cases where a decision cannot be made immediately, and to decide on what information or actions are required. (4) Ensure that actions decided are followed up. (5) Make operators aware of hazards and operability problems. HAZOP studies have been used with great success in the chemical and petroleum industries, for reviewing the process design to obtain safer, more efficient, and more reliable plants. HAZOP has become a standard activity in the design of the process systems on offshore oil and gas platforms in the North Sea. Today, HAZOP is used for hazard identification in many different application areas. The HAZOP approach was developed initially to be used during the design phase, but can also be applied to systems in operation. Several variants of the original HAZOP approach have been developed. Among the available approaches are Process HAZOP. This is the original HAZOP approach that was developed to assess process plants and systems. This approach is described in the rest of this section. Human HAZOP. This is a “family” of more specialized HAZOPs focusing on human errors rather than on technical failures (see Chapter 15). Procedure HAZOP. This HAZOP approach is used to review procedures or operational sequences (sometimes denoted SAFOP – safe operation study). A procedure HAZOP may also be an extension of a JSA. Software HAZOP. This variant of HAZOP is used to identify possible errors in the development of software. Computer hazard and operability (CHAZOP). This variant of HAZOP is used to analyze control systems and also computers. It has primarily been used to find possible causes of process upsets due to control system failures. Cyber HAZOP. This variant is used to identify and assess cyber threats and is discussed further in Chapter 17.

297

298

10 Hazard Identification

10.6.4

Analysis Procedure

The most common HAZOP study is carried out during the detailed engineering phase, and involves eight steps: (1) (2) (3) (4) (5) (6) (7) (8)

Plan and prepare. Identify possible deviations. Determine causes of deviations. Determine consequences of deviations. Identify existing barriers/safeguards. Assess risk. Propose improvements. Report the analysis.

Steps 1 and 8 are described in Chapter 3, and only some elements of these steps are discussed here. The HAZOP workflow is shown in Figure 10.10. A slightly different description of the HAZOP procedure is given by UK CAA (2006). 10.6.4.1

HAZOP Worksheet

The results from the HAZOP study are usually documented in a specific worksheet. An example of a HAZOP worksheet is shown in Figure 10.11. The columns in this worksheet are numbered from 1 to 13, and these are referred to when describing the HAZOP procedure. There is no standard worksheet, and several variants are used. Some worksheets do not include columns for risk ranking. 10.6.4.2

Step 1: Plan and Prepare

The main elements of this step are discussed in Chapter 3, but a few issues need additional comments. Establish the HAZOP Team The composition and the knowledge of the HAZOP

team are very important for the success of the analysis. The HAZOP team should be a multidisciplinary team of experts, typically five to eight persons who have extensive knowledge of the design, operation, and maintenance of the plant, and thus should be able to evaluate all the likely effects of deviations from the design intent. A HAZOP team assigned to consider a new chemical plant may, for example, comprise the following (e.g. see NSW 2008): • HAZOP leader. The HAZOP leader must be familiar with the HAZOP technique. The HAZOP leader has the responsibility for ensuring that all the tasks involved in planning, running, recording, and implementing the study are carried out. Her main task during the meetings is to ensure that the team works together to a common goal. The HAZOP leader should be

10.6 HAZOP Input

Step 1: – Objectives and limitations – Establish HAZOP team – Describe the plant – Provide background information and data (see also Chapter 3)

Output – Defined objectives – Study team – Project plan

Select a node Step 2: Use guidewords to reveal possible deviations that may lead to harm to people, the environment, or material assets, or can give operational problems

– Experience data – Checklists

List of possible deviations

Select a deviation

– Experience data – Causal analysis

Step 3: Identify possible causes of the deviation

– Experience data – Consequence analysis

Step 4: Determine possible consequences of the deviation

Description of the possible consequences of the deviations

Step 5: Identify existing barriers or safeguards related to the deviation Step 6: Estimate the probability and severity, and calculate RPN

List of relevant improvements related to each deviation

Step 7: – Propose improvements – Appoint responsible person

Risk related to each deviation

Yes

More deviations? No

Yes

More nodes? No Step 8: Prepare the report from the analysis

Figure 10.10 Analysis workflow for HAZOP.

HAZOP report

299

300

10 Hazard Identification

• • • • • •

independent of the project but be familiar with the design representations (e.g. P&IDs, block diagrams) and the technical and operational aspects of the system. It is essential that the HAZOP leader is experienced. Design engineer. A project design engineer who has been involved in the design and who is concerned with the project costs. Process engineer. This is usually a chemical engineer who was responsible for the process flow diagram and development of the P&IDs. Electrical engineer. This is usually an engineer who was responsible for the design of the electrical systems in the plant. Instrument engineer. The instrument engineer who designed and selected the control systems for the plant. Operations manager. This is preferably the person who will be in charge of the plant when it moves to the commissioning and operating stages. HAZOP secretary. The HAZOP secretary should take notes during meetings and assist the HAZOP leader with the administration of the HAZOP study.

At least one member of the HAZOP team must have sufficient authority to make decisions affecting the design or operation of the system. Provide Required Information Before the first HAZOP meeting, the required

information should be provided. This may include process flowsheets, piping and instrumentation diagrams (P&IDs), equipment, piping and instrumentation specifications, control system logic diagrams, layout drawings, operation and intervention procedures, emergency procedures, codes of practice, etc. For systems in operation, one has to check that the system is identical to the as-built drawings (which is not always the case). Divide the System into Sections and Study Nodes The system and/or activity should

be divided into major elements for analysis, and the design intent and normal operating conditions for the section should be established. The analysis of a process system is typically based on process elements, such as vessels, pumps, compressors, and the like. Process streams leading into, or out of, an element are analyzed one by one. These process streams are called study nodes. 10.6.4.3

Step 2: Identify Possible Deviations

The HAZOP team starts the examination of a study node by agreeing on the purpose and the normal state of the node. The HAZOP leader then suggests combinations of guidewords and process parameters to guide the team into identifying process deviations and the causes of the deviations. The results of step 2 are entered into columns 1–4 in Figure 10.11. Number (column 1). A unique reference to the deviation is given in this column. The default reference is a number.

Filling a bucket with water

Study object:

Date:

No.

2019-08-20

Name: Stein Haugen

Reference:

Study node

(1)

(2)

1

Faucet

2

3

Guideword (3)

Deviation

Possible causes

Possible consequences

(5)

(6)

(4)

No

Faucet is closed

More

Faucet is opened too much or too fast

No water in bucket

too fast. Risk of splashing around the bucket

Less open

too slowly

Existing barriers

Risk RPN

Proposed improvements

(8)

(9)

(10)

(11)

No

1

1

2

Visual inspection

2

2

4

Visual inspection

2

1

3

(7)

Freq. Sev.

bucket

4

Part of

Failure in faucet such that cold or hot water is

Water in the bucket is either too cold or too hot

Periodic control maintenance

3

2

5

5

Other than

Air in water produces pressure shocks

Splashing around the

No

1

3

4

6

Faucet (temperature)

More attention

Check with (carefully)

More

High temperature

Faucet is adjusted to too high a temperature

Too hot water in the bucket; burning risk

No

2

3

5

Less

Low temperature

Faucet is adjusted to too low a temperature

Too cold water in the bucket

No

2

1

3

Check with (carefully)

7

Check with (carefully)

Figure 10.11 HAZOP worksheet for “filling a bucket with water.”

Responsible Comments (12)

(13)

302

10 Hazard Identification

Study node (column 2). The name (or identification) of the study node, possibly together with a process parameter, is entered into column 2. If necessary, the study node may be accompanied by a reference to a drawing (e.g. a P&ID). Guideword (column 3). The guideword used is listed in column 3. Deviation (column 4). The deviation generated by applying the guideword and the process parameter to the study node is described briefly in column 4. A more detailed description may be supplied in a separate file. 10.6.4.4

Step 3: Identify Causes of Deviations

To identify the possible causes of a deviation is an important part of the HAZOP study. The identified causes are entered into column 5. Possible causes (column 5). For each deviation in column 4, the possible causes are recorded. 10.6.4.5

Step 4: Determine Consequences of Deviation

For each deviation, the credible consequences are entered into column 6 in the HAZOP worksheet in Figure 10.11. Possible consequences (column 6). All the main consequences of the identified deviation are now recorded. Both safety-related consequences and possible operability problems are recorded. 10.6.4.6

Step 5: Identify Existing Barriers (Safeguards)

To be able to come up with relevant proposals for improvement, the HAZOP team must be familiar with the existing barriers (safeguards) that have already been incorporated in the system. Existing barriers (column 7). The existing barriers related to the deviation are recorded. 10.6.4.7

Step 6: Assess Risk

In this step, the risk related to each deviation is evaluated. The step is not part of all HAZOP studies. Frequency (column 8). The frequency of occurrence of each deviation is briefly estimated as broad frequency classes (see Table 6.8). Severity (column 9). The severity of a deviation is usually taken to be for the worst credible consequence of the deviation, determined by the degree of injury, environmental damage, material damage, or system/production disturbance that can ultimately occur (see Table 6.9). RPN (column 10). The RPN of a deviation is computed by summing the class numbers for the frequency and severity of the deviation.

10.6 HAZOP

Risk Matrix The frequency (col. 8) and the severity (col. 9) of each deviation can be entered into a risk matrix as outlined in Section 6.4 and hence can be used to compare the risk of a deviation with some acceptance criteria, if this is relevant. The risk matrix can also be used to evaluate recommendations for improvements (i.e. risk reduction). 10.6.4.8

Step 7: Propose Improvements

Proposed improvements are recorded in column 11 of the HAZOP worksheet in Figure 10.11. Proposed improvements (column 11). Possible actions to prevent the deviation or to mitigate the consequences are recorded. Responsible (column 12). The name of the person who should be responsible for the follow-up of the deviation and/or the proposed improvement is recorded. Comments Additional comments not recorded in the 12 first columns are

entered into column 13. 10.6.4.9

Step 8: Report the Analysis

The HAZOP study may be very time-consuming, and reporting is often done only by reporting potential problems, to avoid much repetition. This may, however, later cause people to wonder whether a deviation has been missed or has been dismissed as insignificant. It is often recommended that a table summarizing the responses to the deviations be prepared, and highlight those deviations that are considered hazardous and credible. In addition, a comment should be added on how to detect and/or prevent the deviation. The remaining hazards should be entered into the hazard log, if such a log is maintained for the system in question. Example 10.8 (Filling a bucket) Consider the “process” of filling a bucket with water through a bathroom sink faucet (tap). The water in the bucket should have a temperature of approximately 50∘ C. A simple HAZOP analysis of this process is (partly) documented in the HAZOP worksheet in Figure 10.11. ◽ 10.6.5

Computer HAZOP

CHAZOP is a variant of the traditional HAZOP technique and focuses on control systems involving software. CHAZOP is also called PES HAZOP, where PES is an abbreviation for programmable electronic systems. When HAZOP was first developed, this type of control was virtually nonexisting and

303

304

10 Hazard Identification

the approach and guidewords had some short-comings in relation to the new types of systems. In the United Kingdom, HSE published guidance for CHAZOP in 1991 (Andow 1991) and several later papers elaborated on the method and gave examples of use (e.g. see Kletz et al. 1995; Schubach 1997; Redmill et al. 1997; Chung et al. 1999). Most of the steps in a CHAZOP are identical to a traditional process HAZOP, but there are some differences that may be highlighted. The following is based on the procedure suggested by Schubach (1997). (1) (2) (3) (4)

Similar to a traditional HAZOP, the process system is divided into nodes. For the first node, choose the first instrumentation or control item loop. Describe the data flow to and from the components in the control loop. Establish design intent and purpose of the control loop and the components. (5) The discussion of each control loop consists of a general discussion of hazards and events, followed by applying guidewords for components, control, sequence, and operator. (6) For each combination, the team should consider a set of standard questions: Is this possible?, what are the causes?, what are the effects?, can this propagate?, does it matter?, system knowledge (hardware, sequence, operator)?, can it be prevented, protected or mitigated against? (7) Record discussion and go back to step 2 until all nodes and controls have been covered. In this case, the traditional HAZOP breakdown of the system into nodes is applied, followed by consideration of control loops. There are also other approaches proposed. Chung et al. (1999) suggest to use a method they call process control event diagram, which essentially is a systematic representation of the control logic. Different alternatives for guidewords are proposed. Andow (1991) proposes a set of aspects and for each, a set of considerations is proposed. Others (e.g. Chung et al. 1999) suggest using the standard process HAZOP guidewords but interpreting them in a way that fits this type of system. An extract is shown in Table 10.6. 10.6.6

Resources and Skills Required

A HAZOP study is carried out as a number of brainstorming sessions by a team of five to eight experts working together under the guidance of a HAZOP leader with thorough experience in the HAZOP study technique. A HAZOP secretary is responsible for producing the record of the team’s discussions and decisions. Each meeting should not last more than approximately three hours because most people’s attention decreases after three hours at a stretch. To give the team

10.6 HAZOP

Table 10.6 Extract of interpretation of HAZOP guidewords for use in CHAZOP (Chung et al. 1999). Attribute

Guideword

Interpretation

Data/control flow

No

No information flow

More

More data is passed than expected

Part of

Information passed is incomplete

Reverse

Information flow in wrong direction

⋮ Data rate Timing of event or action

More

Data rate too high

Less

Data rate too low

No

Does not take place

Early

Takes place before expected

Late

Takes place later than expected



members time to attend to their other duties, there should be no more than 2–3 meetings per week. Most HAZOP studies can be completed in 5–10 meetings, but for large projects, it may take several months even with 2–3 teams working in parallel on different sections of the plant (RSC 2007). Several computer programs have been developed to support the HAZOP study. Some expert systems for HAZOP have also been developed to support the HAZOP leader and the study process. 10.6.7

Advantages and Limitations

Main advantages and limitations of HAZOP include the following: Advantages: The HAZOP study • is widely used and its advantages and limitations are well understood; • uses the experience of operating personnel as part of the team; • is systematic and comprehensive, and should identify all hazardous process deviations; • is effective for both technical faults and human errors; • recognizes existing safeguards and develops recommendations for additional ones; • is suitable for systems requiring interaction of several disciplines or organizations. Limitations: The HAZOP study • is strongly dependent on the facilitation of the leader and the knowledge of the team;

305

306

10 Hazard Identification

• is optimized for process hazards and needs modification to cover other types of hazards; • requires development of procedural descriptions which are often not available in appropriate detail. However, the existence of these documents benefits the operation; • produces a lengthy documentation (for complete recording). HAZOP analyzes a system or process using a “section by section” approach. As such it may not identify hazards related to interactions between different nodes.

10.7 STPA The STPA was proposed by Leveson (2011) and is based on systems-theoretic accident model and processes (STAMP), that was presented in Chapter 8. In addition to dealing with component failures, STPA also considers unsafe interactions between system components. These can occur even if no failures have occurred. The method can also consider the wider sociotechnical system that the technical systems are part of, enabling analysis with a wider perspective than many other methods. STPA is a rather recent method that has become increasingly popular.2 STPA is based on the assumption that accidents occur due to inadequate control of the study object. STPA identifies cases of inadequate control that may lead to accidents. Control of the system can be effectuated by a technical system, a person, or an organization. Once we have identified any areas where control is inadequate, this can be used to formulate new requirements to the system ensuring that control is maintained. 10.7.1

Objectives and Applications

The objectives of an STPA study are to • identify hazardous events that are relevant during all intended use and foreseeable misuse of the study object, and during all interactions with the study object; • identify potential inadequacies in the control of the system that may lead to accidents; • establish new requirements that can be used as input to design or operation. 10.7.2

Analysis Procedure

A standardized analysis procedure for STPA is still not established. This has, so far, influenced the use of the method (Dakwat and Villani 2018). Leveson 2 Postdoc Børge Rokseth, NTNU, has provided valuable comments to the STPA section.

10.7 STPA

(2011) describes the method in just two steps. In Leveson and Thomas (2018) this is elaborated more and broken down into four steps. A number of papers have been published where the method has been applied and one inspiration for the analysis procedure that is presented here has been Rokseth et al. (2017). The analysis procedure is shown in Figure 10.12. The main steps may be summarized as follows: (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11)

Plan and prepare. Identify system-level accidents and hazardous events. Describe system-level constraints. Describe control system hierarchy. Define responsibilities. Define process model for each controller. Describe control actions for each responsibility. Identify unsafe control actions (UCAs). Identify the causes of UCAs. Describe safety constraints. Reporting.

Step 1 is similar to what has been described earlier in Chapter 3 and is not repeated here. Steps 2–11 are described in some detail in the following. 10.7.2.1

Step 2: Identify System-Level Accidents and Hazardous Events

This step can be broken down into two main elements: • First, we need to identify what losses can occur (called accidents). This may be “fatalities,” “environmental damage,” “project delay,” and so on. In practice, this corresponds to the type of assets that are considered in the risk assessment. • Second, what are the hazardous events that can lead to these losses? A simple example could be “car does not stop when required.” Leveson (2011) uses the term “hazard” with a meaning similar to how we define hazardous event. To be consistent, we therefore use the term “hazardous event” also when describing STPA. Hazardous events are typically high-level at this stage, but may be broken down into more detailed sub-events. This is an advantage with STPA because the process can be started early in a design process, with high-level hazardous events only. As more details of the system are developed, more detailed events can also be specified. In this way, the level of detail in the analysis can follow the level of detail in the design process closely. The STPA handbook (Leveson and Thomas 2018) does not provide any guidance on how the hazardous events should be identified, but checklists can be used also for this purpose.

307

308

10 Hazard Identification Input

– System description

Step 1: – Organization and planning – Objectives and limitation – System description – Provision of background information (see Chapter 3)

Output – Specified objectives – Study team – Project plan

Step 2: – Identify system–level accidents (assets that may be affected) – Identify hazardous events

– Accidents – Hazardous events

Step 3: Describe high-level safety constraints

List of high-level safety constraints

Step 4: Describe control system hierarchy

Control system hierarchy

Step 5: Define responsibilities Step 6: Define process models and process variables

Suggested improvements

Step 7: Describe control actions Step 8: Identify unsafe control actions Step 9: Identify causes of unsafe control actions Step 10: Describe detailed safety constraints Step 11: Prepare the report from the analysis

Figure 10.12 Analysis workflow for STPA.

STPA report

10.7 STPA

10.7.2.2

Step 3: Describe Constraints

Safety constraint is an important element in STAMP and is defined as (Leveson and Thomas 2018): Definition 10.2 (Safety constraint) A system-level constraint specifies system conditions or behaviors that need to be satisfied to prevent hazardous events and/or losses. ◽ Based on this definition, a safety constraint can be regarded as a requirement to the system. At this stage of the analysis, high-level constraints only are identified, corresponding to the high-level hazardous events that were identified. Identifying the constraints is often as simple as converting the hazardous event into a requirement instead, to avoid the event occurring. “Car does not stop when required” may be a possible hazardous event. This can be converted into a requirement that “car must be able to stop when required.” A particular constraint may also be relevant for several hazardous events and several constraints for a particular hazardous event may be defined. Example 10.9 (Hazardous events and safety constraints for an aircraft) Hazardous events for an aircraft include (Leveson and Thomas 2018): • • • •

H-1. Aircraft violates minimum separation standards in flight H-2. Aircraft airframe integrity is lost H-3. Aircraft leaves designated taxiway, runway, or apron on ground H-4. Aircraft comes too close to other objects on the ground The associated safety constraints are

• SC-1. Aircraft must maintain minimum separation standards in flight • SC-2. Aircraft airframe integrity must be maintained • SC-3. Aircraft must not leave designated taxiway, runway, or apron on ground • SC-4. Aircraft must not come too close to other objects on the ground Here, all of the safety constraints represent requirements to the design or operation of the aircraft. ◽ 10.7.2.3

Step 4: Describe Control System Hierarchy

A key element of STPA is that our study object is described as a hierarchy of control loops. The next step is therefore to describe this hierarchy. In general, a hierarchical control structure contains these types of elements: • Controllers. This is the entity that controls the behavior of the process. This can be a physical system, a person, a component, an organization, etc. The speed of a car can be reduced (controlled) by pressing the brake pedal. The

309

310

10 Hazard Identification













brake pedal increases pressure in the brake fluid system, and this in turn causes the brake pads to apply friction to the wheel. The driver is the controller of the pressure in the brake fluid system, with the brake pedal as an actuator. Next, the brake pedal can be seen as the controller of the brake pads, with the brake fluid as actuator. In addition, modern cars have systems such as anti-lock braking system (ABS) and electronic stability program (EPS) that also control the brakes. Process models. The process model is the controller’s understanding of how the controlled process works. The process model of the car driver can have different aspects, for example when the car driver approaches a red traffic light or a car that has stopped in front of him, he knows that the speed of his own car must be reduced. Control actions. The control actions are the concrete actions taken to regulate the controlled process. The control action taken by the car driver is to press the brake pedal. Actuators. The actuators are the entities that physically perform the control action. The braking system, including the brake pedal, would be an actuator for the car example. Feedback. To be able to control the process correctly, the controllers need to have feedback on the state of the process. The feedback when we are pressing the brake pedal can be at least in two different forms: the physical feeling of the car decelerating and the speedometer showing that the speed is decreasing. Sensors. These are the elements that provide the feedback on the controlled process. The car has a sensor monitoring the speed of rotation of the wheels, and this is used to determine the speed. Other inputs to and outputs from components (neither control nor feedback). There may also be external influences that affect the controlled process. The friction of the road is an external factor that will influence how quickly speed decreases. If the driver (the controller) does not expect that the road is icy, his process model will assume that the car will have a certain deceleration when the brake pedal is pressed. With low friction, the process model must be adjusted.

When the control structure is being built, it is useful to think in terms of two dimensions: • Level of authority. The control structure can be regarded in the same way as any organization, with more authority being given to the higher organizational levels than the lower. The control structure needs to be developed in this dimension. The hierarchical system proposed by Rasmussen and Svedung (2000) that is shown in Figure 8.15 is a good example of how such a hierarchy can be structured.

10.7 STPA

• Level of abstraction. For the building of the control structure, it is also useful to think in terms of level of abstraction. It is useful to start at a high level of abstraction when building the control structure, in the same ways as we start with high-level hazardous events and safety constraints. At the highest abstraction level, it may be helpful to introduce concepts that do not necessarily exist in real life. An example can be that we start with something called “car automation system” at the highest abstraction level. In the next step, we refine this into, for example “brake control system,” “traction control system,” and so on. Experience is that it is complicated to build the hierarchical control structure. The hierarchy is very seldom a simple hierarchy with one control loop on top of the other, but usually consists of complicated interactions between levels and within levels. It may seem that a lot of detail is required to do this process, but it is possible to start with very limited detail, adding more to the structure as more information becomes available. This means that it is possible to start an STPA early in a design process, but then on a coarse level. 10.7.2.4

Step 5: Define Responsibilities

The responsibilities are closely tied to the safety constraints. For each constraint, there must be at least one controller in the system that is responsible for enforcing the constraint. The responsibility may also be shared between several controllers, especially as long as we are looking at system-level constraints. Also, one controller may be responsible for enforcing many constraints. The responsibility of the car driver is, e.g. to decide when and how hard to apply the brakes. The responsibility of ABS is to prevent the brakes from being blocked. Another responsibility of the car driver is to check that the brakes are functioning before starting to drive. A number of other responsibilities can also be identified in relation to other hazardous events. 10.7.2.5

Step 6: Define Process Models and Process Variables

The process model is essentially a model of the “behavior” of the controlled process. A simple process model can be that if we press the brakes of a car, the speed of the car will decrease. Experience with using the car will over time teach the driver how quickly the car responds to the brakes. Associated with a process model are also process variables. In this case, a variable is the speed of the car. The process models are essential because if they are wrong, we are not able to control the process properly. If we do not understand that the brake pedal decreases the speed of the car, we cannot control the speed properly. Example 10.10 (Pressure vessel) In process systems, there are often a number of pressure vessels, containing different fluids and operating at different

311

312

10 Hazard Identification

pressures. A common feature of all pressure vessels is that they have an upper limit for operating pressure. To prevent rupture of the pressure vessel, this upper limit pressure must not be exceeded. (1) A high-level hazardous event is “rupture of process vessel.” This event may be broken down into sub-events, for example, “rupture of process vessel due to overpressure” and “rupture of process vessel due to corrosion.” (2) The safety constraint that can be associated with the first sub-hazard is “pressure in process vessel must be maintained below maximum acceptable pressure.” The maximum acceptable pressure can be identified from design requirements. (3) Let us assume that pressure is being manually controlled in this case. The controller is then the process operator operating the system. (4) A simple process model would state that when the pressure rises, a control action needs to be taken to reduce pressure into the vessel, for example decrease the flow in. The only process variable in this case is the pressure. In a more complicated process model, temperature, flow rate through the vessel and other parameters could also be relevant variables. ◽ The control hierarchy is closely related to the hierarchy of goals and responsibilities. 10.7.2.6

Step 7: Describe Control Actions

Next, control actions for each controller may be defined for each of the identified responsibilities. Based on the process model, we can decide what we have to do to control the process and the actions that we decide on are called control actions. In many cases, several control actions may be available. We have already described examples of control actions for the car example, among others “pressing the brake pedal” to reduce speed. 10.7.2.7

Step 8: Identify Unsafe Control Actions

The responsibilities, safety constraints, and control actions represent how we expect the system control structure to control risk. A specific group of control actions are the UCAs: Definition 10.3 (Unsafe control action, UCA) A control action that, in a particular context and worst-case environment, will lead to a hazardous event (adapted from Leveson (2011)). ◽ Observe that this definition specifies that hazardous events will occur in a particular context. It is therefore important to define the context. Failure of the brakes of a car is not critical when the car is parked so in the car example the context is that the car is moving.

10.7 STPA

Leveson (2011) defines four generic types of UCAs, where hazardous events can be caused by: (1) Control action not provided (2) Control action provided (3) Control action provided too early, too late, or in the wrong order. In practice, this constitutes three different UCAs with different effects (4) Control action provided too long or too short, which constitutes two different UCAs. Not all the generic types are relevant in all situations. Example 10.11 (UCAs for braking of a car) If we apply the generic UCAs to the braking example, we may come to the following conclusions. • Control action not provided. This means that we do not brake when required. This UCA can clearly lead to a hazardous event. • Control action provided. Braking when required will in most cases not lead to a hazardous event. An exception may be pressing the brake pedal on a very slippery road, even if this is a correct control action in most other situations. This underlines the importance of context. • Control action provided too early, too late, or in the wrong order. Braking too early may surprise other drivers because they do not expect the car to stop. Braking too late is similar to not braking and is highly likely to lead to a hazardous event. Wrong order is relevant only in very special situations in this case, although it may be possible to define contexts where several actions need to be taken and when braking may be done too late because other actions are prioritized. • Control action provided too long or too short. Braking for too long will not lead to a hazardous event, but braking for too short time will mean that the car does not come to a complete stop and a hazardous event may occur. ◽ It is recommended that UCAs are described in five parts (Leveson and Thomas 2018): (1) (2) (3) (4) (5)

Source. The controller that is the source of the UCA. Type. Which of the four generic UCA types this is. Control action. What is the control action that is unsafe. Context. In what situation the UCA is relevant. Link to hazardous event. It is recommended that a link to the relevant hazardous event also is included, to maintain traceability through the analysis.

The description does not necessarily have to be in this order. The UCA should also be linked to the relevant safety constraint.

313

314

10 Hazard Identification

Example 10.12 (Formulation of unsafe control actions) unsafe control actions may be as follows:

Two examples of

• The driver does not press the brake pedal when the car approaches a red light (linked to the hazardous event “Car does not stop when required”). • The driver presses the brake pedal too short when the car approaches a red light (linked to the hazardous event “Car does not stop when required”). ◽ 10.7.2.8

Step 9: Identify the Causes of UCAs

After having identified how control actions can fail, the next step is to identify what the causes of the UCAs may be. To identify causes, it is recommended that all the elements in the control loop are examined systematically. Causes may include not only failure of the elements in the control loop themselves but also errors in the transfer of information between the elements, because transfer fails completely, is transferred too late, or is corrupted during transfer. Generic causes may be identified and some guidance is provided by Leveson and Thomas (2018). Table 10.7 provides a generic list of causes, but further causes may be envisaged. 10.7.2.9

Step 10: Describe Detailed Safety Constraints

The final step in the analysis itself, before reporting, is to describe detailed safety constraints and corresponding requirements. Based on the analysis of what UCAs may occur and why these can occur, we can identify detailed measures that may be implemented to reduce the possibility that UCAs may occur. This can be related to improving the reliability of the involved components, but it may also be related to improvements in process models, improvements in the flow of information within the control loops, and so on. These improvements are expressed as detailed safety constraints that can be converted into design or operational requirements for the system. Table 10.7 Generic causes of unsafe control actions. • • • • • • • • • • • • •

Physical failure of the controller Process model not correctly designed Process model not correctly implemented Process model is not correct anymore (due to other changes) Control action is not transmitted from controller to actuator Actuator does not receive control action Failure of or unexpected behavior of actuator Signal not transmitted from actuator to controlled process No feedback from controlled process Sensor failure Feedback not transmitted from sensor Wrong information transmitted from sensor Controller does not receive feedback

10.7 STPA

Table 1 – Losses – Hazardous events – Safety constraints

Table 2 – Controllers – Responsibilities – Process models – Process variables – Control actions

Table 3 – Unsafe control actions – Causes of UCAs – Detailed safety constraints

Figure 10.13 Tables for reporting the STPA.

10.7.2.10

Step 11: Reporting

In most of the published examples, STPA studies are reported by a number of tables covering different steps in the analysis. Different ways of doing this are shown in literature, but one approach may be to use three different tables describing: (1) Losses, hazardous events, and high-level safety constraints. (2) Controllers, responsibilities, process models, process variables, and control actions. (3) UCAs, the causes of the UCAs, and the detailed safety constraints. Figure 10.13 shows how the three tables are linked together. 10.7.3

Resources and Skills Required

STPA is a complicated method and requires that the system is viewed in a somewhat abstract manner to which most people are not used. It is therefore not an easy method to use. A background from control theory may be useful. Further, the lack of method descriptions has made it difficult to apply, although it is expected that the growing popularity of the method will help. Many users find STPA to be time-consuming to apply. 10.7.4

Advantages and Limitations

A general comment found in several publications that compare STPA with other methods is that it is a useful supplement to other methods, such as FMECA, HAZOP, and bow-ties. It does not necessarily identify more potential problems, but additional problems may sometimes be identified. STPA takes a different perspective on the system and this will usually also enable us to identify other strengths and weaknesses of a system. If

315

316

10 Hazard Identification

a product designer looks at a new system, she may see that it is not very user-friendly, whereas an engineer may identify that it is not strong enough to survive the loads on it. STPA and FMECA principally often have similar effects. Advantages. The main advantages are that the STPA study • is useful for complicated systems, involving automation, software, humans, and technical systems; • takes a different perspective on the system compared to other methods; • can identify system weaknesses that other methods have difficulty finding. Limitations. The main limitations are that a STPA study • cannot rank hazardous events; • has no standard approach that is simple to follow and use; • is time-consuming; • requires expertise and experience to develop the control structure hierarchy and perform the analysis.

10.8 SWIFT SWIFT is a systematic brainstorming session where a group of experts with detailed knowledge about the study object raise what-if questions to identify possible hazardous events, their causes, consequences, and existing barriers, and then suggests alternatives for risk reduction. Estimation of the frequency and severity of the various hazardous events may, or may not, be part of the SWIFT analysis. What-if analyses have long been used in simple risk analyses (CCPS 2008). The main difference between a SWIFT analysis and a traditional what-if analysis is that the questions in SWIFT are structured based on a checklist. The SWIFT approach was earlier called a “what-if/checklist” analysis (CCPS 2008). The borderline between SWIFT and a traditional what-if analysis is rather vague. SWIFT has several similarities to a HAZOP study. The main differences are that SWIFT considers larger modules and that checklists and what-if questions are used instead of guidewords and process parameters. A SWIFT analysis is therefore not so detailed and thorough as a HAZOP study, and is easier and faster to conduct. A study team meeting typically starts by discussing in detail the system, function, or operation under consideration. Drawings and technical descriptions are used, and the team members may need to clarify to each other how the details of the system functions and may fail. The next phase of the meeting is a brainstorming session, where the team leader guides the discussion by asking questions starting with “What if?” The

10.8 SWIFT

questions are based on checklists and covers such topics as operation errors, measurement errors, equipment malfunction, maintenance, utility failure, loss of containment, emergency operation, and external stresses. When the ideas are exhausted, previous accident experience may be used to check for completeness. Example 10.13 (Examples of what-if? questions) • • • • • •

What-if …

the wrong chemical is supplied? the water pump fails? the operator forgets to switch off the electric power? the valve cannot be opened? a fire occurs? the operator is not present when a specified event occurs?

10.8.1



Objectives and Applications

The objectives of a SWIFT analysis are similar to the objectives of a HAZOP study, but usually with less focus on operability problems: (1) To identify all hazardous events, with their causes and consequences. (2) Evaluate whether the safeguards that have been introduced are adequate. (3) To decide whether actions are required to control the hazardous events and, if necessary, to propose risk reduction measures. A SWIFT analysis is suitable for mainly the same applications as a HAZOP study. Whether a SWIFT analysis or a HAZOP study should be conducted is dependent primarily on how detailed the analysis must be. SWIFT may, like HAZOP, be applied to work procedures, and is then usually based on a task analysis (see Chapter 15). A SWIFT analysis is most often carried out after a PHA. It may also be relevant to use such questions as “How could …?” and “Is it possible that …?” In some cases, it may be appropriate to pose all the questions in a brainstorming manner before trying to answer them. 10.8.2

Analysis Procedure

A SWIFT analysis may be carried out in eight steps: (1) (2) (3) (4) (5)

Plan and prepare. Identify possible hazardous events. Determine causes of hazardous events. Determine the consequences of hazardous events. Identify existing barriers.

317

318

10 Hazard Identification

(6) Assess risk. (7) Propose improvements. (8) Report the analysis. Steps 1 and 8 are described in Chapter 3 and are not discussed further here. The other steps are similar to the corresponding steps in a HAZOP study. The analysis process is shown in Figure 10.14.

Input

Step 1: – Organization and planning – Objectives and limitation – System description – System familiarization (see Chapter 3) Step 2: – Identify hazardous events by structured what-if questions

– Structured checklists – What–if questions – Brainstorming

Output – Study team – Project plan

– List of hazardous events

Select a hazardous event

– Experience data – Brief causal analysis

Step 3: – Identify possible causes of the hazardous event

– Experience data – Brief consequence analysis

Step 4: – Identify possible consequences of the hazardous event

Step 5: – Identify existing barriers related to the hazardous event – Assess their capabilities

– Data sources – Expert judgments

Yes

Step 6: – Estimate frequency – Assess the severity of the hazardous event

– Risk estimate – Input to risk matrix

Step 7: – Propose risk reduction measures

– List of recommended risk reduction measures

More hazardous events? No Step 8: – Prepare report from the SWIFT analysis

Figure 10.14 Analysis workflow for SWIFT.

SWIFT analysis report

10.8 SWIFT

10.8.2.1

SWIFT Worksheet

The results from the SWIFT analysis are usually documented in a specific worksheet. An example of a SWIFT worksheet is shown in Figure 10.15. The columns in this worksheet are numbered from 1 to 11, and these are referred to when describing the SWIFT procedure. There is no standard worksheet, and several variants are used. Some worksheets do not include columns for risk ranking. 10.8.2.2

Step 2: Identify Possible Hazardous Events

These problems may have been revealed earlier by a PHA. Develop a response to the initial what-if questions. Generate additional questions and respond to these. Number (column 1). A unique reference to the what-if question is given in this column. The default reference is a number. Study node (column 2). The what-if question is entered into column 2. 10.8.2.3

Step 3: Determine Causes of Hazardous Events

The identified causes of the hazardous event (i.e. the answer to the what-if question in column 2) are entered into column 3. Possible causes (column 3). For each what-if question in column 2, the possible causes are recorded. 10.8.2.4

Step 4: Determine the Consequences of Hazardous Events

For each what-if question, the credible consequences are entered into column 4 in the SWIFT worksheet in Figure 10.15. Possible consequences (column 4). All the main consequences of the hazardous event resulting from the answer to the what-if question are now recorded. 10.8.2.5

Step 5: Identify Existing Barriers

To be able to come up with relevant proposals for improvement, the study team must be familiar with the existing barriers (safeguards) that have already been incorporated in the system. Existing barriers (column 5). The existing barriers related to the hazardous event are recorded. 10.8.2.6

Step 6: Assess Risk

In this step, the risk related to each hazardous event is evaluated. The step is not part of all SWIFT analyses. Frequency (column 6). The frequency of occurrence of each hazardous event is estimated briefly as broad frequency classes (see Table 6.8).

319

LNG transport system

Study object:

Date: 2019-12-20

Reference: No. (1)

Name: Marvin Rausand What if? (2)

Possible causes (3)

Possible consequences (4)

Eksisterende Existing barrierer barriers (5)

– The hose breaks Procedure – Gas is released – Fire/explosion likely

1

The driver leaves without disconnecting the flexible hose?

– Time pressure – Driver distraction

2

The quick-release coupling of the hose is released during filling?

– Not properly connected – Driver is hit Preventive – Technical failure in – Gas is released maintenance of coupling – Fire/explosion likely coupling

3

The tank truck runs off the road on “Main Street”

– Slippery (icy) road – Heavy traffic – Many children are crossing the road – Technical failure – Interaction with other vehicle

– Puncture of tank – Driver training (inner/outer) – Traffic control – Fire/explosion likely – High number of victims

Figure 10.15 Example of a SWIFT worksheet for Example 10.14.

Risk Freq. Sev.

RPN

Risikoreduser– Proposed Respon– Comment Ansvarlig Merknad improvements ende tiltak sible

(6)

(7)

(8)

(9)

3

3

6

Install barrier in front of truck that can be opened only when the flexible pipe is disconnected

3

2

5

– New/better coupling – Improved maintenance – Improved driver training

2

4

6

– Improved driver training – Alternative route

(10)

(11)

10.8 SWIFT

Severity (column 7). The severity of a hazardous event is usually taken to be the worst credible consequence of the event, determined by the degree of injury, environmental damage, material damage, or system/production disturbance that can ultimately occur (see Table 6.9). RPN (column 8). The RPN of a deviation is computed by summing the class numbers for frequency and severity of the hazardous event. Risk Matrix The frequency (column 6) and severity (column 7) of each hazardous event can be entered into a risk matrix as outlined in Section 6.4 and, hence, can be used to compare the risk of a hazardous event with some acceptance criteria, if this is relevant. The risk matrix can also be used to evaluate recommendations for improvements (i.e. risk reduction). 10.8.2.7

Step 7: Propose Improvements

Proposed improvements are recorded in column 9 of the SWIFT worksheet in Figure 10.15. Proposed improvements (column 9). Possible actions to prevent the hazardous event or to mitigate the consequences are recorded. Responsible (column 10). The name of the person who should be responsible for the follow-up of the hazardous event and/or the proposed improvement is recorded. Comments Additional comments not recorded in the 10 first columns are

entered into column 11. Example 10.14 (LNG transport by tank truck) Reconsider the LNG transport system discussed in Example 10.5. Some few what-if questions related to this system are analyzed in the SWIFT worksheet in Figure 10.15. The results are not based on a thorough analysis and are included only as an illustration. ◽ 10.8.3

Resources and Skills Required

The what-if analysis relies on a team of experts brainstorming to generate a comprehensive review. The relevance and completeness of the analysis are therefore dependent on the competence and experience of the team members. At least one of the team members should be familiar with the analysis process and should be able to come up with a list of initial what-if questions. The number of team members required will depend on the complexity of the system or process. For rather simple systems/processes, three to five team members will be sufficient.

321

322

10 Hazard Identification

10.8.4

Advantages and Limitations

The what-if analysis is suitable primarily for relatively simple systems. The analysis will usually not reveal problems with multiple failures or synergistic effects. The what-if analysis is applicable to almost any type of applications especially those dominated by relatively simple failure scenarios. Advantages. The main advantages of using SWIFT are that it (HSE 2001) • is very flexible, and applicable to any type of installation, operation, or process, at any stage of the life cycle; • creates a detailed record of the hazard identification process; • uses experience of operating personnel as part of the team; • is quick, because it avoids repetitive considerations of deviations; • is less time-consuming than other systematic techniques, such as HAZOP. Limitations. The main limitations of SWIFT are that it • is not inherently thorough and foolproof; • works at the system level, such that lower-level hazards may be omitted; • is difficult to audit; • is highly dependent on checklists prepared in advance; • is heavily dependent on the experience of the leader and the knowledge of the team.

10.9 Comparing Semiquantitative Methods The methods that have been presented so far have many similarities, but they are developed for different purposes and approach the problem of identifying what can go wrong in different ways. In Figure 10.16, six of the methods are compared with respect to the following properties: • Application. The types of problems that the method is suitable for. • System breakdown. This describes how the system is viewed and described in the method. • How is identification performed? In this column, a description of what structured methods are applied to identify hazards and hazardous events. • What is identified? This describes what is identified, in terms of what words are used to describe what is found in the hazard identification process. • How is risk ranking performed? The last column describes if and how risk ranking is performed within the method.

10.10 Master Logic Diagram MLD is a graphical technique that can be used to identify hazards and hazard pathways that can lead to a specified TOP event (i.e. an accident) in the

Method

Application

System breakdown

PHA

All types of systems

JSA

Identification of events

Risk ranking

How?

What?

Either object or functional breakdown

Checklists, brainstorming

Hazardous events

Categories and risk matrix

Simple work operations and procedures

Breakdown of work operation into detailed tasks

Checklists, brainstorming

Hazards and hazardous events

Sometimes categories and risk matrix, often no ranking

FMECA

Technical systems, in particular safety systems

Breakdown into subsystems and components

Checklists of failure modes, brainstorming

Failure modes

Sometimes categories and risk matrix, sometimes no ranking

HAZOP

Process systems

Functional breakdown based on flow through system

Guidewords and process parameters

Deviations

Sometimes categories and risk matrix, often no ranking

STPA

All types of systems

Hierarchical structure of control loops

No approach specified

Unsafe control actions

No ranking

SWIFT

All types of systems

No specific approach

Asking what-if questions

Hazards and hazardous events

No ranking

Figure 10.16 Comparison of hazard identification methods.

324

10 Hazard Identification

system. The hazards and pathways are traced down to a level of detail at which all important safety functions and barriers are taken into account. When this is accomplished, the causal events that can threaten a safety barrier or function can be listed. An MLD resembles a fault tree (see Chapter 11), but differs in that the initiators defined in MLD are not necessarily failures or basic events. MLDs are not pursued further in this chapter. Interested readers may consult Modarres (2006) and Papazoglou and Aneziris (2003). A case study illustrating how MLD can be used to identify failure modes of an intelligent detector is presented by Brissaud et al. (2011).

10.11 Change Analysis Change analysis is used to determine the potential effects of some proposed modifications to a system or a process. The analysis is carried out by comparing the new (changed) system with a basic (known) system or process. A change is often the source of deviation in the system operation and may lead to process disturbances and accidents. It is therefore important that the possible effects of changes be identified and that necessary precautions be taken. In the following, the term key difference is used to denote a difference between the new and the basic system that can lead to a hazardous event or can influence the risk related to the system. The system can be a sociotechnical system, a process, or a procedure. 10.11.1

Objectives and Applications

The main objectives of a change analysis are to (1) identify the key differences between the new (changed) system and a basic (known) system; (2) determine the effects of each of these differences; (3) identify the main system vulnerabilities caused by each difference; (4) determine the risk impact of each difference; (5) identify which new safeguards and/or other precautions are necessary to control the risk impacts. Change analysis can be applied to all types of systems, ranging from simple to complicated. This includes situations in which system configurations are altered, operating practices or policies are changed, new or different activities will be performed, and so on. 10.11.2

Analysis Procedure

A change analysis can be carried out in six steps. Steps 1 and 6 are described in Chapter 3 and are therefore not treated further here.

10.11 Change Analysis

(1) Plan and prepare. (2) Identify the key differences (between the new system and the basic system). (3) Evaluate the possible effects of the differences (positive and/or negative related to the risk). (4) Determine the risk impact of the differences. (5) Examine important issues in more detail. (6) Report the analysis. The analysis workflow is shown in Figure 10.17.

Step 1: – Organization and planning – Objectives and limitation – System description – both the basic (known) system and the new (changed) system (see Chapter 3)

Input

Output – – Study team – Project plan

Step 2: – System data – Brainstorming

between the new and the known system



– Experience data – Checklists of hazards, threats, and threat agents

Step 3: – Identify hazards/events –

– List of hazards and hazardous events – Recommended design and control changes – Input to hazard log

– Data sources – Expert judgments

Step 4: – Identify possible consequence and frequency – Assess/rank the associated risk – Propose required safeguards

– Description/ranking of risk impacts (risk matrix) – List of required safeguards

Yes

No – Experience data – Other methods

Step 5: – Examine important issues in more detail Step 7: Prepare the report from the analysis

Figure 10.17 Analysis workflow for change analysis.

Change analysis report

325

326

10 Hazard Identification

10.11.2.1

Step 2: Identify the Key Differences

This step is based on a detailed description of both the basic (known) system and the new (changed) system. Differences between the two systems are identified by comparison and brainstorming. Various checklists may also be useful. At this point, all differences, regardless of how subtle, should be identified and listed. 10.11.2.2

Step 3: Evaluate the Possible Effects of the Differences

In this step, the various differences identified in step 2 are evaluated, one by one. For each difference, the study team decides whether or not it can lead to harm to any assets. Both positive and negative effects on the risk should be recorded. The differences that can lead to, or influence harm, are listed as key differences, ordered into similar groups, and given unique reference numbers. This process often generates recommendations to design changes and better control in relation to the key differences. 10.11.2.3

Step 4: Determine the Risk Impacts of the Differences

Here, the study team evaluates the risk impact of each key difference. A risk evaluation approach such as a risk matrix may be used to indicate how the differences affect the risk to the various assets. As part of this process, additional safeguards and possible changes to existing safeguards are proposed when required. 10.11.2.4

Step 5: Examine Important Issues in More Detail

During the change analysis process, important issues may be revealed that will need further analysis. The study team describes these issues and gives recommendations for further analysis by other risk assessment tools. In some cases, such analyses may be carried out by the study team as part of the change analysis. 10.11.3

Resources and Skills Required

A change analysis may be carried out by two or more experienced engineers. The analysis requires a thorough knowledge of the system and the risk issues of the basic system. At least one of the study team members should have a background as a safety engineer. 10.11.4

Advantages and Limitations

Change analysis can be applied meaningfully only to a system for which baseline risk has been established by experience or from prior risk analyses.

10.12 Hazard Log

Advantages The main advantages are that the change analysis • is efficient and does not require extensive training; • systematically explores all the differences that may introduce significant risk or may have contributed to an actual incident; • is effective for proactive risk assessment in changing situations and environments. Limitations The main limitations are that the change analysis • relies on a comparison of two systems. A thorough knowledge of the risk issues related to the basic system is therefore crucial; • does not quantify risk levels (but the results of a change analysis can be used with other risk assessment methods to produce quantitative risk estimates); • is strongly dependent on the expertise of the study team. Remark 10.5 Other use of the term change analysis: The basic premise of change analysis is that if a system performs to a given standard for a period of time and then suddenly fails, the failure will be due to a change or changes in the system. By identifying these changes, it should then be possible to discover the factors that led to the failure arising. ◽

10.12 Hazard Log It is often beneficial to enter the results of the hazard identification process into a hazard log. The hazard log is also called a hazard register or a risk register. A hazard log is a log of hazards of all kinds that threaten a system’s success in achieving its safety objectives (see also CASU 2002). It is a dynamic and living document, which is populated through the organization’s risk assessment process. The log provides a structure for collating information about risk that can be used in risk analyses and in risk management of the system. The hazard log should be established early in the design phase of a system or at the beginning of a project and be kept up to date as a living document throughout the lifecycle of the system or project. The hazard log should be updated when new hazards are discovered, when there are changes to identified hazards, or when new accident data become available. The hazard log is usually established as a computerized database, but can also be a document. The format of the hazard log varies a lot depending on the objectives of the log and the complexity and risk level of the system, and may range from a simple table, listing the main hazards that are related to the system,

327

328

10 Hazard Identification

to an extensive database with several sub-databases. Elements often included in a (comprehensive) hazard log are the following: (1) Hazards: (a) A unique reference to the hazard (number or name) (b) Description of the hazard (e.g. high pressure) (c) Where is the hazard present? (e.g. in the laboratory building) (d) Where can more information about the hazard be found? (e.g. toxicity data in book A) (e) What is the quantity/amount of the hazard? (e.g. 200 m3 of diesel oil, 500 psi pressure) (f ) When is the hazard present? (e.g. while hoisting a craneload) (g) Which triggering events can release the hazard? (e.g. operator error) (h) Which risk reduction measures can be implemented related to the hazard? (e.g. replace a fluid with a less toxic fluid) (2) Hazardous events: (a) A unique reference to the hazardous event (number or name) (b) Description of the hazardous event (e.g. gas leakage from pipeline A at location B) (c) Which hazards and triggering events can lead to the hazardous event? (e.g. a craneload falls on a pressurized gas pipeline). A link should be made to the relevant hazards in the hazard sub-log (d) In which operational phases can the hazardous event occur? (e.g. during maintenance) (e) How often will the hazardous event occur? (e.g. frequency class 2) (f ) What is the worst credible consequence of the hazardous event? (e.g. a major fire) (g) How serious is the worst credible consequence of the hazardous event? (e.g. consequence class 4) (h) Which proactive safeguards can be implemented to reduce the frequency of the hazardous event? (e.g. improved inspection program) (i) Which reactive or mitigating safeguards can be implemented? (e.g. improved firefighting system) (j) How much would the proposed safeguards reduce the risk? (e.g. RPN reduced from 6 to 4) All hazardous events that can conceivably happen should be included, not only those that have already been experienced. A log of experienced or potential incidents (or accident scenarios) may also be included in the hazard log. The contents of this log may be (3) Incidents (or accident scenarios): (a) A unique reference to the incident (number or name) (b) Description of the incident (event sequence or accident scenario)

10.12 Hazard Log

(c) Has the incident occurred in this system or in any similar systems? If “yes,” provide reference to the incident investigation report, if such a report exists. (d) How often will the incident occur? (e) What is/was the consequence of the incident? (Use the worst credible consequence for accident scenarios) (f ) Which reactive or mitigating safeguards can be implemented? (g) Refer to the treatment of the accident scenario in the quantitative risk assessment if such an assessment has been carried out. Remark 10.6 A number of databases contain information about previous accidents and near accidents (see Chapter 9). These provide valuable information on how accidents can actually arise. The relevant information in these sources should be reflected in the hazard log, in addition to the information from the operator’s own site or company. Historical data alone cannot be relied on because accidents that have already occurred may not represent the entire range of possible accidents, particularly when dealing with major accidents (see also NSW 2003). ◽ The hazard log can sometimes include a sublog of deliberate and hostile actions, for example (4) Threats and vulnerabilities: (a) A unique reference to the threat (number or name) (b) Description of the threat (e.g. arson, vandalism, and computer hacking) (c) Where is the threat relevant? (e.g. computer network) (d) What are the main vulnerabilities? (e.g. no entrance control) System: Process plant X Reference:

Hazard/threat

Name: Marvin Rausand Date created: 2019-08-20

Where?

Amount

Safeguard

Trichloroethylene

Storage 2

1 barrel

Locked room

Pressurized gas

Pressure vessel 3

10 m3 (5 bar)

Fenced

Gasoline

Beneath pump

3000 L

Under earth

Figure 10.18 Simple hazard log (example).

Comments

329

System: Process plant X

Date created: 2019-05-17

Reference:

No.

1

Hazard description Sulfuric acid tank A1

Date modified: 2019-08-20

Hazard presence Where Production hall 2

When Always

Hazard quantity/amount 10 m3

Possible hazardous event Tank rupture (due to falling load)

Consequence (harm to what?) Direct impact - skin burns (five operators)

Name: Marvin Rausand

Risk Freq. Sev. 1

4

RPN 5

Risk reduction measures Procedures for crane operations Restricted area

Sulfur fumes - eyes, respiratory tract (approx. 25 operators) Production stop (>two days) Outlet pipe rupture

2

Direct impact - skin burns (two operators) Sulfur fumes - eyes, respiratory tract (five operators) Production stop (one day)

Figure 10.19 Hazard log (example).

2

2

4

Automatic shutdown valve close to tank

Residual risk

Planned date

Responsible

10.13 Problems

(e) Who are the relevant threat agents? (e.g. visitors) (f ) What unwanted events may take place? (e.g. loss of confidential information) (g) How often will this event occur? (h) How serious is the event? (Use the worst credible consequence) (i) Which safeguards can be implemented? It may further be beneficial to maintain a journal as a historical record of the hazard log. This journal may, for example, contain (5) Journal: (a) The date the hazard log was started (b) References to relevant laws, regulations, and company objectives related to the risk of the system (c) For each entry entered into the log: date and cause (d) For each entry that has been modified: date and cause (e) References between the hazard log and more detailed risk analyses (f ) References to safety reviews and project decisions More information may be added to the hazard log as desired. A hazard log is especially valuable in the design phase, but accidents and near accidents that occur in the operational phase should also be compared with the hazard log, and the log should be updated accordingly. An example of a very simple hazard log is shown in Figure 10.18, and a slightly more complicated hazard log structure is shown in Figure 10.19. These logs are not based on a thorough analysis and are included only as an illustration. For a more detailed hazard log structure with explanations of the various entries, see UK CAA (2006, App. F).

10.13 Problems 10.1

A generic list of hazards is provided in Table 2.5. Compare the items in this list with the definition of a hazard in Chapter 2 and discuss whether they meet the definition. Consider in particular the “organizational hazards.”

10.2

Hazards may be classified according to the main contributor to an accident scenario: (i) Technological hazards, (ii) Natural (or environmental) hazards, (iii) Organizational hazards, (iv) Behavioral hazards and (v) Social hazards. Reclassify the hazards in Table 2.5 according to this classification scheme.

10.3

List the possible hazards you are exposed to when riding a bicycle.

331

332

10 Hazard Identification

10.4

Consider Figure 10.20 and try to identify relevant hazards. Use your imagination. Table 2.5 may also be helpful. Make assumptions about what you see as necessary. Define some accident scenarios (see Chapter 2) and put the accident scenarios in a bow-tie-diagram.

10.5

Consider Figure 10.20 and carry out a SWIFT-analysis. Use the worksheet in Figure 10.15 to report the results. Compare the results with what you found in the previous problems and identify differences and similarities.

10.6

Consider the list of hazards that you identified for riding a bicycle in Problem 10.3. Identify examples of active failures that can trigger unwanted events and latent conditions that may lie dormant and contribute to future accidents.

10.7

Assume that you are driving your car on a wet and slippery country road. A front wheel punctures and you have to change the wheel. You have a spare wheel and the original jack in the boot. (a) Break down the job you have to do into a sequence of tasks. List the tasks in the sequence you have to do them. (b) Carry out a JSA and record the results in a suitable JSA worksheet. Observe the assumptions you have to make to carry out the analysis.

10.8

Reconsider the lifting operations in Example 10.6, but consider the whole operation including the worker onboard the ship, the crane operator, and the worker on the quay who positions the containers and removes the hooks. Note the assumptions you have to make to carry out the analysis. (a) Break down the job into a sequence of tasks and list the tasks in the sequence they have to be done. (b) Carry out a JSA and record the results in a suitable JSA worksheet.

10.9

Consider a hot water kettle that is used to make hot water for tea. Choose a model you are familiar with and carry out an FMECA analysis of the kettle.

10.10

Perform a PHA of a building crane and the operation of the crane. You may delimit the system to the crane itself (but also consider effects on other objects and persons) and consider only normal use of the crane (i.e. not erecting or dismantling the crane, no maintenance operations, etc.). Make other delimitations if required. Consider the need for a breakdown of the system – this should be done if required. For the

10.13 Problems

Figure 10.20 Building site (example). Source: Photo by fancycrave.com from Pexels.

333

334

10 Hazard Identification

hazard identification, it is recommended to use the checklist provided in Table 10.1. In addition, it is also recommended to visit a building site to observe how the operation is done. For the frequency and consequence classification, you may use the classification in Tables 6.8 and 6.9, or you may define your own classification. You may use the PHA worksheet in Figure 10.3. The identified hazardous events should be plotted in a risk matrix. 10.11

PHA, HAZOP, FMEA, and JSA are four methods used for hazard identification purposes. For each of the following systems or operations, comment on whether these methods are suitable or not (or if you think that other methods are better suited): • The water cooling system in a car • A car engine • A race track for car racing • The electrical system of the car • The operation of replacing the gear box of the car.

10.12

Figure 10.21 shows a simple “process system.” The system consists of a water tank, with a water supply on the right hand side on the top and a water outlet at the bottom. The tank is normally full with both inlet and outlet closed (i.e. the valves are normally closed, NC). Water is released manually, by pushing the pushbutton. This opens the valve on the outlet line and water flows out. When the water level reaches the low-level controller (LC2), the outlet valve is closed. The inlet valve is also opened when the water level reaches the low-level controller (LC2). This valve remains open until the water level again reaches the high level (at LC1) when the valve closes. (a) Carry out a HAZOP analysis of this system. The following columns should be included in the HAZOP worksheet: node, guideword, and process parameter (deviation), possible causes of deviation, possible consequences of deviation. Consequences should be classified in the following categories only– spill of water, system does not function, system functions partially, and no consequence. (b) Carry out an FMECA of the same system. The following columns should be used in the FMECA worksheet: Component. operational mode, failure mode, failure mechanism, local effect, global effect, detection. (c) Carry out an STPA analysis of the same system. (d) Compare the results from the three studies. What are the similarities and differences?

References

V1

LC1 Pushbutton

Water tank

LC2

V2

Figure 10.21 Water tank.

References Andow, P. (1991). Guidance on HAZOP Procedures for Computer-Controlled Plants. Report 26. London: Health and Safety Executive. Brissaud, F., Barros, A., Bérenguer, C., and Charpentier, D. (2011). Reliability analysis for new technology-based transmitters. Reliability Engineering & System Safety 96 (2): 299–313. CASU (2002). Making it Happen: A Guide to Risk Managers on How to Populate a Risk Register. Tech. Rep. Staffordshire, UK: The Controls Assurance Support Unit, University of Keele. ISBN: 1-904276-02-4. CCOHS (2009). Job Safety Analysis Made Simple. Tech. Rep. Canadian Centre for Occupational Health and Safety. http://www.ccohs.ca/oshanswers (accessed 03 October 2019). CCPS (2008). Guidelines for Hazard Evaluation Procedures, Wiley and Center for Chemical Process Safety, 3e. Hoboken, NJ: American Institute of Chemical Engineers. Chung, P.W.H., Yang, S.H., and Edwards, D.W. (1999). Hazard identification in batch and continuous computer-controlled plants. Industrial and Engineering Chemistry Research 38 (11): 4359–4371. Dakwat, A.L. and Villani, E. (2018). System safety assessment based on STPA and model checking. Safety Science 109: 130–143.

335

336

10 Hazard Identification

DEF-STAN 00-56 (2007). Safety management requirements for defence systems, Standard. London, UK: Ministry of Defence. Hammer, W. (1993). Product Safety Management and Engineering, 2e. Des Plaines, IL: American Society for Safety Engineers. HSE (2001). Marine Risk Assessment. London: HMSO. HSL (2005). Review of Hazard Identification Techniques. Report HSL/2005/58. Sheffield, UK: Health and Safety Laboratory. IEC 61882 (2016). Hazard and operability studies (HAZOP studies) – application guide, International standard. Geneva: International Electrotechnical Commission. IMO (2015). Revised Guidelines for Formal Safety Assessment (FSA) for Use in the IMO Rule-Making Process, Guideline. London, UK: International Maritime Organization. ISO 12100 (2010). Safety of machinery – general principles for design: risk assessment and risk reduction, International standard. Geneva: International Organization for Standardization. ISO 17776 (2016). Petroleum and natural gas industries—offshore production installations – major accident hazard management during the design of new installations. International standard. Geneva: International Organization for Standardization. ISO 31010 (2009). Risk management – risk assessment techniques, International standard. Geneva: International Organization for Standardization. de Jong, H.H. (2007). Guidelines for the Identification of Hazards. How to make unimaginable hazards imaginable? NLR-CR-2004-094. Brussels: Eurocontrol. Kletz, T. (1999). Hazop and Hazan, 4e. London: Taylor & Francis. Kletz, T., Chung, P.W.H., Broomfield, E., and Shen-Orr, C. (1995). Computer Control and Human Error. Houston, TX: Gulf Publishing Company. Leveson, N. (2011). Engineering a Safer World. Cambridge, MA: MIT Press. Leveson, N. and Thomas, J.P. (2018). STPA Handbook. Technical Report. Cambridge, MA: MIT. Maragakis, I., Clark, S., Piers, M. et al. (2009). Guidance on Hazard Identification. Report. European Civil Aviation Safety Team (ECAST). MIL-STD-882E (2012). Standard Practice for System Safety. Washington, DC: U. S. Department of Defense. Modarres, M. (2006). Risk Analysis in Engineering: Techniques, Tools, and Trends. Boca Raton, FL: Taylor & Francis. NOG-090 (2017). Norwegian Oil and Gas Recommended Guidelines on a Common Model for Safe Job Analysis (SJA), Guideline. Stavanger, Norway: The Norwegian Oil and Gas. NSW (2003). Hazard Identification, Risk Assessment, and Risk Control No. 3. Technical report. Sydney, Australia: New South Wales, Department of Urban and Transport Planning.

References

NSW (2008). HAZOP Guidelines: Hazardous Industry Planning Advisory Paper No. 8. Technical report. Sydney, Australia: New South Wales, Department of Planning. OSHA (2002). Job Hazard Analysis. Technical report OSHA 3071. Washington, DC: Occupational Safety and Health Administration. Papazoglou, I.A. and Aneziris, O.N. (2003). Master logic diagram: method for hazard and initiating event identification in process plants. Journal of Hazardous Materials 97, 11–30. Rasmussen, J. and Svedung, I. (2000). Proactive Risk Management in a Dynamic Society. Karlstad, Sweden: Swedish Rescue Services Agency (Currently: The Swedish Civil Contingencies Agency). Redmill, F.J., Chudleigh, M.F., and Catmur, J.R. (1997). Principles underlying a guideline for applying HAZOP to programmable electronic systems. Reliability Engineering & System Safety 55 (3): 283–293. Rokseth, B., Utne, I.B., and Vinnem, J.E. (2017). A systems approach to risk analysis in maritime operations. Journal of Risk and Reliability 231 (1): 53–68. RSC (2007). Note On: Hazard and Operability Studies (HAZOP). Technical report. London: Royal Society of Chemistry, Environmental Health and Safety Committee. Schubach, S. (1997). A modified computer hazard and operability study procedure. Journal of Loss Prevention in the Process Industries 10 (5–6): 303–307. UK CAA (2006). Guidance on the Conduct of Hazard Identification, Risk Assessment and the Production of Safety Cases—for Aerodrome Operators and Air Traffic Service Providers. Technical report CAP 760. Gatwick Airport, UK: Civil Aviation Authority.

337

339

11 Causal and Frequency Analysis 11.1 Introduction This chapter deals with the second question in the triplet definition of risk “What is the likelihood of that happening?” Our starting point is the bow-tie model and a set of hazardous events identified by the methods described in Chapter 10. In this setting, the second question may be reformulated as “How often will each hazardous event occur?” In most cases, this question is answered by an estimated frequency of the hazardous event. To determine the frequency, we often have to start with a causal analysis for each hazardous event. The causal analysis for a specific hazardous event seeks to identify and understand the cause or the causes that may lead to the hazardous event. A causal analysis is performed as other analyses by breaking down the problem to its constituent parts – layer by layer (see Chapter 4). We start with the direct causes – also called the proximate causes – followed by breaking each direct cause down to its own direct causes, and so on. Causality is a complicated philosophical subject. Interested readers may find a lot more information by searching the internet. A comprehensive discussion may be found in Pearl (2009). 11.1.1

Objectives of the Causal and Frequency Analysis

The objectives of the causal and frequency analysis are to: (1) Determine the causes of the defined hazardous event. How far the causal sequences should be pursued depends on the objective of the analysis and the data available. (2) Establish the relationship between the hazardous event and the basic causes.

Risk Assessment: Theory, Methods, and Applications, Second Edition. Marvin Rausand and Stein Haugen. © 2020 John Wiley & Sons, Inc. Published 2020 by John Wiley & Sons, Inc. Companion Website: www.wiley.com/go/riskassessment2e

340

11 Causal and Frequency Analysis

(3) Determine the frequency of the hazardous event based on a careful examination of the basic causes and the causal sequences. (4) Determine how important each cause is in relation to the frequency of the hazardous event. (5) Identify existing and potential proactive barriers and evaluate the effectiveness of each barrier and the barriers in combination (more about barriers in Chapter 14).

11.1.2

Methods for Causal and Frequency Analysis

Four different methods of causal and frequency analysis are described in this chapter: Cause and effect diagrams. Cause and effect diagrams have their origin in quality engineering and can be used to identify causes of a hazardous event. The method is easy to use and does not require extensive training. It can be used only for causal analysis and does not provide quantitative answers. Fault tree analysis. Fault tree analysis (FTA) is the most commonly used method for causal analysis of hazardous events in risk analyses. The method is well documented and has been used in a wide range of application areas. FTA is suitable for both qualitative and quantitative analysis of complicated systems but is not well suited to handle dynamic systems and systems with complicated maintenance. The method is also sometimes too rigid in its requirements regarding binary states and Boolean logic. Bayesian networks. Bayesian networks are getting increasingly popular and are in many cases a good alternative to FTA. A Bayesian network can replace any fault tree and is more flexible. A main drawback of Bayesian networks is related to its complicated and time-consuming quantification. Markov methods. Markov methods are used mainly to analyze small but complicated systems with dynamic effects. As such, Markov methods can be combined with, and compensate for some of the weaknesses of FTA. Markov methods are well documented and can give analysts deep insight into system properties and operation. Markov methods are not suitable as an initial method for identification of the causes of the hazardous event. For causal analysis, one of the first three methods should be selected. Of these three, FTA or an analysis based on Bayesian networks is usually the most suitable. Which of these to choose depends on the problem to be analyzed, the knowledge and experience of the study team, the availability of data, and the availability of efficient computer programs. To determine the frequency of the hazardous event, all the last three methods can be used. Which of these to choose, depends on the system and the complexity of the causal sequences. In most cases, it is sufficient to use FTA

11.2 Cause and Effect Diagram Analysis

or Bayesian networks, but if the system is complicated with dynamic features, Markov methods may be a good supplement. The two last methods are general methods that can used for many different purposes, and it is outside the scope of this book to present all the features of the methods. A brief introduction to each method is given, and its application in the causal analysis of risk analysis is highlighted. Readers who are interested in a more thorough treatment are advised to consult the references mentioned in each section. The causal analysis can be applied to any event in the system, but in this chapter, we use the term hazardous event to signify the event that is the starting point for the analysis.

11.2 Cause and Effect Diagram Analysis Cause and effect diagram (also called Ishikawa diagram1 or fishbone diagram) may be used to identify, sort, and describe the causes of a specified event. A cause and effect diagram analysis does not have any extensive theoretical basis and is merely a graphical representation and structuring of the knowledge and ideas generated by the study team during brainstorming sessions. Causes are arranged according to their level of importance or detail, resulting in a tree structure that resembles the skeleton of a fish with the main causal categories drawn as bones attached to the spine of the fish. Therefore, the diagram is also called a fishbone diagram. 11.2.1

Objectives and Applications

The main objectives of a cause and effect diagram analysis are: (1) To identify the causes of a defined hazardous event in a system. (2) To classify the causes into groups. (3) To acquire and structure the relevant knowledge and experience of the study team. The cause and effect diagram analysis is done by a study team as a brainstorming session. Cause and effect diagrams are commonly used in product design but can also be used for simple causal analyses of hazardous events as part of a risk analysis of rather simple systems. For complicated systems, FTA would be a better method. A cause and effect diagram has some similarities with a fault tree but is purely qualitative, less structured, and cannot be used as a basis for quantitative analysis. 1 Named after Japanese professor Kaoro Ishikawa (1915–1989), who developed the diagrams.

341

342

11 Causal and Frequency Analysis

No international standard for cause and effect diagrams has been published, but detailed guidelines may be found in several textbooks on quality engineering and management, see for example, Ishikawa (1986) and Bergman and Klefsjö (1994). 11.2.2

Analysis Procedure

A cause and effect diagram analysis is normally carried out in four steps: (1) (2) (3) (4)

Plan and prepare. Construct the cause and effect diagram. Analyze the diagram qualitatively. Report the analysis.

Steps 1 and 4 are discussed in Chapter 3 and not treated further here. 11.2.2.1

Step 2: Construct the Cause and Effect Diagram

The main elements of a cause and effect diagram are shown in Figure 11.1. To construct a diagram, the study team starts with a specified hazardous event. The hazardous event is briefly described in a box at the right end of the diagram, which constitutes the “head of the fish.” The central spine from the left is drawn as a thick line pointing to this box, and the major categories of potential causes (see Table 11.1) are drawn as bones to the spine, as shown in Figure 11.1. When analyzing technical systems, the following six (6M) categories are frequently used: (1) (2) (3) (4) (5) (6)

Man (i.e. people) Methods (e.g. work procedures, rules, regulations) Materials (e.g. raw materials, parts) Machinery (e.g. technical equipment, computers) Milieu (e.g. internal/external environment, location, time, safety culture) Maintenance

The categories should be selected to fit the actual application. It is usually recommended not to use more than seven major categories. Brainstorming is used to identify the factors (or issues) that may affect the hazardous event within each M-category. The team may, for example, ask: “What are the machinery issues affecting/causing…?” This is repeated for each M-category, and factors/issues are identified and entered into the diagram as arrows pointing to the relevant M-category. Checklists, such as the one in Table 11.1 may be used to structure the process. Each factor is then analyzed in the same way to produce subfactors that are represented as arrows pointing to the relevant factor. Asking why this is happening for each factor/issue carries the analysis forward. Additional levels may

11.2 Cause and Effect Diagram Analysis

Machine

Milieu

Man

Hazardous event

Maintenance

Method

Material

Figure 11.1 The main elements of a cause and effect diagram. Table 11.1 Some basic causes of hazardous events. Machinery

Man • • • • • •

Operator error Lack of knowledge Lack of skill Stress Inadequate capability Improper motivation

Methods • Lack of procedures • Inadequate procedures • Practices are not the same as written procedures • Poor communication Materials • Lack of raw material • Low-quality material • Wrong type for job

• Poor design • Poor equipment or tool placement • Defective equipment or tool • Incorrect tool selection Milieu (i.e. environment) • Untidy workplace • Inadequate job design or layout of work • Surfaces poorly maintained • Too high physical demands of the task • Forces of nature Maintenance • • • •

Poor maintenance program Poor maintainability Poor maintenance performance Lack of maintenance procedure

be included under each subfactor, if required. The analysis is continued until we no longer get useful information from asking: “Why can this happen?” A main value of cause and effect diagram analysis lies in the very process of producing the diagrams. This process often leads to ideas and insights that you might not otherwise have come up with. 11.2.2.2

Step 3: Analyze the Diagram Qualitatively

When the team members agree that an adequate amount of detail has been provided under each major category, the diagram is analyzed by first grouping the causes. One should look especially for causes that are identical or similar

343

344

11 Causal and Frequency Analysis

and that appear in more than one category. Second, the causes should be ranked and listed according to what is considered to be the “most likely causes.” 11.2.3

Resources and Skills Required

Performing a cause and effect diagram analysis does not require any specific training. The team members should therefore be able to carry out the analysis after a brief introduction. The number of team members will vary according to the complexity of the system and the criticality of the hazardous event. The analysis can be carried out using pen and paper, or with a whiteboard and Post-It markers. Many common drawing programs have templates and other aids for drawing cause and effect diagrams that can be useful.

®

11.2.4

Advantages and Limitations

Advantages. The main advantages are that the cause and effect diagram technique • is easy to learn and does not require any extensive training; • helps determine causes of deviations; • encourages group participation; • increases process knowledge; • helps organize and relate causal factors; • provides a structure for brainstorming; • involves all participants. Limitations. The main limitations are that the cause and effect diagram technique • may become very complicated; • requires patience from the participants; • does not rank the causes in an “if–then” manner; • cannot be used for quantitative analysis.

11.3 Fault Tree Analysis A fault tree is a top-down logic diagram that displays the interrelationships between a potential hazardous event in a system and the causes of this event. The causes at the lowest level are called basic events and may be component failures, environmental conditions, human errors, and normal events (i.e. events that are expected to occur during the life span of the system). FTA was introduced in 1962 at Bell Telephone Laboratories, in connection with a safety evaluation of the Minuteman intercontinental ballistic missile launch control system. FTA is one of the most commonly used methods in risk and reliability studies.

11.3 Fault Tree Analysis

The main international standard for FTA is IEC 61025 (2006). Other important sources of information on FTA include (NASA 2002; NUREG-0492 1981; CCPS 2008). 11.3.1

Objectives and Applications

FTA may be qualitative, quantitative, or both, depending on the scope of the analysis. The main objectives of an FTA are: (1) To identify all possible combinations of basic events that may result in a hazardous event in the system. (2) To find the probability that the hazardous event will occur during a specified time interval or at a specified time t, or the frequency of the hazardous event. (3) To identify aspects (e.g. components, barriers, structure) of the system that need to be improved to reduce the probability of the hazardous event. FTA is especially suitable for analyzing large and complicated systems with an ample degree of redundancy. In particular, FTA has been used successfully in risk analyses within the nuclear (e.g. see NUREG-75/014 1975), chemical (e.g. see CCPS 2000), and aerospace industries (e.g. NASA 2002). FTA has traditionally been applied to mechanical and electromechanical systems, but there is no fundamental reason why the methodology cannot be applied to any type of system. 11.3.2

Method Description

FTA is a deductive method, which means that we reason backward in the causal sequence of a specific event. We start with a specified potential hazardous event in the system, called the TOP event of the fault tree. The immediate causal events E1 , E2 , … that, either alone or in combination, will lead to the TOP event are identified and connected to the TOP event through a logic gate (see Table 11.2). Next, all the potential causal events Ei,1 , Ei,2 , … that may lead to event Ei for i = 1, 2, … are identified and connected to event Ei through a logic gate. The procedure is continued deductively until a suitable level of detail is reached. The events at this level make the basic events of the fault tree.2 Table 11.2 shows the most commonly used fault tree symbols together with a brief description of their interpretation. A number of more advanced fault tree symbols are available but are not covered in this book. A thorough description may be found, for example, in NASA (2002). FTA is a binary analysis. All events, from the TOP event down to the basic events, are assumed to be binary events that either occur or do not occur. 2 The basic events are also called root nodes of the tree.

345

346

11 Causal and Frequency Analysis

Table 11.2 Fault tree symbols. Symbol

Description

OR-gate

Logic gates

The or -gate indicates that the output event A occurs if at least one of the input events Ei occur

A

E1

E2

E3

AND-gate A

E1

E2

The and-gate indicates that the output event A occurs only when all the input events Ei occur at the same time

E3

Basic event

The basic event represents an event (typically a basic equipment failure) that requires no further development of failure causes

Undeveloped event

The undeveloped event represents an event that is not examined further because information is not available or because its consequence is insignificant

Description

Comment rectangle

The comment rectangle is for supplementary information

Transfer symbols

Transfer-out

The transfer-out symbol indicates that the fault tree is developed further at the occurrence of the corresponding transfer-in symbol

Input events

Transfer-in

11.3 Fault Tree Analysis

To flare PSV1

PSV2

RD

Gas outlet Pressure switches PS1 PS2

PLC

Separator Gas, oil, and water inlet PSD1

PSD2

Fluid outlet Figure 11.2 Oil and gas separator in Example 11.1.

No intermediate states (e.g. “the brake pads are 80% worn”) are allowed in the fault tree. The fault tree diagram is a deterministic model. This means that when the fault tree is constructed and we know the states of all the basic events, the state of the TOP event and of all intermediate events are known. A fault tree is single event-oriented, meaning that a separate fault tree must be constructed for each potential TOP event that we want to analyze. Example 11.1 (Separator vessel) Consider the oil and gas separator in Figure 11.2. A mixture of high-pressure oil, gas, and water is fed into a separator vessel. If a blockage occurs in the gas outlet, the pressure in the separator will increase rapidly. To prevent overpressure, two high-pressure switches, PS1 and PS2 , are installed in the vessel. Upon high pressure, the pressure switches should send signals to a programmable logic controller (PLC). If a signal from at least one pressure switch is received by the PLC, a closure signal will be sent to the process shutdown valves, PSD1 and PSD2 . The shutdown function will fail if both pressure switches fail to send a signal, or the PLC fails to handle the signals and send a closure signal to the valves, or valves PSD1 and PSD2 both

347

11 Causal and Frequency Analysis

Flow into separator fails to be shut down when high pressure occurs

Top structure

348

TOP event description

OR-gate

No signal about high pressure from the pressure switches

PLC does not transmit signal about high pressure

Shutdown valves fail to close on demand

PLC

SDV

AND-gate

No signal from pressure switch 1

No signal from pressure switch 2

PS1

PS2

Basic event description Basic event symbol

Figure 11.3 Fault tree for the shutdown system in Example 11.1.

fail to close on demand. The causes of the TOP event “Flow into separator fails to be shut down when high pressure occurs” are shown in the fault tree in Figure 11.3. The lowest level in the fault tree in Figure 11.3 is a component failure. In some cases, it may also be relevant to identify the potential causes of a component failure: for example, potential primary failures, secondary failures, and command faults, as shown for a pressure switch failure in Figure 11.4. This level may be pursued further, for example, by identifying potential causes for “wrong calibration of pressure switch.” The level at which the analysis is stopped is determined by the objectives of the FTA. ◽ 11.3.2.1

Common-Cause Failures

A common-cause failure is a failure of two or more items due to a single specific event or cause and within a specified time interval. In some cases, it is possible to identify this common cause explicitly and include it in the fault tree. This is shown in the fault tree diagram in Figure 11.5. A parallel system of two pressure switches can fail in two different ways, either as simultaneous individual failures or due to a common cause – in this case that the common tap to the pressure switches is blocked by solids. Common-cause failures and modeling of such failures are treated in detail in Chapter 13.

11.3 Fault Tree Analysis

Rest of the fault tree

No signal from pressure switch 1

Inherent failure in the switch

External stress outside design spec.

Wrong calibration of pressure switch

PF

SF

CF

Primary failure

Secondary failure

Command fault

Figure 11.4 Primary failure, secondary failure, and command fault for a pressure switch in Example 11.1.

Rest of the fault tree

No signal about high pressure from the pressure switches

Independent switch failures

Common tap blocked with solids CCF

No signal from pressure switch 1

No signal from pressure switch 2

PS1

PS2

Figure 11.5 Explicit modeling of a common-cause failure in a system of two pressure switches (CCF: common-cause failure).

349

350

11 Causal and Frequency Analysis

Remark 11.1 (FTA and system analysis) A fault tree diagram neatly illustrates the definition of a system analysis in Section 4.4.3. A system fault is decomposed into subsystem faults and sub-subsystem faults, and so on until the basic events are reached. When the fault tree structure and the basic event probabilities are known, synthesis can be used to determine the system fault properties, such as TOP event probability and importance ranking of the various subsystems and basic events. ◽

(a)

TOP

1 1

(b)

2

2

3

3

TOP

1

2 1

(c)

2

3

3

TOP

2 1

1

G1

3

2

3

Figure 11.6 Relationship between some simple fault tree diagrams and reliability block diagrams.

11.3 Fault Tree Analysis

11.3.2.2

Reliability Block Diagrams

A fault tree diagram (with only and- and or-gates) can always be converted to a reliability block diagram, and vice versa. This is shown in Figure 11.6. A reliability block diagram shows the logical connections of functioning items that are needed to fulfill a specified system function. Each function is represented as a functional block and is drawn as a square (see Figure 11.6). If we can proceed through a functional block from one endpoint to the other, we say that the item is functioning. A brief introduction to reliability block diagrams is given in Appendix A. For a more thorough treatment, e.g. see Rausand et al. (2020). The reliability block diagram in Figure 11.6a represents a series structure that will fail if item 1 fails, or item 2 fails, or item 3 fails. A series structure always corresponds to an or-gate in the fault tree when the basic events represent item failure. The reliability block diagram in Figure 11.6b is a parallel structure that will fail only when item 1 fails, and item 2 fails, and item 3 fails. It is therefore clear that the parallel structure corresponds to an and-gate. Observe that to save space, we have omitted the rectangles describing the basic events in the fault trees in Figure 11.6. In practical applications, we should always give proper descriptions of the events in the fault tree. 11.3.2.3

Minimal Cut Sets

A fault tree provides valuable information about possible combinations of basic events that can result in the TOP event. Such a combination of basic events is called a cut set, and is defined as: Definition 11.1 (Cut set) A cut set in a fault tree is a set of basic events whose (simultaneous) occurrence ensures that the TOP event occurs. ◽ The most interesting cut sets are those that are minimal: Definition 11.2 (Minimal cut set) A cut set is said to be minimal if the set cannot be reduced without losing its status as a cut set. ◽ Let C1 , C2 , … , Ck denote the k minimal cut sets of a fault tree. The number of distinct basic events in a minimal cut set is called the order of the cut set. A minimal cut set is said to fail when all the basic events of this cut set are occurring at the same time.3 A minimal cut set can therefore be represented as 3 The term occurring here may be somewhat misleading. The term does not imply that a basic event occurs exactly at time t; it means that the state of the basic event is present at time t (e.g. a component is in a failed state at time t).

351

352

11 Causal and Frequency Analysis

TOP

Minimal cut set 1 fails

1.1

1.2

Minimal cut set 2 fails

1.3

2.1

2.2

Minimal cut set k fails

2.3

k.1

k.2

k.3

Figure 11.7 The TOP event will occur if at least one of the k minimal cut sets fails.

a fault tree with a single and-gate, as shown in Figure 11.6b. In a reliability block diagram, a minimal cut set can be represented as a single parallel structure with r items, where r is the order of the minimal cut set. All the r items in this parallel structure have to fail for the minimal cut set to fail. Let Cj (t) be the event where minimal cut set Cj is failed at time t, for j = 1, 2, … , k. The TOP event occurs at time t when at least one of the minimal cut sets fails at t, and can therefore be expressed as TOP(t) = C1 (t) ∪ C2 (t) ∪ · · · ∪ Ck (t)

(11.1)

The fault tree can therefore be represented by an alternative top structure, the minimal cut set fault trees connected through a single or-gate, as shown in Figure 11.7. To save space, the rectangles describing the basic events are omitted in Figure 11.7. Each minimal cut set is drawn here with three basic events. The basic events in minimal cut set j are illustrated by the symbols j.1, j.2, and j.3, for j = 1, 2, … , k. In a real fault tree, the minimal cuts sets will be of different orders, and the same basic event may be a member of several minimal cut sets. For small and simple fault trees, it is feasible to identify the minimal cut sets by inspection without a formal procedure/algorithm. For large or complicated fault trees, an efficient algorithm is needed. 11.3.2.4

Identification of Minimal Cut Sets by MOCUS

MOCUS (method for obtaining cut sets) is a simple algorithm that can be used to find the minimal cut sets in a fault tree. The algorithm is best explained by an

11.3 Fault Tree Analysis

Figure 11.8 Example of a fault tree.

TOP

G1

1

2

G2

G3

4

G4

3

6

5

7

example. Consider the fault tree in Figure 11.8, where the gates are called TOP and G1 to G4. The algorithm starts at the TOP event. If this is an or-gate, each input to the gate is written in separate rows. Similarly, if the TOP gate is an and-gate, the inputs to the gate are written in separate columns. In our example, the TOP gate is an or-gate, and we start with 1 G1 2 Because each of the three inputs, 1, G1, and 2, will cause the TOP event to occur, each of them will constitute a cut set. The idea is to successively replace each gate with its inputs (basic events and new gates) until one has gone through the entire fault tree and is left with only the basic events. When this procedure is completed, the rows in the established matrix represent the cut sets of the fault tree.

353

354

11 Causal and Frequency Analysis

Because G1 is an or-gate: 1 G2 G3 2 Because G2 is an and-gate: 1 3, G4 G3 2 Because G3 is an or-gate: 1 3, G4 4 5 2 Because G4 is an or-gate: 1 3, 6 3, 7 4 5 2 We are then left with the following six cut sets: {1}, {2}, {4}, {5}, {3, 6}, {3, 7} Observe that an or-gate increases the number of minimal cut sets in the system, whereas an and-gate increases the order of the cut sets (i.e. increases the number of basic events in the cut sets). If the same basic event is represented in two or more places in the fault tree, MOCUS will generally not provide the minimal cut sets. It is therefore necessary to check that the cut sets identified are indeed minimal. Such a routine is included when MOCUS is implemented into computer programs for FTA. In the aforementioned example, all the basic events are unique, and the algorithm provides the minimal cut sets. Example 11.2 (Nonminimal cut sets) Assume that we have applied the MOCUS algorithm to a fault tree and have arrived at the following cut sets: {B1 , B2 , B5 }, {B3 , B5 , B6 }, {B1 , B2 }, and {B1 , B4 }. In this case, we see that the

11.3 Fault Tree Analysis

3 1

2

4

5 6

7

Figure 11.9 Reliability block diagram corresponding to the fault tree in Figure 11.8.

first cut set contains three basic events, but two of the basic events are also a cut set in their own right, {B1 , B2 }. The third basic event in the first cut set, B5 , can therefore be removed without the cut set losing status as a cut set. The first cut set is therefore not a minimal cut set. The minimal cut sets are therefore ◽ {B3 , B5 , B6 }, {B1 , B2 }, and {B1 , B4 }. In Figure 11.9, the fault tree in Figure 11.8 is converted to a reliability block diagram, from which the minimal cut sets can easily be seen. This approach is not feasible for large fault trees, and we therefore need an efficient algorithm. It should be observed that several more efficient algorithms for identification of minimal cut sets have been developed and implemented into computer programs for FTA. 11.3.2.5

Fault Tree with a Single AND-gate

Consider a fault tree with a single and-gate, as shown in Figure 11.10. Let Ei (t) denote that the event Ei is occurring at time t, for i = 1, 2, … , n. Because the TOP event will occur if and only if all the basic events occur, the Boolean representation of the fault tree is (11.2)

TOP(t) = E1 (t) ∩ E2 (t) ∩ · · · ∩ En (t) The probability that the event is occurring at time t is denoted qi (t) = Pr[Ei (t)] Figure 11.10 Fault tree with a single AND-gate.

TOP

E1

E2

En

355

356

11 Causal and Frequency Analysis

Figure 11.11 Fault tree with single OR-gate.

TOP

E1

E2

En

If event Ei is a component failure, then qi (t) is the unreliability or unavailability of the component. We assume that the events E1 (t), E2 (t), … , En (t) are independent. The probability of the TOP event at time t, Qs (t), is then Qs (t) = Pr[E1 (t) ∩ E2 (t) ∩ · · · ∩ En (t)] = Pr[E1 (t)] Pr[E2 (t)] · · · Pr[En (t)] n ∏ qi (t) (11.3) = q1 (t)q2 (t) · · · qn (t) = i=1

11.3.2.6

Fault Tree with a Single OR-gate

Consider a fault tree with a single or-gate, as shown in Figure 11.11. In this case, any of the basic events will cause the TOP event to occur, and the Boolean representation is TOP(t) = E1 (t) ∪ E2 (t) ∪ · · · ∪ En (t)

(11.4)

When all the events E1 (t), E2 (t), … , En (t) are independent, the probability of the TOP event at time t is Qs (t) = Pr[E1 (t) ∪ E2 (t) ∪ · · · ∪ En (t)] = 1 − Pr[E1∗ (t) ∩ E2∗ (t) ∩ · · · ∩ En∗ (t)] n ∏ ∗ ∗ ∗ [1 − qi (t)] (11.5) = 1 − (Pr[E1 (t)]Pr[E2 (t)] · · · Pr[En (t)]) = 1 − i=1

11.3.3

TOP Event Probability

As indicated in (11.1) and Figure 11.7, any fault tree diagram can be represented as an alternative fault tree diagram with a single or-gate with all the minimal cut set failures as input events. From (11.1), the probability, Q0 (t) of the TOP event at time t can be written Q0 (t) = Pr[TOP(t)] = Pr[C1 (t) ∪ C2 (t) ∪ · · · ∪ Ck (t)]

(11.6)

where Cj (t) is the probability that minimal cut set j is failed at time t, for j = 1, 2, … , k. Minimal cut set j will fail at time t when all the basic events

11.3 Fault Tree Analysis

Ej,i in Cj occur at time t. The minimal cut set failure, Ci (t), can therefore be represented as a fault tree with a single and-gate. The probability that minimal ̌ j (t). If all the basic events in minimal cut cut set Cj fails at time t is denoted Q set Cj are independent, we get from (11.3) ̌ j (t) = Pr[Ej,1 (t) ∩ Ej,2 (t) ∩ · · · ∩ Ej,n (t)] = Q j



qi (t)

(11.7)

i∈Cj

where nj is the number of basic events in minimal cut set Cj , for j = 1, 2, … , k. If all the minimal cut sets were independent, we could use (11.5) to determine the probability of the TOP event at time t as Q0 (t) = 1 −

k ∏

̌ j (t)] [1 − Q

j=1

The same basic event may be a member of several minimal cut sets, and the minimal cut sets are therefore, generally not independent. This type of dependency gives a positive association (e.g. see Barlow and Proschan 1975) between the minimal cut sets, and we can deduce the following approximation: Q0 (t) ⪅ 1 −

k ∏

̌ j (t)] [1 − Q

(11.8)

j=1

This formula is called the upper bound approximation formula and is used by most of the FTA programs. Using the right-hand side of (11.8) will generally give an adequate approximation. The approximation is conservative, meaning that the TOP event probability, Q0 (t), is slightly less than the value calculated.

11.3.3.1

Inclusion–Exclusion Method

The inclusion–exclusion method is an alternative to the upper bound approximation formula and will give a more accurate value for the TOP event probability. It will, at the same time, require more computing resources. By using the addition rule for probabilities to (11.6), we get Q0 (t) =

k ∑ j=1

+

Pr[Cj (t)] −





Pr[Ci (t) ∩ Cj (t)]

i t)

for t > 0

(A.29)

Hence, R(t) is the probability that the item does not fail in the time interval (0, t], or, in other words, the probability that the item survives the time interval (0, t] and is still functioning at time t. The survivor function is shown in Figure A.13. A.4.2.5 Failure Rate Function

The probability that an item will fail in the time interval (t, t + Δt] when we know that the item is functioning at time t is given by the conditional probability Pr(t < T ≤ t + Δt ∣ T > t) =

R(t)

Pr(t < T ≤ t + Δt) F(t + Δt) − F(t) = Pr(T > t) R(t)

1

0 Figure A.13 The survivor function R(t).

Time (t)

715

716

Appendix A Elements of Probability Theory

By dividing this probability by the length of the time interval, Δt, and letting Δt → 0, we get the failure rate function z(t) of the item Pr(t < T ≤ t + Δt ∣ T > t) Δt f (t) F(t + Δt) − F(t) 1 = lim = Δt→0 Δt R(t) R(t)

z(t) = lim

Δt→0

(A.30)

This implies that when Δt is small, Pr(t < T ≤ t + Δt ∣ T > t) ≈ z(t)Δt

(A.31)

Observe the difference between the probability density function f (t) and the failure rate function z(t). Assume that we start out with a new item at time t = 0 and at time t = 0 ask “What is the probability that this item will fail in the interval (t, t + Δt]?” According to (A.27), this probability is approximately equal to the probability density function f (t) at time t multiplied by the length of the interval, Δt. Next, consider an item that has survived until time t and that we at time t ask “What is the probability that this item will fail in the next interval (t, t + Δt]?” This (conditional) probability is according to (A.31) approximately equal to the failure rate function z(t) at time t multiplied by the length of the interval, Δt. A.4.2.6 Mean Value

Let T be the time to failure of an item. The mean time to failure (MTTF) or expected value of T is ∞

MTTF = E(T) =

∫0

tf (t) dt

(A.32)

Because f (t) = −R′ (t), ∞

MTTF = −

tR′ (t) dt

∫0

By partial integration ∞

MTTF = −[tR(t)]∞ 0 +

∫0

R(t) dt

If MTTF < ∞, it can be shown that [tR(t)]∞ 0 = 0. In that case ∞

MTTF =

∫0

R(t) dt

It is often easier to determine MTTF by (A.33) than by (A.32).

(A.33)

A.4 Random Variables

A.4.2.7 Median Life

The median life tm is defined by R(tm ) = 0.50

(A.34)

The median divides the distribution in two halves. The item will fail before time tm with 50% probability and will fail after tm with 50% probability. A.4.2.8 Variance

The variance of T is ∞

Var(T) = 𝜎 2 = E[(T − 𝜇)2 ] =

∫0

(t − 𝜇)2 f (t) dt

(A.35)

A.4.2.9 Marginal and Conditional Distributions

Let T1 and T2 be two continuous random variables with probability density functions f1 (t) and f2 (t), respectively. The joint probability density function of T1 and T2 is written as f (t1 , t2 ). The marginal probability density function of T1 is ∞

f1 (t1 ) =

f (t1 , t2 ) dt2

∫0

(A.36)

The conditional probability density function of T2 when we have observed that T1 = t1 is f (t2 ∣ t1 ) =

f (t1 , t2 ) f1 (t1 )

for f1 (t1 ) > 0

(A.37)

This result may, for example, be used to find b

Pr(a < T2 < b ∣ T1 = t1 ) =

∫a

f (t2 ∣ t1 ) dt2

(A.38)

and the conditional mean value as ∞

E(T2 ∣ t1 ) =

∫0

t2 f (t2 ∣ t1 ) dt2

(A.39)

A.4.2.10 Independent Variables

The two continuous random variables T1 and T2 are independent if f (t2 ∣ t1 ) = f2 (t2 ) and f (t1 ∣ t2 ) = f1 (t1 ), which is the same as saying that T1 and T2 are independent if f (t1 , t2 ) = f1 (t1 )f2 (t2 )

for all t1 and t2

The definition is easily extended to more than two variables.

717

718

Appendix A Elements of Probability Theory

A.4.2.11 Convolution

Let T1 and T2 be two independent variables with probability density functions f1 and f2 , respectively. It is sometimes important to be able to find the distribution of T1 + T2 F1,2 (t) = Pr(T1 + T2 ≤ t) = ∫0

∫0

f1 (x)f2 (y) dx dy

t−y



=

∫ ∫x+y≤t



f1 (x)f2 (y) dx dy =

∫0

(

t−y

∫0

) f1 (x) dx f2 (y) dy



=

F1 (t − y)f2 (y) dy

∫0

(A.40)

The distribution function F1,2 is called the convolution of the distributions F1 and F2 (the distribution functions of T1 and T2 , respectively). By differentiating F1,2 (t) with respect to t, the probability density function f1,2 (t) of T1 + T2 is obtained as ∞

f1,2 (t) =

∫0

f1 (t − y)f2 (y) dy

(A.41)

A.5 Some Specific Distributions A.5.1

Discrete Distributions

A.5.1.1 The Binomial Distribution

The binomial distribution is used in the following situation: (1) We have n independent trials. (2) Each trial has two possible outcomes E and E∗ . (3) The probability Pr(E) = p is the same in all trials. This situation is called a binomial situation, and the trials are sometimes referred to as Bernoulli trials. The “bi” in binomial indicates two possible outcomes. Let X denote the number of the n trials that have outcome E. Then X is a discrete random variable with probability mass function ( ) n x (A.42) Pr(X = x) = p (1 − x)n−x for x = 0, 1, … , n x ( ) where nx is the binomial coefficient ( ) n! n = (A.43) x x!(n − x)! The distribution (A.42) is called the binomial distribution (n, p), and we sometimes write X ∼ bin(n, p). Here, n, the number of trials, is usually a known

A.5 Some Specific Distributions

constant, whereas p is a parameter of the distribution. The parameter p is usually an unknown constant and is not directly observable. It is not possible to “measure” a parameter. The parameter can only be estimated as a relative frequency based on observations of a high number of trials. The mean value and the variance of X are E(X) = np

(A.44)

Var(X) = np(1 − p)

(A.45)

Example A.7 A fire pump is tested regularly. During the tests, we attempt to start the pump and let it run for a short while. We observe a “fail to start” (event E) if the fire pump cannot be started within a specified interval. Assume that we have performed n = 150 tests and that we can consider these tests to be independent. In total, X = 2 events E have been recorded. From (A.44), we know that p = E(X)∕n, and it is therefore natural to estimate the fail-to-start probability by p̂ =

2 x = ≈ 0.0133 = 1.33% n 150



Example A.8 A 2-out-of-3 structure is a system that is functioning when at least 2 of its 3 components are functioning. We assume that the components are independent, and let X be the number of components that are functioning. Let p be the probability that a specific component is functioning. This situation may be considered as a binomial situation with n = 3 trials, and X therefore has a binomial distribution. The probability pS that the 2-out-of-3 structure is functioning is pS = Pr(X ≥ 2) = Pr(X = 2) + Pr(X = 3) ( ) ( ) 3 2 3 3 = p (1 − p)3−2 + p (1 − 3)3−3 2 3 = 3p2 (1 − p) + p3 = 3p2 − 2p3



A.5.1.2 The Geometric Distribution

Assume again that we have a binomial situation, and let Z be the number of trials until the first trial with outcome E. If Z = z, this means that the first (z − 1) trials will result in E∗ , and that the first E will occur in trial z. The probability mass function for Z is then Pr(Z = z) = (1 − p)z−1 p

for z = 1, 2, …

(A.46)

The distribution (A.46) is called the geometric distribution. We have that Pr(Z > z) = (1 − p)z

719

720

Appendix A Elements of Probability Theory

The mean value and the variance of Z are 1 E(Z) = p 1−p Var(Z) = p2

(A.47) (A.48)

A.5.1.3 The Poisson Distribution and the Poisson Process

In risk analysis, it is often assumed that events occur according to a homogeneous Poisson process (HPP). An HPP is a stochastic process where we count the number of occurrences of an event E. To be an HPP, the following conditions must be fulfilled: (1) The number of occurrences of E in one time interval is independent of the number that occurs in any other disjoint time interval. This implies that the HPP has no memory. (2) The probability that event E will occur during a very short time interval is proportional to the length of the time interval and does not depend on the number of events occurring outside this time interval. (3) The probability that more than one event will occur in a very short time interval is negligible. Without loss of generality, we let t = 0 be the starting point of the process. Let NE (t) be the number of times event E occur during the time interval (0, t]. The discrete random variable NE (t) is called a Poisson random variable, and its probability distribution is called a Poisson distribution. The probability mass function of NE (t) is (𝜆E t)n −𝜆 t e E for n = 0, 1, … n! where 𝜆E > 0 is a parameter and e = 2.718 28 … The mean number of occurrences of E in the time interval (0, t) is Pr(NE (t) = n) =

E[NE (t)] =

∞ ∑

n Pr(NE (t) = n) = 𝜆E t

(A.49)

(A.50)

n=0

and E[N(t)] t The parameter 𝜆E is therefore the mean number of occurrences of E per time unit and is called the rate of the Poisson process or the rate of occurrence of events E. The variance of NE (t) is 𝜆E =

Var[NE (t)] = 𝜆E t

(A.51)

A.5 Some Specific Distributions

A.5.2

Continuous Distributions

A.5.2.1 The Exponential Distribution

Let us assume that the time to failure T of an item is exponentially distributed with parameter 𝜆. The probability density function of T is then given by { −𝜆t 𝜆e for t > 0, 𝜆 > 0 f (t) = (A.52) 0 otherwise The distribution function is for t > 0

F(t) = Pr(T ≤ t) = 1 − e−𝜆t

(A.53)

The probability density function and the distribution function of the exponential distribution are shown in Figure A.14. The reliability (survivor) function becomes ∞

R(t) = Pr(T > t) =

∫t

for t > 0

f (u) du = e−𝜆t

(A.54)

The MTTF is ∞

MTTF =

∫0



R(t) dt =

∫0

e−𝜆t dt =

1 𝜆

(A.55)

The variance of T is 1 (A.56) Var(T) = 2 𝜆 and the failure rate function is f (t) 𝜆e−𝜆t (A.57) = −𝜆t = 𝜆 z(t) = R(t) e Accordingly, the failure rate function of an item with exponential life distribution is constant (i.e. independent of time). The results (A.55) and (A.57) compare well with the use of the concepts in everyday language. If an item has on the average 𝜆 = 4 failures/year, the MTTF of the item is 1/4 year. 1.0 0.8

F (t)

0.6

λ 0.4 f (t)

0.2 0.0 0

2

4

Time (t)

Figure A.14 Exponential distribution (𝜆 = 0.4).

6

8

10

721

722

Appendix A Elements of Probability Theory

Now suppose that an item has exponential time to failure T. For such an item Pr(T > t + x) e−𝜆(t+x) = −𝜆t = e−𝜆x = Pr(T > x) Pr(T > t) e This implies that the probability that an item will be functioning at time t + x, given that it is functioning at time t, is equal to the probability that a new item has a time to failure longer than x. Hence, the remaining lifetime of an item that is functioning at time t, is independent of t. This means that the exponential distribution has no “memory.” An assumption of exponentially distributed lifetime therefore implies that: Pr(T > t + x ∣ T > t) =

• A used item is stochastically as good as new. Thus, there is no reason to replace a functioning item. • For the estimation of the survivor function, the MTTF, and so on, it is sufficient to collect data on the number of hours of observed time in operation and the number of failures. The age of the items is of no interest in this connection. The exponential distribution is the most commonly used life distribution in applied risk and reliability analyses. The reason for this is its mathematical simplicity and that it leads to realistic lifetime models for certain types of items. A.5.2.2 The Exponential Distribution and the Poisson Process

Assume that failures occur according to a Poisson process with rate 𝜆, and let N(t) be the number of failures in the time interval (0, t]. The probability mass function of N(t) is (𝜆t)n −𝜆t for n = 0, 1, … e Pr(N(t) = n) = n! Let T1 be the time from t = 0 until the first failure, T2 be the time between the first and second failures, and so on. The variables T1 , T2 , … can now be shown to be independent and exponentially distributed with parameter 𝜆 (e.g. see Rausand et al. 2020). A.5.2.3 The Weibull Distribution

The Weibull distribution is one of the most widely used life distributions in reliability analysis. The distribution is named after the Swedish professor Waloddi Weibull (1887–1979), who developed the distribution for modeling the strength of materials. The Weibull distribution is very flexible and can, through an appropriate choice of parameters, model many types of failure rate behaviors. The time to failure T of an item is said to be Weibull distributed with parameters 𝛼 and 𝜆 if the distribution function is given by { 𝛼 for t > 0, 𝜆 > 0, 𝛼 > 0 1 − e−(𝜆t) (A.58) F(t) = 0 otherwise

A.5 Some Specific Distributions

The corresponding probability density function is { 𝛼 𝛼−1 −(𝜆t)𝛼 𝛼𝜆 t e for t > 0 d f (t) = F(t) = 0 otherwise dt

(A.59)

where 𝜆 is a scale parameter and 𝛼 is referred to as a shape parameter. Observe that when 𝛼 = 1, the Weibull distribution is equal to the exponential distribution. The probability density function f (t) is shown in Figure A.15 for selected values of 𝛼. The survivor function is R(t) = Pr(T > t) = e−(𝜆t)

𝛼

for t > 0

(A.60)

and the failure rate function is f (t) z(t) = = 𝛼𝜆𝛼 t 𝛼−1 for t > 0 R(t)

(A.61)

The failure rate function z(t) of the Weibull distribution is shown in Figure A.16 for selected values of 𝛼. Because of its flexibility, the Weibull distribution may be used to model life distributions where the failure rate function is decreasing, constant, and increasing. The MTTF of the Weibull distribution is ∞ ( ) 1 1 R(t) dt = Γ MTTF = −1 (A.62) ∫0 𝜆 𝛼 where Γ(⋅) is the gamma function defined by:1 ∞

Γ(x) = 1.5

∫0

for t > 0

t x−1 e−t dt

α = 0.5 α=3 α=2

0.5

α=1

f(t)

1.0

0.0 0.0

0.5

1.0

1.5

2.0

2.5

3.0

Time (t)

Figure A.15 The probability density function of the Weibull distribution for selected values of the shape parameter 𝛼 (𝜆 = 1). 1 The gamma function is a standard function in computer programs, such as MATLAB and GNU Octave.

723

Appendix A Elements of Probability Theory 2.0

α=3

α=2

1.5

z (t)

724

α=1

1.0

α = 0.5

0.5

0.0 0.0

0.2

0.4

0.6

0.8

1.0

Time (t)

Figure A.16 The failure rate function of the Weibull distribution for selected values of the shape parameter 𝛼 (𝜆 = 1).

In particular, we have Γ(n + 1) = n!

for n = 0, 1, 2, …

The variance of T is [ ( ( ) )] 1 2 1 Var(T) = 2 Γ + 1 − Γ2 +1 𝜆 𝛼 𝛼 A.5.2.4 The Normal (Gaussian) Distribution

The most commonly used distribution in statistics is the normal (Gaussian) distribution. A random variable T is said to be normally distributed with mean 𝜈 and variance 𝜏 2 when the probability density function of T is 2 2 1 f (t) = √ e−(t−𝜈) ∕2𝜏 2𝜋𝜏

for − ∞ < t < ∞

(A.63)

To simplify the notation, we sometimes write T ∼  (𝜈, 𝜏 2 ). The probability density function f (t) is symmetric about a vertical axis through t = 𝜈 and has the t-axis as a horizontal asymptote. The curve has its points of inflection at t = 𝜈 ± 𝜏. The total area under the curve and above the horizontal axis is equal to 1. In the special case when 𝜈 = 0 and 𝜏 2 = 1, the distribution is called the standard normal distribution, and we denote it by  (0, 1). If the random variable X ∼  (𝜈, 𝜏 2 ) and 𝜏 2 > 0, then U = (X − 𝜈)∕𝜏 ∼  (0, 1). The random variable U is said to be standardized.

A.5 Some Specific Distributions

The distribution function of a random variable U with standard normal distribution is usually denoted by Φ(u) = Pr(U ≤ u). The corresponding probability density function is 2 1 𝜙(u) = √ e−u ∕2 2𝜋

(A.64)

The distribution function of the general normal distribution T ∼  (𝜈, 𝜏 2 ) may be written as ) ( t−𝜈 T −𝜈 ≤ F(t) = Pr(T ≤ t) = Pr 𝜏 ) 𝜏( ) ( t−𝜈 t−𝜈 =Φ (A.65) = Pr U ≤ 𝜏 𝜏 Example A.9 Let T denote the time to failure of a technical item and assume that T has a normal distribution with mean 𝜈 = 20 000 hours and variance 𝜏 2 where 𝜏 = 5000 hours. The probability that the item will fail in the time interval from t1 = 17 000 hours to t2 = 21 000 hours is ( ) t −𝜈 t −𝜈 T −𝜈 Pr(t1 < T ≤ t2 ) = Pr 1 < ≤ 2 𝜏 𝜏 𝜏 ( ) ( ) t2 − 𝜈 t1 − 𝜈 =Φ −Φ 𝜏 𝜏 = Φ(0.200) − Φ(−0.600) ≈ 0.305 ◽ Let T1 , T2 , … , Tn be independent and identically distributed  (𝜈, 𝜏 2 ). It can be shown that n ∑

Ti ∼  (n𝜈, n𝜏 2 )

i=1

The sum of independent normally distributed random variables is hence also normally distributed, and n ∑ i=1

1 n

Ti − n𝜈



n𝜏

=

n ∑ i=1

Ti − 𝜈 𝜏

By using the notation T =

1 n

T − E(T) ∼  (0, 1) √ Var(T)

√ n ∼  (0, 1)

∑n i=1

Ti , the last expression can be written as (A.66)

725

726

Appendix A Elements of Probability Theory

A.5.2.5 The Gamma Distribution

Let T1 , T2 , … , Tn be independent and exponentially distributed with parameter 𝜆. The sum V = T1 + T2 + · · · + Tn is then gamma distributed with parameters 𝜆 and n. The probability density function of the random variable V is f (v) =

𝜆 (𝜆v)n−1 e−𝜆v Γ(n)

for v > 0

(A.67)

The parameter n in (A.67) is not restricted to positive integers, but can take any positive value. The mean and variance of V are n (A.68) E(V ) = 𝜆 n Var(V ) = 2 (A.69) 𝜆 Example A.10 Consider a component that is exposed to a series of shocks that occur according to a Poisson process with rate 𝜆. The time intervals T1 , T2 , … between consecutive shocks are then independent and exponentially distributed with parameter 𝜆. Assume that the component fails exactly at ∑n shock n, and not earlier. The time to failure i=1 Ti is then gamma distributed with parameters 𝜆 and n. ◽ A.5.2.6 The Beta Distribution

A random variable T is said to have a beta distribution with parameters r and s if the probability density function of T is given by f (t) =

Γ(r + s) r−1 t (1 − t)s−1 Γ(r)Γ(s)

The mean and variance of T are r E(T) = r+s rs Var(T) = (r + s)2 (r + s + 1)

for 0 ≤ t ≤ 1

(A.70)

(A.71) (A.72)

A.5.2.7 The Uniform Distribution

Let T have a beta distribution with parameters r = 1 and s = 1. The probability density function is then { 1 for 0 ≤ t ≤ 1 f (t) = 0 otherwise This distribution is called the uniform or rectangular distribution. The uniform distribution is considered a special case of the beta distribution when r = 1 and s = 1.

A.5 Some Specific Distributions

In general, we say that a random variable T has a uniform distribution over the interval [a, b] when the probability density function is { 1 for a ≤ t ≤ b b−a (A.73) f (t) = 0 otherwise The mean and variance of T are a+b E(T) = 2 (b − a)2 Var(T) = 12

(A.74) (A.75)

A.5.2.8 The Strong Law of Large Numbers

Let X1 , X2 , … be a sequence of independent random variables with a common distribution (discrete or continuous) with mean value E(Xi ) = 𝜇. Then, with probability 1, 1∑ lim Xi = 𝜇 n→∞ n i=1 n

(A.76)

This important result is called the strong law of large numbers, stating that the average of a sequence of independent random variables having the same distribution will, with probability 1, converge to the mean of that distribution. A.5.2.9 The Central Limit Theorem

Let X1 , X2 , … be a sequence of independent, identically distributed random variables, each with mean 𝜈 and variance 𝜏 2 . Let X be the empirical mean of the n first variable in this sequence such that 1∑ X n i=1 i n

X=

Then, the distribution of X − E(X) X − 𝜈√ n = √ 𝜏 Var(X) tends to the standard normal distribution as n → ∞. That is, ( ) y 2 1 X − 𝜈√ Pr n≤y → √ e−t ∕2 dt ∫ 𝜏 2𝜋 −∞

(A.77)

as n → ∞. This result is called the central limit theorem and is valid for any distribution of the Xi ’s. Example A.11 Let X have a binomial distribution with parameters n and p. The X can be considered as a sum of n independent random variable that are

727

728

Appendix A Elements of Probability Theory

bin(1, p). We can therefore use the central limit theorem to conclude that the distribution of X − np X − E(X) =√ √ np(1 − p) Var(X) approaches the standard normal distribution as n → ∞. The normal approximation is considered to be good for values of n and p such that np(1 − p) ≥ 10. ◽

A.6 Point and Interval Estimation A random variable is described by its probability distribution. This distribution is usually dependent on one or more parameters. A variable with an exponential life distribution has, for example, one parameter – the failure rate 𝜆. Because the parameters characterize the distribution, we are usually interested in estimating the values of the parameters. A parameter is an unknown quantity that is part of a probability distribution. We can observe numerical values for random variables, but never for parameters. Parameters are, per definition, not observable. In the pure classical approach, we assume that parameters are real quantities, but unknown to us. A gas detector is, for example, assumed to have a failure rate 𝜆, which is a property of the gas detector. An estimator of an unknown parameter is simply a statistic (i.e. a function of one or more random variables) that “corresponds” to that parameter. A particular numerical value of an estimator that is computed from observed data is called an estimate. We may distinguish between a point estimator and an interval estimator. A point estimator is a procedure leading to a single numerical value for the estimate of the unknown parameter. An interval estimator is a random interval in which the true value of the parameter lies with some probability. Such a random interval is usually called a confidence interval.

A.6.1

Point Estimation

Let X be a random variable with distribution function that depends on a parameter 𝜃. The distribution may be denoted by F(x ∣ 𝜃) and can be continuous or discrete. Let X1 , X2 , … , Xn be a random sample of n observations of the variable X. Our problem is to find a statistic Y = g(X1 , X2 , … , Xn ) such that if x1 , x2 , … , xn are the observed numerical values of X1 , X2 , … , Xn , then the number y = g(x1 , x2 , … , xn ) will be a good point estimate of 𝜃.

A.6 Point and Interval Estimation

There are a number of properties required of a good point estimator. Among these are: • The point estimator should be unbiased. That is, the long-run average, or mean value of the point estimator, should be equal to the parameter that is estimated, E(Y ) = 𝜃. • The point estimator should have minimum variance. Because the point estimator is a statistic, it is a random variable. This property states that the minimum variance point estimator has a variance that is smaller than any other estimator of the same parameter. ̂ This symbol is sometimes used for both The estimator of 𝜃 is often denoted 𝜃. the estimator and the estimate, which may be confusing. Example A.12 Consider a binomial situation where we have n independent trials. Each trial can result in either the event E or not. The probability p = Pr(E) is the same in all trials. Let X be the number of occurrences of the event E. A natural estimator for the parameter p is then p̂ =

X n

(A.78)

The estimator p̂ is seen to be unbiased because ( ) E(X) np X ̂ =E = E(p) = =p n n n The variance of p̂ is ̂ = Var(p)

p(1 − p) n

It can be shown that p̂ is the minimum variance estimator for p.



A.6.1.1 Maximum Likelihood Estimation

Maximum likelihood estimation (MLE) is a general approach to point estimation of parameters. Let X1 , X2 , … , Xn be n independent and identically distributed random variables with probability density function F(x ∣ 𝜃). The parameter 𝜃 may be a single parameter or a vector of parameters. Here, we assume that 𝜃 is a single parameter. Assume that we have observed the data set x = x1 , x2 , … , xn . The MLE approach says that we should find the parameter value 𝜃̂ (if it exists) with the highest chance of giving this particular data set, that is, such that ̂ ≥ f (x1 , x2 , … , xn ∣ 𝜃) f (x1 , x2 , … , xn ∣ 𝜃) for any other value of 𝜃.

729

730

Appendix A Elements of Probability Theory

We introduced and discussed the likelihood function L(𝜃 ∣ x) in Chapter 2. In this case, the likelihood function is given by L(𝜃 ∣ x) =

n ∏

f (xi ∣ 𝜃)

(A.79)

i=1

where x is a vector of known values and L(𝜃 ∣ x) is a function of 𝜃. Observe that L(𝜃 ∣ x) is not a probability distribution. The maximum likelihood estimate 𝜃̂ is found by maximizing L(𝜃 ∣ x) with respect to 𝜃. In practice, it is often easier to maximize the log likelihood function ln[L(𝜃 ∣ x)], which is valid because the logarithmic function is monotonic. The log likelihood function is given by ln[L(𝜃 ∣ x)] =

n ∑

ln f (xi ∣ 𝜃)

(A.80)

i=1

The maximum likelihood estimate may now be found by solving 𝜕 ln[L(𝜃 ∣ x)] = 0 𝜕𝜃

(A.81)

and by showing that the solution in fact gives a maximum.

A.6.2

Interval Estimation

An interval estimator of the parameter 𝜃 is the distance between two statistics that includes the true value of 𝜃 with some probability. To obtain an interval estimator of 𝜃, we need to find two statistics 𝜃L and 𝜃U such that the probability is Pr(𝜃L ≤ 𝜃 ≤ 𝜃U ) = 1 − 𝜀

(A.82)

The interval 𝜃L ≤ 𝜃 ≤ 𝜃 U is called a 100(1 − 𝜀) percent confidence interval for the parameter 𝜃. The interpretation of the interval is that if we carry out repeated experiments and construct intervals as previously mentioned, then 100(1 − 𝜀) percent of them will contain the true value of 𝜃. The statistics 𝜃L and 𝜃U are called the lower and upper confidence limits, respectively, and (1 − 𝜀) is called the confidence coefficient. Example A.13 Let X be a normally distributed random variable with unknown mean 𝜈 and known variance 𝜏 2 . Let X1 , X2 , … , Xn be a random

A.6 Point and Interval Estimation

sample of n independent observations of X. For the normal distribution, the average value 1∑ X = X ∼  (𝜈, 𝜏 2 ∕n) n i=1 i n

such that X − 𝜈√ n ∼  (0, 1) 𝜏

(A.83)

A natural estimator for the unknown mean 𝜈 is 1∑ X n i=1 i n

𝜈̂ = X =

(A.84)

This estimator is unbiased because E(𝜈) ̂ = 𝜈, and the variance is Var(𝜈) ̂ = 𝜏 2 ∕n. From (A.83) we get ) ( X − 𝜈√ n ≤ z𝜀∕2 = 1 − 𝜀 (A.85) Pr −z𝜀∕2 ≤ 𝜏 where z𝜀∕2 is the upper-tail percentile point of the standard normal distribution such that the probability to the right of z𝜀∕2 is 𝜀∕2. Equation (A.85) can be rewritten as √ √ Pr(X − z𝜀∕2 𝜏∕ n ≤ 𝜈 ≤ X − z𝜀∕2 𝜏∕ n) = 1 − 𝜀 (A.86) We have thus found a 100(1 − 𝜀) percent confidence interval for the mean 𝜈 with √ Lower confidence limit: 𝜈L = X − z𝜀∕2 𝜏∕ n √ Upper confidence limit: 𝜈U = X + z𝜀∕2 𝜏∕ n The lower and upper confidence limits are random variables (statistics). When we have observed the numerical values of X1 , X2 , … , Xn , we can calculate the estimates of the confidence limits. ◽ Remark A.2 When an interval is calculated, it will either include the true value of the parameter 𝜃, or it will not. If the experiment were repeated many times, the interval would cover the true value of 𝜃 in 100(1 − 𝜀) percent of the cases. Thus, we would have a strong confidence that 𝜃 is covered by the interval, but it is wrong to say that there is a 100(1 − 𝜀) percent probability that 𝜃 is included in the interval. The parameter 𝜃 is not stochastic. It has a true but unknown value. ◽

731

732

Appendix A Elements of Probability Theory

A.7 Bayesian Approach Bayesian or subjective probability was introduced and discussed briefly in Chapter 2. We will now assume that unwanted events E occur according to a HPP with rate 𝜆E . In the Bayesian approach, the analyst is assumed to have a prior belief about the value of 𝜆E . This prior belief is formulated as a probability density function 𝜋(𝜆E ) of a random variable ΛE . If the analyst has a clear and strong opinion about the value of ΛE , she will choose a peaked or “narrow” distribution. If she has only a vague opinion, she will choose a “spread-out” distribution. Which distribution to choose is not very important as such, but it might lead to very complicated mathematical formulas. It is therefore strongly recommended to choose a distribution class for the prior that is conjugate to the distribution of the evidence. Definition A.1 (Conjugate distributions) Two distributions (a) and (b) are conjugate if they have the following property: If the prior distribution is (a) and the evidence distribution is (b), then the posterior distribution (given the evidence) is also (a) (although with different parameter values from the prior distribution). ◽ In this case, the evidence is the number NE = n of occurrences of event E that is observed during an accumulated time period of length t. The evidence has a Poisson distribution given by (𝜆E t)n −𝜆 t (A.87) e E for n = 0, 1, 2, … n! It can be shown that the conjugate to this evidence distribution is the gamma distribution. The analyst should therefore choose her prior as a gamma distribution 𝛽 (A.88) (𝛽𝜆E )𝛼−1 e−𝛽𝜆E 𝜋(𝜆E ) = Γ(𝛼) This is the probability density function of the gamma distribution with parameters 𝛼, 𝛽. The symbol Γ(𝛼) is the gamma function of 𝛼 (e.g. see Rausand et al. 2020). The gamma distribution is very flexible, and most relevant prior beliefs can be modeled by choosing the values of the parameters 𝛼 and 𝛽 appropriately. The (prior) mean value of ΛE is Pr(NE = n ∣ 𝜆E ) =



E(ΛE ) =

∫0

𝜆E 𝜋(𝜆E ) d𝜆E =

and the standard deviation is √ 𝛼 SD(ΛE ) = 𝛽

𝛼 𝛽

(A.89)

(A.90)

A.8 Probability of Frequency Approach

These two formulas can be used to determine the parameters 𝛼 and 𝛽 that best fit the analyst’s belief about the value of 𝜆E . The prior mean value is sometimes used as a prior estimate 𝜆̂ E for 𝜆E . The analyst’s posterior probability density function when the evidence (n, t) is given can now be found by using Bayes formula: 𝜋(𝜆E ∣ n, t) ∝ 𝜋(𝜆E )L(𝜆E ∣ NE (t) = n)

(A.91)

The “proportionality” constant k is given by 1 𝜋(𝜆E ∣ n, t) = 𝜋(𝜆E )L(𝜆E ∣ NE (t) = n) (A.92) k For 𝜋(𝜆E ∣ n, t) to be a true probability density function, its integral must be equal to 1. ∞

∫0

𝜋(𝜆E ∣ n, t) d𝜆E =

1 k ∫0



𝜋(𝜆E )L(𝜆E ∣ NE (t) = n) d𝜆E = 1

The constant k is therefore ∞

k=

𝜋(𝜆E )L(𝜆E ∣ NE (t) = n) d𝜆E

∫0 ∞

=

(𝜆 t)n 𝛽 (𝛽𝜆E )𝛼−1 e−𝛽𝜆E E e−𝜆E t d𝜆E Γ(𝛼) n!

∫0 ∞ 𝛽 𝛼 tn = 𝜆𝛼+n−1 e−(𝛽+n)𝜆E d𝜆E E Γ(𝛼)n! ∫0 𝛽 𝛼 t n Γ(n + 𝛼) = (A.93) Γ(𝛼)n! (𝛽 + t)n+𝛼 By combining (A.87), (A.88), (A.92) and (A.93), we obtain the posterior density (𝛽 + t)𝛼+n 𝛼+n−1 −(𝛽+t)𝜆E e (A.94) 𝜋(𝜆E ∣ n, t) = 𝜆 Γ(𝛼 + n) E which is recognized as the gamma distribution with parameters (𝛼 + n) and (𝛽 + t). We have thus shown that the gamma distribution and the Poisson distribution are conjugate. A thorough discussion of Bayesian probability theory is given, for example, in Lindley (2007) and Dezfuli et al. (2009).

A.8 Probability of Frequency Approach The approach outlined in the following is called the probability of frequency approach by some authors (e.g. see Kaplan and Garrick 1981; Garrick 2008). Observe that the approach reconciles two interpretations of probability: (i) The occurrence of events is modeled as a classical stochastic process where the rate of the process 𝜆 is an unknown parameter, and (ii) the analyst’s uncertainty about the value of 𝜆 is modeled by a subjective probability distribution.

733

734

Appendix A Elements of Probability Theory

A.8.1

Prior Distribution

Assume that we study the reliability of a new type of gas detector and that we believe that the time to failure T of the gas detector is distributed exponentially with failure rate 𝜆. The parameter 𝜆 is unknown, but the new gas detector is similar to previous types of detectors of which we have some experience. From this experience combined with careful examination of the new gas detector, we feel that we have some prior information about the “unknown” failure rate 𝜆. By using subjective probabilities, this prior information can be expressed as a prior distribution of Λ, where Λ is the failure rate considered as a random variable. The prior distribution can be expressed, for example, by the prior density 𝜋(𝜆) as shown in Figure A.17. This approach is very flexible and can express both very detailed prior information (as a narrow, peaked density) and vague information (as a spread-out density). We may choose the form of the prior distribution such that it describes our prior knowledge. A commonly used distribution is the gamma distribution. 𝛽 (A.95) (𝛽𝜆)𝛼−1 e−𝛽𝜆 for 𝜆 > 0 𝜋(𝜆) = Γ(𝛼) Observe that here we use a parameterization of the gamma distribution other than the one we used in Eq. (A.75). The parameter 𝜆 has been replaced by 𝛽, and n by 𝛼. The reason for choosing the gamma distribution is that it is easy to use and at the same time very flexible. We may describe a wide variety of our prior knowledge by selecting the values of the parameters 𝛼 and 𝛽 appropriately. A.8.1.1 Prior Estimate

The prior distribution, as shown in Figure A.17, expresses our total knowledge about the unknown parameter 𝜆. In some cases, it is necessary to present a single value (an estimate of 𝜆), and this estimate is most often chosen to be the mean value of the prior distribution. When the gamma distribution (A.95) is used as prior distribution, we get 𝜆̂ =



∫0

𝜋(𝜆) d𝜆 =

𝛼 𝛽

(A.96)

π(λ)

0 Figure A.17 Prior density for the failure rate.

λ

A.8 Probability of Frequency Approach

The median of the prior distribution is sometimes used as an alternative to the mean. A.8.2

Likelihood

The likelihood of E given the evidence D1 is written as L(E ∣ D1 ). Mathematically, L(E ∣ D1 ) is equal to Pr(D1 ∣ E), but the interpretation of the two concepts is truly different. Whereas the first expresses the likelihood that the state of nature is E when the evidence D1 is given, the second expresses the probability that the evidence is equal to D1 when the state of nature is E, and hence is given. A better way of expressing Bayes formula in (A.14) is therefore Pr(E ∣ D1 ) =

1 Pr(E)L(E ∣ D1 ) Pr(D1 )

(A.97)

Because Pr(D1 ) denotes the marginal probability of the evidence D1 , it is independent of E and can therefore be considered as a “normalizing constant” in (A.97). We can therefore write ( ) normalizing Pr(E ∣ D1 ) = Pr(E)L(E ∣ D1 ) constant which can be written as Pr(E ∣ D1 ) ∝ Pr(E)L(E ∣ D1 )

(A.98)

where the symbol ∝ means “proportional to.” In the following, we use k to denote Pr(D1 ) in the “normalizing constant.” In some cases, the analyst’s prior belief about the state of nature can be expressed by a continuous random variable Θ with some probability density function 𝜋(𝜃) for 𝜃 ≥ 0. When the analyst has a strong belief about the value of Θ, she may choose a “narrow” or “peaked” probability density function, and when she has a vague belief, she may choose a “spread-out” density to express her belief. Her posterior belief, after having studied the evidence D1 , can now be expressed by her posterior probability density function: 𝜋(𝜃 ∣ D1 ) ∝ 𝜋(𝜃)L(𝜃 ∣ D1 )

(A.99)

where L(𝜃 ∣ D1 ) is the likelihood of 𝜃 when the evidence D1 is given. By introducing a constant k, (A.99) can be written as 1 𝜋(𝜃)L(𝜃 ∣ D1 ) k For 𝜋(𝜃 ∣ D1 ) to be a probability density function, its integral must be equal to 1, such that 𝜋(𝜃 ∣ D1 ) =



∫0

𝜋(𝜃 ∣ D1 ) d𝜃 =

1 k ∫0



𝜋(𝜃)L(𝜃 ∣ D1 ) d𝜃 = 1

735

736

Appendix A Elements of Probability Theory

The constant k must therefore be ∞

k=

∫0

𝜋(𝜃)L(𝜃 ∣ D1 ) d𝜃

(A.100)

The posterior density can therefore be written as (Bayes formula) 𝜋(𝜃 ∣ D1 ) = Example A.14

𝜋(𝜃)L(𝜃 ∣ D1 ) ∞ ∫0

𝜋(𝜃)L(𝜃 ∣ D1 ) d𝜃

(A.101)

Consider a binomial situation where:

(1) We carry out n independent trials. (2) Each trial has two possible outcomes A and A∗ . (3) The probability of outcome A is Pr(A) = 𝜃 in all the n trials. Let X be the number of trials that result in A. The probability distribution of X is then given by the binomial distribution: ( ) n x Pr(X = x ∣ 𝜃) = 𝜃 (1 − 𝜃)n−x for x = 0, 1, … , n x Assume now that we know that the trials can be performed in two different ways (1) and (2), but that which of these options is used is unknown. If option (1) is chosen, the probability of A is Pr(A) = 𝜃1 , and if option (2) is chosen, then Pr(A) = 𝜃2 . We do not know which option was used, but we know that all the n trials were performed in the same way. We may formulate this by saying that the state of nature is expressed by the random variable Θ, which has two possible values 𝜃1 and 𝜃2 . An analyst is interested in finding the probability that option (1) was used, such that Θ = 𝜃1 . Based on previous experience, she believes that on the average, a fraction 𝛼 of the trials are performed according to option (1). Her initial or prior belief about the event Θ = 𝜃1 is therefore given by her prior probability Pr(Θ = 𝜃1 ) = 𝛼 The analyst carries out n trials and get the evidence D1 = {X = x}. The probability of getting this result is obviously dependent on the unknown state of nature Θ, that is, whether option (1) or option (2) was used. The likelihood of Θ = 𝜃1 when the evidence D1 is given is ( ) n x L(Θ = 𝜃1 ∣ D1 ) = 𝜃 (1 − 𝜃1 )n−x x 1 Her posterior probability of Θ = 𝜃1 is given by Pr(Θ = 𝜃1 ∣ D1 ) ∝ Pr(Θ = 𝜃1 )L(Θ = 𝜃1 ∣ D1 ) ( ) n x ∝ 𝛼 𝜃 (1 − 𝜃1 )n−x x 1

(A.102)

A.8 Probability of Frequency Approach

To be a probability distribution, we must have that Pr(Θ = 𝜃1 ∣ D1 ) + Pr(Θ = 𝜃2 ∣ D1 ) = 1 When we introduce the proportionality constant k, it must fulfill [ ( ) ] ( ) n x 1 n x 𝛼 𝜃1 (1 − 𝜃1 )n−x + (1 − 𝛼) 𝜃2 (1 − 𝜃2 )n−x = 1 k x x The analyst’s posterior belief about Θ = 𝜃1 when the evidence D1 is given is then ( ) 𝛼 nx 𝜃1x (1 − 𝜃1 )n−x Pr(Θ = 𝜃1 ∣ D1 ) = ( ) ( ) 𝛼 nx 𝜃1x (1 − 𝜃1 )n−x + (1 − 𝛼) nx 𝜃2x (1 − 𝜃2 )n−x (A.103) The proportionality constant k or the denominator of (A.103) is seen to be the marginal probability of D1 . The analyst can now make a new set of n2 in the same way, observe the evidence D2 , and use formula (A.103) and get a further updated degree of her belief ◽ about Θ = 𝜃1 . The importance of realizing that a likelihood is not a probability is illustrated by Example A.15. Example A.15 Consider a component with time to failure T that is assumed to be exponentially distributed with constant failure rate 𝜆. The failure rate is not observable, but 𝜆 describes the “state of the nature.” The probability density function of T when 𝜆 is given is f (t ∣ 𝜆) = 𝜆e−𝜆t The likelihood function of 𝜆 when the time to failure is observed to be t is L(𝜆 ∣ t) = 𝜆e−𝜆t If L(𝜆 ∣ t) were a probability density, its integral should be equal to 1, but here ∞

∫0



L(𝜆 ∣ t) d𝜆 =

∫0

𝜆e−𝜆t d𝜆 =

𝜆 t

Because the integral is not always equal to 1, we conclude that L(𝜆 ∣ t) is not a probability density function. ◽ A.8.3

Posterior Analysis

We again consider the gas detectors introduced previously.

737

738

Appendix A Elements of Probability Theory

A.8.3.1 Life Model

The probability density function of the time to failure, T, of a gas detector when we know the failure rate 𝜆, may be written as f (t ∣ 𝜆) = 𝜆e−𝜆t

(A.104)

Observe that we now include the parameter 𝜆 in f (t ∣ 𝜆) to make it clear that the probability density function is also a function of 𝜆. To obtain information about 𝜆, we carry out a number of experiments and observe the times t1 , t2 , … , tn1 to failure of n1 gas detectors. We assume that the corresponding n1 times to failure are independent variables. The joint probability density function of these variables is then f (t1 , t2 , … , tn1 ∣ 𝜆) =

n1 ∏

𝜆e

−𝜆ti

=𝜆 e n1

−𝜆

n1 ∑ i=1

ti

(A.105)

i=1

The random variables T1 , T2 , … , Tn1 are all observable, meaning that we assign numerical values to each of them when the experiment is carried out. The parameter 𝜆 is not observable. A.8.3.2 Posterior Distribution

We can now update our prior knowledge by using the information gained from the data d1 = {t1 , t2 , … , tn1 }. The prior distribution (A.95) and the distribution of the data d1 (A.105) can be combined by using Bayes formula (A.101) to give the posterior distribution 𝜋(𝜆 ∣ d1 ) =

f (d1 ∣ 𝜆) 𝜋(𝜆) f (d1 )

(A.106)

By using the prior distribution (A.95) and the distribution for the data (A.105), we get, after some calculation, n1 ∑

) ]𝛼+n1 ( ∑ n1 ) ti [( n1 ∑ − 𝛽+ ti 𝜆 i=1 ti 𝜆 e 𝜋(𝜆 ∣ d1 ) = 𝛽+ Γ(𝛼 + n1 ) i=1 𝛽+

i=1

(A.107)

This distribution is seen to be a gamma distribution with parameters 𝛼1 = 𝛼 + n1 𝛽1 = 𝛽 +

n1 ∑

ti

i=1

The posterior distribution 𝜋(𝜆 ∣ d1 ) expresses all our current knowledge about the parameter 𝜆, based on our prior knowledge and the knowledge gained through observing the data d1 . By choosing the exponential distribution for the data and the gamma distribution for the parameter 𝜆, we observe that the posterior distribution belongs

References

to the same class of distributions as the prior. Exponential and gamma distributions, are therefore said to be conjugate distributions, as introduced in Chapter 2. A.8.3.3 Posterior Estimate

In the same way as for the prior situation, we can now find a posterior estimate for 𝜆: 𝛼 𝛼 + n1 𝜆̂ 1 = 1 = (A.108) n1 𝛽1 ∑ 𝛽 + ti i=1

If we are content with this estimate, we may terminate the analysis and use this estimate. If not, we may observe a new set of n2 items and get the data d2 . We use the posterior distribution based on the first data set, (n1 , d1 ) as our new prior distribution and update this prior with the new data set (n2 , d2 ) to get a new posterior distribution. The new posterior distribution is again the gamma distribution, and we can find a new posterior estimate by the same procedure as mentioned previously. This approach is often referred to as Bayesian updating. Observe that this estimate can be used even if no events E (i.e. n = 0) have been observed in the time interval of length t. Also observe that as the amount of evidence increases (i.e. n and t increase), the importance of the values of 𝛼 and 𝛽 is reduced. As a limit, the estimate (A.108) is equal to the estimate obtained by the frequentist approach in Section A.6. A.8.3.4 Credibility Intervals

A credibility interval is the Bayesian analogue to a confidence interval. A credibility interval for 𝜆 at level (1 − 𝜀) is an interval (a(d), b(d)) such that the conditional probability, given the data d, satisfies b(d)

Pr(a(d) < 𝜆 < b(d) ∣ d) =

∫a(d)

fΛ∣d (𝜃 ∣ d) d𝜆 = 1 − 𝜀

(A.109)

Then, the interval (a(d), b(d)) is an interval estimate of 𝜆, in the sense that the conditional probability of 𝜆 belonging to the interval, given the data d, is equal to 1 − 𝜀.

References Dezfuli, H., Kelly, D., Smith, C. et al. (2009). Bayesian Inference for NASA Probabilistic Risk and Reliability Analysis. Tech. Rep. NASA/SP-2009-569. Washington, DC: U.S. National Aeronautics and Space Administration. Garrick, B.J. (2008). Quantifying and Controlling Catastrophic Risks. San Diego, CA: Academic Press.

739

740

Appendix A Elements of Probability Theory

Kaplan, S. and Garrick, B.J. (1981). On the quantitative definition of risk. Risk Analysis 1: 11–27. Lindley, D.V. (2007). Understanding Uncertainty. Hoboken, NJ: Wiley. Rausand, M., Høyland, A., and Barros, A. (2020). System Reliability Theory: Models, Statistical Methods, and Applications, 3e. Hoboken, NJ: Wiley. Ross, S.M. (2004). Introduction to Probability and Statistics for Engineers and Scientists. Amsterdam: Elsevier. Ross, S.M. (2007). Introduction to Probability Models. Amsterdam: Elsevier.

741

Appendix B Acronyms ABS ACR AEMA AHP AIChE AIR ALARA ALARP API APR BDD BFR BORA BPCS BSEE CAST CCCG CCF CCFDB CCP CCPS CDF CHAZOP COTS CPC CPS CPT CREAM CSM CST

anti-lock braking system activity consequence risk action error mode analysis analytical hierarchy process American Institute of Chemical Engineers average individual risk as low as reasonably achievable as low as reasonably practicable American Petroleum Institute activity performance risk binary decision diagram binomial failure rate (model) barrier and operational risk analysis basic process control system Bureau of Safety and Environmental Enforcement (U.S.) causal analysis based on STAMP common-cause component group common-cause failure common-cause failure data base critical control point Center for Chemical Process Safety core damage frequency computer hazard and operability commercial off the shelf common performance condition cyber-physical system conditional probability table cognitive reliability and error analysis method common safety methods common safety target

Risk Assessment: Theory, Methods, and Applications, Second Edition. Marvin Rausand and Stein Haugen. © 2020 John Wiley & Sons, Inc. Published 2020 by John Wiley & Sons, Inc. Companion Website: www.wiley.com/go/riskassessment2e

742

Appendix B Acronyms

CVSS DoD DoE DPM DHS ECCAIRS EFBA EPC ERA ESD ESP ESReDA ETA ETBA EU EUC FAR FEED FMEA FMECA FMVEA FRACAS FSA FTA GAMAB GMO HACCP HAZID HAZOP HCL HEART HEMP HEP HFACS HRA HRO HSE HTA IAEA ICAF ICAO ICS

common vulnerability scoring system U.S. Department of Defense U.S. Department of Energy deaths per million U.S. Department of Homeland Security European Co-ordination Centre for Accident and Incident Reporting System energy flow/barrier analysis error-producing condition European Railway Agency emergency shutdown electronic stability program European Safety, Reliability and Data Association event tree analysis energy trace/barrier analysis European Union equipment under control fatal accident rate front-end engineering and design failure modes and effects analysis failure modes, effects, and criticality analysis failure modes, vulnerabilities, and effects analysis failure reporting, analysis, and corrective action system formal safety assessment fault tree analysis globalement au moins aussi bon (i.e. globally at least as good) genetically modified organism hazard analysis and critical control point hazard identification hazard and operability hybrid causal logic human error assessment and reduction technique hazard and effects management process human error probability human factor analysis and classification system human reliability analysis high reliability organization Health and Safety Executive (UK) hierarchical task analysis International Atomic Energy Agency implied cost of averting a fatality International Civil Aviation Organization industrial control system

Appendix B: Acronyms

IEC IEV IEEE IMO IPL IR IRPA IRGC ISIR ISO ISRS JRC JSA LOPA LRF LSIR LTA LTI LTIF LWF MBF MEM MGL MLD MMD MOCUS MORT MSF MTO MTTF MTTR NASA NCAF NEA NIST NRC NS NTNU OECD OREDA OSHA PEF

International Electrotechnical Commission International Electrotechnical Vocabulary Institute of Electrical and Electronic Engineers International Maritime Organization independent protection layer individual risk individual risk per annum International Risk Governance Council individual-specific individual risk International Organization for Standardization International Safety Rating System Joint Research Centre (EU) job safety analysis layer of protection analysis large release frequency location-specific individual risk less than adequate lost-time injury lost-time injury frequency lost workdays frequency multiple 𝛽-factor (model) minimum endogenous mortality multiple Greek letter (model) master logic diagram man-made disaster method for obtaining cut sets management oversight and risk tree main safety function man, technology, and organization mean time to failure mean time to repair National Aeronautics and Space Administration net cost of averting a fatality Nuclear Energy Agency National Institute of Standards and Technology (U.S. Department of Commerce) Nuclear Regulatory Commission (U.S.) Norwegian standard Norwegian University of Science and Technology Organisation for Economic Co-operation and Development offshore and onshore reliability data U.S. Occupational Safety and Health Administration potential equivalent fatality

743

744

Appendix B Acronyms

PFD PFH PHA PHA P&ID PIF PLC PLL PM PRA PSA PSAN PSF QRA RAC RAMS RAW RBDM RIDDOR RIDM RIF RLE RPN RRW RSSB SFAIRP SHE SHERPA SIF SIL SIS SJA SLIM SRF SRS STAMP STEP STPA SWIFT THERP TTA UCA

probability of failure on demand probability of dangerous failure per hour preliminary hazard analysis process hazard analysis piping and instrumentation diagrams performance-influencing factor programmable logic controller potential loss of life preventive maintenance probabilistic risk assessment probabilistic safety assessment Petroleum Safety Authority Norway performance-shaping factor quantitative risk assessment risk acceptance criteria reliability, availability, maintainability, and safety risk achievement worth risk-based decision-making Reporting of Injuries, Diseases and Dangerous Occurrences Regulations risk-informed decision-making risk-influencing factor reduction in life expectancy risk priority number risk reduction worth Rail Safety and Standards Board (UK) so far as is reasonably practicable safety, health, and environment systematic human error reduction and prediction approach safety instrumented function safety integrity level safety instrumented system safe job analysis success likelihood index methodology small release frequency safety requirement specification system-theoretic accident model and processes sequentially timed events plotting system-theoretic process analysis structured what–if technique technique for human error rate prediction tabular task analysis unsafe control action

Appendix B: Acronyms

UK UKAEA UN US UPM VSL WHO

United Kingdom United Kingdom Atomic Energy Authority United Nations United States unified partial method value of a statistical life World Health Organization

745

747

Author Index Authors who are explicitly referred to are listed. Additional authors may be “hidden” in technical reports issued by organizations. We have tried to provide full names, but for many authors, full names have not been traceable.

a

b

Abbasi, S.A. 21, 183 Abrahamsson, Marcus 648, 657, 660 Ainsworth, Les K. 537, 538, 545 Aitken, Robert J. 650 Albrechtsen, Eirik 96 Ale, Ben J.M. 133, 136 Alizadeh, S.S. 194 Amari, Suprasad V. 369 Amendola, Aniello 648 Andersen, Henrik Reif 369 Anderson, Elizabeth L. 650 Andow, Peter 304 Andrews, John D. 413, 428 Aneziris, Olga N. 324 Ang, M.L. 238 Antonsen, Stian 535, 536 Apostolakis, George E. 178 Arvai, Joseph 172 Asbjørnslett, Bjørn Egil 239 Ashenfelter, Orley 104 Attwood, Daryl 193 Aven, Terje 20, 115, 176, 241, 428, 502, 503, 506–508, 511, 600, 662 Ayyub, Bilal M. 253

Baber, Chris 541, 544, 547, 550, 568, 569 Ball, David J. 143, 145 Barlow, Richard E. 357 Barros, Anne 324, 351, 363, 385, 395, 411, 423, 449, 486, 662, 704, 705, 722, 732 Bartell, S.M. 655 Ba¸sar, Ersan 532 Basra, Gurpreet 249 Baybutt, Paul 148, 611, 677 Bayes, Thomas 41 Bazovsky, Igor 7 Bedford, Tim 456 Bérenguer, Christophe 324 Bergman, Bo 342 Bernstein, Peter L. 4, 16, 17 Bertalanffy, Ludwig von 88 Bier, Vicki 37, 592 Bird, Frank E. 197 Birnbaum, Zygmunt Wilhelm 362 Blackman, Harold S. 250 Borysiewicz, M.A. 430, 431, 676 Borysiewicz, M.J. 430, 431, 676

Risk Assessment: Theory, Methods, and Applications, Second Edition. Marvin Rausand and Stein Haugen. © 2020 John Wiley & Sons, Inc. Published 2020 by John Wiley & Sons, Inc. Companion Website: www.wiley.com/go/riskassessment2e

748

Author Index

Bot, Pierre Le 526 Breinholt, Christian 688 Brissaud, Florent 324 Broomfield, Eamon 304 Burmistrov, Dmitriy 655 Bye, Roar 600

Endresen, Øyvind 145, 158 Ericson, Clifton A. 426 Eusgeld, Irene 122 Evans, Andrew W. 146 Evans, M.G.K. 455, 456 Everett, Christopher 655, 682

c

f

Cameron, R.F. 101 Campbell, David J. 448 Carlson, Lennart 449 Catmur, James R. 304 Charniak, Eugene 370 Charpentier, Dominique 324 Childs, Joseph A. 449 Chudleigh, Morris F. 304 Chung, Paul W.H. 304 Clemens, Pat L. 491 Colbert, Edward 615 Consolini, Paula M. 224 Contini, Serafino 648 Cooke, Roger M. 253 Corneliussen, Kjell 480, 481 Cox, A.W. 238 Cox, L. Anthony Jr. 148 Crawley, Frank 677 Cross, Robert 679 Cullen, W. Douglas Lord 243

Farmer, Frank Reginald 142 Fischhoff, Baruck 100, 101, 155, 172 Fitch, Scott C. 621 Fleming, Karl N. 7, 452, 457 Floyd, Peter 143, 145 Forseth, Ulla 36, 684 Foster, Harold D. 53 Frappier, Gerry 178 Frederickson, Anton A. 497 Freiling, Felix C. 122 Fussell, Jerry B. 362

d Dakwat, Alheri Longji 306 Debray, Bruno 678 Deremer, R. Kenneth 457 Dezfuli, Homayoon 655, 682, 733 Diamantidis, Dimitris 158 Ditlevsen, Ove 650 Duijm, Nijs Jan 148, 488, 490

g Galyean, William 733 Garanty, I. 430, 431, 676 Garnett, K. 114 Garrick, B. John 17, 19, 43, 82, 733 Germain, George L. 197 Gertman, David I. 250 Gibson, J.J. 192, 193 Gran, Bjørn Axel 600 Griffor, Edward 615 Groeneweg, Jop 205 Grøtan, Tor Olav 96 Guldenmund, Frank 194 Guttmann, H. 250, 552–554, 559, 671 Guttormsen, Geir 54, 217, 221, 224, 516

h e Edwards, D.W. 304, 305 Edwin, Nathaniel John 580, 598, 600 Ehrke, Karl-Christian 688 Embrey, David E. 548

Haddon, William 193–195, 212, 515 Hale, Andrew 587, 696 Hambly, E.C. 138 Hammer, Willie 273 Hammonds, J. S. 655

Author Index

Hassel, Martin 239 Hattis, Dale 650 Hauge, Stein 449, 461 Haugen, Stein 188, 502, 503, 506, 507, 511, 512, 514, 580, 581, 584, 587, 598, 600, 696 Heinrich, Herbert William 6, 196, 200 Heming, B.H.J. 696 Hendershot Dennis C. 675 Herbst, Timothy D. 175 Herrera, Ivonne A. 36, 54, 221, 224, 516, 585, 684 Hirst, I. L. 125 Hoem, Å. Snilstvedt 461 Hoffman, F.O. 655 Hokstad, Per R. 36, 461, 480, 481, 684 Holand, Per 139 Hole, Lars Petter 239 Hollnagel, Erik 54, 188, 192, 193, 213, 471, 476, 477, 525, 526, 567, 569, 570 Holmstrøm, Sture 449 Hopkins, Andrew 4, 591 Hou, Yunhe 442 Huang, Yu-Hsing 220 Hudson, Patrick T.W. 204 Humphreys, R.A. 456 Håbrekke, Solfrid 36, 461, 684 Høyland, Arnljot 351, 363, 385, 395, 411, 423, 449, 486, 662, 704, 705, 722, 732

i Ishikawa, Kaoro

342

Johnson, William G. 210, 213, 490 Johnston, B.D. 450 Jung, Seungho 655 Jung, Wondea 536

k Kaplan, Stan 15, 17, 19, 38, 51, 733 Kasperson, Roger E. 172 Kaufer, Barry 449 Kazemi, Reza 696 Keeney, Ralph L. 100, 101, 155 Kelly, Dana 733 Kelly, Terrence K. 437 Khan, Faisal I. 21, 183, 193, 596, 598 Kim, Dongwoon 21 Kim, Hyungju 514 Kim, Jae W. 536 Kim, Jiyong 21 Kirwan, Barry 249, 250, 537, 538, 545, 548, 554, 561, 566 Kiureghian, Armen Der 650 Kjærulff, Uffe B. 370 Kjellén, Urban 190 Klefsjö, Bengt 342 Klein, Gary 582 Kletz, Trevor 4, 295, 304, 677 Klinke, Andreas 18, 19, 113, 185 Knight, Frank Hyneman 6 Kongsvik, Trond 512, 600 Kontogiannis, Tom 201 Kott, Alexander 615 Kozubal, A. 430, 431, 676 Kråkenes, Tony 36, 684 Kunreuther, Howard C. 37, 592 Kuzma, Dave 613 Kvaløy, Jan Terje 241

j

l

Jaynes, E.T. 43 Jenkins, Daniel P. 541, 544, 547, 550, 568, 569 Jin, Hui 479 Johansen, Inger Lise 20, 112, 115

La Porte, Todd R. 224 Laheij, Gerald M.H. 133, 136 Lees, Frank P. 238 Leeuwen, N.D. van 696 Leopoulos, Vrassidas 201

749

750

Author Index

Leveson, Nancy 38, 54, 96, 183, 186, 188, 226, 227, 306, 307, 309, 312–314, 314, 696 Lichtenstein, Sarah 100, 101, 155 Lindley, Dennis V. 43, 733 Linkov, Igor 655 Littlewood, Bev 442 Liu, Yiliu 479 Lundberg, Jonas 188 Lundteigen, Mary Ann 443, 449, 461, 479 Lupton, Deborah 20

m Ma, Zhendong 626 Macdonald, Dave 690 MacKenzie, Cameron A. 123 Macza, Murray 4 Madsen, Anders L. 370 Maggio, Caspare 655, 682 Mahoney, Kelly 607 Mannan, Sam 4, 73, 431 March, James G. 582 Markert, Frank 488, 490 Marmaras, Nikos 201 Maynard, Andrew D. 650 Miguel, Angela Ruth 552, 556 Miller, A.G. 449 Modarres, Mohammad 263, 324, 363, 426, 660 Mohaghegh, Zahra 696 Moon, Il 21 Morgan, M. Granger 172 Moshashaei, P. 194 Mosleh, Ali 428, 449, 457, 459, 600, 696 Muckin, Michael 621 Murray, Stephen 1

Nordland, Odd 112 Nyheim, Ole Magnus 587, 600 Nøkland, Thor Erik 662

o Okoh, Peter 188 Okstad, Eivind H. 512, 600 Onshus, Tor 449

p Paltrinieri, Nicola 580, 596, 598 Pandey, M.D. 105 Papanikolaou, Apostolos 688, 689 Papazoglou, Iannis A. 324 Parry, Gareth W. 448, 455, 456 Parsons, D.J. 114 Pasman, Hans J. 655 Paula, Henrique M. 448 Pearce, Dick 690 Pearl, Judea 339 Peerenboom, James P. 437 Perrow, Charles 38, 45, 185, 220, 222 Pesme, Helene 526 Peters, Barbara J. 536 Peters, George A. 536 Phimister, James R. 37, 592 Pidgeon, Nick F. 213 Post, Jos G. 133, 136 Prem, Kathrine P. 655 Preston, Malcolm 677 Proschan, Frank 357 Pukite, Jan 385 Pukite, Paul 385

q Qi, Junjian 442 Qureshi, Zahid H. 183, 193, 215–218

n Nathwani, J.S. 105 Nielsen, Dan S. 426 Niwa, Yuji 210

r Raney, Glenn 450 Rasmuson, Dale M. 448, 450

Author Index

Rasmussen, Jens 184, 199, 208, 214, 215, 217, 218, 220, 310, 530 Rausand, Marvin xxxi, 33, 351, 363, 385, 395, 411, 423, 443, 449, 461, 479, 481, 486, 487, 616, 662, 704, 705, 722, 732 Reason, James 29, 37, 185, 201, 203, 204, 526, 530, 563, 593 Redmill, Felix J. 304 Renn, Ortwin 18, 19, 113, 185 Reusser, Ralf 122 Ridley, John 690 Ridley, L.M. 413, 428 Rinaldi, Steven M. 437 Ringstad, Arne Jarl 512 Rivers, Louie 172 Rodenburg, F.G. Th. 696 Rogers, William J. 655 Rokseth, Børge 307 Rollenhagen, Carl 188 Rosa, Eugene A. 19, 47 Rosness, Ragnar 54, 217, 221, 224, 516, 537 Ross, Sheldon M. 384, 385, 389, 390, 701 Rouvroye, Jan 481 Ruijter, Alex de 194 Røed, Willy 426, 428, 600

s Saaty, Thomas L. 254 Sagan, Scott D. 226 Salmon, Paul M. 538, 541, 544, 547, 550, 566, 568, 569 Salvi, Olivier 678 Sames, Pierre C. 688 Sammarco, John J. 221–224 Schäbe, Hendrik 112 Schmittner, Christoph 626 Schönbeck, Martin 481 Schreier, Bruce 626 Schubach, Simon 304

Seljelid, Jorunn 502, 503, 506, 507, 511, 512, 587, 600 Seong, Poong Hyun 559 Shappell, Scott A. 532 Shen-Orr, Chaim 304 Shepherd, C.H. 598 Shorrock, Stephen 548 Siu, Nathan O. 459, 696 Skjong, Rolf 145, 158, 688 Sklet, Snorre 188, 197, 201, 208, 209, 502–508, 510–512 Skogdalen, Jon Espen 512 Smit, Klaas 696 Smith, Curtis 733 Smith, Ed 548 Smith, Paul 626 Sørli, F. 512 Sperber, William H. 690 Spurgin, Anthony J. 526 Stack, R.J. 477 Stamatelatos, Michael 655, 682 Stanton, Neville A. 538, 541, 544, 547, 566, 568, 569 Steen, Sunniva Anette 512 Steiro, Trygve 54, 221, 224, 516 Stier, Richard F. 690 Størseth, Fred 96, 217 Strang, Tom 688 Suchman, Edward A. 35 Summers, Angela E. 450, 497, 500 Sun, Kai 442 Sun, Wei 442 Sutcliffe, Kathleen M. 225 Svedung, Inge 199, 215, 218, 220, 310 Swain, Alan D. 250, 552–554, 559, 671

t Thomas, John P. 306, 309, 313, 314 Tinmannsvik, Ranveig K. 54, 221, 224, 516 Tronstad, Lars 512

751

752

Author Index

Turner, Barry A. 213 Tyler, Brian 677

u U˘gurlu, Øzkan 532 Utne, Ingrid Bouwer 307, 514, 585

v Vanem, Erik 145, 157, 158 Vassalos, Dracos 688 Vatn, Jørn 600 Vaughan, Diane 4 Vedros, Kurt 733 Veitch, Brian 193 Verlander, Neville Q. 146 Vesely, William E. 362, 456 Villani, Emilia 307 Vincoli, Jeffrey W. 210 Vinnem, Jan Erik 115, 307, 426, 428, 502, 503, 506–508, 511, 600

w Wagenaar, Willem A. 204 Wagnild, B.R. 512 Wahlstrøm, Bjørn 673 Walker, Guy H. 538, 541, 544, 547, 550, 566, 568, 569 Wang, Chengdong 428 Weick, Karl E. 225 Weiss, Joseph 620 Whalley, Susan 547

Wiegmann, Douglas A. 532 Willers, A. 101 Williams, J.C. 563, 566 Winkler, Robert L. 650 Wittola, Thomas 688 Woods, David D. 54 Wreathall, John 455, 456 Wårø, Irene 512

x Xing, Liudong 369

y Yang, Soon Ha 304, 305 Yang, Xiaole 655 Yang, Xue 580, 581, 584, 696 Yildirim, Umut 532 Yosie, Terry F. 175 Youngblood, Robert 655, 679, 682

z Zackmann, Karin 4 Zhan, Qingjian 532 Zhao, Bobo 532 Zheng, Wei 532 Ziomas, Ioannis C. 648 Zitrou, Athena 456

ø Øien, Knut 449, 585

753

Subject Index A number of terms are explicitly defined in the book. These terms are indicated by a boldfaced page number.

a Acceptable risk 100, 103 Acceptable risk level 103 Accident 35 organizational 37, 185 prevention 183, 194, 203, 212 proneness 189 Accidental risk assessment methodology for industries (ARAMIS) 21, 471, 678 Accident precursor 592 Accident scenario 20, 26, 204, 401, 402 reference 21 worst-case 21, 22 worst credible 22 AcciMap 217 Action error mode analysis (AEMA) 544, 550 worksheet 546 Active failure 30 Acts of God 189 AIR 127, 157

ALARA 111 ALARP 50, 103, 106–108, 111, 112, 114, 154, 168, 634 Alexander L. Kielland accident 8, 138, 678 α-factor model 459 American Petroleum Institute (API) 513, 679 Analysis 94 Annual fatality rate (AFR) 125 As good as new 722 As low as reasonably achievable see ALARA As low as reasonably practicable see ALARP Asset 45, 46 A technique for human error analysis (ATHEANA) 571 Attack 612 cyber 612 physical 612 Attack path 612, 622 Attack-threat matrix 621 Attack tree 626

Risk Assessment: Theory, Methods, and Applications, Second Edition. Marvin Rausand and Stein Haugen. © 2020 John Wiley & Sons, Inc. Published 2020 by John Wiley & Sons, Inc. Companion Website: www.wiley.com/go/riskassessment2e

754

Subject Index

Automatic train stop (ATS) 478 Average individual risk see AIR Avoidable risk 100

b Barrier 49, 403 active 469 passive 469 proactive 468, 478 reactive 468, 478 Barrier activation 472 Barrier and operational risk analysis see BORA Barrier block diagram 504 Barrier diagram 490 Barrier function 466 Barrier management 474 Barrier system 467 Basic process control system (BPCS) 484 Basic risk factor 205 Bayes formula 41, 710, 733 Bayesian approach 39, 41, 43 Bayesian network 370, 429, 507 Bayesian updating 739 Bayes, Thomas 41 BDD 369, 413 Best available technology (BAT) 106 β-factor model 452 Bhopal accident 48, 81, 674 Binary decision diagram see BDD Binomial failure rate model (BFR) 456 Bohr, Niels 650 Boolean algebra 359 BORA 429, 502 Bow-tie model 26, 194 Bravo blowout accident 8, 678 Broadly acceptable risk 103

c Canvey Island study 677 Cascading failure 186, 441 Cause and effect diagram 341

Cause–consequence analysis 426 CCF 348, 413, 442, 671 Center for Chemical Process Safety (CCPS) 99, 127, 167, 242, 316, 345, 364, 418, 515, 674, 676 Central limit theorem 727 Challenger accident 681 Change analysis 208, 324–327 Chapman–Kolmogorov equation 387 Chernobyl accident 48, 535 Clapham Junction accident 685 Cognitive reliability and error analysis method see CREAM Commercial off the shelf (COTS) 616, 618, 620 Common-cause component group (CCCG) 443 Common-cause failure see CCF Common safety target (CST) 99 Common vulnerability scoring system see CVSS Complexity 96 Complex system 95, 96, 221 Complicated system 96 Component failure accident 38, 186 Computer hazard and operability (CHAZOP) 303–304 Conditional independency 374 Conditional probability table see CPT Confidence interval 728, 730, 731 Conjugate distributions 732 Consequence 44 Consequence analysis 415 Consequence spectrum 46 Correlation coefficient 712 Cost–benefit analysis 110 Coupling factor 449 CPT 375, 383 CREAM 553, 567 Credibility interval 739 Critical infrastructure 627, 692 CVSS 612, 624 calculator 612

Subject Index

Cyberattack 606 Cyber FMECA 623, 626 Cyber HAZOP 623, 626 Cyber PHA 626 Cyber-physical system (CPSs) 626 Cyberspace 605 Cyber system 605 Cyber threat 609 hacking 609 identity theft 609 malware 609 phishing 609 ransomware 609 spoofing 609 Stuxnet 609 virus 610 worm 610

607,

d Data dossier 254 Deaths per million (DPM) 131 Decision-making 176 Deepwater Horizon accident see Macondo accident Defense-in-depth 670, 670 Degree of belief 40 Department of Homeland Security (DHS) 167, 168, 607, 617, 623, 693 Dependency 413, 437 Deterministic cause 190 Deterministic decision-making 177 Deterministic system 88 Diagnostic coverage 479 Diagnostic testing 479 Distribution beta 726 binomial 718 exponential 721 gamma 726, 732, 734 Gaussian 724 geometric 719

joint 717 lognormal 555 marginal 717 normal 724 Poisson 720, 732 posterior 733, 735, 738 prior 734 uniform 726 Weibull 722 Distribution function 711 DNV GL 199, 242, 637 Dose–response 12 Dynamic risk analysis 580

e EFBA 490 Einstein, Albert 650 Electric Power Research Institute (EPRI) 671 eMARS 254 Emergent property 97 EN 50126 1999 112 Enabling events and conditions 28 End event 27, 408 Endogenous mortality 112 Energy and barrier model 193 Energy flow/barrier analysis see EFBA ENISA 610, 611 Environmental risk 158, 679, 692 Equipment under control (EUC) 478 Error-forcing context 571 Error-producing condition (EPC) 565 Error recovery 559 Escalation 429 Eschede accident 685 ETA 402, 403, 410 workflow 419 European Maritime Safety Agency (EMSA) 687 Event 24, 702 complementary 702, 707 sequence diagram 426 Event tree analysis see ETA

755

756

Subject Index

Exida 246 Explicit modeling 449

f Fail-safe 482 Failure 30 active 29 hidden 482 random hardware 480 systematic 481 Failure cause 32 Failure classification 32, 480 Failure mechanism 32 Failure mode 31, 478, 482 Failure modes, effects, and criticality analysis see FMECA Failure modes, vulnerabilities, and effects analysis (FMVEA) 623 Failure rate 244 Failure reporting, analysis, and corrective action system (FRACAS) 244 Fallible decision 202 FAR 136, 137, 157, 684 Farmer curve 142 Fatal accident rate see FAR Fatality 46 Fault 30 Fault tree 406, 414 Fault tree analysis see FTA Fishbone diagram 341 Flixborough accident 7, 674, 677 FMECA 287–295 FN curve 142, 157 Formal safety assessment (FSA) 687, 688 Frequency 44 Front-end engineering and design (FEED) 632, 637 FTA 344, 444, 450, 683 Fukushima accident 462 Functional breakdown structure 89

g Genetically modified organism (GMO) 114 Globalement au moins aussi bon (GAMAB) 111, 112 Government industry data exchange program (GIDEP) 244 Group risk 124

h Haddon’s 4Es 195 Haddon’s matrix 195 Haddon’s model 194 Haddon’s strategies 515 Hammurabi’s law 179 Harm 44 temporary 46 Hazard 17, 22 behavioral 260 classification 260 endogenous 260 exogenous 260 generic list 34, 261 natural 260 organizational 260 social 260 technological 260 Hazard–barrier matrix 487 Hazard identification see HAZID Hazard log 303, 327–331 Hazardous event 24, 47, 199, 402 HAZID 259, 267, 275–276, 596, 639 HAZOP 295–306, 494, 676 HEART 552, 563 Health and Safety Executive (UK) see HSE UK Heinrich’s domino model 196 HEP 529 assessed 565 basic 556 nominal 554, 565 Herald of Free Enterprise accident 203

Subject Index

Hierarchical task analysis (HTA) 538 High-reliability organization see HRO Homogeneous Poisson process (HPP) 244, 732 HRA 526 HRO 224–226, 514 HSE UK 103, 108, 113, 144, 148, 167, 476, 531, 535, 552, 645, 688 Human error 526 classification 530 error of commission 532 error of omission 532 mode 529 Human error assessment and reduction technique see HEART Human error identification (HEI) 543 Human error probability see HEP Human error reduction and prediction approach see SHERPA Human HAZOP 545 Human reliability 526 Human reliability analysis see HRA Hybrid causal logic (HCL) 402, 428

i IEC 17776 512, 514 IEC 60601 484 IEC 61025 345 IEC 61165 385 IEC 61508 246, 451, 454, 456, 483, 484, 616 IEC 61509 480 IEC 61511 484, 497, 502 IEC 61513 484 IEC 61882 295 IEC 62061 456, 484 IEC 62278 686 IEC 62443 608, 626 ILCI model 197 Implicit modeling 449 Implied cost of averting a fatality (ICAF) 105 Importance metric 362

Incident 36 Independence 717 Independent protection layer see IPL Indicator 585 coverage 590 lagging 587 leading 586 Individual accident 37 Individual risk 123, 127 Individual-specific individual risk see ISIR Industrial control system (ICS) 615 Industrial safety system 616 Infrastructure 7 Inherently safer design (ISD) 514 Initiating event 24 Integrated control and safety system (ICSS) 616 Interactive complexity 221 Interdependency 439 International Atomic Energy Agency (IAEA) 66, 80, 240, 672 International Civil Aviation Organization (ICAO) 167, 242, 535 International Electrotechnical Commission see IEC International Maritime Organization (IMO) 105, 264, 526, 687 International Organization for Standardization see ISO Internet of things 627 IPL 494 Ishikawa diagram 341 ISIR 134 ISO 9000 213 ISO 12100 44, 70, 81, 91, 237, 264, 285, 513, 689 ISO 14224 247 ISO 17776 70, 264, 632, 679 ISO 22300 608 ISO 26262 484 ISO 27005 608

757

758

Subject Index

ISO 31000 19, 167 ISO 31010 259 ISO 45001 167, 175 ISO Guide 73 44

j Job hazard analysis (JHA) 278 Job safety analysis (JSA) 278–287, 585

l Ladbroke Grove accident 685 Land-use planning 133, 135, 136 Lapse 531 Latent condition 29, 30, 202 Law of large numbers 727 Layer of protection analysis see LOPA Likelihood 17, 38, 42, 735 Linear system 221 Living PRA 674 Location-specific individual risk see LSIR LOPA 493 worksheet 498 Loss 48 mean 48 spectrum 47 Loss causation model 197 Lost-time injury see LTI Lost workdays frequency (LWF) 141 LSIR 133 LTI 140

m Macondo accident 240, 243, 678 Main safety function (MSF) 102, 147 Man, technology, and organization see MTO Major accident 4, 35, 185, 188 Major accident event (MAE) 680 Major accident reporting system see eMARS Management of change 174

Management oversight and risk tree see MORT Man-made disaster (MMD) 213 Markov diagram 384 Markov method 384 Master logic diagram (MLD) 322, 324 Maximum likelihood 729 Mean time to failure see MTTF Mean value 711 Median 717 MEM 112 Methode d’evaluation de la realisations des missions operateur pour la surete (MERMOS) 572 Method for obtaining cut sets (MOCUS) 352 MIL-HDBK-217F 245, 248, 252, 454, 556 MIL-STD-882 267, 273, 277, 634, 668 MIL-STD-1629 668 MIL-STD-2155 244 Minimal cut set 351 Minimum endogenous mortality see MEM Mistake 531 Model 92 black box 92 uniform 39 Modeling instantaneous risk for major accident prevention (MIRMAP) 598 Monte Carlo simulation 361, 395, 659 MORT 210 chart 211 MTO-analysis 208 MTO-diagram 209 MTTF 360, 716 Multiple β-factor model 461 Multiple Greek letter (MGL) model 457 Multiplicity of failure 443 MV Sewol accident 514

Subject Index

n National aeronautic and space administration (NASA) 178, 443, 459, 461, 592, 632, 634, 681, 733 National Institute of Standards and Technology see NIST Newtonian–Cartesian paradigm 93–95 NIST SP 800-12 607 NIST SP 800-30 608, 618 NIST SP 800-82 608, 615 Normal accident 38, 185, 220 Normal accident theory 220 NORSOK D-010 490 NORSOK S-001 476, 513 NORSOK Z-013 634, 680 Norwegian standard see NS NS 5814 100, 238 Nuclear Energy Agency (NEA) 526, 672 Nuclear Regulatory Commission (NRC) 246, 249, 671, 672 NUREG 7, 177, 178, 246, 247, 345, 439, 443, 454, 455, 526, 570–572, 671, 672

o Occupational Safety and Health Administration (OSHA) 267, 674, 676 Offshore and onshore reliability data (OREDA) 237, 246, 454, 507 Operating context 91 Operational risk analysis 580 Organizational accident 37, 185 Organization, man, technology see Risk OMT

p Parallel structure 705 PEF 146

Performance-based regime 179 Performance-influencing factor see PIF Performance-shaping factor see PSF PIF 535 Piper Alpha accident 81, 138, 243, 514, 674, 678, 688 Pivotal event 404, 406 PLL 125, 143, 156 Posterior probability 42 Potential equivalent fatality see PEF Potential loss of life see PLL PRA 672, 681, 682 living 674 Precautionary principle 113, 114 Precursor 37 Preliminary hazard analysis (PHA) 266–267 Prescriptive regime 179 Prior probability 41 Proactive barrier 468 Probabilistic cause 190 Probabilistic risk assessment see PRA Probabilistic safety assessment see PSA Probability 38, 39, 706 Bayesian 40 classical 39 frequentist 39 judgmental 40 steady-state 390 subjective 40, 43 Probability density function 713 Probability of failure on demand (PFD) 486 Problem reporting and corrective action (PRACA) 593 Process hazard analysis (PHA) 677 Process review 264 Project execution model (PEM) 631 Proof test 479 Protection layer 473, 616 PSA 673 PSF 553, 556

759

760

Subject Index

q Quantitative risk assessment (QRA) 147, 676, 677

r RAC 79, 100, 102, 106, 114 Rapid risk ranking (RRR) 267 Rasmussen and Svedung’s model 199 Reactive barrier 468 Reactor Safety Study 402, 671, 677 Reduction in life expectancy see RLE Reliability block diagram (RBD) 351, 414, 704 Resilience 53, 225 RIDM 177 RIF 190, 373, 378, 502, 503, 536, 586, 684 Risk 17, 47 Risk acceptance 99 Risk acceptance criteria see RAC Risk analysis 18, 59 Risk assessment 59 validity 82 Risk aversion 144 Risk-based decision-making (RBDM) 177 Risk communication 172 Risk contour plot 135 Risk evaluation 75 Risk governance 169 Risk index 153 Risk indicator 586 Risk-influencing factor see RIF Risk-informed decision-making see RIDM Risk management 168 Risk matrix 148, 303 Risk measure 122 Risk metric 121 Risk modeling 594 Risk OMT 512 Risk perception 171

Risk picture 47 Risk priority number see RPN Risk profile 47 Risk reduction measure 171, 173, 512, 516 Risk register 327 Risk treatment 171 RLE 155 Robustness 53 Root cause 199 Root cause analysis (RCA) 190, 593 RPN 152, 293

s SAFEDOR 158, 688 Safeguard 403, 613 Safe job analysis (SJA) 278 Safety 50, 51 Safety audit 174 Safety barrier 203, 466 Safety barrier diagram 489 Safety case 179, 680 Safety constraint 309 Safety culture 535 Safety indicator 586 Safety-instrumented function see SIF Safety-instrumented system see SIS Safety integrity 485 Safety life-cycle 484 Safety management 169 Safety performance 51 Safety performance metric 123 Safety risk 670 Sample space 39, 702 Sandoz fire accident 674 Secure-guard 613, 624 Security 51, 605 Security assessment 617 Sensitivity analysis 361, 661 Sequentially timed events plotting (STEP) 200 Series structure 704 Severity 49

Subject Index

Seveso accident 7 Seveso directive 8, 77, 81, 133, 240, 471, 674, 675 SFAIRP 82, 111 Shared cause 448 SIF 478 SINTEF 36, 242, 246, 256, 449, 461, 684 SIS 477, 479, 494 Slip 530 Smart city 627 Smart grid 627 Social willingness to pay 105 Societal risk 124 Sociotechnical system 88 So far as is reasonably practicable see SFAIRP Stakeholder 175 STAMP 226, 227 Statistical life 104 Steady-state probability 390 STPA-Sec 623, 626 Structured what–if technique (SWIFT) 316–322 Structure function 704 Subjective probability 40 Success likelihood index methodology (SLIM) 570 Survivor function 715 Swiss cheese model 202, 532 Synthesis 95 System 87 boundary 89, 90 breakdown structure 89 closed 91 deterministic 88 dynamics 217 open 91 sociotechnical 88, 97 technical 88 System accident 38, 185 System analysis 94 Systematic failure 481

Systematic human error reduction and prediction approach (SHERPA) 548 worksheet 550 System-theoretic accident model and processes see STAMP System-theoretic process analysis (STPA) 306–316

t Tabular task analysis (TTA) 540 Tactical risk 669 Target 45 Task 528 Task analysis 537 Task hazard analysis (THA) 278 Technique for human error rate prediction (THERP) 553 event tree 554 Threat 17, 51, 608 Threat actor 52, 610, 622 motives 611 register 611 Threat model 621 Threat register 610 Threat–vulnerability–asset (TVA) worksheet 626 Three Mile Island accident 220 Tight coupling 222 Time to failure 713 Tolerable risk 17, 99, 103 Total safety 51, 606 Transition rate matrix 387 Tripod 203

u Uncertainty 20, 648 aleatory 649 completeness 654 epistemic 649 model 651 parameter 653 reducible 649

761

762

Subject Index

Uniform model 39 United Kingdom Atomic Energy Authority (UKAEA) 8 Unsafe control action 307, 312 Upper bound approximation 357

Venn diagram 703 Victim 45 Violation 531 Voting logic 483 Vulnerability 52, 612

v

w

Value of life 104 Value of statistical life

Worst-case 104

21

WILEY SERIES IN STATISTICS IN PRACTICE Advisory Editor, Marian Scott, University of Glasgow, Scotland, UK Founding Editor, Vic Barnett, Nottingham Trent University, UK Human and Biological Sciences Brown and Prescott ⋅ Applied Mixed Models in Medicine Ellenberg, Fleming and DeMets ⋅ Data Monitoring Committees in Clinical Trials: A Practical Perspective Lawson, Browne and Vidal Rodeiro ⋅ Disease Mapping With WinBUGS and MLwiN Lui ⋅ Statistical Estimation of Epidemiological Risk Marubini and Valsecchi ⋅ Analysing Survival Data from Clinical Trials and Observation Studies Parmigiani ⋅ Modeling in Medical Decision Making: A Bayesian Approach Senn ⋅ Cross-over Trials in Clinical Research, Second Edition Senn ⋅ Statistical Issues in Drug Development Spiegelhalter, Abrams and Myles ⋅ Bayesian Approaches to Clinical Trials and Health-Care Evaluation Turner ⋅ New Drug Development: Design, Methodology, and Analysis Whitehead ⋅ Design and Analysis of Sequential Clinical Trials, Revised Second Edition Whitehead ⋅ Meta-Analysis of Controlled Clinical Trials Zhou, Zhou, Liu and Ding ⋅ Applied Missing Data Analysis in the Health Sciences Earth and Environmental Sciences Buck, Cavanagh and Litton ⋅ Bayesian Approach to Interpreting Archaeological Data Cooke ⋅ Uncertainty Modeling in Dose Response: Bench Testing Environmental Toxicity Gibbons, Bhaumik, and Aryal ⋅ Statistical Methods for Groundwater Monitoring, Second Edition Glasbey and Horgan ⋅ Image Analysis in the Biological Sciences

Helsel ⋅ Nondetects and Data Analysis: Statistics for Censored Environmental Data

®

Helsel ⋅ Statistics for Censored Environmental Data Using Minitab and R, Second Edition McBride ⋅ Using Statistical Methods for Water Quality Management: Issues, Problems and Solutions Ofungwu ⋅ Statistical Applications for Environmental Analysis and Risk Assessment Webster and Oliver ⋅ Geostatistics for Environmental Scientists Industry, Commerce and Finance Aitken and Taroni ⋅ Statistics and the Evaluation of Evidence for Forensic Scientists, Second Edition Brandimarte ⋅ Numerical Methods in Finance and Economics: A MATLAB-Based Introduction, Second Edition Brandimarte and Zotteri ⋅ Introduction to Distribution Logistics Chan and Wong ⋅ Simulation Techniques in Financial Risk Management, Second Edition Jank ⋅ Statistical Methods in eCommerce Research Jank and Shmueli ⋅ Modeling Online Auctions Lehtonen and Pahkinen ⋅ Practical Methods for Design and Analysis of Complex Surveys, Second Edition Lloyd ⋅ Data Driven Business Decisions Ohser and Mücklich ⋅ Statistical Analysis of Microstructures in Materials Science Rausand and Haugen ⋅ Risk Assessment: Theory, Methods, and Applications, Second Edition