Bits and Bugs: A Scientific and Historical Review of Software Failures in Computational Science 9781611975550, 1611975557

In scientific computing (also known as computational science), advanced computing capabilities are used to solve complex

1,428 203 11MB

English Pages 251 [259] Year 2019

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

The Failures of Philosophy: A Historical Essay 9780691209579

The first book to address the historical failures of philosophy—and what we can learn from them Philosophers are gener

144 4 3MB Read more

The Failures of Philosophy: A Historical Essay 9780691207506, 9780691209579, 2020940121

432 94 4MB Read more

Landmarks in the History of Science: Great Scientific Discoveries From a Global-Historical Perspective 1622739787, 9781622739783

'Landmarks in the History of Science' is a concise history of science from a global and macro-historical stand

242 97 847KB Read more

The Failures of Philosophy: A Historical Essay 069120750X, 9780691207506

The first book to address the historical failures of philosophy-and what we can learn from them Philosophers are general

622 29 4MB Read more

The Three Failures of Creationism: Logic, Rhetoric, and Science 9780520951662

Walter M. Fitch, a pioneer in the study of molecular evolution, has written this cogent overview of why creationism fail

165 20 1MB Read more

Two Bits: The Cultural Significance of Free Software 9780822389002

Ethnographic study of the programmers, engineers, and hackers who have shaped the internet since the 1970s and the battl

167 85 6MB Read more

Science and Values: The Aims of Science and Their Role in Scientific Debate 9780520908116

Laudan constructs a fresh approach to a longtime problem for the philosopher of science: how to explain the simultaneous

179 39 2MB Read more

Principles of Data Science (Transactions on Computational Science and Computational Intelligence) [1st ed. 2020] 3030439801, 9783030439804

This book provides readers with a thorough understanding of various research areas within the field of data science. The

129 90 9MB Read more

Orthodox Christianity and Modern Science: Theological, Philosophical, Scientific and Historical Aspects of the Dialogue (Science and Orthodox Christianity) (Science and Orthodox Christianity, 2) 9782503592671, 2503592678

Orthodox Christian theology is based on a living tradition that is deeply rooted in Greek Patristic thought. However, fe

392 128 2MB Read more

Behavioral Computational Social Science (Wiley Series in Computational and Quantitative Social Science) [Kindle Edition]

This book is organized in two parts: the first part introduces the reader to all the concepts, tools and references that

665 123 2MB Read more

Bits and Bugs: A Scientific and Historical Review of Software Failures in Computational Science
9781611975550, 1611975557

Author / Uploaded
Huckle
Thomas; Neckel
Tobias

Table of contents :
Machine numbers, precision, and rounding errors --
Mathematical modeling and discretization --
Design of control systems --
Synchronization and scheduling --
Software-hardware interplay --
Complexity.

Citation preview

Bits and Bugs

The SIAM series on Software, Environments, and Tools focuses on the practical implementation of computational methods and the high performance aspects of scientific computation by emphasizing in-demand software, computing environments, and tools for computing. Software technology development issues such as current status, applications and algorithms, mathematical software, software tools, languages and compilers, computing environments, and visualization are presented.

Editor-in-Chief Jack J. Dongarra University of Tennessee and Oak Ridge National Laboratory Series Editors Timothy A. Davis Texas A & M University

Laura Grigori INRIA

Padma Raghavan Pennsylvania State University

James W. Demmel University of California, Berkeley

Michael A. Heroux Sandia National Laboratories

Yves Robert ENS Lyon

Software, Environments, and Tools Thomas Huckle and Tobias Neckel, Bits and Bugs: A Scientific and Historical Review of Software Failures in Computational Science Thomas F. Coleman and Wei Xu, Automatic Differentiation in MATLAB Using ADMAT with Applications Walter Gautschi, Orthogonal Polynomials in MATLAB: Exercises and Solutions Daniel J. Bates, Jonathan D. Hauenstein, Andrew J. Sommese, and Charles W. Wampler, Numerically Solving Polynomial Systems with Bertini Uwe Naumann, The Art of Differentiating Computer Programs: An Introduction to Algorithmic Differentiation C. T. Kelley, Implicit Filtering Jeremy Kepner and John Gilbert, editors, Graph Algorithms in the Language of Linear Algebra Jeremy Kepner, Parallel MATLAB for Multicore and Multinode Computers Michael A. Heroux, Padma Raghavan, and Horst D. Simon, editors, Parallel Processing for Scientific Computing Gérard Meurant, The Lanczos and Conjugate Gradient Algorithms: From Theory to Finite Precision Computations Bo Einarsson, editor, Accuracy and Reliability in Scientific Computing Michael W. Berry and Murray Browne, Understanding Search Engines: Mathematical Modeling and Text Retrieval, Second Edition ´ Craig C. Douglas, Gundolf Haase, and Ulrich Langer, A Tutorial on Elliptic PDE Solvers and Their Parallelization Louis Komzsik, The Lanczos Method: Evolution and Application Bard Ermentrout, Simulating, Analyzing, and Animating Dynamical Systems: A Guide to XPPAUT for Researchers and Students V. A. Barker, L. S. Blackford, J. Dongarra, J. Du Croz, S. Hammarling, M. Marinova, J. Wasniewski, and P. Yalamov, LAPACK95 Users’ Guide Stefan Goedecker and Adolfy Hoisie, Performance Optimization of Numerically Intensive Codes Zhaojun Bai, James Demmel, Jack Dongarra, Axel Ruhe, and Henk van der Vorst, Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide Lloyd N. Trefethen, Spectral Methods in MATLAB E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen, LAPACK Users’ Guide, Third Edition Michael W. Berry and Murray Browne, Understanding Search Engines: Mathematical Modeling and Text Retrieval Jack J. Dongarra, Iain S. Duff, Danny C. Sorensen, and Henk A. van der Vorst, Numerical Linear Algebra for High-Performance Computers R. B. Lehoucq, D. C. Sorensen, and C. Yang, ARPACK Users’ Guide: Solution of Large-Scale Eigenvalue Problems with Implicitly Restarted Arnoldi Methods Randolph E. Bank, PLTMG: A Software Package for Solving Elliptic Partial Differential Equations, Users’ Guide 8.0 L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley, ScaLAPACK Users’ Guide Greg Astfalk, editor, Applications on Advanced Architecture Computers Roger W. Hockney, The Science of Computer Benchmarking Françoise Chaitin-Chatelin and Valérie Frayssé, Lectures on Finite Precision Computations

Bits and Bugs A Scientific and Historical Review of Software Failures in Computational Science

Thomas Huckle

Technical University of Munich Munich, Germany

Tobias Neckel

Technical University of Munich Munich, Germany

Society for Industrial and Applied Mathematics Philadelphia

Copyright © 2019 by the Society for Industrial and Applied Mathematics 10 9 8 7 6 5 4 3 2 1 All rights reserved. Printed in the United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Society for Industrial and Applied Mathematics, 3600 Market Street, 6th Floor, Philadelphia, PA 19104-2688 USA. Trademarked names may be used in this book without the inclusion of a trademark symbol. These names are used in an editorial context only; no infringement of trademark is intended. AIRBUS is a trademark of Airbus S.A.S. Boeing is a trademark of The Boeing Company. Camry is a trademark of the Toyota Motor Corporation. EyeQ3 is a registered trademark of Mobileye Vision Technologies, Ltd. Helixorter is a trademark of Vanderlande Industried B. V. LLC. IBM is a registered trademark of IBM, Inc. www.ibm.com Intel, the Intel logo, the Intel Inside logo, Pentium, and Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. Mac is a trademark of Apple Computer, Inc., registered in the United States and other countries. Bits and Bugs: A Scientific and Historical Review of Software Failures in Computational Science is an independent publication and has not been authorized, sponsored, or otherwise approved by Apple Computer, Inc. MATLAB is a registered trademark of The MathWorks, Inc. For MATLAB product information, please contact The MathWorks, Inc., 3 Apple Hill Drive, Natick, MA 01760-2098 USA, 508-647-7000, Fax: 508-647-7001, [email protected], www.mathworks.com. Publications Director Acquisitions Editor Developmental Editor Managing Editor Production Editor Copy Editor Production Manager Production Coordinator Compositor Graphic Designer

Model X and Model S are trademarks of Tesla, Inc. Motorola is a trademark of Motorola Trademark Holdings, LLC. MS-DOS is a registered trademark of Microsoft Corporation in the United States and/or other countries. Prius is a trademark of Prius Motors LLC. Therac is a trademark of Atomic Energy of Canada Limited. Titan V GPU is a registered trademark of NVIDIA Corporation in the U.S. and other countries. Virgin Galactic is a registered trademark of Virgin Enterprises Limited. Volvo is a trademark of Volvo Trademark Holding AB. VW Golf is a trademark of the Volkswagon AG Corporation, Fed Rep Germany. VXworks is a registered trademark of Wind River Systems, Inc. Windows, Windows XP, and Excel are registered trademarks of Microsoft Corporation in the United States and/or other countries. WorkCenter and ColorCube are registered trademarks of Xerox Corporation.

Kivmars H. Bowling Elizabeth Greenspan Gina Rinelli Harris Kelly Thomas Louis R. Primus Claudine Dugan Donna Witzleben Cally A. Shrader Cheryl Hufnagle Doug Smock

Library of Congress Cataloging-in-Publication Data Names: Huckle, Thomas, author. | Neckel, Tobias, author. Title: Bits and bugs : a scientific and historical review of software failures in computational science / Thomas Huckle, Technical University of Munich, Munich, Germany, Tobias Neckel, Technical University of Munich, Munich, Germany. Description: Philadelphia, PA : Society for Industrial and Applied Mathematics, [2019] | Series: Software, environments, and tools ; 29 | Includes bibliographical references and index. Identifiers: LCCN 2018045526 (print) | LCCN 2018051491 (ebook) | ISBN 9781611975567 | ISBN 9781611975550 Subjects: LCSH: Software failures--History. | Debugging in computer science--History. Classification: LCC QA76.76.F34 (ebook) | LCC QA76.76.F34 H83 2019 (print) | DDC 004.2/409--dc23 LC record available at https://lccn.loc.gov/2018045526 Partial royalties from the sale of this book are placed in a fund to help students attend SIAM meetings and other SIAM-related activities. This fund is administered by SIAM, and qualified individuals are encouraged to write directly to SIAM for guidelines.

is a registered trademark.

To Svenja, Tilman, Nikolai, and Antonia To Anja, Julia, and Philipp

Contents List of Excursions

ix

Preface 0.1 Background of This Book . . . . . . . . . . . . . . . . . . . . . . . . . 0.2 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi xi xi

1

Introduction 1.1 Focus of This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Structure of This Book . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Classification of Bugs . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 3 3

2

Machine Numbers, Precision, and Rounding Errors 2.1 The Failure of the Ariane 5 . . . . . . . . . . . . 2.2 Y2K and Data Formats . . . . . . . . . . . . . . 2.3 Vancouver Stock Exchange . . . . . . . . . . . . 2.4 The Patriot Missile Defense Incident . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

9 10 26 30 40

Mathematical Modeling and Discretization 3.1 Loss of the Gas Rig Sleipner A . . . . . 3.2 London Millennium Bridge . . . . . . . 3.3 Weather Prediction . . . . . . . . . . . 3.4 Mathematical Finance . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

51 52 75 79 93

3

. . . .

. . . .

. . . .

. . . .

. . . .

4

Design of Control Systems 103 4.1 Fly-by-Wire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.2 Automotive Bugs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5

Synchronization and Scheduling 125 5.1 Space Shuttle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.2 Mars Pathfinder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6

Software-Hardware Interplay 145 6.1 The Pentium Division Bug . . . . . . . . . . . . . . . . . . . . . . . . 145 6.2 Mars Rover Spirit Flash Memory Problem . . . . . . . . . . . . . . . . 164

7

Complexity 7.1 Therac-25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Denver Airport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Strategic Defense Initiative . . . . . . . . . . . . . . . . . . . . . . . . vii

171 171 192 213

viii

Contents

8

Appendix 8.1 Urban Legends and Other Stories . . . . . . . 8.2 MATLAB Example Programs . . . . . . . . 8.3 Pentium Bug: Original Email of T. Nicely to Groups . . . . . . . . . . . . . . . . . . . . .

Index

. . . . . . . . . . . . . . . . . . . . . . Several Individuals . . . . . . . . . . .

219 . . . 219 . . . 227 and . . . 240 247

List of Excursions Excursion 1: Machine Numbers for Integers . . . . . . . . . . . . . . . . . . . . Excursion 2: Machine Numbers for Real Numbers: Floating-Point Format . . . . Excursion 3: Transforming Machine Number Formats: Casts . . . . . . . . . . . Excursion 4: Exception Handling and Protection of Operations . . . . . . . . . . Excursion 5: Rounding to Machine Numbers . . . . . . . . . . . . . . . . . . . Excursion 6: Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Excursion 7: Modeling and Simulation . . . . . . . . . . . . . . . . . . . . . . . Excursion 8: Finite Element Method . . . . . . . . . . . . . . . . . . . . . . . . Excursion 9: Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Excursion 10: Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . Excursion 11: Conditioning and Condition Number . . . . . . . . . . . . . . . . Excursion 12: Conditioning and Chaos . . . . . . . . . . . . . . . . . . . . . . . Excursion 13: Data Assimilation . . . . . . . . . . . . . . . . . . . . . . . . . . Excursion 14: Priority and Semaphores in Concurrent Computing . . . . . . . . Excursion 15: Carry-Save Adder . . . . . . . . . . . . . . . . . . . . . . . . . . Excursion 16: Replacing Subtraction by Addition: One’s Complement and Two’s Complement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Excursion 17: Restoring Division for Integers . . . . . . . . . . . . . . . . . . . Excursion 18: Nonrestoring Division for Integers . . . . . . . . . . . . . . . . . Excursion 19: Radix 4 SRT Division . . . . . . . . . . . . . . . . . . . . . . . . Excursion 20: Radix 4 SRT with Carry-Save Adder on Pentium . . . . . . . . . . Excursion 21: Basic Aspects of Radiation Therapy . . . . . . . . . . . . . . . . Excursion 22: Concurrent Computing and Race Conditions . . . . . . . . . . . . Excursion 23: Complexity—Part II . . . . . . . . . . . . . . . . . . . . . . . . . Excursion 24: The Traveling Salesman Problem . . . . . . . . . . . . . . . . . .

ix

12 13 14 14 32 33 55 57 58 60 80 82 85 137 146 147 148 150 152 153 173 180 198 200

Preface 0.1 Background of This Book The interest in numerical bugs originated from a post by Pete Stewart in the October 1999 edition of NA Digest1 (see [1]). There, Pete Stewart mentioned interesting numerical bugs and challenged readers to provide further examples. The first list of examples comprised the Patriot failure, the explosion of Ariane 5, index computation of the Vancouver Stock Exchange, and the Green Party rounding error in parliamentary elections in Germany. Later on, Stewart’s collection grew (and was summarized on different websites), adding the sinking of the Sleipner gas rig, the Denver Airport baggage-handling disaster, the Pentium division bug, the Y2K problem, and the loss of the Mars Climate Orbiter. Over the years, the number of numerical bugs kept growing and attracted a lot of interest. In particular, one of the authors closely followed this interesting development. In about 2000, he started to collect corresponding material on a website2 and worked on the topics in seminars and presentations. It turned out that the cases get more complex and sometimes also inconsistent the deeper one dives into the topics and related literature. Typically, a lot of material exists in the literature on all of the cases. Sometimes, surveys are nicely compact but simplify the real reasons (considerably). Frequently, only certain aspects were in the focus of the corresponding authors. Therefore, it pays off to bring together all the relevant material of the selected bugs in one place in a thorough scientific manner.

0.2 Acknowledgments Many people supported us with respect to proofreading specific sections. In alphabetical order, we thank Long Chen, Alan Edelman, Michael Grad, Svenja Huckle, Florian Jarre, Julius Knorr, Anja Neckel, Alois Pichler, Alexander Pöppl, Roland Pothast, KarlHeinz Reineck, Christoph Riesinger, Florian Rupp, Alexander Sauter, Ultich Schättler, Stephen Seckler, Robert Skeel, Carsten Uphoff, Korad Waldherr, and Roland Wittmann. Concerning valuable hints to certain interesting software bugs, special thanks go to Andreas Blüml, Bernd Brügge, Oliver Ernst, William Kahan, Daniel Kressner, Helmut Seidl, and Pete Stewart. For useful discussion and information on certain bugs, we thank Kevlin Henney (Ariane), Berit Alstad, Norb DeLatte, Kevin Parfitt, Karl-Heinz Reineck, Keith Thompson, Ger Wackers (Sleipner), Long Chen (Millennium Bridge), Andreas Fink, Detlev Majewski, Joaquim Pinto, Florian Prill, Ulrich Schättler, Uwe Ulbrich (weather fore1 “The NA Digest is a collection of articles on topics related to numerical analysis and those who practice it” (see http://www.netlib.org/na-digest-html/). It also offers a weekly email newsletter. 2 https://www5.in.tum.de/persons/huckle/bugse.html

xi

xii

Preface

cast), William Kahan, Christoph Lauter, Tobias Nipkow, Christoph Riesinger, Florian Schnagl (Pentium), Leo Eichhorn (Mars Rover flash memory), Gunter Artl, Florian Jarre (Denver and Heathrow Airports), Manfred Broy, and Michael Grad (automotive). Furthermore, we thank SIAM and, in particular, Elizabeth Greenspan and Gina Rinelli Harris for all the constructive feedback and support. And last but not least, we gratefully acknowledge the comments and propositions of all reviewers, which considerably contributed to improving the quality of this book.

Bibliography [1] P. G. Stewart. Rounding error. date accessed: 2018-07-31, 1999. http://www. netlib.org/na-digest-html/99/v99n43.html#1

Chapter 1

Introduction

1.1 Focus of This Book In this book, we describe and analyze software failures mainly related to problems in the field of scientific computing. This field, also called computational science, is very interdisciplinary and uses advanced computing capabilities to understand and solve complex problems frequently stemming from physics or engineering science. The major fields within scientific computing that have been selected for this book are • mathematical modeling of phenomena; • classical numerical analysis (number representation, rounding, conditioning, numerics of differential equations, etc.); • mathematical aspects and complexity of algorithms, systems, or software; • concurrent computing (parallelization, scheduling and synchronization, race conditions, etc.); • dealing with numerical data (e.g., data input, data interpretation, and design of control logic). The goal is to collect all available material and to determine and discuss the major causes of the failures in necessary detail. We really aim to analyze all aspects of the selected bugs in breadth (looking into all reported failures related to a specific issue) as well as in depth (getting the complete and scientifically sound picture of a problem). To make this book as beneficial as possible for the interested layperson, we have added excursions which define and explain important concepts that are crucial for understanding special aspects of bugs. These excursions are included where necessary, and they are built upon or refer to each other. More experienced readers may skip the excursions and directly continue reading at the indicated pages. In contrast to other books or collections of software failures, we do not explicitly focus on general aspects of software engineering such as developing, maintaining, or validating reliable software and corresponding techniques to prevent bugs. Frequently, software bugs serve as a motivation for computer scientists to learn, understand, and apply specific software engineering techniques (see, e.g., [2, 11]). Certain literature uses failures to motivate approaches not only for the development of software but also for its 1

2

Chapter 1. Introduction

application and the interpretation of related results (cf. [9, 21]). Contributions such as [3] focus on the field of scientific computing and aim to improve not only the software development but also the whole approach of scientific computing, including the analysis of modeling errors, aspects of verification and validation, etc. And finally, literature on specific aspects of mathematical problems uses a subset of bugs in compact format as motivation for respective theory (see, e.g., [6]). A completely different approach is taken in [13] and The RISKS Digest:3 Every single computer-related problem is listed independent of its nature (hardware- or software-related). The bibliography for each failure in our book is frequently quite involved. Because we strive for a picture that is as comprehensive as possible, we aim to collect and incorporate all the available and relevant literature. This includes original papers, secondary literature, web pages, videos, presentations, and also private communication with experts involved in the examinations of the failures. We show the full picture of the incidents, taking into account different opinions of authors who often concentrate on a single root cause, ignoring other aspects, and therefore sometimes contradict each other. This way, we are able to complete the picture and widen the horizon for a number of cases. Some bugs, or aspects of them, can even be categorized as urban legends. In certain cases, parts of the sources are not accessible anymore (in particular older websites) or have never been (completely) public, which hinders the analysis. We try to use reliable scientific sources only, but sometimes also websites such as Wikipedia are used since they are publicly available and directly accessible. For some bugs, MATLAB4 programs are collected in the appendix to complete the picture and facilitate understanding. The content of this book is useful in different contexts. • It can serve as a self-contained collection of topics for seminars on software bugs targeting a scientific and comprehensive presentation. • It can deliver illustrative examples of numerical principles such as machine numbers, rounding errors, condition numbers, or complexity. Therefore, it is very well suited as supplemental material for courses on numerical methods or scientific computing, as it treats some topics in an entertaining way that supports the understanding of major problems. • It can serve as a motivating introduction to certain important aspects of numerical methods for nonexperts. By means of the excursions, all the necessary background is provided in an intelligible way. • It can be used as a source for public lectures (at open house days, for example) for a general audience to demonstrate certain interesting aspects of scientific work in a refreshing way. • It can just entertain the interested reader (at least for certain bugs and failures). Therefore, the book will be helpful and interesting for students, teachers, and researchers in the field of scientific computing as well as for a broader, interested audience with minimal background in mathematics and computer science. 3 The

RISKS Digest: Forum on Risks to the Public in Computers and Related Systems; see https://

catless.ncl.ac.uk/Risks/ 4 MATLAB is a commonly used commercial programming platform and language with relatively easy syntax. See http://www.mathworks.com/.

1.2. Structure of This Book

3

1.2 Structure of This Book The book is subdivided into chapters that are devoted to certain classes of bugs and can be read independently of each other. Only the excursions partially refer to each other. In the remainder of this introduction, we discuss possible classifications of different aspects of software problems (definition of error vs. fault vs. failure and their relations, for example). Chapter 2 is dedicated to number representation and rounding errors. In Chapter 3, we discuss bugs caused by faulty assumptions which might occur in initial values or input data, in interpretation or output of data, and in faulty assumptions concerning the mathematical modeling or discretization of systems. In Chapter 4, problems with the digital control of airplanes and cars are presented. Chapter 5 contains software bugs involving issues of synchronization. Chapter 6 presents bugs related to software-hardware interplay, and Chapter 7 discusses bugs related to large projects with high complexity. In the appendix, we present a collection of urban legends and other interesting stories. Furthermore, it contains a selection of code examples (mainly in MATLAB) and other interesting background material for specific bugs and their corresponding explanations. The order of the considered main examples follows the character of the bug and not the respective field of application. Therefore, the applications are mixed and sometimes appear in different chapters. In some cases, a unique classification is difficult because different bugs or errors contributed to a single failure. The explosion of the Ariane 5, e.g., can be traced back to an inappropriate number format, to a faulty exception handling, to unchecked use of old code, and to insufficient testing. A section for a bug begins with a short overview characterizing the failure. Then an introduction provides background material on the application area. Subsequently, a timeline of the considered accident is typically given for the reader to become familiar with the events, followed by a detailed description of the error and a discussion of its root causes. When applicable, two further subsections follow: one on important but minor bugs in the same field of application, and one containing smaller and/or less prominent bugs of similar character. A comprehensive bibliography is given at the end of each section.

1.3 Classification of Bugs For classifying different aspects of software problems, a lot of different terminology can be found in the literature. A comprehensive and intuitive approach, from our point of view, is the classification presented in [2] and [8], which is visualized in Fig. 1.1. Both types of classification do not directly address an error or mistake of a human being (typically one or several software developers), but indirectly assume its existence for the further steps. Variant 1:

The first variant of classification according to [2] (Fig. 1.1(a)) considers an existing fault in the software system which leads, at runtime of the software, to an erroneous state of the system. This erroneous state may or may not be hidden, but its consequence is an observable failure of the system. A failure is defined as the deviation of the observed behavior of a (part of the) system from the specified behavior. Variant 2: In the second variant of classification according to [8] (Fig. 1.1(b)), faults also lead to failures. In addition to faults, defects are also specified. A defect is a supertype of fault; i.e., every fault is a defect, but not every defect is a fault since there

4

Chapter 1. Introduction

error fault

error defect

erroneous state

fault

failure

failure

(a) Classification of [2]

(b) Classification of [8]

Figure 1.1. Classification of software problems: (a) according to [2] and (b) according to [8]. Arrows pointing from box A to box B are interpreted as “A leads to B.”

is a difference in the type of detection: “A defect is a fault if it is encountered during software execution (thus causing a failure). A defect is not a fault if it is detected by inspection or static analysis and removed prior to executing the software” ([8], p. 4). The severity of failures (and faults) is frequently indicated via five different levels of criticality affecting 1) comfort, 2) discretionary funds, 3) essential funds, 4) a single life, and 5) many lives (see, e.g., [1, 7]5 ). Note that the term bug is not contained in both of these classifications but may be considered to be mostly equivalent to fault. We are going to use bug in this sense throughout this book. The term “bug” in the context of computers is often attributed to G. Hopper, who reported in the late 1940s on a dead moth causing a problem inside one of Harvard’s early computers (the Harvard Mark II; see [14, 24]). Fig. 1.2 contains a picture of the original dead moth attached to the computer log. A different idea consists in categorizing software faults and bugs according to the type of the underlying errors (cf., e.g., [8, 12, 15, 18, 24]). Examples are typographical errors, interface errors, computational errors, errors related to design or requirements, testing errors, and performance errors. Furthermore, software bugs can also be grouped according to their type of occurrence: • A so-called bohrbug is a deterministic error that can always be reproduced directly (see [5]). Therefore, such bugs are easier to find. • A mandelbug is a bug which is so complex that it occurs in a chaotic or nondeterministic way. Therefore, a mandelbug is hard to detect (cf., e.g., [5]). Sometimes mandelbug also denotes a bug with fractal behavior by revealing more and more bugs (analyzing the code to fix a bug always uncovers more bugs). • A schrödingbug presents itself in running software after the programmer detects that the code is faulty and should never have worked (see, e.g., [16, 17, 22]). 5 Note that, depending on the field of application, additional or different categories are used. Software in the context of aerospace, e.g., has to be classified via five design assurance levels (DALs); see https: //en.wikipedia.org/wiki/DO-178C.

Bibliography

5

Figure 1.2. The first computer bug: Dead moth causing problems in the Harvard Mark II. Image courtesy of the Naval Surface Warfare Center, Dahlgren, VA, 1988 [24].

• A heisenbug is a bug where the observation changes the behavior. Looking for the bug changes the results of the program (see, e.g., [4, 23]).6 • A hindenbug is a bug with catastrophic consequences (cf., e.g., [19]). • A lance-armstrong-bug is a bug where the software always passes all tests but obviously does not behave as it should (not 100% serious; see, e.g., [20]). An example for a heisenbug is the MATLAB 80-bit precision problem due to [4] described in Sec. 8.2.2. In general, classical analog control of systems is usually related to continuous functions where slight errors lead to nearby values (depending on the conditioning of the problem) but typically not to inevitable failures. In contrast, digital control is based on discrete and discontinuous functions where one wrong bit can lead to total failure. This aggravates the importance of software errors compared to mechanical or electrical systems. Furthermore, recent approaches such as artificial intelligence or machine learning do not help in this context because there is no guarantee of confident solutions without any exceptions.

Bibliography [1] B. Boehm and R. Turner. Balancing Agility and Discipline: A Guide for the Perplexed. Addison-Wesley, 2004 [2] B. Bruegge and A. H. Dutoit. Object-oriented Software Engineering: Using UML, Patterns, and Java. Prentice Hall, 3rd edition, 2010 [3] B. Einarsson (editor). Accuracy and Reliability in Scientific Computing. SIAM, 2005 6 The Heisenberg effect is sometimes called the observer effect, the panopticon effect, or the Hawthorne effect and describes the effect that the act of observing changes the behavior of the analyzed system; see [10] for a brief explanation.

6

Chapter 1. Introduction

[4] W. Gander. Heisenberg effects in computer-arithmetic. date accessed: 2018-0129, 2005. https://www.inf.ethz.ch/personal/gander/Heisenberg/paper [5] M. Grottke and K. Trivedi. Fighting Bugs: Remove, Retry, Replicate, and Rejuvenate. Computer, 40(2):107–109, Feb. 2007 [6] N. J. Higham. Accuracy and Stability of Numerical Algorithms. SIAM, 2nd edition, 2002 [7] IEEE Computer Society. IEEE Standard for Software and System Test Documentation. IEEE Std 829-2008, 2008 [8] IEEE Computer Society. IEEE Standard Classification for Software Anomalies. IEEE Std 1044-2009, 2010 [9] N. G. Leveson. Safeware: System Safety and Computers. Computer Science and Electrical Engineering Series. Addison-Wesley, 1995 [10] B. W. Liffick and L. K. Yohe. Using Surveillance Software as an HCI Tool. In Proceedings of the Information Systems Education Conference, ISECON, 2001. http://proc.edsig.org/2001/00a/

[11] G. J. Myers. Software Reliability: Principles and Practices. A Wiley Interscience publication. Wiley, 1976 [12] T. Nakashima, M. Oyama, H. Hisada, and N. Ishii. Analysis of Software Bug Causes and Its Prevention. Information and Software Technology, 41(15):1059– 1068, 1999 [13] P. G. Neumann. Computer-Related Risks. Pearson Education, 1995 [14] G. Pearson. Google honors Grace Hopper. . . and a “bug.” date accessed: 2018-0706, 2013. https://www.wired.com/2013/12/googles-doodle-honors-gracehopper-and-entomology/

[15] D. Pradhan. Top 10 reasons why there are bugs/defects in software! Software testing tricks. date accessed: 2018-01-29, 2008. http://www.software testingtricks.com/2008/12/why-are-bugsdefects-in-software.html

[16] E. Raymond. The Jargon File 4.4.7 – Schroedingbug. date accessed: 2018-02-04. http://catb.org/jargon/html/S/schroedinbug.html

[17] E. Raymond. The New Hacker’s Dictionary. MIT Press, 3rd edition, 1996 [18] SoftwareTestingStuff. Classification of defects / bugs. date accessed: 2018-0129, 2008. http://www.softwaretestingstuff.com/2008/05/classificationof-defects-bugs.html

[19] Techopedia.

Hindenbug.

date accessed: 2018-02-03, 2018.

https://www.

techopedia.com/definition/31877/hindenbug

[20] T. Ts’o and M. Pool. Lance Armstrong bug. date accessed: 2018-02-01, 2012. https://plus.google.com/+TheodoreTso/posts/7CK29YNwKT6

[21] L. R. Wiener. Digital Woes: Why We Should Not Depend on Software. AddisonWesley Longman Publishing Co., 1993

Bibliography

7

[22] Wikipedia. Unusual software bug. date accessed: 2018-02-04, 2012. http:// veriloud.com/wp-content/uploads/UnusualSoftwareBugs.html

[23] Wikipedia.

Heisenbug.

date accessed: 2018-02-01, 2017.

https://en.

wikipedia.org/wiki/Heisenbug

[24] Wikipedia.

Software bug.

date accessed: 2018-02-01, 2018.

wikipedia.org/wiki/Software_bug

https://en.

Chapter 2

Machine Numbers, Precision, and Rounding Errors

In this chapter, we investigate bugs related to working with numbers, as well as their representation. The finite representation of numbers in a computer is a notorious source of errors. The set of integers or real numbers contains infinitely many elements. But every computer is finite, and thus every integer and real number has to be stored in a data format using only a finite number of bits. The current year can be stored in decimal form with four decimal digits as 2018. Storing a number in an inappropriate data format represents an error and typically results in unwanted behavior or even failure of the corresponding computer program. A chosen number format with a certain number of bits cannot store larger numbers that would require more than the available bits. Storing a longer number in an inappropriate smaller data format was the (basic) cause for the Ariane 5 failure in 1996. The Y2K or Year 2000 bug was related to using only two digits for storing the current year, which is fine as long as we stay in a single, unique century, but not at the turn of a century. Any format for real numbers on computers necessarily leads to rounding errors, too. After each elementary operation such as addition or multiplication, the result has to be stored and, thus, is forced again into the prescribed format based on the given number of bits. Therefore, rounding errors are a direct consequence of this finite representation. Furthermore, the arising errors can propagate in the course of a sequence of computations in an extreme way, spoiling all the delivered numerical results. In this case, the executed computations are wrong and misleading. This accumulation of errors can result in wrong outcomes by the sheer number of small errors, by careless computations, or for extremely awkward problems. Such errors occurred in the computation of the Vancouver Stock Exchange index in 1982, and in controlling antiballistic Patriot missile launches during the Iraq War in 1991.

9

10

Chapter 2. Machine Numbers, Precision, and Rounding Errors

2.1 The Failure of the Ariane 5 In any collection of data, the figure most obviously correct, beyond all need of checking, is the mistake. —Finagle’s Third Law, Arthur Bloch

2.1.1 Overview • Field of application: near-Earth spaceflight • What happened: self-destruction in view of unstable flight behavior • Loss: more than $500 million for satellites • Type of failure: integer overflow

2.1.2 Introduction The European Space Agency (ESA) was founded in 1975, a successor of the European Space Research (ESR) and the European Launcher Development Organization (ELDO). Established in 1962, the forerunner ELDO had to learn the space business the hard way, suffering 11 failures of the launch vehicles Europa-1 and -2, while Europa-3 existed on paper only. The merger of ESR and ELDO to create the ESA facilitated the successful Ariane7 launcher program based on the Europa-3 project. Nearly all European countries are members of the ESA, including Switzerland and Norway, with associated members such as Canada also being allowed. In typical European style, the allocation procedure for ESA-related projects stipulates “that each Member State is to receive a fair financial and technological return on its investment.” [14]. Certainly, the resulting large number of involved parties represents a potential source of problems. A successful ESA mission was the Giotto spacecraft visiting the comets Halley 1986 and Grigg–Skjellerup 1992, in collaboration with NASA. In 1980, the company Arianespace was founded with the goal of developing a European launch vehicle in competition with NASA and the Russian space program. The first launch of Ariane 1 from Kourou in French Guyana took place on December 24, 1979. Follow-up models were Ariane 2 and 3, launching on May 31, 1986, and August 4, 1984, respectively.8 Three out of 17 flights were failures in the Ariane 1 and 2 program. The Ariane 4 was developed in different versions from 1982 to 1988 with various payloads from 2,290 kg to 4,950 kg for reaching geostationary orbit (GTO), respectively, 4,600 kg–10,200 kg, for low earth orbit (LEO). In order to allow a higher payload, the plan for Ariane 5 in 1987 stipulated a payload of 6,100 kg GTO and 18,000 kg LEO, respectively. In comparison, the mighty Saturn V allows for 47,000 kg TLI (translunar injection) and 140,000 kg LEO.

2.1.3 Timeline The maiden flight of the Ariane 5 launcher (Ariane 501) was scheduled for June 4, 1996, with four Cluster satellites on board; see Fig. 2.1(a) for a sketch of the launcher. The 7 The name Ariane stems from Ariane, French for Ariadne, the Greek goddess of vegetation and daughter of King Minos of Crete who helped Theseus to kill the Minotaur (with the head of a bull and the body of a man) and find his way out of the labyrinth near Knossos. 8 For a survey on launch systems, see, e.g., https://en.wikipedia.org/wiki/Comparison_of_ orbital_launch_systems.

2.1. The Failure of the Ariane 5

11

Seite 1 von 1

5 10 15 20 25 30 35 40

(b) video of liftoff and 5 10 15 20 25 30 35 40 explosion

5

10

15

20

25

30

35

(a) Ariane 501

(c) second 38–44 after 5 10 15 20 25 30 35 liftoff

(d) explosion

Figure 2.1. The Ariane 501: (a) Sketch of the Ariane 5 launcher with Cluster satellites. Courtesy of Phrd/Stahlkocher, via Wikimedia. (b) QR code of a YouTube video of Ariane 501’s launch and explosion. (c) QR code of the website [52] containing a descriptive series of images covering the accident from second 38 to 44 after liftoff. file:///C:/Users/huckle/Desktop/Ariane_501_Cluster.svg 23.06.2016(d) Picture of the launcher’s explosion. Reprinted with permission from ESA.

goal of the so-called Cluster satellite program, a collaboration with NASA, was to study the magnetosphere of the earth based on four identical spacecrafts in a tetrahedral formation.9 First, we will briefly describe the chain of events (mainly relying on [28, 41]) before going into the details of the underlying facts. After a normal countdown and liftoff, a failure of the SRI (French abbreviation of Inertial Reference System, English IRS) occurred after 36.7 seconds of flight time. The SRI measures position and movement of the launcher and transmits this data to the on-board computer (OBC). A failure in the SRI fed incorrect data to the OBC, which caused a twisting of the rocket and, hence, a flight instability that finally led to self-destruction. The loss of the rocket and the four satellites was valued at $500 million. Also at stake were the development costs for the Ariane 5 of around $7 billion. An inquiry board that was supposed to analyze the failure based on the recovered debris such as the two SRIs, videos, photos, and other evidence was installed. The French mathematician Jacques-Louis Lions was appointed as chairman of the inquiry board. The only fortunate circumstance of the crash was that many parts of the destroyed rocket were actually found on the ground in the surroundings of the launcher area at Kourou because of the short flight time. Hence, together with telemetry data received at the ground station, trajectory data from radar stations, and optical observations from infrared and video cameras, the inquiry board had a lot 9 After the failure of the Ariane Cluster mission, the Cluster satellite program finally succeeded in 2000 with the help of two Soyuz-Fregat rockets taking off from Baikonur.

12

Chapter 2. Machine Numbers, Precision, and Rounding Errors

of data and was able to uncover the full story, which was summarized in the final report [28].

2.1.4 Detailed Description of the Bug To understand the findings of the inquiry board, we need some elementary background knowledge on computing, in particular on machine numbers and exception handling during program execution. Therefore, Excursions 1, 2, 3, and 4 explain these aspects. Readers who already have a basic knowledge of these concepts may continue directly to page 15. Excursion 1: Machine Numbers for Integers Computers are electronic devices based on electrical switches that can be in two states or take only two values. Therefore, every computer is based on the binary system, allowing only 0 and 1 as digits. One digit, which can have the value 0 or 1, is called a bit, and eight bits are called one byte. Hence, every character, word, and number has to be represented by a finite sequence of bits: 0’s and 1’s. For an integer k, in particular, a binary representation is derived by writing k as a unique sum of powers of 2 where the allowed factors are only the digits 0 or 1. For example, the decimal number 11 has the following binary form (note that the order of the bits from right to left corresponds to increasing powers of 2): 11 = 8 + 2 + 1 = 1 · 23 + 0 · 22 + 1 · 21 + 1 · 20 = (1011)2 . Conversely, a binary number such as (11)2 can be easily identified as k = 1 · 21 + 1 · 20 = 3 in our common decimal system. In the same way, every character can be represented in the ASCII code (American Standard Code for Information Interchange) as a sequence of 0’s and 1’s, e.g., “A” = (1000001)2 , “$” = (0100100)2 , “1” = (0110001)2 , carriage return = (0001101)2 . As integer numbers may grow arbitrarily large, more and more bits need to be stored on a computer. Therefore, we have to decide how many binary digits we use for our numbers. This number of digits limits the set of integers that we can represent in our code. If we choose eight bits, 28 = 256 different numbers can be encoded. Since frequently positive and negative numbers are relevant, some binary representation has to be chosen to map the 256-bit combinations to the corresponding interval of numbers. One idea would be a dedicated bit defining the sign of an integer, but this would result in a nonunique binary representation of the value 0. Therefore, binary representations such as the binary offset or the two’s complement (see also Excursion 16) use a different mapping, resulting in a unique binary representation of 0 but also in an unsymmetric range of numbers: For eight bits, typically the integers from −128 to 127 are available. Similarly, a 16-bit signed integer number can range from −32,768 = (−1 × 215 ) through 32,767 = (215 − 1). Obviously, we run into problems if in the course of our computations we leave the range of allowed numbers. Adding 1 to the largest positive number, i.e., 32,767, should—from a mathematics point of view—of course result in 32,768, a number that would take more bits than the allowed 16. In this case, strange things may happen with

2.1. The Failure of the Ariane 5

the computer-encoded binary numbers. Typically, the result will be −32,768 (this is also called integer overflowa ). Depending on the compiler, it may also happen that the exact result with more than 16 bits is stored. This results in an overwriting of neighboring bits in memory and unpredictable behavior of the program. In this case, another variable or word is overwritten, and this may prevent a correct code execution. Properties of typical integer data types are displayed in the following table: data type # bytes smallest int largest int short 2 −32,768 = −215 32,767 = 215 − 1 31 int 4 −2,147,483,648 = −2 2,147,483,647 = 232 − 1 uint 4 0 4,294,967,295 long 8 −263 264 − 1 a Compare

the case of Therac-25 in Sec. 7.1.

Excursion 2: Machine Numbers for Real Numbers: Floating-Point Format Up to now, we have only considered integer numbers. Noninteger numbers such as 2.5 or 3.141592 can be written as so-called fixed-point numbers, which are nothing other than integers in disguise: We consider numbers of the form x = r1 . . . rp .s1 . . . sq with p digits in front of the decimal point and q behind. These numbers are two integers of length p and q, respectively, where s1 . . . sq is shifted q positions to the right relative to the decimal point. As with integers, the set of representable fixed-point numbers also covers only a narrow range of the infinite possibilities of real numbers. Think of real numbers stemming from periodic fractions, roots, logarithms, etc., written in decimal form. Therefore, the fixed-point format is rarely used. For scientific applications, it is necessary to represent a wide range of real numbers. But here the difficulties are much worse than for integers: One real number may already have an infinite number of digits, and it is impossible to store such a number in the memory of any computer. Hence, we have to introduce a cut and a restriction to a limited number of digits for representing an approximation x ˜ for any given real number x. To this end, we use one bit for the sign of the real number, s = sign(x), +1 or −1, and we represent x in a mixed form as the product of a fixed-point number m = 0.m1 m2 . . . mt , the so-called mantissa, and a power of 2: x ˜ = s · m · 2e , where the exponent e is an integer with a certain allowed length. This represents the so-called floating-point format. Furthermore, we can choose mi = 0 or 1 for 1 < i ≤ t, and especially m1 = 1. This normalization m1 = 1 is always possible (by adapting the exponent e), with the exception of x ≈ 0. The normalization is necessary to derive the uniqueness of the representation because otherwise every number could be written in different ways by shifting the mantissa and adjusting the exponent accordingly, e.g., 1.5 = (1.1)2 · 20 = (0.11)2 · 21 = (11)2 · 2−1 . The main properties of floating-point data types are described in the following table: data type # bits total # bits mantissa # bits exponent range of exponent float 32 23+1 8 [−126, 127] double 64 52+1 11 [−1022, 1023] In this way, we can approximately represent very large and very small numbers. Obviously, we hereby introduce an error, which is kept small by rounding the given real number x to the nearest so-called machine number x ˜ = s · m · 2e . This defines the floating-point representation of x. In addition, to facilitate the representation of the number x = 0 and allow a more precise representation of numbers close to zero, the normalization m1 = 1 is not enforced in the case of minimum (negative) exponent e.

13

14

Chapter 2. Machine Numbers, Precision, and Rounding Errors

Since computations with nonnormalized numbers are very slow on certain CPUs, this effect can slow down numerical algorithms dramatically. For algorithms and numerical examples that use a lot of nonnormalized numbers, the runtime can grow up to a factor of 900 on a Pentium 4 and up to a factor of 370 on Sandy Bridge processors (see [9, 12]). Example for floating-point format: The number −1.5 in the float format is described as −1.5 = −(110000000000000000000000)2 · 20 , • representing the sign via one bit, • storing the mantissa as (10000000000000000000000)2 , because the first digit 1 is already given by the normalization and does not have to be stored, • and the exponent is stored as a signed integer (cf. Excursion 1). The reader is referred to [21, 24, 31, 36, 44] for further details on machine numbers and the relation to (numerical) algorithms.

Excursion 3: Transforming Machine Number Formats: Casts Typically, programming languages offer machine numbers based on different bit lengths to make programming more flexible. For integers, the data types int (32 bit), long int (64 bit), and short int (16 bit) are used. But one should be aware that these formats are not uniquely defined. For floating-point numbers—according to the IEEE-754 standard—the single-precision float (32 bit: 1 bit for the sign, 8 bits for the exponent, 23 bits for the mantissa) and double-precision double (64 bit: 1 bit for the sign, 11 bits for the exponent, 53 bits for the mantissa) are used. Sometimes it is useful to alter the representation of a number from one format to another one in a specific part of the program. Such an operation is called a cast. Casts often happen automatically, by adding float x and int n to x + n, e.g., the number n is automatically transformed or cast into the float format in order to perform the addition for float numbers. Furthermore, programmers sometimes want or need to save storage by transforming a long int into a short int, etc., but this leads to correct results only in a case where the considered integer fits in the short int format regarding its actual value. We could define a variable n=5 of type long int and k of type short int, and apply a cast via the command k := n without any loss of information; in contrast, n=61,000 would result in a wrongly cast value stored in k. Especially in the beginning of the computer era, storage was very restricted, and programmers tried to save memory by allocating only the absolutely necessary bit length for variables.a a Compare the bugs related to Y2K in Sec. 2.2, Vancouver Stock Exchange in Sec. 2.3, and Therac-25 in Sec. 7.1.

Excursion 4: Exception Handling and Protection of Operations During the execution of a code, abnormal or exceptional conditions can occur such as division by zero.a In order to deal with these cases in a clever way, an exception handling (see, e.g., [6]) can be started by calling special subroutines. Thus, the normal flow of the program is abandoned or safeguarded. Such exception mechanisms can be implemented via programming language constructs or via hardware mechanisms. The reason for introducing exception handling is that usually there is no possibility of anticipating all the possible events for a program. By using exceptions, one can save the code from crashing in the event of anomalous

2.1. The Failure of the Ariane 5

15

behavior or usage of a program. Consider a web server receiving a misspelled website name from a client user. It would be catastrophic to shut down the web server with some cryptic error and stop all services. Instead, an exception can come into effect sending a corresponding 404 error message, for example, to the single problematic user, but the server keeps on running smoothly for all other users. Another typical example is a floating-point division by zero, which can cause the program to fail if no exception handling is active for this case. In a divide-by-zero situation, the exception could be replacing the related number by the virtual number infinity (or NaN, not-a-number). Then the exception handler can store important variables and data and possibly also return to the main program. Many programming languages such as Ada allow various types of exceptions, including a protection of variables and subroutines checking whether certain operations lead to an overflow and avoiding the erroneous operation and effect. a Compare

the example of the USS Yorktown on page 88.

With the notation of machine numbers and exception handling, we can now describe the series of events that led to the loss of Ariane 501 in full detail. The SRI computer code used an internal alignment function which had to compute the horizontal velocity bias of the rocket. Inside the alignment function, a cast took place to convert the horizontal velocity bias denoted as T_ALG.E_BH from 64-bit floating point to 16-bit signed integer. Unfortunately, the number stored in T_ALG.E_BH was too large to fit into the smaller 16-bit format (see Fig. 2.2 for an excerpt of the original code). This operation was not protected, resulting in an Operand Error on the underlying Motorola hardware platform (MC68020 processor and MC68882 coprocessor; see, e.g., [29, 30, 39]). This hardware exception—covering overflow problems in particular and furthermore other errors that were not frequent or important enough to merit a specific condition—was indicated by the OPERR bit set in the exception status byte of the Motorola assembly language and correctly caught as an exception by the Ada runtime environment of the SRI system. Unfortunately, the designed standard exception handling of the SRI code was a shutdown of the active SRI system 2 (designated for hardware failures only). The backup SRI system 1 (running in the background and identical to the SRI 2 in software and hardware) had ceased to work correctly due to the same problem 72 milliseconds before, so it was not possible to switch over to SRI 1. The restarting SRI 2 provided some diagnostic bit patterns as output when booting, which was interpreted as correct flight data by the OBC. This meaningless flight data caused the OBC to decide to change the course of the launcher abruptly by maneuvering the nozzles of the two solid boosters and the Vulcain engine into extreme positions. The resulting rupture of the links between the solid boosters and the core stage due to aerodynamic forces then led to the initialization of self-destruction intended for such a problematic flight situation (compare Fig. 2.1(d)).

2.1.5 Discussion A couple of additional avoidable details make this incident even more unfortunate: 1. In the specific code segment of the SRI software where the error occurred, casts appear for seven critical variables, but only four of them have been protected. A detailed manual analysis of these code segments when developing the SRI code for Ariane 4 showed that only nonphysical data could lead to exceeding values for the remaining three casts. This resulted in the decision to keep the three

16

Chapter 2. Machine Numbers, Precision, and Rounding Errors

Figure 2.2. Ariane 501 critical code segment (taken from [27]): While the code lines concerning the vertical velocity bias using L_M_BV_32 and T_ALG.E_BV (lower half of the image) show checks applying suitable values for limit cases via if-statements, the critical unchecked code segment for the horizontal velocity bias T_ALG.E_BH (last line) directly applies a cast statement without any if-clause. Reprinted with permission from Jean-Jacques Lévy.

corresponding casts unchecked to save computational resources (cf. [28, 29]), including the horizontal velocity bias shown in Fig. 2.2. 2. The developers of the Ariane program assumed and addressed only hardware failures in OBCs (cf. [28]). Their logical consequence was to design backup systems identical in the hardware and in the software. With knowledge of the events, it does not seem very safe to assume software is always working correctly until some hardware fault appears. 3. The design of the Ariane program expected only random hardware faults causing nonphysical data, in particular also in the SRI systems. The corresponding default exception handling to restart the SRI system even during flight time, when valid attitude data can no longer be computed (cf. [28]), was at least very unfortunate. 4. The existing Ariane 4 SRI code was directly included in Ariane 5 without actually testing the range of the variables. What had been physically safe for Ariane 4 (a certain limit on the horizontal velocity increments) did not hold for Ariane 5, which encounters an up to five times more rapid build-up of horizontal speed [28]. No simulations and tests were undertaken to produce or check realistic, Ariane 5–specific values for the horizontal velocity together with the SRI system, and this approach had been approved by all involved groups (cf. [28]). 5. Of course, there had been system integration tests for the Ariane 5 setup. But to restrict complexity in the test setup, the SRI system had been simulated instead of fully integrated. This approach is frequently used, but it led to not discovering the problematic SRI behavior for the new flight data of Ariane 5.

2.1. The Failure of the Ariane 5

17

6. Finally, not without irony, the problematic code part of the SRI was not useful at all for the Ariane 5. The SRI-related code had two main operating modes: In the alignment mode, the SRI inertial reference coordinate system was defined, and the subsequent flight mode delivered launcher velocity and attitude in this coordinate system. For the Ariane 4, the corresponding alignment function had been designed to allow for a faster resuming of the countdown in the unlikely case of an interruption. It was specified to be active only up to second 40 (remember, the failure happened around second 37 of the flight). For the Ariane 5, this functionality was totally useless since a different preparation sequence was used there (cf. [28]) and was only kept to avoid changing a working, previously tested code block. Based on these aspects, a variety of different interpretations and views exist in the community on the type of error or problem of the Ariane 5 accident (see [23, 35] for a partial survey): • Is it just a programming error, as it appears at first sight, due to the missing manual handling of too large values in the code? • Is it a systematic software design error, as pointed out by the report of the inquiry board [28] based on argument 3? • Is it a testing error (as listed in [35])? See argument 5. • Is it a requirements error (some reasoning on this aspect is available in [23]), based on arguments 1 and 3? • Is it a reuse specification problem due to argument 4 as denoted by [22] (or software reuse error as presented in [10])? • Is it a problem of missing precondition checks (cf. [1]) during development? In this case, a lack of attention to strict preconditions on input values can be seen as the cause of the accident. • Is it a process management problem? Some arguments for this can be found in [35], but the authors of [22] and to some extent also of [23] rather deny it. • Is it a failure of risk management, as pointed out in [35], since the risky decision to not protect the particular cast statement had not been properly reviewed? • Is it a wrong choice of programming language (as mentioned in [22])? This aspect is controversially disputed in the community. Opposite views can be found in [18], for example. The authors in [23] argue that it is partially a language fault. In [29], interestingly before the actual event in June 1996, details are listed concerning the decisions on Ada in the development phase as well as on the testing strategies. This is supported by arguments 1 and 3. • Is it a systems engineering10 failure, as advocated by [11, 40] and in particular by [25, 26]? 10 Systems engineering in the context of computer-based systems comprises all engineering disciplines involved in the life cycle of such a system (see, e.g., [25, 26]).

18

Chapter 2. Machine Numbers, Precision, and Rounding Errors

It is well beyond the scope of this chapter to decide on the actual type of the error or even to discuss these aspects in full detail. However, the Ariane 5 accident can serve as a meaningful example of how complex modern software systems are, how many different persons and stakeholders were involved with different views on the same facts, and how extremely cautious one has to be to make complex missions work out well.

2.1.6 Other Bugs in This Field of Application • The Cryosat mission was supposed to explore the polar ice sheet and sea ice thickness. It failed on October 8, 2005. The Cryosat satellite was riding on top of a modified Russian Rockot launch vehicle (SS-20 with additional third stage). A problem occurred with the software flight-control system regarding the added third stage. The third stage failed to generate the command to shut down the second-stage engine. Therefore, the second and third stages did not separate, the satellite did not get the final boost to reach its orbit, and it fell into the sea (cf. [5, 13, 16]). A similar problem led to the crash of the NASA CO2 hunter with the Orbiting Carbon Observatory (OCO). The lack of proper separation of satellite from launcher Taurus XL made it crash into the ocean near Antarctica on February 24, 2009 (see [34]). • On August 22, 2014, two Galileo satellites did not reach their final intended circular orbit because of software errors in the upper stage of the Russian Fregat-MT (see [17, 42]). The flight was conducted by the Russian space agency Roscosmos. • The Virgin Galactic SpaceShip Two crashed on October 31, 2014, during its landing approach. The accident was provoked by one of the pilots activating the feathering system too early (cf. [49]). This system consists of a 90◦ twisting of the tail unit, which slows down the spaceship and can be effective only at lower altitude. Maybe the software should have been designed to prevent this early activation.

2.1.7 Similar Bugs • On August 15, 2010, an integer overflow allowed a transaction in the source code of the cryptocurrency Bitcoin to generate over 184 billion Bitcoins (BTC) out of nothing. At that time, only about 1.5 million Bitcoins were in circulation! Bitcoin, one of the first cryptocurrencies, still plays a major role in this field. The basic idea is to create a currency that is not controlled by banks or countries but is completely transparent and globally linked. This is realized by providing the code in an open-source manner, allowing everybody to download it and to contribute to check the system (see [33] for a number of technical details). A sophisticated mechanism using encryption and keys is installed to ensure privacy of certain data such as who actually performs which transaction. Bitcoin relies on the blockchain, a method to specify all transactions ever made in a chain of blocks of transactions.11 As the source code, this chain of blocks is publicly available12 and verified by many persons contributing to the system via the 11 A transaction is the act of sending a certain amount of Bitcoins from one address to another, similar to classical transfers for regular accounts. 12 For specific transaction or block IDs, tools such as https://www.blockchain.com/ allow for direct insight into the public data.

2.1. The Failure of the Ariane 5

19

compute power they provided to the Bitcoin software. These persons are called “miners” since they have to “dig” in a brute-force manner through many options to verify a transaction or block (see [46] for a nice flow chart on “How a Bitcoin transaction works”). Technically, their computers have to calculate cryptographic hash functions to verify a block. There is no way that educated guessing can identify the hash of a given input. Hence, finding the fitting hash is time-consuming. Thus, miners actually invest electric energy or its corresponding costs.13 The first miner to achieve the verification gets a reward of typically 50 BTC and a potential transaction fee. These rewards are the only way of actually generating new Bitcoins. The transaction fee is an incentive for miners similar to a donation from the client user who wants a transaction to be verified quickly to potential miners. The remaining miners who find the fitting hash after the first one do not obtain any compensation, but serve as verification of the blockchain’s state. As of today, Bitcoin mining only pays off via joining a mining pool. Note that for Bitcoins, the overall maximal amount of possible Bitcoins is designed to be 21 million BTC (see [7]). After that limit, mining will not result in additional 50 BTC rewards, but only in the transaction fees. The transactions related to the overflow bug were contained in the block 74638.14 This block consisted of different sources of incoming and outgoing money in the form of Bitcoin addresses. The simplified picture of the scenario is (see also [7]) – from my address which has 50.50 BTC: – send a huge number of BTC to an address A, – send the same huge number of BTC to an address B, – send 50.51 BTC to an address C. The overflow of the two huge numbers actually resulted in −0.01 BTC (equivalent to −1,000,000 so-called Satoshis,15 a negative outgoing amount of money interpreted as an additional positive income in the overall sum. The actual scenario is a little more complicated in its details. Fig. 2.3 shows one format of the Bitcoin blockchain dump for the problematic block 74638. The incoming amount of Bitcoins is visible from lines 3 and 6. It sums up to 50.50 BTC, where 50 BTC are the mining reward and 0.50 BTC are the investment of the person committing the transaction. The outgoing amount of money consisted of three different transactions: the two very large numbers (lines 7 and 8) and the amount of 50.51 BTC for the miner (50 BTC for mining, 0.51 as a transaction fee; line 4). Note that the large number showing up in lines 7 and 8 stems from a cast from the original number “7ffffffffff85ee0” in hexadecimal representation into a 64-bit integer.16 13 Note that mining gets more expensive over time since the whole chain of blocks always has to be calculated: The historical hashes are part of the future tasks. This has the additional advantage that transactions or blocks get more and more secure over time: If somebody wants to modify an older transaction to commit a fraud, all subsequent blocks in the chain would have to be recomputed by this same person or computer, which is extremely costly and unlikely after a few blocks. 14 The literature is not clear on whether the transactions using the overflow bug were meant to be some testing or a real attack to actually generate money. 15 This name stems from one of the main developers of Bitcoin, Satoshi Nakamoto. Note that this name is a pseudonym; the actual person behind it is still unknown (see [46]). 16 Note that this is the correct number. In other sources such as [4, 7, 45], wrong numbers are indicated. Some of those seem to stem from an inaccurate cast from the original number “7ffffffffff85ee0” in hexadecimal representation into a floating-point format.

20

Chapter 2. Machine Numbers, Precision, and Rounding Errors

1 CBlock ( h a s h =0000000000790 ab3 , v e r =1 , h a s h P r e v B l o c k =00 000 000 0060 686 5 , h a s h M e r k l e R o o t =618 eba , nTime =1281891957 , n B i t s =1 c00800e , nNonce =28192719 , v t x = 2 ) 2 C T r a n s a c t i o n ( h a s h =012 cd8 , v e r =1 , v i n . s i z e =1 , v o u t . s i z e =1 , nLockTime = 0 ) 3 CTxIn ( C O u t P o i n t ( 0 0 0 0 0 0 , - 1 ) , c o i n b a s e 040 e 8 0 0 0 1 c 0 2 8 f 0 0 ) 4 CTxOut ( n V a l u e = 5 0 . 5 1 0 0 0 0 0 0 , s c r i p t P u b K e y =0x4F4BA55D1580F8C3A8A2C7 ) 5 C T r a n s a c t i o n ( h a s h =1 d5e51 , v e r =1 , v i n . s i z e =1 , v o u t . s i z e =2 , nLockTime = 0 ) 6 CTxIn ( C O u t P o i n t ( 2 3 7 f e 8 , 0 ) , s c r i p t S i g =0xA87C02384E1F184B79C6AC ) 7 CTxOut ( n V a l u e = 9 2 2 3 3 7 2 0 3 6 8 . 5 4 2 7 5 8 0 8 , s c r i p t P u b K e y =OP_DUP OP_HASH160 0xB7A7 ) 8 CTxOut ( n V a l u e = 9 2 2 3 3 7 2 0 3 6 8 . 5 4 2 7 5 8 0 8 , s c r i p t P u b K e y =OP_DUP OP_HASH160 0 x1512 ) 9 v M e r k l e T r e e : 012 cd8 1 d5e51 618 e b a Figure 2.3. A human-readable dump of the problematic block 74638. The data of [2] have been used (skipping its last two lines on lengthy hash numbers) and reformatted. Note that the details for this block are no longer available in Bitcoin’s actual blockchain due to the correction of the bug discussed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

-------------------------------------------------check specific values (unsigned/signed long long) -------------------------------------------------ULLONG_MAX : 18446744073709551615 2^63 : 9223372036854775808 out1 (literal) : 9223372036854275808 out2 (conversion from hex) : 9223372036854275808 cast (out1+out2) to signed : -1000000 -------------------------------------------------imitate original bitcoin formula -------------------------------------------------nValueOut = 0 : 0 nValueOut += out1 : 9223372036854275808 nValueOut += out2 : -1000000 nValueOut += 5051000000 : 5050000000 nValueIn : 0 mineReward : 5000000000 investment : 50000000 nValueIn : 5050000000 nValueIn - nValueOut : 0

Figure 2.4. Output of source code examples contained in Sec. 8.2.6: The different variables and results of operations show Bitcoin’s overflow problem in generating a negative outgoing amount of −1,000,000 Satoshis equivalent to −0.01 BTC.

In Fig. 2.4, the output of a code example imitating some of the features is shown. The corresponding code example is available in the programming languages MATLAB, C++, and Python in Sec. 8.2.6 on page 235. The largest integer number that can be represented by 64 bits is contained in the C++ limit variable ULLONG_MAX in line 4. Line 5 holds the value of 263 in the decimal system; this value corresponds to the maximal representable value of a

2.1. The Failure of the Ariane 5

21

signed 64-bit integer plus 1, i.e., LLONG_MAX=263 − 1. Lines 6 and 7 show the actually used values for the two immense outgoing transactions translated from the hexadecimal into the decimal system (note the slight difference in the sixth digit from right: 7 instead of 2). The sum of those two values results in the overflow to generate −1,000,000. Lines 12–20 imitate the original Bitcoin formulas used to compute the sum of all incoming and outgoing transaction values (see [19] for details on Bitcoin’s actual C++ code before and after the bug-fix). As visualized by line 20, balancing all incoming and outgoing transactions resulted in a value of zero, which is important for the system to have a clean block of transactions. Since the Bitcoin source code just checked all individual transactions for negative values and not for a reasonable absolute size, the overflow bug or potential attack was possible. Once detected, the problem in the source code was fixed after a couple of hours (cf. [3, 19, 32]) by inserting additional checks for the individual transaction values. The fix of the problematic block needed some more attention. Obviously, it was impossible to do nothing about that block of transactions since it created many more BTC than possible by design. In the end, the decision was to also fork the blockchain itself to create a “good” new one that “time travel[ed]” [7] to the block before the critical one. The old blockchain still existed, but more and more users switched to the new one. 53 blocks after the overflow transaction block, i.e., at the block height of 74691, all computing nodes accepted the new blockchain as the real Bitcoin history (see [3, 9]). • In 1997, the Pentium II and Pentium Pro processors failed to set an error flag in the event of converting a large negative floating-point number to a too-small integer format (see [8, 15]). This behavior is not compatible with the IEEE floatingpoint standard and is similar to the overflow problem of Ariane 5, but realized on hardware instead of software. • The Mars Climate Orbiter (MCO) was launched December 11, 1998, and reached Mars on September 23, 1999 (cf. [43]). The mission included the search for water, atmospheric analysis, mapping the surface, and relay communications with the Mars Polar Lander upon an expected landing on December 3, 1999. The spacecraft encountered Mars on a trajectory that brought it too close to the planet, 57 km instead of the minimum 80 km, causing it to pass through the upper atmosphere and disintegrate. The summary of the findings on the failed landing of the Mars Climate Orbiter stated ([51], p. 3) Spacecraft operating data needed for navigation were provided to the JPL navigation team by prime contractor Lockheed Martin in English units rather than the specified metric units. The wrong units were used in connection with the angular momentum desaturation (AMD) procedure that was supposed to slow down the spacecraft flywheels once in a while by firing of the thrusters. To this end, relevant data were sent to the ground and stored in the AMD file. The data from this file were then used to model the resulting forces on the spacecraft. After firing of the thrusters based on an impulse bit, the velocity change was computed. The impulse bit modeled the thruster performance. These calculations were done on board the spacecraft and also on the ground where the erroneous computations occurred. The Software

22

Chapter 2. Machine Numbers, Precision, and Rounding Errors

Interface Specification that defined the format of the AMD file specifies the units for the impulse bit to be Newton seconds, in other words, in metric units. For the ground software, the transferred value of the impulse bit was given in English (or imperial) units as pound seconds. Therefore, the software underestimated the effect of the thruster firing by a factor of 4.45, representing the factor between pound and Newton (see [43]). • A similar mixup of imperial vs. metric units occurred in July 1983. Air Canada flight 143, a Boeing 767, ran out of fuel halfway between Montreal and Edmonton (see, e.g., [48]). At the time of the incident, Canada was converting to the metric system. Unfortunately, the electronic gauging system did not work properly in the specific aircraft. Therefore, manual measurements had to be performed using floatsticks. Wrong conversion factors were used to convert the manual measurements into total fuel weight: The actual fuel weight was calculated in pounds (imperial system) but assumed to be in kilograms (metric system),17 differing by a factor of about 2.2 (cf. [50]). Due to some lucky circumstances, the plane (tauntingly called the Gimli Glider) was steered to glide to the nearby airport Gimli for an emergency landing. • Bank account numbers in Germany can have different lengths depending on the bank. However, the maximum length is 10 digits. When converting shorter numbers to 10 digits, the software can write the given shorter number right-aligned to fill up the missing leading digits with zeros. But sometimes, due to erroneous code, the numbers are stored left-aligned so that the right side is filled up by zeros and the bank account number is wrong. This happened in January 2005, when the German government wanted to distribute unemployment benefits (cf. [47]). • The rendezvous between the shuttle Endeavour and the Intelsat 603 spacecraft on May 14, 1992, nearly failed. The problem could be traced back to a mismatch in precision (see [20, 37, 38]): The software routine responsible for calculating rendezvous firings did not converge to a solution since the difference of statevector variables (holding the shuttle’s current position and velocity) and of limits for the calculation was small but crucial. The software used a different number of digits for the actual values and the specified bounds that should be reached as exact values. Because the actual values and the bounds never precisely matched, the calculations did not stop. The teams at NASA were able to come up with a workaround to complete the rendezvous and later fixed the software problem for future flights.

Bibliography [1] R. L. Baber. The Ariane 5 explosion as seen by a software engineer. date accessed: 2017-10-19, 2002. http://www.cas.mcmaster.ca/~baber/TechnicalReports/ Ariane5/Ariane5.htm

[2] Bitcoin. Common vulnerabilities and exposures. date accessed: 2018-0816. https://en.bitcoin.it/wiki/Common_Vulnerabilities_and_Exposures# CVE-2010-5139 17 Before

the incident, Air Canada had recently changed from the imperial to the metric system, at least for the aircraft under consideration.

Bibliography

23

[3] Bitcoin Wiki. Value overflow incident. date accessed: 2018-07-31. https://en. bitcoin.it/wiki/Value_overflow_incident

[4] Bitcointalk. Strange block 74638. date accessed: 2018-08-18, 2010. https: //bitcointalk.org/index.php?topic=822.0

[5] H. Briggs. Cryosat rocket fault laid bare. date accessed: 2018-07-22, 2005. http: //news.bbc.co.uk/2/hi/science/nature/4381840.stm

[6] B. Bruegge and A. H. Dutoit. Object-Oriented Software Engineering: Using UML, Patterns, and Java. Prentice Hall, 3rd edition, 2010 [7] Bruno. The curious case of 184 billion Bitcoin. date accessed: 2018-07-31, 2018. https://bitfalls.com/2018/01/14/curious-case-184-billion-bitcoin/

[8] R. R. Collins. Inside the Pentium II Math Bug - Dan-0411 Rocks the Industry. Dr. Dobbs’s Journal, 22(8), 1997 [9] B. Dawson. That’s not normal—the performance of odd floats. date accessed: 2018-07-25, 2012. https://randomascii.wordpress.com/2012/05/20/thatsnot-normalthe-performance-of-odd-floats/

[10] V. De Florio. Application-Layer Fault-Tolerance Protocols. IGI Global, 2009 [11] M. Dowson. The Ariane 5 software failure. ACM SIGSOFT Software Engineering Notes, 22(2):84, 1997 [12] Z. Drmaˇc and K. Veseli´c. New Fast and Accurate Jacobi SVD Algorithm. II. SIAM Journal on Matrix Analysis and Applications, 29(4):1343–1362, 2008 [13] ESA. CryoSat Mission lost due to launch failure. date accessed: 2018-07-22, 2005. http://www.esa.int/Our_Activities/Observing_the_Earth/CryoSat/ CryoSat_Mission_lost_due_to_launch_failure

[14] ESA Media Relations Division. ESA turns 30! A successful track record for Europe in space. date accessed: 2017-10-19. http://www.esa.int/For_Media/ Press_Releases/ESA_turns_30!_A_successful_track_record_for_Europe_ in_space

[15] L. M. Fisher. Flaw Reported in New Intel Chip. The New York Times online, May 6, 1997 [16] R. Francis. ESA’S ICE MISSION CryoSat: More important than ever. In ESA bulletin. Bulletin ASE., pages 10–18. ESA, 2010. https://earth.esa.int/c/ document_library/get_file?folderId=13019&name=DLFE-752.pdf

[17] Galileo GNSS. Galileo FOC satellites launch failure conclusions. date accessed: 2018-07-22, 2018. http://galileognss.eu/galileo-foc-satellites-launchfailure-conclusions/

[18] K. Garlington. Critique of “Put it in the contract: The lessons of Ariane.” date accessed: 2017-10-19. http://ansymore.uantwerpen.be/system/files/uploads/ courses/SE3BAC/p06_2_ariane.html

24

Chapter 2. Machine Numbers, Precision, and Rounding Errors

[19] Github. Bitcoin—fix for block 74638 overflow output transaction. date accessed: 2018-08-16, 2010. https://github.com/bitcoin/ bitcoin/commit/d4c6b90ca3f9b47adb1b2724a0c3514f80635c84#diff118fcbaaba162ba17933c7893247df3aR1013

[20] T. L. Hardy. Software and System Safety. AuthorHouse, 2012 [21] N. J. Higham. Accuracy and Stability of Numerical Algorithms. SIAM, 2nd edition, 2002 [22] J.-M. Jézéquel and B. Meyer. Design by Contract: The lessons of Ariane. IEEE Computer, 30(1):129–130, 1997 [23] P. B. Ladkin. The Ariane 5 Accident: A Programming Problem? Essays, 1998. http://www.rvs.uni-bielefeld.de/publications/Reports/ariane.html

[24] V. Lefevre and J.-M. Muller. Worst Cases for Correct Rounding of the Elementary Functions in Double Precision. In Proceedings 15th IEEE Symposium on Computer Arithmetic. ARITH-15 2001, pages 111–118. IEEE Comput. Soc., 2001 [25] G. Le Lann. The Ariane 5 Flight 501 Failure: A Case Study in System Engineering for Computing Systems. Technical report, INRIA, 1996 [26] G. Le Lann. An Analysis of the Ariane 5 Flight 501 Failure: A System Engineering Perspective. IEEE Engineering of Computer-Based Systems, pages 339–346, 1997 [27] J.-J. Lévy. Un petit bogue, un grand boum!, 2010 [28] J.-L. Lions. Ariane 5 Flight 501 Failure. Ariane 501 Inquiry Board Report, 1996. http://esamultimedia.esa.int/docs/esa-x-1819eng.pdf

[29] J. N. Monfort and V. Q. Ribal. Ariane 5: Development of the On-Board Software. In M. Toussaint (editor), Ada in Europe, pages 111–123. Springer, 1996 [30] Motorola. MC68881/MC68882 Floating-Point Coprocessor User’s Manual, 1987 [31] J.-M. Muller, N. Brisebarre, F. de Dinechin, C.-P. Jeannerod, V. Lefèvre, G. Melquiond, N. Revol, D. Stehlé, and S. Torres. Handbook of Floating-Point Arithmetic. Birkhäuser Boston, 2010 [32] S. Nakamato. Re: overflow bug SERIOUS. date accessed: 2018-08-07, 2010. http://satoshinakamoto.me/2010/08/15/re-overflow-bug-serious/

[33] S. Nakamoto. Bitcoin: A peer-to-peer electronic cash system. www.Bitcoin.org, 2008 [34] NASA. Overview of the Orbiting Carbon Observatory (OCO) mishap investigation results. date accessed: 2018-07-22, 2009. https://www.nasa.gov/pdf/ 369037main_OCOexecutivesummary_71609.pdf

[35] B. Nuseibeh. Ariane 5: Who Dunnit? IEEE Software, 14(3):15–16, 1997 [36] M. L. Overton. Numerical Computing with IEEE Floating Point Arithmetic. SIAM, 2001

Bibliography

25

[37] J. Paul. Endeavour rendezvous software fix. date accessed: 2018-07-22, 1992. http://catless.ncl.ac.uk/Risks/13.56.html#subj5

[38] I. Peterson. Fatal Defect: Chasing Killer Computer Bugs. Vintage Books, reprint edition, 1996 [39] C. Poivey, O. Notebaert, P. Garnier, T. Carriere, J. Beaucour, B. Steiger, J. Salle, F. Bezerra, R. Ecoffet, B. Cadot, M.-C. Calvet, and P. Simon. Proton SEU Test of MC68020, MC68882, TMS320C25 on the ARIANE 5 Launcher On Board Computer (OBC) and Inertial Reference System (SRI). In Proceedings of the Third European Conference on Radiation and Its Effects on Components and Systems, pages 289–295. IEEE, 1995 [40] G. Rozenberg and F. W. Vaandrager. Lectures on Embedded Systems. Volume 1494 of Lecture Notes in Computer Science. Springer, 1998 [41] SIAM. Inquiry Board Traces Ariane 5 Failure to Overflow Error. SIAM News, 29(8), 1996 [42] Sputnik. Galileo satellites incident likely result of software errors. date accessed: 2018-07-22, 2014. https://sptnkne.ws/je9z [43] A. G. Stephenson, D. R. Mulville, F. H. Bauer, G. A. Dukeman, P. Norvig, L. S. LaPiana, and R. Sackheim. Mars Climate Orbiter Mishap Investigation Board: Phase I Report. Technical report, NASA, 1999. https://llis.nasa.gov/llis_ lib/pdf/1009464main1_0641-mr.pdf

[44] J. Stoer and R. Bulirsch. Introduction to Numerical Analysis. Texts in Applied Mathematics. Springer, 3rd edition, 2002 [45] The Shayan. Once there were 184 billion Bitcoins. date accessed: 201808-18, 2017. http://blog.theshayan.com/2017/12/01/once-there-were-184billion-bitcoins/

[46] A. Wawro. 7 things you need to know about bitcoin. date accessed: 2018-0816, 2013. https://www.pcworld.com/article/2033715/7-things-you-needto-know-about-bitcoin.html

[47] D. Weber-Wulff. Two German projects: Toll and Dole. date accessed: 2018-0722, 2005. https://catless.ncl.ac.uk/Risks/23.65.html [48] Wikipedia. Gimli Glider. date accessed: 2018-07-22. https://en.wikipedia. org/wiki/Gimli_Glider

[49] Wikipedia. VSS Enterprise crash, Virgin Galactic. date accessed: 2018-07-22. https://en.wikipedia.org/wiki/VSS_Enterprise_crash

[50] R. Witkin. Jet’s fuel ran out after metric conversion error. The New York Times, July 30, 1983. https://www.nytimes.com/1983/07/30/us/jet-s-fuelran-out-after-metric-conversion-errors.html

[51] T. Young, J. Arnold, T. Brackey, M. Carr, D. Dwoyer, R. Fogleman, R. Jacobson, H. Kottler, P. Lyman, and J. Maguire. Mars Program Independent Assessment Team: Summary Report. Technical report, NASA, 2000. https://www.nasa.gov/ news/reports/MP_Previous_Reports.html

[52] T. Zehrer. Software engineering Ariane 5. date accessed: 2017-10-19. http: //www.thomas-zehrer.de/?p=49

26

Chapter 2. Machine Numbers, Precision, and Rounding Errors

2.2 Y2K and Data Formats “Why’re you snapping your fingers?” “It keeps the elephants away.” “But there aren’t any elephants.” “See? It works.” —Popular joke

2.2.1 Overview • Field of application: various • What happened: fear of widespread breakdown of computer systems on January 1, 2000 • Loss: billions of dollars • Type of failure: too-short data format in legacy software

2.2.2 Introduction The Y2K (Year 2000) Bug was an error caused by coding the year only by its last two digits: 99 versus 1999. It is also called the Millennium Bug. The cause for this bug was the necessity of saving storage in view of the shortage of memory in the beginning of the computer era. This led to using only two digits for storing the actual year in computer languages such as COBOL. The full date was abbreviated in the form MMDDYY for month, day, and year. In the 1950s and 1960s, the turn of the millennium seemed very far away, and nobody believed that the code written at that time would last so long (see, e.g., [2, 6, 13]). Additionally, the year 2000 was a leap year. In general, deciding whether a year is a leap year or not is also a common source for possible errors (cf. [13]).

2.2.3 Description of the Y2K Problem First we describe a typical example which actually occurred in 2002: The British pensioner Joseph Dickinson, age 103, was called in from his local hospital for an eye test and was told to bring his parents, too. The computer had stored 1899—his actual year of birth—only as 99, and was therefore assuming 1999 as his year of birth (see [3]). At the end of the 20th century, many and much more severe incidents were expected leading to the breakdown of elevators, production facilities, account management systems, and other information management systems, but also affecting any kind of technical equipment based on microcomputers (embedded systems) such as cars, trains, airplanes, medical equipment, etc. Governments and companies anticipated that a combination of different breakdowns could lead to chaotic situations in developed countries. Therefore, many countries started projects to check whether all computer systems would fail at the turn of the millennium. This could be done by simulating the event: Manually change the date and time of the system to December 31, 1999, 23:59, and monitor the system’s behavior during the change of date. After detecting a problem, possible workarounds could be applied (cf. [13]): • Changing the affected program, file, or database so that it used four digits for storing the year. That was the best but also the most involved solution.

2.2. Y2K and Data Formats

27

• Changing the six digits MMDDYY to use three digits for the year and three digits for the day as the nth day in the year, ignoring the month totally. Therefore, “1999” would be represented as “099” (abbreviating the leading 19 in 1999 by 0), and “2001” by “101” (indicating 20 by 1). This shifts the rollover problem to the year 2899 (represented as 999). “February 1, 2020,” e.g., would be stored as “032120”—the 32nd day in the year 120, which represents the year 2020. • Windowing: Programs kept the two-digit format for years and obtained small additional code snippets to decide on the actual number when called for specific functions. For information on pensions, e.g., “2000” is of course an inadequate number, and “00” has to be “1900.” The first remedy needs to rewrite the date-related code. The other two cures insert a few lines of code between reading the date and using the date. The inserted code then does the transformation to the unambiguous eight-digit form used by other software (parts) outside the problematic code. Different countries made efforts prior to the turn of the millennium in order to detect and avoid possible breakdowns. In October 1998, the United States government signed the Year 2000 Information and Readiness Disclosure Act in order to help companies deal with the problem. In December 1998, the UN built the International Y2K Cooperation Center in order to gather information and assist in crisis management. Billions of dollars were spent by different countries to safeguard the digital systems against the Millennium Bug (see, e.g., [5, 13]). Surprisingly, at the turn of the year there were no major incidents, neither in countries that invested a lot of effort in preventing breakdowns nor in countries that ignored the problem such as South Korea or Italy. Also schools and small companies that did nothing for prevention were not affected by major problems. In hindsight, there are opposing opinions on the seriousness of the Y2K threat (cf. [4, 7, 8, 11]). Some people argue that in critical environments such as financial or medical services there had to be a safety check and a possible update in order to be safe rather than sorry. Others state that the risk had been extremely overrated, wasting valuable time and money. The Y2K problem is an extreme example of legacy software problems that are considered a serious threat to many computer systems. Obvious drawbacks of old software are as follows: • The costs for maintaining old hardware and software might be much higher than replacing both by recent equipment. • Legacy systems are hard to maintain, improve, expand, or integrate. Frequently, legacy software is not well documented, and the original programmers have already retired or died. • Older software is much more vulnerable against hacker attacks. Therefore, the use of legacy software should be avoided (see [14]) if possible. Note that replacing large legacy code frequently is very cumbersome and, thus, expensive. In particular for certified software, as used in the aerospace or automotive industry, recertifying a new version may be out of scope. As a consequence of the Y2K problem, the IT industry has learned some lessons in developing software (see, e.g., [7]), such as better disaster planning, better documenting systems, and better preparing for the future.

28

Chapter 2. Machine Numbers, Precision, and Rounding Errors

2.2.4 Y2K-Related Software Bugs The following (incomplete) list contains documented, actual incidents ascribed to the Y2K Bug: • In Sheffield, UK, two abortions were carried out caused by incorrect risk assessment for Down syndrome and miscalculation of the mother’s age. Four babies with Down syndrome were born to mothers that were categorized in the low-risk group (see [13]). • Some institutions displayed the year 2000 as 19100 or even 0100 on their websites, e.g., the US Naval Observatory and the French national weather forecast Meteo France (cf. [13]). • Caused by a faulty patch that should have fixed a Y2K bug, US spy satellites transmitted unreadable data for three days. This is an example of a “bug in bugfix” (see [4, 8, 9]). • Air transportation at the US East Coast was struck by delays resulting from a buggy Y2K fix (see [8]). • There were a few difficulties with money transfers, ATM cards, registration of automobiles, and driver’s licenses (cf. [4, 8]). • Minor problems arose in nuclear power plants in the US and Spain, none being a safety hazard (see, e.g., [8, 9]). • Due to faulty calculation of their age, employees of the Deutsche Oper Berlin were temporarily refused subsidies for their families (cf. [8, 9]). • In Britain, thousands of babies were initially registered as 100 years old (see, e.g., [4, 9]). • In Malaysia, medical equipment such as defibrillators and heart monitors failed (cf. [8]). • Credit card holders received bills twice (see, e.g., [4, 9]).

2.2.5 Similar Bugs The Y2K Bug has little nephews or nieces that suffer from similar date format restrictions or irregular date behavior: • NASA Space Shuttle Year End Rollover problem: NASA avoided missions lasting over New Year’s Eve because they were not sure whether the control computers would run into trouble at the transition from 365 to 0 (see [1]). • Leap second bug: Sometimes a leap second is introduced in the official coordinated universal time UTC in order to adjust the days to the exact rotation of the Earth. This already happened more than 10 times since 1972. In some code, this can lead to a breakdown. At the end of June 2016, websites from LinkedIn to Reddit, and websites with Java server applications in Hadoop and Mozilla, crashed in view of the inserted leap second (see [6, 12, 13]).

Bibliography

29

• Leap year bug: In some programs, the decision whether a year has to be treated as a leap year is not implemented correctly. This might concern past years, but can also lead to current and future failures on February 28 or 29. The correct rules for leap years are nested in the sense that a year is a leap year if it is divisible by 4, and not divisible by 100 except if divisible by 400 (cf. [6, 12, 13]), i.e., the year 1600 was a leap year, but 1900 was not. • UNIX time problem or year 2038 problem: In the operating system UNIX, the time and date are stored in time_t as a signed long integer counting the number of seconds since January 1, 1970. On 32-bit systems, the largest representable integer number is 231 − 1, allowing only years until 2038. This will not be a problem on 64-bit systems supported by many companies because of the longer data format (see [12]). • Similar data format problems were related to January 4, 1975 (caused by data overflow in a 12-bit field), year 2010 (confusion between hexadecimal and binary coding, also called "Y2K+10" problem), and September 9, 1999 (written as 9999 in possible conflict to using 9999 for representing an unknown date); cf. [12]. • Older GPS devices express the date as a number for the week and a day of the week based on 10 bits. So every 1,024 weeks there is a rollover and the counting starts again. This happened the first time on January 6, 1980, and again on August 22, 1999 (see [12]). • There is a whole list of similar problems depending on the operating system or the device. Older Samsung mobile phones have a year 2015 problem, Apple Macintosh computers suffer from a year 2040 problem, and S/370 IBM mainframe computers are affected with a year 2042 problem (cf. [12]). • Some F-22 Raptor fighter jets also encountered a weird bug related to crossing the date line from west to east. On the way from Hawaii to Okinawa on February 11, 2007, four Stealth Raptors had to cross the International Date Line, which caused the on-board computers to crash. The software was designed and tested for the Western Hemisphere only. A software patch was necessary to fix the bug (see [9, 10]).

Bibliography [1] C. Bergin. NASA solves YERO problem for shuttle. date accessed: 201801-19, 2007. https://www.nasaspaceflight.com/2007/02/nasa-solves-yeroproblem-for-shuttle/

[2] Encyclopaedia Brittanica. Y2k bug. date accessed: 2018-01-22. https://www. britannica.com/technology/Y2K-bug

[3] A. Goldstein. 103-year-old man told to bring parents for eye test. date accessed: 2018-03-07, 2002. https://catless.ncl.ac.uk/Risks/22.20.html#subj3 [4] T. Gottfried.

Lessons of Y2K.

date accessed: 2018-03-07, 2000.

//catless.ncl.ac.uk/Risks/20/77#subj6

https:

30

Chapter 2. Machine Numbers, Precision, and Rounding Errors

[5] J. Hodas. What are the main problems with the Y2K computer crisis and how are people trying to solve them? Scientific American. date accessed: 201803-07. https://www.scientificamerican.com/article/what-are-the-mainproblem/

[6] H. Holtkamp and P. B. Ladkin. The Year 2000 problem. date accessed: 2018-0307, 2002. http://www.rvs.uni-bielefeld.de/research/Y2K/year2000.html [7] P. Krill. Y2k: 10 years later. date accessed: 2018-03-07, 2010. https://www. infoworld.com/article/2683640/risk-management/y2k10-years-later.html

[8] E. Roberts. Did the Y2K bug actually cause any problems? date accessed: 2018-03-07, 2000. https://www.cs.swarthmore.edu/~eroberts/cs91/ projects/y2k/Y2K_Errors.html

[9] J. Saturen. Y2K—partial list of prominent computer glitches since 1-1-00. date accessed: 2018-03-07, 2000. http://rense.com/ufo6/glitches.htm [10] Slashdot. Software bug halts F-22 flight. date accessed: 2018-0307, 2007. https://it.slashdot.org/story/07/02/25/2038217/software-bughalts-f-22-flight

[11] M. Thomas. Myths of the Year 2000. Computer Journal, 41(2), 1998. http: //www.rvs.uni-bielefeld.de/research/Y2K/mthomas/BCSY2Karticle.html

[12] Wikipedia. Time formatting and storage bugs. date accessed: 2018-03-07. https: //en.wikipedia.org/wiki/Time_formatting_and_storage_bugs

[13] Wikipedia.

Year 2000 problem.

date accessed: 2018-03-07.

https://en.

wikipedia.org/wiki/Year_2000_problem

[14] Wikipedia. Legacy system. date accessed: 2018-03-13, 2017. https://en. wikipedia.org/wiki/Legacy_system

2.3 Vancouver Stock Exchange Many mickles make a muckle. —George Washington, Writings

2.3.1 Overview • Field of application: financial market • What happened: wrong computing of stock market index • Loss: — • Type of failure: accumulated rounding error

2.3. Vancouver Stock Exchange

31

2.3.2 Introduction Stock market indexes such as the Dow Jones Industrial Average, NASDAQ, or DAX are a weighted mean of a collection of company shares at a stock market. They were introduced in order to visualize the overall development of the value of important shares. These indexes indicate whether the overall stock moves in a positive or negative direction. Nowadays, such indexes are also used for investments via derivatives that are assorted exactly following the stock contained in the index. This offers a relatively safe investment strategy which can easily be followed. Hence, an index can have a strong influence on the value of a stock, and including a company’s stock in an index usually increases the value of the specific stock. The so-called Laspeyres index (see [17]) (often used in specifying an index) is defined by the formula P j sp j,t qj It = P B, j sp j,0 qj where B is the base value with which the index is initialized or normalized, t is the instant of time, and qj are possible weights for share sp j,0 . The Vancouver Stock Exchange (VSE) was founded in 1906 in Canada following the more important Toronto Stock Exchange (TSX) and the Montreal Stock Exchange (MSE). The VSE had a dubious reputation, and many of the traded stocks were suspected to be frauds or failures. In 1999, the VSE merged with the Alberta Stock Exchange and the MSE to form the Canadian Venture Exchange (CDNX).18 In January 1982, an exchange index (in the following denoted by VSE index I) was introduced at the Vancouver Stock Exchange. This index was computed as the weighted mean value of all VSE stock prices sp j , j = 1, . . . , N . It was initialized to the value 1,000.000 by applying a weight w that was the same for all shares: I=

N X

(sp j · w)

(2.1)

j=1

=w·

N X

sp j .

(2.2)

j=1

This formula follows the Laspeyres definition with B = 1,000, all weights equal to 1. At that time, software was already used to compute the indexes and other economic data at stock exchanges. According to the two main newspaper sources, The Wall Street Journal [15] and the Toronto Star [11], the VSE index contained about N = 1,400 to 1,500 companies. After 22 months, the index had dropped from its initial value of 1,000.000 to a value of 524.811 on Nov. 25, 1983 (see [11]). This was quite surprising because most of the traded stocks showed a growing price sp j that should have been matched by a growing index I due to Eq. (2.1). Some consultants from Toronto and California were brought in to analyze the underlying software and question the significance of the index.

2.3.3 Detailed Description of the Bug To be able to understand the underlying details of the problem, the concepts of rounding and cost analysis of a computer program are relevant. Therefore, details on these con18 https://en.wikipedia.org/wiki/Vancouver_Stock_Exchange

32

Chapter 2. Machine Numbers, Precision, and Rounding Errors

cepts are presented in Excursions 5 and 6. Readers already familiar with these concepts may continue reading directly on page 33.

Excursion 5: Rounding to Machine Numbers Restricting all numbers to the fixed-point format (see Excursion 2 on page 12 for details on this format), any given real value has to be converted into this format in a first step. Furthermore, each operation such as multiplication, division, or addition may generate results that have more than the allowed number of digits. In this case, there are different ways to force the number into the prescribed format. Formally, this can be expressed as forcing x = r1 . . . rp .s1 . . . sq sq+1 . . . into the form x ˜ = r˜1 . . . r˜p .˜ s1 . . . s˜q . This process is called rounding, and there are different possibilities of doing this. Simply ignoring and deleting sq+1 sq+2 . . . and setting s˜i = si for i = 1, . . . , q is called truncating. Other possibilities are rounding to the nearest allowed number in fixed-point format either in the direction of 0, or −∞, or +∞. This will introduce a rounding error of the size 10−q depending on the last decimal. The best way of rounding is rounding to the nearest neighbor: For sq+1 ≤ 5 we set s˜q = sq , and for sq+1 > 5 we can define x ˜ = r1 . . . rp .s1 . . . sq + 10−q . In this case, the rounding error is bounded by 0.5 · 10−q . Hence, for q = 0 and rounding to an integer, the rounding process will introduce an error of the size 1.0 for truncation and 0.5 for rounding to the nearest integer. For the optimal rounding to the nearest integer an ambiguity occurs. Rounding x = 0.5 to an integer, for example, allows the choice x ˜ = 0 or x ˜ = 1. To introduce a deterministic criterion and to avoid a systematic error in this case (i.e., always rounding upwards or always rounding downwards), a reasonable rule in this case is to always round to the next even neighbor, so in our example above to zero, because then we can expect that rounding downwards or upwards will occur with the same probability. The same rounding process is always applied for operations with binary floatingpoint numbers and mantissa 0.m1 . . . mt mt+1 . . . . In this case, the mi have only 0 or 1 as allowed values, and the rounding of the mantissa to the nearest allowed neighbor can result in 0.m1 . . . mt or 0.m1 . . . mt + 2−t for rounding downwards or upwards. The following table shows the correct rounding and the corresponding relative error for single precision: exact mantissa (binary format) 1.00000000000000000000001|10... 1.00000000000000000000001|01000... 1.1111111111111111111111|0111... 1.11111111111111111111111|10000...

rounded mantissa (bin. format) 1.00000000000000000000010| 1.00000000000000000000001| 1.1111111111111111111111| 10.0000000000000000000000|

rel. error −6, 0 · 10−8 3, 0 · 10−8 3, 0 · 10−8 −3, 0 · 10−8

The round-off error generated in one rounding step from a number x to a machine number x ˜ can be measured as absolute error in the form εabsolute = |x − x ˜| or as relative rounding error as εrelative = εabsolute /|x| = |x − x ˜|/|x|. Generally, the absolute error does not contain useful information because an absolute error of the size 1 for a number x = 1 is significant, but for x = 100,000,000 it is negligible. On the other hand, for x = 0 the relative error is not well-defined. Hence, for problems where the result will be a number close to zero, the relative error in the computed result can be very large. Even if the round-off error of a single numerical operation is relatively small, potentially severe accumulative effects may appear for a substantial sequence of basic operations, for example. For further details on rounding, see [2, 6, 19].

2.3. Vancouver Stock Exchange

33

Excursion 6: Complexity As we have seen above, it is important to check the accuracy of the results of computer programs that could be affected by rounding effects or other programming errors. But it is also very important to make sure that the results are delivered by the program in reasonable time. One way to analyze the estimated runtime is to determine the complexity of the algorithm behind the implementation. Complexity is usually measured as a function of n, where n represents the size or number of the input data (frequently a vector or array of length n). A fast, comparably cheap algorithm has low complexity, e.g., a linear complexity of n. An example of such an algorithm is the search for the maximum entry in a vector a of length n, as visualized in the following MATLAB example: n = length ( a ) max_value = a ( 1 ) for i = 2: n i f a ( i ) > max_value max_value = a ( i ) end end The linear complexity in this example is due to the loop which visits each entry of the input vector a once; obviously, looking for the maximum entry cannot be done without checking each entry. If we generalize this example to a function that sorts the entries of a (a being still of size n) according to their size, a program similar to the following one will need two nested loops: n = length ( a ) for i = 1: n -1 max_value = a ( i ) for j = i +1: n i f a ( j ) > max_value max_value = a ( j ) k = j end end copy = a ( i ) a( i ) = a(k) a ( k ) = copy end The complexity for this sorting method is n2 (quadratic), because in each of the loops, the program needs to run over the entries of a. Hence, on the order of n2 operations— comparisons and assignments, in this case—have to be performed. Usually, we do not distinguish between basic operations such as addition, multiplication, or division since we expect roughly the same runtime for all elementary operations on modern processors.

An analysis of the VSE index problem showed that the software used to compute the index avoided recalculating the complete index every time a change occurred (around 2,800 or 3,000 times per day). A direct computation of the new value for I according to formula 2.1 is of linear complexity (i.e., it takes about N operations with a constant

34

Chapter 2. Machine Numbers, Precision, and Rounding Errors

factor—here a factor of 2) per changed stock price. In order to improve the runtime, the software developers decided to modify the computation to incorporate only the actual, current changes to the previous values of individual stock prices sp j . Therefore, the update formula for changes in a certain stock value sp j had the following form: Inew := Iold + (sp j,new − sp j,old ) · w.

(2.3)

This operation is obviously of constant complexity: Only three operations instead of N are needed to compute the new index value. Furthermore, the index used four decimals for the calculations but truncated the resulting updated value to three decimals; i.e., the last decimal was simply deleted instead of using a proper rounding as discussed in Excursion 5. A rough calculation shows that truncating the last decimal on average leads to an error of ε = 0.00045 in each step: Assuming a uniform distribution of possible digits for the last (fourth) decimal, i.e., each of the digits 0 to 9 are equiprobable to appear in the decimal s4 , we build the mean value of all cases. To this end, we sum up all those possible contributions (i · 10−4 ) of all digits i and divide by 10 (the number of available digits): P9 ε=

i=0

9 X i · 10−4 9 · 10 i = 10−5 · = 10−5 · = 0.00045 . 10 2 i=0

Only if the fourth decimal is always equal to zero does no perturbation appear by cutting off this decimal. In general, an average error of about ε = 0.00045 arises in each call of the update formula (2.3). Performing this action 2,800 times per day and for 22 months, i.e., about 440 days, adds up to an expected total error in form of an underestimate of ∆1 := 2,800 · 440 · ε = 554.4, and assuming 3,000 changes per day even to ∆2 := 3,000 · 440 · ε = 594.0. Hence, we expect the (erroneous) index value I˜ on November 25, 1983, to be close to the difference of the exact value (1098.892) and ∆1 and ∆2 , respectively: I˜1 = 1098.892 − ∆1 = 544.492 , I˜2 = 1098.892 − ∆2 = 504.892 . Including the accumulated truncation error into the true value of 1098.892 should result in a rough estimate close to 544 or 505—compared to the actual output 574.081 of the erroneous program.

2.3.4 Discussion Considering the sources relevant for the VSE case, we see possible problems in analyzing software bugs. Often, the sources only give incomplete and inconsistent information. For instance, some sources trace the error back to the truncation without mentioning the updates, others see the update as the direct cause, some sources mention “1,400 or so” stocks [15], others 1,500 [11]. No source reveals anything about the actual code or the programming language. The only additional information is the more or less obvious statement that the truncation error summed up to around 1 point per day, or 20 per month. To allow for a better understanding of what really happened, we implemented different calculation methods and compared the results. The corresponding MATLAB program can be found in Appendix 8.2.1. As a model, we use a total of 1,500 stocks

2.3. Vancouver Stock Exchange

35

and 3,000 updates per day. We define the initial stock prices sp j uniformly distributed between 50 and 60. In every update step, we randomly select the stock that has to be changed. The amount of a stock price change is randomly chosen with a maximum percentage r% (i.e., a stock price may change by [−r, r] percent of its current value each time it is changed). The computations of the index I are implemented in six different variants (four variants using the update formula (2.3) and two variants using the full summations from Eqs. (2.1) and (2.2)): If : updating and truncation to three decimals (i.e., as in the original VSE software). Ic : updating and always rounding upwards to the next neighbor (in three decimals). Ir : updating with rounding to the nearest neighbor (in three decimals). Ie : updating in full double-precision arithmetic which results in keeping about 12 decimals instead of only 3 (see Excursion 2). PN Iw : direct computation in the form j=1 (sp j · w) as in Eq. (2.1), so first multiplying each summand with w before summing up. I: direct computation in the form w · only once.

PN

j=1

sp j as in Eq. (2.2), multiplying with w

The last option I should be the method of choice in terms of accuracy and efficiency. The following table shows the result of the VSE index I after 1 day (assuming 3,000 updates of the index due to random stocks with relative random changes in the interval of [−r, r] percent for different values of r), computed in the six different variants and represented with three decimals: stock upd. r 0.200% 0.100% 0.010% 0.001%

up If 992.139 995.432 998.210 998.461

update up Ic up Ir 995.139 993.643 998.432 996.914 1001.210 999.696 1001.461 999.992

up Ie 993.636 996.922 999.701 999.970

direct Iw I 993.636 993.636 996.922 996.922 999.701 999.701 999.970 999.970

A computation of the error of the different variants in comparison with the exact result after 22 months (using the MATLAB code given in Appendix 8.2.1) shows the following numbers: stock upd. r 0.2000% 0.0001%

I − If 660 660

update Ic − I Ir − I 660 0.021 660 0.017

Ie − I 3.1 · 10−11 3.4 · 10−12

direct Iw − I 2.5 · 10−12 1.4 · 10−12

The table shows that the program gives nearly the same errors independent of the stock update rate r. The errors are rather independent of the stock price and the magnitude of the updates, depending mainly on the number of updates. Furthermore, the main source of the VSE index error is the combination of updating and truncating (If ). Updating and rounding to the nearest neighbor (Ir ) reduces the error by a factor of 104 . Exact updating in double precision (Ie ) leads to a small error in the order of 10−11 and

36

Chapter 2. Machine Numbers, Precision, and Rounding Errors

the (academic) direct computation Iw with N times multiplying the weight w results in about 10−12 . Hence, updating alone in sufficient precision increases the error but not extremely. Comparing the costs, updating saves operations (only three arithmetic operations compared to N −1 additions and 1 division in the case of direct computation of I). Nevertheless, the direct computation of I would have been the optimal choice—at least at the end of each trading day—to avoid the accumulation of the errors over time. To conclude, the VSE index problem was caused by applying a frequent, systematic rounding error that was introduced due to the decision to save costs. The mistake could have been avoided by performing the updates more accurately or by directly recomputing the index at least once every day. In contrast to the dubious method of computing the VSE index, the German DAX is newly computed every second by the explicit sum. In 1988, when the index started, the computation was done every minute. The actual index formula is more complex, including different correctional weight factors to take into account differences in the free-floating capital, the size of the freely available stock, bonuses, and changes in the compound of the index. Note that this software problem can be found on Pete Stewart’s original list in the NADigest [18]. Furthermore, it serves as a classic example in literature on numerical methods in general and rounding errors in particular (see, e.g., [4, 6, 10, 13]).

2.3.5 Other Bugs in This Field of Application In November 1992, a stock reached a value of more than $10,000 on the New York Stock Exchange for the first time. But the computer system used only four digits for representing stock values (cf. [20]). A selection of additional software bugs in the field of finance can be found in Sec. 8.1.2 on page 219. These are not fully within the focus of this book and are less documented.

2.3.6 Similar Bugs • The voting count for the Green Party failed during elections for the state parliament in Schleswig-Holstein, Germany, on April 5, 1992. The parliament has a 5% clause, so parties with less than 5% will not be represented in parliament. The software used to evaluate elections for years displayed the results rounded to one decimal, resulting in 4.97% for the Green Party represented as 5.0% [16, 22]. The error was not discovered until after midnight and the final result had to be corrected, leading to the Green Party losing their seats. You can simulate such behavior using Microsoft’s Excel software. If you enter the value of 4.97 in a cell and choose the number representation with a custom format of two decimals, the value 4.97 will be displayed correctly. If you change the format to one decimal, the value (still contained in the cell) will be displayed as 5.0 due to rounding. • Another display effect appears in Microsoft’s Excel software: For certain evaluations of formulas, the displayed result is corrected to a cosmetic value that is more intuitive for the user to read. Compare the following two statements proposed in [8]: (4/3 − 1) · 3 − 1 results in 0.00000000000000000E+00, ((4/3 − 1) · 3 − 1) results in −2.22044604925031000E-16.

2.3. Vancouver Stock Exchange

37

The statements are completely equivalent since they only differ in the outer parentheses. The second result of about −2.2E-16 is correct with respect to the IEEE standard for floating-point arithmetic (see [7]). Presumably, Excel assumes further computations if the outer parentheses exist (at least in this context) and hence uses the exact value for further calculations; without brackets, the result is interpreted to be final and rounded “cosmetically” ([8], p. 7) to zero to better fit the expectations of a (nonscientific) user.19 Expectations of the user to get “nice” results are generally dangerous since multiple or more complex computations or data will always result in machine numbers involving rounding errors in the representations, and since decimal representation for the user differs from binary representation on computers. • There are many discussions on different wrong or weird computations of calculators (see, e.g., [5, 14]). The IEEE standard [7] states that a scientifically correct implementation of an elementary mathematical function should always round correctly, i.e., give the nearest machine number (for a given machine precision) to the exact result (cf. [9]). In particular, if the exact result is a machine number itself, there should be no error. In the case of the √Microsoft Windows calculator, the results of different computations involving √4 (taken from [5]) are given in Tab. 2.1. If the Windows calculator computed 4 as an accurate√2.0 (machine number) according to the IEEE standard, the second statement 4 − 2 would result in an exact zero. However, the values of −1.07E-19 and −8.16E-39 for the standard and scientific mode, respectively, indicate that this is not the case and √ that the represented result “2” for 4 also has a cosmetically rounded form. Note that Excel and MATLAB give the correct result of a binary exactly represented 2.0. Table 2.1. Results of square root computations in Excel, MATLAB, and the Windows calculator (Windows 7 Professional, version 6.1). The Windows calculator does not represent the result correctly (as an exact 2.0), but cosmetically rounds it to this value.

statement √ √4 4−2

Excel (scientific repr.) 2.00E+00 0.00E+00

MATLAB 2 0

Windows calculator in standard mode scientific mode 2 2 −1.07E-19 −8.16E-39

• Excel 2007 showed a bug for certain operations. The result of the multiplication 850 · 77.1 would be displayed as 100,000 instead of the exact result 65,535. The computations were done properly; only the display showed the incorrect value. 65,535 is the highest number that can be represented by an unsigned 16-bit binary number. In total, six of the 9.214 · 1018 different floating-point numbers in Excel 2007 were flawed (see [3]). • An example of an observer effect was presented in 1995 by Walter Gander (see [1]) and is related to determining the machine precision in MATLAB. In former versions of the MATLAB software (6.5 to 7), a command line output of intermediate results led to a reduction in precision of the final result. This effect was caused by two different floating-point formats used internally in MATLAB: 19 Note

that MATLAB gives the correct result for both computations according to the IEEE standard.

38

Chapter 2. Machine Numbers, Precision, and Rounding Errors

For internal computations, the longer 80-bit format was used, whereas displaying numbers resulted in a conversion to the regular 64-bit double precision. Hence, the code delivered different results depending on the “observer.” The MATLAB code to generate the behavior is contained in Sec. 8.2.2. • What is the temperature of a healthy body in Celsius or in Fahrenheit? Usually 37◦ Celsius is considered the standard temperature. But nature would likely not select an integer value for body temperature, especially not in man-made units. Thus, 37 is chosen more for psychological reasons because we prefer integers and they are easier to memorize. Nowadays, 36.8 Celsius is seen as a better estimate, with an upper limit of 37.7 Celsius. On the other side, in comparison with 37 Celsius, 98.6 Fahrenheit (or 36.8 in Celsius) uses one decimal more and therefore looks more exact. In general, in displaying values we have to be careful with showing only significant digits. A bad example would be displaying 37.10111 Celsius as the result of a measurement because the measuring device does not usually have this high an accuracy, and furthermore the last decimals do not give us any additional meaningful information. For a discussion on the normal temperature range, see [12]. • Examples of currency conversion and rounding (see [21]): Converting currencies introduces errors. There are three possible arithmetic errors: conversion errors, reconversion errors, and totalizing errors. As an example, 1 Belgian franc (BEF) is equivalent to e0.025302043 and will be rounded to e0.03 with an error of 16%. A value of e0.03 would be retransformed into BEF 1.185675 and rounded to BEF 1, again representing an error of 16%. Changing BEF 100 by cutting the exchange in 100 separate steps of BEF 1 would result in e3; changing 3 Euros back gives BEF 119, a fictitious win of BEF 19.

Bibliography [1] W. Gander. Heisenberg effects in computer-arithmetic. date accessed: 2018-0129, 2005. https://www.inf.ethz.ch/personal/gander/Heisenberg/paper [2] D. Goldberg. What every computer scientist should know about floating-point arithmetic. Computing Surveys, 23(1):5–48, 1991 [3] D. Goodin. What’s 77.1 x 850? Don’t ask Excel 2007. date accessed: 2018-07-22, 2007. https://www.theregister.co.uk/2007/09/26/excel_2007_bug/ [4] A. Greenbaum and T. P. Chartier. Numerical Methods: Design, Analysis, and Computer Implementation of Algorithms. Princeton University Press, 2012 [5] V. Gupta. Microsoft Windows calculator bug, Sqrt(4) – 2 != 0. ASKVG, date accessed: 2018-01-12. https://www.askvg.com/microsoft-windows-calculatorbug/

[6] N. J. Higham. Accuracy and Stability of Numerical Algorithms. SIAM, 2nd edition, 2002 [7] IEEE Computer Society. IEEE Std 754TM-2008 (Revision of IEEE Std 754-1985), IEEE Standard for Floating-Point Arithmetic. Technical report, IEEE, 2008

Bibliography

39

[8] W. Kahan. Floating-point arithmetic besieged by “business decisions.” date accessed: 2018-01-26, 2005. https://people.eecs.berkeley.edu/~wkahan/ ARITH_17.pdf

[9] V. Lefevre and J.-M. Muller. Worst Cases for Correct Rounding of the Elementary Functions in Double Precision. In Proceedings 15th IEEE Symposium on Computer Arithmetic. ARITH-15 2001, pages 111–118. IEEE Comput. Soc., 2001 [10] J. Leitch. Adventures in .NET rounding, Part 3: Rounding in action. date accessed: 2016-07-25, 2010. http://www.jackleitch.net/2010/06/adventures-in-netrounding-part-3-rounding-in-action/

[11] W. Lilley. Vancouver Stock Index Has Right Number Last. Toronto Star, page C11, Nov. 29, 1983 [12] P. A. Mackowiak, S. S. Wasserman, and M. M. Levine. A Critical Appraisal of 98.6◦ F, the Upper Limit of the Normal Body Temperature, and Other Legacies of Carl Reinhold August Wunderlich. JAMA: The Journal of the American Medical Association, 268(12):1578–1580, 1992 [13] B. D. McCullough and H. Vinod. The Numerical Reliability of Economic Software. Journal of Economic Literature, 37(2):633–665, 1999 [14] I. Peterson. Fatal Defect: Chasing Killer Computer Bugs. Vintage Books, reprint edition, 1996 [15] K. Quinn. Ever Had Problems Rounding Off Figures? This Stock Exchange Has. The Wall Street Journal, Nov. 8, 1983 [16] J. Raschke and G. Heinrich. Die Grünen: Wie sie wurden, was sie sind. BundVerlag, 1993 [17] D. Stankov. Konzeption des Deutschen Aktienindex (DAX 30), Chapter 3. TU Berlin, 2008 [18] P. G. Stewart. Rounding error. date accessed: 2018-07-31. http://www.netlib. org/na-digest-html/99/v99n43.html

[19] J. Stoer and R. Bulirsch. Introduction to Numerical Analysis. Texts in Applied Mathematics. Springer, 3rd edition, 2002 [20] Times Wire Services. Stock’s high close is low blow to the technicians at NYSE. The LA Times. date accessed: 2018-07-31, 1992. http://articles.latimes. com/1992-11-17/business/fi-621_1_stock-market

[21] K. Vuik. Some disasters caused by numerical errors. date accessed: 2018-08-08, 1992. http://ta.twi.tudelft.nl/users/vuik/wi211/disasters.html [22] D. Weber-Wulff. Rounding error changes Parliament makeup. date accessed: 2018-01-26, 1992. http://catless.ncl.ac.uk/Risks/13.37.html#subj4.1

40

Chapter 2. Machine Numbers, Precision, and Rounding Errors

2.4 The Patriot Missile Defense Incident But space is broad, teeming with possibilities, positions, intersections, passages, detours, U-turns, dead-ends, one-way streets. Too many possibilities, indeed. —Susan Sontag, Under the Sign of Saturn: Essays

2.4.1 Overview • Field of application: military defense • What happened: failure to intercept incoming missile • Loss: 28 dead soldiers and more than 90 injured • Type of failure: rounding error

2.4.2 Introduction Missile defense is an important task in military systems and also in computer science. It possibly provides a tool for protecting one’s own people or country against attacks from different intermediate- and long-range missiles with nuclear or conventional warheads. The attacking weapons are usually ballistic missiles that follow a ballistic flight trajectory after having burned their fuel. From the computer science perspective, missile defense is a challenging software task. This is caused by the necessary speed of the missiles, making fast computations and man-machine decisions necessary, and by the complexity of the problem involving reliable missile identification and annihilation without collateral damage. US missile defense originated from air raid defenses and is also related to Ronald Reagan’s Strategic Defense Initiative (SDI) initialized in March 1983 (see Sec. 7.3). While SDI during the Cold War aimed at destroying incoming intercontinental nuclear missiles launched from the Soviet Union or a rogue state, attacking medium-range missiles were supposed to be engaged by the land-based Patriot system or the Navy’s Aegis system. The MIM-104 Patriot system began development in 1969 as a surface-to-air missile system [45] originally targeting attacking airplanes. Production by the responsible company Raytheon started in 1976. The name Patriot is frequently considered an acronym for “Phased Array Tracking Radar to Intercept On Target” (see, e.g., [23]). Initially it was designed as an anti-aircraft system, but in 1988 it was enhanced by a software update to PAC-1 (Patriot Advanced Capability-1; cf. [42, 45]) in order to also protect against tactical ballistic missiles. The mobile system consists of a diesel electric power plant, a launching station, an antenna mast group to connect all the different involved units, the radar set, and the engagement control station.20 The weapons control computer (WCC) is controlled by two-man stations. This comprises a 24-bit processor running at 6 MHz that has to calculate the missile intercept routines. The computer had limited processing power, especially compared to today’s devices. In order to initialize the system, one has to choose a target with certain characteristics such as aircraft, cruise missile, or short-range ballistic missile. The radar scans 20 A

sketch of Patriot’s different components is available at https://science.howstuffworks.com/

patriot-missile.htm.

2.4. The Patriot Missile Defense Incident

(a)

41

(b)

Figure 2.5. (a) Sketch of the modified Scud (Al Hussein) missile components (taken from [44]). (b) The potential range of attack of Al Hussein missiles.

for missiles. If there is a match, the system extracts information on the size, position, speed, heading, and eventually identification of the moving object. Based on these data and the characteristics of the target, the range-gate algorithm running on the WCC computes a new area where the object should be expected in the next time step. If the object is actually detected in this range-gate area by the radar, it matches the characteristic and is considered dangerous (see [4]). In this case, an engagement with Patriot missiles is triggered. In 1990, by a major update the MIM-104C PAC-2 was put into service with an improved effectiveness against ballistic missiles, followed by the further improved PAC2/GEM and, since 2001, the PAC-3 version. The first assignment of Patriot (PAC-2) systems took place during the Gulf War Operation Desert Storm in 1991 to protect Israel and US forces against incoming Iraqi Scud (also known as Al Hussein) missiles, and it had a range of about 43 miles, or 70 km (see [35, 45]). The ballistic missile Scud is a Russian descendant of the German V2 (see [46]) that has been in service in different versions since 1957. The Al Hussein missiles were developed during the Iran-Iraq War by Iraq as a modification of the Scud.21 The speed of an Al Hussein missile is about Mach 5 (i.e., ca. 3,750 mph or 1.7 km/s).

2.4.3 Timeline During the Gulf War, attacks using the modified Scuds against Israel and Saudi Arabia began on January 18, 1991. In Israel, the Patriot systems were not yet operational, so Tel Aviv and Haifa were hit. In Saudi Arabia, Patriot missiles were fired at Dhahran the same day. But this alleged first successful interception of an enemy ballistic missile (as 21 Both countries possessed Scud missiles. Iran had obtained Scud-B missiles from Libya which were capable of hitting larger Iraqi cities about 185 miles away. However, because there are no important cities in Iran close to the border, Iraq started to modify their USSR Scud missiles by reducing the warhead and adding fuel to increase the range to about 500 miles. The Al Hussein missiles were nothing more than modified Scud missiles with less blasting effect but the ability to reach more distant goals. Furthermore, with the applied changes the missiles got rather unstable and tended to break apart during flight.

42

Chapter 2. Machine Numbers, Precision, and Rounding Errors

stated in [40] for example) did not engage any Al Hussein missiles because no missile was launched at Saudi Arabia that day (see, e.g., [7, 17, 31, 34]). In fact, in the first week of the war, several Patriot missiles were accidentally launched at phantom targets due to “software flaws and a design that left the back of the radar unit open to stray signals” [31] originating from various systems of the coalition forces. On February 25, an Al Hussein missile from Iraq was fired at Dhahran, Saudi Arabia, where US forces were located. Two of the six Patriot systems of the Dhahran Battalion, the Alpha and Bravo batteries, were responsible for the Dhahran Airbase. Because Bravo was out of commission due to radar problems (cf. [33]), the Alpha battery was the only one in service; it had already been operating for 100 hours (i.e., about four days). Unfortunately, the Patriot system failed to track and intercept the missile. The modified Scud broke up and the warhead hit a warehouse serving as US barracks in the Aujan compound close to the city of Dhahran, killing 28 American soldiers and causing 97 to be injured (see [19, 42]). This event initiated investigations into the reasons for the incident and discussions about the general reliability of the Patriot system.

2.4.4 Detailed Description of the Bug At first, the failure was traced back to a numerical rounding problem (see Excursion 5 on page 32), but closer inspection revealed a combination of flaws consisting of different components. The numerical problem concerns the internal representation of clock time as a 24-bit integer value in units of tenths of a second (see [4, 36]). The code needed the time in seconds, so the integer representing the time (as number of tenths of a second) had to be divided by 10 to give the actual time in seconds. Since the computations are done in binary, a division by 10 is more complex for a computer than for a human. Therefore, 1/10 was also stored as a 24-bit number, and the exact division by 10 was approximated by a multiplication with the 24-bit estimate of 1/10. Unfortunately, 1/10 in the binary system is a number with infinitely many digits: 1 1 1 1 1 1 1 = 4 + 5 + 8 + 9 + 12 + 13 +· · · = (0.000110011001100110011001100 . . .)2 . 10 2 2 2 2 2 2 Allowing only 24 bits for machine number representation, resulting in 23 bits for the binary representation of digits after the decimal point (i.e., t = 23 in the mantissa described in Excursion 5 on page 32), one loses all negative powers of 2 from 2−24 on, introducing a relative error of the size 1 1 1 1 + 25 + 28 + 29 + · · · ≈ 9.5 · 10−8 . 224 2 2 2 At the start of the system, the clock time is set to zero. After 100 hours of runtime of the system and the corresponding software, the exact clock time would be 100 · 60 · 60 s = 360,000 s, but the computed time is (cf. [13]22 ), 1 1 1 1 1 1 100·60·60·10· 4 + 5 + 8 + 9 + · · · + 20 + 21 s ≈ 3.599996567·105 s , 2 2 2 2 2 2 leading to an inaccuracy in the computed absolute time of about 0.343 seconds. The tracking of the hostile missiles was based on the computation of differences of absolute time, where this effect would have canceled out. Thus, this inaccuracy alone 22 Note that [13] lists a corrected version of the calculated time after 100 hours compared to the original official GAO report [4].

2.4. The Patriot Missile Defense Incident

43

did not lead to the failure. Unfortunately, there was one in a series of software updates that was supposed to fix the inaccuracy by inserting code that used a pair of 24-bit registers to allow more accurate time calculations via a transformation to 48-bit floatingpoint numbers (see [36]). In the software update, about six (but not all) appearances of time computations had been replaced with the more accurate calculations (see [22]), leading to inconsistent code. Therefore, in the computation of differences of times between one radar pulse and another, the round-off error did not cancel out since one time was represented via the more accurate 48-bit machine number and the second time value was less accurate due to the old 24-bit format. Because of the high speed of the modified Scud (about Mach 5, i.e., ≈ 3,750 mph or 1,700 m/s), the missile could not be found in the computed range gate—in 0.343 s, a modified Scud travels a distance of about 500 meters (cf. [2, 4]). Hence, the Patriot system assumed the signal to be spurious and removed the target from the system. Other circumstances also contributed to the bug: • The code was written in assembly language 15 to 20 years (according to [22] and [36]) before the actual use of the system and thus was difficult to understand, maintain, and extend. • The primary targets were aircraft traveling with lower speed (Mach 2 instead of Mach 5; see [4]). The Gulf War in 1991 was the first serious test of the system against ballistic missiles. Therefore, a series of software updates was released to improve the system and adjust it to the high speed of ballistic missiles. • The Israeli military used a tracking software to analyze the behavior of the Patriot systems installed in Israel during engagement, allowing them to discover that the system was losing accuracy over longer operation periods: A targeting inaccuracy of 20% would already occur after 8 hours of operation, leading to an inaccuracy of about 50% after 20 hours (see [4, 19]). On February 11, 1991, they informed the US Army and the Patriot Project Office, which was maintaining the software. As an immediate measure, the Israelis suggested rebooting the system regularly. • A software update repairing the bug was sent out on February 16, 1991, but because of transportation difficulties at war times, it could not reach all batteries immediately. In fact, the update did not reach Dhahran until February 26, one day after the incident (see [4]). • On February 21, 1991, a memo reached all batteries stating that a software update was on the way and that the system was not to be run for a “very long time” without specifying the meaning of “very long time” (cf. [4, 22]). • A reboot of the system would take 60 to 90 seconds (see [4, 19]). During this period the battery would have been inoperative. Since the Bravo unit was out of order, the airbase in Dhahran would have been unprotected if the Alpha unit had not run continuously for 100 hours. • Army officials assumed that the Patriot systems would never be running more than 8 hours at a time (cf. [4]).

2.4.5 Discussion The number of actually successful engagements of Patriot systems as well as the definition of “successful engagements” itself is controversial.

44

Chapter 2. Machine Numbers, Precision, and Rounding Errors

Originally, the US Army reported very high success rates. Due to serious doubts, a congressional hearing23 about the Patriot’s performance in the Gulf War was held on April 7, 1992. The different testimonies, reports, and prepared statements24 are available in [1] (which contains, in particular, the references [5, 6, 8, 14, 27, 29, 48] used in the following). A nice summary of the different positions and arguments is contained in [35]. In this context, more fundamental issues of the Patriot system at the time of the events25 were discussed: • The Patriot system tried to destroy an attacking missile by coming close to it— five to ten meters ideally—and then have an interceptor detonate in front of the target spraying about 1,000 pellets forward. This is an appropriate way of fighting against an aircraft and might lead to a change of course of the attacking missile, but it cannot make sure that the missile warhead is actually destroyed (cf. [29, 30]). Furthermore, this mechanism of destruction was only successful for a relative speed of up to 2,100 m/s between the attacking and intercepting missiles (see [27]). At higher relative speed, it would explode too late. • It was difficult to decide whether an engagement by the Patriot system had been successful. Specific data had only been recorded in exceptional cases. Different available videos—including many broadcast by TV stations—were analyzed in [29, 30]. The findings were contested by other statements (see, e.g., [8, 48]) but later on supported by [37], which comprises a more general discussion on different methods to measure or evaluate a successful engagement. In addition, the notion of “successful” is also context-sensitive: In the Saudi Arabian desert, changing the course of a modified Scud can be celebrated as a success, whereas in Israel, only a destruction of the warhead can be considered a full success (cf. [27, 45]). • The attacking Al Hussein missiles were rather unstable during flight, often falling apart at descent (cf. [5, 14, 27, 31]). This represented a more difficult target to destroy. It was more likely that some other part of the missile would be hit instead of the warhead (see, e.g., [27, 30]). Furthermore, the Patriot system was not able to distinguish between warhead and fragments in the detection phase and frequently ended up firing several Patriot missiles26 aimed at different parts of a single incoming Al Hussein missile. • If the warhead is not destroyed, the amount of debris is increased by Patriot parts causing additional damage particularly in densely populated areas (cf. [10, 27]). Originally, the US Army claimed a success rate of 80% in Saudi Arabia and 50% in Israel. After discussions, the Army reduced these numbers to only 70% and 40%, respectively, and these estimates also had a reliability of only 40% (see [10, 35, 42]). This already low success rate was further reduced by the Army just before the hearing (see [6, 15]) and strongly contested by other investigations27 exploiting various available 23 A

video of the hearing is available at https://www.c-span.org/video/?25443-1/patriot-missile-

performance. 24 Many prepared statements are digitally accessible at https://web.archive.org/web/ 20121112172544/http://www.fas.org/spp/starwars/congress/1992_h/index.html 25 See

also [32] for very early discussions on those aspects. average, about three and four Patriot missiles were fired on each single incoming Al Hussein that was engaged in Saudi Arabia and Israel, respectively (see [48]). 27 A YouTube video summarizing the Patriot system, the Dhahran events, and some of T. Postol’s arguments is available under https://www.youtube.com/watch?v=9OEjnBfuNSc 26 On

2.4. The Patriot Missile Defense Incident

45

data such as videos, reports, and debris. According to [7, 11, 27, 29, 30, 41], the actual number of destroyed warheads was close to zero. The Israelis developed their own way of applying the Patriot system by operating it manually and only engaging it under certain conditions. In the long run, the Israeli Army modified the Patriot system according to their own perception (see [11, 47]). In 2014, this improved system effectively destroyed two Hamas drones and one Syrian drone (cf. [11]). Despite the Patriot system not functioning as desired in the first Gulf War, its psychological effect was very helpful in giving the unprotected population a sense of security (see also [32, 42]). Politically, a feeling of safety was important in order to keep, in particular, Israel from actively participating in the war (cf. [42]). This is an example of how economical and political interests of different kinds may influence the technical or logical arguments and discussions (see, e.g., [32]). In the literature, the Patriot failure at Dhahran is a prominent example for rounding errors and techniques to avoid bugs [12, 13, 26, 28, 43]. Note that frequently only partial aspects of all different causes of the Dhahran incident are presented.

2.4.6 Other Bugs in This Field of Application • During armed conflicts, many losses are caused by one’s own armed forces and are frequently called friendly fire casualties. Therefore, identifying hostile and friendly weapons is very important. Every fighter aircraft, for example, sends out an identifying signal that should be recognized by friendly anti-aircraft systems to avoid being attacked. Sometimes these systems fail. For example, during the Iraq War in 2003, Patriot missiles shot down a British Tornado and attacked a US Hornet (see [16, 21, 39]). • During the Falkland War between Argentina and the United Kingdom, the HMS Sheffield was sunk by an Exocet missile on May 4, 1982. The attacking Exocet did not explode but caused a fire that forced the crew to abandon ship. After the sinking, newspapers reported that the Sheffield did not defend against the attacking Exocet because the computers were not reprogrammed to register Exocet as hostile (see [24], cited also in [18]). The Exocet missiles were sold to Argentina by France and were mainly used in NATO for defending against attacking missiles, presumably from Russia in the event of a conflict between NATO and the Warsaw Pact during the Cold War. The HMS Sheffield was equipped with the Abbey Hill Equipment (AHE) responsible for defending against hostile attacks. Most British ships possess Exocet missiles, too, and therefore confusion could result in distinguishing between friendly and hostile missiles. But later on, the UK Minister of Defense contradicted this version (cf. [25, 38]). Nevertheless, the Navy claimed that it installed some modifications to ensure that the AHE rapidly alerted crews to further Exocet attacks. In the official report, the fact that no possible counteraction was initiated is explained by 1) deficient attention of the crew of the Sheffield, and 2) concentrating on possible attacks by submarines and not by missiles. Unfortunately, the official report of the British Navy [38] contains many sections that are redacted and cannot be read. A similar summary can be found in [20]: The author attributes the incident to the British naval forces simply not expecting an attack with Exocet missiles. In 2000, a newspaper article in The Guardian blames officers of the HMS Invincible that had spotted the attacking Exocets for not alerting other

46

Chapter 2. Machine Numbers, Precision, and Rounding Errors

British warships and dismissing the sightings as “spurious” (see [3, 9]). Thus, in view of the conflicting statements and the official denial, the story about the sinking by a “friendly” missile should be considered unconfirmed.

Bibliography [1] Performance of the Patriot missile in the Gulf War: Hearing before the Legislation and National Security Subcommittee of the Committee on Government Operations, House of Representatives, One Hundred Second Congress, second session, April 7, 1992. U.S. Government Printing Office, 1993. https://babel.hathitrust. org/cgi/pt?id=pur1.32754076883812

[2] D. N. Arnold. The Patriot missile failure. date accessed: 2017-10-19. http: //www-users.math.umn.edu/~arnold/disasters/patriot.html

[3] BBC News. Navy accused of Falklands “cover-up.” date accessed: 2018-03-23, 2001. http://news.bbc.co.uk/2/hi/uk_news/1414411.stm [4] M. Blair, S. Obenski, and P. Bridickas. Patriot Missile Defense Software Problems led to Systems Failure at Dhahran, Saudi Arabia. General Accounting Office (GAO), Report GAO/IMTEC-92-26, date accessed: 2018-10-15, 1992. http: //www.gao.gov/assets/220/215614.pdf

[5] J. W. Carter. Testimony before the Legislation and National Security Subcommittee of the House Government Operations Committee. Prepared statement for the hearing before the Legislation and National Security Subcommittee of the Committee on Government Operations, House of Representatives, date accessed: 2018-10-12, 1992. https://web.archive.org/web/20021022011054/ http://www.fas.org:80/spp/starwars/congress/1992_h/h920407v.htm

[6] J. Conyers Jr. Opening Statement of the Hearing “Performance of the Patriot Missile in the Gulf War” before the Legislation and National Security Subcommittee of the Committee on Government Operations, House of Representatives. date accessed: 2018-10-15, 1991. https://web.archive.org/web/20030210161951/ http://www.fas.org:80/spp/starwars/congress/1992_h/h920407c.htm

[7] J. Conyers Jr. The Patriot Myth: Caveat Emptor. Arms Control Today, Nov. 1992 [8] R. Davis. Operation Desert Storm. Project Manager’s Assessment of Patriot Missile’s Overall Performance Is Not Supported. Prepared statement for the hearing before the Legislation and National Security Subcommittee of the Committee on Government Operations, House of Representatives, date accessed: 2018-10-15, 1992. https://web.archive.org/web/20030210164519/http://www.fas.org: 80/spp/starwars/gao/nsi92027.htm

[9] J. Ezard. Officers dismissed radar warning of Exocet attack on HMS Sheffield. The Guardian, Sept. 26, 2000. https://www.theguardian.com/uk/2000/sep/26/ falklands.world

[10] S. Fetter, G. N. Lewis, and L. Gronlund. Why Were Scud Casualties So Low? Nature, 361(6410):293–296, 1993

Bibliography

47

[11] M. Ginsburg. They Couldn’t Stop Saddam’s Scuds, but Now It’s Finally Patriots’ Day. The Times of Israel, Jan. 22, 2015 [12] M. Grottke and K. Trivedi. Fighting Bugs: Remove, Retry, Replicate, and Rejuvenate. Computer, 40(2):107–109, 2007 [13] N. J. Higham. Accuracy and Stability of Numerical Algorithms. SIAM, 2nd edition, 2002 [14] S. A. Hildreth. Evaluation of U.S. Army Assessment of Patriot Antitactical Missile Effectiveness in the War Against Iraq. Prepared statement for the hearing before the Legislation and National Security Subcommittee of the Committee on Government Operations, House of Representatives, date accessed: 2018-10-15, 1992. https://web.archive.org/web/20030201094403/http://www.fas.org: 80/spp/starwars/congress/1992_h/h920407h.htm

[15] H. L. Hilton Jr. Operation Desert Storm: Data Does Not Exist to Conclusively Say How Well Patriot Performed. General Accounting Office (GAO), Report GAO/NSIAD-92-340, date accessed: 2018-10-15, 1992. https://www.gao.gov/ assets/220/216867.pdf

[16] P. B. Ladkin. Re: Patriots and friendly fire. date accessed: 2018-03-09, 2003. https://catless.ncl.ac.uk/Risks/22.72.html#subj10

[17] G. N. Lewis, S. Fetter, and L. Gronlund. Casualties and Damage from Scud Attacks in the 1991 Gulf War. DACS Working Paper, 1993 [18] H. Lin. The Development of Software for Ballistic-Missile Defense. Technical report, Center for International Studies, MIT, 1985 [19] A. Lum. Pariot missile software problem: Verification-software bug report. date accessed: 2011-03-15. http://sydney.edu.au/engineering/it/~alum/ patriot_bug.html

[20] R. MacLachlan. “Friendly” missiles and computer error—more on the Exocet. date accessed: 2018-07-24, 1986. http://catless.ncl.ac.uk/risks/3. 67.html#subj6

[21] R. McCarthy and O. Burkemann. Patriot in new “friendly fire” incident. The Guardian, Apr. 4, 2003. https://www.theguardian.com/world/2003/apr/04/ iraq.rorymccarthy4

[22] T. Morgan and J. Roberts. An analysis of the Patriot missile system. date accessed: 2017-10-11. http://seeri.etsu.edu/SECodeCases/ethicsC/PatriotMissile. htm

[23] NATO. Patriot fact sheet. date accessed: 2018-10-12, 2012. https://www. nato.int/nato_static_fl2014/assets/pdf/pdf_2012_12/20121204_121204factsheet-patriot-en.pdf

[24] New Scientist. HMS Sheffield Thought Exocet Was Friendly. New Scientist, Feb. 10, 1983 [25] New Scientist. That Sinking Feeling. New Scientist, Feb. 24, 1983

48

Chapter 2. Machine Numbers, Precision, and Rounding Errors

[26] S. Oliveira and D. E. Stewart. Writing Scientific Software. Cambridge University Press, Cambridge, 2006 [27] R. Pedatzur. The Israeli Experience Operating Patriot in the Gulf War. Prepared statement for the hearing before the Legislation and National Security Subcommittee of the Committee on Government Operations, House of Representatives, date accessed: 2018-10-15, 1992. https://web.archive.org/web/20121024002855/ http://www.fas.org/spp/starwars/congress/1992_h/h920407r.htm

[28] I. Peterson. Fatal Defect: Chasing Killer Computer Bugs. Vintage Books, reprint edition, 1996 [29] T. A. Postol. Optical Evidence Indicating Patriot High Miss Rates During the Gulf War. Prepared statement for the hearing before the Legislation and National Security Subcommittee of the Committee on Government Operations, House of Representatives, date accessed: 2018-10-15, 1992. https://web.archive.org/web/20090415122817/http://www.fas.org/ spp/starwars/congress/1992_h/h920407p.htm

[30] T. A. Postol and G. N. Lewis. An Evaluation of the Army Report “Analysis of Video Tapes to Assess Patriot Effectiveness.” Technical report, M.I.T. Defense and Arms Control Studies Program, 1992. https://web.archive.org/web/ 20150402044331/http://www.fas.org/spp/starwars/docops/pl920908.htm

[31] B. Rostker. Information Paper: Iraq’s Scud Ballistic Missiles. Technical report, US Department of Defense, Military Health System, 2000. http://www. gulflink.osd.mil/scud_info/

[32] W. Safire. The Great Scud-Patriot Mystery. The New York Times, Mar. 7, 1991 [33] E. Schmitt. U.S. Details Flaw in Patriot Missile. The New York Times, Jun. 6, 1991 [34] B. Sherwood. The Blip Seen ’Round the World. The Washington Post, Sept. 20, 1992. https://www.washingtonpost.com/archive/opinions/1992/09/20/theblip-seen-round-the-world/cce688c7-dae0-4b1f-913b-6d21c3841504

[35] A. Simon. The Patriot missile. Performance in the Gulf War reviewed. date accessed: 2017-10-19. http://www.gulflink.osd.mil/scud_info/scud_info_ refs/n41en141/Patriot.html

[36] R. Skeel. Roundoff Error and the Patriot Missile. SIAM News, 25(4):11, 1992 [37] J. D. Sullivan, D. Fenstermacher, D. Fisher, R. Howes, O. Judd, and R. Speed. Technical Debate over Patriot Performance in the Gulf War. Science & Global Security, 8(1):41–98, 1999 [38] The Royal Navy. Loss of HMS Sheffield - Board of inquiry. date accessed: 201803-23, 1982. http://www.mod.uk/NR/rdonlyres/9D8947AC-D8DC-4BE7-8DCCC9C623539BCF/0/boi_hms_sheffield.pdf

[39] N. Tweedie. US fighter shot down by Patriot missile. The Telegraph, Apr. 4, 2003. https://www.telegraph.co.uk/news/worldnews/middleeast/iraq/ 1426631/US-fighter-shot-down-by-Patriot-missile.html

Bibliography

49

[40] U.S. Army Aviation and Missile Life Cycle Management Command (AMCOM). Patriot. date accessed: 2017-01-11. https://history.redstone.army.mil/ miss-patriot.html

[41] T. Weiner. Patriot Missile’s Success a Myth, Israeli Aides Say. The New York Times, 1993. https://www.nytimes.com/1993/11/21/world/patriot-missiles-success-a-myth-israeli-aides-say.html

[42] K. P. Werrell. Hitting a Bullet with a Bullet. A History of Ballistic Missile Defense. Technical report, College of Aerospace Doctrine, Research and Education, Air University, 2000 [43] L. R. Wiener. Digital Woes: Why We Should Not Depend on Software. AddisonWesley Longman Publishing Co., 1993 [44] Wikipedia. Al Hussein (missile). date accessed: 2017-10-19. https://en. wikipedia.org/wiki/Al_Hussein_(missile)

[45] Wikipedia.

MIM-104 Patriot.

date accessed: 2017-10-19.

https://en.

wikipedia.org/wiki/MIM-104_Patriot

[46] Wikipedia. Scud. date accessed: 2017-10-19. https://en.wikipedia.org/wiki/ Scud

[47] Worldtribune.com. Israel completes upgrade of PAC missile defense. date accessed: 2017-10-18. http://www.worldtribune.com/worldtribune/WTARC/ 2010/me_israel0405_05_12.asp

[48] P. D. Zimmerman. Testimony of Peter D. Zimmerman. Prepared statement for the hearing before the Legislation and National Security Subcommittee of the Committee on Government Operations, House of Representatives, date accessed: 2018-10-12, 1992. https://web.archive.org/web/20030210163206/ http://www.fas.org:80/spp/starwars/congress/1992_h/h920407z.htm

Chapter 3

Mathematical Modeling and Discretization

Solving physical problems on the computer first requires a mathematical description of the phenomenon (frequently called modeling) that has to be restated in a form suitable for computing. This is achieved by a discretization of the physical domain and the underlying functions. Often, this discretization of the physical domain is based on complex meshes consisting of elementary geometrical shapes such as triangles or rectangles in two dimensions. It requires a lot of experience and precaution to generate meshes for the discretization of engineering structures. Resting the numerical computations upon inappropriate shapes can spoil the result and deliver wrong information on the stability of the structure. This happened in the sinking of the Sleipner gas rig in 1991. For simulating the behavior of a bridge, one has to include assumptions on the actual forces acting on boundaries of the geometry, e.g., those induced by wind, pedestrians, or cars. Wrong assumptions in the mathematical modeling will lead to a wrong simulation of the behavior of the bridge, as in the case of the London Millennium Bridge in 2000. For time-dependent problems, initial data or guesses about the actual state of the system (such as temperature or velocity) are needed as input. With this input data specified, the discretized equations can be solved, and the output then describes the behavior of the system approximately. In this way, one can simulate a physical system in order to derive predictions according to the underlying physical laws. Wrong input data or coarse discretization can deteriorate this prediction and lead to wrong forecasts, such as underestimating thunderstorms in the case of winter storm Lothar in 1999. Mathematical equations modeling real-world phenomena depend on parameters that have to be set by the user in the program. In physical applications these parameters are constant, such as the speed of light or the gravitational constant. Financial mathematics tries to describe the future development of stock prices or options, e.g., by mathematical models similar to physical systems. But the corresponding parameters are nearly constant only as long as the economic circumstances are stable. In the event of a market crash, all these predictions are more or less worthless. This aspect is described in the section on mathematical finance.

51

52

Chapter 3. Mathematical Modeling and Discretization

3.1 Loss of the Gas Rig Sleipner A Life can only be understood backwards; but it must be lived forwards. —Søren Kierkegaard

3.1.1 Overview • Field of application: civil engineering • What happened: complete collapse of the concrete structure prepared for the gas rig Sleipner A in 1991 • Loss: $180–$250 million concerning the concrete structure and an estimated $700 million to $1 billion of total costs including costs for the delayed start of operation • Type of failure: erroneous usage of simulation software

3.1.2 Introduction For decades, offshore oil and gas platforms in ocean areas have been a major fixture of the oil and gas industry. Different types of platforms exist,28 depending on local necessities such as water depth, temperature, roughness of the sea, and weather conditions, which typically vary highly with respect to the geographical location as well as over time. A careful design of such platforms is of utmost importance since they are expensive to build and, even more crucial, can threaten human lives or environmental conditions of large areas in the event of problems. Famous examples of major accidents concerning offshore platforms are the Piper Alpha incident in the UK sector of the North Sea, which exploded in 1988 after a gas leak, killing 167 people (cf., e.g., [5]), or the Deepwater Horizon platform 52 miles offshore of Venice, Louisiana, which exploded in April 2010, causing 11 deaths and a huge oil slick in the Gulf of Mexico (see, e.g., [36]). The North Sea represents a particularly rough environment for offshore platforms. This led to the development of a special type of gravity-based structure (GBS) in the 1960s. These oil and gas rigs are anchored directly onto the seabed standing in water up to 300 m deep. While the first generation of GBS relied on steel, Norway in particular tried to push the limits and lower the costs by building concrete structures starting from the 1970s (cf. [11, 13]). The so-called Condeep concept (abbreviation of concrete deep water structure29 ) developed by O. Olsen was very successful (see [7, 24]). A sketch of different Condeep variants is given in Fig. 3.1: These concrete structures consist of cylindrical and spherical shapes (also called cells in the following), which are favorable for the hydrostatic pressure conditions underwater close to the sea bed. Some of the cells are extended to the water surface as shafts on which the platform deck is built and which allow for production and maintenance. 28 See 29 See

https://en.wikipedia.org/wiki/Oil_platform, e.g., for a survey. https://de.wikipedia.org/wiki/Condeep.

3.1. Loss of

ration.) Condeep GBS structures are most efficient for use in depths ranging from about 100 m to 300 m and also where the seabed is highly consolidated (Clauss, 1992). All condeeps have been built for use in the North Sea. Sleipner A is unusual in that it only operates at a depth of 82 m, making it the shortest condeep built. All other aspects of its concept were typical of condeep GBS structures. Immediately preceding the construction of Sleipner A was Draugen (operating at a depth of 251 m, the second tallest condeep) and immediately proceeding the construction of Sleipner A was Troll (operating at a depth of 303 the Gas Rig Sleipner A m, the tallest condeep built) (see Figure 4). It is perhaps significant that Sleipner A was bracketed by those two ambitious projects and was relatively mundane by comparison. Draugen

Sleipner A

53

Troll

Figure 4 – Various Condeep Structures (from Holand, 1996)

Figure 3.1. Different Condeep structures designed for the North Sea. Image taken from [40] (offline) or Failure [19]. Reprinted courtesy of Kvaerner. Facts of the Early in the morning of August 23, 1991, the concrete structure of Sleipner A was undergoing a controlled ballast test in preparation for deck mating when it sprang a leak which was announced by a loud noise emitted from within shaft D3 of the structure. At the time this occurred, the structure had been lowered to a depth of 99 m. The hollow cells were filled with about 32 m of seawater. Thus, the hydraulic head acting on the walls of the cells was 67 m of water. This hydraulic pressure acted within the hollows formed by the intersections of the individual cells, the so-called “tricell joints.” The leak occurred at one of these tricell joints adjacent to shaft D3 (see Figure 5), allowing water to flow into at least two of the cells. The structure sank approximately 18.5 minutes after the first noise indicating a failure was heard. Fourteen men were aboard the structure when the leak began, but were fortunately all saved due to quick realization of the problem and fast response. The structure was completely destroyed by the higher hydrostatic forces as it sank and the impact with the sea floor bottom (approximately 220 m below) with enough force to register as a magnitude 3.0 earthquake. Later survey of the seabed revealed no pieces of the structure greater than 10 m in dimension. The material loss of the structure amounted to $180 million (Collins, 1997). An investigation was initiated the day of the failure.

(a) Geographical survey of Sleipner

(b) Sleipner under construction

Figure 3.2. (a) Geographical situation showing both the Sleipner gas field (i.e., the destination of the Sleipner A platform) and the location of the accident in the Gandsfjorden close to Stavanger. Modified figure used with permission from Keith Thompson [40]. (b) Sleipner A structure under construction. Reprinted courtesy of Kvaerner.

The gas rig Sleipner A30 was developed in Norway for the enormous Troll oil and gas field, which includes the Sleipner region; see Fig. 3.2(a). The Norwegian oil and gas company Statoil ordered Sleipner A in 1989 and transferred the design and construction to Norwegian Contractors (NC),31 which had considerable experience in Condeep constructions, having already built 11 platforms. Sleipner A was intended for a water depth of 82 m, containing 24 cells with four cells extended to shafts supporting the platform 30 The name Sleipner is related to Norse mythology, in which Sleipnir is an eight-legged horse ridden by the god Odin. 31 At that time owned by Aker Oil & Gas Technology Inc. and dismantled as a company in 1995; see [47].

54

Chapter 3. Mathematical Modeling and Discretization

deck. It had a total height of 110 m, the base covered about 16,000 m2 , and the top deck alone had a weight of approximately 55,000 tons (cf. [10, 28]). To understand the Sleipner incident, it is important to know the typical procedure for constructing Condeep platforms (see, e.g., [4, 32, 44] and [35] for images visualizing the procedure). The construction starts in a dry dock where the lower domes and parts of the cylindrical walls of the cells are cast. Then the dock is flooded, the preliminary structure is floated out, and it is anchored at a site in a sheltered Norwegian fjord. There the construction is developed further (see Fig. 3.2(b)) upwards by actually lowering the existing bottom parts step by step, deeper underwater, to keep the construction of new parts close to the water surface and, thus, easing up the actual construction process. The buoyancy of the structure can be controlled by the amount of ballast (i.e., water) in the cells. After completion of the structure, the base is lowered several times for controlled ballast tests to fix possibly existing minor leaks. Finally, it is lowered to about 20 m more than its final operation position to float the heavy top deck over it and mate it with the shafts. Afterwards, ballast is pumped out to raise the platform considerably again, and the complete platform is towed to the offshore site and lowered to its final destination on the sea floor. The design of Condeep models is nontrivial since there are various different scenarios for loads and other conditions during the construction and operation process. Furthermore, the thickness of the cell walls is crucial since it is related to the stability of the cells and to the weight of the structure. On the one hand, if the walls are too thin, they may collapse due to the very high hydrostatic pressures during deck mating. On the other hand, it is not possible to design the walls with the considerable safety factors typically used for land-based structures since too-thick walls will inhibit floating or will result in hydrostatically unstable behavior during the maneuvering of the platform to its final position (see, e.g., [4]). Hence, the engineers need to be very careful when optimizing the structure with respect to the amount of material (mainly steel reinforcements and concrete) and also with respect to cost, because every centimeter of wall thickness counts.

3.1.3 Timeline The construction of Sleipner A started in July 1989. After the casting in the dry dock and the towing to the deep water site at Gandsfjord near Stavanger (see Fig. 3.2(a)), the upper domes of the cells and the shafts were also completed (see Fig. 3.2(b)). Before the deck mating, a controlled ballast test was scheduled for Friday, August 23, 1991. During the test, the structure experienced critical hydrostatic pressures. At a total depth of 99 m, about 5 m before reaching the deck-mating depth, a failure in one of the shaft cells occurred, which allowed water to rush into the shaft much faster than the installed pumps could compensate. The platform went underwater in about 18 minutes, quickly sank at a final speed of approximately 5 m/s, and caused a seismic event of 3.0 on the Richter scale when it hit the ground of the fjord at about 200 m (see [4, 10, 12, 17, 31, 40]).32 Luckily, all 14 people on board were rescued. Hence, “only” about $180–$250 million was lost as a direct consequence of the destruction of the Condeep platform33 (according to [4] and [27]), and the total costs including costs for the delayed start of operation, have been estimated at $700 million to $1 billion (cf. [1, 37, 46]). 32 Most 33 Note

probably, at least parts of the platform imploded during the process of sinking; see [8, 10, 12, 17]. that all indicated cost values are not adjusted for inflation.

3.1. Loss of the Gas Rig Sleipner A

55

3.1.4 Causes of the Failure The remaining pile of debris of the destroyed Sleipner A platform observable during underwater inspections did not allow for any physical evidence (cf. [15, 17]). However, investigations involving observations of eye witnesses as well as various considerations highlighted major cracking in two of the Condeep cells as the probable cause for the incident (see [10, 12]). In order to understand the underlying causes for the failure, we need to be familiar with the general concept of modeling and simulation (Excursion 7) in the field of computational science and engineering (CSE), with the finite element (FE) method (Excursion 8), and with the concepts of interpolation (Excursion 9) and extrapolation (Excursion 10). Readers already acquainted with these concepts may continue directly to page 61. Excursion 7: Modeling and Simulation Since real-scale experiments for the design of huge facilities are not feasible or far too expensive and time-consuming to test the design a priori, computational analysis is frequently used in such situations. Besides theory and experiments, in recent decades computational methods have become a reliable third pillar for scientific exploration, also supported by the considerable increase in computational power of modern computing architectures. Before actually developing or using software for the successful simulation of a real-world problem (i.e., for the computational, reliable approximation of the underlying physical phenomena), two major ingredients are necessary: modeling and discretization. The modeling of phenomena consists of creating a (simplified) description of the real-world scenario; typically, such models consist of mathematical equations derived from physical laws or conservation principles involving balances of quantities such as mass, momentum, or energy. A prominent example is Newton’s second law, stating that the sum of all forces F applied to an object is equal to the product of the object’s mass m and acceleration a: F =m·a. s(a) − s(b) . Relative changes of functions s(x) are measured via relative differences a−b For a constant function, the changes will always be zero, and the changes will be measured as constant 1 for the identity function s(x) = x. Local changes at a point x = a s(a + h) − s(a) for h → 0. The rate of change are therefore defined as the limit of h of a function is called the derivative and typically written as s0 (x), ds/dx, or sx . The functions often depend on the location x and the time t. Therefore, derivatives relative to x and t will also occur. Typical modeling equations rely on such derivatives (or differentials) and, therefore, are called differential equations; for example, they can describe the following: • The growth of a population such as bacteria. The growth of a population (i.e., the change rate or derivative over time t) is, in a simplified model, proportional to the population P itself: P 0 (t) = a · P (t) leading to the analytical solution P (t) = c · exp(at) = c · eat , describing the exponential growth of a population without any limiting factors such as restricted nutrition in the environment. The derivative of a function geometrically represents the slope of the tangent to the function at a given point.

56

Chapter 3. Mathematical Modeling and Discretization

The graph to the right visualizes the relation between the exponential function and its derivative. The black line represents the exponential function f (x) = exp(x) itself. At each integer value −1, 0, 1, and 2 on the x axis, a slope triangle is sketched. The slope of a straight line can be computed via a slope triangle by the fraction of the length of the triangle’s vertical edge (here: always the value of the function f ) over the length of the horizontal edge (here: always identical to 1). Therefore, the slope (i.e., the derivative) is equal to the function value for the exponential function, i.e., exp(x) satisfies f 0 (x) = f (x). The MATLAB code for this visualization is contained in Appendix 8.2.3.

7

exp(x) slope slope slope slope

6

1 2 3 4

5 4 3 2 1 0 -4

-3

-2

-1

0

1

2

The growth of the exponential function depends on the function value itself leading to extremely fast growth. • Diffusion processes where the flow of particles n(x, t) is proportional to the change in the concentration C: dn/dt = j(x, t) = −c · dC/dx. A change in the concentration of material induces a transport of particles from the dense region to the sparsely populated area. • Modeling of option prices via the Black–Scholes partial differential equation (PDE) describing the price of an option over time (see also Sec. 3.4). • The change of the state in a quantum physics system via Schrödinger’s equation = HΨ. ih · dΨ dt • The motion of an harmonic oscillator such as a frictionless pendulum with position u and mass m: mu00 (t) = −ku(t). ˙ • The steady-state temperature distribution u in a material via Poisson’s equation uxx + uyy + uzz = f . • Weather prediction as a coupled system of various PDEs (see also Sec. 3.3). Often, these laws are intended to be used for prediction (of development of weather conditions, populations, financial markets, etc.). For constructions in civil engineering, PDEs describe the stability of the structure under external loads such as wind or hydrostatic water pressure effects (see [2]). Since analytic solutions for such types of differential equations typically are available only for special, simple—and, hence, rather uninteresting—scenarios, numerical simulation involving discretization is used to compute approximate solutions. One powerful and popular method of discretization, in particular in the field of civil engineering, is the finite element (FE) method described in Excursion 8. We have to skip many interesting aspects of modeling and numerical simulation, so we refer the interested reader to literature such as [3, 34, 45].

3.1. Loss of the Gas Rig Sleipner A

57

Excursion 8: Finite Element Method For solving PDEs such as those related to structural engineering problems, the continuous scenario has to be discretized in order to derive a description feasible for numerical and computational treatment. In the context of finite elements, this discretization involves several necessary steps. The first step consists of reformulating the underlying equations in a so-called weak form. To this aim, we interpret the problem as a minimization problem for an abstract functional J(u), formulated as a rather complex integral. This allows us to look for a (still continuous) function u ¯ out of a set of suitable candidate functions (in the so-called variational space H), which minimizes J. In this way, solving the PDE is replaced by minimizing J(u). In the second step, the whole problem is discretized: The space H is restricted to a smaller finite-dimensional trial space Hh of functions uh that can be represented by comparably simple basis functions such as piecewise polynomials of a certain maximum degree, and the overall domain Ω is discretized by an approximation of the domain boundary and a mesh (frequently also called grid) of finitely many cells therein denoted together by Ωh . Therefore, solving the PDE is replaced by approximately minimizing J(u) for the simple basis functions. A discretization with triangles (Ωh ) of a blue domain (Ω) is sketched in black in the following figure. Here, h is an upper bound for the size of the cells.

Of course, different types of meshes can be used, consisting not of triangles but of quadrangles in 2D or of hexahedra, prisms, tetrahedra, or pyramids in 3D.a Specific locations such as the grid nodes contain the so-called degrees of freedom, i.e., the value of the unknown function uh from the basis functions (e.g., piecewise polynomials) at those locations. The remaining task is to find the values of these degrees of freedom such that the discrete counterpart Jh of the functional is minimized. This is realized in the third step, where the discrete counterpart of the weak form of the underlying equations results in a linear or nonlinear system of equations that has to be solved numerically—a classical and well-understood problem of numerical linear algebra. The approximate values for quantities such as stresses and strains at positions in between the nodes of the grid are calculated via interpolation of uh in a postprocessing step inside the FE code. The quality of the approximation uh compared to the analytic, real solution u depends on the granularity of the FE grid, on the shape of the grid cells, and on the type of basis functions (i.e., typically on the degree of the polynomials). For quadrilateral or hexahedral grid cells, which will be important for the Sleipner accident, and rather simple basis functions, the resolution of the grid should not be too coarse, and the elements should not be too irregular, i.e., the skew angle (angle α representing the complementary angle of the grid cell compared to 90◦ ; see the figure below) should be rather low, and very long and narrow elements resulting in a high aspect ratio L/h should be avoided.b

h

L

α

58

Chapter 3. Mathematical Modeling and Discretization

The most accurate quadrilateral or hexahedral elements are squares or cubes where the skew angle is 0 and the aspect ratio is 1. However, this may result in considerably more grid cells and, hence, more computational costs, compared to adapted or stretched cells. There are no fixed rules on which grids/cells are “allowed” and which are “forbidden”; a lot of experience concerning the FE method and the underlying field of application is necessary in this context. As a rule of thumb, according to different sources, the skew angle should not exceed 20◦ or 45◦ (depending on the type of element and corresponding basis functions used in combination with the grid cells; see [20, 30], respectively). For larger skew angles, the approximations deteriorate rapidly. As a rough guideline concerning the aspect ratio L/h, “elements with aspect ratios exceeding 3 should be viewed with caution and those exceeding 10 with alarm” ([9], p. 7-4). This excursion only covers very basic aspects of FEs in a nutshell, avoiding details in particular on the underlying mathematics; this information can be found in respective literature such as [2, 35]. a See https://en.wikipedia.org/wiki/Types_of_mesh for a brief survey; also mixed or hybrid meshes are applied in practice. b The notion of irregular is not consistently defined in the literature. In accordance with [14], we use irregular as a superset notion comprising skewed elements and elements with an aspect ratio larger than 1 in the following.

Excursion 9: Interpolation Data such as velocity, temperature, pressure, etc., are mathematically described via functions depending on space and/or time. Often, these functions are not known explicitly, but only a discrete set of given data Y := {y0 , y1 , . . . , yN } is available (e.g., from observation, measurements, or computation) on a finite set D := {d0 , d1 , . . . , dN } of positions in space or time. In order to still be able to work with mathematical functions for further calculations, one has to describe and approximate these unknown functions based on the given data points in a way that allows for fast computations as well as sufficient accuracy. If we need information on the unknown function f at a position x in space or time that is not contained in D, we have to approximate the value f (x) using the known values f (D). To estimate f (x) as accurately as possible, we approximate the unknown function either • globally for all data points by a simple function p(x), e.g., a single polynomial of high degree; or • locally around x by simple functions, e.g., several polynomials of low degree, and combine these piecewise functions to one global, simple function p(x). To remain in the given function space, this reassembling of the local functions has to fulfill additional continuity constraints. This is a very popular approach. Hence, interpolation is the task of defining a simple function (p(x)) that coincides with the given f (x) on the set of known points (D, Y ) (see [43]). To define a piecewise local function, we choose a small number of direct neighbors of x in D, say d0 , . . . , dk with values y0 = f (d0 ), . . . , yk = f (dk ). Based on p it is possible to estimate any other function value f (x) for x ∈ / D via p(x) and to draw an approximate graph of the function f . Historically, polynomials have often been chosen as simple functions for local approximations. To this end, we have to choose the degree k of the polynomial—constant (k = 0), linear (k = 1), quadratic (k = 2), cubic (k = 3), etc.—depending on the number k + 1 of locally used function values to create one polynomial. If we allow only k = 0, then the interpolation would deliver a constant polynomial p(x) = f (d0 ),

3.1. Loss of the Gas Rig Sleipner A

59

leading to the interpolated function value f (x) ≈ p(x) = f (d0 ) for positions x. Using two interpolation points d0 and d1 , we can construct the linear polynomial (straight line) uniquely defined by these two values of the form p(x) = f (d0 ) + (x − d0 ) · (f (d1 ) − f (d0 ))/(d1 − d0 ). Obviously, it holds that p(d0 ) = f (d0 ) and p(d1 ) = f (d1 ), i.e., the polynomial p really interpolates f . In general, there is a systematic way to get a unique description of a onedimensional interpolating polynomial of degree k via the so-called Lagrange polynomials Lj (x), j = 0, 1, . . . , k, depending on the interpolation points d0 , . . . , dk . The k + 1 Lagrange polynomials are defined by Lj (x) =

(x − x0 ) . . . (x − xj−1 )(x − xj+1 ) . . . (x − xk ) . (xj − x0 ) . . . (xj − xj−1 )(xj − xj+1 ) . . . (xj − xk )

Obviously, each Lj (x) is a polynomial of degree k as specified. Furthermore, Lj (xj ) = 1, and Lj (xm ) = 0 for m = 0, . . . , j −1, j +1, . . . , k. In the following plot, we display the three Lagrange polynomials related to the interpolation points d0 = −1, d1 = 0, and d2 = 1 for the exponential function f (x) = exp(x), together with the graph of the corresponding interpolating polynomial p. 3 L1 L2

2.5

L3 exp(x) p(x)

2

1.5

1

0.5

0

-0.5 -1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

The MATLAB source code for this visualization can be found in Appendix 8.2.4. The Lagrange polynomials allow to write the solution of the interpolation problem without further computations in the form p(x) = y0 · L0 (x) + y1 · L1 (x) + · · · + yk · Lk (x). From the mathematical point of view, these Lagrange polynomials describe the solution sufficiently, but for fast and accurate numerical computations, often other algorithms (such as Newton’s divided differences or Neville’s algorithm) are used that do not build the polynomials explicitly but only calculate the function values of sequences of interpolating polynomials. The description of the interpolating functions should allow for fast and efficient computations. We are not interested in closed-form representations, in contrast to pure mathematics, which prefers explicit representations and (trigonometric) polynomials or power series. The following two figures show example cases of polynomial interpolation. On the left, a quadratic (k = 2) interpolation polynomial p that interpolates three red points (at positions D = {d0 , d1 , d2 } with values Y = {y0 , y1 , y2 }) of a given function f (x) is visualized; obviously, the approximation of p for f is not always very accurate. One approach to overcome this issue constitutes using a sequence of several interpolation polynomials glued together: In the right plot, linear interpolation polynomials are combined to approximate the sine function.

60

Chapter 3. Mathematical Modeling and Discretization

15

1.0

sin(x)

(D, Y)

0.707

10

p(x)

5 0.0

0

f(x) (D, Y) p(x)

5 10 3

2

1

0

1

-0.707 -1.0 2

0

π

π

4

2

3π 4

π

5π 4

3π 2

7π 4

2π

Interpolation plays a vital role in different application scenarios. In digital image processing, a picture is mathematically represented as a table or matrix of gray values (or a triple of color values such as red/green/blue, for example). The image is given by values at discrete positions. To estimate the gray or color value in between given pixels or areas (e.g., in order to reconstruct missing data), one may interpolate data based on neighboring pixel values. Furthermore, interpolation polynomials are frequently used as an (approximate) simple alternative for very complex or expensive functions. Moreover, a considerable amount of algorithms for various tasks in numerical analysis (such as the above-mentioned FE method for the discretization of PDEs) also rely on interpolation polynomials. Details on various aspects of interpolation can be found in [33]. More sophisticated interpolation methods such as splines or NURBS,a frequently also deployed in higher dimensions, are famous in various fields of applications, in particular in computeraided geometric design. The main concept is similar, but different local basis functions are used. a Based

on simple local basis functions.

Excursion 10: Extrapolation Another task similar to interpolation is extrapolation. In extrapolation, we are interested in predicting values for a position x outside (extra in Latin) the range of given data positions d0 , . . . , dk , whereas interpolation allows us to obtain data for positions within or between (inter) the positions in D. The following figure sketches the extrapolation to x = 0, with the corresponding value y of the extrapolation polynomial marked with a blue square, for a sequence of four data points: 1.0 0.9 0.8 0.7 0.6

(D, Y)

0.5

p(x) 0.4

0

1

2

3

4

5

6

Extrapolation has to be used carefully, since it can only give meaningful results if the desired positions x are not too far from the known positions d0 , . . . , dk . A striking

3.1. Loss of the Gas Rig Sleipner A

61

example for this problem is the extrapolation applied to the 100-meter sprint record times for men and women at the Olympics published in [38]. Based on the winning times at the Olympics from 1900 to 1996, one can set up a statistical extrapolation using the best-fit linear regression method to predict the records in the future. The outcome of this extrapolation is that around 2150, women will run faster than men at the Olympics. Obviously, here the prediction based on a short time period is stretched to a value far in the future and far away from the known area, ignoring many different influences and physical factors.a a The dangers of excessive extrapolation are also visible in another constructed example by Mark Twain in https://en.wikiquote.org/wiki/Life_on_the_Mississippi, regarding extrapolation of the length of the Mississippi.

Immediately after the Sleipner A incident, investigations were initiated by NC internally and by Statoil34 involving the Norwegian research organization SINTEF.35 Due to the mapping of the observations to numerical and experimental tests, the investigation concentrated on particular parts of the Sleipner A structure, the so-called tricells. These tricells filled the space between regular cylindrical cells and shafts (see Fig. 3.3) and were actually filled with water since they contained holes on the top. This resulted in hydrostatic water pressure equivalent to an effective water head of 67 m at the time of the incident (see, e.g., [10, 15, 31, 40]; 67 m due to 99 m of absolute submerging depth minus the bottomof32Failure m of ballast and shafts). From one of these Implementation and Assessment Case water Studiesininthe thecells Engineering Curriculum

99 m

67 m of head

32 m

Tricell Joint

Location of Failure

Water Pressure

D3

Figure 5 – Sleipner A: Water Levels at the Time of Failure and Location of Failure Figure 3.3. Top left: water levels at the time of failure and assumed location of the failure; bottom left: failed tricell in the overall layout; the cracking tricell T23 is the upper left tricell neighboring Investigation and FindingsD3; right: sketch of the tricell layout. Used with permission from Keith Thompson [40].

The investigation was performed by SINTEF, the largest research organization in Scandinavia. The investigation was34 extensive and looked at sinking scenarios, design standards, several critical areas of the Actually, both Statoil and NC were eager to fix the problems and to redesign the structure. They restructure constructed (includinga the tricells), quality assurance programs. Model tests on were performed to correlate second Sleipnerand platform in about 15 months, which started operation October 1, 1993 (see [17]). sinking scenarios with observed facts and performance of physical tricell joints with analytical models. 35 http://www.sintef.no/en/ Within the investigation, recommendations were also made for rebuilding the Sleipner A condeep GBS structure with improved details for the tricell joints. Sinking scenarios confirmed that a single leak from one tricell joint could result in enough water entry to

62

Chapter 3. Mathematical Modeling and Discretization

tricells (number T23, cf. [15, 17, 27]), cracks in the walls towards the shaft D3 arose36 which quickly let in water displacing the air contained in D3 originally necessary to assure the buoyancy of the whole structure during the floating test. The assumed failure, as well as details on the reinforcement of different sections of the tricells, are visualized in Fig. 3.7. To find the causes for the failing of the tricell walls, the investigations focused on the FE analysis and postprocessing performed to design Sleipner A. According to [11, 12, 14, 18], the overall workflow of the analysis and design steps had the following form, which was state of the art at that time: 1. Preliminary design: An initial version of the overall geometry of the GBS structure has to be created according to the needs with respect to hydrostatic or other loads, to weight, and to wall thickness. This preliminary design is then optimized and detailed in the next step. 2. Global FE analysis: A simplified FE model based on linear elasticity is computed for the whole structure in 3D. The more realistic, nonlinear behavior of certain regions is explicitly neglected in this step due to computational costs. This is compensated for by a separate, local processing of selected sections (see step 3). Since the number of grid cells in the global FE analysis determines not only the accuracy of the approximated numerical solution but also the computational costs to solve the problem, the engineers were interested in a rather low number of cells to keep computations feasible for the moderate computing power available at the time. Hence, the 3D grid for the whole structure typically had been coarsened before calculations were carried out. The global FE analysis results in stress values at the degrees of freedom of the elements (typically the nodes), which are used in the next steps. 3. Local analysis of isolated parts: In order to incorporate nonlinear effects and other material- and load-specific aspects, specific regions of the structure may be analyzed locally in more detail. To this end, the global values obtained in the previous step are used at the newly generated boundaries when “cutting out” a part of the overall geometry. The local analysis is done with postprocessing tools. 4. Layout of reinforcements: Since the concrete structures are made by armored concrete to withstand tension and shear, a specific, detailed plan of where to put which type (geometry and thickness) of steel reinforcements has to be created. A lot of expert knowledge is contained in corresponding codes and standards as well as postprocessing tools. Based on the FE mesh, a design section mesh is generated, the FE stresses are transformed in order to be available as stress resultants at particular points, which are then typically extrapolated to boundary locations. In the end, manually 36 In [10, 15], data on the water ingress capacity are mentioned, which suggests that as much as two of these tricells around D3 failed shortly one after the other, which is in accordance with two bang-like sounds and other observations by personnel on board the platform at the time of the incident. Two failing tricell walls are also indicated in [12, 42].

3.1. Loss of the Gas Rig Sleipner A

63

generated printouts with specific indications on which reinforcement types have to be used at which locations are generated and handed over to the construction engineers to realize the layout. For the design of Sleipner A, the above steps are visualized in Fig. 3.6. The following detailed additional aspects on the corresponding steps are important for the failure. • Regarding step 1: Due to the successful construction of 11 other Condeep platforms constructed by NC before Sleipner A, the preliminary design was rather generic but covered the scenario-specific aspects of the Sleipner location (see the Concept Report of Statoil and NC mentioned in [47]). However, the particular design of the tricells had been modified at the joints of the wall sections (cf. [15, 28, 47]). • Regarding step 2: The code used for the global FE computations of the whole structure was MSC/ NASTRAN37 Version 65C (cf. [11]). The 3D FE grid for the Sleipner Condeep structure is sketched in Fig. 3.4; only one-quarter of the overall domain had to be discretized due to biaxial symmetry (see [47]). Most parts of the walls of the cells and tricells had been discretized by one layer of brick-shaped grid cells, on which the so-called CHEXA-8 elements in the NASTRAN solver were operating. Certain specific geometry parts were covered by prism-shaped grid cells with corresponding six-node elements. The overall grid had been generated by the preprocessor PATRAN [18], which produced a significant number of irregular elements (see Excursion 8 concerning the definition).

Figure 3.4. Sleipner 3D mesh used for the global FE analysis; only a quarter of the overall 24 cells actually had to be used due to the biaxial symmetry of the structure. Reprinted with permission from John Wiley and Sons, Inc. [31] 37 Abbreviation for NASA Structural Analysis System, a solver developed by NASA and still in use today. Due to its long history, it is considered a de facto standard for FE computations in air- and spacecraft design. The most popular NASTRAN version today is MSC.Nastran developed by MSC Software; see http://www. mscsoftware.com/product/msc-nastran.

64

Chapter 3. Mathematical Modeling and Discretization

(a) original tricell FE mesh

(b) detailed layout of tricell FE grid

Figure 3.5. FE meshes used for the design of Sleipner A. (a) horizontal cut in tricell region. (b) zoom of the gray-shaded region in (a) used for local analysis after the incident (onesixth of the whole tricell mesh is sufficient due to symmetry). Reprinted with permission from SINTEF [14].

A horizontal cut of the original38 FE mesh in the tricell region is visualized in Fig. 3.5. Part (a) shows a zoom of the gray-shaded region of part (b); Especially at the tricell walls, only two CHEXA-8 elements have been used (in the zoomedin region), resulting in a rather coarse resolution, and the skew angle of those elements is about 13◦ and 26◦ (as also indicated in [47]). The coarse grid in combination with the skew elements led to inaccurate results in the FE analysis. Subsequent computations of SINTEF showed that much more accurate results could have been obtained by finer or regular cells (see [12, 14, 40]). • Regarding step 3: There is no Sleipner A–specific information on the local analysis of isolated parts regarding possible failure causes. • Regarding step 4: The postprocessing tool POST was used in the context of the Sleipner A design. The relevant steps are sketched in Fig. 3.6. The stress values at the degrees of freedom of the global FE analysis are averaged in the middle sections of the design mesh cells. The section forces are integrated over these sections and stored in the center points (see the second and third sketches in Fig. 3.6) before being extrapolated longitudinally to the boundary (see the fourth sketch). In POST, a straightforward usage of parabolic extrapolation had been implemented. The authors of [40] and [14] show nicely how this may aggravate underestimation of shear forces now aggravated to about 59% compared to basic beam theory; in contrast, a simple linear regression may give much more accurate results of approximately 81% in this context even with the original inaccurate FE values. Although this computational deviation is huge given the small design window for the engineers to balance the strength vs. the weight of the structure, it alone would most probably not have led to the failure 38 Note that different variants of the tricell FE mesh can be found in the literature. To the best of our knowledge, [14, 17, 27, 28, 31] provide the actual original mesh used for the design of Sleipner A in the tricell regions. Other visualizations such as the ones given in [10, 13, 40] typically do not differ much but are not identical to the original mesh, in particular in the locally zoomed part similar to Fig. 3.5(b).

3.1. Loss of the Gas Rig Sleipner A

65

Figure 3.6. Workflow for the FE design of Sleipner. Design work and drawings from Dr.Techn. Olav Olsen (abbreviated by OO). Reprinted with permission from SINTEF [18].

(cf. [11]). Only combined with an inadequate reinforcement layout (see below) was the cracking of the tricell walls possible. Based on the (underestimated) section forces, POST calculated the reinforcements for all parts of the structure. Fig. 3.7 shows the resulting reinforcements for the tricell joint regions. A grid of horizontal and vertical bars close to the boundaries of the walls was used. Furthermore, stirrups (i.e., wide, inverse-u-shaped bars from one side of a wall to the other; see Fig. 3.8) were used with a vertical spacing of 17 cm in the bottom third of the height of the cell walls and with a vertical spacing of 34 cm in the middle third, stopping just below the assumed

66

Chapter 3. Mathematical Modeling and Discretization

tricell T23

55 0

shaft D3

assumed path of crack

stirrups @34 cm

buoyancy cell

stirrups @ 17 cm

shaft D3

height of anchor deck: 28 m

last stirrups in vertical direction

tricell T23

estimated location of failure T-headed bar

buoyancy cell D2

800

(a) reinforcements: horizontal layout

(b) reinforcements: vertical layout

Figure 3.7. Sketch of assumed failure mode and reinforcement details in (a) horizontal and (b) vertical cut of the tricell wall. These pictures are inspired by [44] and are not exact representations of the actual reinforcement geometry but qualitatively visualize the scenario. In (a), no stirrups are contained since the upper third of the tricell wall, where the failure occurred, possessed no horizontal stirrups (see Fig. 3.8 or [46] for a horizontal cut including stirrups). In (b), the bold arrows illustrate the water flow through the inlet at the top of the tricell continuing through the crack into the shaft D3.

failure (cf. [4, 46]; see also Fig. 3.7(b)). Additional T-headed bars were used as reinforcement across the tricell joints with a length of about 1 m (cf. Fig. 3.7(a)). These reinforcements turned out to be insufficient; in particular, the T-headed bars were too short (i.e., improperly anchored; cf. [28]). Various tests showed that the failure could have been prevented by either longer T-headed bars (see, e.g., [4, 10, 12, 26]) or stirrups for the whole height of the wall (cf. [4]). To summarize, the irregular FE elements (skewed and with a too-high aspect ratio) and the extrapolation in the postprocessing led to an estimate of about 59% of the boundary shear forces compared to the correct solution. Missing stirrups and too-short T-headed bars in the reinforcement layout contributed as a third, compounding cause to the complete loss of the Sleipner A platform. The Sleipner accident, however, also had a positive impact thanks to various “lessons learned” (mentioned, e.g., in [4, 11, 13, 23, 27, 28, 31, 40, 41, 47, 49]), in particular with respect to a more thorough analysis and application of FE software and to updates in norms and codes for the design of concrete buildings. A number of papers used the accident and the information gathered as a test scenario for FE codes (e.g., [4, 25, 46]), and Sleipner A still represents a decent albeit typically short motivation in FE courses or books such as [6, 22, 23, 29, 37]. The above-mentioned three causes for failure (irregular elements, parabolic extrapolation, and too-sparse reinforcements) can be found in most of the papers and reports on Sleipner A. Again, we observe not a single failure but a combination of several causes. In the remainder of this section, we will go into details concerning two aspects that are not so frequently examined in the context of the Sleipner A incident: a discussion on the potential lack of know-how during the design of the platform and a discussion on the question of whether the design engineers actually committed errors with respect to the state-of-the-art design criteria and specifications.

3.1. Loss of the Gas Rig Sleipner A

67

3.1.5 Discussion on Expertise and Experience In the relevant literature, different arguments can be found concerning the level of knowhow and experience in the design of the Sleipner A platform. Arguments indicating a considerable level of know-how and experience (“pro”):

• NC had already successfully built 11 platforms before Sleipner A that “did not deviate significantly from earlier platforms” ([15], p. 1; similar in [16]). • Problems with tricells had been encountered and fixed in every Condeep project by NC (cf. [47], p. 42: “In all projects prior to Sleipner A water pressure in the tricells was an issue”; similar in [1]). • In the design of Sleipner A, top experts in the field of Condeep construction both internally at NC and externally worked together, accumulating massive personal experience and know-how (cf. [27, 31]); in particular concerning the in-house global FE analysis at NC, people knew each other well (see [47], p. 48). • FE analysis had been used in the design of all other Condeep structures ([8, 40, 47]), albeit with other tools and by other companies; see below. Arguments indicating a lack of know-how and experience (“con”):

• Details in the design of Sleipner A tricells were considerably different with respect to geometry (cf. [15, 28]), to optimizing the volume of concrete (see [28]), and to the functionality (closed tricells fitted with piping systems; see [47], p. 42). • “Loss of corporate memory” ([1], p. 94; mentioned without additional sources): No person involved in former Condeep projects participated in the design of Sleipner A. • The SINTEF main report [12] states, without details on the reasons: “A main cause of the accident is inadequate first-line competence in numerical analysis, and lack of experience in overall design in all phases” (p. 18). • The FE global analysis for Sleipner A was, for the first time, completely covered in-house at NC. • Software change concerning the FE calculations: NASTRAN was used for the first time in the context of Condeep structures instead of the previously applied software SESAM (see [18, 47]). • Concerning the placement of the reinforcements, in particular the T-headed bars, the authors in [8, 31] argue that experienced design engineers would have extended the T-headed bar considerably into the pressure zone on both sides of the tricell Y-joints. • In [27] and [31], the authors state that the Sleipner case suffers from an overemphasis of and an overreliance on computational FE methods (a similar argument is given in [39]). They suggest to perform a local design of the tricell with strut-and-tie models based on a global analysis. Strut-and-tie models are a classical tool for designing reinforced and prestressed concrete members and struc-

68

Chapter 3. Mathematical Modeling and Discretization

tures and applying equilibrium models relying on struts as compression elements and on ties for the reinforcements. It is especially appropriate for designing and detailing discontinuity regions and would immediately indicate problems in the connections of the tricell walls.

3.1.6 Discussion on Failure Causes vs. Errors The discussion in this section is mainly based on [47], which provides very detailed descriptions of the background of Condeep constructions and the design of Sleipner A, including results from internal reports and personal interviews (but lacking visualizations; a comparably large number of images are contained in [40], [10], or [19]). The amount and level of details is actually the reason why we placed this discussion at the end of this chapter. Furthermore, in this contribution, G. Wackers provides a perspective profiting from a temporal distance (written in 2004, more than 12 years after the accident) and a personal distance (i.e., considerable objectivity39 ). The conclusion drawn explicitly by Wackers is that there have been failure causes, but that no actual errors have been committed in the decisions. It is only clear in retrospect what went wrong, but at the time of decision-making, the stakeholders were conforming with the relevant norms, codes, rules, specifications, etc., i.e., they were “doing the job as good as possible” (cf. [47], p. 11). In contrast to this conclusion, postaccident investigations typically assume a fundamental error or failure that “simply” has to be localized and eventually be fixed. In [48], Wackers refers to this phenomenon as retrospective fallacy.40 This represents an interesting aspect that is implicitly supported as well as contradicted by other contributions and discussed in the remainder of this section. Concerning the historical background of the Sleipner A development, note that there had occurred a considerable change in the second half of the 1980s: NC was now an Aker company, resulting in a more business-oriented management style than the previous “entrepreneurial styled leadership dominated by engineers” ([47], p. 29). Furthermore, a considerable drop in oil and gas prices to about 50% and more successful competitors of NC such as Peconor pushed NC to reduce costs for the development of oil and gas rigs for the North Sea. In the following, we list different decisions in the design of Sleipner A that contributed to the incident but represented no direct errors (page numbers in the headings refer to the related sections in [47]): • Decision on tricell design (p. 41ff.): NC and Statoil fixed limits for various design choices in the so-called Concept Report before the start of the detailed design layout of Sleipner A. Experiences from previous projects were included, in particular concerning the tricells. All projects prior to Sleipner A encountered problems with water pressure in tricells. In the Statfjord A project in the 1970s, this even led to serious cracks in the tricell corners. The design answer—also used in Condeep projects after Statfjord A—was a modified tricell geometry and the usage of bicycle handlebar–shaped reinforcement bars across the corner. Actually, both of these best-practice solu39 Compared to other reports. Note that the authors of [15, 16, 17, 28, 49] were affiliated either with NC or Statoil and, hence, most probably experienced subjective influences when writing reports or papers shortly after the accident, in particular when liability issues had not yet been settled. 40 Related to hindsight bias in psychology (see https://en.wikipedia.org/wiki/Hindsight_bias) or to historian’s fallacy in historical science (see https://en.wikipedia.org/wiki/Historian’s_fallacy). A similar phenomenon is the black swan effect described in Sec. 3.4.

3.1. Loss of the Gas Rig Sleipner A

69

tions were modified throughout the different design steps in the Sleipner case (see below), leaving Sleipner A with a different tricell geometry (see also [15] for a visual comparison) and with different (T-headed) reinforcement bars. Furthermore, all projects prior to Sleipner A had closed top domes for the tricells. The water level inside the tricells could be controlled via actuators. In the event of wall ruptures, this mechanism served as a backup to limit the amount of water flowing from tricells into other cells. In the design of Sleipner A, the top dome of the tricells contained a hole, resulting in the complete tricell being filled with water without any control of the amount of water inflow or hydrostatic pressure. According to Wackers, no primary reason for the holes was given in any source. From an engineering point of view, the top dome with a hole and the closed top dome with pipes and valves were equal (assuming no failure elsewhere); hence, the decision favored the cheaper version (hole). • Decision on in-house FE analysis with NASTRAN (pp. 29, 45ff.): The Sleipner A project was the first Condeep project where NC offered to perform the full global FE analysis on its own. Before, other companies such as Computas and Veritec were in charge of the global FE computations. In previous projects, NC had gained considerable experience in this field; hence, offering the global FE computations as an in-house solution seemed the natural next step. Furthermore, NC engineers trusted more in NASTRAN, for which they had run 10 large studies, than in the somehow unavoidable update from SESAM-69 to SESAM-80 forced by the vendor Veritec. This also seemed like a reasonable choice, no indication for some kind of error here either. Note that Veritec and SESAM-80 were not completely out of the game: In the context of quality assurance, Statoil contracted Veritec to review not all but only Statoil-selected documentation of NC. • Decision on type of elements in NASTRAN (pp. 53–55): FE does not represent a strict, deterministic rule-based approach. A lot of engineering judgment is required for the choice of the elements with respect to the size and shape of the grid cells as well as to the piecewise trial basis functions (concerning the order of the polynomials, etc., resulting in different numbers of degrees of freedom per mesh cell). Every FE computation represents an approximation by definition, and the effect of choices on accuracy can only be directly measured for benchmark scenarios in comparison to available correct solutions. For real-world scenarios such as the Sleipner A design, correct solutions are almost always unknown. One way to obtain information on the accuracy of FE solutions with different resolution or order is to compare a sequence of computations, i.e., solving the same problem a lot of times, which clearly was infeasible in the context of costs and computational resources at that time. A second option to estimate the accuracy of the solutions are a posteriori error estimators (see, e.g., [23]) that are included in modern finite element method (FEM) tools but were not standard at the time of the events and typically increase the computational costs associated to solutions. The choice of the relatively coarse cells and a comparably low amount of 8 and 6 degrees of freedom per CHEXA and CPETA element, respectively, was driven by considerations of computational resources and feasibility (under the given project restrictions). Concerning the irregularity of elements used in FE, it was known that irregular elements with huge aspect ratios or large skew angles can result in considerably

70

Chapter 3. Mathematical Modeling and Discretization

inaccurate results. But no strict rule or limit on the maximum allowed aspect ratio or skewness of elements for specific accuracy demands was (and actually still is41 ) available. The NASTRAN primer [30], in particular, contains no hints or warnings in the context of the CHEXA type of elements.42 Since Dr.Techn. Olav Olsen, who was in charge of the engineering layout of the computations as additional contractor (see Fig. 3.6), and Veritec (in the context of verification reports) put this on the agenda in the context of regular meetings, NC engineers discussed the usage of skewed and high-aspect-ratio elements explicitly. Concerning the skewness of elements, their use was not new; some of the engineers had experience in designing steel structures with skewed elements. But for both steel and concrete, no standard on the allowed or useful degree of skewness was available. Furthermore, a verification review realized by Grosch and Brekke (consultants from Norwegian Offshore Contractors, dealing with the verification concerns of Veritec mentioned above) also considered all skewed elements both in the previous Gullfaks C Condeep project and in Sleipner A. For specific parts (top-domes of the cells), recommendations to reduce skewness were followed; the skewed elements in the tricell walls were considered to be neutral (i.e., not harmful) by all involved parties and hence remained in the design. Concerning the usage of higher aspect ratio of certain grid cells, the verification reports by Veritec contained concerns, but these concerns had no consequences. Hence, only via additional computations after the failure (i.e., multiple solutions of the same problem), it became clear that the choices on the finite elements led to considerably underestimated shear forces at the tricell walls. In [10, 13], Holand disagrees with the no-error hypothesis concerning the choice of elements: “The utilization of irregular elements in the critical section in the Sleipner case was not in accordance with established practice” ([10], p. 34). Holand does not discuss in which context this practice had been established: at his group at SINTEF, in particular academic research groups, in the whole academic community, at NC in particular, or at engineering companies in general. Similarly, the SINTEF report [14], stemming from colleagues of Holand, states, “It is well known that the accuracy of Finite Element calculations deteriorates when elements are distorted from regular geometrical shape, e.g. rectangular shape” (p. 5.1). • As an additional flaw, the software used was putting warning flags on sections that should be further analyzed by the user, but it did not show a flag for the critical section in the tricell. So the developing team relied upon the program without questioning the results. • Decision on additional triangular parts in tricell corners (p. 51f.): Due to the optimization (i.e., the slimming) of the tricell walls, an additional problem arose: The thinner walls were too long to effectively withstand the hydrostatic forces. Hence, the engineers modified the overall tricell geometry— 41 Even today, only rough guidelines and rules of thumb are available in the literature (see, e.g., [9]), exactly because the consequences are complex and demand a lot of mathematical and engineering judgment including detailed knowledge on the scenario to be solved. 42 Some mixed experience with irregular elements of different types of NASTRAN elements was, however, available in the literature (cf. [21], for example).

3.1. Loss of the Gas Rig Sleipner A

71

both for the FEM analysis and for the actual concrete structure—by including triangular-shaped43 grid cells in the corners of the tricells (see Fig. 3.5(a): 4 instead of 6 hexahedral cells along the tricell wall plus 2 new prism-type cells). Hence, not only the local free wall length but also the overall approximate geometry Ωh had been changed! This seemed an improvement but actually created problems that had originally been avoided in the very first step (see “Decision on tricell design” above) by including historical know-how of the Stratfjord A project.

Figure 3.8. Sketch of bicycle handlebar–shaped bars in the reinforcement layout. This layout does not represent the original design by O. Olsen but was used in a detailed analysis by SINTEF after the incident and gives an idea of the difference compared to the T-headed bar from Fig. 3.7(a). It also shows the horizontal layout of the stirrups in the lower two-thirds of the tricell walls (Fig. 3.7(b)). Reprinted with permission from SINTEF [25].

• Decision on T-headed reinforcement bars (p. 57ff.): The second part of the best-practice solution from the Stratfjord A project, the bicycle handlebar–shaped bars in the reinforcement, had been changed rather tacitly in the engineering process after the design: While Olaf Olsen originally had bicycle handlebar–shaped bars in the first version of reinforcement drawings (see Fig. 3.8 for an idea of such a shape), a request by NC came in to replace them with straight T-headed bars. Those T-headed bars, stemming from a continuous process of product innovation, seemed to be an improvement and had successfully been used in the Gullfaks C Condeep project (but in a slightly different context). NC had a certain amount of leftover T-headed bars, so why not use them and improve both the engineering and the cost factor of the structure? The usage, the layout, and the arrangement of the T-headed bars was in accordance with the internal and external design reviews and verification procedures. Only after the accident did the inadequate reinforcement become obvious, as explained above. 43 “Triangular” refers, in this context, to the 2D horizontal cuts to visualize the tricell geometry and the elements that actually had the shape of prisms in full 3D.

72

Chapter 3. Mathematical Modeling and Discretization

To summarize this discussion, note that several other contributions support the “no-error” hypothesis at least indirectly by mentioning the need or the successful installation of revised design guidelines and criteria (cf. [1, 27, 28, 31], for example).

3.1.7 Key Points • Major causes: A variety of different decisions contributed to the failure. The most prominent ones are an insufficient layout of the FE grid cells (too skewed and with a too-high aspect ratio) resulting in an underestimated shear stress in tricell walls, the choice of an extrapolation scheme aggravating this underestimation, and an insufficient reinforcement layout of the tricell construction. Note that Sec. 3.1.6 discusses in detail various aspects concerning the hypothesis that no errors were committed, but unlucky decisions in accordance with all existing rules and best practices were made. • Conclusion: The loss of the gas rig Sleipner A represents an example of a failure related to software that is complex and shows a variety of causes and aspects. This incident led to a discussion on and installation of improved specifications and regulations. Besides, people working in the field of scientific computing learned a few lessons on how to use or not use computational tools. The authors of [40] summarize this effect as follows: Probably the biggest lesson from this case study is the need never to treat computer analysis as a black box process. Computer analysis is only as good as the user who inputs the model and interprets the results. Proper modeling and interpretation requires a strong understanding of the theoretical and practical workings of the programs and a thorough understanding of what the results mean. Rational methods of checking results should always be employed and quality assurance processes should allow the time for proper attention to such details.

Bibliography [1] R. G. Bea. The Role of Human Error in Design, Construction, and Reliability of Marine Structures. Technical report, Ship Structure Committee, 1994 [2] D. Braess. Finite Elements: Theory, Fast Solvers, and Applications in Solid Mechanics. Cambridge University Press, 2001 [3] H.-J. Bungartz, S. Zimmer, M. Buchholz, and D. Pflüger. Modeling and Simulation. Springer Undergraduate Texts in Mathematics and Technology. Springer, 2014 [4] M. P. Collins, F. J. Vecchio, R. G. Selby, and P. R. Gupta. The Failure of an Offshore Platform. Concrete International, 19(8):28–35, 1997 [5] W. D. Cullen. The Public Inquiry Into the Piper Alpha Disaster. v. 1. Stationery Office, 1990 [6] B. Einarsson (editor). Accuracy and Reliability in Scientific Computing. SIAM, 2005

Bibliography

73

[7] O. A. Engen. The Development of the Norwegian Petroleum Innovation System: A Historical Overview. In J. Fagerberg, D. C. Mowery, and B. Verspagen (editors), Innovation, Path Dependency and Policy - The Norwegian Case, Chapter 7, pages 179–207. Oxford University Press, 2009 [8] F. Faröyvik. Sleipner: Fra Ingeniorkatedral til Betonskrab. Teknisk Ukeblad, 138(38 24-10):8–9, 1991 [9] C. A. Felippa. Introduction to Finite Element Methods. Lecture Notes (ASEN 5007), 2016. http://kis.tu.kielce.pl/mo/COLORADO_FEM/colorado/ IFEM.Ch07.pdf

[10] I. Holand. The Loss of the Sleipner Condeep Platform. In G. M. A. Kusters and M. A. N. Hendriks (editors), DIANA Computational Mechanics ‘94: Proceedings of the First International Diana Conference on Computational Mechanics, pages 25–36. Springer, 1994 [11] I. Holand. Structural Analysis of Offshore Concrete Structures. In IABSE congress report, volume 15, pages 875–880, 1996 [12] I. Holand. Sleipner A GBS Loss. Report 17. Main Report. Technical report, STF22 A97861, SINTEF, 1997 [13] I. Holand and R. Lenschow. Research behind the Success of the Concrete Platforms in the North Sea. In Mete A. Sozen Symposium, pages 235–272. American Concrete Institute, ACI SP-162 edition, 1996 [14] K. Holthe and L. Hanssen. Sleipner A GBS Loss. Report 5. Global Structural Response. Technical report, STF22 A97725, SINTEF, 1997 [15] B. Jakobsen. The Loss of the Sleipner A Platform. In Proceedings of the 2nd International Offshore and Polar Engineering Conference, pages 1–8, 1992 [16] B. Jakobsen. The Sleipner Accident and Its Causes. Engineering Failure Analysis, 1(3):193–199, 1994 [17] B. Jakobsen and F. Rosendahl. The Sleipner Platform Accident. Structural Engineering International, 3:190–193, 1994 [18] E. Jersin, T. H. Soereide, and A. R. Reinertsen. Sleipner A GBS Loss. Report 16. Quality Assurance. Technical report, STF38 A97428, 1997 [19] K. Kurojjanawong. The sinking of offshore concrete gravity base - Sleipner Alpha. date accessed: 2017-10-19. http://blog.gooshared.com/view/166 [20] R. H. MacNeal. A Simple Quad ShellElem. Computers & Structures, 8:175–183, 1978 [21] O. G. McGee. Finite Element Analysis of Flexible, Rotating Blades. Technical report, NASA, 1987 [22] A. Morris. A Practical Guide to Reliable Finite Element Modelling. Wiley, 2008 [23] S. Oliveira and D. E. Stewart. Writing Scientific Software: A Guide. Journal of Organometallic Chemistry, 104(1):316, 2006

74

Chapter 3. Mathematical Modeling and Discretization

[24] O. Olsen. CONDEEP-PLATTFORMER. Technical report, 2011 [25] E. Opheim. Sleipner A GBS Loss. Report 7. FEM Analysis of Tri-Cell Joint. Technical report, STF22 A97857, SINTEF, 1997 [26] E. Opheim. Sleipner A GBS Loss. FEM Analyses of Y-tests. Technical report, STF22 A98733, SINTEF, 1998 [27] K.-H. Reineck. Der Schadensfall Sleipner und die Folgerungen für den computerunterstützten Entwurf von Tragwerken aus Konstruktionsbeton. In FEM 95, pages 63–72. E. Ramm (editor), Stuttgart, Germany, 1995 [28] W. K. Rettedal, O. T. Gudmestad, and T. Aarum. Design of Concrete Platforms after Sleipner A-1 Sinking. In Proceedings of the 12th Int. Conf. on Offshore Mechanics and Arctic Engineering, Vol. I, pages 309–319. American Society of Mechanical Engineers, 1993 [29] G. A. Rombach. Finite Element Design of Concrete Structures: Practical Problems and Their Solution. Thomas Telford Limited, 2004 [30] H. G. Schaeffer. MSC/NASTRAN Primer: Static and Normal Modes Analysis. Schaeffer Analysis, 3rd edition, 1982 [31] J. Schlaich and K.-H. Reineck. Die Ursache für den Totalverlust der Betonplattform Sleipner A. Beton- und Stahlbetonbau, 88(1), 1993 [32] B. L. Stead, O. T. Gudmestad, and S. Dunseth. Importance of Fabrication Engineering in the Early Phases of the Sleipner A Development. In Proceedings of the 11th Int. Conf. on Offshore Mechanics and Arctic Engineering, Vol. I-B, pages 543–548. The American Society of Mechanical Engineers, 1992 [33] J. Stoer and R. Bulirsch. Introduction to Numerical Analysis. Texts in Applied Mathematics. Springer, 3rd edition, 2002 [34] G. Strang. Computational Science and Engineering. Wellesley-Cambridge Press, 2007 [35] G. Strang and G. Fix. An Analysis of the Finite Element Method. WellesleyCambridge Press, 2008 [36] B. Stupak. Inquiry into the Deepwater Horizon Gulf Coast Oil Spill. Opening Statement of the U.S. Committee on Energy and Commerce, 2010 [37] B. Szabó and I. Babuška. Introduction to Finite Element Analysis. John Wiley & Sons, 2011 [38] A. J. Tatem, C. A. Guerra, P. M. Atkinson, and S. I. Hay. Athletics: Momentous sprint at the 2156 Olympics? Nature, 431(7008):525–525, 2004 [39] A. Tedesko. Experience. An Important Component in Creating Structures. Concrete International, 15(2):70–74, 1993 [40] K. Thompson. The Sleipner A Failure - August 23, 1991. Technical report, 2011 [41] K. Thompson, J. O. Jirsa, and J. E. Breen. Behavior and Capacity of Headed Reinforcement. ACI Structural Journal, 103(4):522–530, 2006

3.2. London Millennium Bridge

75

[42] E. Thorenfeldt and H. Bra. Sleipner A GBS Loss. Report 6. Strength of Critical Areas. Technical report, STF22 A97854, SINTEF, 1997 [43] L. N. Trefethen. Approximation Theory and Approximation Practice. Other Titles in Applied Mathematics. SIAM, 2013 [44] K. Tretiakova. Sleipner A - North sea oil platform collapse. date accessed: 2017-10-19. https://failures.wikispaces.com/Sleipner+A+-+North+ Sea+Oil+Platform+Collapse

[45] A. Tveito, H. P. Langtangen, B. F. Nielsen, and X. Cai. Elements of Scientific Computing, volume 7 of Texts in Computational Science and Engineering. Springer, 2010 [46] F. J. Vecchio. Contribution of Nonlinear Finite-Element Analysis to Evaluation of Two Structural Concrete Failures. Journal of Performance of Constructed Facilities, 16(3):110–115, 2002 [47] G. Wackers. Resonating Cultures : Engineering Optimization in the Design and (1991) Loss of the Sleipner A GBS. Technical Report 32, Center of Technology, Innovation and Culture, University of Oslo, 2004 [48] G. Wackers. Entrainment, Imagination, and Vulnerability. Lessons from LargeScale Accidents in the Offshore Industry. In Vulnerability in Technological Cultures, pages 179–198. MIT Press, 2014 [49] H. Ynnesdal and F. Berger. The Sleipner Accident. In SPE Health, Safety and Environment in Oil and Gas Exploration and Production Conference 1994, pages 715–716. Society of Petroleum Engineers, 1994

3.2 London Millennium Bridge London Bridge is falling down. —Nursery rhyme

3.2.1 Overview • Field of application: civil engineering • What happened: wobbling bridge • Loss: £5 million • Type of failure: insufficient input assumptions in FE analyses

3.2.2 Introduction Bridges are very impressive and useful structures, but if constructed improperly they may collapse and cause fatal casualties. The bridge over the Firth of Tay in Scotland, e.g., collapsed on December 28, 1879, when a train from St. Andrews to Dundee was crossing during a thunderstorm. The whole train as well as parts of the bridge sank into the river, due to different reasons concerning design and maintenance, resulting in an estimated 74 victims (see [10]). The Quebec Bridge in Montreal tumbled down during

76

Chapter 3. Mathematical Modeling and Discretization

construction on August 29, 1907, killing nearly the whole male part of the tribe of Kahnawake Indians working on the bridge (see [1, 2, 3] and the YouTube video accessible via the QR code in Figs. 3.9(a) and 3.9(b)). A very prominent example is the collapse of the Tacoma Narrows Bridge, which occurred on November 7, 1940, due to windinduced transverse vibrations (cf. [9]). This collapse of the so-called Galloping Gertie is impressively recorded in videos available on YouTube (see the QR code in Fig. 3.9(c)). Also in a military context, it is well known that soldiers crossing a bridge marching in lockstep can induce oscillations and collapse. Therefore, certain low oscillations and their exaltation have to be taken into account when designing and mathematically and numerically analyzing a projected bridge. 5

5

5

10

10

10

15

15

15

20

20

20

25

25

25

30

30

30

35

35

35

40

40

(a) 10Quebec Bridge (short) 15 20 25 30 35

5

40

(b) 10Quebec Bridge (long) 15 20 25 30 35

5

40 40

(c) Tacoma Narrows Bridge 5 10 15 20 25 30 35 40

Figure 3.9. QR codes for videos on YouTube regarding (a) a brief 1-minute survey of the collapse of the Quebec Bridge in Montreal, (b) a detailed 20-minute documentary by BDHQ on the collapse of the Quebec Bridge in Montreal, and (c) original video material with comments regarding the collapse of the Tacoma Narrows Bridge.

3.2.3 Timeline In 1996, the planning of a new footbridge in London began. The 320-meter long pedestrian bridge was designed to connect St. Paul’s Cathedral and the Tate Modern Art Gallery. To allow a clear view of St. Paul’s, its height was restricted, leading to the elegant design of a shallow suspension bridge (see Fig. 3.10). The project team included Sir Norman Foster and Partners and Sir Antony Caro as architects, and Ove Arup and Partners as structural engineers. On June 19, 2000, the London Millennium Bridge opened, attracting thousands of people. Unfortunately, the bridge started a considerable wobbling caused by the walking pedestrians. The vibrations were so intense that many pedestrians latched onto the banister. Therefore, three days after the opening, the bridge had to be closed again. After a series of tests and calculations, additional dampers were installed and the bridge was able to reopen on February 22, 2002 (see [7, 8]).

3.2.4 Detailed Description of the Bug The main source for the following information is [4]. The original construction was supported by parametric analysis based on classical numerical tools relying on an FE analysis (see Excursion 8). The influence of wind was experimentally tested by a 1:16 model of the bridge. The influence of pedestrians on the stability of the bridge was examined by a modified BS 540044 “bad man” approach. The analysis of the excitation showed no serious problems, and tests with a few people confirmed the calculated results. 44 BS 5400 is the British standard code for the design and construction for steel, concrete, and composite bridges.

3.2. London Millennium Bridge

(a) View from St. Paul’s Cathedral

77

(b) Pedestrians’ perspective

Figure 3.10. London Millennium Bridge: (a) The bridge seen from St. Paul’s Cathedral. Courtesy of Ibex73, via Wikimedia. (b) Pedestrians’ perspective of the bridge onto St. Paul’s Cathedral. Courtesy of Jordiferrer, via Wikimedia.

On opening, about 2,000 people at a time were walking on the bridge. This caused excessive lateral vibrations on the three different spans of the bridge for the first lateral modes of about • 0.8 Hz on the southern span, • 0.5 Hz on the central part, and • 1.0 Hz on the northern part. Following the closure, the engineers started a thorough analysis of the characteristics of the bridge. Crowd tests were carried out on the bridge in July 2000 with 100 people and in December with up to 275 people. The number of persons was slowly increased in order to model different loading. As a result, it turned out that the vibrations were caused by the movement of the pedestrians that led to a substantial lateral loading, which had not been anticipated by the numerical simulations. If only a few people were walking simultaneously on the bridge, no problem occurred. But for a larger number of walking pedestrians, a psychological crowd-related effect took place that resulted in pedestrians walking in unison. The vibrations were intensified by the effect that the walkers would lapse into a sailor’s gait induced by the slight movement of the bridge in order to balance the vibration, but also adjusting to the vibration. Therefore, in a positive feedback effect, small predicted vibrations by pedestrians caused a linked swaying movement of the people that intensified the vibrations. This effect is called synchronous lateral excitation; the authors of [4] state that (p. 12) Although some previous descriptions of this phenomenon were found, none of them gave any quantification of the force due to the pedestrians, or any relationship between the force exerted and the movement of the deck surface.

According to [4], the initial analytic model of the bridge structure was valid. The movements were caused by an unpredicted external force, and not related to an unpredicted miscalculated bridge response. In the prototype analysis of the bridge, an iterative pro-

78

Chapter 3. Mathematical Modeling and Discretization 5 10 15 20 25 30 35 40 5

(a)10Wobbling of opening 15 20 25 30 35

40

(b) Testing 15 20 25

40

5 10 15 20 25 30 35 40 5

10

30

35

(c)

Figure 3.11. London Millenium Bridge: QR codes for videos on YouTube regarding (a) the wobbling of the bridge during the opening on June 19, 2000, and (b) a test run before the reopening in 2002. The picture (c) shows additionally installed viscous dampers. Courtesy of Dave Farrance, via Wikimedia.

cess is applied, starting with the setup of an FE model of the actual prototype followed by the subsequent steps: • Apply load cases via boundary conditions according to the standards and the previous analysis. • Check the results as to whether all the requirements are fulfilled. • If this is not the case: Improve the prototype, update the FE model, and go to step 1. Otherwise: Stop. For the Millennium Bridge, the effect of synchronous lateral excitation by pedestrians was not included in this iterative analyses. Therefore, the external forces used in the design process of the bridge via boundary conditions in the FE tools had been underestimated, leading to the insufficient layout. The bug was not in the software itself, but in the wrong application of the software in the FE analysis of the bridge. It is important to note that the problems with the vibrations were not linked to the technical innovations. The same effect could also occur on other bridges with a frequency less than about 1.3 Hz and loaded by a sufficient number of pedestrians [4]. After the analysis, additional dampers were built in (see [5, 6]): 37 viscous dampers were installed, mainly damping lateral motions, but also vertical and torsional movements (cf. Fig. 3.11 for an example). Furthermore, 26 pairs of vertically acting tunedmass dampers were integrated in order to absorb the oscillations. Finally, four pairs of laterally acting tuned-mass vibration absorbers were used in the center span. The overall additional cost amounted to about £5 million (cf. [5]), increasing the total cost of the bridge to £18.2 million. After some successful tests, the bridge reopened on February 22, 2002. Even today, small movements of the bridge can be noticed, but there is no longer a resonance problem.

Bibliography

79

Bibliography [1] CBC News. Kahnawake Mohawks mark 1907 Quebec bridge disaster. date accessed: 2018-03-09, 2006. http://www.cbc.ca/news/canada/kahnawakemohawks-mark-1907-quebec-bridge-disaster-1.623180

[2] V. Chazanovskij and P. M. Karcev. Warum irrten die Experten? Ungluecksfaelle und Katastrophen aus der Sicht technischer Zuverlaessigkeit. Verlag Technik, 3rd edition, 1990 [3] D. Cuthand. Cuthand: Bridge disaster of 1907 changes Mohawk community. date accessed: 2018-03-09, 2007. https://indiancountrymedianetwork.com/news/ cuthand-bridge-disaster-of-1907-changes-mohawk-community/

[4] A. J. Fitzpatrick, P. Dallard, S. Le Bourva, A. Low, R. M. Ridsdill Smith, and M. Willford. Linking London: The Millennium Bridge. The Royal Academy of Engineering, 2001 [5] Foster and Partners. Millennium Bridge reopens. date accessed: 201803-09, 2002. https://www.fosterandpartners.com/news/archive/2002/02/ millennium-bridge-reopens/

[6] D. E. Newland. Vibration of the London Millennium footbridge: Part 2 - Cure. date accessed: 2018-03-09. http://www2.eng.cam.ac.uk/~den/ICSV9_04.htm [7] Structurae.

Millennium Bridge.

date accessed: 2018-03-09.

https://

structurae.net/structures/millennium-bridge

[8] Wikipedia. Millennium Bridge, London. date accessed: 2018-02-28. https: //en.wikipedia.org/wiki/Millennium_Bridge,_London

[9] Wikipedia. Tacoma Narrows Bridge (1940). date accessed: 2018-03-09. https: //en.wikipedia.org/wiki/Tacoma_Narrows_Bridge_(1940)

[10] Wikipedia.

Tay Bridge disaster.

date accessed: 2018-02-28.

https://en.

wikipedia.org/wiki/Tay_Bridge_disaster

3.3 Weather Prediction Prediction is very difficult, especially about the future. —Niels Bohr

3.3.1 Overview • Field of application: weather forecast • What happened: underestimating the thunderstorm Lothar in Europe in 1999 • Loss: e5 billion and up to 60 deaths from thunderstorm Lothar in Europe45 • Type of failure: wrong input and coarse time window in the context of data assimilation 45 Note that these losses are of course not all related to a false forecast. However, a warning by the weather service might have mitigated those losses considerably, in particular with respect to the casualties.

80

Chapter 3. Mathematical Modeling and Discretization

3.3.2 Introduction Despite decades of experience in numerical weather prediction and ongoing improvements in models, algorithms, and implementations, forecasts covering more than three days and/or incorporating extreme events such as thunderstorms have proven to be very challenging. The winter storm Lothar hit Western Europe on December 26, 1999. It came with wind speeds of more than 150 km/h on ground level, causing the death of up to 60 persons and a damage of approximately e5 billion. More than 30 million solid cubic meters of wood were felled. Houses were destroyed, and highways and railway lines had to be closed for a couple of days. After the storm, harsh criticism was raised against the German Weather Service (Deutscher Wetterdienst, DWD) because the DWD failed to deliver a warning in time in contrast to other weather services. The effects of Lothar were poorly predicted by most weather services and very poorly predicted by the DWD (see, e.g., [1, 9, 18]). One of the challenges in numerical weather prediction is the sometimes chaotic behavior of the underlying differential equations that have to be solved, which—in the terminology of numerical analysis—can be ill-conditioned. The following excursions explain the concept of conditioning and condition numbers. Readers familiar with these aspects may directly continue reading on page 83.

Excursion 11: Conditioning and Condition Number In Excursion 5 on rounding to machine numbers, we have discussed the possibly dangerous influence of rounding errors in scientific computations. In the corresponding examples of the Vancouver Stock Exchange and the Patriot system, the problematic features were related to the variant of formulation of the algorithm used to solve a problem and not to the problem itself. By modifying the algorithms, the cumulative effect of the round-off errors could have been avoided in these cases. Some problems exist, however, that are always extremely sensitive to the influence of rounding errors independent of the choice of the algorithm to actually tackle it. Such problems are called ill-conditioned (see, e.g., [14, 41]). As an example, let us consider the logistic map l(x) = r · x(1 − x) for 0 < r ≤ 4 and

0≤x≤1.

The function l(x)—also called a logistic parabola—describes a simplified demographic model of the growth of a population under limited resources. Starting from time ti , the relative population at time ti+1 can be computed as x(ti+1 ) = rx(ti )(1 − x(ti )), or in simplified notation xi+1 = rxi (1 − xi ): as long as the population is small, the population size at ti+1 is enlarged by a factor r compared to xi ; if the population is large (close to 1), the population xi+1 is shrinking by a factor r(1 − xi ) compared to xi —in view of the limited resources. Note that for the allowed values of r and initial population 0 ≤ x0 ≤ 1, the resulting values xi are always contained in the interval [0, 1]. Especially for r = 4, x ˆ = 1 − 10−15 = 0.999999999999999, and x ˜=1−2· −15 10 = 0.999999999999998, we can evaluate l(ˆ x) and l(˜ x). Note that the difference and the relative error between x ˆ and x ˜ is on the order of 10−15 . The evaluation of the function in MATLAB results in l(ˆ x) = 3.996802888650560 · 10−15 and l(˜ x) = −15 7.993605777301111 · 10 with relative error (l(ˆ x) − l(˜ x))/l(ˆ x) ≈ −1:

3.3. Weather Prediction

x: l(x) :

81

x=x ˆ 0.999999999999999 3.996802 . . . · 10−15

x=x ˜ 0.999999999999998 l(˜ x) = 7.993605 . . . · 10−15

relative error ε ≈ 9.99 · 10−16 ≈ −1

This loss of accuracy in the computation of 4x(1 − x) is caused by the difference 1 − x of two numbers that are nearly identical—all the underlined digits 9 in 1 = 0.999999999999999999 . . ., and x ˜ = 0.999999999999998 coincide and the difference appears in the 15th digit after the decimal point. The “good,” reliable (underlined) digits disappear by the difference operation, and there remain the spoiled digits. This effect of losing reliable digits and accuracy by computing differences of only slightly different numbers is called cancellation. In such a case, small relative perturbations in the input values x deliver large relative errors in the output values l(x). Therefore, the computations are very sensitive to (unavoidable) small changes in the input data that can originate from measuring errors or previous rounding errors, for example. Such extreme behavior is analyzed by considering the condition number of a given problem. Let us assume that we have to evaluate a real-valued function f (x) at the value x0 related to a given computational problem P. We are interested only in the influence of small errors δ in the input x0 assuming that all computations can be done exactly during the evaluation of f . In all practical ways of solving P, we usually have to take into account uncertainties in the input that can spoil our final result even in the case of otherwise perfect and flawless computations. To analyze this effect, we can linearize the function f in the form f (x0 + δx ) = f (x0 ) + δx f 0 (x0 ) + · · · .

(3.1)

Ignoring small terms in δx2 and higher powers of δx , for the relative error at x0 we get for small perturbations δx the formula

Therefore, we call

x0 f 0 (x0 ) δx f (x0 + δx ) − f (x0 ) . ≈ · f (x0 ) f (x0 ) x0

(3.2)

x0 f 0 (x0 ) cond(f ) := cond(P ) := f (x0 )

(3.3)

the relative condition number of problem P (i.e., of function f at position x0 ). The relative condition number of a function f describes how small relative input errors δ can effect the output result, magnifying the relative error by the factor cond(f ). A problem is ill-conditioned for input x0 if small relative perturbations in x0 can lead to large relative errors in the result f (x0 ). For ill-conditioned problems, there is practically no numerically satisfyingly way to gain good results. The only possible way to improve the output result is to reduce the input error, which is often impossible due to physical and technical restrictions. Note that if the final result of a computation should be zero, this problem is typically ill-conditioned because numerical computations will never exactly result in zero and, therefore, the relative error of the output will be close to infinity due to f (x0 ) ≈ 0 in the denominator of (3.3). For xr(1−2x) the logistic parabola, the condition number is given by cond(l(x)) = = 1−2x . For x close to 1, this condition number gets very large, and—as rx(1−x) 1−x seen in the above example—small relative errors in x can generate huge relative errors in l(x).

82

Chapter 3. Mathematical Modeling and Discretization Excursion 12: Conditioning and Chaos Ill-conditioned numerical problems are closely related to chaotic behavior in dynamical systems. A demonstrative example is the iterative process related to the logistic parabola (compare Excursion 11) l(x) = r · x · (1 − x), r ≤ 4. Starting with some initial value x0 ∈ [0, 1], we generate the sequence xi+1 := l(xi ), i = 0, 1, . . . , describing the growth of a population under limited resources. For r close to 4, the resulting iterates react dramatically to small changes in the starting value. In contrast to the example in Excursion 11, here we consider a smaller constant r < 4 and a whole sequence of evaluations. Let us assume that we are interested in x50 , using r = 3.999. The following table displays x50 —computed again with MATLAB—for three different, nearby, starting values x0 : x0 : x50 :

0.1 0.018139698766 . . .

0.1 + 10−16 0.001387951918 . . .

0.1 + 10−15 0.997641582032 . . .

Note that all iterates have to be located in the interval [0, 1]. Therefore, the three possible values for x50 are completely different, in spite of starting with values for x0 that are similar up to machine precision. Again, slight changes in the input value lead to extremely different results. For the iteration based on the logistic parabola, the evaluation of l(x) in [0, 1] is ill-conditioned only for x ≈ 1, as we have seen in Excursion 11. Furthermore, for r close to 4, there can appear iterates close to 1 that will introduce a large rounding error by cancellation. This cancellation will appear in one of the iteration steps if an iterate is close to 1, spoiling all following numbers. To demonstrate this behavior, let us apply the function l(x) iteratively on starting value x0 = 0.9999 and a nearby x ˜0 = 0.1 + 10−6 with r = 3.999. This results in the following sequence of iterates x, x ˜, and relative errors |x − x ˜|/x: i 1 2 3 4 5 6 7 8 9 10 11 12

xi 0.0003998. . . 0.0015984. . . 0.0063817. . . 0.0253579. . . 0.0988347. . . 0.3561768. . . 0.9170302. . . 0.3042668. . . 0.8465424. . . 0.5195034. . . 0.9982288. . . 0.0070703. . .

x ˜i 0.0003958. . . 0.0015824. . . 0.0063181. . . 0.0251064. . . 0.0978800. . . 0.3531097. . . 0.9134646. . . 0.3161089. . . 0.8645200. . . 0.4683832. . . 0.9957525. . . 0.0169134. . .

|xi − x ˜i |/xi 0.009999. . . 0.009995. . . 0.009979. . . 0.009915. . . 0.009660. . . 0.008611. . . 0.003888. . . 0.038920. . . 0.021236. . . 0.098401. . . 0.002480. . . 1.392184. . .

Already after 12 iterations, the results are completely different, and we can also recognize the jumps in the error of the following step if an iterate gets close to 1 (step 7 and step 11). We can also understand this behavior by considering the function x12 = l12 (x) := l(l(...l(x)...)) that represents the process of applying the logistic parabola 12 times on an initial value x0 . The following four figures contain the graphs of l(x), l2 (x) = l(l(x)), l4 (x) = l(l(l(l(x)))), and l8 (x).

3.3. Weather Prediction

83

In particular, the function l8 is strongly oscillatory. Therefore, obviously small changes in x yield dramatically different values x8 = l8 (x), i.e., a “chaotic” behavior. Hence, the graph of l8 (x) illustrates the ill conditioning. Typical examples of such ill-conditioned problems are weather forecast over a longer period and predicting the developing economy (see Sec. 3.4). In the context of weather forecast, this chaotic aspect is expressed by E. Lorenz by the striking title “Predictability: Does the Flap of a Butterfly’s Wings in Brazil Set Off a Tornado in Texas?” of [17], and is called the butterfly effect in chaos theory.

3.3.3 Detailed Description of the Erroneous Forecast As outlined in Excursion 7 on modeling and simulation, realizing simulations to predict certain quantities or find optimal solutions to problems frequently relies on partial differential equations to model the relevant physical phenomena. The discretization of the underlying PDE consists of a mesh to obtain a finite representation of the underlying geometrical domain and of a discrete counterpart of the mathematical operators in the PDE (see Excursion 8 on the finite element method as one prominent example of a discretization scheme). In weather prediction, the complex PDE system involves variables such as wind speed, temperature, pressure, and humidity as degrees of freedom to be computed. For a realistic forecast, the mesh size has to be sufficiently small to achieve a satisfying accuracy. This directly translates to considerable computational costs since data on every grid cell and/or node have to be computed, frequently in a coupled manner in terms of linear or nonlinear systems of equations. For weather forecast, computational results typically have to be provided within a limited wall-clock time window for all computations (typically 30 minutes, see, e.g., [37, 40]). Hence, specifying the resolution of the mesh always has to balance accuracy or quality of the results and the costs of a simulation under given resources. The forecast of the DWD in 1999 was based on two models. The global model (the so-called GME) used a mesh resolution of 60 km and 31 vertical levels to discretize the atmosphere of the Earth (cf. [40]). The local model (LM) comprised a smaller ge-

84

Chapter 3. Mathematical Modeling and Discretization

Figure 3.12. Recent sketch of a mesh for DWD’s global and local model. Since the year 1999, the forecast skill of the simulations has made significant progress, based on new physical parameterizations, improvements in the data assimilation process, and a significantly finer mesh size. Reprinted with permission from Florian Prill, Deutscher Wetterdienst.

ographical region (Europe) but provided a finer resolution of 7 km and 35 vertical levels.46 A sketch of different meshes and regions of interest is available in Fig. 3.12. The system of PDEs for the two models involves degrees of freedom for physical quantities such as wind velocities, temperature, humidity, and pressure, and takes into account the conservation of momentum and mass in the atmosphere. Measured data are crucial for the numerical solution of such systems: Both the available amount of data and its quality are important when it comes to actual predictions. In the context of weather prediction, boundary conditions and accurate initial values have to be provided on many grid points at the start of the simulation. Almost always, the resolution of the simulation mesh is much finer than the resolution of available data points stemming from different types of measurements obtained by ships, buoys, radiosondes,47 airplanes, and satellites. Therefore, initial data for coordinates not coinciding with measuring locations have to be assimilated by (sophisticated) methods involving aspects of interpolation, for example (see [19, 20, 37, 47] for the DWD approach used in 1999 and [4, 31, 33] in a more general context). Furthermore, measured data for a given point in time are enhanced by previous solutions of simulations to further improve the assimilated data. In the context of Lothar, a corresponding data assimilation was ap46 Meanwhile, the LM at DWD is called the COSMO model and the GME has been replaced by ICON, the ICOsahedral Nonhydrostatic global model. 47 Radiosondes are battery-powered devices to measure atmospheric data such as altitude, pressure, temperature, wind speed and direction, and relative humidity. A sonde is typically attached to a weather balloon. These devices are an important source of data in the context of weather prediction for the military and civil sector. Every day, hundreds of such sondes are launched. For more details, see https:

//en.wikipedia.org/wiki/Radiosonde

3.3. Weather Prediction

85

plied on the GME with a time interval of 6 hours. The derived values were transferred to the local model as boundary conditions for the LM forecast together with observations delivering output data hourly.48 Excursion 13 briefly explains basic concepts and approaches in data assimilation. Readers familiar with these topics may directly continue on page 86. Excursion 13: Data Assimilation Integrating large sets of observed data into numerical models is challenging but may enhance the outcome of simulations considerably. In case of a time-dependent model and a time-ordered data set, combining models and data is called data assimilation (DA). The main goal of DA is to estimate the state of a system for further analysis, prediction, or quantification of uncertainty. Both of DA’s building blocks, the observed data and the numerical models, have issues. On the one hand, observations frequently suffer from a lack of accuracy in the measured data, a lack of data points (both concerning the number of measurements and their location), and missing knowledge on the relation between observations and the variables in the mathematical model. On the other hand, models are typically not perfect: Models always represent an approximation of the full reality, and there are numerical errors of different types in actual computations based on the model. To obtain more accurate estimates for the states of the model (such as initial conditions) and thus to improve the quality of computational predictions (in particular of ill-conditioned systems as described in Excursion 11), one can incorporate more observations or improve the data assimilation process. Bayesian data assimilation does the latter by combining models and observations via Bayes’ theorem from statistics. The physical variables are typically interpreted as random variables with associated probability density functions. The basic idea of coupling data and models in a probabilistic framework relies on conditional probabilities. For probabilities p(u) and p(v) of two random variables u and v, respectively, the conditional probability p(u|v) indicates the probability of u given the events in v, i.e. it involves a dependency of one random variable on another.a For conditional probabilities, it holds p(u ∧ v) , p(v) p(v ∧ u) . p(v|u) = p(u)

p(u|v) =

(3.4) (3.5)

Hence, the probability of u under the assumption that v holds is given by the probability that u and v hold divided by the probability of v, and vice versa for p(v|u). Combining both equations (3.4) and (3.5) results in Bayes’ theorem: p(u|v) =

p(v|u) · p(u) . p(v)

(3.6)

A probability distribution of a continuous random variable is described by a probability density function. A prominent example is the Gaussian distribution (with the characteristic “Bell curve” shape) (u − µ)2 1 exp − , p(u) = √ 2σ 2 2πσ 2 where µ is the expected value of the random variable u and σ 2 represents its variance. In the context of data assimilation, v may represent the observations or data while the state variables are contained in u. 48 Many

details on the actual configuration of DWD’s weather forecast as of 1999 can be found in [40].

86

Chapter 3. Mathematical Modeling and Discretization

Bayes’ theorem (3.6) indicates that the probability of the states u under observations v is equal to the probability of observations v under the states u times a fraction of the separate probabilities. In order to find a probability of the state variables that optimally meets the observations and the model analysis, different approaches exist to exploit formula (3.6) (see [30]): • variational methods determining the minimum of a cost function (measuring the difference between values derived by observations and by the model) by iterative optimization techniques; • Kalman filters for estimating the expected value and covariance of the state variables; • particle filters for finding a representation of the state variables. For details on data assimilation, see [38] and, in particular, [30]. a The difference in conditional vs. nonconditional probabilities is indirectly well known from everyday life: The probability of the event “I jog every Wednesday” may of course be different from the probability of the event “I jog every Wednesday if the weather is fine.”

Two aspects in the context of data assimilation were responsible for the wrong forecast of Lothar (see, e.g., [1, 12, 18, 47]): • The radiosonde “Sable Island” (southeast of Halifax, Nova Scotia) failed to reach the necessary height of 40 km on December 24, 1999, at 12 UCT because the hydrogen-filled balloon exploded. Therefore, the weather balloon was started again 114 minutes later and successfully reached its intended height. Due to low transmission rates, radiosondes were started only four times per day at the time of the events. The respective time stamps of the transmitted data were fixed to four corresponding options: 00, 06, 12, and 18 UTC49 (personal communication with U. Schättler of DWD). According to the standard procedures defined by the World Meteorological Organization, the Canadian Weather Services distributed the restarted radiosonde data with the same time stamp of 12 UTC as the previous one. Therefore, these data were related to a different atmospheric situation. Hence, previous forecasts not using these spoiled data delivered better results than the subsequent runs. Weather services other than DWD did not suffer from this effect because they did not use the delayed data for the 12 UTC forecast (personal communication with U. Schättler of DWD). • A specific feature in DWD’s modeling increased the error introduced by the radiosonde’s faulty time stamp. A posteriori analysis of the simulations revealed that reducing the time window for observed data50 in the global model from six to three hours gave much better boundary conditions for and, thus, better results of the local model LM (see [1, 10]). Since weather prediction is ill-conditioned, these two aspects generated slight changes in the initial values of the local model, resulting in completely different predictions in this exceptional case. This is illustrated by ensemble forecasts for Lothar in Fig. 3.13 (taken from [27]; the ensemble has been generated via initial perturbations and 49 Meanwhile,

the exact time stamps of the actual time of recording can be used. that for a target time for a simulation such as 00 or 12 UTC, data in a certain time window before and after this point in time are used synoptically, i.e., agglomerating all of these data at exactly that point in time. 50 Note

3.3. Weather Prediction

87

Figure 3.13. Ensemble simulations of the Lothar scenario computed in the aftermath of the event. Fifty slightly different scenarios with respect to input data have been computed. Only 10 of the 50 runs contained indications of a severe thunderstorm. The pictures show surface pressure for each of the 50 scenarios after 42 hours of simulated wall clock time. Reprinted with permission from World Scientific [27].

stochastic physics.). Only ten of the fifty 42-hour forecasts show a severe thunderstorm over Central Europe. The example of Lothar illustrates that in certain cases, short range weather prediction can also be very hard. In the aftermath of Lothar, the DWD changed the time window in the global model from six to three hours and adapted its internal structure and organization to reflect the priority of high-quality warnings (see, e.g., [12]). A particular issue for every meteorological service is the decision of whether to issue a warning. A warning should only be published in case of a high probability of a thunderstorm in order to avoid unnecessary concern and false alarm (cf. personal communication of D. Majewski).

3.3.4 Other Bugs in this Field of Application • In 1983, the Colorado River flooded, killing 7 persons and leading to $12 million in property damage (see [24, 36]). In that year, El Niño was responsible for a global and sudden change in the weather conditions. Forecasts from the National Weather Service and the Bureau of Reclamation initially predicted a runoff of about 117% of normal, while it actually turned out at 210% (cf. [6]). Because of this “monumental mistake” [46] in federal computer predictions of snow meltoff flow, too much water had been kept in Lake Powell via the Glen Canyon Dam in the spring period. • In 1985, the US National Weather Service failed to predict a storm because a weather buoy was left unrepaired for two and a half months. This missing input to

88

Chapter 3. Mathematical Modeling and Discretization

the simulation software influenced the quality of the results significantly. Three lobstermen unfortunately died in the storm, and their families were awarded a compensation of $1.25 million at first instance (see [24, 44]). • In October 2007, a tornado hit the Mediterranean island of Mallorca. The weather prediction correctly described heavy rainfall but failed to forecast the strong winds up to at least 130 km/h (see, e.g., [8, 29, 35]). According to different sources, about 20 persons were injured, and one person allegedly even killed by debris.

3.3.5 Similar Bugs Software-related problems involving incorrect input data comprise the following events. • In 1975, researchers had concerns about the impact of chlorofluorocarbons (CFCs) on the atmosphere contributing to global warming by reducing the thickness of the protective ozone layer. CFCs were used for a variety of industrial products such as refrigerants, sprays, etc. Therefore, NASA equipped the Nimbus-7 satellite with two sensors for measuring the total amount of ozone in the atmosphere. From 1978 to 1995, NASA used Nimbus-z to collect data about the ozone concentration. This data gave no reason for concern because it showed no dramatic changes in the concentration. In October 1985, a British team measured the ozone layer over Halley Bay, Antarctica, using a ground-based ozone spectrometer. They found an alarming reduction of the ozone layer by 40% compared to the previous year. NASA had to scrutinize why this effect had not been detected before. Unfortunately, the software used for the collected data from Nimbus-7 was programmed to delete all outliers—measured values that deferred greatly from the expected values (see [5, 22, 23]). Hence, the NASA team failed to detect the ozone hole because it was already more severe then expected. As a result of the detected ozone hole, 43 nations signed the Montreal Protocol in order to reduce the use of CFCs. • In September 1998, the USS Yorktown aircraft carrier was on a mission to test the US Navy’s Smart Ship system when its computer system crashed. Investigations revealed a buggy input causing a division by zero. The source of the problem on the Yorktown was that a crew member entered a zero into a database field. When the software attempted to divide by this zero, a buffer overflow occurred that crashed the entire network and let the ship lose control of its propulsion system. For about 2.5 hours ship was “dead in the water” (see [7, 48]) before it was towed back to Norfolk Naval Station. • On April 30, 1999, a Titan IV B rocket equipped with a Titan Centaur upper stage developed by Lockheed Martin suffered from instabilities shortly after launch. The reason for the failure was an incorrectly entered value of the role-rate filter constant [16, 28]. The roll-rate filter was an algorithm in the inertial navigation unit comprising five constants. The correct value of the constant I1(25) should have been −0.1992476, but instead −1.992476 was entered. This led to the loss of the rocket and the included load, a Milstar communication satellite. • On March 20, 2002, complicated fire control software and operator error contributed to the deaths of two soldiers in Fort Drum. The so-called Advanced Field Artillery Tactical Data System was used to calibrate a howitzer. If an operator

3.3. Weather Prediction

89

forgets to enter the target altitude in this system, the computer enters a default zero. This led to the landing of high-explosive shells about 1.5 miles short of their target (cf. [3, 45]). • In December 2001, three soldiers from the US Special Forces were killed and several injured by a “friendly” bomb in Afghanistan. The person in charge of air control had entered the location of a Taliban position in a GPS device as the target for an air strike. Unfortunately, the battery of the GPS device died and had to be changed. After a restart of the device, the default location of the GPS tool was its current position, overwriting the previous input. Therefore, the bomb erroneously hit the location of the Special Forces team [21]. • On July 3, 1988, Iran flight 655 on an Airbus 300 bound for Dubai was shot down by the missile cruiser USS Vincennes. After combat with other Iranian gunboats, the Aegis system of the Vincennes, which is responsible for engaging attacking aircraft or missiles, gave a warning about an approaching aircraft: “Unknown– Assumed Enemy.” Unfortunately, the Vincennes IFF (Identification Friendly or Foe) first received the signature of a commercial airplane, but later on it showed a military signature, belonging to another Iranian aircraft on the ground. The Airbus did not react to the warnings sent out on civil and military channels. Then, in the chaotic circumstances of the battle, the Vincennes crew misinterpreted the output of the Aegis system regarding size and altitude of the Airbus, assuming the taking off Airbus to be an attacking military F-14 Tomcat aircraft. For a report on and discussion of the accident, see [11, 25, 42]. • In 2013, Xerox office devices (in particular of the WorkCentre and ColorCube series) were affected by a software bug that surprisingly changed numbers in scanned documents. The origin of this weird behavior was the JBIG2 compression method, which uses pattern matching to allow stronger compression. This pattern matching could lead to substitution of characters in scanned documents [15, 49] in the case of specific input data.51 This problem was particularly cumbersome for companies that relied on scans and had deleted the original paper documents, as they could no longer detect whether numbers had been changed somewhere. • During the 2012 Summer Olympics in London, a software error caused problems in collecting the throw distance of Betty Heidler in the hammer throwing competition. When entering the measured value of 77.13 m, the computer refused to accept this value. It turned out that the previous throw amounted to exactly the same distance and, therefore, the software deduced that the officials tried to enter the old value again [39]. • On November 27, 2017, a Soyuz-2-1b rocket lifted off at Vostochny Cosmodrome, Russia’s newly built spaceport.52 After the successful separation of the third stage, a major problem that led to the loss of the complete payload occurred. Early reports on preliminary findings already mention a problem in the 51 A nice YouTube video (in German) of a presentation of D. Kriesel on the Xerox problem at the 2014 congress of the German Chaos Computer Club can be found at https://youtu.be/7FeqF1-Z1g0. 52 While many successful launches have been carried out for decades from Baikonur, which Russia has been leasing from Kazakhstan since the collapse of the Soviet Union in 1991, this space mission was the second one to actually start from the Vostochny Cosmodrome, which was completed on Russian territory in 2016.

90

Chapter 3. Mathematical Modeling and Discretization

Fregat-M upper stage—which is the fourth and last stage of the rocket responsible for transporting the payload satellites to their designated initial positions—as the reason for the failure (cf. [13, 43]). The improper orientation of the booster caused the Fregat-M module to enter the atmosphere and fall into the Atlantic Ocean [43]. The launcher was intended to deliver Russia’s Meteor-M No.2-1, a weather satellite worth about 2.6 billion rubles or $45 million, together with 18 smaller satellites from seven countries (cf. [13, 26]; see also [50] for a detailed survey on the mission and the incident). On December 27, 2017, Russia’s Deputy Prime Minister responsible (amongst others) for research related to defense, Dmitry Rogozin, told the Rossiya 24 state TV channel that wrong initial coordinates had been programmed in the flight control system of the Fregat-M:53 The takeoff location was set to Baikonur instead of Vostochny (see, e.g., [2, 26, 32, 34]). At that time, Rogozin should have had the report of the completed investigation initiated after the failure at hand (see, e.g., [50]). Interestingly, Russia’s state space agency Roscosmos operating the Meteor program published a statement claiming “The flight task was tested exclusively for the Vostochny Cosmodrome, which was checked by specialists in accordance with existing methods . . . . The reason for the accident is a combination of several factors at the Vostochny Cosmodrome . . . [that are] impossible to detect by any existing mathematical models” (see [34]). The statement is rather cryptic and may represent more a political move than a contribution to understanding the technical reasons for the failure; it might also refer to the relatively complex chain of events caused by wrongly set azimuth coordinates (see [50]).

Bibliography [1] G. Adrian. Lothar workshop. date accessed: 2018-01-03, 2000. https://www5. in.tum.de/persons/huckle/lothar

[2] BBC News. Failed satellite programmed with “wrong co-ordinates.” date accessed: 2018-01-08, 2017. http://www.bbc.com/news/technology-42502571 [3] S. Bellovin. Software problem kills soldiers in training accident. date accessed: 2018-07-31, 2002. https://catless.ncl.ac.uk/Risks/22.13.html [4] E. Blayo, E. Cosme, and A. Vidard. Introduction to data assimilation. date accessed: 2018-08-20, 2011. http://ljk.imag.fr/membres/Maelle.Nodet/ documents/MN_DA.pdf

[5] B. Christensen and S. Christensen. Achtung: Statistik. Springer, 2015 [6] Colorado River Storage Project. Glen Canyon Dam Construction History, 2008 [7] M. D. Crawford. USS Yorktown dead in water after divide by zero. date accessed: 2018-07-31, 1998. https://catless.ncl.ac.uk/Risks/19.88.html [8] dab/dpa. Tornado Mallorca. date accessed: 2018-03-28, 2007. http://www. spiegel.de/reise/aktuell/unwetter-meteorologen-raetseln-ueber-mallorcatornado-a-509850.html 53 The Soyuz rocket and the Fregat-M module had different flight control systems developed and initiated by different entities.

Bibliography

91

[9] DWD. Bewertung der Orkanwetterlage am 26.12.1999 aus klimatologischer Sicht. Technical report, Deutscher Wetterdienst (DWD), 1999. http://www.wetterextrem.de/stuerme/lothar/orkan_lothar.pdf

[10] DWD. Operationelles NWV-System - hier: Änderung der Datenassimilation für GME. date accessed: 2018-03-28, 2000. https://rcc.dwd. de/DE/fachnutzer/forschung_lehre/numerische_wettervorhersage/nwv_ aenderungen/_functions/DownloadBox_modellaenderungen/gme/pdf_1999_ 2001/pdf_gme_03_05_2000.pdf?__blob=publicationFile&v=4

[11] D. Evans. Vincennes: A case study. date accessed: 2018-05-11, 1993. https: //www.usni.org/magazines/proceedings/1993-08/vincennes-case-study

[12] J. G. Goldammer. Erstes Forum Katastrophenvorsorge “Extreme Naturereignisse und Vulnerabilität.” Deutsches Komitee für Katastrophenvorsorge, 2001. http:// www.dkkv.org/fileadmin/user_upload/Veroeffentlichungen/Publikationen/ DKKV_1._Forum_Extreme_Naturereignisse_und_Vulnerabilitaet.pdf

[13] W. Graham. Soyuz 2-1B launch with Meteor-M ends in apparent Fregat-M failure. date accessed: 2018-01-08, 2017. https://www.nasaspaceflight.com/2017/11/ soyuz-2-1b-launch-meteor-m/

[14] N. J. Higham. Accuracy and Stability of Numerical Algorithms. SIAM, 2nd edition, 2002 [15] D. Kriesel. XEROX-Scankopierer verändern geschriebene Zahlen. date accessed: 2018-07-31, 2013. http://www.dkriesel.com/blog/2013/0802_xeroxworkcentres_are_switching_written_numbers_when_scanning

[16] N. G. Leveson. Role of Software in Spacecraft Accidents. Journal of Spacecraft and Rockets, 41(4):564–575, 2004 [17] E. N. Lorenz. Predictability: Does the Flap of a Butterfly’s Wings in Brazil Set Off a Tornado in Texas? In American Association for the Advancement of Science, 1972 [18] D. Majewski. Orkantief Lothar, 2007. Handout for talk, obtained from U. Schättler [19] D. Majewski. Numerische Wettervorhersage als mathematisch-physikalisches Problem. Presentation slides, obtained from U. Schättler, 2008 [20] D. Majewski. Numerische Wettervorhersage (NWV) beim Deutschen Wetterdinest (DWD). date accessed: 2018-03-29, 2017. https://www.dwd.de/ DE/service/lexikon/begriffe/N/Numerische_Wettervorhersage_pdf.pdf?__ blob=publicationFile&v=6

[21] J. K. Mccarthy. Friendly fire deaths traced to dead battery. date accessed: 201807-31, 2002. https://catless.ncl.ac.uk/Risks/21/98#subj1 [22] B. Mcgarry. Ozone hole undetected for years due to programming error. date accessed: 2018-07-31, 1986. https://catless.ncl.ac.uk/Risks/3/29#subj1 [23] NASA Earth Observatory. cessed: 2018-03-12, 2018.

Serendipity and stratospheric ozone.

date ac-

https://earthobservatory.nasa.gov/Features/ RemoteSensingAtmosphere/remote_sensing5.php

92

Chapter 3. Mathematical Modeling and Discretization

[24] P. G. Neumann. Latest version of the computer-related trouble list. date accessed: 2018-03-28, 1986. https://catless.ncl.ac.uk/Risks/4/01 [25] P. G. Neumann. Aegis, Vincennes, and the Iranian Airbus (report on a Matt Jaffe talk). date accessed: 2018-03-28, 1989. http://catless.ncl.ac.uk/Risks/8. 74.html#subj1

[26] A. Osborn and A. Roche. Russia says satellite launch failure due to programming error. date accessed: 2018-01-08, 2017. https://www.reuters.com/article/usspace-launch-russia-mistake/russia-says-satellite-launch-failuredue-to-programming-error-idUSKBN1EL1G2

[27] T. N. Palmer. Predictability of Weather and Climate: From Theory to Practice From Days to Decades. In Realizing Teracomputing, Vol. 1, pages 1–18. World Scientific, 2003 [28] G. J. Pavlovich.

Formal Report of Investigation.

Technical report.

http:

//sunnyday.mit.edu/accidents/

[29] C. Ramis, R. Romero, and V. Homar. The Severe Thunderstorm of 4 October 2007 in Mallorca: An Observational Study. Natural Hazards and Earth System Science, 9(4):1237–1245, 2009 [30] S. Reich and C. Cotter. Probabilistic Forecasting and Bayesian Data Assimilation. Cambridge University Press, 2015 [31] S. Reich and A. Stuart. Data Assimilation in Numerical Weather Prediction. SIAM News, Oct. 2015 [32] Reuters. Russian satellite lost after being set to launch from wrong spaceport. The Guardian, Dec. 28, 2017. https://www.theguardian.com/world/2017/dec/28/ russian-satellite-lost-wrong-spaceport-meteor-m

[33] Rotronic UK. Meteorology numerical weather prediction. date accessed: 2018-08-19, 2017. https://blog.rotronic.co.uk/2017/07/19/meteorologynumerical-weather-prediction/

[34] RT. Roscosmos rejects claim “wrong spaceport settings” caused November satellite loss. date accessed: 2018-01-09, 2017. https://on.rt.com/8vte [35] J. Schipper. EUMeTrain. The GEO Quaterly, 28:36, 2010 [36] W. E. Schmid. Floods Along Colorado River Set Off a Debate Over Blame, The New York Times, July 17, 1983 [37] C. Schraff and R. Hess. Datenassimilation für das LM. Promet, 27(3/4):156–164, 2002 [38] R. C. Smith. Uncertainty Quantification: Theory, Implementation, and Applications. SIAM, 2014 [39] Spiegel Online. Olympia 2012: Software-Fehler Schuld an Heidler-Drama im Hammerwurf. date accessed: 2018-07-31, 2012. http://www.spiegel.de/ sport/sonst/olympia-2012-software-fehler-schuld-an-heidler-drama-imhammerwurf-a-849533.html

3.4. Mathematical Finance

93

[40] J. Steppeler, G. Doms, and G. Adrian. 128(3):123–128, 2002

Das Lokal-Modell LM.

Promet,

[41] J. Stoer and R. Bulirsch. Introduction to Numerical Analysis. Texts in Applied Mathematics. Springer, 3rd edition, 2002 [42] L. Swartz. Overwhelmed by Technology: How did user interface failures on board the USS Vincennes lead to 290 dead? date accessed: 2018-05-11, 2001. http: //xenon.stanford.edu/~lswartz/vincennes.pdf

[43] TASS. Booster Fregat may have been lost due to GLONASS equipment failure source. date accessed: 2018-01-08, 2017. http://tass.com/science/978015 [44] The Associated Press. U.S. weather agency held liable in storm deaths. date accessed: 2018-03-28, 1984. https://nyti.ms/29xmYtK [45] The Associated Press. Officer found negligent in deaths of 2 at Fort Drum. date accessed: 2018-07-31, 2003. https://www.nytimes.com/2003/07/02/nyregion/ officer-found-negligent-in-deaths-of-2-at-fort-drum.html

[46] The New York Times. Nevada Governor Says Errors Led to Flood, July 5, 1983 [47] W. Wergen and M. Buchhold. Datenassimilation für das Globalmodell GME. Promet, 27(3/4):150–155, 2002 [48] Wired. Sunk by Windows NT. date accessed: 2018-07-31, 1998. https://www. wired.com/1998/07/sunk-by-windows-nt/

[49] Xerox. Xerox scanning update: What you need to know. date accessed: 2018-07-31, 2013. https://www.xerox.com/assets/pdf/ScanningQAincluding AppendixA.pdf

[50] A. Zak. Soyuz fails to deliver 19 satellites from Vostochny. date accessed: 201801-08, 2017. http://www.russianspaceweb.com/meteor-m2-1.html

3.4 Mathematical Finance If you look at a big enough population long enough, then almost any damn thing will happen. —Persi Diaconis

3.4.1 Overview • Field of application: financial markets • What happened: contribution to global financial crisis, missing stopping criterion for stock orders • Loss: billions of dollars • Type of failure: careless (greedy) use of mathematical models and computer

94

Chapter 3. Mathematical Modeling and Discretization

3.4.2 Introduction In the financial sector, mathematical tools and computers are essential for performing basic transactions such as money transfer, payment by credit or debit card, and selling or buying stocks, currencies, or options. Trading itself is mainly operated by computers. Therefore, algorithms as well as connection speed can be crucial to achieve slight advantages over competitors. Furthermore, a variety of mathematical or statistical tools in risk management help to optimize an investment with respect to profit. These instruments rely on certain assumptions to allow a meaningful prediction. Careless use of these tools can do severe damage, and contributed to the financial crisis of 2008, for example. In the following sections, we first consider algorithmic and technical aspects of the financial markets. Second, we present and discuss the mathematical tools in risk management, revealing some of the weaknesses and hazards of these methods.

3.4.3 Algotrading and High-Frequency Trading On Wall Street, a large amount of stock market trading is realized automatically via algorithms and computers.54 In algotrading (also called algorithmic trading), the orders are done automatically and fast, e.g., in the form of stop-loss orders, with the goal to react very fast to even tiny fluctuations in currencies, exchange rates, or stock exchange prices. On the one hand, it is advantageous to use computers and to have a very fast network connection to the stock market computers. Hence, big companies try to have computational facilities in proximity to the stock market IT infrastructure with highspeed Internet connection that allows trading in milliseconds. On the other hand, single stock market customers cannot afford such infrastructure and, thus, are not able to keep up with big companies. Furthermore, fast access allows manipulation of the market by strategies such as sending numerous buying orders to stimulate the market and withdrawing them immediately. According to [11], high-speed connections were also used to outrun traders with slow connectivity. These slower traders placed electronic orders in different electronic stock exchanges. The time for these orders to be placed depended on the distance and connectivity—some went fast, and others took more time. Competitive traders with faster connections were able to spot the faster orders. This knowledge enabled them to react to the slower placements of orders at the more distant exchanges even before they had actually been placed. In the end, the outrun traders had to pay higher prices for the originally cheaper order. Algotrading may also contribute to serious and global problems. On October 19, 1987, also known as Black Monday, the major indexes of Wall Street dropped by 30% and more. This represented the greatest loss on Wall Street on a single day. Besides other factors, the heavy usage of algotrading contributed to this event: In searching for the cause of the crash, many analysts blame the use of computer trading (also known as program trading) by large institutional investing companies. In program trading, computers were programmed to automatically order large stock trades when certain market trends prevailed. [cf. [9], p. 92; also similar in [6]]

For similar events, nowadays the term flash crash is used, which is defined as a very rapid, deep, and volatile fall in security prices within an extremely short time period. 54 In [3], more than 50% had already been indicated as of 2011, with an estimated increase to 70% by 2015. For 2018, [8] states about 50–60% on typical days and up to 90% for periods when the markets are very volatile.

3.4. Mathematical Finance

95

A flash crash appears in algorithmic and high-frequency trading, where the speed and interconnectedness of computers and algorithms can contribute to the loss and recovery of billions of dollars in minutes and seconds. Examples of flash crashes happened on • May 6, 2010, at the New York Stock Exchange (NYSE); • April 17, 2013, regarding the German DAX; and • October 6, 2016, regarding the British pound exchange rate. An example of the risks of computerized trading and potential problems in related software is the case of Knight Capital. On August 1, 2012, the Knight Capital Group— the largest trader in US equities—updated their computer code SMARS in order to add new features for its customers. SMARS is a high-speed algorithmic router that sends orders out into the market. The core functionality of SMARS is receiving orders from the trading platform (“parent orders”) and breaking them into multiple smaller “child” orders. For this purpose, the company had originally used the so-called Power Peg code. At the time of the events, the legacy Power Peg code had been replaced a couple of years earlier, but it was still present and callable in the new code and could be activated by a flag. Furthermore, the feature for counting the number of shares of parent orders that had been executed by child orders and for stopping routing children had been moved to an earlier point of execution in the SMARS code in 2005. The new SMARS code was deployed starting from July 27, 2012, to be ready for service on August 1. Unfortunately, the flag for activating a new, customer-friendly feature was the same as the flag for starting the Power Peg functionality in the old SMARS version. On August 1, the new code was supposed to run on eight servers, but a technician copied the new code only to seven of them. On this eighth server, the old SMARS that included Power Peg was still installed. When the system started, the eighth server began creating child orders without checking against the total number of the parent orders, and therefore never stopped. To halt this infinite ordering, the new code on the seven servers was removed, but the problematic flag was left switched on. Now, all eight servers started infinite ordering. During the 45 minutes before the team prevented the software from sending further orders, 12 small retail orders had led to the routing of millions of orders in total (see [4, 13, 23] for a description of the events). As a result of these events, Knight Capital Group lost $460 million, i.e., more money than it actually owned. They had to borrow the money from other investors and finally lost their autonomy. Today, they are part of Virtu Financial, one of the largest highfrequency trading firms.

3.4.4 Mathematics and the Financial Crisis Predictions and evaluations play a major role in the financial markets. Accurate forecasts of the future price of certain stock, goods, and currencies, or the development of economies would allow profitable investments, but this is hard or even impossible. In 2008 and 2009, the financial crisis originated in the US and had global effects. Besides other aspects, the usage of mathematical models and a blind or short-sighted trust in their results contributed to certain developments throughout the crisis. The Financial Crisis Inquiry Report of the US government states that “Financial institutions and credit rating agencies embraced mathematical models as reliable predictors of risks, replacing judgment in too many instances. Too often, risk management became

96

Chapter 3. Mathematical Modeling and Discretization

risk justification” ([7], p. xix). Wall Street executives had hired so-called quants— analysts adept in advanced mathematical theory and computers—to develop models of how markets or securities might change. S. Patterson55 reported to the Financial Crisis Inquiry Commission that the use of mathematical models considerably changed finance: “Wall Street is essentially floating on a sea of mathematics and computer power” ([7], p. 44). But quants and dealers are often more interested in short-term wins and their own bonuses than in stable investments (see [24]). Therefore, mathematical models can be misused for justification of dubious financial investments. In the following, we list some important mathematical tools used by brokers and discuss assumptions that have to be satisfied to ensure the applicability of these tools. If these assumptions are not met, the corresponding solutions become less accurate or even meaningless. • An important model was developed by F. Black and M. Scholes (see [2], based on results in [12]). This is an option-pricing model that indicates how to hedge the option with the stock to eliminate risk. The model only depends on one single parameter (the volatility) to describe the stock. Thus, it is independent of specific rising or falling of the stock price. The Black–Scholes equation is a part of this model and represents a PDE for estimating the future price of a European-style option (see Excursion 7 on modeling via differential equations on page 55; the Black–Scholes equation is mathematically similar to heat conduction problems). The underlying idea is to model fluctuations of the price by Brownian motion, a stochastic process describing the erratic movement of particles in fluid or gas (see [18] for details on stochastic processes). In each time step, the price can randomly increase or decrease with a certain probability. The outputs of interest are the resulting mean values or expectation values at a given time, and the standard deviations. The mean value represents a guess for the price that changes over time, and the deviation gives the average amount by which the actual price might differ from the calculated guess. This difference, called the volatility, depends on the waywardness of the market value changes. The future mean value one is interested in is not given explicitly but emerges indirectly as the solution of the Black–Scholes PDE (cf. [1, 19, 24]). The Black–Scholes PDE is based upon the description of the Brownian motion of a random variable W . Eliminating the drift of the Brownian motion by a mathematical construct called risk-neutral probability leads to the simplified PDE δV 1 δ2 V δV + σ 2 S 2 2 + rS − rV = 0 , δt 2 δS δS where S is the price of the underlying stock. In the basic form, several necessary assumptions have to be fulfilled (see, e.g., [5, 21, 22, 24]): – The risk-free interest rate r and the volatility σ are constants. – The value (price) V = V (S, t) of the option or derivative56 is a function of S and the time t only. – All terms in the equation are sufficiently smooth. 55 S. 56 A

Patterson is also the author of the book The Quants [16]. derivative is a financial instrument or contract that is settled on a future day.

3.4. Mathematical Finance

97

• In modern portfolio theory (MPT), diversification is used to reduce the risk of an investment. To this aim, the expected return for a possible investment is maximized for a given level of risk based on the standard deviation and a set time horizon (see [24]). MPT is based on correlations between different securities that are estimated by time series analysis. This delivers a single value for the risk of an investment. MPT is usually based on a few assumptions (A) and has certain properties (P) [24]: – (A) in the original MPT designed by Markowitz, the variables are assumed to follow a normal distribution; – (A) the markets are arbitrage-free (i.e., all market prices are a direct and immediate input for the model; in particular, no differences in prices for one and the same item exist); – (A) the participants at the market behave in a rational, independent, and risk averse manner (preferring lower risk given the same profit); – (P) the analysis depends on the chosen time interval; – (P) the calculations are based on past performance, but there is no guarantee of future performance (compare Excursion 10 on extrapolation); – (P) there is a huge number of correlation coefficients involved that have to be estimated; often, these coefficients are spurious or oscillating. • The Value at Risk (VaR) model [15] is based on MPT measuring the boundaries of risk in a portfolio over short durations. It uses two values: a degree of confidence (e.g., 90%) and a time horizon (e.g., 1 day). For the example of a given degree of confidence of 90%, a time horizon of 1 day, and a related VaR of $2 million, the VaR model states that for 90 out of 100 potential cases (i.e., days) the possible loss is expected to be less than $2 million. In this way, the resulting VaR number expresses the risk of a loss in a simple form. Again, some assumptions (A) and properties (P) hold [15, 24]: – (A) the variables are assumed to follow a normal distribution; – (A) the VaR is based on short history, expecting that “tomorrow will be more or less like today.” (cf. [15]); – (P) like MPT, VaR is based on unstable parameters. Correlations between assets might change dramatically: In the event of a big crash, e.g., they will be highly correlated. • Credit default swaps (CDSs) are often used for designing investments (see [10, 24]). A CDS can be seen as insurance for the case of “credit events.” It is a bilateral contract of protection: In case the borrower cannot pay off the debt, the protection buyer pays a premium based on the risk at regular intervals until the maturity of the contract or the date of occurrence of the default.57 The protection seller, on the other hand, agrees to provide compensation for the eventual loss. As in other tools, assumptions are necessary to estimate a CDS in a meaningful way (cf. [10]): – The probabilities of default, interest rates, and recovery rates are independent. 57 In

this context, the default describes a kind of a nonpayment of a credit.

98

Chapter 3. Mathematical Modeling and Discretization

– There is no opportunity of arbitrage (taking advantage of price differences). – There is a risk-neutral probability (i.e., a probability that a risky bond will default during its lifetime). • Collateralized debt obligations (CDOs; cf. [10, 24]) represent a bundle of many CDSs in a single product. In this way, subprime loans can be combined and sold together with other securities like premium loans or bonds. The resulting individual credit risk probabilities are tied together using a joint probability for different investments, typically via a Gaussian copula. Thus, the events are correlated, and the correlation coefficients can be used to estimate the joint risk (see [17]). Frequently used assumptions are – a normal distribution of all probabilities involved; and – that these correlation coefficients can massively be simplified reducing everything to one single correlation number. In particular the massive simplification of the correlation coefficients may result in problematic behavior under certain circumstances. As a demonstrative example, the use of the Gaussian copula method failed completely in the financial crisis of 2008 (see [17]). • Often, so-called technical analysis (see [24]) is used to derive predictions of prices. Here, certain patterns in the price charts are determined, and it is expected that such patterns will occur repeatedly in the future. Hence, as soon as a part of a pattern appears in the data, one could draw the future curve. Because of the erratic development of prices, this practice might not be very reliable (compare Excursion 10 on extrapolation). In the following, we present a condensed list of critical and arguable assumptions and simplifications in financial models such as MPT, VaR, CDS, and CDO: • A fundamental assumption in financial modeling is that the markets are arbitragefree and that the participants in the market are behaving rationally (cf. [24]). But as a matter of fact, people are influenced by additional factors such as – following a trend; – loss aversion; – preference for the status quo; – believing in illusory connections. • Often, the dependency between different data is simplified drastically. Correlation coefficients are simplified to all be equal. • For all of the above models, we have to assume that the observed values are stable and continuous in order to allow an extrapolation of the data in the future. But frequently, the parameters are unstable, and the corresponding problems to solve are ill-conditioned (compare Excursion 11 on conditioning). • Random variables are often assumed to follow a normal distribution. However, the values of interest would frequently be better described by a power-law distribution instead of the normal distribution (see [24]). The power-law distribution

3.4. Mathematical Finance

99

is better suited for more extreme events that are unpredictable,58 also called “the edge of chaos” [24]. • Often, models rely on assumptions about the properties of important parameters. On the one hand, one has to obtain good or satisfying values for the parameters (frequently called model fitting). On the other hand, one has to model the mathematically qualitative behavior of the parameters (is a parameter just a constant, or does it depend on other data, etc.). To obtain good values, the parameters of the models are often fitted by data stemming from a short time period when the market behaved continuously, and when the model delivered very precise predictions. In addition, the amount and quality of the data used for fitting may be insufficient. Hence, these models are of limited usefulness when the conditions have changed considerably. Models such as the Black–Scholes model (related to the Black–Scholes PDE) are derived and applied similarly to classical physical problems modeled by PDEs such as flow or diffusion problems and weather prediction. In these physical PDEs, constants appear such as the gravitation constant, the speed of light, the Planck constant, or constants that depend on the type of the material such as the Reynolds number or the viscosity. These physical constants do not depend on any parameters. The Black–Scholes equation also incorporates quasi-constant parameters: the drift and the volatility. This is necessary to simplify the model, but these assumptions are not valid in real markets.59 Therefore, the solutions of the Black–Scholes PDE are only reasonable as long as the market situation and the related constants do not change in an extreme way. But in the event of a crisis, such extreme changes unfortunately happen, making the approach meaningless. This is summarized by the statement, “It’s not like trying to shoot a rocket to the moon where you know the law of gravity” (cf. [7], p. 44). If not made transparent, such restrictions in mathematical models may lead to a false feeling of safety for stakeholders relying on such models for certain decisions. The main issues are not only the models, but the way people use them: “But a computer does not do risk modeling. People do it. And people got overzealous and they stopped being careful” (statement of G. Berman in [15]). But predictions suffer from a more severe and fundamental problem, the so-called black swan effect (see below). One can only try to predict the future by data from the past. Obviously the last year’s growth of earnings of a company or the increase of a price do not really allow us to predict further growth (see [24]). Furthermore, such a prediction or extrapolation will necessarily fail if new unforeseen developments take place. Examples of such unforeseen events are booms in new technologies, as in the case of laptops or smartphones, the stock crash of October 1987, the fall of the Berlin Wall and the Iron Curtain, and the 9/11 events. The phenomenon of not being able to predict certain developments or events was called the black swan effect by N. N. Taleb in [20]. Observing only white swans leads to the conclusion that swans will always be white. But in some other parts of the world or in the near future, there could emerge a substantial number of unforeseen black swans. 58 The power-law distribution can have the form of the function 1/xk for some integer k (see, e.g., [14]). In such a distribution, a cluster of values at one end dominates the other values. Examples are the distribution of income, word frequencies, or trading volumes. In logarithmic scale, the distribution function is represented by a straight line. 59 Note that more sophisticated approaches exist, again coming with some assumptions.

100

Chapter 3. Mathematical Modeling and Discretization

The fact that a model works well for some time does not allow the conclusion that it will work well in the future, too. This is not a malfunction of the model but a failing of the user, who should be aware of the limits of the model. Inductive reasoning does not strictly hold in real-world applications (cf. [15, 19]). A black swan is defined by Taleb as an event that is • unpredictable; • of large impact; and • obvious in retrospect. Hence, models can be very useful in stable periods when the market behaves continuously. But they will fail completely once the prerequisites for their application are violated as in the case of black swan events. D. Einhorn in [15] compares these models to “an air bag that works all the time, except when you have a car accident.”

Bibliography [1] R. Ball. The Global Financial Crisis and the Efficient Market Hypothesis: What Have We Learned? Journal of Applied Corporate Finance, 21(4):8–16, 2009 [2] F. Black and M. Scholes. The Pricing of Options and Corporate Liabilities. Journal of Political Economy, 81(3):637–654, 1973 [3] D. Cliff, D. Brown, and P. Treleaven. Technology Trends in the Financial Markets: A 2020 Vision. Technical report, UK Government Office for Science, 2011. https://research-information.bristol.ac.uk/en/publications/ technology-trends-in-the-financial-markets-a-2020-vision(be46a5b587ee-49ff-a0c1-0d7cab37b2bd).html

[4] D7. A DevOps Cautionary Tale. Dougseven.com, date accessed: 201807-31, 2014. https://dougseven.com/2014/04/17/knightmare-a-devopscautionary-tale/

[5] S. Dunbar. Stochastic processes and advanced mathematical finance. date accessed: 2018-07-31, 2016. https://pdfs.semanticscholar.org/f27e/ fe3c1a7209a65565165ccb58d5d81ce69537.pdf

[6] R. Dutton. Financial Meltdown. Infodial Limited, 2009 [7] Financial Crisis Inquiry Commission. The Financial Crisis Inquiry Report, Authorized Edition: Final Report of the National Commission on the Causes of the Financial and Economic Crisis in the UN. PublicAffairs, 2011 [8] C. Isidore. Machines are driving Wall Street’s wild ride, not humans. CNN Money, 2018. http://money.cnn.com/2018/02/06/investing/wall-streetcomputers-program-trading/index.html

[9] J. Itskevich. What caused the stock market crash of 1987? date accessed: 201803-27, 2002. https://historynewsnetwork.org/article/895 [10] A. Latifa. Financial Stochastic Modeling and the Subprime Crisis. International Journal of Economics, Finance and Management Sciences, 4(2):67–77, 2016

Bibliography

101

[11] M. Lewis. Flash Boys. Norton & Company, 2014 [12] R. C. Merton. Theory of Rational Option Pricing. The Bell Journal of Economics and Management Science, 4(1):141, 1973 [13] E. M. Murphy. In the matter of Knight Capital Americas LLC respondent. date accessed: 2018-08-16, 2013. https://www.sec.gov/litigation/admin/2013/ 34-70694.pdf

[14] T. Neckel and F. Rupp. Random Differential Equations in Scientific Computing. Versita, 2013 [15] J. Nocera. Risk mismanagement. The New York Times, Jan. 2, 2009. https: //www.nytimes.com/2009/01/04/magazine/04risk-t.html

[16] S. Patterson. The Quants: How a New Breed of Math Whizzes Conquered Wall Street and Nearly Destroyed It. Crown Business, 2010 [17] F. Salmon. Recipe for disaster: The formula that killed Wall Street. date accessed: 2018-03-27, 2009. https://www.wired.com/2009/02/wp-quant/ [18] R. C. Smith. Uncertainty Quantification: Theory, Implementation, and Applications. SIAM, 2014 [19] I. Stewart. The mathematical equation that caused the banks to crash. The Guardian, Feb. 12, 2012. https://www.theguardian.com/science/2012/feb/ 12/black-scholes-equation-credit-crunch

[20] N. N. Taleb. The Black Swan: The Impact of the Highly Improbable. Random House, 2007 [21] Wikipedia. Black–Scholes model. date accessed: 2018-08-16. https://en. wikipedia.org/wiki/Black-Scholes_model

[22] Wikipedia. Black–Scholes equation. date accessed: 2018-07-31. https://en. wikipedia.org/wiki/Black-Scholes_equation

[23] Wikipedia. Knight Capital Group. date accessed: 2018-07-31. https://en. wikipedia.org/wiki/Knight_Capital_Group

[24] P. Wilmott and D. Orrell. The Money Formula: Dodgy Finance, Pseudo Science, and How Mathematicians Took Over the Markets. Wiley, 2017

Chapter 4

Design of Control Systems

Numerical aspects are relevant to designing control logic and larger control systems such as real-time programs for the control of aircraft or cars. The controlling computer restricts the possible actions of an airplane depending on the mode the plane is in. This mode can switch from flying mode to landing mode, e.g., only if certain numerical conditions are fulfilled such as sufficient weight on the wheels. A careless or inaccurate implementation of these conditions can cause severe accidents, as described in the section on fly-by-wire problems. Autonomously driving cars also depend on numerical input gathered by cameras or radar, e.g., to generate a virtual representation of the environment that is necessary to control the vehicle to avoid accidents. Mistakes in designing software for determining obstacles can lead to fatal accidents. Similarly, errors in the control structures of code can have fatal consequences in certain applications, such as activation of airbags.

4.1 Fly-by-Wire Every solution breeds new problems. —Arthur Bloch

4.1.1 Overview • Field of application: aviation • What happened: crash during landing in Warsaw in 1993 • Loss: 2 dead and 56 injured • Type of failure: unreasonable conditions for allowing braking during landing

4.1.2 Introduction In the early days of aircraft, the pilot controlled the pitch elevator, rudder, wings, retractable undercarriage, and other elements directly via cranks, cables, pulleys, and hydraulic pipes. In modern airplanes, manual flight controls are replaced by an electronic interface and a computer system. Such fly-by-wire systems use digitized sensor data and pilot 103

104

Chapter 4. Design of Control Systems

commands as well as software to decide on the necessary actions to operate the elevator, rudder, etc.60 (see, e.g., [25, 31, 42]). Advantages of fly-by-wire systems are (cf. [9]), in particular in military context, • reducing aircraft weight (i.e., increased range due to less consumption of kerosene) while maintaining a specified performance; • improving performance for a given power unit; • improving aerodynamic, structural, or operational efficiency; • improving handling qualities; • reducing pilot work load; and • removing flight restrictions. A main advantage of the fly-by-wire system in civil aviation is that the aircraft is able to fly autonomously, and that under normal conditions the computer can avoid dangerous situations where the flight could become unstable due to certain commands of the pilot. But there are also disadvantages of fly-by-wire systems. The pilots remain in charge mainly for takeoff and landing and lose direct contact with the aircraft. Furthermore, in extreme situations, the software may prohibit actions that are dangerous but necessary to avoid a crash. There may also be effects of delayed or prohibited actions that the pilots are not aware of. One example is the automatic shifting between flight modes in the control system that can lead to situations where the pilot is not able to predict how the plane will react to his commands. Furthermore, in the event of a breakdown of the fly-by-wire system, control over the aircraft can be lost immediately. The two major manufacturers of civil aircraft, Airbus and Boeing, follow different strategies with respect to the design of fly-by-wire systems (see [41]). The computer in an Airbus always has the ultimate flight control and restricts the actions of the pilot such that the flight stays inside allowed performance limits. Only in the case of multiple failures of redundant computer systems can a mechanical (for the Airbus A320) or electrical back-up system for rudder control be used. Boeing uses fly-by-wire control systems for its models Boeing 777, 787 Dreamliner, and 747-8. In contrast to Airbus, Boeing’s policy is that the pilot obtains proposals and warnings from the software but may completely override them to permit flight situations beyond the usual flight envelope. Accidents caused by fly-by-wire systems are related mainly to the control logic that is responsible for limiting or triggering certain actions of the airplane. In the following sections, we first describe a preliminary incident that was not related to software but led to a certain type of control logic, which then contributed to an actual software-based problem (the Warsaw accident).

4.1.3 Preliminary Event: The Lauda Air Crash Shortly before midnight on May 26, 1991, the Lauda Air Boeing 767 flight 004 from Bangkok to Vienna crashed into the jungle from a height of approximately 25,000 feet. 60 Note that “wire” in this context is used synonymously to electronically connected devices, including some control software. Of course, classical wires in the sense of steel cables have been used in airplanes since the early days of aviation to steer a plane in a purely analog way.

4.1. Fly-by-Wire

105

An analysis of the accident revealed that the thrust reverser hydraulic isolation valve had opened during the ascent of the airplane. The thrust reverser restow system is usually activated during landing in order to slow down the plane. This is achieved in combination with brakes on the main landing gear and aerodynamic braking by wing top spoilers being directed upward. For thrust reverse, the aircraft engine’s thrust is diverted so that it is directed forward rather than backward. The flight voice recorder was found and analyzed, while the flight data recorder was damaged and unreadable. First, a discussion between the crew was recorded related to an alert associated with the thrust reverser isolation valve. But no actions were recommended, and no actions were taken. Furthermore, activation of the reverse thrust during flight was considered harmless based on tests done by Boeing. The flight crew, as well as the operational guidelines, were not aware of the—as it later turned out—critical nature of the warning message. Later, the pilot’s voice was recorded: “Ah, reverser’s deployed.” Twenty-nine seconds later, the recording ended. The examination of the wreck showed that the left engine thrust reverser was deployed during flight without a command from the control system or pilot (cf. [17, 43]). Tests afterwards indicated that an electrical short circuit occurring simultaneously with an auto-stow action could result in an unintentional opening of the directional control valve for up to 1 second, which would allow an unauthorized activation of the reverse thrust. As a consequence of the crash, both the Aircraft Accident Investigation Committee of Thailand’s Ministry of Transportation and Communications and the US National Transportation Safety Board suggested providing appropriate safeguards to prevent in-flight deployment of thrust reversers (see [19, 28]). Therefore, Boeing modified the thrust reverser system and included sync-locks allowing deployment only in the case where the main landing gear truck tilt angle was at the ground position. Further tests by Boeing revealed that activation of reverse thrust leads to an uncontrollable flight state and a nearly unavoidable crash if the aircraft is moving fast and flying at higher altitudes (cf. [28]). Tests prior to the accident had been performed at lower speed and lower altitude (see [19]).

4.1.4 The Lufthansa Warsaw Accident On September 14, 1993, Lufthansa flight 2904 from Frankfurt to Ok˛ecie International Airport near Warsaw, Poland, overshot the runway and hit an embankment and an instrumental landing system localizer with its left wing (cf. [44]). This caused a fire on the left wing of the Airbus A320-211 that spread to the main cabin. Two passengers died in the accident, one from the impact and one from the fire, and many were injured (see [11, 22]). The weather conditions on this day were difficult due to wind, rain, and water on the runway. In the event of high wind, aircraft have to approach the runway with higher speed. For a crosswind landing, different techniques are recommended such as crab (approaching the runway with the plane twisted in an angle to the runway) or sideslip (approaching the runway with hanging wing) (see [40, 46]). Furthermore, different runway directions or even different runways with different orientation may be chosen (if available) depending on the direction of the wind. In the investigation of the accident, two main reasons for the problem have been identified: • The wind data at Warsaw airport was delivered orally every three minutes. Therefore, the data transferred to the pilot were outdated and wrong (cf. [11, 21, 22]). Instead of the indicated wind speed of 13.5 kn equivalent to 25 km/h with di-

106

Chapter 4. Design of Control Systems

Figure 4.1. Sketch of different angles in the context of the wind during the landing of LH 2904 at Warsaw in 1993. The aircraft headed in direction 113◦ , the information obtained from the tower said front-to-side wind from 160◦ with 25 km/h, but the actual measurements showed tailwind from 280◦ with 33 km/h. All angles are indicated mathematically negatively (i.e., clockwise) with zero being the magnetic northern direction at the Warsaw airport (see [26] for more detailed information on angles, etc.).

rection 160◦ ,61 actual values of 18 kn ≈ 33 km/h and a direction of 280◦ were measured by the digital flight data recorder during landing (see Fig. 4.1 for a visualization of the different directions). In particular, the different wind direction (tailwind instead of side wind) made a huge difference, since the configuration of the aircraft during approach is adjusted accordingly. • Airbus software includes locks for preventing an erroneous activation of braking devices including reverse thrust to avoid problems similar to the Lauda Air accident described above. The reverse thrust could only be active in cases where more than 6.3 tons of weight acted on each of the two main landing gear struts. Furthermore, the spoiler could be activated only if this weight condition was satisfied or if the wheels were turning faster than 133 km/h. The wheel brakes could be activated if the wheel speed of both landing gears was greater than a certain threshold. Table 4.1 visualizes the control logic for the different devices in a simplified form; the actual control interdependencies are more complicated and partly inconsistent (see [20]). This logic and the related numerical limits seemed to be reasonable limits for indicating a landing process but turned out to be critical in hindsight. The sequence of the major events during the Warsaw accident are listed in Tab. 4.2. At time point t1 , the aircraft was already a bit too fast and high due to the wrongly indicated wind situation. Only the right landing gear touched the ground, but no braking could be activated since the control software expected weight on both landing gear struts (see condition A in Tab. 4.1). This situation did not change by the nose landing gear touching the ground. Only at t3 , i.e., about nine seconds after t1 , did the left landing gear touch the ground with the corresponding wheels. Soon afterwards, the spoilers were activated due to condition B (wheel speed). This reduced the lift force on the aircraft up to a point where condition A (weight of at least 6.3 tons) was fulfilled and 61 The speed of an aircraft is measured either in knots (kn) or in kilometers per hour (km/h), where 1 kn is equivalent to 1.852 km/h. The angle of the incoming wind and of the aircraft’s heading direction are indicated in zero to 360 degrees as on a compass, with zero being the local magnetic northern direction; see [26].

4.1. Fly-by-Wire

107

Table 4.1. Conditions to be fulfilled to activate spoilers, thrust reversers, and wheel brakes. Fabs,l and Fabs,r represent the force acting on the shock absorber of the left and right landing struts, respectively. Vw,l and Vw,r are the velocity of the wheels at the left and right landing struts, respectively. TH is an internally computed threshold (see [24]).

spoilers reversers brakes

condition A shock absorbers Fabs,l > 62 kN ∧ Fabs,r > 62 kN Fabs,l > 62 kN ∧ Fabs,r > 62 kN

log. comb. ∨

condition B wheel speeds (in km/h) Vw,l > 133 ∧ Vw,l > 133 Vw,l > TH ∧ Vw,l > TH

Table 4.2. Timeline of the important events during the Warsaw accident. RLG, NLG, and LLG stand for the right, left, and nose landing gears, respectively. The conditions A and B in the last two columns are the ones of Tab. 4.1. The question marks at t3a , t3 , and t4 indicate that the full or necessary weight on the struts had most probably not yet been reached since (fully) deployed spoilers are necessary to reduce the lift force on the aircraft; see [12, 20].

t1 = 15:33:48 t2 = 15:33:51 t3a =15:33:52

distance on runway in m 770 1028 1112

t3 t4 t5 t6 t7 t8

1525 1680 1830 1900 2783 2885

time

= 15:33:57 = 15:33:59 = 15:34:01 = 15:34:02 = 15:34:19 = 15:34:22

event

condition A B

RLG on ground NLG on ground reversers activated

?

?

LLG on ground spoilers fully deployed reversers start moving full reverser action end of runway reached impact

? ? X X X X

? X X X X X

the reversers started to move at t5 62 . From the formal activation of the reversers at t3a to their full action at t6 , it took at least 9 seconds. When the major braking devices were all active at t6 , the remainder of the runway was too short to bring the aircraft to a full stop under the given conditions (aggravated by aquaplaning on the runway), and a go-around was no longer an option. After this incident, Airbus changed the landing logic (by reducing the weight in condition A to 2 instead of 6.3 tons; see [33, 44]) in the A320 such that no lengthy delay in the activation of the braking systems should occur.

4.1.5 Other Bugs in This Field of Application In this section, we focus on civil aviation and do not report on any military incidents since the amount, availability, and content of literature is very different for these two categories. Some of the incidents are related to bugs or nonintuitive setups in the humancomputer interaction. 62 A different reasoning concerning the fulfillment of condition A and B is described in [24]; but this variant seems to be inconsistent because of the additional delay of the thrust reversers’ action; see [12, 20].

108

Chapter 4. Design of Control Systems

5 10 15 20 25 30 35

(a) Airbus A320 nearly crashing at Hamburg in 2008

40

(b) QR code of25 YouTube 5 10 15 20 30 35 40 video

Figure 4.2. Near-crash of LH 044 at Hamburg in 2008. (a) Picture of the left wing tip touching the ground. Credit: Lars Tretau / BILD. (b) QR code for a YouTube video of the landing approach and near accident.

• On March 1, 2008, flight LH 044 attempted to land in Hamburg, Germany, under bad weather conditions. Facing a strong side wind, the Airbus A320 nearly crashed during the approach for landing. One wing even touched the ground before the pilot successfully executed a go around (see [31, 32] and the YouTube video referred to in Fig. 4.2(b)). As the root cause of the near crash, an investigation revealed a problem in the flight control software (cf. [7]). As soon as one wheel touches the ground, the flight computer switches to ground mode and restricts the power of the ailerons and the pilot’s ability to move them (see [41]).63 For landing, the engineers expected that rolling movement would no longer be necessary, so the software limited the possible aileron movement in ground mode. During the approach of flight LH 044, the pilot was therefore unable to countersteer in order to withstand the strong crosswind. By accelerating and taking off, the aircraft switched to flight mode again and the dangerous situation was resolved. Because this behavior of the flight computer was not clearly documented, the pilots were not able to foresee the reduced steering ability (see [3, 31]). • On May 9, 2015, a predelivery test flight of an Airbus A400M Atlas ended up in a crash at La Rinconada close to Seville Airport in Spain. Three of the four engines failed in-flight, causing an uncontrolled descent and final crash killing four crew members (see [18, 38]). Initially, it was suspected that the three engines failed because of insufficient fuel supply. This could have been caused by a special piece of software that was supposed to support military flight maneuvers by pumping fuel from one wing into the other and thus changing the trim by relocating the balance point. This pumping could have resulted in a strong banking, causing the lack of fuel (cf. [13, 14, 35]). Further investigation revealed the actual cause of the accident. Prior to the plane’s first flight, process files known as “torque calibration parameters” had been deleted 63 Ailerons are the small movable devices on the wings that can be moved up and down in order to roll the aircraft around the longitudinal axis.

4.1. Fly-by-Wire

109

due to a software update (cf. [18, 38]). All the engines are controlled by separate computers that are responsible for executing the pilots’ commands in an optimal way. The deleted files are necessary for interpreting sensor data on the force generated by each engine to control the propellers. Without these files, the computer cannot function properly. Additionally, the pilots did not receive a warning about the problem on the ground since the design specified a corresponding warning to be issued only at a minimum altitude of 120 m (see [38]). • Two aircraft incidents have been related to a device called the Air Data Inertial Reference Unit (ADIRU). This system collects and provides information on the flight, in particular values of speed and angle of attack (cf., e.g., [41]). If the ADIRU delivers wrong data, this can lead to erroneous warnings, a loss of the altitude information on the flight display, and a wrong indication of the aircraft’s angle of attack. In the case of Qantas flight 72 on October 7, 2008, problems in one of the ADIRU devices led to invalid data for the angle of attack to which the receiving flight computer reacted with sudden, accidental pitch-down maneuvers before the pilots were able to regain control of the plane (see [31, 45]). About 119 passengers and crew members were injured due to the sudden movements. Two different aspects are relevant in the context of this incident (cf. [1]). First, the hardware of one of the three ADIRU devices delivered erroneous data,64 generating an unusual number of data spikes, i.e., of spurious values for measurements that have to be treated as outliers by the interpreting software in the flight control computer. Second, the flight control computer’s algorithm to filter the data spikes was suboptimal. While it was able to filter at least some of the outliers before the incident, it continued to have problems with certain input of the erroneous ADIRU device (see [1]). In the case of Air France flight 447 on June 1, 2009, problems related to ADIRU even caused the loss of the Airbus A330-203 and the death of all 228 crew members and passengers (see [31, 36]). AF 447 flew from Rio de Janeiro to Paris in high altitude above the Atlantic Ocean. The autopilot was engaged and stabilized the aircraft in turbulence. Presumably due to an ice-cover on all three Pitot tubes, which are used for measuring speed on board the aircraft, no valid input data with respect to speed was obtained by the ADIRU (cf. [8]). The flight control computer did not manage to provide useful data for the autopilot settings. Therefore, the autopilot suddenly disengaged, which surprised the pilots. The usually rapid diagnosis and counteraction in the case of iced-up Pitot tubes was not realized by the pilots. Without detailed information on speed and in the turbulent flight phase, this led to major problems, the confusion of the pilots, and finally to the crash of the aircraft. • On July 1, 2002, two planes collided midair close to Überlingen, Germany. The so-called Traffic Collision Avoiding System (TCAS) is included in the computer systems of up-to-date aircraft. In a case where planes approach each other on a potential collision course, both TCAS systems involved make contact, decide how to circumvent each other in order to avoid a collision, and pass on the resulting commands to the two pilots. This type of system is necessary due to the high density of air traffic and the high accuracy for the cruising altitude and for the 64 It

is unclear which part of the computer hardware was actually not fully working.

110

Chapter 4. Design of Control Systems

heading available by modern GPS data [5]. Besides the TCAS, air traffic is still always managed by human air traffic controllers (ATCs) on the ground. On July 1, 2002, Bashkirian Airlines Flight 2937, a Tupolew Tu-154, was on its way from Moscow to Barcelona. On board were 60 passengers, mainly children from Ufa, Bashkortostan, for a UNESCO-organized stay at the Costa Dorada in Spain. At the same time, the cargo flight DHL 611 was traveling from Bergamo to Brussels. Only one single ATC in Zurich was on duty for the Swiss company Skyguide (see [34]). When the aircraft were approaching each other, the TCAS instructed the DHL pilot to descend and the Russian pilot to climb. But the ATC in Zurich also commanded the Russian aircraft to descend because he was not aware of the descent of the DHL. Hence, both planes descended and collided, killing 71 persons (cf. [6]). The insufficient reaction of the air traffic controller was induced by a series of unfortunate circumstances (see [6, 23]): – The ATC was alone and had to track two monitors. On short notice, he simultaneously had to deal with another flight, which was delayed and heading for the airport at Friedrichshafen. – The short-term conflict alert system to support the ATC was not functioning. – Controllers from Karlsruhe, Germany, noticed the approaching planes, but were not able to contact Zurich due to a broken telephone link. As a consequence of the tragic event, the inquiry board suggested that in cases of conflicting orders, the pilots should always follow the TCAS. The tragedy of the accident also caused the death of the Zurich air traffic controller, who was murdered one and a half years later by a Russian citizen who had lost his family in the accident (cf. [34]). • On July 26, 2017, a German Forces helicopter “Tiger” crashed in Mali near Gao, leading to the death of both pilots. Investigations revealed that the autopilot responsible for controlling the pitch elevator caused the accident. The autopilot is programmed to balance the effect of turbulence. In this case, the turbulence was too strong, and the autopilot unsuccessfully tried to balance the helicopter within limited margins specified in the software, pushing the helicopter into a steep dive. Then the autopilot switched off, leaving the pilots no time to react (see [29]). • On January 20, 1992, Air Inter Flight 148 on an Airbus A320 crashed in the mountains near Strasbourg Airport in France, killing 87 of the 96 persons on board (see [37]). The investigation showed that there was no single cause of the accident, but multiple factors contributed to the disaster. In particular, air traffic control provided instructions to the pilots rather late (cf. [4]). Furthermore, the pilots entered a descent rate by typing in “−3.3” into the flight control unit, which resulted in a very high descent rate of 3,300 feet per minute (cf. [2]). The probable reason65 for this high descent rate is that the pilots inadvertently had the flight control unit in vertical speed mode instead of the necessary track/flight path angle mode. In the latter, “−3.3” would have been interpreted as an angle of descent of −3.3◦ which would have been equivalent to a descent rate of about 800 feet per minute (see [2, 4];66 in [27], a similar but virtual scenario is described). 65 An

alternative explanation is that the pilots entered an incorrect value (see [2, 37]). papers especially consider problems with “mode awareness.”

66 These

Bibliography

111

• On April 26, 1994, China Airlines Flight 140 flew from Taipei (Taiwan) to Nagoya (Japan). During the approach for landing, the copilot inadvertently activated the autopilot, which entered the GO AROUND mode (see [11, 22, 30, 39]). When trying to manually counteract the erroneous go-around maneuver of the autopilot, the pilots did not disconnect the autopilot, which in turn pushed the go-around even harder. Since a landing was no longer possible, the pilots decided on a go-around independently, but the sequence of their commands led the aircraft to stall. This caused the Airbus A300B4-622R to crash and resulted in the death of 264 passengers and crew members. The fact that the autopilot could be switched on inadvertently and lead to the wrong mode at least contributed to the accident, together with an unclear description of the automatic flight system (cf. [30]). • During the design and construction of the Airbus A380, the German and French subgroups used different variants of a computer-aided design (CAD) program.67 For the final components to be assembled during construction, this resulted in many wires being too short (see [10, 16]) and delayed delivery of A380 planes considerably. • In May 2015, it became public that the Boeing Dreamliner suffered from a software glitch. Simultaneously powering up the four main generator control units (GCUs) and letting them run for 248 days resulted in all four GCUs going into failsafe mode at the same time, losing all AC electric power regardless of the flight phase. This would lead to a shutdown of the aircraft computer, also during flight mode. In the meantime, a software update has removed the bug (cf. [15]).

Bibliography [1] Australian Transport Safety Bureau. ATSB TRANSPORT SAFETY REPORT Aviation Occurrence Investigation – AO-2008-070 Final. Technical report, ATBS, 2011. https://www.atsb.gov.au/media/3532398/ao2008070.pdf [2] Aviation Safety Network. Accident description: ASN aircraft accident Airbus A320-111 F-GGED Strasbourg-Enzheim Airport. date accessed: 2018-08-10, 2018. http://aviation-safety.net/database/record.php?id=19920120-0 [3] Bild. Beinahe-Unglück in Hamburg – war es ein Computer-Problem? date accessed: 2018-07-31, 2009. https://www.bild.de/news/2009/an-beinaheunglueck-bei-airbus-flug-9149694.bild.html

[4] C. E. Billings. Human-Centered Aviation Automation: Principles and Guidelines. Technical Report February 1996, NASA, 1996. https://ntrs.nasa.gov/ archive/nasa/casi.ntrs.nasa.gov/19960016374.pdf

[5] J. Brooks. Air collision RISK from increased accuracy. date accessed: 2018-0731, 1997. https://catless.ncl.ac.uk/Risks/19/07#subj12 67 Different details can be found in the literature: The author of [16] indicates a problem of different versions of the commercial software CATIA: The German developers in Airbus’s multinational setup kept using version 4, whereas the French groups had already migrated to version 5 for the different components to be designed on each side. In [10], it is stated that the German groups relied on another software developed by the US company Computervision. In any case, design changes on either side could not be merged consistently between the developing groups.

112

Chapter 4. Design of Control Systems

[6] Bundesstelle für Flugunfalluntersuchung (BFU). Untersuchungsbericht Überlingen/Bodensee. Technical report, Bundesstelle für Flugunfalluntersuchung, 2004. https://www.bfu-web.de/DE/Publikationen/Untersuchungsberichte/ 2002/Bericht_02_AX001-1-2.pdf

[7] Bundesstelle für Flugunfalluntersuchung (BFU). Investigation Report - A320 Crosswind Landing. Bundesstelle fur Flugunfalluntersuchung, (March):85, 2010. https://www.bfu-web.de/EN/Publications/Investigation%20Report/ 2008/Report_08_5X003_A320_Hamburg-Crosswindlanding.pdf

[8] Bureau d’Enquêtes et d’Analyses pour la sécurité de l’aviation Civile. Final Report on the Accident on 1st June 2009 to the Airbus A330-203. Technical report, BEA, 2012. https://www.bea.aero/docspa/2009/f-cp090601.en/pdf/fcp090601.en.pdf

[9] B. Burns. Control-configured combat aircraft. date accessed: 2018-07-31, 1978. https://www.sto.nato.int/publications/AGARD/AGARD-AG-234/AGARDAG-234.pdf

[10] N. Clark. The Airbus saga: Crossed wires and a multibillion-euro delay. The New York Times, Dec. 11, 2006. http://www.nytimes.com/2006/12/11/business/ worldbusiness/11iht-airbus.3860198.html

[11] Der Spiegel. “800 Tote sind zuviel.” Der Spiegel, 33, 1994. http://www.spiegel. de/spiegel/print/d-13684448.html

[12] Actuation delay was crucial at Warsaw. Flight International, 144(4391):10, 1993 [13] J. Flottau. Das Problem ist die Software. Sueddeutsche Zeitung, 2015 [14] J. Flottau and T. Osborne. Software cut off fuel supply in stricken A400M. date accessed: 2018-02-22, 2015. http://aviationweek.com/defense/software-cutfuel-supply-stricken-a400m

[15] S. Gibbs. US aviation authority: Boeing 787 bug could cause “loss of control.” The Guardian, May 1, 2015. https://www.theguardian.com/business/2015/ may/01/us-aviation-authority-boeing-787-dreamliner-bug-could-causeloss-of-control

[16] R. Grabowski. A380 delayed by CATIA 4/5 Mix. date accessed: 2018-02-22, 2006. http://worldcadaccess.typepad.com/blog/2006/09/a380_delayed_by. html

[17] J. G. Hermann Kopetz, R. Shapiro, J. Morris, and S. Philpson. Lauda Air. date accessed: 2018-02-22, 1991. http://catless.ncl.ac.uk/Risks/11.82.html# subj3

[18] L. Kelion. Fatal A400M crash linked to data-wipe mistake. date accessed: 201807-31, 2015. www.bbc.co.uk/news/technology-33078767 [19] J. L. Kolstad et al. Safety Recommendation A-91-45 through -48. Technical report, NTSB, 1991. https://www.ntsb.gov/safety/safety-recs/RecLetters/ A91_45_48.pdf

Bibliography

113

[20] P. B. Ladkin. Analysis of a Technical Description of the Airbus A320 Braking System. Technical report, CRIN-CNRS & INRIA, 1995. http://www.rvs.unibielefeld.de/publications/Papers/HIS-PBL-A320speccrit.pdf

[21] P. B. Ladkin. Lufthansa in Warsaw. date accessed: 2018-02-22. http://catless. ncl.ac.uk/Risks/15.13.html#subj7

[22] P. B. Ladkin. Summary of Der Speigel interview with Airbus’ Bernard Ziegler, Airbus Ind. date accessed: 2018-02-22. http://catless.ncl.ac.uk/risks/16. 35.html#subj2

[23] P. B. Ladkin. Tu154M and B757, midair collision, near Überlingen, Lake Constance, Germany, July 2002. date accessed: 2018-0222, 2002. http://www.rvs.uni-bielefeld.de/publications/compendium/ incidents_and_accidents/Ueberlingen2002.html

[24] Main Commission Aircraft Accident Investigation Warsaw. Report on the Accident to Airbus A320-211 Aircraft in Warsaw. Technical report, 1994. http://www.rvs.uni-bielefeld.de/publications/Incidents/DOCS/ ComAndRep/Warsaw/warsaw-report.html

[25] National Academies. More effort needed to avoid problems associated with new flight control systems. date accessed: 2018-07-31, 1997. http://www8. nationalacademies.org/onpinews/newsitem.aspx?RecordID=5469

[26] M. S. Nolan. Fundamentals of Air Traffic Control. Cengage Learning, 5th edition, 2010 [27] N. B. Sarter and D. D. Woods. How in the World Did We Ever Get into That Mode? Mode Error and Awareness in Supervisory Control. Human Factors: The Journal of the Human Factors and Ergonomics Society, 37(1):5–19, 1995 [28] H. Sogame.

Lauda Air B767 Accident Report.

Technical report.

http://www.rvs.uni-bielefeld.de/publications/Incidents/DOCS/ ComAndRep/LaudaAir/LaudaRPT.html

[29] Spiegel Online. “Tiger”-Hubschrauber: Pilot leitete Sturzflug ein. date accessed:

Falsch eingestellter Auto2018-07-31, 2017. http:

//www.spiegel.de/politik/deutschland/tiger-hubschrauber-falscheingestellter-auto-pilot-leitete-sturzflug-ein-a-1195721.html

[30] K. Takeuchi. Aircraft Accident Investigation Report 96-5. Technical report, Aircraft Accident Investigation Commission, 1996. http://www.mlit.go.jp/jtsb/ eng-air_report/B1816.pdf

[31] G. Traufetter. Captain Computer. Spiegel, 31:107–118, 2009. http://magazin. spiegel.de/EpubDelivery/spiegel/pdf/66208581

[32] G. Traufetter. Near-crash in Hamburg: Investigators criticize Airbus for inadequate pilot manuals. date accessed: 2018-03-22, 2010. http://www.spiegel.de/international/germany/near-crash-in-hamburginvestigators-criticize-airbus-for-inadequate-pilot-manuals-a681796.html

[33] U. Voges. Lufthansa Airbus Warshaw. date accessed: 2018-02-22, 1993. http: //catless.ncl.ac.uk/Risks/15.31.html#subj8

114

Chapter 4. Design of Control Systems

[34] Wikipedia. 2002 Ueberlingen mid-air collision. date accessed: 2017-06-20. https://en.wikipedia.org/wiki/2002_%C3%9Cberlingen_mid-air_collision

[35] Wikipedia. 2015 Seville Airbus A400M Atlas crash. date accessed: 2018-02-22. https://en.wikipedia.org/wiki/2015_Seville_Airbus_A400M_Atlas_crash

[36] Wikipedia. Air France Flight 447. date accessed: 2018-01-26. https://en. wikipedia.org/wiki/Air_France_Flight_447

[37] Wikipedia. Air Inter Flight 148. date accessed: 2017-08-07. https://en. wikipedia.org/wiki/Air_Inter_Flight_148

[38] Wikipedia. Airbus A400M Atlas. date accessed: 2018-07-31. https://en. wikipedia.org/wiki/Airbus_A400M_Atlas

[39] Wikipedia. China Airlines Flight 140. date accessed: 2018-08-13. https://en. wikipedia.org/wiki/China_Airlines_Flight_140

[40] Wikipedia.

Crosswind landing.

date accessed: 2018-07-31.

https://en.

wikipedia.org/w/index.php?title=Crosswind_landing

[41] Wikipedia. Flight control modes. date accessed: 2018-08-08. https://en. wikipedia.org/wiki/Flight_control_modes

[42] Wikipedia. Fly-by-wire. date accessed: 2017-12-31. https://en.wikipedia. org/wiki/Fly-by-wire

[43] Wikipedia. Lauda Air Flight 004. date accessed: 2018-02-22. https://en. wikipedia.org/wiki/Lauda_Air_Flight_004

[44] Wikipedia. Lufthansa Flight 2904. date accessed: 2018-02-19. https://en. wikipedia.org/wiki/Lufthansa_Flight_2904

[45] Wikipedia. Qantas Flight 72. date accessed: 2018-02-21. https://en.wikipedia. org/wiki/Qantas_Flight_72

[46] Wikipedia.

Seitenwindlandung.

date accessed: 2018-07-31.

https://de.

wikipedia.org/wiki/Seitenwindlandung

4.2 Automotive Bugs Always focus on the front windshield and not the rearview mirror. —Colin Powell

4.2.1 Overview • Field of application: automotive industry • What happened: deployment of deactivated airbag; autonomous driving accidents • Loss: several deaths • Type of failure: faulty extension of legacy software; incorrect identification of obstacles

4.2. Automotive Bugs

115

4.2.2 Introduction Like aircraft, cars are increasingly packed with electronics and digital systems for (see [13, 20, 43]) • controlling events in the engine such as the injection of fuel; • regulating the vehicle’s handling performance via the electronic stability control (ESC) and anti-lock braking system (ABS); • automatic emergency braking or the forward-collision warning system; • activating airbags; • displaying warning signals in cases such as flat tires, lack of cooling fluid, or unfastened seatbelts; • advanced driver-assistance systems such as cruise control, navigation, automatic parking, driver drowsiness detection, lane departure warning system, etc. The amount and complexity of software hidden in a modern car is considerable (see [42]) and can be compared to the code running on a smartphone, containing 100 million lines of code or more (cf. [13]). Assuming a rate of 5 to 50 bugs per 1,000 lines— as generally estimated for commercial software68 (see [27])—would indicate more than 500,000 possible bugs in 100 million lines of code. At least for fatal errors, their number depends logarithmically on the number of lines according to other sources (see, e.g., [39]), resulting in the order of 10 severe errors for a million lines of code. Furthermore, electronic components account for up to 40% of the value of a new car (cf. [43]). Besides other technical innovations such as optimizing mileage, endurance, safety, and pollutant emission, driver convenience is being improved considerably, even including features of autonomous driving. Similar to fly-by-wire systems, computerization can lead to reducing or even eliminating the direct influence of the driver over the car. Software bugs are common and possibly perilous in this context. Furthermore, such safety devices might lead to a deceptive sense of security. Additionally, the example of Volkswagen’s (VW) “Clean Diesel” strategy and similar concepts of other car manufacturers uncovered since 2015 shows that digitization also allows for easy manipulation of automotive software to cheat and to bypass prescribed legal limits (see [13]). In the following, we focus on two different problems of automotive software that are safety-critical: faulty deactivation of an airbag, and bugs in the context of autonomous driving.

4.2.3 Faulty Airbag Deactivation In case of a collision, airbags absorb the impact of drivers and passengers with other parts of the car. Due to the high pressure and fast inflation of the airbag in the event of activation, keeping a certain distance from the dashboard is necessary to avoid injuries caused by the airbag itself. Especially in case of a baby seat fixed on the passenger seat, the airbag has to be deactivated in order to avoid endangering the infant. Different methods are common for deactivating an airbag, either via software or via hardware. On December 25, 1999, a VW Golf collided with an oncoming car near Erding, Germany (see [47]). The 3-month-old baby on the passenger seat was killed by the 68 For noncommercial or high-accuracy demands, a considerably smaller amount of bugs per lines of code is achieved via considerable effort (see Sec. 7.3.3 for a brief discussion).

116

Chapter 4. Design of Control Systems

airbag even though the airbag had been deactivated by the dealer’s garage. The parents survived with slight injuries. The deactivation was properly programmed into the airbag control unit. The case was submitted to court, which had to identify the cause and responsibilities. At first, faulty explanations of this bug appeared in German newspapers and also in The RISKS Digest (cf. [25, 47]). They claimed that the electronic system on board a car has to function under hard conditions caused by temperature, humidity, and jolting, which can interfere with the car’s electronic circuits. This could lead to malfunctions. Therefore the software of the airbag control unit continuously conducts automatic checking in the form of a parallel balance. If the checksum indicates a malfunction, the unit switches to an emergency mode based on reading the basic data permanently stored in the control plate ROM (read-only memory). But in the case of the above accident, these backup data allegedly ignored the deactivation of the passenger airbag. In this way, while driving, the state of the airbag could return to the standard setting, ignoring the deactivation. In court, M. Broy from the Technical University of Munich acted as an expert witness. Analysis of the code revealed the actual explanation of the fault (see [42] and personal communication with M. Broy). The integration of the side airbags into the already existing software for front airbags was done by adding some lines in the existing code. The activation of the front and side airbags was executed in two rounds. The bit for deactivation of the passenger airbags was only set for the first (front airbag) round, but not for the second round. Due to some obscure technical facts, the front airbag was activated in the second round, which was meant for the side airbags. But the VW Golf under consideration was not even equipped with side airbags.

4.2.4 Autonomous Driving Many companies are developing autonomous driving to relieve the human driver and to ensure road safety. This led to a series of accidents with autonomous cars between 2016 and 2018. • The first accident happened on February 14, 2016, in Mountain View, California. On a multiple lane road, the right lane was narrowed by sandbags positioned around a storm drain aside the road close to an intersection. Therefore, Google’s Lexus autonomous vehicle (AV) had to switch from this lane to the next on the left. After a few cars had passed, a public bus was approaching from behind on the target lane. The Google AV started to switch lanes “believing” that the bus driver would slow down to allow the maneuver. But approximately 3 seconds later, the Google AV touched the side of the bus, leading to body damage on the left front fender of the Google AV. The QR code in Fig. 4.3(a) leads to a YouTube video of the incident. Obviously, the software used to model the human behavior of other drivers in the Google car was not able to predict the actual behavior of the bus driver properly.69 This also reveals the problem that in traffic, autonomous cars have to deal with “unreasonable” human behavior. Autonomous driving might be much easier if only other AVs were involved (cf. [9, 46]). Prior to this accident, Google had already reported 272 failures and 13 near misses70 in California alone between September 2014 and November 2015 (see [19]). 69 Note 70 In

that the passenger in the Google car also expected that the bus would stop or slow down (cf. [46]). these 13 cases, human drivers had to fix the behavior of the autonomous control to avoid crashes.

4.2. Automotive Bugs

117

5

5

5

10

10

10

15

15

15

20

20

20

25

25

25

30

30

30

35

35

35

40

40

(a) Google AV hitting a 35bus 40 5 10 15 20 25 30 in 2016

(b) Tesla Model S hitting a 40 5 10 15 20 25 30 35 truck in 2016

40

(c) Tesla Model S hitting a 40 5 10 15 20 25 30 35 fire truck in 2018

Figure 4.3. QR codes for videos on YouTube regarding: (a) the Google AV touching a bus in Mountain View, California, on February 14, 2016; (b) an animation of the circumstances leading to the death of J. Brown in Florida in 2016 during an accident with his Tesla Model S in Autopilot mode; and (c) an animation of the incident involving a Tesla Model S hitting a stationary fire truck in California in January 2018.

• On May 7, 2016, a Tesla Model S was driving in Autopilot71 mode in Williston, Florida. According to the failure report [29], the Tesla car collided with a tractor-semitrailer truck turning left on a highway at an uncontrolled intersection. The oncoming truck was making a left turn, but the Tesla continued straight on, crashed with and passed underneath the trailer, bumped into a power pole and left its driver, the autonomous-driving enthusiast J. Brown, dying on site (cf. Fig. 4.3(b) for the QR code referring to an animation of the circumstances). Obviously, the software was not able to recognize the truck. This could have been caused by different factors. Many newspaper reports claimed that the white truck was not recognizable against the bright sky (see, e.g., [26, 28]). Furthermore, some sources explain the accident by a blind spot of the Autopilot system above the car’s hood (cf. [45]). In the user manual, Tesla states that obstacles that are narrow, lower than the dashboard, or hanging from the ceiling might not be detected. Tesla also claims that, “Neither Autopilot nor driver noticed the white side of the tractor trailer against a brightly lit sky, so the brake was not applied” and that, “The high ride height of the trailer combined with its positioning across the road and the extremely rare circumstances of the impact caused the Model S to pass under the trailer” (see [28, 45]). The failure report from the National Highway Traffic Safety Administration (NHTSA) did not identify any defects in the design or performance of the Autopilot system. The Tesla user manual clearly advises that driving in the Autopilot mode requires the continual and full attention of the driver to monitor the traffic environment and be prepared to take action to avoid crashes. But the incident seems to have a more fundamental software design background. The Autopilot sensors consist of a radar emitter located on the front of the car and a camera mounted at the top of the windshield (see [45]). According to The New York Times [44], the crash-avoidance system is designed to engage only when radar and computer vision systems agree that there is an obstacle. But Mobileye—the Israeli company that developed EyeQ3, the software used in the 71 Tesla’s advanced driver-assistance system feature has been named Tesla Autopilot and, after an update in its hardware, has been marketed as Enhanced Autopilot.

118

Chapter 4. Design of Control Systems

Tesla model—claimed that their computer vision system was only trained for detecting the rear of a vehicle. Hence, the software was not able to react to the trailer coming from the side (see, e.g., [15, 16, 23, 35]). The widely used EyeQ system-on-a-chip is based on deep-learning neural networks.72 Furthermore, it is also very hard for radar systems to detect an object standing or moving sideways (see [6]). Hence, both systems are designed to avoid rear-end collisions by detecting the rear of vehicles ahead [29], and both systems failed in the incident. In the literature, there have been rumors that the Tesla’s driver was watching a Harry Potter movie at the time of the crash (see, e.g., [26]). Detailed investigations showed no sign of such behavior (cf. [5, 29]). Tesla’s invasive promotion of the self-driving capability seduces drivers into using the autonomous mode in a dangerous way. J. Brown and other Tesla drivers were publishing YouTube videos driving in autonomous mode with their hands off the steering wheel. Similar videos exist of Tesla CEO E. Musk’s ex-wife and her friend. Furthermore, Musk told reporters that Autopilot was “almost twice as good as a person” even in its first version (see [34]). Even the name Autopilot promises more than it can currently deliver. At the time of the events, only drivers were blamed for failures of the system because of their full responsibility (cf. [34, 41]). Meanwhile, the US National Traffic Safety Board (NTSB) regards Tesla as partly responsible for the accident because overreliance on the semiautonomous system contributed to the accident (see [30]). • On January 22, 2018, a Tesla Model S crashed into a stationary fire truck on Interstate 405 in Culver City, California. According to the driver, the car was operating in autonomous mode. The car was traveling behind a pickup truck at 65 mph. The pickup suddenly changed lanes because of the parked fire truck, but the autonomous system did not stop the car and let it crash into the truck (for details, see [1, 40]; the QR code in Figure 4.3(c) leads to an animation of the incident). This type of accident involving a self-driving Tesla and a stationary vehicle occurred already several times and is obviously a known problem for cars with autonomous features [37]. • On March 18, 2018, a woman was killed by an autonomous driving Uber Volvo XC90 SUV while walking her bicycle across the street in Tempe, Arizona. The literature is not fully consistent with respect to the actual details and causes, and the final report was still missing at the time of writing. The preliminary report of the NTSB [31] states that the software classified the pedestrian first as an unknown object, then as a vehicle, and finally as a bicycle.73 The autonomous system determined that emergency braking would be necessary, but emergency braking was not enabled in autonomous mode to avoid erratic vehicle behavior. Independent information from May 2018 claims that the sensors detected the obstacle, but the software classified the pedestrian with the bicycle as a spurious signal (see [11]). The code has to decide whether the obstacle is something to ignore, such as a plastic bag, or whether it is an actual threat. In this case the pedestrian was wrongly estimated as a “false positive.” Volvo claims that their inbuilt security system would have activated an emergency braking (see [18, 33, 38]). 72 Compare the discussion on neural networks in Sec. 8.1.3 on page 220 and, in particular, the difficulties of neural networks in classifying slightly modified stop signs [12]. 73 Assuming this bicycle was being ridden, not walked.

4.2. Automotive Bugs

119

5

5

10

10

15

15

20

20

25

25

30

30

35

35

40

(a) Tesla X hitting a30barrier 5 10 15 20 25 35 40 in 2018

40

(b) Demonstration of30Tesla’s 5 10 15 20 25 35 40 Autopilot

Figure 4.4. QR codes for videos on YouTube regarding (a) reconstruction of Tesla’s Autopilot behavior at the very site of the Tesla X accident in California in March 2018; and (b) a demonstration video of Tesla’s Autopilot behavior as of 2016 showing the cockpit as well as different camera views augmented by bounding boxes of detected objects.

• On March 23, 2018, a Tesla Model X crashed into a concrete barrier at a construction site near the junction of Highways 101 and 85 in Mountain View, California (see, e.g., [4]). The driver died in the hospital shortly after the crash. The car was running in autonomous mode and gave warning signals, but the driver had his hands off the steering wheel and did not react to the warnings. Later investigations showed that the Autopilot was confused by the lane markings and therefore did not follow the intended route but crashed into the barrier. It is still unclear why the Autopilot did not take any action to avoid the collision with the barrier and even accelerated just before the impact (see [2, 3, 32]). Note that the crash attenuator in front of the concrete barrier had been damaged in a previous accident and could not absorb the kinetic energy as usual (cf. [32]). The QR code of Fig. 4.4(a) refers to a YouTube video where another Tesla driver reproduces the behavior of the Autopilot at the very same location. To summarize, at the time of writing, the autonomous driving systems cannot truly be considered autonomous systems, but driver-assistance systems only—despite the misleading name. The above-mentioned fatal crashes also raise legal questions about liability for accidents with self-driving cars. Who is responsible in case of an accident: the car company, the software developer, or the driver? Furthermore, ethical questions also exist. In the case of an inevitable fatal accident, the software has to decide who is to live and who is to die. Should the system save the driver and passengers or pedestrians, young or old or pregnant persons? A loophole out of this dilemma could be to include a sort of randomness in such an event, imitating unpredictable human behavior.

4.2.5 Other Bugs in This Field of Application There are a variety of other software bugs related to cars. The following incomplete list is ordered in an inverse chronological manner. • The digitization of cars also leads to security vulnerabilities. In 2016, Tesla had to fix a software bug that allowed hackers to control the car remotely. Similar Big Data–related privacy problems will arise when companies forge ahead with car-to-car communication to spread information about traffic jams or collect data on performance and vehicle handling (cf. [13, 17]).

120

Chapter 4. Design of Control Systems

• In gas-electric hybrid vehicles, the gasoline engine is mainly active at high speed, while it shuts down at low speed. Toyota’s Prius hybrid has suffered from different software-related problems over the years. The first glitch is related to the models from 2004 to 2005. Due to a “programming error,” the hybrid cars stalled or shut down (see [7, 36]). A second problem occurred in 2015 (cf. [21]): A software glitch in the Prius hybrid could cause higher temperature, overheating, and damage of the transistors controlling the hybrid engine. Under certain circumstances, this damage could result in forcing the car to slow down or even shut down the hybrid system. A third problem in 2013 was related to the braking system. Brakes in hybrid vehicles are more complex than in regular cars since they involve recharging the batteries by using the electric motors as generators. A delay of less than 1 second occurred in the antilock brake system (ABS). For a Prius traveling at 60 mph this would result in another 90 feet of traveled distance before the brakes start to be effective (see [8]). • Toyota was confronted with many cases of unintended acceleration of cars. These problems led to at least one fatal accident involving a Toyota Camry in Oklahoma in 2007. In the investigations, the problems were traced back to the engine control module firmware.74 The expert M. Barr testified in court that the source code for the electronic throttle control system contained bugs that could cause unintended acceleration and was of “unreasonable quality” (see [10]). He described the code as “spaghetti” (cf. [10, 13]). The jury awarded $3 million to the two victims (see [13, 22]). In the settlement of a separate class action suit related to the unintended acceleration, Toyota agreed to pay $1.1 billion (cf. [22]). • In 2002, all the stability systems in BMW series 5 cars were disabled after a hard brake exceeding a certain amount of pressure on the pedal. The disabled systems comprised the dynamic stability control and ABS. The only way to recover was to keep the ignition switched off for 5 seconds (see [24]). • Around 1985, black cars left the General Motors assembly hall in Michigan without windshields. The industry robots were not able to recognize the black cars (see [14]) but had no problems with other colors.

Bibliography [1] S. Alvarez. Tesla Model S firetruck crash in California: What we know so far. date accessed: 2018-02-23, 2018. https://www.teslarati.com/tesla-models-firetruck-crash-details/

[2] Autoweek. Tesla blames driver Walter Huang in fatal March 2018 crash of Model X on Autopilot in San Francisco Bay Area. date accessed: 201807-31, 2018. http://autoweek.com/article/autonomous-cars/tesla-blamesdriver-fatal-crash-involving-autopilot

[3] Autoweek. Tesla Driver traces route of fatal model X crash, almost crashes in the same spot. date accessed: 2018-07-31, 2018. http://autoweek.com/ article/autonomous-cars/tesla-driver-almost-crashes-replicating-routefatal-crash-autopilot 74 Firmware

is software providing low-level control for a device’s specific hardware.

Bibliography

121

[4] N. E. Boudette. Fatal Tesla Crash Raises New Questions about Autopilot System. The New York Times, Mar. 31, 2018. https://www.nytimes.com/2018/03/31/ business/tesla-crash-autopilot-musk.html

[5] K. Burke. Tesla driver in fatal crash wasn’t watching video, witness tells investigators. date accessed: 2018-08-08, 2017. http://www.autonews.com/article/ 20170619/MOBILITY/170619733/

[6] CBC News. Tesla Autopilot limitations played big role in fatal crash. date accessed: 2018-03-23, 2018. http://www.cbc.ca/news/technology/teslaautopilot-crash-ntsb-1.4287026

[7] CNN. Prius hybrids dogged by software. date accessed: 2018-02-24, 2005. http: //money.cnn.com/2005/05/16/Autos/prius_computer/

[8] CNN. Toyota blames software for Prius braking woes. date accessed: 2018-07-31, 2010. http://money.cnn.com/2010/02/04/autos/Toyota_Prius_complaints/ [9] A. Davies. Google’s Self-Driving Car Caused Its First Crash. Wired, pages 1–2, 2016 [10] M. Dunn. Toyota’s killer firmware: Bad design and its consequences. date accessed: 2017-07-06, 2013. http://www.edn.com/design/automotive/4423428/ Toyota-s-killer-firmware--Bad-design-and-its-consequences

[11] A. Efrati. Uber finds deadly accident likely caused by software set to ignore objects on road. date accessed: 2018-07-31, 2018. https://www.theinformation. com/articles/uber-finds-deadly-accident-likely-caused-by-softwareset-to-ignore-objects-on-road

[12] K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno, and D. Song. Robust Physical-World Attacks on Deep Learning Models. arXiv, 2018 [13] D. Gelles, H. Tabuchi, and M. Dolan. Complex Car Software becomes the Weak Spot under the Hood. The New York Times, date accessed: 201803-07, 2015. https://www.nytimes.com/2015/09/27/business/complex-carsoftware-becomes-the-weak-spot-under-the-hood.html

[14] I. Giese. Software reliability. date accessed: 2018-02-26, 2002. http://webdocs.gsi.de/~giese/swr/allehtml.html

[15] J. M. Gitlin. Tesla and Mobileye call it quits; will the car company build its own sensors? date accessed: 2018-02-24, 2016. https://arstechnica.com/ cars/2016/07/tesla-and-mobileye-call-it-quits-will-the-car-companybuild-its-own-sensors/

[16] J. Golson. Tesla and Mobileye disagree on lack of emergency braking in deadly Autopilot crash. date accessed: 2017-07-11, 2016. https://www.theverge. com/2016/7/1/12085218/tesla-autopilot-crash-investigation-radarautomatic-emergency-braking

[17] A. Greenberg. Tesla Responds to Chinese Hack with a Major Security Upgrade. Wired, pages 2016–2019, 2016

122

Chapter 4. Design of Control Systems

[18] T. Griggs and D. Wakabayashi. How a self-driving Uber killed a pedestrian in Arizona. The New York Times, Mar. 20, 2018. https://www.nytimes.com/ interactive/2018/03/20/us/self-driving-uber-pedestrian-killed.html

[19] M. Harris. Google reports self-driving car mistakes: 272 failures and 13 near misses. The Guardian, Jan. 12, 2016. https://www.theguardian.com/ technology/2016/jan/12/google-self-driving-cars-mistakes-data-reports

[20] L. L. M. Höfer. Mensch denkt, Chip lenkt. c’t, May 2000 [21] J. Hsu. Toyota recalls 1.9 million Prius hybrids over software flaw. date accessed: 2017-07-12, 2014. https://spectrum.ieee.org/riskfactor/computing/it/ toyota-recalls-19-million-prius-hybrids-over-software-flaw

[22] C. Isidore. Toyota settles unintended acceleration case after $3 million jury verdict. date accessed: 2018-08-10, 2013. https://money.cnn.com/2013/10/25/ news/companies/toyota-crash-verdict/index.html

[23] F. Lambert. Tesla Autopilot partner Mobileye comments on fatal crash, says tech isn’t meant to avoid this type of accident. date accessed: 2018-01-04, 2016. https://electrek.co/2016/07/01/tesla-autopilot-mobileye-fatalcrash-comment/

[24] S. Lesser. BMW series 5 disables Dynamic Stability Control and ABS. date accessed: 2018-03-07, 2004. https://catless.ncl.ac.uk/Risks/23/60#subj3 [25] S. Leue. Baby death due to software-controlled air bag deactivation? date accessed: 2018-03-07, 1999. https://catless.ncl.ac.uk/Risks/20/28#subj8.1 [26] S. Levin and N. Woolf. Tesla driver killed while using autopilot was watching Harry Potter, witness says. The Guardian, July 1, 2016. https://www. theguardian.com/technology/2016/jul/01/tesla-driver-killed-autopilotself-driving-car-harry-potter

[27] S. McConnell. Code Complete. DV-Professional Series. Microsoft Press, 2004 [28] D. Millward. Driver of Tesla killed in first fatal crash while in autopilot mode. The Telegraph UK (July):11–12, 2016. http://www.telegraph.co.uk/news/ 2016/07/01/driver-of-tesla-killed-when-electric-car-crashes-whilein-autopi/

[29] NTSB. Collision between a Car Operating with Automated Vehicle Control Systems and a Tractor-Semitrailer Truck. Technical report, NTSB, 2017. https: //www.ntsb.gov/investigations/AccidentReports/Reports/HAR1702.pdf

[30] NTSB. Driver Errors, Overreliance on Automation, Lack of Safeguards, Led to Fatal Tesla Crash. Press release, NTSB, 2017. https://www.ntsb.gov/news/ press-releases/Pages/PR20170912.aspx

[31] NTSB. Preliminary report - Highway HWY18MH010. date accessed: 2018-0731, 2018. https://www.ntsb.gov/investigations/AccidentReports/Pages/ HWY18MH010-prelim.aspx

[32] NTSB. Preliminary report: Crash and post-crash fire of electric-powered passenger vehicle. date accessed: 2018-08-10, 2018. https://www.ntsb.gov/ investigations/AccidentReports/Pages/HWY18FH011-preliminary.aspx

Bibliography

123

[33] A. Ohnsman. LiDAR maker Velodyne “baffled” by self-driving Uber’s failure to avoid pedestrian. date accessed: 2018-07-31, 2018. https://www.forbes.com/ sites/alanohnsman/2018/03/23/lidar-maker-velodyne-baffled-by-selfdriving-ubers-failure-to-avoid-pedestrian/

[34] Reuters. Tesla’s Elon Musk sent mixed messages about Autopilot’s capabilities. date accessed: 2017-07-06, 2016. http://www.cbc.ca/news/technology/teslaautopilot-1.3664841

[35] B. Rubel. Mobileye - Trennt sich der Zulieferer aus Imagegründen von Tesla? date accessed: 2017-07-11, 2016. https://www.mobilegeeks.de/artikel/teslamobileye-trennung/

[36] E. Slonim. Prius cars shutdown at speed. date accessed: 2018-03-07, 2005. http: //catless.ncl.ac.uk/m/risks/23/87#subj1.1

[37] O. Solon. Tesla that crashed into police car was in “autopilot” mode, California official says. The Guardian, May 30, 2018. https://www.theguardian.com/ technology/2018/may/29/tesla-crash-autopilot-california-police-car

[38] Spiegel Online. Uber: Standardsoftware des Volvo hätte Fußgängerin entdeckt. date accessed: 2018-07-31, 2018. http://www.spiegel.de/auto/aktuell/uberstandardsoftware-des-volvo-haette-fussgaengerin-entdeckt-a1200349.html

[39] D. Stevenson. A Critical Look at Quality in Large-Scale Simulations. Computing in Science & Engineering, 1(3):53–63, 1999 [40] J. Stewart. Why Tesla’s Autopilot can’t see a stopped firetruck. date accessed: 2018-02-22, 2018. https://www.wired.com/story/tesla-autopilotwhy-crash-radar/

[41] J. Stilgoe. Tesla crash report blames human error—this is a missed opportunity. The Guardian, Jan. 21, 2017. https://www.theguardian.com/science/ political-science/2017/jan/21/tesla-crash-report-blames-human-errorthis-is-a-missed-opportunity

[42] SWR 2. Alles ist Code—und Code ist nichts. date accessed: 2018-03-07, 2007. https://www.swr.de/swr2/programm/sendungen/wissen/swr2-wissen-allesist-code-und-code-ist-nichts/-/id=660374/did=1998718/nid=660374/ 4m8qw5/index.html

[43] The Economist. Tech.View: Cars and software bugs. date accessed: 2018-03-07, may 2010. https://www.economist.com/blogs/babbage/2010/05/techview_ cars_and_software_bugs

[44] The New York Times. Inside the self-driving Tesla fatal accident, July 1, 2016. https://www.nytimes.com/interactive/2016/07/01/business/insidetesla-accident.html

[45] J. Torchinsky. Does Tesla’s Autopilot suffer from a dangerous blind spot? date accessed: 2017-07-10, 2016. https://jalopnik.com/does-teslas-semiautonomous-driving-system-suffer-from-1782935594

124

Chapter 4. Design of Control Systems

[46] C. Urmson. Report of traffic accident involving an autonomous vehicle. date accessed: 2018-03-07, 2016. https://www.dmv.ca.gov/portal/wcm/connect/ 3946fbb8-e04e-4d52-8f80-b33948df34b2/Google_021416.pdf?MOD=AJPERES& CVID=

[47] C. von Wüst. Vom Retter erschlagen. Der Spiegel, 12, 1999. http://www. spiegel.de/spiegel/print/d-10246307.html

Chapter 5

Synchronization and Scheduling

In order to solve specific tasks in a distributed manner, communication between and timing of jobs is necessary. In this way, time-dependent arrangements are important when connecting different computers. Otherwise, the interconnected computers cannot work together properly (similar to individual workers in a company if no meetings or consistent communication is taking place). For safety reasons, the Space Shuttle used five computers: four identical in software and hardware and one different in software. The four identical computers were running the same processes in parallel and allowed an intermediate replacement of a failing computer. The fifth computer served as the backup that had to be able to take over control in the event of a consistent failure of the other systems. The mutual connection of these different computers depended upon a very precise and timely order and coordination. Small errors might prevent this synchronization. Such a failure occurred during the first launch of the Space Shuttle in 1981. Similarly, all jobs or tasks concurrently running on a computer have to be put in some meaningful order to guarantee the smooth succession of computations. Usually, there is a hierarchy of jobs and a waiting list that schedules the following tasks. Situations should be avoided where subroutines of the running code are blocking other important jobs. A typical bad example in this context is the so-called priority inversion where unimportant jobs are allowed to block major jobs, such as in the case of the Mars Pathfinder rover in 1997.

5.1 Space Shuttle Lost time is never found again. —Benjamin Franklin

5.1.1 Overview • Field of application: space flight • What happened: delay of launch • Loss: — • Type of failure: synchronization failure 125

126

Chapter 5. Synchronization and Scheduling

(a) Test flight of Enterprise

(b) Launch of Columbia STS-1

Figure 5.1. (a) Test flight of the shuttle Enterprise on a Boeing 747 piggyback airplane. (b) Launch of Columbia (STS-1) on April 12, 1981, at NASA’s Kennedy Space Center in Florida. Images courtesy of NASA.

5.1.2 Introduction Plans for developing the Space Shuttle began in the mid-1960s as follow-up to the successful Apollo program (see [11, 20] for detailed facts on the shuttle program). The goal of the new spacecraft, officially named the Space Transportation System (STS), was to reduce the costs by incorporating reusable components. This led to a three-component unit with an aircraft, two reusable solid rocket boosters, and a nonreusable external tank filled with liquid oxygen and hydrogen. This design was finally established by NASA in March 1972. The shuttle could transport up to 8 persons and payload up to 24,500 kg depending on the target orbit between 185 and 643 km. The maximal total weight was limited to 109,000 kg, the length was 37.24 m, and the wingspan was 23.79 m. Six shuttle spacecraft (also called orbiter vehicles) were constructed, which were named after important ships in the context of science and exploration (see [11]): • Enterprise, which never flew in space but was used for testing of landing and preparation tasks. • Columbia, with 28 missions in 1981–2003. • Challenger, with 10 missions in 1983–1986. • Discovery, with 39 missions in 1984–2011. • Atlantis, with 33 missions in 1985–2011. • Endeavour, with 25 missions in 1992–2011. In total, there were 135 missions transporting satellites, carrying science labs such as Spacelab and Spacehab, supporting the International Space Station (ISS), and conducting many scientific tests. The first flight took place on August 12, 1977, with the shuttle Enterprise on a modified Boeing 747 piggyback airplane. The shuttle Columbia realized the first successful liftoff on April 12, 1981, as flight STS-1. Two major accidents afflicted the program: On January 28, 1986, the Challenger (STS-51-L) exploded due to cavernous gaskets

5.1. Space Shuttle

127

(a) Landing of Columbia STS-1

(b) Landing of Atlantis STS-117

Figure 5.2. (a) Space Shuttle Columbia (STS-1) landing at Edwards Air Force Base, April 14, 1981. (b) Space Shuttle Atlantis (STS-117) landing with a chute to slow down the spacecraft. Images courtesy of NASA.

becoming less resilient at low temperatures on the ground. On February 1, 2003, the Columbia shuttle broke apart during approach on its 28th mission (STS-107) because of missing thermal tiles that were lost at liftoff. The shuttle program ended in July 2011 with the last flight of Atlantis, mainly in view of costs that turned out much higher than originally intended.75 Successor programs are the Dragon spacecraft developed by Elon Musk’s company SpaceX, Dream Chaser by the Sierra Nevada Corporation, and CST-100 by Boeing (cf. [21]).

5.1.3 Timeline of the Delay of the First Shuttle Launch The first flight of the shuttle Columbia was scheduled for April 10, 1981, as STS-1. This launch was very important for NASA, not only in view of the money invested but also because this was the first major project after more than ten years and the beginning of a new era of astronautics. The event was broadcast all over the world from the John F. Kennedy Space Center. Unfortunately, 18 minutes before liftoff the countdown had to be stopped (see [7]), and the start had to be delayed indefinitely. The delay was caused by the failure to initialize the software system running on five redundant computers. NASA experts were able to trace the error to the initializing of the computers. Therefore, they were sure that the error could not occur during the flight. The flight was scheduled for April 12, only two days later, without knowing the actual underlying cause of the problem. The following liftoff was successful and smooth, and the mission was accomplished two days later on April 14, 1981. In total, the software bug did not lead to an immediate loss but did have a negative effect on NASA’s reputation.

5.1.4 Detailed Description of the Bug This section is mainly based on [3, 17]. NASA’s paramount goals in developing the control system for the Space Shuttle were safety and reliability. Therefore, all technical equipment such as sensors, computers, buses, and energy supplies was provided 75 The total costs of the shuttle program amount to about $133.7 billion according to [11]. Concerning software, the costs for the shuttles’ software were originally estimated at $20 million but turned out to be $324 million until 1991; additionally, costs for maintaining the on-board software only were estimated at $100 million per year in 1992 (see [8], which is a preprint of Chapter 6 in [6]).

128

Chapter 5. Synchronization and Scheduling

in quadruplicate following the slogan “Fail Operational–Fail Safe.” Errors in service should avoid failure by relying on backup systems that were always able to take over in case of an erroneous device. Therefore, four identical general purpose computers (GPCs) were used with identical software during critical stages of the mission. The number four had been chosen because in case of one failing unit there were still three operational ones: Normal behavior with a clear majority voting76 resulting in a unique output was then still possible (“fail operational”). In the event of a second failure taking place afterwards, the existing redundancy allowed for a safe return of the vessel (“fail safe”). The four computers were controlled by the Primary Avionics Software System (PASS) developed by IBM (see [4] for technical details). PASS was an asynchronous and priority-driven system where tasks were scheduled according to their priority (see also Excursion 14 on page 137). Problems related to erroneous software instead of hardware, however, could not be avoided by this setup—similar to the case of Ariane, as discussed in Sec. 2.1. Therefore, in 1976 NASA decided to add a fifth computer77 identical in hardware but running an independently developed operating system and application software. The so-called Backup Flight Control System (BFS) developed by Rockwell International had to run in the background and could take over from PASS in the event of a failure (without the possibility to actually switch back). To keep track of all activities and to be able to take over control from the potentially failing four PASS computers, the BFS had to listen to all input and some output data traffic from the four-computer system (cf. [14]) as long as this data seemed to be correct and not polluted by any failure. Rockwell and IBM heavily discussed the general design of the two systems but ended up using two different variants: PASS kept its asynchronous design, while the BFS was implemented as a fully synchronous system relying on fixed predefined cycles (see [8]). In order to connect the two different systems, the PASS had to at least look like a synchronous system regarding the data exchange. To accomplish this without major changes, a cycle counter process (CCP) with high priority was introduced in PASS, and the scheduling of all processes had to be adapted and arranged relative to the CCP. The failure of STS-1 was related to a wrong absolute starting time of the CCP. At the initialization of the first of the four GPCs to be switched on, denoted by GPC1 in the following, this computer obtained the current absolute time t0GP C = t2cc by sampling the central system clock (denoted by cc in Fig. 5.3). It then obtained timing data from the already activated telemetry and computed a future starting time t1GP C for the PASS cycles with length δGP C = 40 ms, which was in sync with the larger telemetry cycles δt .78 This was advantageous for reading out postprocessing data after the mission and was achieved by obtaining timing information directly from the telemetry that initially sampled the same central system clock at t0t . In this original design, the timelines of the telemetry and GPC1 were fully consistent since both relied on the same central clock. Even the BFS was synchronized with this starting time and the corresponding CCP and thus knew when (i.e., in which cycle) to expect what behavior while listening to the PASS. Due to the redundancy in the GPCs, an additional problem had to be solved: The four GPCs had to be exactly in sync to ensure correct behavior for voting. Sampling the 76 In the event of failure, it is easy to detect a discrepancy in computed results, but it is more complex to decide which output is the correct one. Voting solves this problem to a certain extent. 77 Note that this computer is also denoted by GPC in [3, 17]. In the following text, however, we will only denote the four computers running the PASS software as GPCs. 78 Note that the literature—in particular [3] and [17]—suggests a telemetry cycle length of δ = 1s but t never explicitly mentions this value.

5.1. Space Shuttle

129

(a) originally intended, bug-free situation

(b) buggy setup leading to inconsistent absolute times

Figure 5.3. Survey of timelines for (a) the originally intended situation and (b) the buggy scenario. The top-most timeline is the one of the central clock (cc) which was sampled by the telemetry processes (t, middle timeline) and consistently by the first started GPC (timeline at the bottom) in situation (a). In situation (b), the GPC wrongly obtained its absolute time (t¯0GP C instead of t0GP C ) from the first entry in the OS timer queue, which consisted of the time stamp for the delay processing of the bus initialization process. The telemetry and GPC processes were scheduled in cycles with periods of δt and δGP C , respectively. The starting time t1GP C for the GPC process cycles was computed using telemetry data to be phase-synchronous with telemetry for postprocessing purposes.

central clock was no option since the sampling would have taken place at slightly different points in time, leading to small but still harmful differences in the absolute times of the four GPCs. IBM developers therefore decided to use a mechanism involving the operating system timer queue:79 Since after the initialization phase many different pro79 Detailed information on synchronization in the context of the Space Shuttle program can be found in [15, 18, 19].

130

Chapter 5. Synchronization and Scheduling

Figure 5.4. Problematic setup: Under specific conditions in the initialization of the first GPC, the erroneous absolute time value t¯0GP C (obtained from the bus initialization process scheduled in the OS timer queue; see Fig. 5.3(b)) was larger than the computed starting time t1GP C . Hence, the OS considered t1GP C to be buggy and shifted the scheduled time stamp of the first execution of the CCP by one cycle to the right. Compared to the actual absolute time of the telemetry, all processes using this info (via the CCP) were one cycle late.

cesses had to be scheduled each cycle, the time stamp of the next process in the OS timer queue was always only slightly in the future and a good estimate for the actual current time. Therefore, the four GPCs used this data structure to be bitwisely exactly in sync. Since the queue was empty at the initialization of GPC1 showing a specific initial fill pattern of the queue, the GPCs used a check on this pattern as an actual check for “am I the first GPC being initialized.” If this was the case, the GPC was GPC1 and sampled the clock (as indicated in Fig. 5.3(a)); afterwards, it calculated the starting time t1GP C and started the CCP. Otherwise, the GPC was second to fourth at initialization and used the time stamp of the first process scheduled in the shared OS timer queue. Presumably, in both situations, the respective GPC computed its starting time (synchronized with telemetry) via information from the telemetry. The bug now arose due to two modifications of this original design. First, in 1979, the code for a bus initialization prior to computing the start time of the GPC had a delay time of 50 ms incorporated for the remainder of its operation, which interfered with the GPC’s calculation of the start time. When GPC1 checked the fill pattern of the OS timer queue to determine whether or not it actually was the first GPC to get started, the OS timer queue was no longer empty, but contained a time stamp for the remainder of the bus initialization subroutine. Therefore, the actual GPC1 was erroneously assuming to be GPC 2 to 4 and, thus, took its absolute timing information t¯0GP C from the bus initialization entry of the OS timer queue (see Fig. 5.3(b)). This modification of the PASS software was not consistent with respect to the absolute time values (t¯0GP C was a little too far in the future) but never showed any erroneous behavior since the computed start time t1GP C was still even more in the future and the correct first GPC cycle was hit for process execution (see the gray bar at the top of Fig. 5.4 for a visualization). A second change to the bus initialization, realized in 1980, finally led to the problem of STS-1. The developers changed the delay time in the bus initialization implementation from 50 ms to 80 ms since the bus initialization was used in other parts of PASS for reconfigurations where temporary CPU overload in certain flight conditions had to be avoided. Under certain conditions, however, the new and even more in the future time stamp t¯0GP C of the bus initialization delay execution, which was erroneously used as the actual time value of GPC1, was now actually located right to the computed start time t1GP C on the timeline at the point when computing the start time and starting up the

5.1. Space Shuttle

131

CCP (cf. Fig. 5.4). Therefore, GPC1’s operating system assumed that the CCP process was meant to be started in the past (with time stamp t1GP C ) because its current absolute time value was t¯0GP C > t1GP C . In such a situation, the operating system simply shifted the CCP to be scheduled by one 40-ms cycle to the future (see Fig. 5.4). This resulted in many processes obtaining timing information via the OS timer queue and the CCP to be scheduled one cycle late (see [3, 10]). Only a few processes related to telemetry80 used the correct absolute time information but seemed to be one cycle too early. Data from those correctly timed processes were regarded as noise by the BFS since the BFS itself relied on the delayed cycle counter for determining scheduled periods of data transfer. Therefore, the BFS never really started to “listen” to the PASS in the initialization phase. Note that the bug showed up only in very few situations. Concerning this point, the literature is not sufficient to fully analyze the situation. The following paragraph represents our interpretation based on available information and a couple of assumptions. In [17], it is stated that After the second change, we had essentially a 1 in 67 chance of it happening each time we turned on the first GPC of the set. We had opened a 15-millisecond window within each second when there could be another process on the timer queue when the uplink process was being phase-scheduled.

We interpret this in the following way: To be in phase with the telemetry data, the synchronization had to compute a start time t1GP C with respect to cycle lengths of 1 second. Depending on the absolute starting time of the first GPC and the actual time when the bus initialization process started its delay time stamp in the OS timer queue, different (i.e., 67) uniformly distributed cases could arise, where only one led to the problem of an assumed actual time being “right” (i.e., future) to the calculated starting time of the CCP. One second divided by 67 equidistant time intervals actually results in a window size of about 15 ms for the problematic case.

5.1.5 Discussion What contributed significantly to the problem of STS-1? The bug remained in Columbia’s software until the planned liftoff of STS-1 due to several reasons:

• The bug occurred only with a small probability (1 in 67) in the initialization of the PASS software. In many test cases, the bug would simply not show any negative effects. • It was not obvious that modifying the bus initialization would have an effect on the initial setup of cycles of the software processes. • Testing could only be done in a rather short time window between finalizing the software and the launch since the changes to the PASS to be compatible with the BFS had to be realized in the final stage of software development (see, e.g., [3]). • The design and process of the shuttle’s software were highly complex for the time of the events (cf., e.g., [2, 3]). 80 An example of such processes is mentioned as the “polling” or “uplink” process in [3] and [17], respectively.

132

Chapter 5. Synchronization and Scheduling NASA Strategy: “Faster, Better, Cheaper”: The paramount goal of NASA’s shuttle program was to reduce costs for a high number of launches at the price of limiting capabilities with respect to possible missions (close to Earth vs. reaching out to Mars, etc.). This philosophy was a major driving force for NASA’s “Faster, Better, Cheaper” (FBC) program introduced in 1992 (see [16]). This comprised

• smaller missions; • advanced technology; • reducing NASA HQ management staff and also including more private companies; • aggressive planning of future missions; and • allowing failures: “It’s OK to fail,” according to NASA official D. Goldin. The allowance of failures was justified by implementing a database of “lessons learned” that was intended to help avoid mistakes in future projects. The database focused on failed missions, not on successful ones. The FBC philosophy was challenged from the beginning and was considered responsible for a series of failures (MCO,81 MPL,82 Space Shuttle Columbia; cf. [5, 9, 12, 13]). The actual failure rate of FBC missions was 44%, compared to 30% for traditional missions, but FBC missions turned out to be 57% more cost-effective (see [5]). Hence, the faster, better, cheaper strategy turned out to be reduced to faster and cheaper, but not better. Especially disputed was the statement “It’s OK to fail.” Failure was permitted as long as the failures were analyzed and led to future improvement via “lessons learned.” But according to [16], failures due to mistakes were not tolerable. Collecting mistakes in a database should cover all missions (also successful missions) and be public. Furthermore, there should be a clear metric for measuring FBC in terms of success rate, cost reduction, public approval, and scientific papers. In addition, FBC led to the outsourcing of jobs and tasks to contractors that caused a knowledge drain, a lack of experienced staff, and also a reduction in mission safety. In hindsight, FBC seemed to have • been too radical and ambitious; • reduced the quality and safety of the projects; • not properly included the “lessons learned” database; and • not comprised metrics for measuring success and failure. The Columbia tragedy also showed that FBC may not be a reasonable strategy for missions jeopardizing human safety (cf. [13]). NASA’s Need for Old Hardware:

A second problem arose towards the end of the Space Shuttle program. Since the shuttles were in service for 30 years, computer technology had changed significantly over this long period of time. Because the needed hardware was outdated, NASA had difficulty finding replacement parts (see [1]). Intel 8086 processors of 1981, necessary for safety checks of the twin boosters, were especially hard to acquire. 81 See 82 See

page 21. page 141.

Bibliography

133

Bibliography [1] W. J. Broad. For Parts, NASA Boldly Goes ... on eBay. New York Times, May 12, 2002 [2] B. Evans. ‘You don’t go fly when you got debts’: 35 years since STS-1 inaugurated the space shuttle program (Part 1). date accessed: 2018-06-06, 2016. http://www.americaspace.com/2016/04/09/you-dont-go-fly-when-you-gotdebts-35-years-since-sts-1-inaugurated-the-space-shuttle-programpart-1/

[3] J. R. Garman. The “BUG” Heard ’Round the World: Discussion of the Software Problem which Delayed the First Shuttle Orbital Flight. ACM SIGSOFT Software Engineering Notes, 6(5):3–10, 1981 [4] C. J. Hickey, J. B. Loveall, J. K. Orr, and A. L. Klausman. The Legacy of Space Shuttle Flight Software. In AIAA SPACE Conference and Exposition 2011, 2011 [5] International Federation of Professional and Technical Engineers. IFPTE report on the effectiveness of NASA’s workforce & contractor policies. date accessed: 2018-03-21, 2003. http://www.spaceref.com/news/viewsr.html?pid=10275 [6] R. D. Launius, J. Krige, and J. I. Craig. Space Shuttle Legacy. American Institute of Aeronautics and Astronautics, Inc., 2013 [7] R. D. Legler and F. V. Bennett. Space Shuttle Missions Summary. Technical Report September, NASA, 2011. https://www.jsc.nasa.gov/history/reference/TM2011-216142.pdf

[8] N. G. Leveson. Software and the Challenge of Flight Control. In Space Shuttle Legacy: How We Did It and What We Learned. AIAA, 2013 [9] A. MacCormack. Management lessons from Mars. date accessed: 2018-03-21, 2004. https://hbr.org/2004/05/management-lessons-from-mars [10] NASA. STS-1 Orbiter Final Mission Report. Technical report, NASA, 1981 [11] NASA. Space Shuttle Era Facts. Technical report, NASA, 2012. https://www. nasa.gov/pdf/566250main_2011.07.05SHUTTLEERAFACTS.pdf

[12] P. G. Neumann. Faster, cheaper *not* better. date accessed: 2018-03-22, 2000. https://catless.ncl.ac.uk/Risks/20.86.html#subj2

[13] L. J. Paxton. “Faster, Better, and Cheaper” at NASA: Lessons Learned in Managing and Accepting Risk. Acta Astronautica, 61(10):954–963, 2007 [14] K. M. Rusnack and J. R. Garman. NASA Johnson Space Center oral history project: Edited oral history transcript. date accessed: 2018-0614, 2001. https://www.jsc.nasa.gov/history/oral_histories/GarmanJR/ GarmanJR_3-27-01.htm

[15] J. R. Sklaroff. Redundancy Management Technique for Space Shuttle Computers. IBM Journal of Research and Development, 20(1):20–28, 1976 [16] T. Spear. NASA Faster, Better, Cheaper Task Final Report. Technical report, NASA, 2000. https://mars.nasa.gov/msp98/misc/fbctask.pdf

134

Chapter 5. Synchronization and Scheduling

[17] A. Spector and D. Gifford. The Space Shuttle Primary Computer System. Communications of the ACM, 27(9):872–900, 1984 [18] J. E. Tomayko. Computer Synchronization and Redundancy Management. In Computers in Spaceflight: The NASA Experience, NASA Contractor Report, 1988. https://archive.org/details/nasa_techdoc_19880069935

[19] J. E. Tomayko. Computers in Spaceflight: The NASA Experience. NASA, 1988 [20] Wikipedia. Space shuttle. date accessed: 2018-03-22. https://en.wikipedia. org/wiki/Space_Shuttle

[21] Wikipedia. Space shuttle retirement. date accessed: 2018-05-07. https://en. wikipedia.org/wiki/Space_Shuttle_retirement

5.2 Mars Pathfinder Things which matter most must never be at the mercy of things which matter least. —Johann Wolfgang von Goethe

5.2.1 Overview • Field of application: space flight • What happened: unwanted rebooting of control computer • Loss: For short periods of time, Pathfinder was not operational, delaying the collection of scientific data • Type of failure: priority inversion

5.2.2 Introduction Due to its proximity to Earth, Mars has historically attracted a lot of attention, in particular in the disciplines of astronomy and space flight (for a survey on Mars exploration, see [6, 7]). The first useful information on Mars had been obtained by telescopic observations, leading to assumptions of vegetation (because of color changes) and water or intelligent beings (because of features looking like canals) made by Schiaparelli in 1877. In the same year, the Martian moons Phobos and Deimos were discovered. The rotation period, the icy polar caps, and the huge Olympus Mons—the largest mountain in the solar system—had also been observed on the red planet by telescope. At the beginning of the space flight era, space probes were also sent to Mars, but failures were frequent. In the period of 1960–1962, the first Russian missions ended prematurely by launch or spacecraft failure. Following the failure of NASA’s Mariner 3 mission, Mariner 4 was the first probe successfully passing Mars. The first successful orbiter was Mariner 9 in 1971, which worked properly for 516 days. The first landing on Mars was achieved by the Russian probe Mars 3 lander in 1971, but contact was lost 14.5 seconds after the start of transmissions. The American Viking 1 lander touched down successfully in 1976 and was operating for 2245 sols.83 NASA’s Mars Pathfinder mission comprised a 83 One

sol is one solar day on Mars and is equivalent to 24.6 hours of Earth time.

5.2. Mars Pathfinder

135

stationary lander and a mobile rover, which were launched on December 4, 1996, and successfully landed on Mars on July 4, 1997, to continue operation for about 85 days. Meanwhile, realistic plans exist for a detailed exploration (cf. [14]) and colonization of Mars.84

5.2.3 Description of the Mars Pathfinder The Mars Pathfinder project (see [15, 16, 28]) was managed by NASA’s Jet Propulsion Laboratory (JPL) at the California Institute of Technology. The landing on July 4, 1997, involved a parachute, a rocket braking system, and an airbag system inflating airbags before impact and retracting them after landing. The lander released the first wheeled robotic rover on Mars, called Sojourner. The primary goal of the mission was to test key technologies and concepts for future missions to Mars. Furthermore, investigations of the planet’s atmosphere, meteorology, and geology, and the compound of Martian rocks and soil, were performed. Images of Mars’ surface were also taken. Pathfinder’s on-board computer, which controlled both the flight phase of the probe and all operations of the lander and Sojourner on Mars, was based on an IBM RS/6000 processor specialized to spatial needs by Lockheed Martin, and was running the operating system VxWorks RTOS of WindRiver (see [2]). This custom-tailored operating system is often used in small devices like embedded systems in aerospace, defense, and industrial robots, e.g., in particular if real-time and deterministic performance are requested. The lander unit possessed two instruments for scientific investigation: a stereoscopic camera with spatial filters on an expandable pole (the Imager for Mars Pathfinder (IMP)) and the Atmospheric Structure Instrument/Meteorologic Package (ASI/MET) collecting data on pressure, temperature, and winds (cf. [16]). Additionally, the rover conducted different experiments with an alpha-proton X-ray spectrometer (APXS) and took photos with three cameras. The rover was the size of a baby stroller: it weighed 16 kg and was 65 cm long, 48 cm wide, and 30 cm tall. It comprised six wheels, and could move autonomously with a speed of up to 0.6 meter per minute working in a radius of 10 m around the landing site. The power was supplied by solar panels. The Pathfinder lander and Sojourner worked successfully for about three months. Instead of the originally planned 7 sols for the rover, both were functional for 83 sols up until September 27, then contact to the base station was lost. The rover traveled a distance of over 100 m to analyze various rocks. Furthermore, Pathfinder was collecting meteorological data and sending pictures to Earth. In the first month of the mission, about 1.2 gigabits of data were sent to Earth, including more than 1,000 images and about 4 million measurements of temperature, pressure, and wind (see [29]).

5.2.4 Timeline In general, the Pathfinder/Sojourner mission is considered a great success due to the smooth landing, the long runtime of the mission, and the large amount of transmitted scientific data. Starting the first day after landing, however, Pathfinder’s operating system experienced unexpected and repeated reboots85 (see [13] for a sequence of press 84 In 2011, Elon Musk proposed the development of the “Mars Colonial Transporter” project, now called the “Interplanetary Transport System,” which hopes to be able to send human beings to Mars in 10–20 years (see [22]). 85 According to [8, 19], a few similar reboots occurred in the preflight testing phase at JPL but were rare, hard to reproduce, and considered of much lower priority compared to other software issues.

136

Chapter 5. Synchronization and Scheduling

Figure 5.5. Schematic picture of Pathfinder and its parts after landing. Courtesy of NASA.

Figure 5.6. Sketch of the Pathfinder Sojourner. Courtesy of NASA.

releases by JPL regarding the events). This occurred when collecting and transmitting meteorological data in lander’s ASI/MET module. As soon as the engineers at JPL recognized the problem, they started looking for the underlying cause by doing tests on Earth with an exact replica (in hardware and software) under the very same conditions. Fortunately, the software had several debug features, in particular a trace/log facility. The feature remained in the final version implemented on the Mars lander and rover because the engineers at JPL believed in the motto “test what you fly and fly what you test.” In about 18 hours, they managed

5.2. Mars Pathfinder

137

to reproduce the problem and soon after identified it to be a classical case of priority inversion (see [19]). Priority inversion describes the event that in a job scheduling scenario, a high-priority task is indirectly preempted by a lower-priority task, inverting the original hierarchy (see Excursion 14). By July 21, the problem was solved (cf. [13]). To correct the error, a patching mechanism was used (cf. [4, 19]): The differences in the source code between the fixed replica on Earth and the actual Pathfinder lander were sent to Mars. The on-board copy of the software there was modified by specific software to incorporate the changes, removing the priority inversion. Concerning the losses related to the bug, the mission team was lucky: • The problem only occurred four times between July 5 and 14, 1997 (see [13]). • Since no data that had already been collected were lost from the main memory, the unexpected reboot problem only delayed further scientific experiments to the next day (cf. [19]). • Pathfinder operated much longer than originally expected and was able to carry out a lot more experiments (see [7]). Therefore, the loss due to the bug was minimal and consisted of less than four days of lost operational time for the lander and Sojourner modules.

5.2.5 Detailed Description of the Bug For this section, the main sources are [4, 8, 19, 29]. In particular, [19] represents a detailed summary of information by G. Reeves, the leader of the software team for the Mars Pathfinder spacecraft at JPL at the time of the events. To understand the details of the bug, basic knowledge of the concepts of priority and semaphores in the context of concurrent computing in computer science is relevant, which are summarized in Excursion 14. Readers familiar with these concepts may directly continue reading on page 138. Excursion 14: Priority and Semaphores in Concurrent Computing In concurrent computing, different tasks have to be executed on the same resource, such as a processor. Therefore, tasks are ordered by a specific assigned priority according to their importance. The idea of such a hierarchy of priorities is to enable a scheduling mechanism that ensures the execution of tasks with higher priority before lower-level tasks are run or completed.a Furthermore, tasks can have shared access to different resources such as memory or I/O devices. Such a shared access needs to be carefully controlled to avoid problems like writing of two different tasks simultaneously into the same memory location. To prevent such problems (also called race conditions; see also Excursion 22 on page 180) in accessing a resource, the implementation of a semaphore or mutex (mutual exclusion) can be used. The semaphore or mutex can lock certain common resources for exclusive use by a specific task, and free it when this task has finished its use to grant access by the next task. Semaphores are special variables or data types to control access to shared resources. A mutex is a binary semaphore (i.e., with two states) where only the locking instance is allowed to unlock a resource again.b Frequently, an ordered waiting list is used together with a semaphore or mutex to keep track of all the tasks that want to use the common resource and to provide a synchronization of the waiting tasks. To summarize, there are two different mechanisms to control the sequence of execution of processes: the hierarchical priority of different tasks influencing their schedul-

138

Chapter 5. Synchronization and Scheduling

ing over time, and the locking mechanism of semaphores to allow exclusive use of certain resources until a task is finished. Priority and semaphores may contradict each other in certain situations in the context of dynamic scheduling. A running task with lower priority (task A) may be preempted to allow the execution of a higher-priority task (task B). If both tasks, however, happen to rely on the use of the same resource, the semaphore mechanism may result in a priority inversion preventing task B from running until task A has terminated (at least the access to the shared, needed resource). To avoid priority inversion, priority inheritance can be used: If a low-level task A blocks a common resource via a semaphore or mutex, forcing a task B of higher priority to stop and wait, task A inherits the higher priority of task B to allow for a fast execution of A with respect to other concurrent tasks. This ensures that the blocking of task B will not take too long. Further details on concurrent computing, priority, and related aspects can be found in [3, 24]. a The mechanism is similar to the treatment of patients in a hospital; in the event of an incoming patient with a life-threatening condition, the medical staff will maybe drop or accelerate the treatment of other patients in order to deal with the more urgent case. Later on, the staff will return to the temporarily deprioritized patients. b In the context of railways, semaphores are represented by more complex railway signals that may show more information than just a “green” or “red” light. A mechanism similar to mutexes was used for safe access of single-track railroad tracks used for trains going in both directions: Tokens in the form of staffs or via electric token blocks were used to guarantee that not more than one train is allowed to access the track at a time (see https://en.wikipedia.org/wiki/Token_%28railway_ signalling%29 for details).

The problem in Pathfinder’s software was related to scheduling of tasks on its hardware, which is schematically shown in Fig. 5.7. The different parts of Pathfinder’s hardware are connected via two buses: a VME bus and an MIL-STD-1553 bus. The 1553 bus was used, in particular, to transfer data from the ASI/MET module to the main memory, which was afterwards sent to Earth via radio devices. All activities on the 1553 bus were scheduled at an 8-Hz rate, i.e., in time windows of 0.125 seconds each. In order to control the usage of the bus for data transport, two tasks played a central role:

1553 bus

cruise stage of equipment: thrusters, valves, star scanner, sun sensor

IPC pipe

d-buffer

d-buffer

memory

CPU

radio

lander part of equipment: accelerometers, radar altimeter, ASI/MET

camera

VME bus

Figure 5.7. Schematic view on the architecture of different hardware components of Pathfinder.

5.2. Mars Pathfinder

139

Figure 5.8. Qualitative sketch of major events in a regular 1553 bus cycle according to [19]: Start of the cycle at t1 (8-Hz boundary) with transactions that have been scheduled in the previous cycle. bc_dist becomes active at t2 and has completed all distribution of data at t3 . At t4 , bc_sched starts scheduling the next cycle transactions (possibly stopping other tasks due to higher priority, indicated by the red arrow) and terminates at t5 .

the bus scheduler task called bc_sched controlling the setup of the transactions, and the bc_dist task handling the collection of the transaction results to enable bc_sched for the next cycle. A single bus cycle of 0.125 seconds under normal circumstances is illustrated in Fig. 5.8. First, in the time interval [t1 , t2 ], the bus is active, executing the transactions set up by the previous bc_sched run. Then, at t2 , the first part of the 1553 traffic is complete and the bc_dist task starts distributing the data up to time t3 . In [t3 , t4 ], further tasks might be carried out involving the bus until bc_sched prepares the next cycle in the period [t4 , t5 ], potentially followed by further tasks for the bus. This sequence was repeated every 1/8 seconds. (Note that the times t2 , . . . , t5 are sketched just qualitatively in Fig. 5.8 and that they were not fixed, but could differ in actual values from one cycle to another.) The task bc_sched obviously had a very high priority, directly followed by bc_dist. All other functions relevant to controlling the spacecraft86 were assigned a medium priority. And finally, all scientific functions such as imaging, image compression, and ASI/MET were of the lowest category of priority. Most tasks related to the 1553 bus used a double-buffered shared memory mechanism where bc_dist placed the newest data. The ASI/MET task, however, was the only task related to real-time data that used an interprocess communication mechanism (IPC) based on a pipe layout. The access to the pipe was managed via a mutex. In the case of Pathfinder’s reboot problem, it turned out that the high-priority bc_dist was blocked by the low-priority ASI/MET task via the mutex controlling the access to the list of file descriptors for the IPC mechanism. After taking the mutex, the ASI/MET task was preempted by several medium-priority tasks with higher priority than ASI/MET but not relying on the IPC mechanism (and, thus, not being blocked by the corresponding mutex). The bc_dist task tried to execute (with higher priority than those medium-priority tasks) and succeeded only up to the point where it had to send the newest ASI/MET data and needed to wait for the IPC-related mutex to be freed by ASI/MET after completion. This never happened (in the corresponding cycle) due 86 Note

that spacecraft refers to both the Pathfinder lander and Sojourner in this context.

140

Chapter 5. Synchronization and Scheduling

Figure 5.9. Sketch of the problematic 1553 bus cycle involving priority inversion related to the ASI/MET and bc_dist tasks. At time ta , the ASI/MET task gets preempted just after locking the mutex. At t2 , bc_dist starts executing until it is blocked by the locked mutex at tb waiting for the low-priority ASI/MET task to finish, which never happens due to further mediumpriority tasks being executed until bc_sched starts at t4 and detects a problem, leading to the reboot failure.

to more medium-priority tasks being executed. When bc_sched was finally activated, it detected an incomplete bc_dist task and declared an error that led to initiating the reset of the whole system. Fig. 5.9 visualizes this sequence of events for a problematic cycle. Priority inversion and priority inheritance in general were well known at the time of the events (see [20] for a description). The solution to Pathfinder’s specific priority inversion problem turned out to be technically easy to achieve. Priority inheritance had to be switched on for the corresponding mutex via a change of value in the related global variable. Then the ASI/MET task got assigned the high-priority value of the bc_dist task when this one was waiting. Thus, no medium-priority tasks could preempt ASI/MET after this assignment, leading to a successful direct termination of the ASI/MET task and freeing of the mutex soon after to allow bc_dist to finish in time. Originally, the priority inheritance was turned off for the mutex under consideration in Pathfinder’s software. The teams at JPL and WindRiver carefully discussed whether changing the related global variable to priority inheritance would result in possibly unwanted or even harmful side effects. They decided to go for this change. The patching of the software to include the priority inheritance was done by sending a new part of the code. Then the on-board software of the spacecraft modified the code accordingly. Fortunately, everything worked out, no problematic effects occurred, and Pathfinder’s software avoided the undesired reboot problems from that point in time, leading to smooth operation until September 1997.

5.2.6 Discussion Facts that contributed to the bug: • Because the data bus tasks execute very frequently, the priority inheritance was initially switched off to avoid unfavorable runtime effects (cf. [19]).

5.2. Mars Pathfinder

141

• System resets also occurred in the preflight testing. But they were not reproducible and were seen as less important and mission-critical compared to other problems to be solved (see [19]). Furthermore, the project deadline prohibited a sound analysis of the bug. • “Killed by success”: The data rates from the surface and the amount of activities were higher than anticipated and aggravated the problem. • The priority inversion only occurred when the relatively short ASI/MET task was preempted by a medium-priority job, i.e., only in rare cases (cf. [4, 8]). The on-board debugging functionality, the replica on Earth, and the remote patching functionality turned out to be crucial for solving the problem with Pathfinder and for avoiding a potential premature end of the mission.

5.2.7 Other Bugs in This Field of Application • The Mars Polar Lander (MPL) was supposed to examine the ground in the Mars polar region. The landing included a parachute followed by a thrust-powered descent. Unfortunately, the landing failed on December 3, 1999. According to the accident report, the most probable cause of the failure was that spurious signals were generated when the lander legs were deployed during descent. The spurious signals gave a false indication that the lander had already landed, resulting in a premature shutdown of the lander engines (see [9]). • Beagle 2 was deployed to the surface of Mars on December 25, 2003, from the European Space Agency’s Mars Express spacecraft. A parachute was meant to slow its descent and airbags should have protected it as it touched down, but when no signal was received back, the team assumed it had crashed. Different possibilities are discussed as the reasons for this crash, such as the failure of the parachute or the cushioning airbags (cf. [5]). • The European probe Schiaparelli crashed during its landing on Mars on October 19, 2016. The landing included a parachute and rocket-assisted descent. “The new images suggest that while the lander successfully separated from the parachute, its thrusters switched off prematurely, and it dropped from a height between 2 and 4 km at a speed greater than 300 km per hour” (cf. [25]). • The Mars rover Curiosity also suffered from a series of bugs. First, in February 2013 the control was swapped from its A (primary) computer to the B (backup) computer due to a memory issue (see [12, 26]). Second, in March 2013, the B computer went into standby mode because a command file failed the size check. Unfortunately, an unrelated file was appended to the transmitted file, which caused a size mismatch (cf. [23]). As of 2018, in spite of the described problems, the rover is still exploring the Mars surface successfully. • The ISS is a collaboration of the United States, Russia, Europe, Japan, and Canada. This evokes software and hardware problems in view of code and equipment from different countries. In 2001, the Russian supply module Progress M1-7 had problems with docking on the ISS (see [10]). In 2007, the Russian computers failed because of condensation, causing a short circuit that was interpreted as a power off command (cf. [11, 17, 21]). Also in 2009, oscillations shook the ISS after

142

Chapter 5. Synchronization and Scheduling

Russian software control parameters were erroneously up-linked. These parameters led to a firing of the thrusters—in order to adjust the ISS—stimulating the natural frequency of the space station. This was caused by human error, sending improper values (see [1, 18]). In 2002, a truncation error in the GPS code on the ISS was detected that led to miscalculation of the attitudes (cf. [27]).

Bibliography [1] D. Bitner. Software Glitch Shakes Space Station, No Damage Detected. NASA countdown, 14(10), 2009 [2] Computerwoche. Pathfinder nutzt IBMs RS/6000. date accessed: 201807-31, 1997. https://www.computerwoche.de/a/pathfinder-nutzt-ibms-rs6000,1100002

[3] A. B. Downey. The Little Book of Semaphores. Green Tea Press, 2009 [4] T. Durkin. The Vx-Files: What the Media Couldn’t Tell You about Mars Pathfinder. Robot Science & Technology, 1:1–3, 1998 [5] ESA. Beagle 2 ESA/UK Commission of Inquiry 5. Technical Report April, 2004 [6] S. Garber. A chronology of Mars exploration. date accessed: 2018-03-19, 2015. https://history.nasa.gov/marschro.htm

[7] E. Howell. A brief history of Mars missions. date accessed: 2018-03-19, 2015. https://www.space.com/13558-historic-mars-missions.html

[8] M. Jones. What really happened on Mars rover Pathfinder. date accessed: 201803-21, 1997. http://catless.ncl.ac.uk/Risks/19.49.html#subj1.1 [9] JPL Special Review Board. Report on the Loss of the Mars Polar Lander and Deep Space 2 Missions. Technical Report March, 2000 [10] W. Knight. Russian docking problem threatens shuttle launch. date accessed: 2018-03-21, 2001. https://www.newscientist.com/article/dn1624-russiandocking-problem-threatens-shuttle-launch/

[11] T. Malik. NASA Eyes Workarounds for Space Station Computer Glitch. date accessed: 2018-03-21, 2007. https://www.space.com/3956-nasa-eyesworkarounds-space-station-computer-glitch.html

[12] NASA. New Curiosity “safe mode” status expected to be brief. date accessed: 2017-08-18. https://www.jpl.nasa.gov/news/news.php?feature=3732 [13] NASA. Mars Pathfinder mission status. date accessed: 2018-04-25, 1997. https: //mars.nasa.gov/MPF/newspio/mpf/status/mpf.html

[14] NASA. Mars exploration program. date accessed: 2018-03-19, 2018. https: //mars.nasa.gov/

[15] NASA Jet Propulsion Laboratory. Mars Pathfinder fact sheet. date accessed: 2018-03-19, 1997. https://mars.nasa.gov/MPF/mpf/fact_sheet.html

Bibliography

143

[16] NASA Jet Propulsion Laboratory. Mars Pathfinder. Technical report, NASA Jet Propulsion Laboratory, 1999. https://www.jpl.nasa.gov/news/fact_sheets/ mpf.pdf

[17] J. Oberg. Space station: Internal NASA reports explain origins of June computer crisis. date accessed: 2018-03-21, 2007. https://spectrum.ieee.org/ aerospace/space-flight/space-station-internal-nasa-reports-explainorigins-of-june-computer-crisis

[18] J. Oberg. Shaking on space station rattles NASA. date accessed: 2018-03-21, 2009. http://www.nbcnews.com/id/28998876/ns/technology_and_sciencespace/t/shaking-space-station-rattles-nasa/

[19] G. Reeves. What happened on Mars. date accessed: 2018-03-21, 1998. http: //catless.ncl.ac.uk/Risks/19.54.html#subj6.1

[20] L. Sha, R. Rajkumar, and J. Lehoczky. Priority Inheritance Protocols: An Approach to Real-Time Synchronization. IEEE Transactions on Computers, 39(9):1175–1185, 1990 [21] L. Sherriff. Dodgy wet wiring caused ISS computer crash. The Register. date accessed: 2018-03-21, 2007. https://www.theregister.co.uk/2007/10/16/iss_ computer_crash/

[22] SPACEX. Mission to Mars. date accessed: 2018-03-26. http://www.spacex. com/mars

[23] D. Spector. Software bug stalls the Curiosity Rover. date accessed: 2017-0818, 2013. http://www.businessinsider.com/mars-curiosity-rover-secondcomputer-glitch-2013-3?IR=T

[24] A. S. Tannenbaum. Modern Operating Systems. Pearson, 3rd edition, 2009 [25] T. Tolker-Nielsen. EXOMARS 2016 - Schiaparelli anomaly inquiry. date accessed: 2018-07-31, 2017. http://exploration.esa.int/mars/59176-exomars2016-schiaparelli-anomaly-inquiry/

[26] L. Tung. NASA fixes bug that put Curiosity Mars rover on standby. date accessed: 2017-07-24, 2013. http://www.zdnet.com/article/nasa-fixes-bugthat-put-curiosity-mars-rover-on-standby/

[27] G. White. “Truncation error” found in GPS code on Int’l Space Station. date accessed: 2018-03-21, 2002. https://catless.ncl.ac.uk/Risks/22.11.html [28] Wikipedia.

Sojourner (rover).

date accessed: 2018-03-19.

https://en.

wikipedia.org/wiki/Sojourner_(rover)

[29] J. J. Woehr. Really Remote Debugging: A Conversation with Glenn Reeves. Dr. Dobbs’s Journal, Interview, 1999. http://www.drdobbs.com/architectureand-design/really-remote-debugging-a-conversation-w/228700403

Chapter 6

Software-Hardware Interplay

Computer hardware is notoriously impacted by flaws due to its complicated design. Manufacturers such as Intel, AMD, and NVIDIA typically struggle to create robust systems, but bugs continue to appear.87 Since the focus of this book is on specific software bugs, we will not discuss this type of error in detail. But in some cases, faulty numerical values lead to a bug in the hardware or in the use of the hardware. In this context, the famous floating-point division error of Intel’s Pentium processor in 1994 can be seen as a software bug encoded in hardware.88 The fault was due to some missing numerical values that were necessary in the course of a sophisticated algorithm for efficiently implementing division of integer numbers in hardware. Another example related to the interplay of software and hardware occurred during the NASA mission of the Mars rover Spirit. For storing data on a flash memory, the bookkeeping of the stored data was growing in such a way that the memory could not be used at all, which nearly terminated the mission by repeatedly rebooting.

6.1 The Pentium Division Bug In heaven all the interesting people are missing. —Friedrich Nietzsche

6.1.1 Overview • Field of application: chip design • What happened: Intel’s Pentium processor delivered inaccurate division results for certain input values • Loss: $475 million • Type of failure: missing data in lookup table 87 See R. L. Hummel. PC Magazine Programmer’s Technical Reference: The Processor and Coprocessor. Ziff-Davis Publishing Co., 1992. 88 See T. R. Halfhill. The truth behind the Pentium Bug. date accessed: 2018-01-19, 1995, http://users. minet.uni-jena.de/~nez/rechnerarithmetik_5/fdiv_bug/byte_art1.htm.

145

146

Chapter 6. Software-Hardware Interplay

6.1.2 Introduction The rapid digitization of our lives is based on the extreme progress in the development of computational power in hardware. This started with the first computers using vacuum tubes, subsequently replaced by transistors for realizing logic gates, the basis of computing devices. The circuits were packed into single microchips via integrated circuit design containing the functionality of memory chips, flash memory, embedded microprocessors, or the central processing unit (CPU)—the heart of every computer. The CPU has to manage and control the workflow such as reading the atomic commands, fetching the data from memory, executing operations, and storing results back in memory. An important part is arithmetic operations, which are heavily used for scientific applications but are also relevant for internal calculations in standard programs or operating systems (e.g., for computing memory addresses in program execution). Therefore, the efficient realization of basic arithmetic operations such as addition, multiplication, and division in hardware plays a crucial role in modern chip design. In particular, (floating-point) division is also silently used in other instructions such as remainder or tangent function (see [9]). The efficient implementation of more complex mathematical functions such as powering, square root, sine, cosine, exponential function, and logarithm on hardware is also very important. Predecessors of the Pentium system introduced an additional arithmetic coprocessor (the x87 coprocessor) that was able to accelerate more general arithmetic operations on the hardware level and could be built into the computer for users that were interested in extensive arithmetic computations. Over time, this arithmetic coprocessor had been fully integrated into the standard CPU. Chip design companies use proprietary, highly optimized algorithms to accelerate such operations on the CPU. To understand the Pentium bug that is related to the division of floating-point numbers,89 a couple of arithmetic concepts are relevant and are explained in the following Excursions 15–18: The carry-save adder, one’s and two’s complements, and restoring and nonrestoring division. Readers familiar with these concepts may go directly to page 151. Furthermore, floating-point numbers are represented via mantissa and exponent (see Excursion 2). Hence, a division relates to summing up exponents and to computing the bit field of the resulting divided mantissa; this bit-field computation can be interpreted as an integer division that constitutes the main part of the floating-point division. Excursion 15: Carry-Save Adder For integer representation in binary format, the least-significant bit (corresponding to the digit for 20 ) is going to be the right-most one in this and the following examples. As an example for addition, we consider 1007 + 146 = 1153 in decimal representation, which corresponds in binary format to summand 1 1 1 1 1 1 0 1 1 1 1 summand 2 0 0 1 0 0 1 0 0 1 0 carry ahead 1 1 1 1 1 1 1 1 0 0 ------------------------------------result 1 0 0 1 0 0 0 0 0 0 1

+

This is the classical way of adding two numbers in a written manner as taught at school, just for the binary representation. Each nonzero entry in the carry ahead is due to the carry of the previous bit position (i.e., the least-significant bit in the carry—the first 89 Therefore, this bug is sometimes called the Pentium FDIV bug since FDIV is the x8086 instruction for floating-point division.

6.1. The Pentium Division Bug

147

starting from the right in our examples—is always equal to zero. Note that the classical carry ahead is never stored explicitly but automatically integrated into the summation. To make this work, one has to sequentially go through all bits of the two given numbers starting from the right, to add corresponding bits including a possible carry, and to move possible new carry information to the left neighbor. A faster method, the carry-save adder, tries to avoid this sequential scheme by collecting and storing the carries arising at each bit independently in an additional bit sequence—the collected carry—and doing an additional final addition of this carry at the end. For the above-mentioned example, this reads 1 1 1 1 1 0 1 1 1 1 0 0 1 0 0 1 0 0 1 0 -------------------------------------------Without carry 1 1 0 1 1 1 1 1 0 1 Collected carry 0 1 0 0 0 0 0 1 0 0 -------------------------------------------Add for final result 1 0 0 1 0 0 0 0 0 0 1

+

+

This type of addition does not give the final result directly but displays the result as two values. Its power becomes evident if we have to add three or more N -bit numbers as in the context of the multiplication where we have to add many intermediate integers but need the result only in the final step. In such situations, the carry-save adder can work many steps fully in parallel concerning each bit and needs only one usual addition (with serial carry ahead functionality) in the last step. In this way, for adding three integers a, b, and c, two new binaries s and t are created by defining the digits as si = (ai + bi + ci ) mod 2 and ti = 1 if there is a carry from position i to i + 1 and ti = 0 otherwisea (i.e., t represents the collected carry). The information about a + b + c is stored in s and t. Now, a fourth integer d can be added by considering s + t + d. This sum can again be computed (independently in parallel for each digit) in the carry-save manner resulting in a new s and t. We can proceed until all numbers are summed up to final values of s and t. Finally, s + t has to be computed in a serial manner using the carry ahead approach. a The

modulo operation mod 2 assures the correct realization of the four possible cases of digit combinations. In particular, 1 + 1 = 0 mod 2.

Excursion 16: Replacing Subtraction by Addition: One’s Complement and Two’s Complement Given a nonnegative integer number in binary format x = xp xp−1 ...x1 x0 with all xi equal to 0 or 1, the one’s complement of x is defined as x ¯ = x ¯p x ¯p−1 ...¯ x1 x ¯0 where ¯ 0 = 1 and ¯ 1 = 0. Thus, the one’s complement is nothing other than swapping all binary digits to the opposite value. It holds that x + x ¯ = 11 . . . 1 = 2p+1 − 1. Hence, subtracting x from some number a can be represented by adding the one’s complement of x: a−x=a+x ¯ − 2p+1 + 1 . As an example, for p = 2 and a = 101, x = 011 in binary format, we can compute the difference in the usual way: 101 -011 ---010

148

Chapter 6. Software-Hardware Interplay

Alternatively, adding the one’s complement x ¯ = 100 to a results in 101 +100 ----1001 The final result is then obtained by subtracting 2p+1 − 1 (i.e., 1000 - 0001 in binary format): 1001 - 1000 + 0001 = 0010 Furthermore, the two’s complement y of a p + 1-bit number x = xp xp−1 . . . x1 x0 , is defined as the complement with respect to 2p+1 : y = 2p+1 − x. This is equivalent to taking the one’s complement and adding 1 since y = 2p+1 − x = 2p+1 − (2p+1 − 1 − x ¯) = x ¯+1. The two’s complement, hence, is the negative of x plus the term 2p+1 (i.e., the highest bit, corresponding to the digit at position p + 2). Therefore, the two’s complement is actually used for replacing subtraction by addition: a − x = a + y − 2p+1 . Usually, the highest bit representing 2p+1 is ignored because it cannot be represented in (p + 1)-bit numbers. Besides the computation of the two’s complement of a number, a two’scomplement representation of (integer) numbers existsa and is illustrated below. The bit fields of integer numbers (or mantissas of floating-point numbers) are interpreted in the way that numbers between zero and the maximum positive value (2p − 1 in the figure to the left) are obtained by incrementing bits in the usual way. But the bit fields of negative numbers x are given as the two’s complement y of the positive counterpart 2p+1 − |x|. This results in a direct jump from 2p − 1 to −2p by incrementing the least-significant bit. a https://en.wikipedia.org/wiki/Two’s_complement

Excursion 17: Restoring Division for Integers The division of floating-point numbers is mainly based on the division of the mantissas, i.e., on the division of integer-represented numbers. To compute the division of two integers q := p/d, the restoring division algorithm may be written in the following form for decimal numbers with digits qj ∈ {0, . . . , 9}: 1 2 3 4 5 6

Start with assumed number of digits j large enough Initial guess for digit: set k := 1 Subtract p := p − 10j · d If p = 0: stop, set qj := k and all other qi := 0, i = 0, . . . , j − 1 If p > 0: increment k := k + 1, go to 3 If p < 0: restore p := p + 10j · d, set qj := k − 1, j := j − 1, go to 2

6.1. The Pentium Division Bug

149

For the example of 325/13 in decimal form, the steps of the restoring division correspond to subtracting 13 · 101 for j = 1 as often as possible (denoted by k) until a nonpositive remainder is reached, then going back one step and setting the corresponding value as the first digit of the resulting division (q1 = 2 in this case) and using the corresponding remainder; go to the next digit (j = 0): subtract 13 · 100 as often as possible until a nonpositive remainder is reached (five times in this example). In total, the result is then given by 2 · 101 + 5 · 100 = 25. This may be visualized as follows: j=1: subtract 13*10 subtract 13*20 subtract 13*30 j=0: subtract 13*1 subtract 13*2 subtract 13*3 subtract 13*4 subtract 13*5

325:13 = 25 ---195: pos. remainder ---65: pos. remainder ----65: neg. remainder: --- => restore previous p=65 and set q1:=2 65 52: pos. remainder --39: pos. remainder --26: pos. remainder --13: pos. remainder --0 : remainder==0 => set q0:=5 and stop

Now, it is easy to understand the restoring division in binary form (with only minor changes in the formulation of the algorithm due to less options for digit values): 1 2 3 4 5

Start with j large enough Subtract p := p − 2j · d If p = 0: stop, set qj := 1, all other qi := 0, i = 0, . . . , j − 1 If p > 0: set qj := 1, set j := j − 1, go to 2 If p < 0: restore k := k + 2j · d, set qj := 0, j := j − 1, go to 2

Let us also consider a binary example:

j=3: j=2:

j=1: j=0:

1111:11 = 0101 ---1111 - 11*1000 = 1111 - 11000 = -1001: neg. remainder ----=> restore previous k and set q3:=0 1111 - 11*100 = 1111 - 1100 = 0011 : pos. remainder ----=> keep remainder k and set q2:=1 011 --011 - 11*10 = -011 : neg. remainder --=> restore previous k and set q1:=0 011 - 11*1 = 0 : remainder==0 --=> set q0:=1 and stop 0

A disadvantage of this restoring division is that the first guess for qj = 1 might be too large. In this case, qj is zero and one has to reduce j to j − 1 and try again with the original value of the remainder p. This clearly represents computational overhead. The nonrestoring division method, explained in Excursion 18, has been developed to overcome this issue.

150

Chapter 6. Software-Hardware Interplay Excursion 18: Nonrestoring Division for Integers To avoid the possible restart of the restoring division scheme, the nonrestoring division method proceeds with an overestimated value by computing a negative remainder and corresponding digit, which is then compensated for by sign-inverted calculations for the next digit. For the decimal example of 325/13 given in Excursion 17, we now subtract 130 as often as possible until we reach a negative remainder (three times in this case). Then we add 13 as often as possible until we reach a nonnegative remainder (i.e., five times). In total the result is then given by 3 · 101 − 5 · 100 = 25. Again, this may be visualized: j=1:

j=0:

325:13 = 3(-5) = 30-5 = 25 -130 ---195 -130 ---65 -130 ----65: neg. remainder: set q1:=3 and j:=0 +13 (now add instead of subtract) ---52: neg. remainder: continue adding +13 ---39: neg. remainder: continue adding +13 ---26: neg. remainder: continue adding +13 ---13: neg. remainder: continue adding +13 --0: remainder==0: set q0:=-5, compute final 30-5 and stop

For the binary example mentioned in Excursion 17, this corresponds to 1111:11=1(-1)1(-1) = 2^3-2^2+2^1-2^0 = 0101 j=3: -11000 = 11*1000 ------1001: neg. remainder: set q3:=1 and j:=2 j=2: +1100 = 11*100 (now add instead of subtract) ----011: pos. remainder: set q2:=-1 and j:=1 j=1: -110 = 11*10 (now subtract instead of add) ----11: neg. remainder: set q1:=1 and j:=0 j=0: +11 = 11*1 (now add instead of subtract) ---0: remainder==0: set q0:=1, comp. fin. 2^3-2^2+2^1-2^0, stop

6.1. The Pentium Division Bug

151

6.1.3 Timeline At the end of 1993, Intel started an advertising campaign for introducing its new processor and the “intel inside” logo that was connected with a marketing strategy to strengthen the bond of customers to Intel. Because Intel failed in protecting the name “x86” legally, the company decided to call the successor of the i486 processor P entium, and not i586. The Pentium processor had a lot of improvements compared to the i486, such as enhanced floating-point arithmetic. In particular, the division algorithm for computing p/d was newly implemented and up to five times faster compared to the i486. Intel had been aware of the growing complexity of the chip design process, especially because of the coordination of the different groups working on different parts of the chip. Therefore, a special group had been established in order to control all the interfaces between different modules. Nevertheless, it was clear that a newly developed chip would not be flawless. Therefore, a procedure called “step-and-repeat” was installed to include corrections and improvements in every step of the fabrication process (cf. [17]). Hence, Intel was convinced to finally sell a chip that was fully functional. Since March 1994, Prof. Thomas Nicely from Lynchburg College had been using Pentium-based systems to compute sums of reciprocals of prime numbers in the context of his research in the field of number theory. In June 1994, he stumbled over some inaccuracies in the results. After a detailed analysis of different possible causes, it turned out to be a hardware problem. In particular, in contrast to the Pentium processor, older i486 systems did not produce any unexpected inaccuracies. On October 24, 1994, Nicely contacted Intel tech support and explained his testing procedure and comparisons that led to the conclusion that the cause of the error was definitely the Pentium chip. The received answer gave no information or explanation. The service person told him that there were no problems with the Pentium, and that he should wait for a call back. Six days after his ineffective call, Prof. Nicely sent an email message to a few persons (see Appendix 8.3), which led to a posting in the newsgroup comp.sys.intel describing examples where the Pentium gave wrong results.90 The first example was the computation of 1/824633702441.0 (cf. [26]), which could be tested via (824633702441.0)·(1/824633702441.0) and should result in 1.0 up to a small rounding error but with 19 significant decimal digits. Instead, the difference between the correct result and the one using Pentium chips was 0.000000006274709702, i.e., only exact up to 8 significant decimal digits. Other examples were discovered. In particular, A. Kaiser’s list of 23 erroneous reciprocals intrigued T. Coe to look into the details of the problem (cf. [9]). After the publication of his findings, T. Nicely himself signed a nondisclosure agreement with Intel.91 But the coverage of the Pentium bug had already developed its own dynamics with the help of the Internet. Some Net activists informed local newspapers and TV stations, and by November 24, several global and US media reported on the Pentium bug (see [9, 21]). In late November 1994, Intel offered to replace chips for those customers being able to convince Intel staff that they heavily used floating-point arithmetic. At this point, Intel’s white paper [33], which had already been available internally at Intel since the summer (cf. [9, 17]), was released (see [26]),92 a further attempt to calm down disappointed customers. On December 12, IBM announced the halting of shipment of 90 See

[9] for a nice historic survey of the quite complex flow of information. detailed survey of different aspects from T. Nicely himself can be found in [26]. 92 Probably not yet completely publicly, but “manually” outside Intel. Other sources such as [9] mention December 14, 1994, being the date of the public release on the Internet. 91 A

152

Chapter 6. Software-Hardware Interplay

Pentium-based systems due to their own estimations about the probability of encountering the bug (see Sec. 6.1.5 for a detailed discussion on these aspects). But the uproar in the community was exerting too much pressure on the company: Only 8 days later, on December 20, 1994, Intel announced a new policy and offered to replace the flawed processor with a corrected successor variant for any Pentium owner who requested it. In the aftermath of these events, many people started to search for the actual cause of the Pentium bug. Notably, T. Coe, P. Tang, V. Pratt, and A. Edelman described the origin of the bug, the critical numbers, and a statistical analysis about the probability of wrong computations (see [8, 10, 12, 28]). Based on the independent analysis of the bug, W. Kahan developed an SRT93 division tester (see [18]) and C. Moler of MathWorks/ MATLAB developed a workaround that identified the flawed cases and replaced the division by a safe version, avoiding the buggy entries in the lookup table (cf. [9, 22]). For an extensive collection of documents related to the Pentium division bug, see [23].

6.1.4 Detailed Description of the Bug Compared to other basic arithmetic operations on processors, division is rather complex to realize efficiently (see [16]). Therefore, various methods exist, and different processors or processor families are based on different algorithms. Typically, division algorithms belong to one of two categories: multiplicative methods that are based on a fixed-point iteration such as Newton’s method, and additive algorithms related to nonrestoring methods. The Pentium chip used the nonrestoring radix 4 SRT method that is described in detail in Excursions 19 and 20, which are based on [12]. Excursion 19: Radix 4 SRT Division The Sweeney–Robertson–Tocher (SRT) division algorithm has been described independently by those three authors in [7], [31], and [35] (see [2] for a nice survey). It is a special case of a nonrestoring division to base 4 where the new digits computed in each step are chosen from the set {−2, −1, 0, 1, 2}. To compute the floating-point division of p/d, we assume (without loss of generality) 1 ≤ p, d < 2 since the mantissas are crucial for the computation. 1 2 3 4 5

p0 := p for k = 0, 1, . . . Select a digit qk ∈ {−2, −1, 0, 1, 2} s.t. pk+1 := 4(pk − qk d) with |pk+1 | ≤ 83 d end P p/d = k=0 qk /4k

The new qk can determined via a lookup tablea in each step k. The full admissible be 8 8 interval I := − 3 d, + 3 d can be partitioned into five overlapping subintervals for each of the five possible digits (as visualized in Fig. 6.1): → qk = −2 I−2 := − 83 d, − 43 d 5 1 → qk = −1 I−1 := − 3 d, − 3 d 2 2 → qk = 0 I0 := − 3 d, 3 d 1 5 I1 := 3 d, 3 d → qk = 1 4 8 → qk = 2 I2 := 3 d, 3 d 93 Described

in the next excursion.

6.1. The Pentium Division Bug

153

If pk is in one of the subintervals, then pk − qk d is in I0 , and hence the next pk+1 := 4(pk − qk d) is by construction again in I. The algorithm is well-defined since |pk+1 | ≤ 83 d always holds. After K iterations, the result q = p/d is simply computed—up to the corresponding accuracy—by the digits qk (for basis 4) via q0 40 + q1 4−1 + q2 4−2 + · · · + qK−1 4−K+1 and just has to be transformed to the standard binary form (for basis 2). The advantage of radix 4 is that in each step, one digit to base 4 is obtained, which corresponds to two digits in binary format compared to only one binary digit per SRT iteration step on older processors. This was one major aspect for improving the performance of division operations in Pentium processors compared to i486 architectures. a The

relevant aspects of how to design suitable lookup tables are described below.

56

ALAN EDELMAN

s − 83 s

s

qk ← −2 − 43 s

s

s − 53 s

s

s − 23 s s

s

qk ← 0

qk ← −1 − 13

s

s 1 3

s

s

s

s

2 3

4 3

s

s

s

qk ← +1

qk ← +2 s

s

s

s 8 3

s

5 3

s

8 FIG. Figure 3.1. This figure shows the interval d, 8. .d]. the value d/3. 6.1. Overlapping intervals[−I−2 . , IEach next mark choicerepresents of qk . Note that the 2 for“tic” 3 ,3 Five subintervals are shown: three labeled from above and two from[12] below, with values for qk . If p is factor d in the interval boundaries is omitted in the visualization. in one of the five subintervals and qk is the digit labeling the subinterval, then the operation sending p to p − qk d sends the subinterval to the one corresponding to qk ← 0. The operation sending p to 4(p − qk d), then, keeps p from leaving the interval [− 83 d, 83 d].

When implementing the SRT algorithm on a system, additional aspects come into qk ←940First, interval. between upper anddone lower intervals shownumber where either of two play. onlyOverlaps inexact calculations can be due to machine representawork. errors. The qk Luckily, ← 0 interval, when up by a for factor remains choices for qk will tion involving rounding this can bescaled compensated by of the4,degrees 8 why we may run Second, the algorithm within the interval |p| ≤ of freedom due to the overlaps inisthe subintervals. due toadtheinfinitum. SRT itera3 d; that can additions therefore and find subtractions a qk for which tions,We many occur. Since each of these operations has to be implemented efficiently, a carry-save variant of the SRT is actually implemented on pk+1 = 4(pk − qk d) (3.1) hardware. Details on the SRT variant used for the Pentium processor are described in has absolute Excursion 20. value ≤ 83 d for every k > 0. Since Equation (3.1) is equivalent to qk pk+1 Adder pk with Excursion 20: Radix 4 SRT 4−k =Carry-Save 4−(k+1)on + , Pentium d

4k

d

in the SRT algorithm is computing pk − qk d = pk + we The maymajor provecomputational by inductionstep that (−qk d), i.e., adding and multiplying: qk−1 pk q1 p =addition q0 + necessary + · · · + k−1 + 4−k . (3.2) If qk = 0 → no (A) d 4 4 d If |qk | = 2 → qk d is obtained by binary shift (B) Letting ∞0 in Equation (3.2) subtraction proves thatasthe SRT via algorithm computes the (C) correct If kq→ → represent addition one’s complement k > quotient. If qk < 0 → normal addition of two positive integers for pk + (−qk d) (D)

Notice that because of the overlap regions, the representation of the quotient in Theofsequence resulting additions from the SRT stepsvalue are implemented via the uniquely determined, but the of the quotient is,carryof course, terms the qk isofnot save adder (see Excursion 15 on page 146). Therefore, pk is represented as ck + sk . uniquely determined. There is a popular misconception that the SRT algorithm is capable of correcting “mistakes” by using the redundant set of digits. Though the overlap in the regions 94 For a survey of parameter options in designing SRT hardware implementations and a study on optimality, may occasionally allow either of two choices for qk , if an invalid qk is chosen the see [15]. algorithm will never recover. A consequence of Equation (3.2) is that pk = d

∞ X

qk+2 qk+1 + 2 + ··· . qi /4i−k = d qk + 4 4

154

Chapter 6. Software-Hardware Interplay

Up to now, we have assumed exact arithmetic. But in practice one has to consider rounded values due to machine number representation for the final result. Additionally, a small number of digits of pk and d simplifies the table lookup inside the SRT iterations to determine the digits qk . Luckily, the redundant interval representation via I−2 , . . . , I2 allows exactly this feature: Compute each quotient digit by using only estimates Pk of the partial remainder pk and D of the divisor d. The Pentium SRT implementation rounded d down with granularity 1/16 to the form D := bdc

= x.yyyy

1 16

since d ∈ [1, 2). For the carry-save representation pk = ck + sk , a rounded binary number representation of the form xxxx.yyy with granularity 1/8 is used for both ck and sk , resulting in Pk := bck c 1 + bsk c 1 8

8

(for an explanation of why rounding with granularity 1/8 and 1/16 is used, see the description of the PD table setup below). Therefore, the following inequalities hold: Ck ≤ c k < C k +

1 , 8

Sk ≤ sk < Sk +

1 , 8

1 . 4 The SRT radix 4 division algorithm based on the carry-save adder reads (see also [12]): 1 D := bdc 1 16 2 S0 := p0 , C0 := 0 3 for k = 0, 1, ... 4 Pk := bck c 1 + bsk c 1 8 8 5 qk := table_lookup(Pk , D) 6 compute (−qk d) via Eq. (A)–(D) by zeroing or shifting, and (if qk > 0) one’s complementing 7 compute ck+1 + sk+1 := 4 · (ck + sk + (−qk d)) via carry-save addition and shift 8 in case of one’s complementing: set least-significant bit in ck+1 before shift 9 end P 10 p/d = k=0 qk /4k Pk = Ck + S k ≤ pk < P k +

To better understand the SRT algorithm, we compute the example p/d = 1.75/1.0. This example and its representation is similar to the one used in [12, 13]. The fact that both p and d are exact machine numbers with only a few nonzero digits simplifies the calculations and delivers an exact (and known) result, but the relevant steps of the algorithm for general cases are also directly visible. Note that the values for the digits qk are selected according to the lookup table in Fig. 6.3 (which is explained below), where the bit field using the two’s-complement representation (see Excursion 16) of the lookup index Pk can directly be used.

6.1. The Pentium Division Bug

155

D := bdc 1 = 1.0000 16 0001.110 00000000000 s0 s0 = p0 = 1.75 in binary form 0000.000 00000000000 c0 c0 initially zero P0 := S0 + C0 = 0001.110 −q0 · d = 1101.111 11111111111 ⇒ q0 :=table_lookup(P0 , D) = 2 ⇒ (B),(C) 1100.001 11111111111 s˜1 intermediate save part of carry-save adder 0011.100 00000000001 c˜1 intermediate carry part & one’s compl. bit 0000.111 11111111100 s1 two-digit shift for factor 4 (l.8 in algorithm) 1110.000 00000000100 c1 two-digit shift for factor 4 S1 = .111, C1 = 1110.000 ⇒ P1 := 1110.111 −(−1) · d 0001.000 00000000000 ⇒ q1 :=table_lookup(P1 , D) = −1 ⇒ (D) 1111.111 11111111000 s˜2 intermediate save part 0000.000 00000001000 c˜2 intermediate carry part 1111.111 11111100000 s2 two-digit shift for factor 4 0000.000 00000100000 c2 two-digit shift for factor 4 P2 := 1111.111 + 0000.000 = 1111.111 −0 · d 0000.000 00000000000 ⇒ q2 :=table_lookup(P2 , D) = 0 ⇒ (A) 1111.111 11111000000 s˜3 intermediate save part 0000.000 00001000000 c˜3 intermediate carry part remainder p3 = 0 ⇒ q3 = q4 = . . . = 0 1.75=

The resulting quotient is finally computed via X p/d = qk /4k = 2 · 40 − 1 · 4−1 + 0 · 4−2 + 0 = 1.75 . k=0

Note that, for about the last ten years, the SRT method with radix 16 has been used as an accelerated division method in some modern processors (see [24, 25]).

For the SRT radix 4 algorithm to work, a lookup table with unique entries has to be generated based on the five overlapping intervals I−2 , . . . , I2 (shown in Fig. 6.1). Due to the rounding of the divisor d and the remainder pk in the SRT algorithm, the table gets a specific number of rows and columns according to the granularity of the rounding. This lookup table then allows us to map continuous or rounded pairs of numbers for the divisor d and the remainder pk of an SRT iteration k to a digit qk ∈ {−2, −1, 0, 1, 2}. Following [12], we visualized the lookup table in Fig. 6.2 with integer multiples of 1/16 on the x axis for the divisor in [1, 2) and integer multiples of 1/8 for the remainder pk ∈ [−8/3D, 8/3D] on the y axis.95 Hence, the reason that 16 columns and 87 rows are used is due to the rounding to “machine numbers” with the accuracy 1/16 for d and 1/8 for pk , i.e.—as mentioned in Excursion 20—D is represented as a binary number x.yyyy with four digits after the decimal, and Pk is represented as a binary number xxxx.yyy with three digits y for the granularity 1/8 and four digits x for the maximum range of the outermost intervals I−2 and I2 touching ±2 · 8/3 = ±5.3¯3 as maximum and minimum value, respectively. These accuracies have most probably been chosen by Intel.96 Note that the positive and negative binary numbers on the y axis are interpreted 95 The MATLAB code to generate the table is given in Appendix 8.2.5. For the detection of colored cells, it relies on the MATLAB example of Edelman given in [12]. 96 Note that the decision on the granularity of rounding involves balancing two different aspects. On the one hand, a coarser granularity results in fewer table entries and is, thus, cheaper to realize on hardware. On the other hand, the granularity needs to be fine enough to allow for the staircase steps (see below) separating the different digits in the overlap of the intervals I−2 , . . . , I2 . The choice of 1/8 for pk and 1/16 for d represents a reasonable compromise.

156

Chapter 6. Software-Hardware Interplay

Figure 6.2. PD lookup table for radix 4 SRT with granularity of rounding of 1/8 for pk (vertical axis) and 1/16 for d (horizontal axis) resulting in a seven-bit representation for pk (xxxx.yyy in the two’s-complement representation) and a five-bit format for d (x.yyyy). The five different colors of the cells (blue to brown from top to bottom) correspond to the different possible values for qk for a specified pair (D, Pk ). The five columns containing the buggy entries (cells in dark blue at the top) are shown in darker colors. White cells at the top and bottom of the table are never reached by the algorithm (the five gray cells at the bottom of the problematic columns are also actually never reached, see [12]). White cells inside the trapezoidal shape represent the overlap of the subintervals I−2 , . . . , I2 where both neighboring colors are possible. The straight gray lines indicate the boundaries of the subintervals (continuous with respect to d).

via the two’s-complement representation (see Excursion 16). For different values of d ∈ [1, 2), the interval boundaries are scaled by d accordingly, resulting in straight gray lines in the figure, which denote the sharp limits of the intervals. The resulting digits −2, −1, 0, 1, 2 in the lookup table are color-coded (blue to brown from top to bottom). Due to the overlap of the intervals I−2 , . . . , I2 , a certain degree of freedom exists for creating a unique mapping at the borders of the regions for different digits (i.e., col-

6.1. The Pentium Division Bug

157

ors in the figure); this ambiguity results in white cells in Fig. 6.2 between different colors. The white cells above the blue ones on top and below the brown ones at the bottom correspond to entries that are never accessed in correctly working Pentiums since they are located—with and without rounding—completely outside of the allowed intervals [− 38 d, 38 d]. As mentioned in [12], the five gray cells at the bottom of problematic columns are actually also never really accessed. As Pratt remarks in [28], the lookup table plot can be interpreted both as a discrete lookup table indexed by D and Pk (using rounded numbers in the given precision) and as a continuous mapping representation for actual (higher precision) d, pk value pairs inside of cells. Rounding d to D corresponds to moving to the left edge of a cell, and rounding pk results in moving onto the lower edge of a cell. Note that for all iterations of the SRT algorithm, the same column of the table is used since D does not change. The actual bug is related to the five dark blue cells at the top of the darker-colored columns (corresponding to D ∈ 17/16, 20/16, 23/16, 26/16, 29/16): Those entries were missing (i.e., interpreted as zero) in Intel’s implementation of the table. Therefore, for certain cases of divisions, inaccurate computations for the result and the remainder occurred. Since Intel provided only partial information on the actual setup and implementation (cf. [33]), one had to rely on the reverse engineering work of T. Coe (used in [10]), which resulted in data for the five columns with the faulty entries marked with darker colors. In a second visualization of the table in Fig. 6.3, the upper half of the table is shown with two modifications. First, the ambiguous white cells in the overlap regions of different digits are now uniquely colored. Due to the backward engineering of the bug, the actual coloring could be reconstructed for the five problem-related columns (darker colors). For the other colors, we followed the assumed rule proposed by [10] that digits with larger absolute value are preferred. The distinction between the different digits (or colors) can be interpreted as the definition of staircase steps ([34], or thresholds in [28]) on grid lines in the table being in the allowed overlap regions. These thresholds have to respect the overlap regions of the intervals in the sense that the threshold lines must never cut the limiting straight lines. Second, some of the straight lines representing the boundary of the intervals have to be shifted by an additional value of −1/8 due to the second potential rounding error introduced by the carry-save adder in the SRT: Two roundings instead of one are performed by the command Pk := bck c 1 + bsk c 1 (see 8 8 line 4 of the algorithm in Excursion 20). It is important to note that only the straight lines have to be shifted additionally that limit the thresholds purely from above but not from below. In particular, the top-most straight line must not be shifted (to the red line) since the threshold from blue to white lies above this line. Note that the coloring of the table in the first version in Fig. 6.2 already used the additional shifts. The final lookup table features have to be implemented in hardware to ensure efficient evaluations for radix 4 SRT divisions. According to Intel (see [33]), this has been realized via a programmable logic array (PLA) in Intel’s Pentium processor.

6.1.5 How Serious Was the Bug? In this section, we focus on some aspects of the Pentium bug concerning the probability of erroneous division and the loss of accuracy, which were heavily discussed in the community. Loss of Accuracy: One has to distinguish between absolute and relative errors when talking about accuracies; see Excursion 5 on page 32. Concerning the absolute error,

158

Chapter 6. Software-Hardware Interplay

Figure 6.3. Modified version of the PD lookup table. Only the nonnegative rows Pk are shown. The ambiguous white cells in the overlap of the subintervals are set according to the suggestion of [10] (larger absolute value of digits preferred). Light gray straight lines represent variants vertically shifted by −1/8 to account for the carry-save adder involving two roundings (see line 4 of the SRT algorithm in Excursion 20): Only those lines must be shifted that limit the thresholds purely from above but not from below; hence, the top-most straight line must not be shifted (to the red line).

Intel claimed in [33] (p. 3) that “the worst case inaccuracy occurs in the 12th bit position to the right of the binary point of the significand of the result, or in the 4th significant decimal digit.” According to [10, 12], the worst-case absolute error is about 5 · 10−5 for dividing numbers in [1, 2) (with smaller values for nonworst cases) since it can be proven that the Pentium always correctly computes the digits q0 , . . . , q7 before a buggy lookup table entry can be hit. A specific erroneous division was reported by V. Pratt (see [12]): For the fraction 14909255/11009918, the absolute error of 4.65 · 10−6 occurs.97 Concerning the typically more crucial relative errors, the highest value reported by Pratt according to [12] is about 6 · 10−5 . [10] contains an error analysis 97 This

has been used as a test case to detect a buggy Pentium. The correct result should be 1.35416585.

6.1. The Pentium Division Bug

159

resulting in an upper bound for the relative error of 6.7 · 10−5 as well as the example case of 4195835/3145727, which should result in 1.33382044. . . but gives 1.33373906 on the Pentium; this corresponds to a relative error of 6.1 · 10−5 (see [22]). Probability of Erroneous Divisions:

Historically, a considerable amount of discussion took place in the community regarding how frequent or probable erroneous Pentium divisions are. In the following, we give a summary of certain main aspects and focus on double precision only. Intel indicated a “probability that 1 in 9 billion randomly fed divide or remainder instructions will produce inaccurate results” (see [33], p. 3). This number relates to double precision and is based on the (unrealistic) assumption of a uniform distribution of the mantissas. In the conclusions of [33], Intel stated: “The flaw is of no significance in the commercial PC market where the vast majority of Intel processors are installed. Failure rates introduced by this flaw are swamped by rates due to existing hard and soft failure mechanisms in PC systems. The average PC user is likely to encounter a failure once in 27,000 years due to this flaw, indicating that it is practically impossible for such a user to encounter a problem in the useful lifetime of the product.” As mentioned in [9], the 27,000 years have been calculated starting from the “1 in 9 billion” error rate, combining it with the claim that a typical Pentium user performs about 1,000 divisions per day, which corresponds to roughly 1/3 million divisions per year. However, as Edelman states in the conclusions of [12], the divisors at risk are those with six consecutive ones in the fifth through tenth positions, i.e., a specific case of divisions as well as a certain pattern in previous computations of the SRT algorithm is necessary to hit a buggy PD table entry. Therefore, the probability of erroneous divisions is clearly nonuniform. As stated in [9], this “reduces the risk for certain applications, and increases it for certain others.” For a more realistic distribution of operands, Pratt showed an error rate of 1 in 4 million (cf. [28])—1,000 times greater than the rate computed with the uniform-distribution model. When stopping the shipment of Pentium-based systems in December 1994, IBM also questioned the uniform assumptions of Intel and claimed an error rate of 1 in 100 million (see [9]). In addition, IBM assumed a different user behavior resulting in an overall error rate for a typical user of once every 24 days. It seems that there actually is a huge lack of knowledge on probabilities of various events crucial for the Pentium bug to arise (see [21, 29] for corresponding quotations of W. Kahan) such that the whole discussion about probabilities simply misses the point (as mentioned in [9]): The buggy Pentium chips did not meet the specifications promised by Intel and expected by the customers. Summary:

How serious was the Pentium bug? The answer remains unclear for computations in application software, but it is quite clear for Intel: In total, Intel numbered the loss by the replacement of the hardware and by the damage to Intel’s image to $475,000,000.

6.1.6 Discussion Possible Causes of the Pentium Bug: Intel’s white paper [33] is not very detailed with respect to different aspects of the implementation and causes of the bug. It is not clear whether the five missing entries in the PD table were ignored on purpose (because nobody thought they would ever be reached in the course of the division algorithm), or whether they were omitted by some other mistake. As for the reason for the bug, Intel stated in [33]: “. . . a script was written to download the entries into a hardware PLA.

160

Chapter 6. Software-Hardware Interplay

An error was made in this script that resulted in a few lookup entries . . . being omitted from the PLA.” (p. 8). As noted in [28], the location of the buggy table entries at the very top of the table gives room for the hypothesis that the additional shift by −1/8 has been applied erroneously also to the straight line 38 D (resulting in the red line in Fig. 6.3) that bounds the threshold (blue vs. white cells on top) from below and not from above. In [12], a pure technical accident is regarded as not very probable: “. . . the error was probably more than an elementary careless accident” (p. 66). Another aspect in this context is that a symmetric PD table would be advantageous since it could be reduced to half its size (which saves space and costs on hardware). In [14], a more symmetric approach is described (referred to as folding), but this does not result in an actually sign-symmetric PD table (see [12, 15]). According to W. Kahan, the symmetry assumption was used at least in testing the implemented PLA (see [29]), which resulted in not finding the bug beforehand. The different and sophisticated aspects of the PD table in the context of the radix 4 SRT, together with the side conditions to realize every part of the algorithm and in particular also the table lookups highly efficiently, really have the potential to mislead people to assume that certain entries are simply not relevant.98 But due to the restricted information from Intel, the reasons remain unclear. Testing vs. Verifying:

The discussion on the reasons for the Pentium bug is closely related to the one of testing vs. verifying (see [28], partly cited also in [12]). Intel was generally aware of the considerably higher complexity in the development of the Pentium chip compared to its predecessors. Therefore, they installed a dedicated team for interfaces between the different groups involved in the design process (cf. [17]). This also involved, of course, extensive testing. But erroneous assumptions on the overall setup may have led to missing tests for parts of code or hardware implementations that were never considered to be relevant since they were assumed to be unreachable (see aspect of symmetry discussed above). Such problematic situations can only be overcome by formal and automatic verification of correctness of code/implementation. Techniques of this branch of theoretical computer science are complex and costly but have a huge potential and have been considerably improved in recent years (cf. [1, 19]). In the context of SRT, different approaches of verification have been applied using theorem proving techniques (as in [6, 32] for an SRT circuit similar to Pentium but with fewer bits/entries in the PD table) and binary decision diagrams (see [3] for one iteration of the SRT algorithm). A crucial point in the verification is that the relevant hardware (i.e., the estimation circuit and the hardware variant of the lookup table) delivers the correct quotient digits qk . Microcode:

Another aspect in the context of processor design and manufacturing is the so-called microcode (see, e.g., [36]). Microcode is an approach using an interpreter between the hardware and the architectural layout of a computer: The Complex Instruction Set Computer (CISC) instructions, which are standardized and which users of a system typically know about, are translated on the fly into Reduced Instruction Set Computer (RISC) instructions for a specific hardware. This microcode was originally implemented on read-only memory. However, particularly in the aftermath of the Pentium FDIV bug, CPU manufacturers started to apply writable patch memory to be able to update microcode in order to correct processor errata or to realize debugging func98 Note that the first edition of [16] from 1990 contained a similar mistake for nonreachable table entries, as described in [3], but was corrected in the second edition.

6.1. The Pentium Division Bug

161

tionality (see [5, 20]). As with every code, microcode and its possible updates also need to be secure and bug-free (to avoid bug-in-bug-fix scenarios, for example). To this end, Intel and AMD seem to look into formal verification for microcode too (see [5]). Fewer Bits for the Divisor D in the PD Table? Note that in both [10] and [28] an additional hypothesis is formulated concerning the actual amount of columns in the PD table: Since the observed erroneous table lookups were limited to 5 of the 16 columns, the possibility exists that those columns were representing a merged variant of groups of columns where the first and last 2 columns are grouped together, and the remaining 12 columns are grouped in packages of 3 each. Such a realization would, again, be advantageous because fewer table entries are necessary to be realized in hardware (only 6 instead of 16 columns, i.e., only about one-third of the table entries). From the point of view of granularity of the table entries, the thresholds with now coarser staircase steps could still be fitted between the straight lines (see PD table plot in [28]). Due to missing information from Intel, it is unclear whether or not this hypothesis holds true, and it actually does not affect (directly) the possible reasons for the bug. The Role of the Internet: The Pentium bug is one of the early examples where the Internet played a crucial role in the whole line of events. First, the email of T. Nicely and follow-up discussions in different news groups on the discovery of the problem made the Pentium bug public. Second, the exchange on examples of faulty divisions were shared in newsgroups, etc., and hence fostered a detailed and relatively fast analysis of the bug. Third, the popularity, accessibility, and dynamics of discussions on the Internet exerted the relevant pressure on Intel to change its policy regarding the replacement of buggy Pentium processors. At the time of the events, a lot of jokes circulated (see [4, 11]) and—similar to memes nowadays—the logo “Intel inside” was cartooned as “error inside,” “bug inside,” or simply considered a warning label.

6.1.7 Other Bugs in This Field of Application • In 1982, the IBM personal computer showed the following calculation: 0.1/10 delivered a solution 0.001. This behavior was caused by an error in the conversion of binary to decimal form. The calculation itself was correct in binary form, but under certain conditions the displayed result in decimal form was wrong (see [27]). The bug was fixed after some discussions between clients and IBM by an update of the operating system. • Note that another major bug exists related to the Pentium processor arithmetic called DAN-0411: As described in more detail in Sec. 2.1.7, the Pentium II and the Pentium Pro failed to set an error flag when converting a large negative floating-point number to integer format. • In March 2018, it became public that NVIDIA’s Titan V GPU suffered from a specific problem. Running identical scientific simulations multiple times could deliver different numerical results (see [30]). But the different runs should deliver the same output values for each run; this property is an important and expected behavior for scientific calculations. Furthermore, this also suggests that some of the computed results are wrong. As of now, there exists no final explanation for this behavior. Experts presume that the problem could be caused by read errors from memory or a design bug.

162

Chapter 6. Software-Hardware Interplay

Bibliography [1] American Mathematical Society (AMS). Notices of the American Mathematical Society, volume 55(11). AMS, 2008. http://www.ams.org/journals/notices/ 200811/

[2] D. Atkins. Higher-Radix Division Using Estimates of the Divisor and Partial Remainders. IEEE Transactions on Computers, C-17(10):925–934, 1968 [3] R. Bryant. Bit-Level Analysis of an SRT Divider Circuit. In 33rd Design Automation Conference Proceedings, 1996, pages 661–665. ACM, 1996 [4] M. R. Chapman. In Search of Stupidity: Pentium. In In Search of Stupidity: Over Twenty Years of High Tech Marketing Disasters. Apress, 2nd edition, 2007 [5] D. D. Chen and G.-J. Ahn. Security Analysis of x86 Processor Microcode. Technical report, Arizona State University, 2014. http://lux.dmcs.pl/ak/2014_paper_ microcode.pdf

[6] E. M. Clarke, S. M. German, and X. Zhao. Verifying the SRT Division Algorithm Using Theorem Proving Techniques. Formal Methods in System Design, 14(1):7– 44, 1999 [7] J. Cocke and D. Sweeney. High Speed Arithmetic in a Parallel Device. Technical report, IBM, 1957 [8] T. Coe. Inside the Pentium FDIV Bug. Dr. Dobbs’s Journal, 20(4):129–135, 1995 [9] T. Coe, T. Mathisen, C. Moler, and V. Pratt. Computational Aspects of the Pentium Affair. IEEE Computational Science and Engineering, 2(1):18–30, 1995 [10] T. Coe and P. T. P. Tang. It Takes Six Ones to Reach a Flaw. In Proceedings of the 12th Symposium on Computer Arithmetic, pages 140–146. IEEE Comput. Soc. Press, 1995 [11] D. Deley. The Pentium Division Flaw - Chapter 9. date accessed: 2018-07-31. http://daviddeley.com/pentbug/pentbug9.htm

[12] A. Edelman. The Mathematics of the Pentium Division Bug. SIAM Review, 39(1):54–67, 1997 [13] A. Edelman. The Intel Pentium division flaw. date accessed: 2018-01-19, 2004. math.mit.edu/~edelman/talks/old/pentium.ppt

[14] J. Fandrianto. Algorithm for High Speed Shared Radix 4 Division and Radix 4 Square-Root. In 1987 IEEE 8th Symposium on Computer Arithmetic (ARITH), pages 73–79. IEEE, 1987 [15] M. J. Flynn and S. S. Oberman. Minimizing the Complexity of SRT Tables. In Advanced Computer Arithmetic Design, Chapter 7. John Wiley & Sons, Inc., 2001 [16] D. Goldberg. Computer Arithmetic. In Computer Architecture: A Quantitative Approach, Appendix J, pages J1–J73. Morgan Kaufmann, 5th edition, 2011 [17] T. Jackson. Inside Intel: How Andy Grove Built the World’s Most Successful Chip Company. HarperCollins, 1997

Bibliography

163

[18] W. Kahan. A test for SRT division. date accessed: 2017-12-14, 1995. https:// www.cs.indiana.edu/classes/p415-sjoh/readings/Pentium/Kahan-SRTest.txt

[19] M. Kaufmann, J. S. Moore, D. M. Russinoff, W. A. Hunt, S. Swords, J. Davis, and A. Slobodova. Design and Verification of Microprocessor Systems for HighAssurance Applications. Springer US, 2010 [20] P. Koppe, B. Kollenda, M. Fyrbiak, C. Kison, R. Gawlik, C. Paar, and T. Holz. Reverse Engineering x86 Processor Microcode. In 26th USENIX Security Symposium, USENIX Security 2017, Vancouver, BC, Canada, pages 1163–1180, 2017 [21] J. Markoff. Company News; Flaw Undermines Accuracy of Pentium Chips. The New York Times, Nov. 24, 1994 [22] C. Moler. A tale of two numbers. date accessed: 2017-12-19, 1995. https:// de.mathworks.com/content/dam/mathworks/tag-team/Objects/a/72895_92024 v00Cleve_Tale_Two_Numbers_Win_1995.pdf

[23] C. Moler. 25, 2005.

Pentium division bug documents.

date accessed:

2018-06-

https://de.mathworks.com/matlabcentral/fileexchange/1666pentium-division-bug-documents

[24] S. K. Moore. Intel Makes a Big Jump in Computer Math. IEEE Spectrum, Feb. 2008 [25] J.-M. Muller, N. Brisebarre, F. de Dinechin, C.-P. Jeannerod, V. Lefèvre, G. Melquiond, N. Revol, D. Stehlé, and S. Torres. Handbook of Floating-Point Arithmetic. Birkhäuser Boston, 2010 [26] T. R. Nicely. Pentium FDIV flaw FAQ. date accessed: 2018-01-17, 2011. http: //www.trnicely.net/pentbug/pentbug.html

[27] I. Peterson. Fatal Defect: Chasing Killer Computer Bugs. Vintage Books, reprint edition, 1996 [28] V. Pratt. Anatomy of the Pentium Bug. In P. D. Mosses, M. Nielsen, and M. I. Schwartzbach (editors), TAPSOFT’95: Theory and Practice of Software Development, Volume 915 of Lecture Notes in Computer Science, pages 97–107. Springer, 1995 [29] D. Price. Pentium FDIV Flaw - Lessons Learned. IEEE Micro, 15(2):88–86, 1995 [30] K. Quach. 2+2=4, er, 4.1, no, 4.3... Nvidia’s Titan V GPUs spit out “wrong answers” in scientific simulations. The Register. date accessed: 201807-26, 2018. http://www.theregister.co.uk/2018/03/21/nvidia_titan_v_ reproducibility/

[31] J. E. Robertson. A New Class of Digital Division Methods. IEEE Trans. Electronic Computers, 7:218–222, 1958 [32] H. Ruess, N. Shankar, and M. K. Srivas. Modular Verification of SRT Division. Formal Methods in System Design, 14:45–73, 1999 [33] H. P. Sharangpani and M. L. Barton. Statistical Analysis of Floating Point Flaw in the Pentium Processor (1994). Technical report, Intel, 1994

164

Chapter 6. Software-Hardware Interplay

[34] K. G. Tan. The Theory and Implementation of High-Radix Division. In Proceedings of the 4th IEEE Symposium on Computer Arithmetic, pages 154–163, 1978 [35] K. D. Tocher. Techniques of Multiplication and Division for Automatic Binary Computers. Quart. Journ. Mech. and Applied Math., 11(3):364–384, 1958 [36] Wikipedia. Microcode. date accessed: 2018-03-16. https://en.wikipedia.org/ wiki/Microcode

6.2 Mars Rover Spirit Flash Memory Problem Between the great things we cannot do and the small things we will not do, the danger is that we shall do nothing. —Adolph Monod

6.2.1 Overview • Field of application: space flight • What happened: unwanted reboots of the control computer • Loss: time and, thus, data due to fewer days of operation • Type of failure: overflow related to file system and storage management

6.2.2 Introduction The Mars rover Spirit or MER-A99 landed on January 4, 2004, and was successfully working for more than 5 years, delivering images and analyzing rocks and soil (for a sketch of the rover, see Fig. 6.4). The rover got stuck in soft soil on May 1, 2009, and the mission was finally declared finished on May 25, 2011 (see [3, 16]). The goal of the MER mission was to analyze the surrounding minerals, rocks, and soil and to help reveal the geologic processes and atmospheric conditions on Mars. The rover was equipped with various cameras and spectrometers. Its CPU had access to 120 MB RAM and additional 256 MB flash memory100 as an extension (cf. [17]). It operated on Vx-Works 5.3.1 developed by WindRiver Systems (see [17, 19]). Communication from Spirit to Earth was possible indirectly via the orbiting vessels (Mars Global Surveyor or Odyssey) by using a low-gain antenna and directly via a low-gain low-data-rate antenna and a steerable high-gain antenna. Each day, three communication windows were used for communication with the orbiters. Direct-to-Earth contact was scheduled for 09:00 local solar time for data download and uploading of sequences for the daily activities (see [10]). The rover was ordered to stand by through the night on Mars with turned-off electronics but still active mission clock and battery charger board. After landing, the rover ceased communication on January 21, 2004, and got stuck in a cycle of repetitive reboots. The flight software development team started testing and analyzing the rover, and on January 26, they identified the problem as related to 99 In NASA’s Mars Exploration Rover (MER) mission, twin vehicles were landed in a time interval of 3 weeks on opposite sites of Mars: MER-A (Spirit) and MER-B (Opportunity). 100 The technology of flash memory is also the basis of USB sticks, SD cards, and solid state disks (SSDs).

6.2. Mars Rover Spirit Flash Memory Problem

165

Figure 6.4. Sketch of the Mars Exploration Rover vehicle. Courtesy of NASA.

the on-board flash memory. The situation was critical since the energy supply suffered under the repeated rebooting, but fortunately the power system was still able to provide the necessary support. After 14 days, the team was able to resolve the problem and the rover was made fully functional again. Hence, only an indirect loss occurred, consisting of missed days of scientific measurements during the mission.

6.2.3 Timeline In the following, some date indications refer to days on Mars, indicated by sol,101 enumerated starting with the landing of the probe. Note that a very detailed timeline of the events and JPL’s measures is contained in [10]. On sol 18 (i.e., January 21, 2004), problems with Spirit arose for the first time. The signals from the rover became spotty and stopped earlier than expected. Poor weather at the ground station in Canberra, Australia, was suspected initially, but during that day and due to further communication problems, it became obvious that problems on the rover itself were the cause of the unexpected behavior. An anomaly team was formed to find out the cause of the “very serious anomaly”(see [10]). The next day (sol 19), the rover did not respond to the communication tests initiated by the team but sent a single signal at a special up-link rate indicating a major fault. On sol 20, Spirit’s behavior changed slightly. In the morning, it autonomously started to send a telemetry packet containing useless data that was also sent in later attempts to obtain data from the rover. In the subsequently commanded communication window, telemetry data were obtained for the full length of the window duration. No recorded scientific data were available, which indicated a problem with the flash memory. In addition, the telemetry showed Spirit being caught in reset/reboot cycles. Luckily, the flight software was designed to delay subsequent reboots if a reset was necessary during an initialization of the software from a previous reboot. This so-called 101 One

sol is equivalent to 24 hours and 37 minutes.

166

Chapter 6. Software-Hardware Interplay

Figure 6.5. Visualization of the delayed resets following [10]. After an error occurring early in the initialization of the flight software, the next reset was delayed by 15 or 60 minutes. This delayed reset functionality allowed for analysis and intervention by the control team on Earth.

delayed reset was scheduled with patterns of breaks of “two times 15 minutes” followed by “one times 60 minutes” (see Fig. 6.5 for a visualization). Therefore, commands from ground control could be processed within the delay windows. Subsequent analysis suggested a low battery status and that the rover had possibly not gone to sleep overnight since sol 19. On sol 21, a command was sent to put rover in a “crippled mode” that would not use the flash file system. This way, the continuous resets and the danger of running out of power were stopped, and a shutdown for 24 hours was performed. The restart in crippled mode needed manual intervention and prohibited the use of flash memory, which reduced the achievable mission goals, but at least control of Spirit was regained. After experimenting with crippled mode and the flash file system, the flash file system was formatted, erasing all old data and wiping the flash memory clean on sol 32. This solved the problem, led to normal behavior of the system, and allowed for successful completion of the mission (see [1, 10, 19]).

6.2.4 Detailed Description of the Bug Data are stored as a sequence of bits on every type of memory. But the operating system needs a way to translate this sequence of bits into meaningful information. Therefore, a rule is needed to describe the beginning and end of contiguous bits representing a file in memory. In the original DOS file systems used by Microsoft, e.g., this is achieved by using a so-called File Allocation Table (FAT; cf. [14, 18]) and files containing the description of the directory and file structure.102 FAT systems such as FAT12, FAT16, and FAT32 exist, differing mainly in the underlying bit length of a table element. Nowadays, FAT systems are mostly replaced by the superior New Technology File System (NTFS). In the context of the MER mission, the FAT system was used for storing data on the flash file system supported by VxWorks (see [15, 17]). A flash memory retains data even when power is off and is widely used in digital cameras or smartphones. The FAT system has the disadvantage that deleted files still appear as entries in the describing entities and are only marked as deleted. Therefore, the FAT management structure can never shrink and grows with every new file. When making the flash mem102 In

the following, the ensemble of these entities will be called FAT management structure.

6.2. Mars Rover Spirit Flash Memory Problem

request for memory fails

create RAM representation

167

suspend requesting task

"health check" indicates severe error delay: 15/60 min

mount ﬂash memory

reboot system

Figure 6.6. Visualization of the reboot cycle problem (following a sketch by our student L. Eichhorn): A task requesting RAM memory fails due to unavailable free system memory; the task is silently suspended, which is detected by a “health check” function initializing a (delayed) reboot. When starting freshly, the flash memory file image still consumes all free RAM memory leading to the next cycle iteration. The development team was able to analyze and solve the problem by forcing a reboot without mounting the flash memory via “crippled mode.”

ory accessible on a running system, the management structure is typically imported into the RAM to represent the file system structure (directories, subdirectories, number of files in a directory, etc.). In the case of Spirit’s problems on sol 18, the imported FAT management structure of the flash memory simply did not fit into the relatively small free system RAM anymore (cf. [1]). This caused a silent suspension of one of the file system tasks. A “software health” function that checked for suspended tasks then initiated a reset of the avionics electronics (including the flight computer) and the reinitialization of the flight software. Each reset tried to mount the flash system anew, leading to a cycle of reboots as illustrated in Fig. 6.6. The silent suspension of file system tasks was caused by two configuration errors in two different libraries when using the FAT system for Spirit. In the flight software of the MER mission, commercial operating systems where used that relied on many bundled libraries. In particular, a library referred to as the DOS library was very convenient, which supported using a DOS file system structure. The DOS library initially used a private memory area of 256 KB in RAM in the MER missions. A configuration parameter existed that was used to decide whether or not a dynamic incrementation of that private memory area was possible by using free system memory space. This configuration parameter was wrongly set to “yes.” A second configuration parameter was available to limit this incrementation to a maximum amount of memory. Unluckily, however, no such limit had been specified. The second library involved in the problem was the Mem library, which provided general memory management service. The configuration of this library controlled, among others, the reaction in case of missing free memory. The ideal way would have been to fail the specific file system activity task in progress (such as creating, opening, or writing files). The Mem library on Spirit, however, was configured to silently

168

Chapter 6. Software-Hardware Interplay

suspend the corresponding file system activity task that—via the “software health” functionality—led to the problematic reboot cycle. The configuration of the above-mentioned parameters was intended to be temporary until more experience and data could be gathered to improve the setup. This investigation, however, was never completed (due to a compressed schedule and constant reprioritization of activities, according to [10]), and the initial configuration values simply remained in the software. After recovery by manually formatting Spirit’s flash memory, the team actually changed the configuration of the system to monitor and control the free memory space in order to avoid the FAT management structure becoming too big. It is interesting that the problem was intensified, if not caused, by a large amount of files stemming from the launch load of Spirit’s software overwritten by an upload during the mission (see [17]). After the upload of the new software to Spirit, the team discovered that the old files still existed in flash memory and needed to be deleted since the new ones were contained in newly generated directories (and had not just overwritten the old ones). The team developed a utility to identify and delete the obsolete directories and uploaded it to the rover on sol 15. Unluckily, only one of the two parts of this code was successfully uploaded. The second, missing part was scheduled for upload on sol 19—about one day too late.

6.2.5 Similar Bugs • In 2014, the Mars rover Opportunity (Spirit’s twin rover, which still is in operation at the time of writing) also suffered from computer resets caused by worn-out cells in the flash memory. Over time (see [6, 8]), many attempts have been carried out to fix the problem. The basic idea was to rely on reformatting of the flash memory (cf. [7]). NASA’s JPL team tried reformatting the first time on September 4, 2014, after severe reset problems that prohibited all scientific activities. Twice, on December 6, 2014, and March 19, 2015, the flash memory was reformatted again. In each of the three cases, the rover seemed to work nicely but started to suffer from resets again after a couple of days. Continuous investigations have been run to try to restore the flash, but without success. The solution consists of running Opportunity by default with RAM memory only, completely ignoring the flash system. • Railway Switch Tower, Hamburg Altona, March 1995: In Hamburg Altona, Germany, the DB (Deutsche Bundesbahn or German Railway) replaced the outdated electric railway switch tower by a fully electronic system delivered by Siemens. The so-called SIMIS 3216 setup featured, besides many other components, a 16bit central computer system (BAR16) responsible for processing and sending control commands (see [9]). Contrary to analog control, safety was realized in this setup with a “2-out-of-3” design: Two identical computers always run in parallel (with a third one being able to jump in as a backup) doing the same computations; actual control commands were issued only if both computational results were identical (cf. [4, 9]). The new switch tower replaced 49 jobs and controlled 160 switches, 250 signals, and 215 direct-current circuits (see [2, 13]). Hamburg Altona is the starting and destination station for many trains in Germany, in particular for the fast Intercity Express trains (ICE). More than 30,000 travelers use the station per day. On the evening of Sunday, March 12, 1995, the new computerized control system was put into service. The next morning, the system shut down at 05:00, 07:00, and

Bibliography

169

09:00 for safety reasons (see [5]). Within 10 minutes, the system had restarted, but these delays caused major trouble. Therefore, DB decided to close down the whole station. Passengers and trains had to start from other stations, which caused train delays all over Germany. It took the experts of Siemens until Wednesday to find the reason and get the system running with stability again. The main cause of the problem turned out to be an inappropriate exception handling in the case of the relatively rare event of stack overflow in memory (which, allegedly, was not expected by the developers to take place at all; cf. [2, 11]). Note that the literature is sparse and inconsistent with respect to the real details of the underlying causes; sources containing relevant information are [2, 4, 5, 11, 12, 13]. Our interpretation of all available data is the following. The BAR16 system shut down due to a dead-loop in the part of the code responsible for handling stack overflows. A stack data structure was used and initially designed with a certain size, which was assumed to be sufficient but turned out to be a bit too small in peak traveling times. The resulting stack overflow was handled by the corresponding routine, which acquired an additional 500 bytes of memory. Reports in the literature differ on the actual absolute numbers: While the authors of [2, 11, 12, 13] mention a demanded stack increase from 3,500 bytes to 4,000 bytes, an increase from 2,500 to 3,000 bytes is indicated in [5]. The increase by 500 bytes seems to be considerably more than the actual needed values (see [2]) that were available physically; it is not clear why this happened: A safety margin is stated in [2], or maybe it is related to “two faulty bits“ mentioned in [13] or two faulty (decimal?) digits indicated in [11]. However, the additional 500 bytes were not available, and the stack could not be enlarged, which finally led to the shutdown of the control program.

Bibliography [1] G. S. Berkowitz. Re: NASA Spirit nearly done in by DOS. date accessed: 201803-07, 2004. https://catless.ncl.ac.uk/Risks/23.52.html#subj12 [2] K. Brunnstein. About the “Altona Railway software glitch.” date accessed: 201803-07, 1995. https://catless.ncl.ac.uk/Risks/16.93.html#subj1 [3] T. Greicius. Mars rovers overview. date accessed: 2018-02-27, 2017. https: //www.nasa.gov/mission_pages/mer/overview/index.html

[4] W. Mehl. Deutsche Bahn erwägt, Schadensersatz zu fordern. Siemens-Rechner legt Stellwerk in Hamburg fuer zwei Tage lahm. Computerwoche, Mar. 1995. https://www.computerwoche.de/a/deutsche-bahn-erwaegt-schadensersatzzu-fordern-siemens-rechner-legt-stellwerk-in-hamburg-fuer-zweitage-lahm-cw-bericht-walter-mehl,1112982

[5] F. Möcke. Alle Räder stehen still. c’t, 5, 1995. https://shop.heise.de/katalog/ alle-rader-stehen-still

[6] NASA Jet Propulsion Laboratory. Mars exploration rover mission: All opportunity updates 2014. date accessed: 2018-07-04, 2014. https://mars.nasa.gov/ mer/mission/status_opportunityAll_2014.html

170

Chapter 6. Software-Hardware Interplay

[7] NASA Jet Propulsion Laboratory. Memory reformat planned for opportunity Mars rover. date accessed: 2018-01-30, 2014. https://www.jpl.nasa.gov/news/news. php?feature=4275

[8] NASA Jet Propulsion Laboratory. Mars exploration rover mission: All opportunity updates 2015–2018. date accessed: 2018-07-04, 2015. http://mars.nasa. gov/mer/mission/status_opportunityAll.html#sol2600

[9] E. Preuß. Stellwerke deutscher Eisenbahnen seit 1870 (Typenkompass). transpress, 2012 [10] G. Reeves and T. Neilson. The Mars rover Spirit FLASH anomaly. Aerospace Conference Proceedings, 2005

IEEE

[11] Spiegel. Learning by doing. Der Spiegel, 14, 1995. http://www.spiegel.de/ spiegel/print/d-9180321.html

[12] D. Weber-Wulff. More on German train problems. date accessed: 2018-03-07, 1995. http://catless.ncl.ac.uk/risks/17.02.html#subj3 [13] Wikipedia. Bahnhof Hamburg-Altona. date accessed: 2018-03-07. https://de. wikipedia.org/wiki/Bahnhof_Hamburg-Altona

[14] Wikipedia. Design of the FAT file system. date accessed: 2018-07-04. https: //en.wikipedia.org/wiki/Design_of_the_FAT_file_system

[15] Wikipedia. Flash memory. date accessed: 2018-03-13. https://en.wikipedia. org/wiki/Flash_memory

[16] Wikipedia. Spirit (rover). date accessed: 2018-03-07. https://en.wikipedia. org/wiki/Spirit_(rover)

[17] R. Wilson. The trouble with rover is revealed. date accessed: 2018-01-04, 2004. https://www.eetimes.com/document.asp?doc_id=1148448

[18] WindRiver Systems. VxWorks programmer’s guide. date accessed: 201807-26, 1998. https://userweb.jlab.org/~brads/Manuals/VxWorks/vxworks_ programmers_guide_5-3-1.pdf

[19] WiredStaff. No life on Mars, but many bugs. date accessed: 2018-01-30, 2004. https://www.wired.com/2004/01/no-life-on-mars-but-many-bugs/

Chapter 7

Complexity

Software is becoming more and more elaborate and voluminous. Considering operating systems (OS), Windows XP had about 45 million lines of code, Mac OS X 10.4 about 86 million lines, and the Linux kernel 3.6 about 15.9 million lines of code.103 This increases the probability and the number of errors in a program, and, thus, also the vulnerability of the OS. Especially in the beginning of the computer era, software was considered safe and reliable compared to hardware. This led to accidents in which medical devices killed people due to wrong radiation settings, as in the case of the infamous Therac-25. Furthermore, software has to deal with more and more complex problems. This aggravates the process of developing new software. The complexity can be simply related to the number of tasks and the size of the necessary code, or it is related to the mathematical complexity of a subproblem that is nearly impossible to solve exactly in reasonable time. For the newly designed Denver International Airport, underestimating the complexity of the planned automatic baggage-handling system caused a 16-month delay of the opening in the 1990s. Complexity also brings up the question of which type of problems can be solved by computer programs at all. This might depend on the complexity of the task, the limited time, the required level of reliability, the missing possibility for testing, political reasons, or whether human lives are at risk. Following such considerations, parts of the computer science community challenge the usefulness of ballistic missile defense systems such as SDI (Strategic Defense Initiative) and successor systems.

7.1 Therac-25 Anything that can go wrong, will—at the worst possible moment. —Finagle’s Law of Dynamic Negatives

7.1.1 Overview • Field of application: medical engineering • What happened: malpractice by radiation overdose 103 See

https://en.wikipedia.org/wiki/Source_lines_of_code.

171

172

Chapter 7. Complexity

• Loss: at least two persons killed and four badly injured due to overdose • Type of failure: error-prone software design in context of a complex system (race conditions, integer overflow)

7.1.2 Introduction Computers have been used in medicine since the 1950s in different applications such as (see [24]) • administration of hospitals and the creation of patient digital medical records; • establishing advanced imaging systems such as computed tomography (CT), magnetic resonance imaging (MRI), positron emission tomography (PET), ultrasound, etc.; • treatment planning and support during surgery; • expert systems or clinical decision support systems for assistance in diagnostics. The first use of computers in medicine dates back to 1954 and the development of a computerized cytoanalyzer for examining mass cells for signs of cancer (see [3]). In 1974, the first computer-assisted dose planning program for the treatment of brain tumors was introduced (gamma knife), as well as the first clinical CT scanners. Improving health care through the use of IT is also accompanied by additional risks such as swapping or confusing patient data, or showing artifacts. According to D. Warren, anesthesiologist at Virginia Mason Medical Center Seattle, “there is a problem with relying on computers to reduce errors in the sense that computers still rely on human input. And the human input is the continued source of error, even in a ‘perfect’ computer system” [3]. Some of the most serious accidents caused by questionable and unreliable software are related to radiation therapy using the Therac-25 system (see [23, 26]). Treatment with the Therac-25 was achieved by accelerating electrons to generate a high-energy beam that was directed at a tumor located in superficial tissue. For tumors in deeper tissue, the electron beam was converted into an X-ray beam using a tungsten target in order to spare healthy tissue in outer layers. Therac-25 had been developed based on the previous accelerators Therac-6 and Therac-20.104 These older devices had been developed as a joint project of the Atomic Energy of Canada Limited (AECL) and the Parisian Compagnie Générale de Radiologie (CGR) based on CGR’s linear accelerators Neptune and Sagittaire. Therac-6 and Therac-20 were augmented with computer control, but the machines could also be handled without the computer, which just provided additional convenience. Therac-20 was capable of dual-mode electron or X-ray application in contrast to the sole X-ray mode in Therac-6. The Therac-25 project was promoted by AECL since the mid-1970s, based on a new “double pass” concept that allowed for a reduction in the size of the accelerator. The dual-mode concept comprised 25 MeV photons in X-ray mode or electron beams of 5–25 MeV, delivered by the same device. On the one hand, the software of the Therac-25 played a considerably more crucial role in maintaining safety in all operations compared to its precursors: Hardware inter104 Generating

beams of a maximum energy level of 6 and 20 mega-electronvolt (MeV), respectively.

7.1. Therac-25

173

locks and backups were omitted and replaced by software controls. On the other hand, software programs of Therac-6 and Therac-20 were also reused in Therac-25 (see [12]). A first prototype of Therac-25 was produced 1976, and the final version was available in 1982. Five machines were installed in the US, in particular at • Kennestone Regional Oncology Center, Marietta, Georgia; • Yakima Valley Memorial Hospital, Yakima, Washington; • East Texas Cancer Center, Tyler, Texas; and six in Canada; one of these six was at the Ontario Cancer Foundation in Hamilton, Ontario. Between 1985 and 1987, six documented incidents of massive overdoses in the treatment of patients occurred in the above-mentioned hospitals, causing at least two deaths and four serious injuries. In view of these accidents, the software, design, quality control, and manual of the Therac-25 had to be revisited and changed. Since these changes, no Therac-25 accidents have been reported. The following sections are based mainly on the contributions by N. Leveson ([11, 12, 13]), who acted as an expert witness for the prosecution in some of the cases. Together with C. Turner, she spent more than three years collecting information and reconstructing the accidents (see [16]).

7.1.3 Description of the Therac-25 To understand the software problems of the Therac-25 system, physical aspects of radiation therapy are relevant and briefly discussed in Excursion 21 below. Readers familiar with these aspects may go directly to page 174. Excursion 21: Basic Aspects of Radiation Therapy The main goal of medical treatment via radiation is to kill specific cells in a certain, confined area of tissue. Typically, this is used to fight cancer. The challenge is to make sure that all relevant cancer cells are killed while keeping the collateral damage to normal cells to a minimum level. To this end, high-end quality medical imaging is important to identify the target area as accurately as possible. For the actual treatment, two major modes are typically used: Electron beams with lower energy levels for superficial treatment and X-ray radiation to target cells deeper in the corresponding body part. Machines for both types of treatment comprise a particle accelerator for electrons to create a concentrated beam of high-energy electrons necessary for the ionizing effects to break chemical bonds in molecules that produce free radicals and more effective DNA double-strand breaks by direct interaction. In electron treatment, this bundled beam is scattered by magnets to treat the desired surface area. For X-ray treatment, the concentrated electron beam from the accelerator contains considerably more electrons and is led to a metal target material to create high-energy photons. A lot of expertise is crucial for setting up a correct or fitting treatment plan. Such specifications typically comprise the number of radiation treatments, their intervals, and their intensity. By splitting up a single treatment into several (the so-called fractions), healthy cells have the chance to regenerate, whereas tumor cells would need much longer pausing intervals. The intensity of a radiation treatment is specified by both the energy of electrons in the concentrated beam, which typically is measured in mega-electronvolts

174

Chapter 7. Complexity

(MeV), and the beam current, which indicates the amount of electrons crossing a specific cross-sectional area in a given time window. For treatment prescriptions, a measure for the radiation dose is necessary to describe the effects of the ionizing radiation in matter (see [4, 10] for details on units, etc.). In the context of the Therac-25 problems, the unit rad is heavily used in the literature. It indicates an absorbed radiation dose and has been replaced by the SIa unit gray (Gy) but is still used to some extent in the United States. 100 rad is equivalent to 1 gray, which corresponds to 1 joule per kilogram. Concerning the order of magnitude when talking about radiation doses, it is important to note that single therapeutic doses are typically on the order of about 200 rad. If delivered to the whole body, about 1,000 rad can be lethal, and “500 rads is the accepted figure for whole-body radiation that will cause death in 50 percent of the cases” ([12], p. 10). a The modern form of the metric system is the International System of Units, which is abbreviated as SI (from the French term Système International d’Unités).

The Therac-25 system incorporated a separate room containing a treatment table, the radiation unit or gantry, a microphone, a video camera, and an intercom device (see Fig. 7.1 for a sketch of the layout). The control unit was located outside the treatment room, comprising the display terminal and the TV monitor for the operator.

Figure 7.1. Layout of a Therac-25 system: Sketch of the treatment room and operating devices. Reprinted by permission of Pearson Education, Inc., New York, New York [12, Fig. A.5, p. 549].

The layout of Therac’s accelerator device is sketched in Fig. 7.2 (see [21] or [1] for a similar sketch). It produced a high-energy electron beam of up to 25 MeV and was horizontally aligned in the upper “arm” of the Therac-25 machine (cf. Fig. 7.1). Bending magnets were used, among others, to send the electron beam in a 270◦ turn perpendicular to the arm for the actual treatment of patients. To enable Therac’s dual-mode functionality (i.e., electron and X-ray therapy) and to adjust the positioning of patients, relevant equipment was rotated into the path of the

7.1. Therac-25

175

Figure 7.2. Layout of beam generation in the Therac-25: electrons are accelerated to very high velocities in a horizontally aligned device. The resulting electron beam is then bent in a 270 degree manner by magnets for the actual treatment direction. Reprinted with permission from IEEE [21].

highly concentrated beam after the 270◦ turn. This was realized by a turntable that was maneuvered by microswitches to move it into the three different relevant positions (see Fig. 7.3 for a sketch of the turntable). For electron mode, the scanning magnets were moved into proper position by the computer. Additionally, operator-mounted electron trimmers could also be used for beam shaping. In X-ray mode, a target and a beam flattener were placed into the electron beam in order to produce a uniform treatment field of photons. Because of these additional energy-reducing devices, the beam’s energy level always had to be set to the maximum value of 25 MeV and also to a considerably higher beam intensity in X-ray mode. The so-called field light position was the third choice for the turntable placement. In this position, light was sent through mirroring devices to imitate the intended beam and to facilitate the correct positioning of the patient. At the beginning of each treatment, the operator manually positioned the patient on the treatment table, set the field size and gantry rotation, and attached additional accessories. Then she/he left the treatment room to continue operating from the display terminal where additional data had to be specified such as the patient identification, the treatment prescription (beam type, energy level, dose, dose rate, and time), the gantry rotation, and the field sizing (calibration). Afterwards, the system compared the manually entered values in the treatment room to those entered at the terminal. If the values matched, the treatment was permitted, which was indicated by a verified message. Otherwise, the treatment was blocked, and the mismatch had to be resolved. The screen layout of Therac’s display terminal is shown in Fig. 7.4. A few specific aspects are important to understand what happened during the documented incidents: • Sending a maximum-energy electron beam intended for X-ray mode in a wrong turntable position with the beam flattener missing would lead to a high overdose. Hence, controlling Therac’s dual-mode facility had to be done very carefully. • Instead of traditional hardware locks, mainly software was in charge to crosscheck the proper settings. • Entering the patient data at the terminal again turned out to be a time-consuming procedure in actual treatment. Following user requests, AECL soon modified the setup to allow copying of the treatment-site data to the terminal by simply using the carriage return key a couple of times. In this way, the data entry could be

176

Chapter 7. Complexity

Figure 7.3. Schematic representation of the turntable used in Therac-25: Depending on the rotational position of the turntable, three different use cases could be applied: The electron mode scan magnet allowed for the electron beam to be fanned out for treatment of external layers, the flattener enabled to achieve a suitable X-ray beam, and the mirror devices allowed for regular lighting to imitate the treatment beam for positioning of patients. Reprinted by permission of Pearson Education, Inc., New York, New York [12, Fig. A.1, p. 518].

Figure 7.4. Screen layout of the Therac-25 display terminal. The COMMAND text field at the bottom right was used to actually start a treatment. Reprinted by permission of Pearson Education, Inc., New York, New York [12, Fig. A.2, p. 519].

7.1. Therac-25

177

performed much faster, but the operator still had to make sure that the entered data were correct. • Error messages of the Therac-25 software were rather cryptic, often in the form MALFUNCTION XX with XX being a number from 1 to 64. The manual listed the different malfunctions but typically without any explanation (some of the error messages were originally intended for internal software development only). • When an error was detected, the Therac-25 was able to shut down in two ways. One was “treatment suspend,” which would require a complete machine reset to restart. The other option was “treatment pause,” which restarts immediately after striking the -key to proceed. In this case, the treatment parameters remained P valid, and the procedure could be resumed in a convenient way. This was possible five times before the treatment would be suspended.

7.1.4 Timeline 1. The first incident that came to light occurred on June 3, 1985, at Kennestone Oncology Center in Marietta, Georgia (cf. [17]). A 61-year old woman was treated with follow-up radiation after the removal of a malignant breast tumor. A 10MeV electron treatment105 of the clavicle was intended, but a much stronger radiation was most likely applied erroneously. Experts later estimated the actually absorbed one or two doses to be on the order of 15,000–20,000 rad (see [12]). The reason for the malpractice was never fully clarified, but the incidents with Therac-25 machines afterwards suggest a connection. The patient felt a “tremendous force of heat” and later developed a reddening on her back. Finally, due to radiation burns, the woman’s breast had to be removed and her shoulder and arm were paralyzed. A lawsuit filed against the hospital and AECL was later settled out of court. 2. An undisputed application of an overdose during a Therac-25 treatment was reported to AECL in July 1985. A 40-year-old patient underwent a treatment for carcinoma of the cervix on July 26 at the Ontario Cancer Foundation in Hamilton (Ontario), Canada. Five seconds after the start of the machine, the Therac shut down, displaying the error message HTILT. The machine’s console display indicated that NO DOSE had been applied, and that it switched into TREATMENT PAUSE. Therefore, the operator started a second treatment, pressing the -key P to proceed. But again the machine shut down in the same way. The operator repeated it three more times, always with the same behavior of the system. After the shutdown following the five seemingly unsuccessful treatments, the hospital technician was called to check the Therac but found nothing wrong with the system. In the days after the treatment, the patient complained of burning, hip pain, and swelling, and the Therac-25 machine was taken out of service. AECL suspected a hardware failure in a turntable microswitch, but the malfunction could not be reproduced. The hardware was also redesigned accordingly on the other existing machines. But it was never proven that microswitches had been the cause of 105 No information exists on the intended radiation absorbed dose or other units of prescription for this incident.

178

Chapter 7. Complexity

the malfunction. The patient’s hip had been heavily damaged, and she died on November 3, 1985, of the treated cancer. The erroneously applied radiation was estimated to have been 13,000–17,000 rad (see [12], p. 524). 3. In December 1985, a woman was under medical treatment at Yakima Valley Memorial Hospital, Washington, and afterwards developed an excessive reddening of the skin on her right hip, appearing in a parallel striped pattern. She was treated until January 6 because nothing unusual was suspected. Having ruled out other possible reasons for the striped pattern, the probable cause was the open slots in the blocking trays of Therac-25, but because the blocking trays had been discarded after the incident, the relation to the pattern could not be analyzed. On January 31, 1985, the staff sent a letter to AECL, which evoked the negative response that “the damage could not have been produced by any malfunction of the Therac-25 or by any operator error” ([12], p. 526). Therefore, the skin reaction was ascribed as “cause unknown.” The patient survived with minor disability. 4. On March 21, 1986, a patient at the East Texas Cancer Center had his ninth treatment for the removal of a tumor from his back, this time with an intended 22-MeV electron beam of 180 rad (see [2]). The details of each step during the treatment are much better documented here than in other cases. In the treatment room, the patient was placed face down on the table. The experienced operator left the room, closed the door, and started to enter the prescription data of the treatment at the terminal but typed x for X-ray mode instead of e for ↑ key. She hit the return key electron mode. To correct the typo, she used the several times to keep the other correct entries of the setup unchanged. The input reached the COMMAND field at the bottom right of the terminal (see Fig. 7.4), and the message said BEAM READY. Then she hit the B command for “beam on” to start the treatment. The machine reacted with a shutdown and displayed the message MALFUNCTION 54, also indicating TREATMENT PAUSE. In Therac’s manual, MALFUNCTION 54 was explained as “dose input 2”; only experts at AECL actually knew that this signaled a too-high or too-low dose having been delivered. Furthermore, on the dose monitor display, a substantial underdose of 6 monitor units instead of the requested 202 monitor units was indicated. In her usual routine of dealing with the tics of the machine, the operator hit the -key to P proceed with the treatment. But the same sequence of events reappeared: shutdown, MALFUNCTION 54, indication of a too-weak dose. Unfortunately, on this day, there was no transmission between the patient in the treatment and the operator because the video display was unplugged and the audio monitor was broken. After the first treatment, the patient felt a strong shock on his back and recognized that something went wrong due to his experience with previous treatments. He began to get up from the table to call for help at exactly the moment when the operator hit P again. He felt the delivered beam hit his arm. He stood up and pounded against the door of the treatment room, and the operator immediately opened the door. The hospital physicist investigated the Therac system but found its calibration within specification. The incident was attributed to an electric shock, the patient was sent home, and the machine was used for further treatments for the rest of the day. Later investigations revealed, however, that the patient had received a massive overdose of about 16,500–25,000 rad in less than 1 second on 1 cm2 area. The patient died 5 months after the accident from the complications caused by the overdose. The Therac was shut down for testing the

7.1. Therac-25

179

day after the problematic treatment. But the investigations could not reproduce MALFUNCTION 54, and the machine was put back into service on April 7, 1986. 5. A second serious accident at the East Texas Cancer Center happened on April 11, 1986 (see [2, 19]). A patient had to receive an electron treatment for skin cancer on the side of his face. After preparing the treatment room, the operator entered ↑ key for correctional the patient data with the wrong mode. Again, she used the editing and changed the mode to electron mode. Then the operator pressed the return key several times to move the cursor to the bottom of the screen to BEAM READY in order to start the treatment. The machine shut down immediately with a loud noise. Again, MALFUNCTION 54 appeared, and via the intercom she could hear a moaning. The patient removed the tape that had held his head in position and announced that something was wrong. Three weeks after this accident, the patient died from the overdose. An autopsy revealed a high-dose radiation injury to the temporal lobe and the brain stem. After this incident, the machine was taken out of service. Investigations at the hospital revealed that the speed of the corrections at the terminal was crucial for causing the wrong treatment, and the hospital physicist was able to repeat the procedure by editing at a fast pace; later on, AECL could also reproduce the malfunction. 6. On January 17, 1987, a second accident happened at the Yakima Valley Memorial Hospital. A patient with a carcinoma was scheduled to receive two X-ray film verification exposures of 4 and 3 rad, and additionally a 79-rad X-ray treatment. After the film exposures, the operator entered the room in order to prepare the treatment. Via the field-light position, he controlled the position of the beam. Then he set the command to return the turntable into the treatment position. But he forgot to remove the film underneath the patient. On the display BEAM READY, he hit B for beam on. But on the display, no dose was indicated, and the unit shut down with a pause. Hence, the operator pushed the P key to proceed with the treatment. Again, the machine paused, but the operator heard something from the patient over the intercom. The console showed only the applications of the two film exposures, but the patient claimed “he felt a burning sensation” ([12], p. 541) and later developed a reddening of the skin with the shape of a parallel striped pattern on the whole area of treatment. All tests performed by AECL in a follow-up investigation indicated no problems with the machine. The suspected problem of an activated electron beam with the turntable being in field-light position could not be reproduced. The patient might have been exposed to a total of about 8,000–10,000 rad. AECL informed all users how to manually avoid this problem. The hospital’s physicist performed additional tests by directing the electron beam in X-ray mode on a film with the turntable in field-light position. The optical effects on this film matched the results of the one forgotten just before the accident. AECL seemed to focus on potential hardware failures, and its engineers needed more than a week to identify the software as the primary suspect (see [12], p. 542). They finally discovered a software bug, described in Sec. 7.1.7. The patient died in April 1987. Since he suffered from a terminal form of cancer, it is not clear whether his death was directly caused by the overdose of the treatment.

180

Chapter 7. Complexity

7.1.5 Concurrent Computing and Race Conditions The problem of race conditions in concurrent computing is important for the understanding of Therac’s software issues. Race conditions are related to shared resources used by different concurrent tasks. The concept of controlling access to shared resources via priorities and semaphores discussed in Excursion 14 (see page 137) is helpful in this context but cannot resolve all related problems. Readers familiar with the subject of race conditions may go directly to page 181. Excursion 22: Concurrent Computing and Race Conditions Concurrent computing occurs if different routines run on one or multiple processors and are active at the same time, i.e., they are executed during overlapping time windows (cf. [20]). This often happens in parallel computing, but it also appears in sequential computing (in multitasking or multiuser applications, e.g., where one processor has to execute different jobs and switches between these jobs). One routine might be calling different subroutines and waiting until these subroutines are finished, calling these subroutines again later on. In concurrent computing, the data access has to be controlled very carefully because of the existence of shared variables, which are accessed—read or overwritten—by different concurrent tasks. The order of these tasks is very important for the outcome of the program. As a simplified example, let us consider two concurrenta tasks that set two variables A and B to values that depend on the already stored values at the point of calculation: • task (1): A := 2B • task (2): B := 3A The two tasks have to perform the following three instructions each: (1) READ B (1) A:=2B (1) WRITE A and (2) READ A (2) B:=3A (2) WRITE B Depending on the actual order of all statements of all concurrent tasks, the calculated results will be different. The values to be read in the shared variables might be the initial ones, or already changed by a concurrent task. This behavior is a classical race condition and leads to random and unpredictable output values (see [20]). In our example, we now assume initial values A = 1 and B = −1, which leads to different possible results depending on the order of the performed statements. Three exemplary orders are: Input: (1) (1) (1) (2) (2) (2) Output:

A=1 and B=-1 READ B A:=2B WRITE A READ A B:=3A WRITE B A=-2, B=-6

(2) (2) (2) (1) (1) (1)

READ A B:=3A WRITE B READ B A:=2B WRITE A A=6, B=3

(1) (2) (1) (2) (1) (2)

READ B READ A A:=2B B:=3A WRITE A WRITE B A=-2, B=3

7.1. Therac-25

181

If the computer first runs task (1) and then task (2), the final results in the shared variables A and B will be A = −2 and B = −6. If the two jobs are executed in the order (2) before (1), then the result will be A = 6 and B = 3. If the two jobs run exactly in parallel (resulting in interlaced execution of statements), the output will be A = −2 and B = 3. a Note that “concurrent” indicates that no order of the tasks exists a priori, even if we number them by (1) and (2) for the sake of referencing.

7.1.6 Therac’s Software Problems—Part A Therac’s source code has never been publicly available. But due to massive documentation in the aftermath of the incidents, a “rough picture” ([12], p. 532) is available. The first category of Therac’s software problems was related to three different causes: 1. Problem of race condition (“composite variable MEOS in two tasks”): Two different parts of one variable were used by two different tasks at two different points in time. Under certain conditions, this caused an inconsistency of input data for the treatment of patients. 2. Problem of missing check for modified input data (“untimely clearing of bending magnet flag”): No check for modified input data was realized when the new input data were edited in a fast manner (“dead time” effect). 3. Problem of insufficient mechanism to detect editing (“DataEntryComplete insufficient”): In the event of modified input data before the actual start of a treatment, the mechanism to detect whether or not the data entry actually has been completed by the operator was insufficient. To understand Therac’s software problems, one has to apprehend the basic interplay of the different routines and variables responsible for data input, visualized in Figure 7.5. We use bold text to indicate tasks, typewriter format to refer to variables, and italic text for emphasizing specific aspects or instances. The task Treat (for treatment monitor) takes care of the processing of the actual radiation treatment and comprises eight subroutines. The control variable TPhase decides which of these subroutines is going to be executed next. After a subroutine has finished, Treat is called again with a new value of TPhase. The task Keyboard Handler runs concurrently with Treat and communicates with it via a shared variable DataEntryComplete to determine whether or not the input data such as beam type and dose has been entered. The Keyboard Handler parses the corresponding input specified by the operator and writes the information into another shared variable MEOS (for Mode/Energy Offset). MEOS actually consists of two separately interpreted one-byte fields (see also Fig. 7.5): The low-order byte is interpreted by another concurrent task Hand to move the turntable into the correct position for the specified mode and energy level. The high-order byte is interpreted by the subroutine DATENT (for “data entry”) in Treat for setting different parameters for the general configuration

182

Chapter 7. Complexity

Figure 7.5. Relevant tasks and subroutines for the first category of Therac’s software problems. The task Treat, in particular, controlled eight different subroutines. Note that the composite variable MEOS holding data of the intended treatment (mode, energy level, etc.) was accessed by different tasks at different points in time. Reprinted by permission of Pearson Education, Inc., New York, New York [12, Fig. A.3, p. 535].

of the Therac (see pseudocode in Alg. 7.1106 (explained in the next paragraph in more detail), in lines 3–8). We briefly explain the main aspects of the pseudocode in Alg. 7.1 that are relevant to understand the bug. In the remainder of this section, we refer to the algorithm frequently just by line numbers to simplify reading. Upon entering DATENT, the subroutine first checks whether the mode and energy data has been set by the Keyboard Handler in MEOS, whose high-order byte is then used to access and set corresponding system parameters via indexing. Afterwards, DA TENT calls the subroutine M AGNET to set the bending magnets (in particular also the 270◦ bending magnet). It is important to note that these magnets are responsible for configuring the general setup of the electron beam (see Fig. 7.2) for both X-ray and electron mode before the beam passes the turntable; in particular, these are different magnets than the scanning magnets placed on the turntable for electron mode only (see [16]). Since setting the bending magnets on hardware needed about 8 seconds in total due to hysteresis effects in the magnets, the software had to wait accordingly. This is realized via another subroutine, P TIME (for “pausing time”), pausing the computations. In particular, P TIME is also in charge to clear a bending magnet flag set upon entering the subroutine M AGNET before (cf. line 23). Furthermore, a shared variable is set by the Keyboard Handler task in case any editing of data on the terminal is taking place; P TIME (and actually only P TIME) checks this shared variable (if the bending magnet flag is still set, see lines 36 and 35) and—in case of editing—exits the waiting loop 106 Note that this pseudocode is almost identical to the one given in [12] and contains only minor modifications for readability. It has been constructed indirectly by N. Leveson and coworkers in the course of the events. Since the original source code has never been published by AECL, this “is the best that can be done with the information available” (see [12], p. 534).

7.1. Therac-25 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39: 40: 41: 42: 43: 44: 45:

183

function DATENT if mode/energy specified then calculate table index repeat fetch parameter output parameter point to next parameter until all parameters are set call M AGNET if mode/energy has changed then return end if end if if data entry is complete then set TPhase to 3 end if if data entry is not complete then if reset command entered then set TPhase to 0 end if end if return end function function M AGNET set bending magnet flag repeat set next magnet call P TIME if mode/energy has changed then break end if until all magnets are set return end function function P TIME repeat if bending magnet flag is set then if editing taking place then if mode/energy has changed then break end if end if end if until hysteresis delay has expired clear bending magnet flag return end function

Algorithm 7.1. Section of the Therac-25 software relevant for the problem category A. Note that this is reconstructed pseudocode (taken from [12] and slightly reformatted) and does not represent the actual source code, which is not publicly available.

184

Chapter 7. Complexity

(due to the break statement107 in line 38), clears the bending magnet flag (line 43), and returns to M AGNET (line 44) which then exits its loop (line 28) and returns to DATENT (line 9). In such a case, DATENT will then be called again by the task Treat with an unchanged value of 1 for the variable TPhase to rerun the parameter setup phase. We now describe the three different problems 1–3 in more detail before relating them to the actual incidents in Sec. 7.1.6. Problem of Race Condition (“composite variable MEOS in two tasks”)

The two different bytes of the variable MEOS are interpreted by the two different concurrent tasks (Hand and Treat). A race condition may arise where the two parts of MEOS are read at two different points in time. If the task Keyboard Handler has set the flag DataEntryComplete before additional changes are specified by the operator, the highorder byte of MEOS interpreted by the subroutine DATENT for setting parameters (see lines 3–8) will not be read again since DATENT has already successfully exited. The low-order byte, however, is used in the new setting by the different task Hand, which is scheduled independently of Treat/DATENT to position the turntable accordingly. Problem of Missing Check for Modified Input Data (“untimely clearing of bending magnet flag”)

Remember the collaboration of the subroutines DATENT, M AGNET, and P TIME with respect to possible editing changes during the setup phase of the bending magnets (see Alg. 7.1). Several of these magnets have to be set up. The problem now is that P TIME clears the bending magnet flag in line 43 at the end of its first of several calls. Thus, edits that happen after clearing the flag in the first P TIME run will not be recognized in subsequent runs. This problem actually requires an additional assumption on the implementation: The check whether the mode and/or energy has been changed (see lines 37, 27, and 10) needs a check “editing taking place” (see line 36) for the first check to work correctly also afterwards. It is unclear how this was actually realized, but without this interdependency, the problems cannot be explained: The pure statements in lines 27 and 10 would, otherwise, also do the job of detecting changes in the values even with a buggy clearing of the bending magnet flag. A change of the mode or energy will appear on the display of the terminal but will actually not be processed with respect to the parameter setup (due to problem 1 described above). Hence, during this dead time (and only in this time window of eight seconds in total), information may get lost and the treatment might be started with inconsistent parameters because of fast corrections of input data by the operator. After the accidents, the clearing of the bending magnet flag was moved from P TIME (line 43) to the end of the M AGNET subroutine (after line 30). Problem of Insufficient Mechanism to Detect Editing (“DataEntryComplete insufficient”)

An additional problem is related to setting the correct value for the variable TPhase after the call to M AGNET in line 9. DATENT checks in line 14 whether or not the flag DataEntryComplete has already been set. If the flag indicates that the data input is completed, TPhase is set to 3 (see line 14) and, after the return in line 20, the task Treat 107 Note that in [12], the break statement is actually indicated as an exit statement but with the exact behavior as a break command.

7.1. Therac-25

185

continues with S ET U P T EST and does not enter DATENT again. If DataEntryComplete has not been set, DATENT sets TPhase to 0 if the condition in line 17 holds or keeps the value 1 for TPhase otherwise, which would result in Treat calling DATENT again to actually obtain the correct, new configuration data. The problem now is that the flag DataEntryComplete just stores whether or not the cursor on the terminal screen has at some point in time been down to the COMMAND field (bottom right in Fig. 7.4), but not that it is still there! This creates a potential race condition: Moving the cursor the first time to the COMMAND field writes a preliminary value into DataEntryComplete that is going to be used erroneously afterwards despite additional edits still taking place (with the cursor actually being away from the COMMAND field). After AECL had identified these inconsistencies, a workaround was implemented using an additional flag to indicate that the cursor is not in the command line and editing is still ongoing. Relation of Accidents to Software Problem Part A

For Therac’s software problem of the type “part A” to happen, all three subproblems 1–3 described above need to exist simultaneously. This shows the complexity of the overall, suboptimal design of the software, which is already challenging to explain or understand, not to mention to test or maintain. We now order the six different incidents 1–6 of Sec. 7.1.4 with respect to their potential relation to the software problem part A. Incident 4 The first East Texas case: The type of events during the operation of this case of treatment fits the software problem: The operator had entered an X-ray treatment erroneously but quickly ↑ to access the field fixed that input error in the terminal display using the button where the type of treatment was specified, modified it from X-ray to electron mode, and used the return key to quickly skip further changes and go back to the COMMAND field and proceed with the treatment. This change in input data was performed within the time window of eight seconds that the system took to set the bending magnets for the first input, but after the very first call to the subroutine P TIME, i.e., after the bending magnet flag had been cleared. Therefore, no change was detected for the parameter setup of the bending magnets in DATENT leading to a (very strong108 ) beam current erroneously intended for X-ray mode. The different type of treatment, however, was correctly identified by the task Hand, which ordered the turntable to get positioned in electron mode (without the X-ray flattener, of course), leading to a massive overdose. This effect was aggravated by executing the erroneous treatment twice. Incident 5 The second East Texas case: The history of the events shows an analog pattern of operation compared to the first East Texas case. The only difference is that the execution of the faulty setup was realized once instead of twice. Due to the careful investigations of the hospital’s physicist F. Hager, this problem was well documented, and the error could be reproduced and related to the speed of editing input data. Incident 1 Kennestone: It is unclear whether this incident is related to software problem part A due to 108 In

[12], a “factor 100” is mentioned.

186

Chapter 7. Complexity

missing information (see [12], p. 542). The overdose of the patient was not lethal, but the radiation treatment targeted a completely different part of the body. The estimated amount of radiation of about 15,000–20,000 rad is, however, similar to the East Texas cases (incidents 4 and 5). Incident 2 Ontario: The documented course of events and the error messages during this treatment do not fit Therac’s software problem part A. It is unclear whether a completely different problem or race condition generated this accident (cf. [12], p. 542). Incident 3 The first Yakima case: This is very similar to the second Yakima case (incident 6 described below). Hence, it is assumed that the flow of events as well as the reason for the accident are related to software problem part B. Incident 6 The second Yakima case: This incident has no relation to Therac’s software problem part A, as it clearly has been traced back to part B, which is described in the next section.

7.1.7 Therac’s Software Problems—Part B The second category of software problems in Therac-25 is completely independent from the first one. In Therac’s software, the subroutine DATENT sets the variable Tphase to 3 under certain conditions (see line 14 in algorithm 7.1). This causes the task Treat to be rescheduled to enter the next subroutine S ET U P T EST. The relevant design aspects for Therac’s software problem part B are visualized in Fig. 7.6. A shared variable Class3 of one byte is used to check that the upper collimator109 is positioned consistently with the parameters: A consistent setting needs Class3 to be equal to zero. Every inconsistent setting during a call of S ET U P T EST increments the variable Class3 by 1, and the nonzero value indicates that treatment should not proceed. The subroutine S ET U P T EST sets Class3 and then checks another shared variable F$mal for a zero value to verify that no malfunctions are occurring. In the event of a nonzero value of F$mal, S ET U P T EST is rescheduled. Otherwise, S ET U P T EST sets the variable Tphase to 2 and the treatment is allowed to continue by other subroutines in task Treat. The variable F$mal is actually set by another task Hkeper (for “Housekeeper”) in its subroutine C HKCOL (for “check collimator”) responsible for consistency checks of the upper collimator. As indicated in Fig. 7.6, the code deciding whether or not to enter C HKCOL is contained in another subroutine L MTCHK (for “limit check”), which calls C HKCOL in case Class3 has a nonzero value. For Class3 equal to zero, this check is skipped. Note that the subroutine S ET U P T EST is typically executed several hundred times during a setup phase before a treatment because it reschedules itself while waiting for other events or input specified by the operator. Each execution in a configuration that is not yet consistent will lead to incrementing Class3 by 1. Since this variable uses one byte of memory, it can store 256 different integer values. A classical integer overflow (see Excursion 1 on page 12) can easily occur if S ET U P T EST is executed the 256th time 109 The collimator was part of the device on the turntable for the x-ray mode, together with the target and the flattener.

7.1. Therac-25

187

Figure 7.6. Background information on Therac’s second software problem: Interplay of the subroutine S ET U P T EST of the task Treat and the task Hkeper via the shared 8-bit variable Class3. Reprinted by permission of Pearson Education, Inc., New York, New York [12, Fig. A.4, p. 543].

resulting in a value of zero for Class3 despite 256 inconsistencies detected, including the most recent one. When the operator hits the “set” button in exactly the time window after such an overflow and before the next incrementation by 1, the treatment will start regardless of the upper collimator position, which could still be in the field-light position combined with 25 MeV of electron beam without X-ray target in place. A mixture of an overflow problem and a suboptimal design using shared variables allowing for a race condition, thus, generated the second category of Therac’s software problems. AECL fixed this bug by simply replacing the incrementing of Class3 by 1 for each case of inconsistency by setting Class3 to a fixed constant nonzero value each time.

Relation of Accidents to Software Problem Part B

This second Therac-25 problem has been discovered in the investigation of the second Yakima incident (no. 6 on the list in Sec. 7.1.4). The forgotten film just before the incident made it possible to match it with the test film radiated by the hospital physicist in his independent investigation and to verify the problem of the electron beam coming on in X-ray mode but with the turntable in field-light position. The first Yakima incident (no. 3 on the list in Sec. 7.1.4) is less documented than the second one. But the type of injury and, in particular, the striped pattern of the radiation, seems very similar to the second Yakima accident. Therefore, it is very likely but not proven that the first Yakima accident was also caused by Therac’s software problem of type B. For the remaining accidents, it is not clear whether they were caused by category A or B of Therac’s software problems or yet another problem (simply not discovered or

188

Chapter 7. Complexity Table 7.1. Summary of the six different incidents. The columns indicate the number of the incident as given in Sec. 7.1.4, the city of the hospital, the part of the body exposed to radiation treatment, the estimated actual dose in rad, and the consequence of the erroneous treatment. The last column contains the association to Therac’s software problem part A or B with a question mark indicating an unclear situation.

No. 1 2 3 4 5 6

hospital

body part

Kennestone Ontario Yakima Texas Texas Yakima

breast cervix hip upper back face chest

estimated dose in rad 15,000–20,000 13,000–17,000 10,000 16,500–25,000 25,000 8,000–10,000

outcome injured death from cancer injured death death death (from cancer?)

bug categ. A? ? B? A A B

related to those accidents).110 Some sources assign the Hamilton accident (in Ontario, no. 2 on the list) to the same software problem part B as the second Yakima incident (see [12], p. 542). Tab. 7.1 summarizes the six incidents and their potential relation to the two different categories A and B of Therac’s software problems.

7.1.8 Discussion The identified root causes for the accidents were • a race condition when including corrected data with the keyboard; • a dead time when adjusting the bending magnets during which input was ignored; • a combination of an integer overflow in the variable Class3, which was used for indicating proper settings, but also counting discarded settings and of a potential race condition. Different aspects contributed to the faulty software (see [11] for details): • Overconfidence in software: At the beginning of the computer era, the belief that software could not be wrong (compared to hardware) was widespread, leading to complacency in developing software. • Confusing reliability with safety: When a program is 98% reliable it is, however, not 100% safe. This is especially crucial when dealing with human lives. This consideration is also fundamental in Sec. 7.3 on SDI, Sec. 4.1 on fly-by-wire, and Sec. 4.2 on autonomous driving. • Lack of defensive design: In case of doubt, it is always a better strategy to vote for safety. • Unrealistic risk assessments: Software developers should not ignore possible hazards or consider hazards as unlikely. 110 Note that after the investigations related to the second Yakima incident, a variety of changes had been made in the software, some of them eliminating additional errors detected during review (cf. [12], p. 548).

7.1. Therac-25

189

• Inadequate software engineering practices: This comprises sloppy and undocumented code, cryptic error messages, no quality assurance, and no systematic testing. It is worth mentioning that the Therac software was developed by one single programmer over several years (see [16], p. 38). When outsourcing programming tasks to small companies, in particular, care needs to be taken to make sure that not only a single or very few persons are responsible for the whole development. • Software reuse: Code from Therac-20, which had worked under different conditions and only as a supporting tool, was directly included in the Therac-25. • Safe vs. friendly user interface: Safety should not be sacrificed to achieve simple handling of interfaces. • Failure to eliminate root causes and inadequate investigations: In the case of these accidents, there was no sufficient information for users about problems, and in the beginning, there was no thorough inquiry into the possible causes. D. Parnas summarizes the software problems in the Therac-25 case to be caused mainly by two types of programming error (cf. [19]): • The computer should always detect and discard unreasonable settings. • Video screen and keyboard commands should be synchronized. Hence, the data input at the keyboard should only be displayed when it has been processed.

7.1.9 Other Bugs in This Field of Application Several lists of accidents in radiation treatment are detailed in [5, 9, 14, 18, 25]. Examples of medical bugs include the following: • In November 2000, at the National Cancer Institute, Panama City, the proper dosage of radiation was miscalculated. The therapy-planning software in use for treatment of cancer patients was a product of the US company Multimedia Systems International. The code allowed four metal shields or blocks to protect healthy tissue. The blocks were specified by drawing their placement on a computer screen. In order to also allow for five blocks, the doctors tricked the software by drawing five blocks as a single large block with a hole in the middle. But unfortunately, the code delivered different answers depending on how the hole had been drawn, especially depending on the direction the hole had been drawn. In this way, the dosage could vary by a factor of 2 or more depending on this direction. This caused the death of at least eight patients, and 20 more patients suffered significant health problems after the treatment (for more information, see [5, 9, 14, 25]). • Between 2004 and 2005, at the Jean Monnet Hospital in Epinal, France, more than 24 people received 20% more radiation than recommended due to a calibration error. The error was linked to the introduction of new machines in 2004 (cf. [5, 25]). • In 2017 at Gisborne Hospital, New Zealand, patient data were displayed under someone else’s record. The bug appeared for more than two years in the software Healthview developed by the company iSoft, a major software supplier of England’s National Programme for IT (see [6]).

190

Chapter 7. Complexity

• According to the US Food and Drug Administration (FDA), problems with druginfusion pumps led to nearly 20,000 serious injuries and over 700 deaths between 2005 and 2009. Mostly, software errors were deemed responsible for these problems. For example, if—by buggy code—a single keystroke is interpreted multiple times, this could result in delivering an overdose (cf. [22]). Furthermore, faulty software occurs in pacemakers, blood analysis devices, monitoring devices, or infusion pumps (see [7, 8, 15, 22]).

Bibliography [1] AECL. Commercial products. picture 11, The Society for the Preservation of Canada’s Nuclear Heritage, date accessed: 2018-08-01. http://www. nuclearheritage.ca/commercial-products/nggallery/image/111-1000-7/

[2] AP. Fatal Radiation Dose in Therapy Attributed to Computer Mistake. The New York Times, 1986 [3] M. Case, K. Clement, G. Orchard, and R. Zou. A history of computing in medicine, 2006. http://ininet.org/a-history-of-computing-in-medicine. html

[4] D. L. Chandler. Explained: rad, rem, sieverts, becquerels - A guide to terminology about radiation exposure. date accessed: 2018-08-20, 2011. http://news.mit. edu/2011/explained-radioactivity-0328

[5] K. Coeytaux, E. Bey, D. Christensen, E. S. Glassman, B. Murdock, and C. Doucet. Reported Radiation Overexposure Accidents Worldwide, 1980-2013: A Systematic Review. PLOS ONE, 10(3):e0118709, 2015 [6] T. Collins. “Rare” bug in iSoft software led to mix-up of patient data. date accessed: 2018-03-26, 2010. http://www.computerweekly.com/blog/PublicSector-IT/Rare-bug-in-iSoft-software-led-to-mix-up-of-patient-data

[7] I. Giese. Software-Fehler (6-10). date accessed: 2018-03-26, 2002. http://webdocs.gsi.de/~giese/swr/fehler06.html

[8] C. Holden. Regulating Software for Medical Devices. Science, 234(4772):20, 1986 [9] IAEA. Accident prevention. date accessed: 2018-03-26. https://www.iaea.org/ resources/rpop/health-professionals/radiotherapy/accident-prevention

[10] International Commission on Radiological Protection. The 2007 Recommendations of the International Commission on Radiological Protection. Annals of the ICRP, Publication 103. Elsevier, 1st edition, 2008. http://journals.sagepub. com/doi/pdf/10.1177/ANIB_37_2-4

[11] N. G. Leveson. Medical devices: The Therac-25. date accessed: 2018-03-20. http://sunnyday.mit.edu/papers/therac.pdf

[12] N. G. Leveson. Medical Devices: The Therac-25 Story. In SafeWare: System Safety and Computers, Computer Science and Electrical Engineering Series, Appendix A, pages 515–553. Addison-Wesley, 1995

Bibliography

191

[13] N. G. Leveson and C. Turner. An investigation of the Therac-25 accidents. Computer, 26(7):18–41, 1993 [14] Netspective Media. Two of history’s worst software bugs reported to be in medical software. date accessed: 2018-03-26, 2005. https://www.healthcareguy. com/2005/11/09/two-of-historys-worst-software-bugs-reported-to-be-inmedical-software/

[15] Paragraft. When bugs really do matter: 22 years after the Therac 25. date accessed: 2018-03-26, 2008. https://paragraft.wordpress.com/2008/07/03/ [16] I. Peterson. Silent Death. In Fatal Defects, Chapter 2. Vintage Books, reprint edition, 1996 [17] E. Richards. Software’s dangerous aspect. The Washington Post, Dec. 9, 1990. https://www.washingtonpost.com/archive/politics/1990/12/09/softwaresdangerous-aspect/9b2e9243-8deb-4ac7-9e8f-968de0806e5e/?utm_term= .c6fae01a2261

[18] R. C. Ricks, M. E. Berger, E. C. Holloway, and R. E. Goans. REAC/TS Radiation Accident Registry: Update of Accidents in the United States. In IRPA-10 Proceedings of the 10th International Congress of the International Radiation Protection Association on Harmonization of Radiation, Human Life and the Ecosystem. Japan Health Physics Society, 2000 [19] R. Saltos. Man killed by accident with medical radiation. The Boston Globe, June 20, 1986. Date accessed: 2018-07-31, ed. by N. McPhee: http://facultypages. morris.umn.edu/~mcphee/Courses/Readings/Therac_25_accidents.html

[20] A. S. Tanenbaum. Modern Operating Systems. Pearson, 3rd edition, 2009 [21] T. Taylor, G. VanDyk, L. Funk, R. Hutcheon, and S. Schriber. Therac 25: A New Medical Accelerator Concept. IEEE Transactions on Nuclear Science, 30(2):1768–1771, 1983 [22] The Economist. When code can kill or cure. date accessed: 2018-03-26, 2012. https://www.economist.com/node/21556098

[23] US General Accounting Office. Medical Devices Recalls - Examination of Selected Cases. Technical Report GAO/PEMD-90-6, US General Accounting Office, 1989. https://www.gao.gov/products/PEMD-90-6 [24] Wikipedia.

Health informatics.

date accessed: 2018-03-26.

https://en.

wikipedia.org/wiki/Health_informatics

[25] Wikipedia. List of civilian radiation accidents. date accessed: 2018-03-20. https: //en.wikipedia.org/wiki/List_of_civilian_radiation_accidents

[26] Wikipedia. Therac-25. date accessed: 2018-03-20. https://en.wikipedia.org/ wiki/Therac-25

192

Chapter 7. Complexity

7.2 Denver Airport The scientific theory I like best is that the rings of Saturn are composed entirely of lost airline luggage. —Mark Russell

7.2.1 Overview • Field of application: airport design • What happened: disastrous automated baggage-handling system • Loss: delay in airport opening resulting in additional costs of about $500 million in total • Type of failure: faulty control of baggage carts, underestimating complexity

7.2.2 Introduction To reduce costs in civil aviation, it is crucial to minimize the time an aircraft is grounded and waiting at the terminal gate. During the time that the plane stays on the ground, different tasks have to be accomplished such as cleaning the cabin, providing the catering, refueling, guiding the passengers to the gate and inside the aircraft, and transporting the luggage between check-in, gates, and baggage claim. Especially for transfer flights, the time of transfer for passengers and luggage should be short and of similar magnitude to avoid separation and delays. For airports serving as a hub for major airlines, the ability to deal with huge numbers of travelers and baggage is important to keeping the customers and airlines satisfied. This is especially difficult nowadays for wide-body aircraft with 600–800 passengers, where a large amount of baggage and people must be dealt with simultaneously. For constructing a new airport, the design and implementation of the baggagehandling system (BHS) is a crucial and time-consuming task (see [24]). Usually, testing of the luggage transport has to be scheduled long before the opening of the airport in order to guarantee smooth and reliable service. Baggage handling has three tasks: • transport baggage from check-in to the aircraft at the departure gate; • transport baggage from the aircraft at the arrival gate to the baggage claim area; • transport bags from the arrival gate to another departure gate during transfers. This last task makes things especially complicated and is usually limited to transfers of particular airlines to check through bags for connecting flights of the same airline only. There are two main options for transporting bags over longer distances (cf. [47]): • loose transport by conveyor belts or man-driven luggage carts, chutes, and rolls (or reels); • transport in containers or automatic carts (in addition to conveyor belts at checkin and baggage claim areas; see Fig. 7.15(a) for a QR code of a YouTube video on a modern system at London Heathrow). Loose transport will nearly always occur at the beginning (check-in) and the end (baggage-claim) of the conveyance chain. Trolleys and tugs are also employed. But

7.2. Denver Airport

193

for automatic identification and sorting in loose transport from gate to gate, bar codes on the bags and reliable scanners are needed. For redirecting bags, pushers, diverters, and tilt tray sorters are used. For transport in containers or automatic carts, the software identifies corresponding bags and containers/carts by their identification numbers. Then the containers/carts are steered to the destination via bar codes or roll labels, which saves time and provides more reliable identification than using the labels of the bags. For the automatic transport, the sorting is done on the fly without the need for a central sorting station as in the case of loose transport. Furthermore, carts can obtain higher speeds. A disadvantage of this type of transportation is the need to handle the empty containers and carts that have to be distributed to the loading points just in time. Gates C28-C39

Gates C40-C50

N

C

Gates B38-B61

Gates B15-B37

B

Gates A24-A39

Gates A40-A68

A AGTS train walkway bridge

Jeppesen Terminal

Figure 7.7. DIA’s layout of concourses.

7.2.3 Timeline Compact surveys on the sequence of events regarding the new Denver International Airport (DIA) can be found in [14, 17, 21, 29, 60]. The planning for DIA started in 1988, when the city of Denver bought a 53-square-mile area east of Denver. In June 1989, the airport plan was approved by the voters, and in September site preparations and construction began. In August 1990, Denver decided to use a conventional tug and cart baggage system following the advice of hired consulting firms. In November 1990, United Airlines (UA), which wanted to use the new airport as a major hub with short aircraft turnaround times, insisted on an automated system for Concourse B (see Fig. 7.7 for a sketch of DIA’s layout) and hired BAE Automated Systems111 for the corresponding design; Continental hired Neidle Patrone Associates (BNP) for a different system in Concourse A112 (see [29]). In the course of the events, 111 For a brief survey of BAE’s history as a company, see [28]. The company was sold to G&T Conveyor Co. in 2002 (cf. [46]). 112 Individual or different BHSs for different concourses reserved for specific airlines are rather common.

194

Chapter 7. Complexity

there seemed to be not enough progress with the different BHSs of the different airlines for Denver officials. After negotiations with Continental and United Airlines in December 1991, Denver overthrew its previous decision and ruled for an automated baggage system for the entire airport (see [28]), leaving only about 2 years to complete this project until the planned opening date of October 31, 1993 (cf. [14, 60]). During 1992, the city issued additional changes in the design of the BHS to reduce the price. In March 1993, the opening was postponed from October 31, 1993, to December 19, 1993, and soon afterwards to March 9, 1994 (see [29]; note that slightly different dates for the announcements are mentioned in [14]). In March 1994, the opening was once more postponed to May 15, 1994. All delays stemmed from the insufficient BHS. In April 1994, airport authorities arranged a demonstration of the new luggage system for the media. It was a disaster; seemingly everything that could go wrong did go wrong: “. . . bags were literally chewed up and spit out while horrified officials and viewers saw clothing and other personal belongings flying through the air” [3]. In May 1994, the mayor of Denver finally announced an indefinite delay of the opening (cf. [21, 29]). That same month, the German company Logplan was engaged to evaluate the project (see [14, 28]). Logplan had already been involved in the luggage system at Frankfurt Airport. In August 1994, tests of Denver’s automated BHS still failed. On Logplan’s recommendation, the automatic system was radically trimmed back to reduce complexity (see [19]): • Only one concourse (B, for UA) instead of all three concourses, was served by the automated BHS via destination coded vehicles (DCVs), using an additional conventional tug and trolley system for Concourses A and C. • The capacity of bags on each track was reduced to 50%. • Only outbound baggage was transported with the automated system; inbound and transfer luggage was transported with the conventional system. Finally, the plan of a unified, automated, and fast baggage system for the entire airport had to be abandoned. On February 28, 1995, DIA was inaugurated with a grand opening. In addition to the originally projected costs for the installation of the automated BHS (about $186 million; cf. [33]), DIA’s operating company had to install the conventional backup system (about $50–75 million according to [19, 60]) and lost about $33.3 million per month due to the delayed start of operation (cf. [19]). Expenses due to the problems and the delay of the baggage-handling system totals well over $500 million. Furthermore, additional costs for the involved airlines are estimated at about $50 million (see [19]). Due to the massive problems of the baggage-handling system, some media reports called it the “baggage system from hell” (cf. the YouTube video referenced by the QR code in Fig. 7.8 and [21]). Furthermore, many jokes about the “real” meaning of the abbreviation DIA were invented, such as “Delayed International Airport” (see [34]).

7.2.4 Original Design of DIA’s Baggage-Handling System In its original design by BAE, the system should serve 88 gates in 3 concourses, incorporating over 27 km of tracks and about 9 km of conveyor belts (see [19, 60]; Figure 7.9 gives an impression of the complex layout). The main part consisted of track-mounted DCVs that were moved via linear induction motors mounted periodically on the tracks every 50 feet. 3,100 carts of standard size and 450 oversized carts (for skis, etc.) were

7.2. Denver Airport

195 5 10 15 20 25 30 35 40

10 15 20 25 30 35 40 Figure 7.8. QR code for5a YouTube video by the TV channel MSNBC on DIA’s problems with the automated baggage-handling system.

Part of DIA’s complex BHS layout: conveyor belts and DCVs on tracks

Figure 7.9. Impressions of DIA’s baggage-handling system. The complex layout involved conveyor belts and DCVs on tracks under many spatial constraints. Photo credit: Kevin Moloney/The New York Times/Redux.

ordered [14]. The system comprised 5,000 motors; 59 laser bar code reader arrays; more than 300 radio frequency readers (for the RF/ID system in the carts); 92 programmable logic controllers (PLCs) to control the motors and the track switches; 2,700 photo cells for keeping track of the carts; over 150 computers, workstations, and communication servers; and 14 million feet of wiring (see [14, 19, 60]). The bags’ journey started at check-in with conveyor belts while laser scanners read the bar code. The data from the bar code reader were processed to an empty DCV with radio frequency identification that was called to the loading area. To save time and energy, a cart did not stop but prepared its container in a special loading position; while still moving, it obtained a bag from a high-speed luggage bowling machine at a T-section. To realize this setup, carts slowed down from an up to 24 mph (38 km/h) traveling speed to, allegedly, 4.5 mph for loading and 8.5 mph for unloading (see [5]). In the event of missing empty carts at a loading zone, the corresponding conveyor belt had to stop moving all bags until an empty DCV had arrived to collect the bag. The computer system sent the bar code to the computer responsible for sorting, which computed

196

Chapter 7. Complexity

the route to the destination (gate or baggage claim) by using a lookup table to match the flight number and gate. The tracking computer guided the DCV to its destination via communicating with radio transponders mounted on the carts. The track switches during transport were controlled via PLCs by the tracking computer. The DCVs were traced by photoelectric sensors located every 150 to 200 feet. Concerning the DCVs, the computer system and corresponding software had to • control the PLCs; • handle the merging of carts on the tracks; • control the switches; • monitor the radio transponders on the carts; • manage the destinations for empty carts; • determine and control the routing (of full and empty carts); • track obstacles and failures; and • compute detours around obstacles or jams. All in all, the originally planned layout of DIA’s baggage-handling system was designed to transport luggage 10 times faster than existing systems with 14 times the capacity of the largest known system (at San Francisco) and serving 10 times more destinations (gates, etc.) than existing systems [19, 38, 45].

7.2.5 Detailed Description of the Problems During the two-years delay of the opening, a series of failed tests took place, in particular the demonstration for the media in April 1994. This resulted in destroyed bags, suitcases being flung out of carts, clothing flying through the air and frequently clogging the rails, DCVs jumping out of tracks and crashing into each other (especially at intersections), and bending rails. In total, it turned out that the system was compromised by various problems: • Too-sharp curves in the rails due to the imposed geometrical layout of the airport and the tunneling system as well as faulty lock bars at the carts resulted in bags falling out of DCVs. In addition, airflow flipped light suitcases out of carts (see [14, 33, 34, 43, 51]). • Photo cells for tracking DCVs and their speed were misplaced, dirty, painted over, knocked out of alignment, or of insufficient quality, resulting in DCVs crashing into each other and jamming the system (cf. [1, 34, 51]). • Laser scanners misread bar codes attached to the luggage due to normal behavior (not exactly 100% of all scans work correctly, not unlike grocery scanning in supermarkets nowadays), due to laser scanners knocked out by workers bringing in other parts of the construction, and due to poorly printed bar code tags (see [17, 19, 43, 45]). • Unreliable power generation resulted in voltage level oscillations that impacted the sensible induction motors running the DCVs (cf. [1, 28]).

7.2. Denver Airport

197

• Computers were sometimes overwhelmed when tracking thousands of DCVs in real time (see [43]). • Insufficient timing of pushing bags from belts to DCVs (too early or too late), and vice versa (cf. [19, 43, 51, 60]). • Unsuitable timing of printing baggage tags (too quickly) overwhelming the system, which sent many bags to manual sorting (see [43]). • Due to previous jams and a restart of the system, the system lost track of which carts had already been loaded, resulting in loading bags in already occupied carts (see [51, 60]). • Testing and debugging of the system was difficult due to small tunnels and frequently bad or missing radio communication between testers in tunnels, concourses, and computer rooms (cf. [43]). • The software was complicated in view of different tasks, including not only the tracing and routing of bags and DCVs, but also connecting to the different reservation systems of all airlines (written in different programming languages) [43]. • The sheer complexity and size of the system (in total over 100 gates, check-in areas, and baggage claim areas) represented the major problem. In particular, the line balancing problem)113 together with empty-cart management was not solved in a useful manner, resulting in empty DCVs arriving too early or too late at the corresponding destinations and causing problems in other parts of the highly coupled system (see [1, 36, 51, 60]). The last point—the complexity of the system—is particularly important and will be explained in the next section.

7.2.6 Complexity The overall complexity of DIA’s BHS turned out to be the paramount reason for the failure. To get an idea of the complexity you have to think of the whole system as a “cascade of queues” (cf. [19]) that has to be fed and controlled in such a way that • empty carts are requested and provided; • routes are planned and possibly changed; • all the lines of flow have balanced service; and • acceptable delivery times are also guaranteed under heavy load. The overall, highly coupled problem is related to scheduling problems (see [48]) and consists of three major tasks: 1) the routing of loaded DCVs, 2) the line balancing (LB) problem (i.e., assuring a balanced service of all transporting units), and 3) empty-cart management (cf. [66, 67]). The LB problem is particularly sensitive; it shares many 113 The line balancing problem in the context of BHS is the problem of balancing all demands for empty DCV carts. What shall be the destination of a cart that has just delivered luggage: one of the many access points (nearby vs. further away) to obtain new luggage or one of the several storage areas for empty carts? Balanced control is important to avoid congestion and, in particular, ensure smooth continuation in all parts of the system without any party waiting for empty carts. This is especially difficult when storage areas for empty cars are missing.

198

Chapter 7. Complexity

aspects of classical assembly line balancing problems common in operations research (cf. [23, 49]). A good survey on different assembly line balancing problem types can be found in [11], and aspects of analytical approaches and early computational approaches to solve it are available in [35] and [30, 31], respectively. The following excursion contains important aspects of the complexity of algorithms relevant in this context. Excursion 23: Complexity—Part II As described in Excursion 6 on complexity (see page 33), every problem that can be solved on a computer has to be implemented by an algorithm that takes C(n) operations where C denotes the cost function and n is the number of unknowns or variables for which the problem has to be solved. The examples in Excursion 6 (accessing vector components in a vector of length n and sorting n numbers) had linear and quadratic complexity; i.e., the function C(n) was a linear or quadratic polynomial. These problems belong to the class P, which contains all problems that are solvable (on a deterministic sequential machine) by an algorithm with cost function C bounded by a polynomial in n. Problems of this type typically behave “nicely” or “ok,” depending on the polynomial’s degree and actual value of n. Unfortunately, there are also problems with faster growing cost function. Consider the classical wheat and chessboard problem: On the first square of a chessboard, put one kernel of wheat (or rice), on the second put 2, on the third 4, on the kth 2k−1 , and so on. How many kernels are on the 64th square, and how many are on the entire chessboard? Obviously, the 64th square should contain 263 = 9,223,372,036,854,775,808 ≈ 9.22 · 1018 kernels, and the entire board 18,446,744,073,709,551,615 = 264 − 1. So with each additional square, the number of kernels grows by a factor of 2. The entire chessboard would contain a heap of kernels larger than Mount Everest. To describe the growth of different elementary functions—from polynomial to exponential behavior— we use the following table comparing linear, quadratic, and cubic growth to exponential growth: n

n2

n3

1 10 100 1,000

1 100 10,000 1,000,000

1 1,000 1,000,000 1,000,000,000

2n

exp(n)

2 1,024 1.267 . . . · 1030 1.071 . . . · 10301

2.718. . . 22,026.465. . . 2.688 . . . · 1043 Inf

Cost functions showing such features are called exponentially growing, and there is no polynomial limiting the function.a Other examples of exponentially growing entities are • populations of living beings (bacteria, animals, humans, etc.) without limitation of resources; • bank account balance b with interest rate r, starting with s after n years: s(1+r)n ; • nuclear chain reactions. Therefore, another class of problems has been defined as N P. This class contains all decision problems that can be verified by a deterministic algorithm in polynomial time given a suitable certificate (i.e., a proposed solution). The certificate is assumed to be generated by some nondeterministic process and therefore N P stands for “nondeterministic polynomial.” Certain problems in N P allow—up to now—only deterministic algorithms with a cost function growing exponentially. A still open and very important question is whether it actually holds that P = N P, i.e., whether there is a smart way to solve any problem from N P in polynomial time even if no suitable certificate is given.

7.2. Denver Airport

199

Other important sets of problems are the set of N P-hard and of N P-complete problems. A problem (*) is called N P-hard if every N P problem can be reduced to the solution of (*) in polynomial time. It is called N P-complete if it is N P-hard and also in N P. N P-complete problems actually exist: The traveling salesman problem (cf. Excursion 24) is N P-complete (see [26] for a comprehensive survey). For problems with exponentially growing cost function, the only resort is to relax the optimality conditions finding polynomial algorithms via heuristics, e.g., that generate a sufficiently accurate approximation to the optimal solution. a Obviously, exponentially growing phenomena cannot last forever because fairly quickly they would exhaust all available finite resources and then die off.

The LB problem in automated BHS is of the most complex category of problems: Similar to the traveling salesman problem (see Excursion 24), it possesses an “extremely large number of feasible solutions” ([30], p. 308) and “even its simplest forms are N P hard” ([23], p. 2; similar in [11]). Therefore, a brute-force comparison of all candidate solutions is not feasible for real-time systems. In the context of baggage-handling systems, the complexity of a design and the associated LB problem is created by different parameters. Besides just the amount of baggage items, distance for the luggage to travel, and the time window for the luggage to arrive at the necessary destination, the number of sources and destinations (i.e., checkin counters, gates, and baggage claim areas) is a major driving force of complexity. To give an impression of the underlying (complex) problems, we list a couple of simple example layouts represented by graphs with corresponding numbers in the following. Let NG be the total number of gates, NCI the total number of check-in areas, and Nclaim the total number of baggage claim areas. Assuming in a first approach (according to [4]) that all gates have direct connections to each other (i.e., they form a so-called clique in the graph) and that all check-in and baggage claim areas are directly connected to each gate but not to one another (there should be some flight in between), the total number of (directed connections) is calculated via NG 2· + NCI · NG + Nclaim · NG . (7.1) 2 Fig. 7.10 shows five different variants of such a setup. The input parameters and the corresponding total number of directed edges NE for connections according to Eq. (7.1) are given in the upper part of Tab. 7.2. The nonlinear growth of connections is clearly visible. Of course, such a layout—despite the advantage for routing and actual transfer of luggage—is unrealistic since airports cannot provide such a huge amount of tunnels, etc. A more realistic setup with junctions Si and a cyclic layout of gates is shown in Fig. 7.11 and even more realistic for a simple terminal-concourse structure in Fig. 7.12. In Tab. 7.2, the lower rows show the number of directed connections for these partial layouts, which are significantly smaller than and not as fast-growing as the variants involving cliques of gates. For routing the (full but also empty) DCVs, more connections or paths from a given start point to a specified end point are advantageous since congestion may be circumvented, but they are also problematic since they increase the number of possibilities for the underlying optimization problem. A very prominent example of an optimization problem on graphs is the traveling salesman problem described in Excursion 24. Readers familiar with this problem may continue directly to page 202.

200

Chapter 7. Complexity

G1

G2

CI 1

Claim 1

G1

CI 1

Claim 1

(a) NG = 1, NCI = 1, Nclaim = 1

G1

CI 1

CI 2

(b) NG = 2, NCI = 1, Nclaim = 1

G2

Claim 1

G1

G2

CI 1

CI 2

(c) NG = 2, NCI = 2, Nclaim = 1

G3

Claim 1

(d) NG = 3, NCI = 2, Nclaim = 1

G4

G1 G2

CI 1

CI 2

G3

Claim 1

Claim 2

(e) NG = 4, NCI = 2, Nclaim = 2

Figure 7.10. Sketch of the growing number of connections between gates, check-in areas, and baggage claim areas according to Equation (7.1). Note that all gates are directly connected to each other; i.e., they form a clique in the graph. All check-in areas and baggage claim areas have direct connections with each gate.

Excursion 24: The Traveling Salesman Problem As a mathematical example we consider the traveling salesman problem (see [40, 41]). A salesman has to visit a given number of cities, which are all connected in pairs via streets with different lengths. The order of the cities does not matter, and the salesman is of course interested in the shortest trip. He has to start and end in the same city and to visit every other city only once. The mathematical problem, thus, is the optimal alignment of his route through all cities with minimum length. The graph of the cities and their connections is given as a set of nodes ci , i = 1, . . . , n, representing the cities and a set of edges ei,j = [ci , cj ] representing streets connecting cities ci and cj . Each edge ei,j is associated with costs di,j defined by the distance between the two cities. We are looking

7.2. Denver Airport

201

for the sequence of the cities ci1 , ci2 , . . . , cin , ci1 such that all consecutive cities in this order are directly connected and that the overall cost of the described route is minimal. In a complete graph, there are n! permutations or different arrangements, so the number of possible solutions K grows exponentially with n as n n √ K = (n − 1)! ≈ 2πn · e according to Sterling’s formula (cf. [50]). As a simple example, we consider the following graph sketching four cities (n = 4 nodes) with connections (edges with weights). For a fixed starting point in node 1, we get six different possible routes:

2 • 1→2→3→4→1: 9+8+4+7 = 28 5 9

4

• 1→2→4→3→1: 9+5+4+10 = 28 • 1→3→2→4→1: 10+8+5+7 = 30

8

• 1→3→4→2→1: 10+4+5+9 = 28 • 1→4→2→3→1: 7+5+8+10 = 30

4

7

• 1→4→3→2→1: 7+4+8+9 = 28

1

3

10

In this special case, four optimal solutions exist with identical total costs of 28. This shows that all of the many candidates have to be inspected if one truly wants or needs to get the exact best solution. For larger scenarios, this soon gets computationally too expensive (due to the NP completeness, see Excursion 23) and one has to rely on heuristics to solve the problem approximatively.

Table 7.2. List of the resulting number of directed edges (NE ) for different numbers of gates (NG ), check-in areas (NCI ), and baggage claim areas (Nclaim ). The upper part of the table refers to layouts with fully connected gates (cliques) sketched in Fig. 7.10, whereas the lower part describes a more realistic setup of partially connected gates involving a certain number NS of junctions as visualized in Fig. 7.11 and Fig. 7.12.

Figure 7.10(a) 7.10(b) 7.10(c) 7.10(d) 7.10(e) 7.11(a) 7.11(b) 7.11(c) 7.12

NG 1 2 2 3 4 8 3 3 4 8

NCI 1 1 2 2 2 2 2 2 2 2

Nclaim 1 1 1 1 2 2 1 2 2 2

NS 2 2 2 4

NE := # directed edges 2 6 8 15 28 144 13 14 16 34

202

Chapter 7. Complexity

G2 G3

G1 d1 S1 CI 1

G2

CI 2

G3

G1

S2 d2

d1 S1

Claim 1

CI 1

(a) NG = 3, NCI = 2, Nclaim = 1, NS = 2

CI 2

Claim 2

G3

G1

G4 d1 S1

CI 2

Claim 1

(b) NG = 3, NCI = 2, Nclaim = 2, NS = 2

G2

CI 1

S2 d2

d2 S2 Claim 1

Claim 2

(c) NG = 4, NCI = 2, Nclaim = 2, NS = 2

Figure 7.11. Sketch of the growing number of connections between gates, check-in areas, and baggage claim areas for a partial, cyclic layout involving junctions Si : Not every gate is directly connected to every other gate, and the check-in and baggage claim areas are connected via junctions.

For complete graphs, such as the five example scenarios shown in Fig. 7.10,114 the formula to compute the total number of paths to reach a specified end node from a given start node is given by f (N ) = 1 + (N − 2) · f (N − 1) .

(7.2)

For N = 2, . . . , 6, the corresponding total number of available paths is shown in Tab. 7.3. Table 7.3. Number of paths to reach a specified end node from a given start node in a complete graph according to formula (7.2) for different numbers of nodes N .

N := number of nodes # paths (fixed start/end)

N =2 1

N =3 2

N =4 5

N =5 16

N =6 65

N = 10 109,601

More realistic representations of DCV-connection graphs as sketched in Figs. 7.11 and 7.12 contain fewer connections and, thus, fewer paths. However, (many) more 114 Remember

that complete subgraphs or cliques only exist in the gates and their connections.

7.2. Denver Airport

203

G3

G4

G5

G6

G2

G1

G8

G7

CI 1

S1

S3

Claim 1

CI 2

S2

S4

Claim 2

Figure 7.12. Sketch of connections between gates, check-in areas, and baggage claim areas for a partial layout similar to the ones in Fig. 7.11 (not every gate is directly connected to every other gate, and the check-in and baggage claim areas are connected via junctions) but resembling a structure with a terminal and a concourse with 8 gates, 2 check-in areas, 2 baggage claim areas, and 4 junctions.

gates, additional junction points, shortcuts, connected conveyor belts, and additional different DCV tracks for other concourses make things much more complicated (see Fig. 7.13 for the originally designed layout in Denver). In particular, the LB problem was known in the community of operations research to be N P-hard to solve, which is highly problematic for real-time control systems. Unluckily, BAE was not aware of those theoretical aspects before starting the project, and the complexity problem did not appear for smaller, previously provided systems. Since finding the overall optimal solution of the underlying optimization problem is so costly, an effective approach needs to use intelligent heuristics to compute approximate but good solutions that can be determined in polynomial time. In this context, many simulations115 and practical tests are necessary to stabilize the system and ensure reliable behavior under all relevant circumstances. Meanwhile, a lot of new methods and algorithms have been developed for routing and LB problems. In Tab. 7.4, different details are summarized concerning research in baggage-handling control of the Delft Center for Systems and Control between 2008 and 2015,116 where a different (more complex) model of the overall system is applied.117 Note the relatively low number of nodes (i.e., check-in and baggage claim areas as well as junctions) that were subject to research (not implementation of a real-time production system) compared to the original layout of DIA’s baggage-handling system (more than 100 nodes including 88 gates). This glance at the state of the art in BHS research, more than 10 years after DIA’s opening and mostly for one of the three actual underlying problems in baggage-handling control separately, gives an impression of the impact of DIA’s original complexity. 115 See

[52] for a noncommercial example of a very simple setup for the Denver showcase. sequence of publications is one of the relatively rare ones directly available in the literature. 117 For details on the approach used, the so-called model predictive control (MPC), see [25, 42]). 116 This

204

Chapter 7. Complexity

Section 3

DIA’s Automated Baggage Handling System

(a) Five planned DCV routes according to [4]

(b) Planned layout of tracks according to [60]

Figure 7.13. Planned layout of DCV tracks connecting DIA’s three concourses as well as the check-in and baggage claim areas, published in Oct. 1994. (a) Reprinted with permission of Modern Materials Handling, a Peerless Media LLC publication. (b) Reprinted courtesy of US Government Accountability Office [60].

In 2005, UA shut down the controversial automated baggage system at Concourse B of DIA (cf. [2, 33, 46]) and replaced it with a conventional system in order to save maintenance costs (of about $1 million per month; see [14, 16]). BAE is no longer in existence and was sold to G&T Conveyor Co. in 2002 (cf. [46]). DCV-based systems are, however, running efficiently at airports such as Frankfurt, Munich, and London Heathrow Terminal 5Page (T5). 14 GAO/RCED-95-35BR New Denver Airport

7.2. Denver Airport

205

Table 7.4. Example of research results of the Delft Center for Systems and Control on different types of (sub)problems for BHS control. The individual columns show the reference; the year of publication; the applied number of bags, check-in areas, baggage claim areas, junctions S, different scenarios concerning timing of the incoming bags; and the type of the three different subproblems (routing, LB, empty-cart management) that is solved. Note that in the publications [66] and [67], not a fixed number of bags but a flow rate of 5 and 8 bags per second, respectively, is used as an input. In [67], not only a single of the three subproblems is tackled, but the overall control problem in an integrated approach; this corresponds mostly to a production-type system.

source [53] [58] [56] [57] [54] [55] [66] [67]

year 2008 2009 2009 2009 2009 2010 2012 2015

# bags 25 120 180 460 2500 120 8/sec 5/sec

# CIs 2 6 6 4 4 4 2 2

# claims 2 1 1 1 2 2 1 2

# junctions 2 10 10 4 9 9 ? 5–9

# scenarios 25 27 27 18 6 18 1 ?

problem routing routing routing routing routing routing LB all

In many publications, DIA’s complexity problem in the context of the BHS is listed as a (warning) example [12, 17, 20, 27, 28, 34, 38, 51, 63].

7.2.7 Discussion In hindsight, a considerable number of warnings existed with respect to the problematic complexity and extremely tight schedule of DIA’s BHS project: • People fully experienced in real-time material handling software, in particular those familiar with the LB problem, would have noticed the huge complexity and the related problems in the layout (see [19, 36]). BAE did not have that experience. • After the start of the constructions at DIA, the consulting company Breier Neidle Patrone Associates had been hired to evaluate the feasibility of the automated baggage handling system. The company judged the project to be too complex for the given schedule, but the city of Denver decided to stick to the original plans (see [15, 19, 29]). • The new airport in Munich, Germany, which opened in 1992 during DIA’s planning phase, has a similar yet smaller baggage-handling system. Its installation took two years with an additional six months spent in 24/7 testing before the launch (see [14, 15]). Representatives of the Munich airport clearly advised extending testing to avoid failure. The city of Denver decided not to adapt the schedule. • Only 3 of the 16 companies contacted responded to the BHS call, and none of these bids predicted that they would be able to finish the project within the given schedule (see [14, 15]). The city rejected the three bids and approached BAE to take the project without changing the size or schedule.

206

Chapter 7. Complexity

• At BAE systems, senior managers were worried about the complexity of the project and estimated four instead of two years to be necessary to complete it (see [15]). Those concerns were ignored. Another issue increased the inhibiting effects of complexity for the problematic baggage-handling system: too many innovations. As mentioned in [19, 38], DIA’s originally planned BHS was • the first automated system to serve an entire airport; • the first system where DCVs did not stop to collect or drop luggage; • the first system for oversized bags; • the first system with RFID components for the DCVs; and • the first system controlled by a distributed network of computers instead of a single mainframe. For the design of production systems, it may represent a considerable danger to rely too much on untested technology. In [36], e.g., a rule of thumb of using only 10% unproven technology is advised.

7.2.8 Other Bugs in This Field of Application • After 20 years of planning, Heathrow’s new Terminal 5 started construction in September 2002. It was a joint project between British Airways (BA) and the British Airport Authority (BAA) designed to bundle BA’s activities at Heathrow and create room for other airlines in the other terminals. Besides £75 million for technology, BAA invested more than £175 million in the information systems. This included 180 IT suppliers and the installation of 163 IT systems (cf. [37]). The total budget was about £4.3 billion (see [9]). The baggage system at T5 was developed in cooperation with Vanderlande Industries, IBM, and Alstec. Vanderlande, in particular, had a lot of experience in setting up systems of this size (cf. [32]). The system was supposed to transport up to 12,000 bags per hour (cf. [65]) and incorporated 11 miles of conveyor belts. The whole system was tested virtually by computer programs on several levels (see [22]) as well as in place for six months by 15,000 volunteers in 66 different trials with over 400,000 pieces of luggage running through the system (cf. [6, 32]). The opening on March 27, 2008, turned out to be a total disaster. Besides problems with the car park and the security screening, the baggage system in particular broke down. Passengers were flying without their luggage or missed their flights. Fig. 7.14 contains QR codes to access a selection of TV news reports of the event available on YouTube. The overall costs for BA have been estimated to be at least £16 million (see [32, 39]). Here is a list of problems that occurred (more details can be found in [32]): – The baggage handlers sometimes got the erroneous information that certain flights had already taken off and took suitcases back into the terminal instead of preparing them for final loading (cf. [37]). – Computers did not recognize some baggage handlers’ IDs. Hence, those handlers were not able to log on to the baggage-handling system to work (see [9, 44, 59]).

7.2. Denver Airport

207

5

5

5

10

10

10

15

15

15

20

20

20

25

25

25

30

30

30

35

35

35

40

40

5

10

(a)15 BBC News 20 25 30

35

40

5

10

(b) NTDTV 15 20 25 30

40 35

40

5

10

15

(c)20ITV 25

30

35

Figure 7.14. QR codes for videos on YouTube regarding problems at Heathrow’s opening of T5: (a) report of BBC News, aired March 29, 2008; (b) report of NTDTV, aired March 29, 2008; (c) report of ITV, aired March 30, 2008.

– The baggage-handling teams were confused about the layout of the terminal and were not able to redress the delays (see [44]). – There was a lack of baggage storage bins (which are loaded onto planes), and some baggage-loading carousels broke down (cf. [44]). – On the day of the opening, 68 flights had to be canceled, and other flights left with no luggage aboard (see [32]). Over the first five days, about 500 flights were canceled (cf. [59]). – Passengers walked out of baggage claim without their suitcases after waiting for more than 90 minutes. Others had to wait for two hours (see [9, 44, 65]). – In the following days, abandoned luggage—up to 28,000 pieces—was transported to storage and then partially to other airports such as Milan for faster sorting and delivering to the unhappy owners all over the world (cf. [7, 18, 32, 62]). The basic material handling was implemented by Vanderlande’s HELIXORTER system [61], with a capacity of 6,000 trays per hour and a conveyance speed of 2 m/s. The system is able to identify a bag’s position and can also handle irregular or smaller items that often get trapped in conventional systems. This was supposed to improve sorting accuracy and reduce the number of mis-sorts and jams. Furthermore, the Vanderlande system BAGTRAX is used to transport late bags via DCVs (see [22]), which can be transported at high speed directly to the head of stands to the waiting aircraft. An analysis of the failure discovered a combination of different causes: – Lack of sufficient training of staff (cf. [9, 32, 37]). – A software filter had been used during testing in order to prevent messages being delivered to the “live” systems in other Heathrow locations. This filter was still active at the opening (and for four more days, see [32, 59]), which led to not recognizing a number of bags sent to BA from other airlines. Those bags had to be sent to manual sorting. – Wireless LAN access was not operable at some of the stands [32]. – Due to a wrongly set configuration, the feed of data from the baggagehandling system to the baggage reconciliation system was not operational [32].

40

208

Chapter 7. Complexity 5

5

10

10

15

15

20

20

25

25

30

30

35

35

40

(a) Vanderlande BHS at 40 5 10 15 20 25 30 35 London Heathrow T5

40

(b) Heathrow T525 - Success 5 10 15 20 30 35 40 or Failure?

Figure 7.15. QR codes for videos on YouTube regarding (a) a 10-minute description of Heathrow’s current baggage-handling system at Terminal 5 (status 2015) by Vanderlande published on May 12, 2015, and (b) a recording of a 45-minute lecture on “Heathrow Terminal 5 - Success or Failure?” given by Tim Brady at the University of Brighton published on June 6, 2012.

– The transmission of BA flight information data between BAA and its contractor SITA was erroneous; therefore, direct baggage messaging for flights in T1 and T4 didn’t work correctly [32]. – Due to a delay in the construction of the terminal building, BA and BAA decided to reduce the time for testing of the whole system. In addition, the test scenarios did not cover a large enough setup or the overall process and resembled a mid-size (and not daily worst-case) workload of the system (see [32]). – The baggage-handling data that had to be sent to the baggage reconciliation system were blocked by “incorrect configuration” (cf. [32, 59]). – The number of generated messages from the baggage system was considerably higher than estimated and led to problems due to the system’s server capacity (see [32]). In contrast to Denver International Airport, the problems at Heathrow had been fixed within about one month after the disastrous opening (see [32]). For a detailed discussion on aspects of failure and success of Heathrow’s T5, see the recording of a lecture at the University of Brighton available via the QR code in Fig. 7.15(b). A survey by Vanderlande of the different components of Heathrow’s current T5 baggage-handling system can be found using the QR code given in Fig. 7.15(a) • Other airports have also suffered from computer-related problems at openings (cf. [8, 32]). In 1998, the Kuala Lumpur International Airport in Malaysia and Hong Kong’s airport Chek Lap Kok started operation, and both experienced delayed flights and luggage chaos due to computer failures. • Baggage problems not only arise at openings, but also continue to appear at operating modern airports. In 2017, e.g., Heathrow experienced problems in Terminals 3 and 5 (see [10]).

Bibliography

209

Bibliography [1] L. M. Applegate, R. Montealegre, C.-I. Knoop, and H. J. Nelson. BAE Automated Systems (A): Denver International Airport Baggage-Handling System Building. Harvard Business School Case, 9-396-311:1–15, 1996 [2] L. M. Applegate, H. J. Nelson, R. Montealegre, and C.-I. Knoop. BAE Automated Systems (B) : Implementing the Denver International Airport Baggege-Handling System. Harvard Business School Case, 9-396-312, 2001 [3] Arizona Republic. Arizona Republic from Phoenix, Arizona - March 27, 1994 - Page 85. date accessed: 2018-03-20, 1994. https://www.newspapers.com/ newspage/123656921/

[4] K. A. Auguston. The Denver Airport: A Lesson in Coping with Complexity. Modern Materials Handling, 10:40–45, 1994 [5] M. Batchelor. DIA automated baggage handling system. date accessed: 2018-02-28, 2012. cs.furman.edu/~pbatchelor/sa/Slides/ DIAAutomatedBaggageHandlingSystem.pptx

[6] BBC News. Volunteers test out Heathrow T5. date accessed: 2018-02-13, 2007. http://news.bbc.co.uk/2/hi/uk_news/england/london/7008412.stm

[7] BBC News. Now snow hits flights at new T5. date accessed: 2018-02-13, 2008. http://news.bbc.co.uk/2/hi/7332956.stm

[8] BBC News. Other airports’ rocky starts. date accessed: 2018-02-13, 2008. http: //news.bbc.co.uk/2/hi/uk_news/7318540.stm

[9] BBC News. What went wrong at Heathrow’s T5? date accessed: 2018-02-13, 2008. http://news.bbc.co.uk/2/hi/uk_news/7322453.stm [10] BBC News. Heathrow Airport baggage system suffers failure. date accessed: 2018-02-23, 2017. http://www.bbc.com/news/uk-england-london-40285170 [11] C. Becker and A. Scholl. A survey on problems and methods in generalized assembly line balancing. European Journal of Operational Research, 168(3):694–715, 2006 [12] B. Bruegge and A. H. Dutoit. Object-Oriented Software Engineering: Using UML, Patterns, and Java. Prentice Hall, 3rd edition, 2010 [13] Calleam Consulting. Denver Airport Baggage System Case Study. date accessed: 2018-03-20. http://calleam.com/WTPF/?page_id=2086 [14] Calleam Consulting. Case study - Denver International Airport baggage handling system. An illustration of ineffectual decision making. date accessed: 2018-02-28, 2008. http://calleam.com/WTPF/?page_id=2086 [15] A. Coolman. Lessons learned from project failure at Denver International Airport: Why checking bags is still a pain. date accessed: 2018-02-14, 2014. https://www.wrike.com/blog/lessons-learned-from-project-failure-atdenver-international-airport-why-checking-bags-is-still-a-pain/

210

Chapter 7. Complexity

[16] J. A. Crowder, J. N. Carbone, and R. Demijohn. Multidisciplinary Systems Engineering: Architecting the Design Process. Textbooks in Telecommunication Engineering. Springer International Publishing, 2015 [17] S. Dalal and R. S. Chhillar. Case Studies of Most Common and Severe Types of Software System Failure. International Journal of Advanced Research in Computer Science and Software Engineering, 2(8):341–347, 2012 [18] M. Davis. T5 suitcases sent to Milan for sorting. The Independent, Apr. 2, 2008. http://www.independent.co.uk/news/uk/home-news/t5-suitcasessent-to-milan-for-sorting-803639.html

[19] R. de Neufville. The Baggage System at Denver: Prospects and Lessons. Journal of Air Transport Management, 1(4):229–236, 1994 [20] R. de Neufville and A. R. Odoni. Airport Systems: Planning, Design and Management. McGraw-Hill Education, 2nd edition, 2013 [21] P. S. Dempsey, A. R. Goetz, and J. S. Szyliowicz. Denver International Airport: Lessons Learned. McGraw-Hill, 1997 [22] R. Derksen, H. van der Wouden, and P. Heath. Testing the Heathrow Terminal 5 baggage handling system - before it is built. date accessed: 2018-02-13, 2007. http://archive.newsweaver.com/qualtech/newsweaver.ie/qualtech/ e_article000776781afeb.html?x=bcVhM0T,b4V6Phvj,w

[23] E. Falkenauer. Line Balancing in the Real World. In Proceedings of the International Conference on Product Lifecycle Management PLM, Vol. 5, pages 360–370. IEEE Comput. Soc. Press, 2005 [24] M. M. Frey. Models and Methods for Optimizing Baggage Handling at Airports. Ph.d. thesis, Technical University of Munich, 2014 [25] C. E. García, D. M. Prett, and M. Morari. Model predictive control: Theory and practice. A survey. Automatica, 25(3):335–348, 1989 [26] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman and Co., 1990 [27] W. W. Gibbs. Software’s Chronic Crisis. 1994):1–13, 2016

Scientific American, (September

[28] R. L. Glass. Software Runaways. Prentice Hall, 1998 [29] M. Gryszkowiec. Denver International Airport: Statement of Michael Gryszkowiec. Technical report, US General Accounting Office, 1995 [30] A. L. Gutjahr and G. L. Nemhauser. An Algorithm for the Line Balancing Problem. Management Science, 11(2):308–315, 1964 [31] M. Held, R. M. Karp, and R. Shareshian. Assembly-Line Balancing: Dynamic Programming with Precedence Constraints. Operations Research, 11(3):442–459, 1963 [32] House of Commons, Transport Committee. The Opening of Heathrow Terminal 5. Technical report, House of Commons, HC 543, 2008

Bibliography

211

[33] K. Johnson. Denver Airport Saw the Future. It Didn’t Work. The New York Times, pages 26–28, Aug. 27, 2005 [34] H. Kerzner. Project Recovery: Case Studies and Techniques for Overcoming Project Failure. John Wiley & Sons, 2014 [35] M. D. Kilbridge and L. Wester. A Review of Analytical Systems of Line Balancing. Operations Research, 10(5):626–638, 1962 [36] B. Knill. Denver International Airport Has Happened Before. Material Handling Engineering, 49(8):7, 1994 [37] M. Krigsman. IT failure at Heathrow T5: What really happened. date accessed: 2018-02-13, 2008. http://www.zdnet.com/article/it-failure-at-heathrowt5-what-really-happened/

[38] D. Krob. CESAM: CESAMES Systems Architecting Method - A Pocket Guide, 2017 [39] A. Lagorce. T5 teething pains could last until the summer. date accessed: 2018-02-13, 2008. https://www.marketwatch.com/story/teething-woes-atheathrows-t5-could-last-until-summer

[40] E. L. Lawler. Combinatorial Optimization: Networks and Matroids. Dover Books on Mathematics. Courier Corporation, 1976 [41] E. L. Lawler, J. K. Lenstra, A. H. G. Rinnooy Kan, and D. B. Shmoys. The Traveling Salesman Problem: A Guided Tour of Combinatorial Optimization. John Wiley & Sons Ltd., 1991 [42] J. M. Maciejowski. Predictive Control: With Constraints. Pearson Education. Prentice Hall, 2002 [43] P. McQuaid and P. A. McQuaid. Software disasters and lessons learned. date accessed: 2018-02-28, 2006. https://www.stickyminds.com/sites/default/ files/presentation/file/2013/06STRWR_W5.pdf

[44] D. Milmo. Passengers fume in the chaos of Terminal 5’s first day. The Guardian, Mar. 28, 2008. https://www.theguardian.com/environment/2008/ mar/28/travelandtransport.theairlineindustry

[45] A. R. Myerson. Automation Off Course in Denver. The New York Times, pages 18–21, March 18, 1994. http://www.nytimes.com/1994/03/18/business/ automation-off-course-in-denver.html

[46] NBCNews.com. United abandons Denver baggage system. date accessed: 201802-12, 2005. http://www.nbcnews.com/id/8135924/ns/us_news/t/ [47] K. Nice.

How baggage handling works.

date accessed:

2018-02-12.

https://science.howstuffworks.com/transport/flight/modern/baggagehandling.htm/printable

[48] M. L. Pinedo. Scheduling. Springer International Publishing, Cham, 5th edition, 2016

212

Chapter 7. Complexity

[49] J. K. Shim and J. G. Siegel. Operations Management. Barron’s business review series. Barron’s Educational Series, 1999 [50] J. Stoer and R. Bulirsch. Introduction to Numerical Analysis. Texts in Applied Mathematics. Springer, 3rd edition, 2002 [51] A. J. Swartz. Airport 95 : Automated Baggage System? Software Engineering Notes, 21(2):79–83, 1996 [52] J. Swartz. Simulating the Denver Airport Automated Baggage System. Dobbs’s Journal, 1, 1997

Dr.

[53] A. N. Tarau, B. De Schutter, and H. Hellendoorn. Travel Time Control of Destination Coded Vehicles in Baggage Handling Systems. In Proceedings of the 17th IEEE International Conference on Control Applications, pages 293–298. IEEE Comput. Soc. Press, 2008 [54] A. N. Tarau, B. De Schutter, and H. Hellendoorn. Hierarchical Route Choice Control for Baggage Handling Systems. In 12th International IEEE Conference on Intelligent Transportation Systems (ITCS 2009), Vol. 19, pages 1–6. IEEE, 2009 [55] A. N. Tarau, B. De Schutter, and H. Hellendoorn. Model-Based Control for Route Choice in Automated Baggage Handling Systems. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 40(3):341–351, 2010 [56] A. N. Tarau, B. De Schutter, and J. Hellendoorn. Decentralized Route Choice Control of Automated Baggage Handling Systems. IFAC Proceedings Volumes, 42(15):70–75, 2009 [57] A. N. Tarau, B. De Schutter, and J. Hellendoorn. Predictive Route Choice Control of Destination Coded Vehicles with Mixed Integer Linear Programming Optimization. IFAC Proceedings Volumes, 42(15):64–69, 2009 [58] A. N. Tarau, B. De Schutter, and J. Hellendoorn. Route Choice Control of Automated Baggage Handling Systems. Transportation Research Record: Journal of the Transportation Research Board, 2106(15):76–82, 2009 [59] R. Thomson. British Airways reveals what went wrong with Terminal 5. date accessed: 2018-02-13, 2008. http://www.computerweekly.com/news/2240086013/ British-Airways-reveals-what-went-wrong-with-Terminal-5

[60] US General Accounting Office. New Denver Airport: Impact of the Delayed Baggage System. Technical Report GAO/RCED-95-35BR, US General Accounting Office, 1994 [61] Vanderlande. HELIXORTER. date accessed: 2018-02-13, 2018. https://www. vanderlande.com/airports/innovative-systems/sortation/helixorter

[62] A. Vrouwe. Heimweg via Mailand. Sueddeutsche Zeitung, Apr. 4, 2008 [63] N. Walkinshaw. Software Quality Assurance: Consistency in the Face of Complexity and Change. Undergraduate Topics in Computer Science. Springer International Publishing, 2017

7.3. Strategic Defense Initiative

213

[64] Wikiarquitectura. Denver International Airport. date accessed: 201802-16. https://en.wikiarquitectura.com/building/denver-internationalairport/

[65] P. Woodman. Disastrous opening day for Terminal 5. The Independent, Mar. 27, 2008. http://www.independent.co.uk/news/uk/home-news/disastrousopening-day-for-terminal-5-801376.html

[66] Y. Zeinaly, B. De Schutter, and H. Hellendoorn. A Model Predictive Control Approach for the Line Balancing in Baggage Handling Systems. IFAC Proceedings Volumes, 45(24):215–220, 2012 [67] Y. Zeinaly, B. De Schutter, and H. Hellendoorn. An Integrated Model Predictive Scheme for Baggage-Handling Systems: Routing, Line Balancing, and Empty-Cart Management. IEEE Transactions on Control Systems Technology, 23(4):1536–1545, 2015

7.3 Strategic Defense Initiative I call upon the scientific community, who gave us nuclear weapons, to turn their great talents to the cause of mankind and world peace; to give us the means of rendering these nuclear weapons impotent and obsolete. —Ronald Reagan, March 1983

7.3.1 Overview • Field of application: military defense • What happened: public funding for developing unreliable systems • Loss: billions of dollars • Type of failure: underestimating the complexity

7.3.2 Introduction In 1983, President Reagan originated the Strategic Defense Initiative (SDI), which also became known as the Star Wars program. In the beginning, it was an outcome of the Cold War and was intended to guard the US against nuclear attacks from the Eastern Bloc. The intention was to break the mutually assured destruction that prevented war between the superpowers but also forced people to live under a permanent threat of nuclear attack. SDI was part of the arms race, which, in the long run, contributed to the collapse of the Soviet Union and the Warsaw Pact. A compact historical survey on SDI and related programs is given in [9, 20]. SDI can be seen as a more general and challenging version of the Patriot system (see Sec. 2.4) aiming for nuclear intercontinental missiles that move much faster than the short-range missiles of the Iraq wars and are more dangerous, too. The problem can be likened to “hitting a bullet with a bullet” (see [23]). Following the boost phase of an attacking nuclear missile (the first 5 minutes), the engagement should take place during the midcourse phase (about 20 minutes), leaving in total about 25 minutes for all stages

214

Chapter 7. Complexity

of engagement. Different scenarios were proposed such as X-ray or chemical laser weapons, particle beam weapons, and ground- and space-based missiles. Especially prominent was the Hypervelocity Railgun program, similar to a particle accelerator with a projectile on rails, and the Brilliant Pebbles program, which was intended to use satellite-based interceptors or kinetic warheads made of tungsten. The original SDI program never led to a practical and useful defense system. Over the years, it was redesigned and renamed by later US administrations (see [9], in particular, for the following facts). Under George Bush in 1991, Global Protection Against Limited Strikes (GPALS) was intended to defend attacks of up to 200 vehicles from third world powers and more accidental attacks from superpowers. The Clinton administration (1993–2001) developed a three-phase National Missile Defense (NMD) architecture with significantly smaller funding intended to intercept about 5–20 ballistic missiles. Under George W. Bush (2001–2009), the program was renamed Groundbased Midcourse Defense (GMD) in 2002 and was meant to protect the specified area (not necessarily the US homeland) from an undefined number of ballistic missiles from “rogue states” (see [8, 9, 20]). The Obama administration scaled down the size of the intended system (and its costs) and focused again more on homeland regions. Intended system components beyond 2017 can be found in [21, 22]. They comprise • sensor elements such as the Aegis system in its naval or onshore variant, the Cobra Dane radar, or X-band radar118 ; and • interceptors such as the THAAD (Terminal High Altitude Area Defense) batteries.119 At the time of writing, US officials declare that these defense systems are targeted mainly against the aggressive missile program of North Korea and against supposed possible attacks by Iran, but the missile defense program also unsettles other countries such as Russia and China. Concerning the total costs, different numbers can be found in the literature. According to [8], the accumulated costs of these programs since the mid-1980s until 2007 sum up to about $110 billion, with another $8.9 billion requested for fiscal year 2008. A sum of about $123 billion was spent from 2002 until 2017 (i.e., more than $7 billion per year on average) plus an intended $37 billion until 2021, as reported by [22].

7.3.3 SDI and the Limits of Software Technology The main goal of missile defense is • fast and reliable detection of an attack, its origin, its target, and its trajectory; • interception of missiles and warheads, also discriminating between decoys and warheads. Similar to other complex software systems, the software necessary to enable SDI functionality of the whole program would be very large, on the order of millions of lines of code. Assuming about 0.1 defects per 1,000 lines of code—achievable only for highquality software with very high effort120 —still leads to more than 500 expected errors 118 Aegis onshore systems are stationed in Romania and planned in Poland, as are radar systems in Turkey, UK, and Denmark. 119 THAAD Interceptors are based at United Arab Emirates, Patriot PAC-3 at Germany, Netherlands, Qatar, Saudi Arabia, Japan, and South Korea. 120 See [11]. For commercial software (which typically is also shorter), 5 to 50 bugs per 1,000 lines of code are typically assumed (see also Sec. 4.2.2).

7.3. Strategic Defense Initiative

215

in the overall code. Furthermore, an SDI system would possess very high complexity due to the heterogeneity of the hardware; of the type of data to be created, transmitted, and analyzed; of the dependency of different subsystems; and of the time and length scales for communication, etc. This complexity has a severe impact on reliability, as the case of the Denver International Airport has already shown on a much smaller scale (see Sec. 7.2). There are a number of the aspects that make software for SDI especially difficult and unique. These problems had been emphasized by D. Parnas challenging the defense against nuclear intercontinental ballistic missiles (see [12, 13]): • sharp time limits because of the short time horizon of 25 minutes between launch and impact; • testing under realistic conditions is nearly impossible; a final test will be possible only in the event of emergency; • no failure is allowed, but 100% reliability is impossible (even with automated testing and today’s approaches to verification); • no definite specification of an attacking missile since the enemy can use different types of decoys, additional acceleration, different speed, etc.; • ethical aspects concerned with making war more likely by stimulating a preventing first strike from the enemy or provoking a delusive sense of security. Some of Parnas’s arguments might get weaker over time due to advances in software engineering, verification, and testing. But full reliability can never be reached, and failures in nuclear missile defense are fatal. Different aspects concerning the feasibility of ballistic missile defense are discussed in [2, 4], collecting the contributions of both sides. In these discussions, C. L. Seitz supports SDI and emphasizes the importance of the system architecture compared to software engineering techniques. Assuming a hierarchical control structure, he claims that it is possible to construct SDI software that will work correctly the first time it is needed. H. Lin claims that removing errors from the code will be difficult because it will be a “real-time” system operating with different configurations, and it might be hard to make errors recur for debugging (see [10]). K. Werrell emphasizes that ballistic weapons are serving more as a psychological terror weapon than as a military weapon (cf. [23]). In [7], the political implications of missile defense are discussed. P. W. Anderson states in [1] that the SDI supporters do not propose specific configurations that could be shot down, but instead argue in terms of emotional hopes and glossy presentations. He describes the problem in three stages: • SDI probably wouldn’t work, even under ideal conditions. • It almost certainly wouldn’t work under war conditions. • Each defense system will certainly cost at least 10 times more than the offensive system it is supposed to shoot down. Hence, the attacking party has an enormous, inescapable advantage. The only meaningful setup of SDI would be exploding H bombs in space, with all the associated negative effects such as pollution; see [1].

216

Chapter 7. Complexity

Between 2000 and 2005, T. Postol also argued against SDI [14, 16, 17]. This led to a severe controversy between Postol and the Pentagon. J. Weizenbaum from MIT supported Parnas’s arguments. He remarked in [4] that specifications of a defense system are not and cannot be defined. In general, the SDI controversy cannot be considered isolated from • political orientation: far right or conservative vs. liberal, socialist, or communist opinions; • foreign affairs and external influence; and • commercial interests. Therefore, opponents frequently escalated discussions beyond the limits of purely scientific arguments. To summarize, the SDI problem is an extreme application for software development: In general, users should be aware of the limitations of computers, of the risk of relying on systems that can never be fully reliable for crucial and fatal applications, and of the question whether such software is useful or hazardous.

7.3.4 SDI-Related Bugs • The Homing Overlay Experiments (HOE) were the first tests of a hit-to-kill system by the US Army based on a two-stage kinetic kill vehicle to destroy ballistic missiles.121 In December 1983, the third intercept test failed: A software error in the on-board computer prevented the conversion of optical homing data to steering commands (see [15]). • In 1985, a laser experiment was conducted as part of the High Precision Tracking Experiment (HPTE). Via a mirror at the Space Shuttle, a laser beam released from the top of a mountain should be reflected (see [3]). As part of the input data, the height of the mountain was given in feet. Unfortunately, the computer program interpreted this information as nautical miles (cf. [19]). The experiment was successfully repeated two days later. • The Terminal High Altitude Area Defense system (THAAD) was designed in 1987 by Lockheed Martin and has been in service since 2008 to intercept short-, medium-, and intermediate-range ballistic missiles. The fourth test on December 13, 1995, failed: A software error caused an unneeded kill vehicle divert maneuver, which resulted in the kill vehicle running out of fuel before an intercept was possible (cf. [24]). For a collection of failures in missile defense tests, see [20] and different contributions in The RISKS Digest, such as [18], [5], and [6].

Bibliography [1] P. W. Anderson. The Case against Star Wars. In More and Different: Notes from a Thoughtful Curmudgeon, Chapter VII, pages 300–306. World Scientific, 2011 121 Note that the first versions of the Patriot system are older but were intended to intercept aircraft instead of ballistic missiles.

Bibliography

217

[2] K. Bowyer. “Star Wars” Revisited—a Continuing Case Study in Ethics and SafetyCritical Software. In 31st Annual Frontiers in Education Conference. Impact on Engineering and Science Education. Conference Proceedings, Vol. 21, pages F1D7–F1D-12. IEEE, 2002 [3] W. J. Broad. Laser Test Fails to Strike Mirror in Space Shuttle. The New York Times, page A1, June 20, 1985 [4] D. Chiou. Computer Experts Meet to Debate SDI Feasibility. The Tech online Edition, 105(44), 1985. http://tech.mit.edu/V105/N44/sdibat.44n.html [5] J. Epstein. Honest, General, it was only a little glitch. date accessed: 2018-03-09, 2005. https://catless.ncl.ac.uk/Risks/23/66#subj11 [6] J. Epstein. Missile interceptor doesn’t even leave its silo—again. date accessed: 2018-03-28, 2005. https://catless.ncl.ac.uk/Risks/23/72#subj1 [7] C. L. Glaser and S. Fetter. National Missile Defense and the Future of U.S. Nuclear Weapons Policy. International Security, 26(1):40–92, 2001 [8] S. A. Hildreth. Ballistic Missile Defense: Historical Overview. Technical report, Congressional Research Service, The Library of Congress, 2007 [9] T. Karako, I. Williams, and W. Rumbaugh. The Evolution of Homeland Missile Defense. In Missile Defense 2020: Next Steps for Defending the Homeland, CSIS Reports, Chapter 2. Center for Strategic & International Studies, 2017 [10] H. Lin. The Development of Software for Ballistic-Missile Defense. Technical report, Center for International Studies, MIT, 1985 [11] S. McConnell. Code Complete. DV-Professional Series. Microsoft Press, 2004 [12] D. L. Parnas. Software Aspects of Strategic Defense Systems. ACM SIGSOFT Software Engineering Notes, 10(5):15–23, 1985 [13] D. L. Parnas. SDI: A violation of professional responsibility. date accessed: 2018-03-23, 2007. http://www.cse.msu.edu/~cse870/Lectures/ SS2007/ParnasPapers/parnas-SDI-ramirez-notes.pdf

[14] C. P. Pierce. Going Postol. The Boston Globe, October 23, 2005. http://archive. boston.com/news/globe/magazine/articles/2005/10/23/going_postol/

[15] J. Pike. HOE Homing Overlay Experiment. date accessed: 2018-03-23. https: //www.globalsecurity.org/space/systems/hoe.htm

[16] T. A. Postol. Letter from Ted Postol to John Podesta concerning visit of DSS investigators. date accessed: 2018-03-23, 2000. http://www.bits.de/NRANEU/ BMD/documents/Postol3.htm

[17] T. A. Postol and G. N. Lewis. The proposed US missile defense in Europe: Technological issues relevant to policy. date accessed: 2018-03-03, 2007. http: //russianforces.org/files/BriefOnEastEuropeMissileDefenseProposal_ August24,2007_FinalReduced.pdf

[18] V. Prevelakis. Missile interceptor doesn’t even leave its silo. date accessed: 201807-31, 2005. https://catless.ncl.ac.uk/Risks/23.65.html#subj5

218

Chapter 7. Complexity

[19] S. Rosenblum. Misguided Missiles: Canada, the Cruise and Star Wars. Canadian issues series. J. Lorimer, 1985 [20] Union of Concerned Scientists. US ballistic missile defense timeline: 1945-Today. date accessed: 2018-01-31, 2017. https://www.ucsusa.org/nuclear-weapons/ us-missile-defense/missile-defense-timeline#.WppcHXwiEkI

[21] US Department of Defense Missile Defense Agency. The ballistic missile defense system. date accessed: 2018-03-09, 2018. http://www.mda.mil/system/system. html

[22] US Government Accountability Office. Missile Defense - Some Progress Delivering Capabilities, but Challenges with Testing Transparency and Requirements Development Need to Be Addressed. Technical Report GAO-17-381, United States Government Accountability Office, 2017. https://www.gao.gov/assets/690/ 684963.pdf

[23] K. P. Werrell. Hitting a Bullet with a Bullet. A History of Ballistic Missile Defense. Technical report, College of Aerospace Doctrine, Research and Education, Air University, 2000. www.dtic.mil/dtic/tr/fulltext/u2/a381863.pdf [24] Wikipedia. Terminal High Altitude Area Defense. date accessed: 2018-03-09. https://en.wikipedia.org/wiki/Terminal_High_Altitude_Area_Defense

Chapter 8

Appendix

8.1 Urban Legends and Other Stories some think reft and light cannot be confused. what an ellol. —Ernst Jandl, translated from German

8.1.1 Introduction Bugs also occurred before the computer era, and such errors often appear as illustrative examples for certain classes of software failures. The validity of such stories is frequently hard to judge, and sometimes they are purely urban legends. A very prominent but unverified and probably apocryphal example is related to the Napoleonic wars and the battles of Ulm and Austerlitz (cf. [27, 67]). Following the narrative, the Austrian and Russian armies wanted to combine their forces against Napoleon in 1805. Both nations agreed on the end of October for uniting their armies, but used different calendars (the Julian calendar in Russia and the Gregorian calendar in Austria) and, thus, actually different dates for the meeting. Therefore, Napoleon was able to deal with the two armies separately, beating the Austrian forces in a series of battles near Ulm, and the main Russian army—reinforced only by remaining weak Austrian troops—at Austerlitz. The following issues are categorized by their respective field of application.

8.1.2 Finance • In 1983, the Bank of America planned to switch to a new software system, “MasterNet,” for account management. The system started in 1987 and induced system crashes, wrong accounting, erroneous money transfers, and delays in payment of interest and statement of accounts (see [24, 60]). In 1988, the system was abandoned. Related problems led to a loss of more than $23 million. 219

220

Chapter 8. Appendix

• In November 1985, the Bank of New York had to deal with many transactions because 32,000 government security issues poured in. The software was designed to cope with 36,000 issues, but due to a software glitch, the computer corrupted the transactions. There were inadequate instructions regarding the method of storing data when the number of issues exceeded 32,000 (see [5, 62]).122 Because of the computer failure, the bank had a $32 billion overdraft on its cash account at the New York Federal Reserve Bank (see [20]). For this overdraft, the bank had to pay $5 million interest. • The infamous salami slicing technique can be utilized to take advantage of rounding errors occurring in financial transactions. Programmers can transfer tiny amounts of money (“salami slices”)—stemming from rounding down to the nearest cent—on their own accounts, collecting huge amounts in this way (see [26, 30, 75]). In a slightly modified form, such a scheme was applied by accountant E. Royce in 1963 (cf. [45, 57], and similar stories are collected in [26, 30].123

8.1.3 Neural Networks Neural Networks (NNs) are interesting black-box tools to derive results for problems such as classification. They represent one of several approaches in the field of machine learning and artificial intelligence. The variants of NNs are interesting tools to create a black-box model for a function or behavior by observing the output data related to specified inputs. The original idea (see [29, 66]) was to imitate the human brain with layers of neurons that, in the event of an incoming signal, stimulate neighboring connected neurons with a certain intensity. In this way, a certain input at the input layer generates an output at the final layer. By adjusting the local intensities via given training data in demanding learning algorithms, the developers of an NN can design it for specific tasks such as classification or detection of certain features in the input data. The concept of NNs saw a certain boost of popularity both in science (due to considerable advancements in the setup and algorithms) and in the media (mainly due to AlphaGo, an artificial intelligence program developed by Google DeepMind that was able to beat world champions at playing Go). But NNs also come with restrictions and possible disadvantages: • The user has no insight into the criteria that the trained NN is based on to derive its classification results (see [17]). • The training input data have to be chosen very carefully to avoid bias to hidden, unintentional criteria. In 1994, stories emerged about technically flimsy setups of NNs. According to [44] and [19], an NN was trained to detect tanks in images. In verification tests following the training period, the network failed completely in recognizing tanks. It turned out that the training images used contained important additional features: The images with tanks always showed clouds too, whereas all images without tanks had been taken on sunny days. Hence, the network was unknowingly trained to detect clouds, not tanks. Recently, research papers have been published facilitating the detection of criminals ([1, 48, 51, 63]) or the classification of people according to their sexual orientation ([13, 63]) based on portrait pictures. Unfortunately, both results turned out to be very 122 Some

sources claim the use of a 16-bit variable instead of a 32-bit variable [46].

123 The salami slicing technique is also part of the plot of the movies Superman III, Hacker, and Office Space.

8.1. Urban Legends and Other Stories

221

controversial—technically and ethically. When classifying images of criminals and noncriminals, the training images might contain additional features that the NN uses unintentionally as main criteria (e.g., the type of T-shirt that the prisoners are wearing or a certain style of camera position or angle preferred for selfies). Besides the doubtful scientific outcome, this type of classification is also questionable from an ethical point of view. What does it actually mean when a person is classified as 80% criminal or homosexual? Who will use this data and for what purposes? Recent publications also challenge the use of deep NNs in safety-critical applications. In the context of image classification, changing only a few pixels in an image can lead to a completely different classification result (see [18, 58]). As an example, a stop sign somewhat modified by a few stickers is interpreted as a speed limit by the NN. Such behavior can be especially dangerous in the context of autonomous driving (compare the car accidents in Sec. 4.2).

8.1.4 Bridges and Railway Tracks • In the course of the German reunification, the originally double-tracked line between Eichenberg and Arenshausen (near Kassel and Erfurt, close to the innerGerman border), which had been destroyed in Cold War times, had to be restored beginning with only one track. Two building sites started from both directions. Both groups started to build their respectively right track in an advancing direction—with disappointing results (see [14] and Fig. 8.1). The inverse of the above story happened after World War II (cf. [14]): One track of the double-tracked line Rostock–Schwaan had to be pulled down and brought to Russia as reparation. On both sides, workers started to pull down the respectively right track. The trustworthiness of the sources for these two stories is hard to judge because they are primarily based on information from newsgroups.

Figure 8.1. Visualization of problems with building or tearing down a double-track railway connection starting from both ends to meet in the middle.

• The following true story is related to bridging the river Rhine. Germany and Switzerland measure their altitudes differently regarding the reference level. Germany refers to the level of Amsterdam in meters above normal zero, and Switzerland uses meters above sea level relative to the Repère Pierre du Niton (two unusual rocks in Lake Geneva). This fact was well known to the engineers responsible for planning the bridge over the river Rhine connecting the Swiss and German parts of the city of Laufenburg. To take into account the resulting difference of 27 cm, a superelevation of the counter bearing on the Swiss side was set. But due to an error in the leading sign, the counter bearing was lowered by 27 cm, resulting in a possible difference of 54 cm for the bridge. During construction, the mistake was detected and had to be resolved by correcting the counter bearing on the German side (cf. [70]). Similar events are said to have happened during the building of the first railway tracks from Germany to France and Belgium in the 19th century (see [14]).

222

Chapter 8. Appendix

8.1.5 Ships and Trains • The Zenobia was a roll-on/roll-off ferry which sank on her maiden voyage from Sweden to Syria in 1980. Close to Larnaca, Cyprus, more and more water was pumped into the side ballast tanks due to a computer malfunction. Slowly, the list of the ship got worse, and the captain and the crew abandoned the ship, which sank on June 7, 1980. Examinations revealed that the trimming software that is necessary to balance the load distribution was faulty and caused the accident (see [15, 50, 71]). Today, the wreck, lying 1.5 km off Larnaca, serves as a fancy scuba diving location. • For the London Underground rail network, the train control had been reduced to only two buttons that the driver had to press: one to close the doors, and another one to start the train. An interlock prevents the train from running before the doors are indeed closed. In April 1990, a driver allegedly introduced a further optimization by taping the second button permanently down so that the train would start immediately, as soon as the doors were closed. Unfortunately, one day, one of the doors was blocked. So the driver stepped out and unblocked the door—leaving himself outside the departing train. The train stopped automatically at Pimlico, the next station, but an inspector had to open the doors and let the passengers out (see [38, 41, 43]). A similar accident happened in 1993, when the driver intended to inspect a defective door at King’s Cross (cf. [61]). In July 2011, a London train on the Victoria line left the Warren Street station with its doors still open (see [42]). The driver had to perform an emergency stop and reboot the computer system. • The Bay Area Rapid Transit (BART) connecting the San Francisco Bay area started operation in 1976. Its control system was haunted by “ghost trains” which caused trouble (see [38, 69]). Trains showed up on the computer system which did not exist but blocked tracks. Furthermore, real trains sometimes disappeared from the system. Additional operators in booths on the platforms were necessary for several years.

8.1.6 Warplanes • Following [25, 38], crossing the equator in a flight simulator caused a US F-16 fighter to flip over, killing the pilot (in simulation). The bug resulted from a missing minus sign indicating south latitude. • During another flight simulation, the F-16 plane flew upside down because the computer could not decide whether to do a left or right roll to return to the upright position, and the software froze (see [38]). • The Swedish fighter plane JAS 39 Gripen became famous due to its accidents and bugs. Pilot-induced oscillations were responsible for crashes such as the ones on February 2, 1989, and August 8, 1993 (cf. [38, 40, 65]). The Gripen reveals the strong dependency of current military aircraft on computer support. To formulate it in an extreme way: These airplanes are only able to stay airborne thanks to intelligent and fast software. Hence, if the software is weak, severe problems and crashes are likely.

8.1. Urban Legends and Other Stories

223

8.1.7 Submarines and Torpedoes • During World War II, US submarines were haunted by faulty torpedoes. One problem were so-called hot runs, where a torpedo activates itself while still on board. A second issue were circular runs (cf. [59]), where the direction of a fired torpedo somehow got changed, letting it target the firing submarine itself; this resulted in at least two submarines sinking themselves: the USS Tullibee and the USS Tang (see [77, 78]). To avoid circular runs of torpedoes, anticircular mechanisms were designed. The following legend can be found in the literature (cf. [2, 8, 55]): To build in an anticircular mechanism, a gyroscope in the torpedo was supposed to measure the deviation and trigger a self-destruction in the event of a deviation of about 180◦ . During a maneuver, the captain ordered the firing a torpedo, but the torpedo got stuck in the tube. After short consideration, the captain commanded a turnaround to the port—with fatal effect. This legend may be traced back to the loss of the nuclear submarine USS Scorpion (see [38, 64, 76]), which sank mysteriously in May 1968 in the Atlantic. The boat is located on the seabed, about 740 km southwest of the Azores, more than 3,000 m deep. In the course of the investigation, the bathyscaphe Trieste II visited the submarine wreck and took pictures. The inquiry board discussed different possible scenarios, and the available data indicate that the submarine performed a 180◦ turn shortly before sinking. Since the circumstances for the sinking of the USS Scorpion are somehow mysterious (no clear reason had been identified124 ), and since a 180◦ turn seems to actually have happened, this combination might have created the legend. However, this legend is most likely untrue. First, it is much more prudent to simply deactivate a circular-run torpedo instead of letting it explode (see [54]). Second, there was a clear order to US submarine captains to perform a 180◦ turn with the vessel in case of an issue with a hot-run torpedo (cf. [54]); such an order would make no sense at all for an explosion mechanism. To summarize, the 180◦ legend that sometimes appears in the software engineering community might be an altered version of the tale of the hapless Scorpion. • Another mishap affected the USS Squalus in 1939 (see [11, 35]). The board of red and green lights in the control room signaled that all hull openings were closed. But unfortunately the induction to the engine room was open, and the engine rooms were flooded. Twenty-six men died, and only 33 survived an undersea rescue. So in this true story, a nonfunctioning red light, which should have indicated the open induction to the engine room, led to the failure.

8.1.8 Nuclear Defense • During the Cold War, incidents of false alarm were provoked by seemingly attacking nuclear missiles. The Ballistic Missile Early Warning System (BMEWS) site in Thule, Greenland, part of the NORAD defense system, actually mistook the moon for a hostile incoming missile in October 1960 (see [38, 52, 74]). Since Khrushchev, the Russian leader, was visiting the United Nations in New York at that time, and because the signals were indicating more than 1,000 missiles, no counterstrike was induced. 124 Other explanations also circulated concerning the mysterious loss of the USS Scorpion, such as an attack by a Russian boat (see [76]).

224

Chapter 8. Appendix

• On November 9, 1979, a similar alarm occurred at NORAD headquarters inside the Cheyenne Mountains. More and more missiles were indicated on the screen launched from submarines and the Soviet Union. The early warning sensors of satellites showed possible attacks up to four times a day triggered by forest fires, volcanic eruptions, or other sources of heat. The officers on duty would then discuss the threat. But this time a massive attack was revealed, with more and more missiles filling the screen. Because other radar and ground stations did not detect any missiles, and no Soviet warheads arrived, no counterstrike was initiated. Later on, the reason for the false alarm was detected: A technician had put a tape into one of the computers containing realistic details of a war game (cf. [52]). The authors of [36] report 50 false alerts from the NORAD defense system in 1979 alone. • The Russian satellite Kosmos 1382 registered on September 26, 1983, that five successive American nuclear missiles—all starting from the same American missile base—would reach the USSR within minutes. Fortunately, the responsible officer, Lieutenant Colonel Petrov, did not react in the prescribed way by starting Russia’s counterstrike missiles (see [53]) but ordered a check of the computer system. Assuming that a serious attack would not be launched by one base and only a few missiles simultaneously, Petrov decided to categorize the message as a false alarm. Thus, he saved the world from a nuclear war (but actually was disrespected for many years in Russia for disobeying orders). A possible reason for the false alarm was determined by an investigation to have likely been a strong reflection of sun beams on a cloud layer that was interpreted as a missile launch. • In December 1984, due to a 180◦ heading error, a Soviet test cruise missile aimed for Finland. The same problem occurred when the US naval frigate George Philip opened fire 180◦ off target, in the direction of a Mexican ship in July 1983 (see [38]).

8.1.9 Space Flight • In 1958, Pioneer I was the first spacecraft launched by NASA to reach the moon. Unfortunately, during the launch on October 11, 1958, the guidance system steered the missile too high and fast such that the second stage shut off 10 seconds earlier than planned, and the third stage was bumped during separation. As a result, the probe only reached a height of around 114,000 km before falling back to earth and burning up. Only a small amount of useful information was collected and sent in the time window of 43 hours (see [73]). • Different stories can be found in the literature about the failure of the Venus probe Mariner 1. NASA launched Mariner 1 on July 22, 1962, to fly by Venus. The flight had to be aborted immediately after the launch in view of a faulty application of guidance commands that made steering impossible. The most popular explanation for the accident mentions a missing “-” in the equations, the code, or the data. Following A. C. Clarke, Mariner 1 was “wrecked by the most expensive hyphen in history” (see [12]). Other sources claim the error was caused by lack of an overbar in the handwrit˙ which ten equations (see [9, 37]). By omitting this overbar in the variable R,

8.1. Urban Legends and Other Stories

225

described the smoothing of the derivative of the radius R, the related mean value of the velocity was replaced by the untreated velocities. This amplified minor fluctuations to serious deviations, which had to be corrected. Giving the simple omission of a hyphen as a basic cause could also be a substitute for a more complicated explanation that would have been too complex for NASA officials to present to nonexperts. The official NASA website mentions an “improper operation of the Atlas airborne beacon equipment” and that “the omission of a hyphen in coded computer instructions in the data-editing program allowed transmission of incorrect guidance signals to the spacecraft” (cf. [34]). Often, a faulty FORTRAN code also serves as an explanation (see [6, 38, 79]). An erroneous Do loop of the form “DO 5 K=1. 3” would be understood by the compiler as “DO5K = 1.3”. The correct code should be “DO 5 K=1, 3”. So the replacement of the comma by a period transforms the DO loop into an assignment. But it is questionable whether NASA was using FORTRAN at the time. A similar bug seemed to have appeared in the Mercury project in 1961 (cf. [7]) and might be the source of this version of the Mariner 1 story. • The Voyager missions had been a milestone in the exploration of the outer planets and interstellar space. Voyager 2 was launched on August 20, 1977—16 days earlier than Voyager 1—with the goal of visiting Jupiter, Saturn, Uranus, and Neptune, and using each planet also for accelerating the probe to the next one (see [31]). A certain constellation of the outer planets facilitated these two journeys of Voyager 1 and 2 in a small time window.125 The Voyager 2 mission was plagued by a number of software problems that luckily could all be resolved during the flight: – Following the launch, the computer of Voyager lost orientation and tried to switch to backup sensors. Fortunately, the Centaur rocket attitude system was in charge and corrected the disequilibrium of Voyager’s control unit. But after separation, Voyager’s control unit again ran wild in switching thrusters and actuated valves to stabilize its orientation (see [31], also summarized in [4, 47]). Because of the disorientation, its executive program shut down most communications with Earth as scheduled in the event of a spacecraft attitude disorientation. It took 79 minutes until Voyager was able to find the sun, establish a known orientation, and renew contact with ground control. When writing the related code, the programmers assumed that a loss of orientation could only be triggered by a hardware failure in deep space, and therefore should be answered by a shutdown of the communication and fixing the failure itself. Nobody expected this behavior during or shortly after liftoff. – A second error affected the images sent during the long trip (see [22, 28]). With growing distance, the digit rate was shrinking and the amount of interesting data was growing. Therefore, software for compression of the image data was implemented. The image size used by Voyager’s cameras was 800 × 800 pixels, so the related data size added up to 640,000 pixels, each resulting in a gray value between 0 and 255. In total, one image contained 5,120,000 bits. To reduce the amount of data, a compression method had to 125 These two probes continue to travel and have meanwhile left the solar system and reached interstellar space (see https://voyager.jpl.nasa.gov/ for details).

226

Chapter 8. Appendix

be invented. To this aim, only one pixel value was stored explicitly; all the others were stored as the difference to a neighbor. Hence, the second spare computer was reprogrammed to install software for this image compression. On the other hand, this procedure increased sensitivity and error-proneness because one wrong bit at a sensitive pixel position could change all the following values by the repeated differences. Unfortunately, six days before the Uranus encounter, such an error occurred. Large blocks of black and white lines overlaid all the images which were transmitted in compressed form. A faulty bit in a command had exactly the wrong value 1 instead of 0. The error could have been caused by a cosmic particle (which would have allowed an immediate correction) or a permanent flaw in the hardware. So the code was rewritten on the fly to avoid the possibly erroneous memory location. This correction allowed the transmission of accurate and compressed images of Uranus. – Another problem was related to the navigational data and the dynamic model of the trajectory. When approaching Uranus, the predicted flight path did not coincide with the observed flight path. The reason was an erroneous estimation of Uranus’s inertia. This value had to be corrected by 0.3% in order to avoid the differences and to allow further accurate prediction of the trajectory (cf. [22, 28]) – In April 2010, Voyager 2 started sending unintelligible scientific data. The error was caused by one bit in the memory that formats the data and had flipped from 0 to 1. By resetting the system, Voyager was again able to send the data in the proper form (see [33, 56]). • The Russian space probe Phobos 1 meant to explore Mars was launched on July 7, 1988. The control unit was fed with erroneous commands: A hyphen was left out. The code should have been checked by a proofreading computer before transmitting it, but the proofreading computer was not working, and the technician in charge sent the commands anyway. The faulty code resulted in the end-of-mission command and in the shutdown of all systems (see [72]). • The Clementine mission was launched by NASA on January 25, 1994. It served for lunar exploration and was intended to pass by the asteroid Geographos afterwards. But following the successful lunar mission, an unspecified malfunction in one of the on-board computers caused the thrusters to fire until the propellant was consumed. Therefore, Clementine was not able to continue and to study the surface of the asteroid Geographos (cf. [32, 68]). • The Mars Global Surveyor was doomed due to incorrect software commands. Investigations revealed that when installing a patch for a software error, the engineers stored corrections at wrong locations in the memory. This resulted in the deactivation of two security systems that were supposed to make sure that the solar panel would move only in a narrow angle, and that communication to Earth would be preserved in case of a malfunction. Later on, during the alignment of the solar sails, the panels were turned too far. By activating the emergency modes, the probe came into full sunlight, which overheated one of the batteries. The software interpreted this as overloading and cut off the power supply. Since communication was also destroyed by the software update, the batteries were empty twelve hours later, and the probe was completely lost (see [10, 49]).

8.2. MATLAB Example Programs

227

• In 1965, the Gemini 5 capsule with two astronauts on board landed 130 km short of the planned landing point in the sea. The rotation of the earth is 360.98◦ per 24 hours, but was entered by a programmer as 360◦ . Fortunately, this caused no serious problems because several US ships were in the landing area anyway (see [3, 38, 55]).

8.2 MATLAB Example Programs The following MATLAB examples have been developed and tested on the release version MATLAB R2017b

8.2.1 MATLAB Code: Vancouver Stock Exchange The MATLAB implementation of the example computations in Sec. 2.3 has the following form: % % % % %

Example computations for the VSE index problem 1500 stocks, updates for 22 months with 3000 per day initial index value = 1000.000 4 different update variants: floor (If), ceil (Ic), rounding (Ir), exact double precision (Ie)

n = 1500; nUpdPerDay = 2*n; nDays = 440; r = 0.0001; % stock update range (percent) If = 1000; Ic = 1000; Ie = 1000; Ir = 1000; sp = 50+10*(1:n)/n; % stock prices unif. distr. [50,60] w = sum(sp)/1000; % weight w for all stock prices rng('default'); % random number generator settings % compute stock changes: for j = 1:nUpdPerDay*nDays k = randi(n,1,1); % update random stock price z = (1-2*rand(1)) * r; % random change in price: [-r,r] z = sp(k)*z; % absolute change in stock price sp(k) = sp(k) + z; z = z/w; % weighted change in stock price % changes in the index (4 variants) If = floor((If+z)*1000) / 1000; Ic = ceil( 1000*(Ic+z)) / 1000; Ir = round(1000*(Ir+z)) / 1000; Ie = Ie+z; end % compute direct index values (no updates) I = sum(sp)/w; Iw = sum(sp/w); % output to command line: error compared to I floor([If, Ic, Ir, Ie, Iw, I]*1000) [I-If, Ic-I, Ir-I, Ie-I, Iw-I]

The code assumes 1,500 stocks in the VSE index, which are updated twice per day each. The updates continue for 22 months. The initial value of the VSE index is set to 1,000.000. The multiplications and divisions by “1000” inside the loop imitate the accuracy of three digits after the decimal point without needing to change to a different data type than the default double of MATLAB.

228

Chapter 8. Appendix

8.2.2 MATLAB Code: Heisenberg Effect in MATLAB The following source code is due to [21] and represents an example of generating different results by switching output to the command line on and off, respectively, in former versions of MATLAB. 1 2 3 4 5 6 7 8 9 10 11 12 13 14

% MATLAB function to compute the machine precision. % The semicolon in line 10 prevents the output to the % command line. The value of the computation was % different (MATLAB version ) compared to % the variant myrealmin_withOutput(). function x = myrealmin() x = 1; temp = x; while eps * temp / 2 > 0 temp = (eps * temp / 2); if (temp > 0) x = temp; end end

1 2 3 4 5 6 7 8 9 10 11 12 13 14

% MATLAB function to compute the machine precision. % The missing semicolon in line 10 generates the output % to the command line. The value of the computation was % different (MATLAB version ) compared to % the variant myrealmin(). function x = myrealmin_withOutput() x = 1; temp = x; while eps * temp / 2 > 0 temp = (eps * temp / 2) if (temp > 0) x = temp; end end

Both functions compute the smallest normalized floating-point number but differ in line 10 regarding the output to the command line. In versions 6.5–7 of MATLAB, the computations for x in myrealmin() resulted in an exact zero, whereas the call of myrealmin_withOutput() showed 8.0948e-320 as the (correct) result. This bug is of the type “heisenbug” (see Sec. 1.3): The difference in the computations is observable only if we look at it by switching on the command line output. The reason for the different results is a different number format used for internal calculations (80-bit register computations) and for displaying a number (standard 64-bit double-precision format). Hence, in myrealmin_withOutput(), every single computation was converted to 64-bit precision, accumulated, and finally displayed as return value in exactly this precision. In myrealmin(), however, all computations are done with higher 80-bit precision, which is rounded/converted to zero when being stored as the return value. Note that this behavior has been changed in later versions of MATLAB.

8.2. MATLAB Example Programs

229

8.2.3 MATLAB Code: Exponential Function and Its Derivative The following MATLAB code has been used to generate the visualization of a specific feature of the exponential function in Exc. 7 in Sec. 3.1: It solves the ordinary differential equation f 0 (x) = f (x); i.e., the derivative at each point is identical to the function value. This is observable via slope triangles plotted by the code. The code uses the package export_fig126 which comes in handy for the typically cumbersome task of generating (nice) pdf files from MATLAB figures. % script to compute and visualize the relation % of the function f(x) = exp(x) and its derivative % via slope triangles at positions -1,0,1, and 2. deltaX = 0.01; x = -4:deltaX:2; y = exp(x); x1 x2 x3 x4

= -2:deltaX:-1; = -1:deltaX:0; = 0:deltaX:1; = 1:deltaX:2;

t1 t2 t3 t4

= (x1+2)*exp(-1); = (x2+1); = x3*exp(1); = (x4-1)*exp(2);

% plot function and tangents % use default color scheme for plot lines, but reorder color1 = [0, 0.4470, 0.7410]; color2 = [0.8500, 0.3250, 0.0980]; color3 = [0.9290, 0.6940, 0.1250]; color4 = [0.4660, 0.6740, 0.1880]; set(gca,'DefaultLineLineWidth',2) hold on % plot exponential function h0 = plot(x,y, 'Color', 'black'); plot(x1,t1, 'Color', color1); plot(x2,t2, 'Color', color2) plot(x3,t3, 'Color', color3); plot(x4,t4, 'Color', color4) % plot vertical lines plot([-1 -1], [0 exp(-1)], plot([0 0], [0 1], plot([1 1], [0 exp(1)], plot([2 2], [0 exp(2)],

'Color', 'Color', 'Color', 'Color',

% plot horizontal lines h1 = plot([-2 -1], [0 0], 'Color', h2 = plot([-1 0], [0 0], 'Color', h3 = plot([0 1], [0 0], 'Color', h4 = plot([1 2], [0 0], 'Color', hold off xlim([-4.1, 2.5]); ylim([0, 7.5]);

color1) color2) color3) color4)

color1); color2); color3); color4);

legendStrings = {'exp(x)', 'slope \Delta 1', 'slope \Delta 2', ... 'slope \Delta 3', 'slope \Delta 4'}; legend([h0 h1 h2 h3 h4], legendStrings, 'Location', 'northwest') % plot figure as pdf w/o large margins using export_fig addpath('altmany-export_fig-9d97e2c') outputFileName = 'exponential_derivatives.pdf'; export_fig(outputFileName, '-q101');

126 See

https://github.com/altmany/export_fig.

230

Chapter 8. Appendix

8.2.4 MATLAB Code: Lagrange Polynomials The following MATLAB implementation has been used to generate the visualization of the Lagrange polynomials and the interpolation polynomial in Excursion 9 in Sec. 3.1. % script to compute and visualize Lagrange polynomials % at interpolation points -1,0,1 % for the function f(x) = exp(x) % vector of evaluation points in x direction for plotting x = -1:0.01:1; % Lagrange polynomials for three points -1,0,1 L0 = x.*(x-1)/2; L1 = -(x+1).*(x-1); L2 = x.*(x+1)/2; % interpolation parabola p(x) for the given points p = L0*exp(-1) + L1 + L2*exp(1); hold on h1 = plot(x,L0,'-', x,L1,'-', x,L2,'-', ... x,exp(x),'--', x,p,'-', 'LineWidth',2); h2 = plot(x,0*x,'k-',[0 0],[-0.5 3],'k-'); hold off legendStrings = {'L_0', 'L_1', 'L_2', 'exp(x)', 'p(x)'}; legend([h1], legendStrings, 'Location', 'northwest') % plot figure as pdf w/o large margins using export_fig addpath('altmany-export_fig-9d97e2c') outputFileName = 'lagrange_polynomials_interpol.pdf'; export_fig(outputFileName, '-q101');

8.2. MATLAB Example Programs

231

8.2.5 MATLAB Code: Pentium PD Table For the visualization of the PD lookup table in Sec. 6.1, the MATLAB code reads as follows: % Function to compute (zoomed) table visualisation for the % Pentium table entries. The function edelman() is used for % the actual table entry computations. % % The export_fig package is used to print (real) pdf files % for the figures: https://github.com/altmany/export_fig % % The zoomed version has some minor modifications: % *) It shows the positive part of y axis only. % *) The potential zigzag (threshold) is visualised by % coloring ambiguous white cells (bright and dark color % depending on the column). % *) Additional plotting of the straight lines shifted by % -1/8 (those which limit the zigzag threshold from % above but not from below). % % Input: printZoomedVersion: % Flag (0 or 1) to decide whether or not % to print the zoomed version. % Output: % function pentium_table(printZoomedVersion) % check input parameter if printZoomedVersion==0 printZoomedVersion==1 % ok else error('pentium_table: input parameter not allowed!'); end close all % create table entries qq = edelman; % define colors; rgb definitions cf. e.g. % http://www.rapidtables.com/web/color/RGB_Color.htm color_m2 = [255 228 181]/255; % moccasin color_m2_dark = [222 184 134]/255; % burly wood color_m1 = [219 112 147]/255; % pale violet red color_m1_dark = [199 21 133]/255; % medium violet red color_0 = [50 205 50]/255; % lime green color_0_dark = [34 139 34]/255; % forest green color_1 = [255 99 71]/255; % tomato color_1_dark = [220 20 60]/255; % crimson color_2 = [135 206 235]/255; % sky blue color_2_dark = [30 144 255]/255; % dodger blue % define units of single boxes in table plot dx = 1/16; dy = 2*dx; % offset of lower left corner of first real neg. cell offset_y = -5.5; % unit cell coordinates for plotting of single cells: uc_x = [0 dx dx 0 0]; uc_y = [0 0 dy dy 0]; hold on % plot all matrix entries (except 1st column with dummy % data and except rows on top/bottom with no colored entry)

232

Chapter 8. Appendix for j=2:17 for k=5:91 if printZoomedVersion==1 % adapt value of ambiguous cells for coloring: % use digit with larger absolute value if qq(k,j)==3 qq(k,j) = 4; elseif qq(k,j)==1 qq(k,j) = 2; elseif qq(k,j)==-1 qq(k,j) = -2; elseif qq(k,j)==-3 qq(k,j) = -4; end end % perform coloring switch qq(k,j) case 4 co=color_2; if j==3 j==6 j==9 j==12 j==15 co=color_2_dark; end case 2 co=color_1; if j==3 j==6 j==9 j==12 j==15 co=color_1_dark; end case 0 co=color_0; if j==3 j==6 j==9 j==12 j==15 co=color_0_dark; end case -2 co=color_m1; if j==3 j==6 j==9 j==12 j==15 co=color_m1_dark; end case -4 co=color_m2; if j==3 j==6 j==9 j==12 j==15 co=color_m2_dark; end otherwise % +-10 co='white'; end % plot cell of matrix in selected color co: % shift unit cells defined by uc_x and uc_y % *) j shifts to the right by (j-2)d (j==2 is 1st % col which should be visible), offset for [1,2). % *) kk shifts vertically: lowest cell coord. of % l.l. corner equal to offset_y + k-dependency fill(uc_x + (j-2)*dx + 1.0, uc_y + (k-5)*dy + offset_y, co); end end % plot 5 buggy cells on top in separate color color_buggy = [0 0 205]/255; % navy fill(uc_x+dx+1, uc_y+67*dy+offset_y, color_buggy); fill(uc_x+4*dx+1, uc_y+71*dy+offset_y, color_buggy); fill(uc_x+7*dx+1, uc_y+75*dy+offset_y, color_buggy); fill(uc_x+10*dx+1,uc_y+79*dy+offset_y, color_buggy); fill(uc_x+13*dx+1,uc_y+83*dy+offset_y, color_buggy); % plot 5 unreached cells at bottom in dark grey

8.2. MATLAB Example Programs color_unreached = [0.5 0.5 0.5]; fill(uc_x+dx+1, uc_y+19*dy+offset_y, fill(uc_x+4*dx+1, uc_y+15*dy+offset_y, fill(uc_x+7*dx+1, uc_y+11*dy+offset_y, fill(uc_x+10*dx+1,uc_y+7*dy+offset_y, fill(uc_x+13*dx+1,uc_y+3*dy+offset_y,

233

color_unreached); color_unreached); color_unreached); color_unreached); color_unreached);

% plot horizontal separator lines in bold for integers for x = linspace(-5,5,11) plot([1 2],[x x],'LineWidth',2,'Color','black'); end % plot straight lines of boundaries for I_{-2},...,I_2 color_gr = [105 105 105]/255; plot([1 2],[8 16]/3, 'LineWidth',2, 'Color',color_gr); plot([1 2],[-8 -16]/3,'LineWidth',2, 'Color',color_gr); plot([1 2],[5 10]/3, 'LineWidth',2, 'Color',color_gr); plot([1 2],[-5 -10]/3,'LineWidth',2, 'Color',color_gr); plot([1 2],[4 8]/3, 'LineWidth',2, 'Color',color_gr); plot([1 2],[-4 -8]/3, 'LineWidth',2, 'Color',color_gr); plot([1 2],[2 4]/3, 'LineWidth',2, 'Color',color_gr); plot([1 2],[-2 -4]/3, 'LineWidth',2, 'Color',color_gr); plot([1 2],[1 2]/3, 'LineWidth',2, 'Color',color_gr); plot([1 2],[-1 -2]/3, 'LineWidth',2, 'Color',color_gr); % for zoom-version: plot shifted lines by -1/8 if printZoomedVersion==1 col_shift = [200 200 200]/255; col_bShift = 'red'; add_shift = -1/8; plot([1 2],[8 16]/3 + add_shift, 'LineWidth',2, 'Color',col_bShift); plot([1 2],[-8 -16]/3 + add_shift, 'LineWidth',2, 'Color',col_shift); plot([1 2],[5 10]/3 + add_shift, 'LineWidth',2, 'Color',col_shift); plot([1 2],[-4 -8]/3 + add_shift, 'LineWidth',2, 'Color',col_shift); plot([1 2],[2 4]/3 + add_shift, 'LineWidth',2, 'Color',col_shift); plot([1 2],[-1 -2]/3 + add_shift, 'LineWidth',2, 'Color',col_shift); end % plot ticks in format of significant number of digits set(gca,'XTick',1:dx:2-dx) set(gca,'XTickLabel',... {'1.0000', '1.0001', '1.0010', '1.0011', '1.0100', '1.0101', ... '1.0110', '1.0111', '1.1000', '1.1001', '1.1010', '1.1011', ... '1.1100', '1.1101', '1.1110', '1.1111'}) set(gca,'XTickLabelRotation',90) set(gca,'YTick',-5:5) set(gca,'YTickLabel', ... {'-5.0 1011.000', '-4.0 1100.000', '-3.0 1101.000', ... '-2.0 1110.000', '-1.0 1111.000', ' 0.0 0000.000', ... ' 1.0 0001.000', ' 2.0 0010.000', ' 3.0 0011.000', ... ' 4.0 0100.000', ' 5.0 0101.000'}) % set general image properties pbaspect([1 1.75 1]) hold off if printZoomedVersion==0 axis([1 2 -5.5 5.5]) else axis([1 2 0 5.5-1/8]) % show positive y axis part only end % plot figure as pdf w/o large margins using export_fig addpath('altmany-export_fig-9d97e2c') if printZoomedVersion==0 outputFileName = 'pentium_table.pdf';

234

Chapter 8. Appendix else outputFileName = 'pentium_table_zoom.pdf'; end export_fig(outputFileName, '-q101');

The above code again relies on the package export_fig for generation of pdf files for the plots. Furthermore, it uses the following MATLAB implementation originally proposed by Edelman in [16] with additional minor scalings for the sake of comfort in usage: % Code of Edelman paper to generate table entries with % coloring modified to use whole numbers (no values 0.5) % and replacing Inf by values of 10 for commodity. % % INPUT: % OUTPUT: qq: Matrix of value-coded table entries. % function [qq]=edelman % original code of Edelman paper: P=8*(-6:.125:6)'; q=0*P; for d=[1:(1/16) : 2-(1/16)] D = floor(16*d); Dp = D+1; % respect 5 different digit intervals qq=(P >= floor(-8*Dp/6)-1) + (P >= ceil(-5*D/6)) (P >= floor(-4*Dp/6)-1) + (P >= ceil(-2*D/6)) (P >= floor( -Dp/6)-1) + (P >= ceil( Dp/6)) (P >= floor( 2*D/6)-1) + (P >= ceil( 4*Dp/6)) (P >= floor( 5*D/6)-1) + (P >= ceil( 8*Dp/6)) q = [q qq]; end % shift q values to match digit values q=(q-5)/2; % set values outside range to Inf q(abs(q)==2.5) = Inf*q(abs(q)==2.5); % set 1st column (actually ignored in images) q(:,1)=P/8;

+ + + + ;

... ... ... ...

% additional modifications: % get rid of 0.5 values by multiplying by two qq=floor(q*2); % replace Inf with value 10, count all cells to be colored [m,n]=size(q);z=0; for i=1:m for j=1:n if qq(i,j)==Inf qq(i,j)=10; end if qq(i,j)==-Inf qq(i,j)=-10; end if j>1 && abs(qq(i,j))