Safe and Secure Cyber-Physical Systems and Internet-of-Things Systems [1st ed. 2020] 978-3-030-25807-8, 978-3-030-25808-5

​This book provides the first comprehensive view of safe and secure CPS and IoT systems. The authors address in a unifie

394 114 2MB

English Pages X, 91 [95] Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Safe and Secure Cyber-Physical Systems and Internet-of-Things Systems [1st ed. 2020]
 978-3-030-25807-8, 978-3-030-25808-5

Table of contents :
Front Matter ....Pages i-x
The Safety and Security Landscape (Marilyn Wolf, Dimitrios Serpanos)....Pages 1-10
Safety and Security Design Processes (Marilyn Wolf, Dimitrios Serpanos)....Pages 11-33
Threats and Threat Analysis (Marilyn Wolf, Dimitrios Serpanos)....Pages 35-45
Architectures (Marilyn Wolf, Dimitrios Serpanos)....Pages 47-57
Security Testing and Run-Time Monitoring (Marilyn Wolf, Dimitrios Serpanos)....Pages 59-72
False Data Injection Attacks (Marilyn Wolf, Dimitrios Serpanos)....Pages 73-83
Back Matter ....Pages 85-91

Citation preview

Marilyn Wolf · Dimitrios Serpanos

Safe and Secure Cyber-Physical Systems and Internet-of-Things Systems

Safe and Secure Cyber-Physical Systems and Internet-of-Things Systems

Marilyn Wolf • Dimitrios Serpanos

Safe and Secure Cyber-Physical Systems and Internet-of-Things Systems

Marilyn Wolf School of Computer Science and Engineering University of Nebraska - Lincoln Lincoln, NE, USA

Dimitrios Serpanos Electrical and Computer Engineering University of Patras Patras, Greece

ISBN 978-3-030-25807-8    ISBN 978-3-030-25808-5 (eBook) https://doi.org/10.1007/978-3-030-25808-5 © Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Marilyn Wolf: For Alec, As Always Dimitrios Serpanos: For Loukia and Georgia

Preface

This book is motivated by the realization that cyber-physical and Internet-of-Things systems have pushed engineers and computer systems designers into new, uncharted territory. The combination of physical plants with complex computer systems opens up new concerns and threats that are more than the sum of traditional safety engineering and computer security. Safety and security must be treated as a unified problem with a coherent set of approaches to solve these new challenges. This book provides our most recent thinking on safe and secure CPS and IoT systems. The chapters survey existing methods to identify challenges and propose new methodologies. While progress has been made in the past several years, we believe that much more work remains to be done. That work must go beyond technical solutions. A new technical community for safe and secure systems must be created and assume a role in safety-critical and mission-critical system design. The threats posed by cyber-physical and IoT systems also require careful consideration of policy and regulation. We would like to thank our colleagues for their insight and suggestions; all errors are ours alone. We would also like to thank our families for their patience as we typed away. Lincoln, NE, USA  Marilyn Wolf Patras, Greece  Dimitrios Serpanos  

vii

Contents

1 The Safety and Security Landscape ����������������������������������������������������������   1 1.1 Introduction����������������������������������������������������������������������������������������   1 1.2 Case Studies����������������������������������������������������������������������������������������   2 1.2.1 Cyber-Physical Systems Are Shockingly Easy to Attack ������   3 1.2.2 Cyber-Physical Systems Can Kill People ������������������������������   3 1.2.3 Cyber-Physical System Disruptions Require Extensive and Lengthy Repairs����������������������������������������������������������������������   4 1.2.4 Patch and Pray Considered Harmful��������������������������������������   5 1.2.5 Folk Wisdom Is Untrustworthy����������������������������������������������   5 1.2.6 The IT/OT Boundary Is Soft��������������������������������������������������   6 1.2.7 Design Processes Cannot Be Trusted��������������������������������������   6 1.2.8 The V Model Is Inadequate����������������������������������������������������   7 1.2.9 Privacy Is a Critical Requirement ������������������������������������������   7 1.3 Chapters in This Book������������������������������������������������������������������������   7 1.4 Summary ��������������������������������������������������������������������������������������������   8 References����������������������������������������������������������������������������������������������������   8 2 Safety and Security Design Processes��������������������������������������������������������  11 2.1 Introduction����������������������������������������������������������������������������������������  11 2.2 Risk Management ������������������������������������������������������������������������������  11 2.3 Fault Models and Hazard Analysis ����������������������������������������������������  13 2.4 Attack Models and Attack Analysis����������������������������������������������������  17 2.5 Standards and Certification ����������������������������������������������������������������  19 2.6 Quality Management Systems������������������������������������������������������������  21 2.7 Safety Design Processes ��������������������������������������������������������������������  22 2.8 Security Design Processes������������������������������������������������������������������  25 2.9 Comparison and Contrast of Safety and Security Design Processes��  29 References����������������������������������������������������������������������������������������������������  31 3 Threats and Threat Analysis ����������������������������������������������������������������������  35 3.1 Introduction����������������������������������������������������������������������������������������  35 3.2 Vulnerabilities, Hazards, and Threats ������������������������������������������������  36 3.3 Compound Threats������������������������������������������������������������������������������  36 ix

x

Contents

3.4 Threat Analysis Models����������������������������������������������������������������������  37 3.5 Characteristics of Vulnerabilities and Threats������������������������������������  40 3.5.1 Improper Authorization Threats����������������������������������������������  41 3.5.2 Authorization Domains����������������������������������������������������������  42 3.5.3 Software Safety Threats����������������������������������������������������������  42 3.6 Iterative Threat Analysis Methodology����������������������������������������������  42 3.7 Threat Mitigation��������������������������������������������������������������������������������  43 3.7.1 Pre-deployment ����������������������������������������������������������������������  44 3.7.2 Post-deployment ��������������������������������������������������������������������  44 3.8 Summary ��������������������������������������������������������������������������������������������  45 References����������������������������������������������������������������������������������������������������  45 4 Architectures������������������������������������������������������������������������������������������������  47 4.1 Introduction����������������������������������������������������������������������������������������  47 4.2 Processor Security������������������������������������������������������������������������������  47 4.2.1 Root-of-Trust��������������������������������������������������������������������������  48 4.2.2 Side Channel Attacks��������������������������������������������������������������  48 4.3 Model-Based Design��������������������������������������������������������������������������  48 4.4 Architectural Threat Modeling�����������������������������������������������������������  49 4.4.1 Attack Model��������������������������������������������������������������������������  50 4.4.2 Example Attacks and Mitigations ������������������������������������������  53 4.5 Service-Oriented Architectures����������������������������������������������������������  54 4.6 Summary ��������������������������������������������������������������������������������������������  56 References����������������������������������������������������������������������������������������������������  56 5 Security Testing and Run-Time Monitoring����������������������������������������������  59 5.1 Introduction����������������������������������������������������������������������������������������  59 5.2 Security Testing����������������������������������������������������������������������������������  60 5.3 Fuzz Testing for Security��������������������������������������������������������������������  61 5.4 Fuzzing Industrial Control Network Systems������������������������������������  63 5.5 A Modbus TCP Fuzzer ����������������������������������������������������������������������  64 5.6 Run-Time Monitoring ������������������������������������������������������������������������  66 5.7 The ARMET Approach ����������������������������������������������������������������������  67 References����������������������������������������������������������������������������������������������������  69 6 False Data Injection Attacks ����������������������������������������������������������������������  73 6.1 Introduction����������������������������������������������������������������������������������������  73 6.2 Vulnerability Analysis������������������������������������������������������������������������  74 6.3 Dynamic Monitoring��������������������������������������������������������������������������  78 References����������������������������������������������������������������������������������������������������  82 Index��������������������������������������������������������������������������������������������������������������������  85

Chapter 1

The Safety and Security Landscape

1.1  Introduction Safety and security are both important, established, and very distinct engineering disciplines. Each discipline has developed its own methodologies and tools based on a set of goals. However, we can no longer treat these two disciplines as separate. The introduction of real-time embedded computing systems that control physical objects means that physical safety and computer security must be treated as a single discipline; the design of cyber-physical (CPS) and Internet-of-Things (IoT) systems must be based on this unitary goal of safe and secure systems. The traditional definitions of these fields can be briefly summarized: • Physical safety is the result of the absence or minimization of hazards that may harm life or property. • Howard’s early analysis of Internet security [How97] defines computer security as “preventing attackers from achieving objectives through unauthorized access or unauthorized use of computers and networks.” In the modern world, these two goals cannot be cleanly separated. The impact of computer security on safety is easy to see—attackers gain unauthorized access to a cyber-physical system and command it to do bad things. However, safety engineering also has an important influence on computer security practices that heavily rely on updates to fix newly found threats. Physical systems cannot be stopped arbitrarily—an airplane cannot be stopped mid-flight for a software update. Even a planned shutdown of a physical plant can take hours given the physical constraints on the system’s operation. Cyber-physical attacks differ from cyber attacks in that they directly threaten physical systems: infrastructure, civil structures, and people. Cyber-physical attacks can kill people and cause damage to physical plants that can take months to repair. Infrastructure equipment is built for small markets and in some cases is one of a kind. Large-scale damage to civil infrastructure—water heaters, refrigeration © Springer Nature Switzerland AG 2020 M. Wolf, D. Serpanos, Safe and Secure Cyber-Physical Systems and Internet-of-­ Things Systems, https://doi.org/10.1007/978-3-030-25808-5_1

1

2

1  The Safety and Security Landscape

e­ quipment, etc.—can overwhelm standard production and result in lengthy delays for replacements and repairs. This book elaborates several themes: • As stated above, safety and security are inseparable in CPS and IoT systems. • Neither safety nor security disciplines offer all the answers. • Safety and security vary in their use of short-term vs. long-term approaches and in the use of prevention vs. remediation. The new field of safe and secure systems should operate at all time scales and from the earliest stages of design to updates. • System designers must accept the fact that there is no end to design process due to evolving Internet threats. Systems must be designed to be adaptable to counter evolving threats. • Suites of standardized design templates help to reduce design risks. • Modern systems must combine design time analysis and architected safety+security features along with run-time monitoring. • Safety and security should be assessed in part by probabilistic assertions of the health of the system. The next section reviews several case studies of safety and security problems in cyber-physical and IoT systems and the lessons we can draw from them. Section 1.3 surveys the remaining chapters in this book.

1.2  Case Studies A few notes on terminology are useful. Threats may come from several sources: deliberate attack, design flaws, physical faults, and improper operation of the system. A few examples of accidents involving cyber-physical systems indicate a range of specific causes: • An Airbus A400M crashed after takeoff at Sevilla Airport in Spain. Later analysis determined that the aircraft’s engine control software had been incorrectly installed during its final assembly. That improper installation led to engine failure [Keo15]. • Analysis of the design of Toyota automobiles [Koo14] identified failures to apply well-known engineering techniques in several areas, including protection from cosmic ray-induced data errors and application of software engineering principles. • Dieselgate [Dav15] was the result of a decision by Volkswagen management to design software in many of their diesel vehicles to provide inaccurate testing data that incorrectly gave the appearance of satisfying emissions regulations in several companies. As will be discussed below, several attacks on cyber-physical systems have been reported. The Notpetya attack [Gre18] was strictly a cyber attack and did not attack

1.2  Case Studies

3

physical plants. However, the virulence of the attack, which wiped out vast amounts of data and system configurations, shows the level of chaos that can be generated by a determined attacker. Cyber attacks and physical attacks can be used in tandem to create a cyber-­ physical attack. Safety problems demonstrate the physical damage that can be inflicted by cyber-physical systems. And in some cases, they expose flaws that could also have been used by attackers. A commonplace of safety is that accidents often have multiple causes—a chain of events that lead to the final accident, resulting in lower accident rates than may otherwise occur. However, an examination of designs and their behavior suggests that many systems have multiple flaws: security vulnerabilities and safety hazards. These multiple sources of problems suggest that failures may be more frequent than we would like them to be. The sections below discuss several observations about the safety and security of cyber-physical and IoT systems. Section 1.2.1 discusses ease of attack. Sections 1.2.2 and 1.2.3 discuss the serious implications of safety failures and attacks. Sections 1.2.4, 1.2.5, 1.2.6, 1.2.7, and 1.2.8 all describe various limitations on existing methodologies and architectures. Section 1.2.9 reviews the importance of privacy in cyber-physical and IoT systems.

1.2.1  Cyber-Physical Systems Are Shockingly Easy to Attack Rouf et al. [Rou10] demonstrated vulnerabilities in the tire pressure monitoring system (TPMS) that are legally required for many types of vehicles in several countries. Direct TPMS devices are mounted in wheels and send data on tire pressure to the car’s electronic control units (ECUs) using radio signals. Rouf et al. showed that the packets could be received at a distance of 40 m, that they could be spoofed to the ECU, and that the packets were not encrypted. Checkoway et al. [Che11] identified a number of vulnerabilities in an example car, with each attack providing complete control over the car’s systems. Vulnerabilities included the car CD player, the OBD-II (on-board diagnostics) port required in the USA, telematics links such as those used for emergency services, and Bluetooth ports.

1.2.2  Cyber-Physical Systems Can Kill People Leveson and Turner [Lev93] analyzed the causes of a series of accidents related to the Therac-25 medical radiation device. They identified several problems with the Therac-25 design, including mechanical systems, user interface design, and software design. These devices administered several radiation overdoses, some of which were fatal. These multiple accidents appear to have resulted from several

4

1  The Safety and Security Landscape

distinct root causes. Leveson and Turner identified several contributing factors: lack of procedures for reacting to reported incidents, overconfidence in software, poor software engineering, and unrealistic risk assessments. HatMan [NCC18] attacks safety controllers from Triconex. Safety controllers are PLCs used for safety procedures such as disabling equipment or inhibiting operations. HatMan can both read/modify memory and execute arbitrary code on a controller; it is believed to be designed to not only reconnoiter industrial systems but also implement physical attacks.

1.2.3  C  yber-Physical System Disruptions Require Extensive and Lengthy Repairs Stuxnet is widely considered to be the first cyber-physical attack [Fal10]. It reprogrammed a particular programmable logic controller (PLC) used in industrial control systems, members of the Siemens Step 7 family. It was designed to attack PLCs at a specific industrial control system in Iran, located in that country’s nuclear facilities, and to control their behavior, generally to operate them in a way that would damage the related equipment. Stuxnet was deployed in at least two phases. Stuxnet 0.5 [McD13] was in the wild by November 2007. It was designed to manipulate valves in uranium enrichment equipment at the Natanz, Iran facility in order to damage centrifuges and other equipment. It used a replay attack to hide its changes to valve settings during a physical attack. This version spread Step 7 project archives. W32.Stuxnet [Fal11] conducted more extensive attacks. It propagated using vulnerabilities in the Windows Print Spooler and vulnerabilities in removable drives. It used infected Windows computers to modify PLC code. Its physical attack caused centrifuges to run too fast, causing them damage. Attacks are believed to have caused significant damage to Natanz’s equipment and to reduce its productivity. The Dragonfly group [Sym14], also known as Energetic Bear, was identified as surveilling a large number of targets in multiple countries. The campaign’s targets included energy companies, petroleum pipeline operators, and energy industry equipment providers as well as defense and aviation companies. Espionage and reconnaissance were believed to be the primary goals of the campaign. Phishing and watering hole attacks were used to obtain credentials from authorized users. Malware installed on target systems gathered a range of data. Symantec identified a Dragonfly 2.0 campaign active in the USA, Turkey, and Switzerland, starting as early as December 2015 [Sec17]. The Ukraine power grid was attacked in early 2016 [Goo16]. The physical attack disconnected electrical substations, causing hundreds of thousands of homes to lose power. The National Cybersecurity and Communications Integration Center (NCCIC) identified CrashOverride as being used to attack critical infrastructure in Ukraine in 2016 [NCC17].

1.2  Case Studies

5

1.2.4  Patch and Pray Considered Harmful Information security practices emphasize the importance of applying updates to software at all levels of the stack, from operating system to application. Updates include recent fixes to known problems in software and may also include upgrades. Patching known problems reduces the exposure of a system to known attacks. Some updates are scheduled, while others may be released sporadically to react to known issues. However, two problems present themselves when applying updates to industrial control and other cyber-physical systems. First, even if updates have been tested by the vendor, they may introduce problems due to interactions with the software specific to the cyber-physical system. Proper testing before installation may take time that leaves the cyber-physical system vulnerable to the attacks that the update is intended to prevent. Second, the operating schedule of the physical plant cannot be dictated by system updates. Very few cyber-physical systems are designed for on-the-fly system upgrades, although the Western Electric 5ESS telephone switching system was designed to allow hot swapping of operating systems. To the extent that software upgrades require pausing or shutting down the associated physical plant, substantial delay may be caused. Bringing up large electrical generators can take several hours, while chemical plant shutdown and restart may take days. The associated costs of physical plant shutdowns often cause operators to delay system updates, which leaves their computing systems with known vulnerabilities. However, requiring operators to immediately install all updates is unrealistic.

1.2.5  Folk Wisdom Is Untrustworthy Computing is as vulnerable to folk wisdom as other disciplines. Two examples illustrate the challenges introduced by reliance on folk wisdom. Folk wisdom long held that open-source software was more reliable than proprietary software. A key reason for this supposed reliability was that open-source software was examined by more people than would be the case for proprietary software, providing more opportunities to identify problems. However, the Heartbleed bug [NCC14] (heartbleed.com) was a serious implementation problem in the OpenSSL cryptographic library widely used throughout the Internet. This bug was deployed for over 2 years before being discovered and was incorporated into a wide range of operating systems. Another piece of folk wisdom that has not withstood the test of time holds that computer systems can be isolated from the Internet, a practice known as air gap. Air gaps have been prescribed, for example, for nuclear power plants [NEI16]. However, a white paper by Chatham House [Bay15] identified several problems with air gaps in nuclear plants. First, they pointed out that the Stuxnet attacks on the Natanz

6

1  The Safety and Security Landscape

nuclear facility was propagated in part by infected USB portable drives. Second, they noted that some nuclear facilities make use of virtual private networks. Third, they noted that some plants have undocumented or forgotten Internet connections that were installed as part of legitimate operations but were left installed longer than intended. Experience suggests that air gaps are difficult to maintain.

1.2.6  The IT/OT Boundary Is Soft The literature generally holds that information technology (IT) used to manage information is largely separate from operational technology (OT) used to operate physical systems. However, the complexity of modern organizations and systems often means that the boundary between IT and OT is hard to clearly identify. For example, IT failures at three US air lines—United [Ort15], Delta [Hal16], and Southwest [Moy16]—caused flight delays and cancellations. One reason for the impact of IT failures on flight operations is that airlines are required to make use of dispatchers to keep track of flights and to work with flight crews during flight. To the extent that dispatchers make use of IT systems to conduct their duties, IT failures can result in operational problems. Another example is given by the Ukraine power grid attack, which included a denial-of-service attack on utility call centers. This aspect of the attack was designed to prevent customers from reporting outages, possibly to cause frustration among customers [Lee16]. A recent CNN report [Col19] reported on 22 known US public sector attacks in the period January to May 2019 and a total of 53 such attacks in 2018. These attacks have restricted a range of government services.

1.2.7  Design Processes Cannot Be Trusted We generally assume that design processes are conducted in a reasonable manner by trusted entities. However, the poor state of security for many IoT devices suggests that not all design organizations are capable or motivated to properly design secure computing systems. An extreme example is given by the Dieselgate scandal [Dav15, Hru15] in which senior Volkswagen executives caused their organizations to create software modes that subverted emissions control and testing, causing many vehicles on the road to fail to comply with emissions standards.

1.3  Chapters in This Book

7

1.2.8  The V Model Is Inadequate The ISO 26262 standard (https://www.iso.org/standard/43464.html) defines a V model for automotive software design. As will be discussed in Chap. 3, the V methodology describes top-down design and bottom-up verification. However, the V model assumes that the requirements are known and unchanging over the design process. However, Internet-connected cyber-physical systems are exposed to a changing set of possible attacks that themselves form part of the system specification. The V model does not allow for changes to the requirements and resulting changes to design either during initial design or after deployment. Alternate development methodologies have been used for some safety-critical system designs. For example, Boeing and Saab used an agile software development methodology that released software every 8  weeks. The companies estimate that this methodology reduced the amount of software lines of code by 50% compared to traditional approaches.

1.2.9  Privacy Is a Critical Requirement Privacy is widely regarded as a critical requirement for computer system security. Cyber-physical systems may provide new forms of privacy concerns and vulnerabilities: timing may be an important factor; the physical plant may be observed or modified to affect data privacy; necessary information, such as that used to negotiate prices for energy services, may allow the inference of personal data. Medical data is an important example of personal data privacy; many countries define and protect medical data. However, data related to wellness or care may not fall under the strict definition of medical data but still create privacy concerns. Medical and wellness data access is generally defined by a complex set of relationships that includes medical providers, nonmedical care providers, and family or guardians [Wol15]. Trade secrets of companies are another form of private information. Trade secrets are generally considered lost once they are revealed. Many organizations maintain some form of trade secrets as part of their operations.

1.3  Chapters in This Book Chapter 2: Safety and Security Design Processes compares the design methodologies and processes used in traditional safety and security methodologies. Safety and security methodologies emphasize different parts of the design process as well as different design goals.

8

1  The Safety and Security Landscape

Chapter 3: Threats and Threat Analysis develops models for a holistic analysis of security vulnerabilities and safety hazards in cyber-physical and IoT systems. An iterative threat management methodology can be applied to safety and security properties at multiple levels of abstraction. Chapter 4: Architectures considers several architectural methodologies and models. Model-based design has become increasingly popular as a way to integrate semantically rich models and simulation and synthesis tools. Model-based design flows can incorporate analysis and synthesis algorithms for security properties. An architectural threat model takes into account the relationship between multiple levels of the design hierarchy to analyze and mitigate both transitive privilege escalation attacks and quality-of-service (QoS) attacks. QoS-aware service-oriented architectures can provide verifiable QoS properties for high-level services in cyber-­ physical systems. Chapter 5: Security Testing and Run-time Monitoring considers two topics: testing at design time and run-time monitoring. Fuzzing has been used to test security properties of industrial systems. Run-time monitors can be used to test for both safety and security properties. Chapter 6: False Data Injection Attacks examines attacks on data at the cyber-­ physical interface; such attacks can interfere with the safety and security properties of the cyber-physical system. Both vulnerability analysis and dynamic monitoring can be used to defend against these attacks.

1.4  Summary Cyber-physical and IoT systems are vulnerable to traditional security and safety problems. However, they are also subject to other concerns that are not met either by non-cyber-controlled physical systems or by IT-oriented computer systems. Some of these problems are caused by Internet connectivity while others are due to the complexity of software. The complexity of modern networked computing systems—information, operational, and combined—means that a few weak components can pose serious threats. Not only does information security affect safety, but safety methodologies also limit the applicability of traditional information security practices. Design methodologies, system designs, and operational plans must take into account these issues to maximize safety and security.

References [Bay15] Caroline Baylon with Roger Brunt and David Livingstone, Cyber Security at Civil Nuclear Facilities: Understanding the Risks, Chatham House (Sept 2015) [Che11] S. Checkoway, D. McCoy, B. Kantor, D. Anderson, H. Shacham, S. Savage, K. Koscher, A. Czeskis, F. Roesner, T. Kohno, Comprehensive experimental analyses of automotive attack

References

9

surfaces, in Proceedings of the 20th USENIX Conference on Security (SEC’11), (USENIX Association, Berkeley, 2011), p. 6 [Col19] K. Collier, Crippling ransomware attacks targeting US cities on the rise, CNN (10 May 2019), https://www.cnn.com/2019/05/10/politics/ransomware-attacks-us-cities/index.html [Dav15] C.  Davenport, J.  Ewing, VW is said to cheat on diesel emissions; U.  S. to order big recall, New  York Times (18 Sept 2015), http://www.nytimes.com/2015/09/19/business/volkswagen-is-ordered-to-recall-nearly-500000-vehicles-over-emissions-software. html?smid=tw-nytimes&smtyp=cur&_r=0 [Fal10] N.  Falliere, Stuxnet introduces the first known rootkit for industrial control sys tems, Symantec Official Blog (6 Aug 2010), http://www.symantec.com/connect/blogs/ stuxnet-introduces-first-known-rootkit-scada-devices [Fal11] N. Falliere, L.O. Murchu, E. Chien, W32.Stuxnet Dossier, version 1.4 (Feb 2011), available at www.symantec.com [Goo16] D.  Goodin, First known hacker-caused power outage signals troubling esca lation, ArsTechnia (14 Jan 2016), http://arstechnica.com/security/2016/01/ first-known-hacker-caused-power-outage-signals-troubling-escalation/ [Gre18] A.  Greenberg, The untold story of NotPetya, the most devastat ing cyberattack in history, Wired (22 Aug 2018), https://www.wired.com/story/ notpetya-cyberattack-ukraine-russia-code-crashed-the-world/ [Hal16] A. Halsey III, Delta computers crash, causing delays and cancellations. Experts say it shouldn’t have happened, The Washington Post (8 Aug 2016), https://www.washingtonpost. com/local/trafficandcommuting/delta-airlines-computer-systems-crash-causing-flight-delaysand-cancellations/2016/08/08/7d5e8fa0-5d72-11e6-af8e-54aa2e849447_story.html?utm_ term=.cb218d77d206 [How97] J.D. Howard, Analysis of Security Incidents on the Internet 1989-1995, Ph.D. dissertation, Carnegie Mellon University, 7 Apr 1997 [Hru15] J. Hruska, VW caught cheating on diesel emissions standards, ordered to recall 500,000 cars, Extremetech (21 Sept 2015), https://www.extremetech.com/extreme/214605-vw-caughtcheating-on-diesel-emmissions-standards-ordered-to-recall-500000-cars [Keo15] L. Kelion, Fatal A400M crash linked to data-wipe mistake, BBC News (10 June 2015), https://www.bbc.com/news/technology-33078767 [Koo14] P. Koopman, A Case Study of Toyota Unintended Acceleration and Software Safety (18 Sept 2014), https://users.ece.cmu.edu/~koopman/pubs/koopman14_toyota_ua_slides.pdf [Lee16] R.M. Lee, M.J. Assante, T. Conway, TLP: White Analysis of the Cyber Attack on the Ukranian Power Grid, Defense Use Case (Sans ICS and E-ISAC, 18 Mar 2016) [Lev93] N.G.  Leveson, C.S.  Turner, An investigation of the Therac-25 accidents. Computer 26(7), 18–41 (1993). https://doi.org/10.1109/MC.1993.274940 [McD13] G. McDonald, L.O. Murchy, S. Doherty, E. Chien, Stuxnet 0.5: The Missing Link, version 1.0 (26 Feb 2013), https://www.symantec.com/content/en/us/enterprise/media/security_ response/whitepapers/stuxnet_0_5_the_missing_link.pdf [Moy16] J.W.  Moyer, D.  Hedgpeth, F.  Siddiqui, Southwest Airlines computer glitch causes cancellations, delays for third day, The Washington Post (22 July 2016), https:// www.washingtonpost.com/news/dr-gridlock/wp/2016/07/21/long-lines-for-southwestairlines-passengers-at-area-airports/ [NCC14] National Cybersecurity and Communications Integration Center, ‘Heartbleed’ OpenSSL Vulnerability (10 Apr 2014) [NCC17] National Cybersecurity and Communications Integration Center, CrashOverride Malware, Alert TA17-163A, June 12, 2017, revised 22 July 2017, https://www.us-cert.gov/ ncas/alerts/TA17-163A [NCC18] National Cybersecurity and Communications Integration Center, HatMan  – Safety System Targeted Malware (Update A), MAR-17-352-01, 10 April 2018 [NEI16] Nuclear Energy Institute, Policy Brief: Cyber Security for Nuclear Power Plants (July 2016)

10

1  The Safety and Security Landscape

[Ort15] E.  Ortiz, J.  Shamlian, T.  Costello, United Airlines flights no longer grounded, delays remain, NBC News (8 July 2015), http://www.nbcnews.com/business/travel/ united-airlines-passengers-say-flights-grounded-nationwide-n388536 [Rou10] I.  Rouf, R.  Miller, H.  Mustafa, T.  Taylor, S.  Oh, W.  Xu, M.  Gruteser, W.  Trappe, I. Seskar, Security and privacy vulnerabilities of in-car wireless networks: a tire pressure monitoring system case study, in USENIX Security 2010, ed. by I. Goldberg, (USENIX Association, Berkeley, 2010), pp. 323–338 [Sec17] Security Response Attack Investigation Team, Dragonfly: western energy sector targeted by sophisticated attack group, Symantec (20 Oct 2017), https://www.symantec.com/ blogs/threat-intelligence/dragonfly-energy-sector-cyber-attacks [Sym14] Symantec Security Response, Dragonfly: western energy companies under sabotage threat, Symantec Official Blog (30 June 2014), https://www.symantec.com/connect/blogs/ dragonfly-western-energy-companies-under-sabotage-threat [Wol15] M. Wolf, M. van der Schaar, H. Kim, J. Xu, Caring analytics for adults with special needs. IEEE Design Test 32(5), 35–44 (2015)

Chapter 2

Safety and Security Design Processes

2.1  Introduction This chapter describes and compares design processes and methodologies targeted to safety and security. Chapter 3 builds upon this description to discuss a unified safety and security design methodology. The broad outlines of safety- and security-­ oriented design bear some similarity, but they differ significantly in emphasis: safety concentrates on requirements, while security emphasizes architecture and coding. A few definitions are in order. Functional safety is used to describe the risks resulting from faults or design flaws. Reliability refers to the probability of a system being able to perform its intended function. Availability is the percentage of time over which the system is capable of performing that intended function. Certification is a legal or regulatory process under which a system is deemed to meet certain criteria, typically related to safety. Both designs and individual artifacts may be certified. For example, with the exception of experimental aircraft, both the design of an aircraft must be certified and the manufacture of each aircraft built under that design. The next section discusses risk and risk management, foundational concepts for both safety and security methodologies. Sections 2.3 and 2.4 describe hazard analysis for safety and attack analysis for security, respectively. Section 2.5 discusses standards and certification. Section 2.6 describes quality management systems. Sections 2.7 and 2.8 describe safety and security design processes, respectively. Section 2.9 compares and contrasts these two types of methodologies.

2.2  Risk Management Risk refers to a potential for loss or injury. Risk may be assessed relative to several different parties: individuals, organizations, or society. In most circumstances, risk cannot be avoided, only minimized. Management of risk often requires trade-offs © Springer Nature Switzerland AG 2020 M. Wolf, D. Serpanos, Safe and Secure Cyber-Physical Systems and Internet-of-­ Things Systems, https://doi.org/10.1007/978-3-030-25808-5_2

11

12

2  Safety and Security Design Processes

between costs and risks or between different types of risk. Safety and security each has their own views on risk; they share some views on risk which also differ in ways that go beyond the specific sources of that risk. Approaches to risk management are generally ranked by their appropriateness and desirability: • • • •

Design for minimum risk Incorporate safety devices Provide warning devices Develop procedures and training

Safety management processes perform a risk review early in the system development process [Alb99]. The risk management review includes three major activities: • Risk planning provides an organized, managed process for the identification and mitigation of risks. • Risk assessment identifies potential risks through system engineering documents and lessons learned. • Risk analysis assesses the likelihood of risks and their potential consequences. A risk model divides total risk into controlled or eliminated risk that can be managed through the design process and residual risk that cannot. Residual risk comes from a combination of identified and unidentified sources (a formulation similar to known unknowns vs. unknown unknowns). Unacceptable risk is any risk that cannot be tolerated by the managing activity. The controlled/eliminated category includes both unacceptable risks and acceptable but mitigatable risks. Risk assessment can take the form of categorical or quantitative analysis. Risk probabilities may be estimates, though in some cases statistically significant data on accident rates is available. A hazard is a precondition for a mishap. The goal of failure modes and effects analysis (FMEA) is to identify both the likelihood and severity of hazards associated with the system. FMEA can then be used to make judgments about the treatment of risks as controlled/eliminated or residual. Many safety processes build their initial risk assessment in the form of a hazard risk index matrix as shown in Fig. 2.1. This style of analysis identifies categories for both probability of a hazard’s occurrence and its severity if it in fact occurs. Probability/severity combinations are then categorized by their risk level, in this case either unacceptable, marginal, or minimum. The matrix can be used to prioritize hazard-related engineering tasks. We will see examples of risk matrices defined by standards in Section 3.cert; different fields often differ somewhat in how they ­categorize probability and severity. An alternative approach to grading hazards is the Risk Priority Number. This value is computed as the product of three factors: • The severity of the risk • The likelihood of occurrence of the risk • The system’s ability to detect the failure mode or its cause

2.3  Fault Models and Hazard Analysis

13

Fig. 2.1  A hazard risk index matrix

The Risk Priority Number variable is often referred to in the literature as RPN and is computed as RPN = S × O × D.

2.3  Fault Models and Hazard Analysis Avizienis and Laprie [Avi86] define dependability of a computer system as justified reliance on its ability to provide its intended service. A fault may cause a system failure; the fault is expressed as an error. Reliability measures the continuous delivery of proper service, while availability refers to the fraction of proper service in the total system time. Faults are classified as either physical or human-made; human-­ made faults may be due to design or to improper interaction with the system. A fault may be permanent or transient. Safety in our design context refers to physical safety and the absence or minimization of hazards that may harm life or property. Leveson [Lev11] argues that reliability is a property of components, while safety is an emergent property of systems. A variety of digital fault models are used for hardware design. While these fault models may have been suggested by a particular physical phenomenon, they are often applied independent of their physical cause. A stuck-at fault puts the component output a constant value 0 or 1. In transistor circuits, a faulty transistor may be modeled as open or short. Excessive delay may be modeled as a fault. Transient fault models include bit flips or glitches. System failures often occur not because of a single condition but a cascade of conditions and events. Understanding hazards requires understanding sequences of events and the likelihood of such sequences. Three common methods for analyzing the effects of faults are functional hazard assessment (FHA), fault tree analysis (FTA), and failure mode effects analysis (FMEA). Functional hazard assessment [Sch14] identifies the relationship between system functions and safety hazards. The FHA process is guided by a worksheet as illustrated in Fig. 2.2. The methodology proceeds through several steps:

14

2  Safety and Security Design Processes

Hazard ID Identifier

Life cycle phase Phase analyzed by risk assessment

Activity Actions performed within life cycle phase

State/Mode System state or mode for the hazard

Function System function

Functional failure

Hazard Description Detailed description of failure conditions

System Item(s)

Causal Factor Description Causes of failure

Mishap

Existing Mitigations Existing means to mitigate failure

Software Control Category Degree of autonomy of software function

Initial MRI

Software Criticality Index Criticality

Causal Factor Risk Level Potential for causal factors to occur

Recommended Comments Mitigations Methods to reduce Relevant risk additional information

Detailed description of failure mode Effect(s) Effects on life, limb, property

Target MRI Projected risk after mitigation

Portion of the system

Initial risk assessment

Description of failure

Follow-On Actions Further work to better understand risk

Fig. 2.2  A functional hazard analysis worksheet

• System architecture data is analyzed to create a functional hierarchy, block ­diagrams, and a function/item matrix. • The impact of the failure of each system function is analyzed for hazards. • Safety-significant subsystems and interfaces are identified. • Existing and recommended mitigations are identified. • Safety-significant functions are decomposed to components. Component failures are related to subsystem hazards. • Risk levels and software criticality indexes are identified and assigned. Follow-­on actions are specified. • A final FHA report is prepared. Fault tree analysis [Ves81] was originally developed at Bell Labs. A simple example is shown in Fig. 2.3. Fault tree analysis typically proceeds from the undesired event, shown at the top of the tree, and works backward to identify the ways in which that event can occur. Precondition events can be combined at each stage to create the conditions necessary for the successor event. Figure 2.4 shows additional symbols that can be used to build fault trees. A related model is the fault tree shown in Fig. 2.5. At each step, a given condition may or may not occur. The leaves of the tree describe the possible outcomes, some of which do not represent hazards while others may represent hazards of varying degree. In this case, probabilities have been assigned to the events, but these probabilities may be omitted in early-stage analysis. A cutset of the tree identifies a set of events that are required for the fault to occur or, alternatively, events that must be avoided to prevent the fault’s occurrence. A minimal-cost cutset is a cutset that

2.3  Fault Models and Hazard Analysis

15

Fig. 2.3  A fault tree

Fig. 2.4  Some symbols used in fault trees

contains the smallest number of events of any cutset. Once probabilities are assigned to the events, we can determine the probability of the minimal-cost cutset. Failure mode effects analysis [ASQ18B] is also organized around a chart. An example is shown in Fig. 2.6. The FMEA process analyzes system functions and identifies potential failures. For each potential failure, it identifies the severity of the failure, the probability of that failure’s occurrence, and the system’s ability to detect the failure mode or its cause. The resulting Risk Priority Number RPN = S × O × D. Criticality of the failure is also analyzed. Appropriate actions are identified and responsible parties assigned. Once the mitigation effort is completed, additional columns identify what was done, the final RPN, and the final criticality level.

16

2  Safety and Security Design Processes

Fig. 2.5  An event tree

Function

Potential Failure Mode

Function being analyzed

Detailed description of failure conditions

O

Current Process D Controls Existing means to Ability to detect mitigate failure cause or failure mode

Probability of failure due to occurrence of this failure Recommended Action(s)

S Severity

Potential Cause(s) of Failure Causes of failure

RPN

CRIT

Risk Priority Number = xx

Initial criticality assessment

Responsible Party and Target Completion Date Who is responsible, when task should be finished

Actions to take

Action Taken How issue was resolved

Potential Effect(s) of Failure Results of failure

S

O

D

RPN

CRIT

Final severity

Final occurrency probability

Final detection ability

Final Risk Priority Number

Final risk assessment

Fig. 2.6  A failure mode effects analysis worksheet

2.4  Attack Models and Attack Analysis

17

2.4  Attack Models and Attack Analysis Security attacks often take advantage of design faults, perhaps in conjunction with physical faults. We often refer to the underlying mechanism leveraged by the attack as a vulnerability. The attack itself is analogous to an error in that it is the manifestation of a vulnerability. Schneier introduced the attack tree model [Sch99]. A simple example is shown in Fig. 2.7. Branches are by default OR-combinations of events, but a set of branches can be marked as an AND condition by joining the branches with an arc labeled with the word AND. Nodes can be labeled to provide additional information: likelihood, special equipment required, cost of attack, etc. Landwehr [Lan81] surveyed early work in formal models for computer security. He identified the access matrix model as being used since the early 1970s. The matrix classifies sets of passive objects and active subjects and the rules by which subjects can operate on objects. Information flow models describe flow relations between security classes. The security classes correspond to different types of information contained in objects. Flow relations categorize whether a set of processes is allowed to perform a given flow from one security class to another. Howard [How97] analyzed security incidents on the Internet’s early period in the interval 1989 to 1995. As part of this analysis, Howard created a taxonomy of computer security attacks shown in Fig. 2.8. This process-oriented model has five major components: attackers, tools, access, results, and objectives. The Microsoft STRIDE model [Her06] was developed as part of their Security Development Lifecycle. The STRIDE model analyzes a set of threats against the system components. Threats include spoofing, tampering, repudiation, information disclosure, denial of service, and elevation of privilege. The NIST Framework for Improving Critical Infrastructure Cybersecurity [NIS14A] is defined as a set of activities through organizations that can manage their cybersecurity goals and processes. The first activity concentrates on identifying cybersecurity risks to systems, assets, data, and capabilities. The risk assessment activity (ID.RA) identifies several subtopics and relates each subtopic to existing standards: Fig. 2.7  An attack tree

18

2  Safety and Security Design Processes

Fig. 2.8  Howard’s model of computer and network attacks

• • • • • •

Asset vulnerabilities are identified and documented. Threat and vulnerability information is collected from a variety of sources. Internal and external threats are identified and documented. Potential impacts on business and the likelihood of those events are identified. Risk is assessed based on threats, vulnerabilities, likelihoods, and impacts. Risk responses are formulated and prioritized.

The NIST Guide to Industrial Control Systems (ICS) Security [Sto15] proposes a risk management level that operates at the organization, business process, and IT/ ICS levels. They identify several special considerations for ICS security risk assessment, concentrating on physical impacts of cyber incidents. Hutchins et al. [HutXX] defined an intrusion kill chain as a model for computer network attacks. Their kill chain included seven phases: • • • • • •

Reconnaissance identifies targets. Weaponization makes use of a Trojan to deliver a payload. Delivery sends the weapon to the target. Exploitation triggers the intrusion code on the target. Installation provides a persistent presence of the adversary on the target. Command and control (C2) provide remote information on and control over the attack code. • Actions on objectives perform the desired operations on the target. They use a course of action matrix to analyze security tasks. The matrix’s vertical axis covers the kill chain phases. The horizontal axis describes actions: detect, deny, disrupt, degrade, deceive, and destroy. Consulting firm Gartner defined its own cyber kill chain [Law14]. Assante and Lee [Ass15] developed a modified cyber kill chain for industrial control systems. They divide an attack into two stages. In stage 1, the attacker prepares and executes a cyber intrusion; this stage follows the basic outline of the intrusion kill chain. In stage 2, the attacker develops and executes the attack on the industrial control system: attack development and tuning, validation, and the attack proper. They identify constituent actions for attack phases: enabling includes triggering and delivering; initiating the attack includes modifying and injecting; supporting the attack includes hiding and amplifying.

2.5  Standards and Certification

19

Armando and Compagna [Arm04] formulated protocol insecurity checking for model checking. They formulated protocol checks as planning problems: states represent the state of the honest principles, intruder, and messages; actions model both intended and unintended possible transitions between states. Security violations are represented by reachability of bad states. The planning problem can be formulated as a satisfiability problem. The system is solved using an abstraction/refinement approach. Abstract models provide sufficient but not necessary conditions for safety. When a possibly unsafe abstraction is found, it is refined to help determine whether the unsafe counterexample is spurious. Abstraction and refinement are iteratively applied until a final proof of safety or unsafety is found.

2.5  Standards and Certification Certification may in some cases be provided by companies, professional societies, or other organizations. For example, Underwriters Laboratories provides certification for a range of products. Government certification processes are defined by law and regulation. Aircraft certification is an example of processes used for the certification of complex safety-critical systems. International treaties harmonize certification processes and allow countries to recognize the certification results of other signatory countries. In the USA, the foundational regulations for certification are provided in Part 21 of the Federal Aviation Regulations, known as FAR Part 21 [FAA19]. Most aircraft are covered by a combination of a type certificate that certifies the design, a production certificate that governs production, and an airworthiness certificate that certifies the construction (and implicitly the maintenance) of the aircraft. Subpart B describes type certification; generally speaking, design information, inspection and maintenance plans, and flight tests are required for a type certificate. Subpart E describes supplemental type certificates (STCs) that can be used to modify aircraft after manufacture. Subpart G defines production certificates, while subpart H describes airworthiness certificates. SAE ARP4761 (https://www.sae.org/ standards/content/arp4761/) provides guidelines for safety assessment during civil aircraft certification. FAR Part 43 governs maintenance, preventive maintenance, rebuilding, and alteration. Generally speaking, any item permanently attached to the aircraft must itself be certified, and maintenance work must be signed off in the aircraft log by a certified mechanic, who must be either have been trained at an FAA-approved school or have 18  months each for airframes or power plants or 30 months for both (https://www.faa.gov/mechanics/become/basic/). Experimental aircraft, such as homebuilts, are covered under separate regulations that govern the issuance of an airworthiness certificate and impose certain operating restrictions on the aircraft. In 2016, the FAA released a final rule [FAA09, Nam16] for certification of small aircraft that allows the use of industry consensus standards for certification. The rule also defined levels of performance and risk based on the aircraft’s maximum seating capacity.

20

2  Safety and Security Design Processes

While aircraft certification is rooted in mechanical principles such as structural strength, it also takes into account systems analysis. FAR 25.1309, for example, requires that airplane systems and associated components must be designed so that a failure that would prevent continued safe flight and landing is extremely improbable and that the occurrence of any other failure conditions that would reduce the capability of the airplane or ability of the crew to cope with adverse operating conditions is improbable. Compliance must be shown by analysis and, where necessary, by ground, flight, or simulator tests. This principle is known as fail-safe design [FAA88] which makes use of several concepts: • • • • • • • • • • •

Designed integrity and quality including life limits Redundancy or backup systems Isolation of systems, components, and elements Proven reliability so that multiple, independent failures are unlikely to occur during the same flight Failure warning or indication Flight crew procedures for use after failure detection Checkability or the ability to check a component’s condition Designed failure effect limits to limit the safety effects of a failure Designed failure path to control and direct the effects of a failure so as to limit its safety impact Margins or factors of safety Error tolerance

DO-178C is used for avionics software certification in the USA, Canada, and Europe [Bro10, Jac12]; it is, however, designed to be application-neutral and applicable to other domains. The standard classifies failures using a Development Assurance Level (DAL) based on their effects: • Level A is catastrophic, generally with airplane loss and possible deaths. • Level B is hazardous, reducing system performance or the ability of the crew to operate the aircraft. • C is major, significantly reducing the safety margin or increasing crew workload. • D is minor, slightly reducing the safety margin or increasing crew workload. • E is anomalous behavior that has no safety effect on the aircraft or pilot workload. The number of objectives, and the number of those objectives that are independent, increases with safety level. DO-178C specifies that tools used to create v­ erified software themselves need to be verified. The standard identifies a modified V-process with an iterative element: • Functional requirements, hazard and safety analysis, and functional allocations identify the functions allocated to software and their Development Assurance Level. • Software is refined from planning, through requirements, design, coding, and integration.

2.6  Quality Management Systems

21

• The products of software development at various stages are fed into a system safety assessment that may update the system functional and hazard/safety analysis. • Software is verified, and then the system is verified, with the results of software verification feeding into the system safety assessment. ASTM F3269-17 defines best practices for certification of unmanned aerial vehicles that rely on complex functions for their flight operations. Complex functions such as autonomous autopilots or flight planning systems are difficult to directly verify using method such as DO-187C. A run-time assurance architecture, including one or more recovery control functions, is used to identify and recover from unusual conditions. Software components of medical devices or software-as-a-medical-device must be validated [FDA02]; software used in device production or manufacturing quality systems must also be validated. The FDA found that 3140 medical devices were recalled between 1992 and 1998. Two hundred forty-two of these recalls were attributable to software; 192 were caused by software defects introduced by changes made after the initial release. This document states “Unlike hardware, software is not a physical entity and does not wear out” [FDA02 p. 8]. Unfortunately, this statement is incorrect. Instruction and data values are represented by physical quantities in computing systems. A variety of physical mechanisms can cause those values to change. The document identifies several typical tasks that can be shown to regulators to indicate that the software has been validated: a quality planning process; requirement documentation including safety requirements derived from a technical risk management process; design methods such as specifications and design evaluations; coding guidelines and source code evaluations; testing plans and result; user site testing; and maintenance and software update procedures that take safety into account.

2.6  Quality Management Systems ISO 9000 [ASQ18A] is a series of quality management and assurance standards. ISO 9001 is one member of that family; its latest version at this writing dates to 2015. The standard is based on several quality management principles: • Customer focus to understand the needs of customers and align organizational objectives accordingly • Leadership to establish a vision and challenging goals • Engagement of people to use their abilities and make them accountable • A process approach to activities • Continual improvement of organizational capabilities • Evidence-based decision-making • Manage relationships with suppliers

22

2  Safety and Security Design Processes

ISO/IEC 15504 Automotive SPICE ™ [VDA15] is a process assessment model for automotive embedded computing systems. A process reference model includes three categories: primary life cycle processes for the supplier/customer interface, organizational life cycle processes, and supporting life cycle processes. The primary life cycle processes include acquisition, supply, system engineering, and software engineering. A measurement framework, defined by ISO/IEC 33020, includes capability levels, process attributes, ratings, and a process capability level model. Safety task implementation includes tailoring safety-critical requirements to the specifics of software, hazard analysis, the development of safety-critical software requirements, and a preliminary software. CMMI-DEV ® [CMM10] provides best practices for the development of products and services that may span the enterprise.

2.7  Safety Design Processes The Joint Services Computer Resources Management Group developed a Software System Safety Handbook [Alb99]. They identify several processes and methods for software safety within a systems safety context. They describe safety planning by the customer or procuring authority to be an iterative process that alternates between requirements/safety policy and software safety plans. The safety engineering of software includes a preliminary hazard list, recasting safety-critical requirements for the software environment, preliminary hazard analysis, a set of system safety-­ critical software requirements, preliminary subsystem hazard analysis, and a detailed software subsystem hazard analysis. Safety testing of software is an iterative process alternating between test plans and test results. A safety assessment report is created to describe the safety analysis of the system. ISO 26262 [ZVE12] addresses the functional safety of automotive electrical/ electronic (E/E) systems. Functional safety is the absence of unreasonable risk to people due to E/E malfunctions. 26262 provides guidelines only; it is not a certification standard. The 26262 process operates on vehicle subsystems (transmission control, ABS, etc.). The subsystem’s hazards are identified, and safety goals are specified. Each safety goal is classified: QM for goals that can be achieved using a standard quality management system, levels A–D otherwise. In contrast to DO-178C, 26262 assigns A as the least critical and D as the most critical level. A goal is evaluated for its exposure, controllability, and safety. A functional safety concept ­identifies an approach; a technical safety concept develops a detailed specification; the hardware and software safety requirements come from the technical safety concept. ASTM F3153 describes a process for developing system-level tests of avionics systems to verify their compliance with safety objectives. The verification process includes planning, testing, test failure resolution, and regression testing.

2.7  Safety Design Processes

23

A typical software safety methodology includes several steps [Tig14]: • The software critical index of each safety-significant function is determined. • Software requirement hazard analysis derives the software requirements necessary to provide a safe implementation and mitigate hazards. • Software architectural hazard analysis is conducted on architectural documents and requirements. It is performed before the preliminary design review. • Software design hazard analysis expands the analysis to consider the planned implementation; it reviews each identified hazard and looks for new hazards. This step is performed before the critical design review. • Code level hazard analysis analyzes safety-significant variables, typing, code flow analysis, and error processing. • Operator documentation safety review reviews user documents for adequacy and to identify additional hazards introduced by the documents. • Software safety testing verifies and validates all software safety requirements. • Formal review gives evidence for the process and resultant risk level of the design. Tribble et  al. describe the use of model checking for safety analysis of the requirements for a flight guidance system [Tri02]. They specified the system using RMSL−e, a variant of Requirements State Machine Language that does not use explicit events; models in this language are synchronous. Model checking checks for the satisfaction of a given property of a state-oriented system in the entirety of its state space. They used model checking to test a range of properties; examples include incorrect flight guidance steering values and incorrect mode indication. They found that model checkers were capable of verifying the required properties; they also found that restating the requirements for model checking helped them to improve the requirements. Leveson [Lev11] developed the STAMP methodology for the design of complex safety-critical systems. Safety considerations are cast as constraints on the design and operation of the system. STAMP takes into account human factors and organizational characteristics in the identification of hazards and safety constraints. Leveson states that STAMP analysis cannot be reduced to drawings. Safety cases [Eur06] have emerged as an alternative methodology for safety analysis. Traditional safety methodologies concentrate on process. A safety case, in contrast, is goal-based or evidence-based. Safety cases are seen as part of a burden of proof on designers and managers to show that their systems provide acceptable levels of safety. Safety cases have been advocated for autonomous vehicles to adapt to the novel architectures of these vehicles [Koo18]. The safety case methodology has been criticized for concentrating on a single cause rather than a causality chain. A safety case includes: • The aim of the safety case • The audience for the safety case and why it is being written

24

• • • • • • •

2  Safety and Security Design Processes

The scope of the document A description of the system and its environment If created for a modification of the system, a justification for the change A safety argument Supporting safety evidence Caveats, assumptions, and limitations Conclusions

The MISRA series of standards gives coding-level guidelines for automotive software. The MISRA Generic Modeling Design and Style Guidelines [MIR09] define best practices for topics such as directory and file names, units, and documentation. A few examples of the MISRA C guidelines [MIR13] indicate the scope of the guidelines: • • • •

No unreachable code. A typedef name should be unique. Macro and identifier names should be unique. All if…else if constructs are terminated with an else clause.

AADL [Fei13] is a language for model-based design of networked control and other embedded computing systems. It does not specifically address safety although its intended application set consists largely of safety-critical systems. It provides mechanisms for the specification of computing platforms and the programs that run on them. AADL provides for the description of multithreaded software systems. It also provides a mechanism to describe a flow or logical path through an architecture. The V methodology has been used as an organizing framework in both MISRA and ISO 26262. As shown in Fig. 2.9, design proceeds top-down from requirements to coding. Verification proceeds bottom-up with a matching verification step for each level of abstraction. Meyer [Mey18] reviews agile programming methodologies and points out that user stories are insufficient to provide a consistent set of requirements. Agile methods consider each user story as separate, resulting in a set of independently derived requirements. These requirements may not be consistent with each other. Consistency of user requirements is a system property, as is safety; inconsistent requirements present the danger of unsafe system behaviors. MIL-STD-882E from the US Department of Defense [DoD12] defines standard practice for safety in system engineering. It is designed to cover a wide range of hazards. The risk matrix defined by this standard uses four levels of severity (catastrophic, critical, marginal, negligible) and six levels of probability (frequent, probable, occasional, remote, improbable, eliminated). It mandates that the Systems Engineering Management Plan for a project include both a System Safety Program Plan and a Hazard Management Plan. The System Safety Program Plan describes the process used for the analysis of hazards and for risk assessment and management. The Hazard Management Plan is used to analyze hazards, risks, and the management or elimination of those risks.

2.8  Security Design Processes

25

Fig. 2.9  The V methodology for design and verification

2.8  Security Design Processes The term attack surface is commonly used in information security to refer to the set of interfaces (known as attack vectors) or locations at which an attacker may be able to either extract useful data or inject data into the system. Bell and LaPadula [Bel73] analyzed computer system security using set theory. The model included several sets: subjects (processes) S, data objects O, classifications or clearance levels C, need-to-know categories K, access attributes A, requests R, decisions D, and sequence indices T. The state of the system includes three components: a set that defines which subjects have access to what objects in that state, an access matrix that gives possible access conditions, and clearance levels of all subjects. A state transition relation governs the system’s state behavior. A security condition relative to a clearance level f holds if a subject/object pair either has a clearance for the subject greater than or equal to than the object’s classification and the subject belongs to one of the object’s need-to-know categories. They demonstrated the security condition can be used to define a set of conditions under which the system is secure for any state. Saltzer [Sal74] identified several security design principles: • Economy of mechanism reduces chances of flaws and faults that could compromise a security mechanism. • Fail-safe defaults require that permission should be explicit and exclusion the default. • Complete mediation ensures that every object access must be checked to ensure that the access is allowable. • Open design forbids reliance on what is now known as security-through-security.

26

2  Safety and Security Design Processes

• Separation of privilege requires multiple keys for privilege. • Least privilege causes programs and users to operate under the least set of privileges required to complete the task. • Least common mechanism minimizes the commonality of mechanisms among users. • Psychological acceptability promotes ease of use. • Work factor identifies the amount of effort required for a hacker to subvert a mechanism. • Compromise recording creates audit trails. Biba [Bib77] analyzed several security models for purposes of national security, which requires the highest levels of integrity. Biba observed that traditional notions of integrity concentrated on access to information but that computer systems required also paying attention to information modification; he observed that some information may require broader access but very narrow modification rights. Security was modeled using set theory. A ring policy, which was implemented in hardware in the MULTICS system, concentrated on direct modification of information, not indirect modification. Each information subject (a module that acts upon information) and object (a repository of information) is given an integrity level that does not change during the subject/object’s lifetime. A subject may modify only an object whose integrity levels are less than or equal to the subject’s integrity level. NCCIC [NCC18] describes a recommended process for antivirus updates for industrial control; this note concentrates on Windows-based SCADA.  The document recommends placing antivirus, update, and patch servers in the DMZ for the control center network. The antivirus/update/patch servers communicate only indirectly with the vendor and organization antivirus networks. The NCCIC Cyber Security Evaluation Tool (CSET) ® (https://ics-cert.us-cert. gov/Downloading-and-Installing-CSET) is an application that can be used to analyze network security practices for IT and industrial control. After selecting an appropriate set of cybersecurity standards, the user determines the security assurance level, diagrams the network topology, and identifies criticality of components. Based on these inputs, the user is asked a set of questions. A report is generated to describe the assessment results. The Common Vulnerability Scoring System (CVSS) [FIRXX] is a quantitative model for scoring IT vulnerabilities. Two versions are in use: v3.0 and v2.0. Calculators are available that implement the CVSS scoring system. CVSS v3.0 base metrics include exploitability metrics (scope, attack complexity, privileges required, user interaction), scope metrics, and impact metrics (confidentiality, integrity, availability). Separate metrics quantify temporal and environmental characteristics. The NIST Platform Firmware Resiliency Guidelines [Reg18] identifies three ­levels of platform resiliency: all critical platform devices must meet a set of protection guidelines but do not necessarily allow full recovery of firmware or critical data. A recoverable platform must meet the criteria and provide recovery mechanisms; resilient devices meet a larger set of guidelines. Guidelines include:

2.8  Security Design Processes

27

• Security mechanisms are based on roots of trust or rooted chains of trust. • Changeable firmware shall rely on root of trust update. • Devices with intrusion detection shall make use of root of trust for the detection services. • Recovery shall rely on root of trust. • The update mechanism will be the only mechanism for updating device firmware. • Flash shall be protected to be unmodifiable outside of an authenticated/secure update mechanism. • Protection mechanisms cannot be bypassed. • Write protection of field non-upgradeable memory shall not be modifiable. • Critical data shall be modifiable only through the device or defined interfaces. • A successful attack on firmware shall not compromise the device’s detection capability. • The device shall perform integrity checks on critical data before use. • Firmware recovery mechanisms shall resist attacks against critical data or primary firmware image. • Critical data recovery mechanisms shall resist attacks. The NIST Guide to Industrial Control Systems (ICS) Security [Sto15] recommends a process that starts with risk assessment at three levels: organization, business process, and information system. They note that safety is an important consideration in the design of industrial control systems. The recommended security program proceeds over several steps: develop the security business case, build and train a cross-functional team, identify charter and scope of team, define ICS policies and procedures, implement an ICS security risk management framework, and provide training for ICS staff. A security architecture for industrial control system security provides the maximum allowable separation between the ICS and the corporate network. To the extent that devices on the two networks must communicate (e.g., to transfer billing information), they should do so through a firewall and a DMZ. The NIST Guidelines for Smart Grid Cybersecurity [NIS14B] concentrates on smart grids—distributed electric power control and management systems. Their approach contains a number of items specific to energy systems. The framework identifies seven domains for its logical reference model: markets for energy, operations, service providers, transmission facilities (long range, bulk transfer), distribution facilities (transfer close to the customer), generation, and customers. The document identifies power system availability as a key goal for all power system design and operation, including cybersecurity. A defense-in-depth strategy places defenses in multiple places and in multiple layers. Logical interfaces in the logical reference model are categorized based on their security characteristics. Based on cybersecurity requirements for the smart grid system, a set of cybersecurity measures are identified. Each technical requirement is assigned a security impact level of low, moderate, or high. NIST SP 800-53 [NIS13] provides US federal organizations with guidance on the management of information security risk. It is based on a multitier model including

28

2  Safety and Security Design Processes

organizations, mission/business processes, and information systems. Their process includes the selection of security control baselines, tailoring the baselines to the organization, documenting the process for selecting security controls, and applying the control selection process to both new and legacy systems. NIST documented considerations for managing cybersecurity and privacy risks of IoT devices and systems [Boe18]. They identify characteristics of IoT devices that may not be adequately coordinated with NIST SP 800-53: device may not have an asset management ID, software updates may be difficult, no enterprise user authentication support, etc. They recommend that organizations understand IoT device risk considerations and the resulting challenges, adjust policies and processes to mitigate IoT-related risk throughout the device lifecycle, and implement updated mitigation practices. Seacord [Sea13] discusses secure coding of C and C++ programs. Potential exploits include string manipulation, arbitrary memory write, memory allocation, integer-related issues, formatted output, concurrency, and file I/O. Methodological tools for secure coding include requirement analysis and misuse cases, analysis of threat models and attack surfaces, identifying and mitigating insecurities in existing code, identifying and enforcing trust boundaries, and making use of compiler and code analysis tools, code audits, and other reviews. The CERT® C Coding Standard [Sea14] provides a set of coding rules for the C language. Example rules include: • Declare identifiers before use. • Do not create or use out-of-bounds array indexes or pointers. • Within signal handlers, call only functions that are safe to use in an asynchronous execution environment. • Do not destroy a locked mutex. NIST Special Publication 800-61 [Cic12] provides assistance to organizations to establish incident response capabilities for computer security and procedures for handling incidents that do occur. The process used to create an incident response capability should include: • Creating a policy and plan for incident response • Developing procedures to handle and report incidents • The development of guidelines for communicating about incidents with outside parties • Creation of a team structure and staffing model • Establishing relationships with other parts of the organization • Determining the services to be provided by the incident response team • Staffing the response team and providing training The report makes other recommendations: • Ensure that their computing networks and systems are secure. • Organizations should document guidelines for interacting with other organizations: other incident response teams, law enforcement, etc.

2.9  Comparison and Contrast of Safety and Security Design Processes

29

• Be broadly prepared but focus on common attack vectors. • The importance of incident detection and analysis should be disseminated throughout the organization. • Create written guidelines for incident prioritization. • Perform lessons-learned analysis on incidents. Their incident response model includes four phases: preparation; identification and analysis; containment, eradication, and recovery; and post-incident activity. A zero-day vulnerability [Par15] is one that is not known to those who are interested in mitigating that vulnerability. Once the vulnerability becomes widely known, it is referred to as a zero-day exploit. Zero-day attacks can be detected through several methods: statistical analysis of attack profiles, signatures of known exploits, and analysis of the exploit’s behavior relative to the target. Security researchers may identify exploits independent of their discovery in the wild. In such cases, the exploits are described to the appropriate vendors or development organizations to allow the exploit to be mitigated before it is publicly discussed. Even after mitigations have been developed, they may not be universally applied, leaving vulnerabilities. In at least one case, publication of an exploit led to its use in an attack [Gre18]. An exploit known as Mimikatz was published in 2011; this exploit was able to grab Windows passwords that had been left in RAM. Mimikatz was later combined with EternalBlue to create Notpetya, a virulent and highly destructive virus. Microsoft had previously released a patch for EternalBlue. Unpatched computers were attached by Notpetya to obtain passwords to be used on other computers that had been patched for EternalBlue.

2.9  C  omparison and Contrast of Safety and Security Design Processes Safety and security analysis are concentrated at different parts of the design process: • Safety risk analysis is driven by requirements and is concentrated on the requirement phase. • Security vulnerability analysis is driven by the system structure and concentrated at the architecture phase and, to some extent, at coding. Hazard analysis and attack analysis both take into account the relationship between root causes and ultimate effects. As a result, legacy engineered systems generally provide some basic level of safety. However, information security as a requirement is relatively new and still now successfully followed in the design of many consumer electronics and IoT devices. The large overhang of installed, insecure devices on the Internet is a continuing concern and a factor that must be considered during the design of systems—attacks may be mounted through legacy devices.

30

2  Safety and Security Design Processes

Traditional IT applications are often transaction-oriented. A transaction is of relatively short duration and typically has cleanly identifiable start and end points. Many cyber-physical and IoT applications, in contrast, require extended or continuous operation of a process. Many security analysis processes are oriented to transaction-­oriented systems and are not always adaptable to processes. For example, ensuring the integrity of a signal from a sensor requires both ensuring the data’s integrity over an extended period while not introducing excessive delay. Timing-­ oriented attacks have been studied in the context of cyber-physical systems but are not traditionally considered in information security analysis. Physical plants introduce new attack models that have not yet been fully integrated into security design processes: • Timing attacks modify the timing of data, either delaying data to incur a missed deadline or changing the timing relationships between data elements. • Replay attacks replay stored versions of signals to hide changes to the behavior of the physical plant. The Stuxnet attack [Fal10, Fal11] made use of replay attacks to hide changes to the behavior of centrifuges. ASTM F3201-16, which provides standard practices for the dependability of software used in unmanned aerial vehicles, is one example of a document that emphasizes both safety and security aspects of software design. Safety design processes often assume that system designers have a great deal of control over the characteristics of the components they use. Embedded software throws that assumption into question—it is complex, and code is generally not visible to customers. Compounding the difficulties, embedded software may change over time, even well after system deployment. Update mechanisms may not be adequately documented. Some devices may not be updateable, leaving them vulnerable to security issues. Incident reporting is treated somewhat differently by safety and security communities. Safety incident reporting is widely accepted; governmental safety bodies often maintain go teams to respond quickly to safety incidents. Security breaches are not always promptly reported. These security problems may not be discovered for quite some time; even when they are known to have occurred, anecdotal evidence suggests that some organizations delay reporting. The US Department of Justice provides a list of suggested agencies to which computer-related crimes may be reported (https://www.justice.gov/criminal-ccips/reporting-computer-internetrelated-or-intellectual-property-crime). The Internet Crime Complaint Center is a reporting mechanism for Internet-facilitated criminal activity that allows submission of information to the FBI. The Department of Homeland Security operates the National Infrastructure Coordinating Center to gather reports related to infrastructure security issues. As previously discussed, US-CERT collects and disseminates technical reports on information security issues. The V methodology is inadequate for modern software systems because it does not take into account the shifting threat landscape presented by computer security. The V methodology assumes that risks can be identified at the early phases of the design process with verification ensuring that the risks have been appropriately

References

31

­ itigated. Security vulnerabilities, in contrast, are often identified after a device has m shipped. Side channel vulnerabilities, for example, present an entire category of problems that in many cases were not understood at the device’s design time. Devices and systems may be attacked indirectly, such as through maintenance computers.

References [Alb99] D.  Alberico, J.  Bozarth, M.  Brown, J.  Gill, S.  Mattern, V.I.  Arch McKinlay, Joint Software System Safety Committee: Software System Safety Handbook, A Technical & Managerial Team Approach (Joint Services Computer Resources Management Group, U. S. Navy, U. S. Army, and the U. S. Air Force, 1999) [Avi86] A. Avizienis, J.C. Laprie, Dependable computing: from concepts to design diversity. Proc. IEEE 74(5), 629–638 (1986). https://doi.org/10.1109/PROC.1986.13527 [Arm04] A.  Armando, L.  Compagna, Abstraction-driven SAT-based analysis of security protocols, in Theory and Applications of Satisfiability Testing. SAT 2003., Lecture Notes in Computer Science, vol 2919, ed. by E.  Giunchiglia, A.  Tacchella, (Springer, Berlin/ Heidelberg, 2004) [ASQ18A] American Society for Quality. What is the ISO 9000 Standards Series? http://asq.org/ learn-about-quality/iso-9000/overview/overview.html. Accessed 2 Sept 2018 [ASQ18B] American Society for Quality, Failure Mode Effects Analysis (FMEA), http://asq.org/ learn-about-quality/process-analysis-tools/overview/fmea.html. Accessed 3 Sept 2018 [Ass15] M.J.  Assante, R.M.  Lee, The Industrial Control System Cyber Kill Chain (SANS Institute, 2015) [Bel73] D.  Elliott Bell, L.J.  Lapadula, Secure computer systems: mathematical foundations, MITRE Technical Report 2547, Volume 1 (1 Mar 1973) [Bib77] K.J.  Biba, Integrity Considerations for Secure Computer Systems, ESD-TR-76-372, MTR-3153, Rev. 1, Deputy for Command and Management Systems, Electronic Systems, Air Force Systems Command (Apr 1977) [Boe18] K. Boeckl, M. Fagan, W. Fisher, N. Lefkovitz, K.N. Megas, E. Nadeau, D.G. O’Rourke, B. Piccarreta, K. Scarfone, Considerations for Managing Internet of Things (IoT) Cybersecurity and Privacy Risks, Draft NISTIR 8228, 2018 [Bro10] B. Brosgol, C. Comar, DO-178C: a new standard for software safety certification, in Presented at the 22nd Systems and Software Technology Conference, 26 Apr 2010, available at http://www.dtic.mil/dtic/tr/fulltext/u2/a558107.pdf [Cic12] P. Cichonski, T. Millar, T. Grance, K. Scarfone, Computer Security Incident Handling Guide, Special Publication 800-61, Revision 2 (National Institute of Standards and Technology, Gaithersburg, 2012) [CMM10] CMMI Product Team, CMMI® for Development, Version 1.3, CMU/SEI-2010-TR-033 (Nov 2010) [DoD12] U. S. Department of Defense, Department of Defense Standard Practice: System Safety, MIL-STD-882E (11 May 2012) [Eur06] European Organization for the Safety of Air Navigation, Safety Case Development Manual, DAP/SSH/091, edition 2.2 (13 Nov 2006) [FAA09] Federal Aviation Administration, Part 23---Small Airplane Certification Process Study: Recommendations for General Aviation for the Next 20 Years, OK-09-3468 (July 2009) [FAA19] Federal Aviation Administration, Title 14, Chapter 1, available at www.ecfr.gov. Accessed 4 May 2019 [FAA88] Federal Aviation Administration, Advisory Circular: System Design and Analysis, AC 25.1309-1A (21 June 1988)

32

2  Safety and Security Design Processes

[Fal10] N.  Falliere, Stuxnet introduces the first known rootkit for industrial control sys tems, Symantec Official Blog (6 Aug 2010). http://www.symantec.com/connect/blogs/ stuxnet-introduces-first-known-rootkit-scada-devices [Fal11] N. Falliere, L.O. Murchu, E. Chien, W32.Stuxnet Dossier, version 1.4 (Feb 2011), available at www.symantec.com [FDA02] Food and Drug Administration, General Principles of Software Validation; Final Guidance for Industry and FDA Staff (11 Jan 2002) [Fei13] P.H. Feiler, D.P. Gluch, Model-Based Engineering with AADL: An Introduction to the SAE Architecture Analysis & Design Language (Addison-Wesley, 2013) [FIRXX] FIRST.Org, Inc., Common Vulnerability Scoring System v3.0: Specification Document, undated [Gre18] A.  Greenberg, The untold story of Notpetya, the most devastat ing cyberattack in history, Wired (22 Aug 2018), https://www.wired.com/story/ notpetya-cyberattack-ukraine-russia-code-crashed-the-world/ [Her06] S. Hernan, S. Lambert, T. Ostwald, A. Shostack, Uncover security design flaws using the STRIDE Approach, MSDN AMagazine, November 2006, http://msdn.microsoft.com/msdnmag/issues/06/11/ThreatModeling/default.aspx. Retried from the Wayback Machine (17 Aug 2018) [How97] J.D. Howard, Analysis of Security Incidents on the Internet 1989–1995, Ph.D. dissertation, Carnegie Mellon University (7 Apr 1997) [HutXX] E.M.  Hutchins, M.J.  Cloppert, R.M.  Amin, Intelligence-driven computer network defense informed by analysis of adversary campaigns and intrusion kill chains, Lockheed Martin Corporation, undated, https://lockheedmartin.com/content/dam/lockheed-martin/rms/ documents/cyber/LM-White-Paper-Intel-Driven-Defense.pdf [Jac12] S.A. Jacklin, Certification of safety-critical software under DO-178C and DO-278A, in AIAA Infotech@Aerospace 2012, AIAA, June 2012, available at https://ntrs.nasa.gov/archive/ nasa/casi.ntrs.nasa.gov/20120016835.pdf [Koo18] P. Koopman, How to keep self-driving cars safe when no one is watching for the dashboard warning light, The Hill (30 June 2018), http://thehill.com/opinion/technology/394945how-to-keep-self-driving-cars-safe-when-no-one-is-watching-for-dashboard [Lan81] C.E. Landwehr, Formal models for computer security. ACM Comput. Surv. 13(3), 247– 278 (1981). https://doi.org/10.1145/356850.356852 [Law14] C. Lawson, Addressing the Cyber Kill Chain, Gartner, ID G00263765 (15 Aug 2014) [Lev11] N.G. Leveson, Engineering a Safer World: Systems Thinking Applied to Safety (MIT Press, 2011) [Mey18] B. Meyer, Making sense of agile methods. IEEE Softw. 35(2), 91–94 (2018) [MIR09] MIRA, MISRA AC GMG: Generic Modeling Design and Style Guidelines, version 1.0 (May 2009) [MIR13] MIRA, MISRA C:2012, Guidelines for the use of the C language in critical systems (Mar 2013) [Nam16] D. Namowitz, Part 23 Reform: FAA releases final rule on small aircraft certification, AOPA (16 Dec 2016), https://www.aopa.org/news-and-media/all-news/2016/december/16/ part-23-reform-faa-releases-final-rule-on-small-aircraft-certification [NCC18] National Cybersecurity and Communications Integration Center, Recommended Practice: Updating Antivirus in an Industrial Control System, IR-18-214 (2 Aug 2018) [NIS13] National Institute of Standards and Technology, Joint Task Force Transformation Initiative, Security and Privacy Controls for Federal Information Systems and Organizations, NIST Special Publication 800-53, Revision 4 (U.S. Dept. of Commerce, National Institute of Standards and Technology, Gaithersburg, 2013) [NIS14A] National Institute of Standards and Technology, Framework for Improving Critical Infrastructure Cybersecurity, Version 1.0 (12 Feb 2014) [NIS14B] National Institute of Standards and Technology, Guidelines for Smart Grid Cybersecurity: Volume 1 – Smart Grid Cybersecurity Strategy, Architecture, and High-Level Requirements, NISTIR 7268 Revision 1 (National Institute of Standards and Technology, Gaithersburg, 2014)

References

33

[Par15] R. Park, Guide to zero-day exploits, Symantec Official Blog (9 Nov 2015), http://www. symantec.com/connect/item-feeds/blog/98071/feed/all/all/all [Reg18] A. Regenscheid, Platform Firmware Resiliency Guidelines, NIST Special Publication 800-193 (National Institute of Standards and Technology, Gaithersburg, 2018) [Sal74] J.H. Saltzer, Protection and the control of information sharing in MULTICS. Commun. ACM 17(7), 388–402 (1974). https://doi.org/10.1145/361011.361067 [Sch14] A. Scharl, K. Stottlar, R. Kady, Functional Hazard Analysis (FHA) methodology tutorial, in International System Safety Training Symposium, Aug 2014, NSWCDD-MP-14-00380 [Sch99] B.  Schneier, Attack trees, Dr. Dobb’s Journal (December 1999). Available at https:// www.schneier.com/academic/archives/1999/12/attack_trees.html [Sea13] R.C. Seacord, Secure Coding in C and C++, SEI Series, 2nd edn. (Addison Wesley, Boston, 2013) [Sea14] R.C. Seacord, The CERT® C Coding Standard: 98 Rules for Developing Safe, Reliable, and Secure Systems, SEI Series, 2nd edn. (Addison Wesley, Upper Saddle River, 2014) [Sto15] K. Stouffer, V. Pillitteri, S. Lightman, M. Abrams, A. Hahn, Guide to Industrial Control Systems (ICS) Security, NIST Special Publication 800-82, Revision 2 (National Institute of Standards and Technology, Gaithersburg, 2015) [Tig14] C.  Tilghman, Mei Chu Li, M.  Zemore, Software safety analysis proce dures, in Intenrational System Safety Training Symposium, St. Louis, 4–8 Aug 2018, NSWCDD-MP-14-00395, https://system-safety.org/issc2014/82_Software_Safety_Analysis_ Procedures.pdf [Tri02] A.C. Tribble, D.L. Lempia, S.P. Miller, Software safety analysis of a flight guidance system, in Proceedings, The Digital Avionics Systems Conference, vol. 2, 2002 [VDA15] VDA QMC Working Group 13/Automotive SIG, Automotive SPICE Process Assessment/Reference Model, version 3.0 (16 July 2015) [Ves81] W.E.  Vesely, F.F.  Goldberg, N.H.  Roberts, D.F.  Haasl, Fault Tree Handbook (NUREG-­0492, Systems and Reliability Research, Office of Nuclear Regulatory Research, U. S. Nuclear Regulatory Commission, Washington, D.C., 1981) [ZVE12] ZVEI, Executive Summary, Funktionale Sicherheit ISO 26262, ZVEI UG2 ad hoc working group, “Functional Safety in accordance with ISO 26262,” ZVEI---German Electrical and Electronic Manufacturers’ Association e.V. (2012)

Chapter 3

Threats and Threat Analysis

3.1  Introduction This chapter develops models for the unified analysis of security vulnerabilities and safety hazards that constitute various forms of threats to security and safety. A unified view of threats leads to understand that these threats can come from design flaws and faults at many levels of abstraction: • Application-level threats can destroy equipment or produce defective and harmful products. • Architecture-level threats can result in timing violations or excursions into unsafe operating modes. • Device-level threats can compromise code, disable a device, or change critical data. Safety analysis starts with risks, which are defined by the application; security analysis starts with threats, which are much more closely tied to the system architecture and module design. Our aim is an analysis method that can be applied at the start of the system design process and extended well into the design process as new information exposes possible new threats. Security treats information loss as all-or-nothing. A compromise results in the exposure of a broad class of data; once that data is out, it can be copied and spread widely. In contrast, safety emphasizes safe operation with reduced capability. When we consider threats to cyber-physical and IoT systems, some time series attacks may not be all-or-nothing but instead jeopardize data in a certain window of time. In this chapter, we propose new threat models that combine safety and security aspects: Section 3.2 discusses vulnerabilities, hazards, and threats; Section 3.3 describes compound threats; Section 3.4 reviews threat analysis models; and Section 3.5 describes characteristics of vulnerabilities. We propose in Section 3.6 an iterative threat management methodology that analyzes and mitigates threats at several levels of abstraction. Section 3.7 describes threat mitigation techniques both preand post-deployment. © Springer Nature Switzerland AG 2020 M. Wolf, D. Serpanos, Safe and Secure Cyber-Physical Systems and Internet-of-­ Things Systems, https://doi.org/10.1007/978-3-030-25808-5_3

35

36

3  Threats and Threat Analysis

3.2  Vulnerabilities, Hazards, and Threats We use the term vulnerability in the context of safety to denote a design flaw or an improper use of a system such as a poor safety-oriented procedure. The flaw may result in safety hazards, security threats, or both. As we saw in Chap. 2, the safety community uses the term hazard for any system characteristic or event that may threaten life or property. The security community typically talks about a hierarchy of concerns: • A vulnerability is a system security weakness. • A threat is a possible means to exploit a vulnerability. • An exploit is software that can be used by an attacker to take advantage of a vulnerability. • An attack is an implementation of a threat. In the context of a unified view of safety and security, we will use the term threat more generally to refer to any combination of security vulnerability and safety risks. A threat may result from combinations of security and safety problems. These compound threats are perhaps the most troubling because they allow attackers to extend the effects of their attacks beyond cyberspace into the physical world.

3.3  Compound Threats A truism in safety is that many accidents are the result of a chain of causes. The same is also arguably true in security, in which a vulnerability’s exploitation is extended by other system design weaknesses. A threat may come purely from either safety or security issues. Compound threats that combine safety and security aspects may allow attackers to cause physical damage as a result of their cyber activities. Figure 3.1 shows an example of a threat tree resulting from the malicious exercise of an unstable physical mode. The figure extends the fault tree notation to use filled-in corners: the top-left corner denotes a threat that is primarily safety, while a top-right corner denotes a primarily security threat. In the example, the system exhibits a safety vulnerability in the form of an unstable mode; it also contains a security vulnerability in the form of an authentication problem. When the authentication vulnerability is exploited, it allows the attacker to issue a malicious command that causes the system to enter an unsafe state. In this case, the root causes include both safety and security issues. Along the threat tree paths, safety and security issues may alternate. Leveson’s STAMP methodology does not take into account computer security. However, its approach is well-suited to the analysis of security concerns. Computer security attacks are made possible not just by technical flaws in the design of a system but also organizational and human behavior factors. Spear phishing to obtain

3.4  Threat Analysis Models

37

Fig. 3.1  An example threat tree for malicious exercise of an unstable mode

user credentials is an example of a human factor vulnerability. The 2015 attack on the Ukrainian power grid used spear phishing to obtain user credentials for the target energy distributors [EIA16].

3.4  Threat Analysis Models The last section introduced the threat tree as an example of a traditional analysis method extended to include both safety and security information. We can generalize several other analysis methods from both domains to provide threat analysis tools. Safety-oriented fault tree methodologies use cutsets to identify sufficient conditions for mitigating a fault; the fault must be separated from all its root causes. Threat trees allow us to identify cutsets of different types: a cutset entirely of safety-­ oriented or security-oriented nodes. We can also analyze paths through the fault trees (from root causes to failure) to analyze the relationships between safety and security. We can classify paths based on the proportion of safety and security-­ oriented nodes on the path. We can generalize the Risk Priority Number of Section 3 risk to threats. We use the Risk Priority Number (RPN) to characterize safety hazards; we use Sf, Of, Df to denote the safety, occurrence, and detection factors for failures. These values take into account safety issues independent of the security issues. We can define a similar vulnerability priority number VPN for security vulnerabilities:

VPN = Sv × Ov × Dv .

(3.1)

38

3  Threats and Threat Analysis

In this case, Sv, Ov, and Dv are the safety, occurrence, and detection factors for the security vulnerability. As with RPN, these component values do not take into account the safety issues. The threat priority number is defined as TPN = RPN + VPN.



(3.2)

Figure 3.2 shows a modified functional hazard analysis worksheet for threat analysis. Four additional columns are added: • • • • •

The security vulnerability Tools and methods used for the attack Access mode for the attack Results of the attack Relationship of the attack to safety

Figure 3.3 shows a failure mode effects analysis worksheet modified for threat analysis. As with FHA, the FMEA workshop is modified by adding four additional columns: Threat ID Identifier

Life cycle phase Phase analyzed by risk assessment

Activity Actions performed within life cycle phase

State/Mode System state or mode for the hazard

Function System function

Vulnerability

Attack Tools

Access Mode

Results of Attack

Security vulnerability

Tools and methods used for attack

Unauthorized access, use, etc.

System compromise

Relationship to Safety How security violation results in safety failure

Functional failure

Threat Description Detailed description of threat conditions

System Item(s)

Causal Factor Description Causes of failure

Mishap

Existing Mitigations Existing means to mitigate failure

Software Control Category Degree of autonomy of software function

Initial TPN

Software Criticality Index Criticality

Causal Factor Risk Level Potential for causal factors to occur

Recommended Comments Mitigations Methods to reduce Relevant threat additional information

Detailed description of failure mode Effect(s) Effects on life, limb, property

Target TPN Projected threat after mitigation

Portion of the system

Initial threat assessment

Fig. 3.2  A functional hazard analysis worksheet analysis for threats

Description of failure

Follow-On Actions Further work to better understand threat

3.4  Threat Analysis Models

• • • • •

39

The security vulnerability Tools and methods used for the attack Access mode for the attack Results of the attack Relationship of the attack to safety

We can update the Hutchins et al. [HutXX] intrusion kill chain and Assante and Lee ICS kill chain [Ass15] to take into account safety and security. This modified cyber-physical kill chain includes seven phases: • Reconnaissance identifies both physical and cyber targets. Attack development can take into account safety risks that could be exploited. • Weaponization may in some cases make use of physical properties of the system to deliver an attack. Function

Potential Failure Mode

Potential Effect(s) of Failure Results of failure

S f, S v

Function being analyzed

Detailed description of failure conditions

Vulnerability Security vulnerability

Attack Tools

Access Mode

Results of Attack

Tools and methods used for attack

Unauthorized access, use, etc.

System compromise

O f, O v Probability of failure due to occurrence of this failure/vulnerability Recommended Action(s)

Severity of failure, vulnerability

Relationship to Safety How security violation results in safety failure

D f, D v

TPN

CRIT

Ability to detect cause or failure mode/attack

Threat Priority  =  + 

Initial criticality assessment

Responsible Party and Target Completion Date Who is responsible, when task should be finished

Actions to take

Action Taken How issue was resolved

Current Process Controls Existing means to mitigate failure

Potential Cause(s) of Failure Causes of failure

S f, S v

O f, O v

D f, D v

TPN

CRIT

Final failure, vulnerability severity

Final occurrency probabilities

Final detection abilities

Final Threat Priority Number

Final threat assessment

Fig. 3.3  A failure mode effects analysis worksheet analysis for threats

40

3  Threats and Threat Analysis

• Delivery may or may not depend on physical access. In some cases, delivery may involve interfering with physical objects. An attempted theft of a BMW automobile proceeded over 2 days [Roo18]. On night 1, the attackers broke the car window to set off the alarm and cut a wire for an air pressure sensor that allowed broken windows to be detected. The thieves apparently planned to return on the second night to steal the car while the alarm was inoperative. • Exploitation may make use of a combination of cyber and physical methods. • Installation may provide a persistent presence or a presence over a limited time span. Installation may also include methods, such as replay, to hide the attack. • Command and control may allow the attacker to remotely assess physical damage and update the direction of the attack. • Actions related to safety and security include the security actions of detect, deny, disrupt, degrade, deceive, and destroy as well as the safety actions of detect and mitigate.

3.5  Characteristics of Vulnerabilities and Threats Vulnerabilities may be indirect causes of threats. For example, the inability to update a unit’s firmware is not itself a threat. On the one hand, inability to update firmware may either prolong the existence of flaws that can be exploited by attackers. On the other hand, an inability to update may protect the system from the introduction of new vulnerabilities in updated code. Vulnerabilities should be evaluated in the context of as many use cases as can be identified. Safety engineering uses reports to identify causes of accidents; those causes can then be applied to other systems to identify and mitigate risks. Several US government databases provide information related to safety: • The Department of Energy maintains several databases, including Safety Basis Information System (SBIS). The department operates a process to identify suspect/counterfeit and defective items. • The Federal Aviation Administration maintains an accident and incident database (https://www.faa.gov/data_research/accident_incident/). • The National Transportation Safety Board (NTSB) provides a database on accident reports on various transportation modes (https://www.ntsb.gov/investigations/AccidentReports/Pages/AccidentReports.aspx) and a database specific to aviation accidents (https://www.ntsb.gov/_layouts/ntsb.aviation/index.aspx). A report on a collision between a freight train and motorcoach at a railroad crossing [NTS18] provides an example of an NTSB report: • An executive summary provides a short description. • A factual information section describes the crash narrative, injuries, emergency response, motorcoach, highway and grade crossing, railroad operations, motor carrier operations, motorcoach driver, weather, and roadway conditions.

3.5  Characteristics of Vulnerabilities and Threats

41

• An analysis section considers the motorcoach driver and train crew, the grade crossing, and emergency egress and extrication. • A conclusions section describes finding and probable cause. • A recommendations section provides new recommendations as well as recommendations reiterated and reclassified in the report. • An appendix describes the investigators and parties to the investigation. • A list of references is provided, as is a list of figures and tables as well as acronyms and abbreviations. The 9/11 Commission Report [Nat04] on the September 11, 2001, terrorist attacks makes use of several techniques of safety reports in its analysis of a deliberate attack. The report describes both the attack itself and events leading up to the attack. The final three chapters are titled “Foresight—and Hindsight,” “What To Do? A Global Strategy,” and “How To Do It? A Different Way of Organizing the Government.” Several databases characterize software vulnerability threats: • The NIST National Vulnerability Database (NVD) (https://nvd.nist.gov/) is the US government repository of vulnerability management data. • Common Vulnerabilities and Exposures CVE® (https://cve.mitre.org) is a database of publicly known vulnerabilities. CVE data is used in NVD. • The CERT Vulnerability Notes Database (https://www.kb.cert.org/vuls/) provides a set of vulnerability notes that include technical descriptions, remediation notes, and affected vendors. • The Common Vulnerability Scoring System (CVSS) [FIRXX] defines metrics for IT-­oriented vulnerabilities. An example from the CVE database [NVD18] describes a cache-based side-­ channel attack on versions of ARM Mbed: • • • • •

The current description provides a summary. Impact describes severity and metrics for CVS versions 3.0 and 2.0. References to advisories, solutions, and tools are provided. Vulnerability type is identified. Vulnerable software and versions are provided.

NVD defines vulnerability as “A weakness in the computational logic (e.g., code) found in software and hardware components that, when exploited, results in a negative impact to confidentiality, integrity, or availability.” This database emphasizes threats from architecture, algorithms, system procedures, and human factors.

3.5.1  Improper Authorization Threats On the one hand, access problems result in safety risks. An improperly authorized user may be allowed by the system to perform bad actions. On the other hand, the concept of a user may not always be appropriate for certain components of a

42

3  Threats and Threat Analysis

cyber-­physical or IoT system. While user authorization is a meaningful concept for supervisory control, a modified version of access privilege may be more appropriate to a sensor node. A user abstraction from an IT-oriented operating system carries with it concepts such as home directories that are not useful for sensing and actuation. Access privileges to specified data ports can be defined without requiring the device to implement a full user abstraction. In fact, some IoT-oriented wireless standards do not directly support a user abstraction.

3.5.2  Authorization Domains An authorization domain allows system components to control access to data. Such a domain may be associated with a user but is distinct from a user. Identifying the number of authorization domains, the amount of data covered by each, and the expected access privileges for each is a key architectural decision. On the one hand, large domains make for easier implementation by providing fewer barriers for data access. However, that lack of barriers allows for wormholes that enable attacks. Authorization domains should take into account both safety and security properties. We will discuss authorization domains in more detail in Chap. 5.

3.5.3  Software Safety Threats Numerical errors may be caused by several mechanisms: poor numerical properties of algorithms, hardware flaws, or externally induced faults such as cosmic rays. Numerical errors are serious sources of vulnerabilities and threats for cyber-­physical and IoT systems.

3.6  Iterative Threat Analysis Methodology Safety-oriented design processes traditionally operate over long time scales, starting with initial risk management. Security issues are driven in part by the discovery of new vulnerabilities that must be quickly and effectively mitigated. Managing the different time scales of safety and security mitigation is necessary to the creation of an effective threat analysis and mitigation methodology for safe and secure systems. Safe and secure system design requires iterative threat analysis. Threat analysis should be revisited at several levels of abstraction during the initial design. Once the system is deployed, threat analysis should be continued both to deal with emerging threats as well as to deal with latent threats. Iterative analysis is sometimes applied

3.7 Threat Mitigation

43

to certified systems, which may require periodic recertification. Unscheduled reviews may also be performed, for example, after an accident. Requirements for safety/security analysis should: • Define a post-deploy schedule for vulnerability analysis. • Develop a plan to handle zero-day vulnerabilities. • Develop criteria under which these plans are revisited both during later design phases and after deployment. Threat analysis should take into account not only known security problems but also potential security vulnerabilities. An initial threat analysis plays a role similar to threat analysis in traditional safety design. Although the system architecture has not yet been chosen, some characteristics of the computing system amount to requirements. The existence of user accounts, for example, leads to possible authentication vulnerabilities. Threat analysis is both used and updated during the architecture phase. At the start of architecture design, the initial threat analysis is factored into architectural design decisions. After a first draft of the architecture is defined, the initial threat analysis is expanded and revised. This phase has two goals. The threat analysis products—fault trees, FHA, or FMEA tables—should be updated to conform to the characteristics of the architecture. Additional threats may be apparent given the architecture; those threats should be added to the analysis. Newly identified vulnerabilities may fall into either of two categories: • The new vulnerability may be a variation of one that was previously considered in the design process. • The new vulnerability may present new threat cases. In some cases, a vulnerability may exhibit some known characteristics but also present some variations that require additional threat case analysis. Detailed design of software should make use of the first two phases of threat analysis and apply software-specific security and safety measures. Design reviews [Fag76] are widely used in code design to visually inspect software and identify flaws. The addition of safety and security issues to the design review requires some additional mechanism but no basic changes to the approach. The staff should already be trained in safe and secure software design, so they should require little additional coaching to perform safe/secure inspections for the design review.

3.7  Threat Mitigation Later chapters describe several specific threat mitigation methods. This section surveys the application of threat mitigation both before and after deployment of the system.

44

3  Threats and Threat Analysis

3.7.1  Pre-deployment McGraw [McG04A] identifies static analysis, risk analysis, and penetration testing as important security practices. Static analysis tools perform syntactic and limited semantic analysis of code and can identify a range of coding problems. McGraw suggests risk-based security testing of software. He suggests that penetration testing take into account software architecture, arguing that black-box penetration testing provides only limited effectiveness. McGraw [McG04B] also noted the limitations of risk analysis as applied to software, specifically that analysis does not identify specific software flaws or mitigations. Trust zones allow designers to decompose the system architecture in order to mitigate the scope of attacks.

3.7.2  Post-deployment IT-oriented security puts great emphasis on updates and fixes for security flaws. Cyber-physical and IoT systems provide much less scope for arbitrarily timed software updates due to several factors: the need for continued physical plant operation and assurance for unintended side effects of updates. Not only can a safety-critical system be shut off at an arbitrary time, but many physical plants require extensive shutdown-and-restart procedures. A large electric power generator generally requires about 8 hours to bring from a standing start to full operation. A chemical plant may take more than a day to shut down and restart. Furthermore, shutting down one section of a manufacturing plant generally requires stopping other equipment that is not directly affected by the update. One result of this characteristic is that many industrial control systems exhibit many old, easy-to-­ exploit vulnerabilities. The solution to this problem is not to enforce IT-oriented practices on cyber-physical and IoT system, but rather to architect resilient systems that operate robustly in the presence of flaws, errors, and attacks. Even if a physical plant can be shut down for some amount of time, mission-­ critical and safety-critical systems require more thorough validation of a software update than do typical IT systems. Because the physical plant may require unusual combinations of drivers and applications, the user cannot rely on the software vendor to perform all required tests before the update is released. In the case of certified systems, updates may be required to undergo a mandated certification process before being deployed.

References

45

3.8  Summary While safety and security methods have developed relatively independently, we can find useful similarities between to help us develop a unified model and approach. Vulnerabilities may come from combinations of design flaws and attacks. Iterative threat analysis methodologies help us to mitigate threats. Threat mitigation can and should be applied both pre- and post-deployment.

References [Ass15] M.J.  Assante, R.M.  Lee, The Industrial Control System Cyber Kill Chain (SANS Institute, 2015) [EIA16] Electricity Information Sharing and Analysis Center, TLP: White, Analysis of the Cyber Attack on the Ukranian Power Grid, Defense Use Case (Mar 18, 2016) [Fag76] M.E.  Fagan, Design and code inspections to reduce errors in program development. IBM Syst. J. 15(3), 219–248 (1976) [FIRXX] FIRST.Org, Inc., Common Vulnerability Scoring System v3.0: Specification Document, undated [HutXX] E.M.  Hutchins, M.J.  Cloppert, R.M.  Amin, Intelligence-driven computer network defense informed by analysis of adversary campaigns and intrusion kill chains, Lockheed Martin Corporation, undated, https://lockheedmartin.com/content/dam/lockheed-martin/rms/ documents/cyber/LM-White-Paper-Intel-Driven-Defense.pdf [McG04A] G. McGraw, Software security. IEEE Secur. Priv. 2(2), 80–83 (2004). https:// doi.org/10.1109/MSECP.2004.1281254 [McG04B] G. McGraw, Risk analysis in software design (May 31 2004), https://www. synopsys.com/blogs/software-security/software-risk-analysis/ [Nat04] National Commission on Terrorist Attacks Upon the United States, The 9/11 Report (22 July 2004), https://www.9-11commission.gov/report/ [NTS18] National Transportation Safety Board, Highway Accident Report: Collision Between Freight Train and Charter Motorcoach at High-Profile Highway-Railroad Grade Crossing, Biloxi, Mississippi, March 7, 2017, NTSB/HAR-18/01, PB2018=101328, Notation 57568, adopted 7 Aug 2018 [NVD18] National Vulnerabilities Database, CVE-2018-0498 (28 July 2018), https://nvd.nist. gov/vuln/detail/CVE-2018-0498#vulnCurrentDescriptionTitle [Roo18] M. Rooding, A Dutch first: ingenious BMW theft attempt, Ramblings of a Dutch dev (7 Aug 2018), http://fwdnxt.com

Chapter 4

Architectures

4.1  Introduction Architectural methods can be applied to both safety and security characteristics. Modern systems are vulnerable to a range of threats, including transitive attacks. Architectural modeling can identify vulnerabilities in a design, which can then be resynthesized to address the flaw. Model-based design can be used to provide both security and safety properties of systems. Service-oriented architectures can be adapted to provide quality-of-service guarantees that are useful in control and signal processing systems. The next section touches upon a few issues in processor security. Section 4.3 discusses model-based design and its application to safety and security. Section 4.4 describes an analysis-and-resynthesis approach to computing platform threat mitigation. Section 4.5 describes approaches to introduce quality-of-service guarantees in service-oriented architectures.

4.2  Processor Security Processors are at the heart of embedded computing systems. CPU protection mechanisms have been studied and improved since the early days of processor design. The next section describes root-of-trust mechanisms for trusted software execution. Section 4.2.2 considers side channel attacks on processors.

© Springer Nature Switzerland AG 2020 M. Wolf, D. Serpanos, Safe and Secure Cyber-Physical Systems and Internet-of-­ Things Systems, https://doi.org/10.1007/978-3-030-25808-5_4

47

48

4 Architectures

4.2.1  Root-of-Trust A root-of-trust provides a trusted execution environment that can serve as the foundation for secure execution of applications. Root-of-trust is a hardware feature. The CPU provides a tamper-resistant mechanism for cryptographic keys. A secure boot system validates an executable module before executing it, typically by verifying a digital signature on the software. Software that has been accepted by the root-of-­trust and runs in the secure perimeter can be used to start other validated processes. ARM TrustZone [ARM09] defines normal and secure modes; CPUs, busses, DMA controllers, and cache controllers can operate in either normal or secure mode.

4.2.2  Side Channel Attacks Side channel attacks exploit side effects of computations to gain access to information or possibly to alter data. An early example of a side channel attack is differential power analysis [Koc99]. This attack identifies bits in a cryptographic key by observing the difference in power consumption between taken and not-taken branches. The techniques and equipment used for differential power attacks have been refined and promulgated to make differential power attacks a standard, well-­understood attack. Computing systems exhibit a wide range of side channels. More recent side channel attacks have concentrated on the CPU itself. Two important types of attacks were announced in 2018: Meltdown [Lip18] takes advantage of side effects of out-­ of-­order instruction execution; Spectre [Koc19] makes use of side effects of branch prediction and speculative execution. Microarchitectural Data Sampling (MDS) attacks [Int19] also exploit side effects of speculative execution.

4.3  Model-Based Design Model-based design methodologies provide models for systems that are more abstract than descriptions in high-level programming languages. Model-based designs are generally synthesizable into implementations. Modeling environments may also include analysis tools and links to simulators. Karsai et al. [Kar03] developed a model-integrated methodology for embedded software design known as model-integrated computing. They advocated the use of domain-specific modeling languages (DSMLs) tailored to the characteristics of particular application domains. A metamodeling language based on UML and the Object Constraint Language (OCL) is used to describe a DSML. DSMLs can be composed to create complex languages that allow a system to be described in ­multiple aspects. They advocate that DSMLs be designed to allow composition, ­including abstraction, modularization, interfaces and ported components, aspects,

4.4  Architectural Threat Modeling

49

and references (linkage across levels of the part-whole hierarchy). They refer to the process of allocating components to implement the model as model synthesis; model-based generators create executable implementations. Eby et al. [Eby07] incorporated access control schemes related to those of Bell-­ LaPadula and Biba models. They added a partition construct to their DSML to capture access rights. They modified analysis and code generation to take into account the presence of encryption algorithms. Their synthesis system could take into account the effect of the addition of security features such as encryption on system schedulability. Tariq et al. [Tar12] developed a domain-specific modeling language for irrigation networks. Irrigation system design requires selecting a topology for the irrigation network, sizing reservoirs and channels, placing level meters, designing gate protocols, and evaluating the proper operation of the network under various usage and rainfall scenarios. A DMSL was developed using the Generic Modeling Environment (GME) from Vanderbilt [Led01]. The metamodeling language includes atoms for LevelMeter, Gate, and Controller and higher-level models for PrimaryCanalPool, SecondaryCanalPool, and TertiaryWaterChannel. A simulation engine used MATLAB to solve Saint Venant’s equation for the irrigation network, taking into account water surface, flow, friction, and bed slope. The simulator could determine water level vs. time at various points in the irrigation network, thus allowing the validation of safety conditions. The modeling environment was used to model a portion of Muzaffargarh Canal and determine the behavior of water in the canal over a set of scenarios. AADL [Fei12] is a model-based design language developed by the Society for Automotive Engineers. It originally targeted avionics but has been used in other domains. A system architecture includes both software and hardware components. Software specifications are based on threads, processes, and data. Timing constraints and execution times for threads and processes may be modeled, as well as other non-functional properties such as code image size. Interactions can be specified using either component and port or flow models. Software components can be bound to hardware modules.

4.4  Architectural Threat Modeling Coding errors receive a great deal of focus in threat modeling and mitigation. However, sometimes, the system architecture itself—its software and hardware structure—introduces locations for potential attacks. The structure of the system can also be used to identify and implement countermeasures. While we often view networked systems as fully connected graphs, the physical layer has its own structure that may force an attack to take certain paths. Those paths become locations for sensors and countermeasures. Privilege escalation attacks are one form of attack enabled by the software structure. In this style of attack, an attack task without privileges to access a particular resource can instead ask another task that does have the required privileges to

4 Architectures

50 Table 4.1  Example of system-level vulnerabilities and relevant attack goals System-level vulnerability Improper data read Improper data write Man-in-the-middle Timing vulnerability Missed deadline Network-on-chip quality-of-service

Attack goals Privacy System malfunction Privacy, system malfunction System malfunction System malfunction

perform an operation for which the attacker does not have privileges. Privilege escalation attacks have been demonstrated on Android [Dav11]. This attack takes advantage of lack of transitive checks on privilege; it poses significant risks and is hard to evaluate manually. Checks are further complicated in networked systems and heterogeneous operating systems. Quality-of-service (QoS) attacks that compromise system timing properties. Such an attack can be accomplished by causing subtle changes to system timing, exploiting resource utilization bottlenecks caused by the allocation of computation to resources. Functional allocation is not often treated in traditional security threat analysis. Systems must be hardened against attacks both at the start of design and after system deployment. New specific vulnerabilities frequently emerge; entire new categories of attacks may be discovered. Safety-critical systems need to be updated to handle these threats but without destroying the system’s non-functional properties: timing, power consumption, thermal. Table 4.1 provides some examples of system-level vulnerabilities for multiprocessor systems-on-chip (MPSoCs) and the associated attack goals. Some of these vulnerabilities are functional, while others are timing-related. A functional vulnerability attacks a particular transaction, while a timing vulnerability attacks the initiation or repetition conditions of an aperiodic or periodic action. Mitigation may combine design time and run-time techniques. Functional vulnerabilities can be mitigated during design by limiting the ability of tasks to perform certain operations at certain locations. These vulnerabilities can be also be addressed by run-time attacks at the network adapter. Timing vulnerabilities can be countered at design time by limiting capabilities or at run time by monitoring network activity.

4.4.1  Attack Model We use the term attack model for a combination of a threat model and a model of the design under test (DUT). The architectural information for the DUT helps algorithms determine the feasibility of an attack and the potential damage it can cause. Architectural information also helps us identify mitigation points both at design and run time.

4.4  Architectural Threat Modeling

51

Figure 4.1 shows a multilayer attack model for system-level design algorithms. The model’s components are described as graphs, sets, or functions. The hardware platform level of the model includes: • An architecture graph [Wol96] {N, E}. Each node Ni represents a processing element; each edge Ei represents a logical link between two nodes. The available bandwidth on an edge is an attribute. • A storage set S = {s1, ⋯, sp} of storage locations of interest. Storage locations may be distributed throughout the architecture; the location of each storage element is an attribute of that storage element. A storage element may represent a sequence of memory locations if the locations all share the same characteristics. Storage locations may include memory locations, CPU registers, CPU memory management systems, security zones, network adapter registers and buffers, I/O device registers and buffers. This layer of the model describes the permissions/capabilities provided by both middleware and the operating systems on the nodes. The middleware and operating system layer of the model includes: • An access capability function {K  :  τ, s  →  {T, F}}. This function determines whether a given task is allowed to access a given storage location pair. Access capabilities are enforced by the middleware. • A set of zones L = {l1, ⋯, lq} that defines mutually accessible sets of code and storage. A set member li = {τa, ⋯, τz, sb, ⋯, sy} describes a set of tasks and storage locations. A zone may be trusted (derived from a root-of-trust) or untrusted. Zone boundaries are enforced by the operating system and hardware units such as the MMU. Fig. 4.1  System model for attack analysis

52

4 Architectures

The application layer of the model includes: • A task set Τ = {τ1, ⋯, τt} of tasks executing on the platform. A task is defined by its processing element assignment and timing characteristics. A periodic task requires a task period and initiation interval. A sporadic task requires a maximum initiation interval. • A transaction set M = {m1, ⋯, mm} of transactions performed by the tasks: payload size, source processing element, destination processing element, source storage location(s), sporadic/periodic, timing parameters, and destination storage location(s). We use the term job for a related set of tasks and transactions. The architecture graph is concerned with logical connections which may include processing elements or network elements such as switches or network adapters. Privileges can be checked at several levels. An MMU can check memory accesses using addresses and access maps. A network-on-chip adapter can check access privileges using addresses and address maps. The OS may check zones, while middleware can check access capabilities. Tasks can perform their own applicationdependent checks. Applying several checks at multiple levels of abstraction can enhance the security of a transaction. The structural information provided by this model allows certain types of analysis-­and-resynthesis to be applied. In contrast, many security threat models, such as Howard’s model [How97] and STRIDE [Her06], are operational. A computation contains a set of initiating tasks I, a set of target tasks X, and a set of transactions T. A basic computation consists of a single transaction:

b = i, x , t .



(4.1)

The full set of computations executed on the platform is B. A computing system can execute both valid computations {V1, ⋯, Vn} that are allowed by the design and attack computations {A1, ⋯, Ap} from attackers. A valid computation V must satisfy several functional security properties: • The initiating task must have access capability for all required storage locations. • Both the initiating and target tasks must have compatible access capabilities. It must also satisfy timing properties: • Every instance of a transaction must be completed by its deadline. • The set of transactions that occupy a link cannot exceed the link’s available bandwidth. Two kinds of attacks are possible. A compromised computation makes use of a modified valid computation. Taking advantage of a code vulnerability, for example, allows the attacker to the vulnerability into one with the required capabilities to perform the attack. In contrast, a crafted attack is an additional computation inserted into the system by the attacker.

4.4  Architectural Threat Modeling

53

4.4.2  Example Attacks and Mitigations As mentioned above, privilege escalation attacks leverage transitive properties of access models to allow an application to indirectly exceed its authorities. Davi et al. [Dav11] demonstrated transitive permission usage attacks on Android by staging such an attack using a set of standard Android applications, including the Android Scripting Environment (ASE). During the attack, one task trusted for the target data accesses the data; it then sends that data to another task without authorization from the original source of the data. Applying our model to this type of attack includes a valid user task τv and two attack tasks τa, τb. The location to be attacked is Sv. The two attack tasks have different capabilities, Ka, Kb. A sequence of transactions allows attacker a to ask b to access the location under attack. First M1(τa, τb) makes the request, and then M2(τb, τa) returns the result to the attacker. One example of an architecture used in real-time embedded systems is the NSP-­ 32S family of safety automotive processors [NXP18]. This system-on-chip includes four pairs of CPUs running in lockstep, FlexRay, a pulse width modulation unit, as well as other features. The AUTOSAR software architecture [Urb17] is widely used for automotive systems. It supports multi-core operating systems and middleware for application management. Figure 4.2 shows locations for possible privilege escalation attacks on this target architecture. Attacker Ta does not have direct privileges to send attack data to the target. The attacker sends the data to Tb which does have the required privileges to transfer it to the target. Failure to transitiviely enforce access privileges is more

Fig. 4.2  Modes of inferred threat and potential mitigations

54

4 Architectures

likely in multiprocessor and multi-OS architectures, particularly heterogeneous ­systems built using software components from multiple vendors. Some types of mitigations that might be possible in a custom silicon design are not possible on this target. For example, the network adapters cannot be modified to check for transitive attacks. In addition, the FlexRay adapter cannot be modified to perform checks. Modifying the AUTOSAR-compliant operating system would be difficult due to the need to ensure compliance with the AUTOSAR standard. The CPUs do not support ARM TrustZone root-of-trust. As a result, trusted zones cannot be implemented. However, the system can perform transitive privilege checks in the middleware in all three cases by inserting another level of middleware. In the case of the SRAM attack, the OS can be notified to limit access to certain locations with the checks being performed by the MMU checks. Transitive checks can be performed either in software or in hardware by storing recent transactions in a memory. The size of the memory is a design choice with cost implications. While the example of Davi et al. included two edges in the transitive attack graph, checkers can be designed to check longer paths up to some limit. In the case of a networked control system, the network connects multiple MPSoCs, typically of heterogeneous architecture and origin. In this case, the node that serves as the launchpad for the attack may have very different capabilities from the node under attack. QoS attacks rely on interactions on the platform or OS and middleware. QoS attacks must be detected by analyzing computation graphs at all three levels of the attack model. An attack can be mounted by either a single task or multiple tasks. An attack that uses multiple tasks on different processing elements may be more difficult to defeat. The attack tasks τa, τb, ⋯ each generate their own transactions Ma, Mb, ⋯ that are designed to disrupt timing. A QoS attack threat exists when two computation graphs mapped to the same resource exceed specified bounds for the utilization of that resource.

4.5  Service-Oriented Architectures Tariq et al. [Tar16, Tar18] developed a service-oriented architecture for ­always-­online cyber-physical systems. The service-oriented approach provides an abstraction for streaming behavior of cyber-physical systems. The modularity of the abstraction allows services to be added, deleted, or modified, while the system is operational without affecting the operation of other deployed services. A service is a design abstraction commonly used in Internet systems. A service is defined by an interface; the service may be provided by any of several possible service providers. The service provider is responsible for allocating resources required to provide the service. A requestor may be matched with potential providers through a service broker.

4.5  Service-Oriented Architectures

55

The traditional definition of a service is transaction-oriented, with limited d­ uration of the service and a well-defined result based on the requestor input. In contrast, cyber-physical and IoT systems operate on signals that have indefinite duration. The system must provide quality-of-service (QoS) guarantees to ensure that the maximum latency and jitter required by the requestor are honored as the signal flows through the platform. A QoS-aware service-oriented architecture provides the separation between service definition and service implementation, including QoS characteristics required for CPS and IoT. A service description language captures the characteristics of the service. The service interface describes the messages at the service interface, while the service computation describes the computation to be performed. The service description includes: • The messages at the service interface and the sensing/actuation commands • The computation performed by the service • Modes of the service, such as fault-handling The service description is captured formally as a Giotto model [Hen03]. A set of service modes describe the behavior of the system if the platform cannot meet all QoS requirements; in the case of wide-area systems, the platform may not have full control of all communication and computation resources. The implementation of the service is described using the E Machine target developed for Giotto [Hen07]. The mapping of the service to a computational platform must meet two types of requirements: it must correctly implement the service description, and it must not interfere with other services executing on the platform. The correctness of the mapping is proven using the temporal logic of Manna and Pnueli [Man92]. Figure 4.3 gives a flowchart for the design methodology. The cyber-physical system is specified using a traditional approach such as model-based design; the design is tuned to satisfy the application requirements using simulation and analysis. The design is then decomposed into a set of services using the service description language such as Giotto. A service compiler translates the service descriptions into an intermediate specification such as E machine. That intermediate ­specification is then synthesized into software and possibly hardware implementations. Since the implementation may be composed of components of several different types, the final implementation phase may use multiple design synthesizers and compilers. Tariq et al. demonstrated this methodology with smart grid applications. First, a demand response application was compiled onto a target networked smart grid architecture; simulations were used to show that the implementation met its functional and QoS requirements. Next, a distributed power agreement algorithm was compiled onto the same smart grid. Proofs showed both that the second application was properly implemented relative to its own requirements and that it did not interfere with the operation of the first application.

56

4 Architectures

Fig. 4.3 Design methodology for QoS-­ aware service-oriented architectures

4.6  Summary Root-of-trust is a hardware mechanism for trusted execution of software. Model-­ based design provides a specify-analyze-synthesize approach to cyber-physical and IoT system design that is well-suited to the inclusion of safety and security methods. Security threats can be modeled at the architecture level and the design resynthesized to remediate attacks; analysis can be used to identify transitive attacks. Service-oriented architecture methodologies can be modified to capture quality-of-­ service characteristics required for control and signal processing.

References [ARM09] ARM Limited, ARM Security Technology: building a secure system using TrustZone technology, 2009. Available at www.arm.com [Dav11] L.  Davi, A.  Dmitrienko, A.-R.  Sadeghi, M.  Winady, Privilege escalation attacks on Android, in ed. by M. Burmeister Gene Tsudik, S. Magliveras, I. Ilic, ISC 2010, LNCS 6531 (Springer, 2011), pp. 346–360 [Eby07] M. Eby, J. Werner, G. Karsai, A. Ledeczi, Integrating security modeling into embedded system design, in 14th Annual IEEE International Conference and Workshops on the Engineering of Computer-Based Systems (ECBS’07), Tucson, 2007, pp. 221–228. https://doi. org/10.1109/ECBS.2007.45 [Fei12] P.H. Feiler, D.P. Gluch, Model-Based Engineering with AADL: An Introduction to the SAE Architecture Analysis & Design Language, SEI Series in Software Engineering (Addison-­ Wesley, Upper Saddle River, 2012) [Hen03] T.A.  Henzinger, B.  Horowitz, C.M.  Kirsch, Giotto: a time-triggered language for embedded programming. Proc. IEEE 91(1), 84–99 (2003) [Hen07] T.A. Henzinger, C.M. Kirsch, The embedded machine: predictable, portable real-time

References

57

code. ACM Trans. Program. Lang. Syst. 29(6), 33 (2007) [Her06] S. Hernan, S. Lambert, T. Ostwald, A. Shostack, Uncover security design flaws using the STRIDE Approach, MSDN AMagazine, November 2006, http://msdn.microsoft.com/msdnmag/ issues/06/11/ThreatModeling/default.aspx, Retried from the Wayback Machine, 17 Aug 2018 [How97] J.D. Howard, Analysis of Security Incidents on the Internet 1989–1995, Ph.D. dissertation, Carnegie Mellon University, 7 Apr 1997 [Int19] Intel, Side channel vulnerability microarchitectural data sampling, https://www.intel. com/content/www/us/en/architecture-and-technology/mds.html?wapkw=speculative+attack. Accessed 17 May 2019 [Kar03] G.  Karsai, J.  Sztipanovits, A.  Ledeczi, T.  Bapty, Model-integrated develop ment of embedded software. Proc. IEEE 91(1), 145–164 (2003). https://doi.org/10.1109/ JPROC.2002.805824 [Koc19] P.  Kocher, J.  Horn, A.  Fogh, D.  Genkin, D.  Gruss, W.  Haas, M.  Hamburg, M.  Lipp, S. Mangard, T. Prescher, M. Schwarz, Y. Yarom, Spectre attacks: exploiting speculative execution, in 40th IEEE Symposium on Security and Privacy (S&P’19), 2019 [Koc99] P. Kocher, J. Jaffe, B. Jun, Differential power analysis, in Proceedings, 19th International Advances in Cryptography Conference: CRYPTO ‘99, IEEE, 1999, pp. 388–397 [Led01] A. Ledeczi, M. Maroti, A. Bakay, G. Karsai, J. Garrett, C. Thomason, G. Nordstrom, J. Sprinkle, P. Volgyesi, The generic modeling environment, in Workshop on Intelligent Signal Processing, Budapest, Hungary, volume 17. Citeseer, 2001 [Lip18] M. Lipp, M. Schwarz, D. Gruss, T. Prescher, W. Haas,, A. Fogh, J. Horn, S. Mangard, P. Kocher, D. Genkin, Y. Yarom, M. Hamburg, Meltdown: reading kernel memory from user space, in 27th USENIX Security Symposium (USENIX Security 18), 2018 [Man92] Z.  Manna, A.  Pnueli, The Temporal Logic of Reactive and Concurrent Systems: Specification (Springer, New York, 1992) [NXP18] NXP Semiconductor, S32S Safety Microcontrollers and Microprocessors, S32S_FS Rev2 (2018) [Tar12] M.U.  Tariq, H.A.  Nasir, A.  Muhammad, M.  Wolf, Model-driven performance analysis of large scale irrigation networks, in 2012 IEEE/ACM Third International Conference on Cyber-Physical Systems, Beijing, 2012, pp. 151–160. https://doi.org/10.1109/ICCPS.2012.23 [Tar16] M.U. Tariq, Service-Oriented Reference Model for Cyber-Physical Systems, Ph.D. dissertation, Georgia Institute of Technology, Apr 2016 [Tar18] M.U. Tariq, J. Florence, M. Wolf, Improving the safety and security of wide-area cyber– physical systems through a resource-aware, service-oriented development methodology. Proc. IEEE 106(1), 144–159 (2018). https://doi.org/10.1109/JPROC.2017.2744645 [Urb17] M.  Urbina, R.  Obermaisser, Efficient Multi-core AUTOSAR-Platform Based on an Input/Output Gateway Core, in 2017 25th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), St. Petersburg, 2017, pp. 157–166. https:// doi.org/10.1109/PDP.2017.85 [Wol96] W. Wolf, Object-oriented co-synthesis of distributed embedded systems. ACM Trans. Des. Autom. Electron. Syst. 1(3), 301–314 (1996)

Chapter 5

Security Testing and Run-Time Monitoring

5.1  Introduction Cyber-physical systems are core components of advanced systems and processes that influence modern life, from automobiles and home automation systems to manufacturing and health. They are fundamental components of critical infrastructures worldwide. Their safe and secure operation is critical in many of these applications because of potential catastrophic results in case of malfunctions caused by failures or attacks. Cyber-physical systems are designed and implemented with safety and security requirements and specifications. Safety properties that need to be met are typically specified at the application level, and security properties are defined at all levels, from application to hardware design. These properties for safety and security are an integral part of the system specification, and thus, they are part of the requirements and specifications input to the design process when developing the system. Importantly, designing a cyber-physical system with safety and security specifications is not sufficient for its expected robust operation in the field. It is often the case that specifications for security and safety are not complete against all potential attacks, since new attacks may be discovered and applied and specifications are typically limited by concepts and assumptions based on prior experience and expectations. Furthermore, faults and errors in system operation are anticipated based on models that are probabilistic, and thus, there is probability that unanticipated failures may occur, even with low probability. Finally, especially for cyber-physical systems in the field, the uncontrolled environment of system operation may allow for interventions to the system or to the interface between the digital and physical part, e.g., with attacks that insert false data to the system, which may lead to violation of safe operation. Clearly, safe and secure cyber-physical systems need to be designed considering their safety and security requirements as part of their specification, but, in addition, they need to be monitored during operation for potential hazards and attacks that © Springer Nature Switzerland AG 2020 M. Wolf, D. Serpanos, Safe and Secure Cyber-Physical Systems and Internet-of-­ Things Systems, https://doi.org/10.1007/978-3-030-25808-5_5

59

60

5  Security Testing and Run-Time Monitoring

have not been anticipated during system specification or are infeasible to avoid in the field, such as false data injection attacks. Considering these two aspects, i.e., design for safety and security and monitoring system operation in the field for safety and security, in this chapter, we address two fundamental technologies that address them, specifically testing and run-time monitoring. Testing constitutes a fundamental step of the design process at several levels and is employed not only for system development but for system certification as well, which is required in several application domains of cyber-physical systems. Testing is employed to check satisfaction of several strong requirements to system operation, such as real-time constraints, safety properties, continuous operation, and interoperability. Interoperability is a challenging requirement to test, when considering that cyber-physical systems include communication subsystems that often implement standardized protocols and standards often have undefined parameters, leading to different implementations by different vendors. Also, run-time monitoring constitutes a challenge considering the uncertainties of the operating environment of cyber-physical systems and the potential ability of attackers to manipulate systems and data.

5.2  Security Testing Safety systems are designed to meet specific safety and security requirements, typically expressed as properties, in addition to other functional and non-functional ones. Resulting systems must be evaluated for their conformance to requirements and resulting specifications. There are several methods for this evaluation, including verification and validation as well as testing. Verification and validation present significant limitations, because their complexity grows exponentially with the size of the evaluated system and because, typically, the system development process includes long supply chains and results to heterogeneous component descriptions that do not enable a complete end-to-end verification process. Testing is a tractable method which is employed in most system development methods, because it enables evaluation of several system properties, such as correctness, security, and performance. Testing is not only used by system developers as part of their development process but by system users and certification organizations, who employ testing to evaluate conformance to standards and directives as well as to provide appropriate certifications. Testing systems for conformance evaluation is a method used not only for certification but for identification of vulnerabilities and failures as well. Attackers typically employ testing methods to identify vulnerabilities and exploit them. This is critical to cyber-physical and IoT systems, where exploitation of vulnerabilities may violate real-time constraints, continuous operation, and safety, resulting to disastrous results.

5.3  Fuzz Testing for Security

61

5.3  Fuzz Testing for Security Remediating vulnerabilities is typically a costly operation, while vulnerability exploitation can have significant consequences, considering that several vulnerabilities in network systems and applications are disseminated publicly [Nis, Sfo, Str]. As a result, the increasing interest to reduce system vulnerabilities leads to several methods for their identification and timely remediation. Software code analysis is a common, typical approach that employs methods for static or dynamic analysis. Static analysis of source code can be performed without executing the code but has significant limitations because it presents high false-­ positive rates and, furthermore, does not identify vulnerabilities triggered by specific instruction sequence execution [Che07, Vie00]. Dynamic analysis, which executes the code, is significantly more effective but has high cost because it inserts additional code for the analysis; examples include StackGuard which extends a C compiler [Cow98] and TaintCheck which applies taint analysis [Cla07, New04]. Testing, in general, is a well-understood and promising method for vulnerability analysis. Fuzzing (fuzz testing), in particular, is a reliable and successful method that is increasingly used for security testing. As a typical testing method, fuzzing applies input test vectors to a system under test (SUT) and analyzes the corresponding outputs, as shown in Fig. 5.1. The tester (fuzzer) identifies vulnerabilities by detecting faulty outputs to the applied vectors; faulty outputs range from wrong outputs (responses) to system crashes. An important advantage of fuzzing is that it can be applied to systems independently of the knowledge of the system’s internal structure and implementation technology. Clearly, knowledge of such information can influence the strategy for input vector generation, leading to improved results. Due to this, fuzzing is categorized in three categories, depending on the available information for the SUT [Tak08, Sut07]. White-box fuzzing is performed on systems where the specification or the source code is available, while black-box fuzzing is performed in cases where no internal information is available. Gray-box fuzzing is the category where partial system internal information is available. Conventional white-box fuzzers employ symbolic execution or taint analysis to identify vulnerabilities and have been quite successful in testing Windows and Linux applications [God12, Cad06]. Symbolic execution inserts symbolic values in program flow to analyze execution sequences and has been exploited widely Fig. 5.1  Fuzz testing configuration

Input vectors (fuzzed inputs) System-Under-Test (SUT)

Fuzzer Responses

62

5  Security Testing and Run-Time Monitoring

[God05, God12, Cad06, Cad08, Avg11, Hua12]. Taint analysis enables the i­dentification of possible attack points through tracing tainted values; then, these points are exercised through test vectors [Sch10, Gan09, Wan10]. Lately, AFL (American fuzzy lop) provided an approach that employs compile-time instrumentation and genetic algorithms to explore code execution paths and identify vulnerabilities automatically in a target program [ZarAFL]. The approach is increasingly adopted by newer tools, such as AFLFast [Boh16], VUzzer [Raw17], Skyfire [Wan17], and Steelix [Li17]. Black-box fuzzing is constrained by the lack of information about system structure and typically exploits the basic information known about the system, such as its operational and interface specification. Thus, black-box testing is quite popular in systems that implement standards, such as network and communication systems. As network interfaces are the first point of entry in systems, fuzzing network protocols are a basic first step to fuzzing systems overall. Considering fuzzing network protocols, the two main approaches to generate input vectors are data generation and data mutation [Nal12, Tak08, Sut07]. In general, data generation creates input packets as test vectors, considering the specification of the standardized protocol(s) that is implemented in the system. Values in the fields can be selected randomly or through a strategy that takes into account the special values that are defined in the standards. Alternatively, mutation testing takes legal packets and mutates (changes) specific data in them, such as specific fields. Although mutation requires the availability of legal packet traffic, it has strong advantages over data generation because it does not need to implement packet generation (it uses existing legal packets) and needs limited knowledge of the standard protocol specification. When limited information is available for protocol specification, some mutation fuzzers construct protocol descriptions based on real traffic [Vda14, Gor10]. Mutations can be chosen with various strategies, including knowledge of available attack traffic [Ant12, Tsa12]. In both data generation and data mutation fuzzing approaches, an important step is that of choosing effective values for packets fields. This is implemented through various strategies, such as random [Mil90, Mil95], block-based which considers blocks of values identified in the protocol specification [Ait02, Ban06, Ami14, Pea14], grammar-based which exploits grammars to describe legal inputs [PRO], and heuristic-based which takes into account the effectiveness of previously applied tests [Spa07, Zha11]. Fuzzing provides several advantages over static and dynamic analysis, because it can be applied to programs whose source code is not available, it is independent of the system’s internal complexity, and it can directly associate the identified faults and errors to the user input, and, thus, enables easier evaluation. However, fuzzing has limitations as well. The space of input vectors is huge, and thus, its vulnerability coverage can be limited, unless the fuzzer employs a systematic approach to select input vectors that may originate from realistic attacks and common errors, detecting vulnerabilities that may be discovered by attackers.

5.4  Fuzzing Industrial Control Network Systems

63

5.4  Fuzzing Industrial Control Network Systems Security and safety are important properties in industrial control and critical infrastructures, where a wide range of industrial networks is employed. Fuzzing industrial network systems is increasingly popular, due to its effectiveness. Several tools have emerged or been extended to enable fuzzing of such networks, including Sulley [Dev07] for ICCP, Modbus and DNP3, Profuzz [Koc] for Profinet, and the Achilles test platform [Ach17] for SCADA protocols, like Modbus/TCP and DNP3. Research work includes black-box mutation fuzzing for SCADA networks without any knowledge about the networking protocol [Sha11] and mutation fuzzing for OPC SCADA [Wan13, Qi14], providing promising results identifying known vulnerabilities from the National Vulnerability Database (NVD) [Nis]. Modbus is one of the most popular protocol for industrial networks and has become a standard published by Modbus IDA [Mod, ModS]. Fuzzing Modbus systems is an important active area of research and development [Byr06, Dev07], including frameworks for fuzzing Modbus for security [Kob07]. The Modbus specification defines several protocol stacks that enable direct communication over serial links and communication over TCP connections, as shown in Fig. 5.2. Based on the OSI reference model, Modbus is an application layer protocol defined to interface directly to physical (serial) and DLC (HDLC) protocols or to TCP through an adjusting sublayer, as shown in stack (c) of Fig. 5.2. The protocol implements client/server (master/slave or request/response) communication model between a control center and field devices, e.g., a SCADA master unit (operates as client) may request a sensor value from a slave PLC (operates as server). Modbus packets are simple with two fields (function code—FC, data), where the function codes define operations and the data are the associated operation data. A request

Fig. 5.2  Modbus protocol stacks

64

5  Security Testing and Run-Time Monitoring

Fig. 5.3  Encapsulated Modbus application packets

defines the required operation and the related parameter, while a response includes the returned data in the data field and the operation to which it responds in the function code field; provision has been taken to indicate with a corresponding exception code that a requested operation did not execute successfully. Importantly, the standard specifies public codes and reserved function codes and allows for user-defined ones. Due to the client/server communication model adopted by the standard, Modbus considers that devices store data in tables with a total of 64K entries and specifies four types of tables, two with 1-bit entries (one read and one read/write) and two with 16-bit entries (one read and one read/write). Modbus accommodates file access as well, where files are sequences of up to 10,000 records. Modbus application packets (protocol data units or PDUs) over TCP are first extended with the MBAP (Modbus Application Protocol) header as shown in Fig. 5.3(b) and then encapsulated by TCP and the lower layer protocols. Alternatively, when TCP is not employed, Modbus PDUs are encapsulated by the DLC and physical protocols, as defined in the protocol stacks (Fig. 5.2). In regard to secure communications, Modbus does not include any security mechanisms for authentication, confidentiality, or integrity. This results to high vulnerability to attacks ranging from data leakage from captured packets to operation disruption through alteration of function codes. Importantly, such attacks can result to non-credible analysis and auditing.

5.5  A Modbus TCP Fuzzer We describe the MTF-Storm as an example for a Modbus fuzzer [Kat18]. MTF-­ Storm, an evolution of MTF [Voy15], is a fuzzer for Modbus/TCP connected systems. It is an automated tool that provides good coverage of input vectors and high fault coverage, while operating over a network, enabling remote fuzzing without physical access to the tested system. MTF-Storm conducts its testing operating in three phases, similarly to MTF: reconnaissance, fuzzing, and fault detection.

5.5  A Modbus TCP Fuzzer

65

Reconnaissance, the first phase, identifies the memory organization of the tested system, based on the Modbus memory organization model with tables, as described above; in addition, it identifies the supported function codes of the Modbus/TCP packets. The results of reconnaissance enable MTF-Storm to proceed to the second phase, fuzzing. In this phase, MTF-Storm synthesizes its fuzzing test vector sequence systematically and effectively with test vectors for single-packet fields followed by test vectors for multiple combined fields; efficient sequences are synthesized using automated combinatorial software testing techniques and partitioning of field value ranges to reduce the test vector space. The results of the application of the fuzzing sequence of input test vectors are collected and analyzed for fault and failure detection and identification of security problems in the third phase. Reconnaissance is a necessary operation in automated black-box or gray-box fuzzers, because it detects a range of operations of the tested system. For Modbus, one needs to know the function codes and the memory model (data tables) employed by the system. MTF-Storm, similarly to its predecessor MTF, explores the function codes used by a system with various methods. One method is to request identification information, as defined by the standard, which may lead to off-line identification of supported function codes. Another method is to send valid requests and identify whether they are executed, as specified in the standard. Finally, it can monitor device traffic, deducing functional information. The memory model is derived identifying the range of memory addresses for each data table. Memory bounds are identified either by observing traffic or probing-specific addresses with appropriate function codes. When reconnaissance completes, the information of the function codes and the memory model are exploited effectively to synthesize appropriate sequences of packets. Sequences are composed for each valid function code of the system and implement a potential attack to the system, such as packet field manipulation, packet removal, and packet injection. Packet field manipulations are mutations with field values that are selected systematically, among boundary values, random values, or illegal ones. Specifically, MTF-Storm supports fuzzing for every function code that is described in the standard, for TCP as well as serial communication. MTF-Storm is based on the FSM state diagrams for the processing of all function codes and employs generation-based testing, similarly to MTF, creating three suites of fuzzing sequences for slave devices, which test (i) values of the MBAP header fields; (ii) violations of the protocol format, such as inconsistent or out-of-range field values; and (iii) values of the PDU fields. The test sets enable fast testing within reasonable time bounds and can be applied independently or in sequence. MTF-Storm records all responses (or their absence) to all test vectors that are applied to the system under test; for each response, a corresponding error is recorded if responses are invalid, incorrect (e.g., with incorrect values), delayed, or incomplete. The list of errors and the related packets leads to identification of security vulnerabilities or faults. MTF-Storm constitutes a representative modern automated fuzzer, which starts from probing a system to identify its operational range and proceeds to meaningful and targeted tests and, finally, producing a list of vulnerabilities. This has been

66

5  Security Testing and Run-Time Monitoring

demonstrated by evaluating several Modbus/TCP implementations and identifying safety, security, and dependability issues with all of them, ranging from out-of-spec responses to successful denial-of-service attacks and crashes [Kat18]. Specifically, MTF-Storm was used to evaluate nine Modbus/TCP implementations (open source or commercial) with attacks that include packet dropping, packet injection, illegal field values, altered function codes, and even flooding, leading to denial of service attacks. Importantly, the test vector sequences include test vectors for single-packet fields followed by test vectors for multiple combined fields, which are synthesized using automated combinatorial software testing techniques and partitioning of field value ranges to reduce the test vector space. MTF-Storm succeeds in identifying failures much more efficiently than alternative tools, i.e., with a significantly smaller number of packets.

5.6  Run-Time Monitoring Run-time monitors observe real-time operation of a system and detect events of interest; monitors for security and safety detect security and safety violations. To detect events and violations, monitors require as point of reference a description of system behavior that is characterized as either “good” behavior or “bad” behavior. Depending on the point of reference, violations are detected as either deviations from “good” behavior or matches of “bad” behavior. Independently of the semantics of the behavior (good or bad), behaviors are described using two main methods, profile-based and model-based. Considering these two parameters, i.e., method of description of behavior and method of event/violation detection, run-time monitors are classified in four categories. Monitors with profile-based behavioral descriptions build profiles based on system operation, employing statistical and machine learning techniques. Profile-based monitors that detect matching of system behavior with some “bad” behavior build profiles of attacks and failures [Hod04, Val00]. In contrast, profile-based monitors that detect deviation from “good” behavior, typically, build their “good” behavior profiles based on statistical techniques [Kim04, Lak05]. Between these two categories of profile-based monitors, the ones that detect deviation from “good” behavior are more effective, because they detect all deviations from “good” behavior and raise alarms even for unknown attacks. Clearly, monitors that match “bad” behavior are limited to already known attacks. However, in general, profile-based monitors present high rates of false alarms, because they raise alarms not only in cases of attacks and failures but whenever legal behavior deviates from the statistically acceptable. Model-based monitors employ models for the reference system behavior. Monitors that detect “bad” behavior exploit models of known attacks and are limited by their inability to detect unknown ones [Pax99, Roe99]; known examples include signature-based systems. Model-based monitors that detect deviation from “good” behavior [Wat07, Gol07] are the most effective, although they place high-­

5.7  The ARMET Approach

67

performance overhead due to the execution of the model. However, relatively to profile-based monitors, model-based monitors enable more precise diagnosis and more effective recovery, since the models provide diagnostic information.

5.7  The ARMET Approach ARMET is an approach to designing safe and secure cyber-physical systems [Kha17]. It implements three basic steps: (i) designing secure-by-design systems; (ii) developing run-time monitors for the systems, detecting attacks and failures; and (iii) recovering from detected attacks and failures. Employing ARMET, a cyber-physical system application starts with the development of an executable specification that includes the safety properties for the application. Developing executable specifications and proving their properties are typical program verification processes and can be implemented with a variety of existing tools for automated or semiautomated proofs. ARMET [Kha17] exploits Fiat [Del15] and employs deductive synthesis to develop robust-by-design applications through interactive stepwise refinement of declarative specifications. Cyber and physical resources are included as first class models, while non-functional properties, e.g., performance and security, are modeled integrated with functional properties. The executable code for the target system is produced automatically from the executable specification. This process leads to an application design that is probably safe and secure, considering the desired safety and security properties. ARMET includes a run-time monitor that observes the execution of application. This run-time security monitor (RSM) is part of a middleware, which is shown in Fig. 5.4. The figure shows the overall structure of the ARMET middleware system, which contains six modules: (i) the run-time security monitor, (ii) the diagnosis module, (iii) the recovery module, (iv) the trust model, (v) the adaptive method selection module, and (vi) the backup module. The middleware executes the executable application specification and the application code for the system and feeds the monitor (RSM), which monitors the behavior of the application while it executes, comparing the observed behavior of the application code execution to the expected behavior based on its specification; the middleware executes the (executable) specification concurrently with the application execution and calculates predictions of the application’s behavior. Thus, RSM observes the behavior of the application execution, and, concurrently, it predicts the state of the application execution by executing its specification. In this fashion, specification execution defines the expected “good” behavior of the application, and, if desirable, it includes known “bad” application behaviors that include known attacks. The monitor detects deviations between predictions and observations; such deviations indicate application failures or attacks. The correctness of the approach is based on the assumption that the executable application specification is executed in a safe environment where attacks are unattainable, i.e., predictions are always correct. This is a reasonable assumption, because there exist conventional trusted platforms that meet this

68

5  Security Testing and Run-Time Monitoring

ARMET

Recovery

Adaptive method selection

Diagnosis

Trust model

Backup

Run-time security monitor (RSM)

Application specification

Application code

Fig. 5.4  The ARMET system

requirement and make the assumption realistic, such as Intel SGX [Cos16] and ARM TrustZone [ARM05]. Whenever a deviation between observations and predictions is detected, ARMET proceeds to diagnosis, identifying the failure or attack based on a trust model, and then to recovery based on the available diagnostic information. The recovery module selects an appropriate adaptive method for recovery, considering previous states stored by the backup module; if no exploitable information results from diagnosis, the system will recover by returning to a previous clean state. Importantly, the system detects all failures and attacks, even unknown ones, because all failures and attacks result in inconsistencies between observations and predictions. The ARMET run-time security monitor (RSM) successfully identifies all such inconsistencies because of its executable specification language [Kha15]. RSM is the first monitor to be formally proven as sound and complete [Kha15], demonstrating that the monitor is free of false alarms (detections), an important, desirable property in practical systems. ARMET’s behavioral approach enables addressing security, safety, and dependability in a unified approach in the same framework [Ser08, Ser18, Ser18a]. Dependable systems are developed with methods that employ probabilistic fault models and errors [Sie82], because the faults and errors are considered accidental. However, malicious attackers compromise safety and security by inserting faults; the models for these faults are fundamentally different from the accidental, probabilistic ones. A behavioral approach to security, like ARMET’s, considers only attack models, e.g., computational or false data injection, detecting accidental faults and malicious attacks in the same way. Fault attribution can be performed after detection, based on the available information and the used trust model.

References

69

References [Ach17] Wurldtech  – GE Digital, Achilles Test Platform (2017), https://www.ge.com/digital/ sites/default/files/achilles_test_platform.pdf [Ait02] D. Aitel, An introduction to SPIKE, the Fuzzer Creation Kit. Presented at The BlackHat USA conference, 2002, www.blackhat.com/presentations/bh-usa-02/bh-us-02-aitel-spike.ppt [Ami14] P.  Amini, Sulley: pure Python fully automated and unattended fuzzing framework (2014), https://github.com/OpenRCE/sulley [Ant12] J.  Antunes, N.  Neves, Recycling test cases to detect security vulnerabilities, in Proceedings of the 23rd International Symposium on Software Reliability Engineering, Dallas, 27–30 Nov 2012, pp. 231–240 [ARM05] ARM Security Technology, Building a Secure System using TrustZone Technology. ARM white paper, Document PRD29-GENC-009492C (2005), http://infocenter.arm.com/ help/topic/com.arm.doc.prd29-genc-009492c/PRD29-GENC-009492C_trustzone_security_ whitepaper.pdf [Avg11] T. Avgerinos, S.K. Cha, B.L.T. Hao, D. Brumley, AEG: automatic exploit generation, in Proceedings of the Network and Distributed System Security Symposium (NDSS’11), San Diego, 6–9 Feb 2011 [Ban06] G. Banks et al., SNOOZE: toward a Stateful NetwOrk prOtocol fuzZEr, in Proceedings of the 9th Information Security Conference (ISC’06), 2006, pp. 343–358 [Boh16] M. Böhme, V.-T. Pham, A. Roychoudhury, Coverage-based greybox fuzzing as markov chain, in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Securit (CCS’16), Vienna, 24–28 Oct 2016, pp. 1032–1043 [Byr06] E.J. Byres, D. Hoffman, N. Kube, On shaky ground – a study of security vulnerabilities in control protocols, in Proceedings of the 5th International Topical Meeting on Nuclear Plant Instrumentation Controls, and Human Machine Interface Technology, American Nuclear Society, Albuquerque, 12–16 Nov 2006 [Cad06] C. Cadar, V. Ganesh, P. Pawlowski, D. Dill, D. Engler, EXE: automatically generating inputs of death, in Proceedings of CCS’06, Oct–Nov 2006 (extended version appeared in ACM TIS-SEC 12:2, 2008) [Cad08] C. Cadar, D. Dunbar, D. Engler, KLEE: unassisted and automatic generation of high-­ coverage tests for complex systems programs, in Proceedings of OSDI’08, Dec 2008 [Che07] B.  Chess, J.  West, Secure Programming with Static Analysis. Addison-Wesley Professional, 2008 [Cla07] J.  Clause, W.  Li, A.  Orso, Dytan: a generic dynamic taint analysis framework, in Proceedings of the 2007 International Symposium on Software Testing and Analysis (ISSTA’07), London, UK, 9–12 July 2007, pp. 196–206 [Cos16] V. Costan, S. Devadas, Intel SGX explained. Cryptology ePrint archive: report 2016/086, IACR, 2016 [Cow98] C.  Cowan et  al. StackGuard: automatic adaptive detection and prevention of buffer-­ overflow attacks, in Proceedings of the 7th USENIX Security Symposium, San Antonio, 26–29 Jan 1998 [Del15] B. Delaware, C. Pit-Claudel, J. Gross, A. Chlipala, Fiat: deductive synthesis of abstract data types in a proof assistant, in Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL’15), Mumbai, 15–17 Jan 2015, pp. 689–700 [Dev07] G.  Devarajan, Unraveling SCADA protocols: using Sulley fuzzer. Presented at the DefCon’15 hacking conference, 2007 [Gan09] V. Ganesh, T. Leek, M. Rinard. Taint-based directed whitebox fuzzing, in Proceedings of the 31st International Conference on Software Engineering (ICSE’09), Vancouver, 16–24 May 2009, pp. 474–484 [God05] P.  Godefroid, N.  Klarlund, K.  Sen, DART: directed automated random testing, in Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, Chicago, 12–15 June 2005, pp. 213–223

70

5  Security Testing and Run-Time Monitoring

[God12] P.  Godefroid, M.Y.  Levin, D.  Molnar, SAGE: whitebox fuzzing for security testing. ACM Queue 10(1), 20 (2012) [Gol07] H.J. Goldsby, B.H.C. Cheng, J. Zhang, AMOEBA-RT: run-time verification of adaptive software, in Proceedings of Models in Software Engineering (MODELS 2007), Nashville, 30 Sept–5 Oct 2007, LNCS-5002, Springer, 2008, pp. 212–224 [Gor10] S.  Gorbunov, A.  Rosenbloom, Autofuzz: automated network protocol fuzzing framework. IJCSNS 10(8), 239–245 (2010) [Hod04] V.  Hodge, J.  Austin, A survey of outlier detection methodologies. Artif. Intell. Rev. 22(2), 85–126 (2004) [Hua12] S.K. Huang, M.H. Huang, P.Y. Huang, C. W. Lai, H. L. Lu, W.M. Leong, CRAX: software crash analysis for automatic exploit generation by modeling attacks as symbolic continuations. IEEE 6th International Conference on Software Security and Reliability, 20–22 June 2012, pp. 78–87 [Kat18] K. Katsigiannis, D. Serpanos, MTF – storm: a high performance fuzzer for Modbus/TCP, in Proceedings of the 2018 IEEE 23rd International Conference on Emerging Technologies and Factory Automation (ETFA), Turin, 2018, pp. 926–931 [Kha15] M.T.  Khan, D.  Serpanos, H.  Shrobe, On the formal semantics of the cognitive middleware AWDRAT.  Technical Report MIT-CSAIL-TR-2015-007, Computer Science and Artificial Intelligence Laboratory, MIT, USA, Mar 2015 [Kha17] M.T.  Khan, D.  Serpanos, H.  Shrobe, ARMET: behavior-based secure and resilient industrial control systems. Proc. IEEE 106(1), 129–143 (2018) [Kim04] S.S. Kim, A.L.N. Reddy, M. Vannucci, Detecting traffic anomalies through aggregate analysis of packet header data, in Proceedings of 3rd International IFIP-TC6 Networking Conference (NETWORKING 2004), Athens, 9–14 May 2004, Springer LNCS-3042, pp. 1047–1059 [Kob07] T.H. Kobayashi, A.B. Batista, A.M. Brito, P.S. Motta Pires, Using a packet manipulation tool for security analysis of industrial network protocols, in Proceedings of 2007 IEEE Conference on Emerging Technologies and Factory Automation, Patras, 2007, pp. 744–747 [Koc] R. Koch, Profuzz, https://github.com/HSASec/ProFuzz [Lak05] A.Lakhina, M.  Crovella, C.  Diot, Mining anomalies using traffic feature distributions, in Proceeding of the 2005 Conference on Applications, Technologies, Architectures and Protocols for Computer Communications (SIGCOMM 2005), Philadelphia, 22–16 Aug 2005, pp. 217–228 [Li17] Y.  Li, B.  Chen, M.  Chandramohan, S.W.  Lin, Y.  Liu, A.  Tiu, Steelix: program-state based binary fuzzing, in Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, Paderborn, 4–8 Sept 2017, pp. 627–637 [Mil90] B.P. Miller, L. Fredriksen, B. So, An empirical study of the reliability of UNIX utilities. Commun. ACM 33(12), 32–44 (1990) [Mil95] B.P. Miller et al., Fuzz revisited: a re-examination of the reliability of UNIX utilities and services. Technical report TR-1268, Department of Computer Sciences, University of Wisconsin-Madison, 1995 [Mod] ModBus Organization. ModBus application protocol specification, http://www.modbus. org/docs/ModbusApplication/ProtocolV11b.pdf [ModS] Modbus serial line protocol and implementation guide V1.02 (Modbus_over_serial_ line_V1_02.pdf) [Nal12] R. McNally, K. Yiu, D. Grove, D. Gerhardy, Fuzzing: the state of the art. Technical note DSTO-TN-1043, Defence Science and Technology Organization, Australia, 02–2012 [New04] J.  Newsome, D.  Song, Dynamic taint analysis for automatic detection, analysis, and signature generation of exploits on commodity software. Technical report CMU-CS-04-140, 2004 (revised 2005) [Nis] http://nvd.nist.gov [Pax99] V. Paxson, Bro: a system for detecting network intruders in real-time. Comput. Netw. 31(23–24), 2435–2463 (1999) [Pea14] Peach Fuzzing platform, http://www.peach.tech/products/peach-fuzzer/, 2017

References

71

[PRO] PROTOS-security testing of protocol implementations, http://www.ee.oulu.fi/roles/ ouspg/Protos/ [Qi14] X. Qi, P. Yong, Z. Dai, S. Yi, T. Wang, OPC-MFuzzer: a novel multi-layers vulnerability detection tool for OPC protocol based on fuzzing technology. Int. J. Comput. Commun. Eng. 3(4), 300–305 (2014) [Raw17] S.  Rawat, V.  Jain, A.  Kumar, L.  Cojocar, C.  Giuffrida, H.  Bos, VUzzer: application-­ aware evolutionary fuzzing, in Proceedings of the Network and Distributed System Security Symposium (NDSS), San Diego, 26 Feb–1 Mar 2017 [Roe99] M. Roesch, Snort – lightweight intrusion detection for networks, in Proceedings of the 13th USENIX Conference on System Administration (LISA’99), 1999, pp. 229–238 [Sch10] E.J. Schwartz, T. Avgerinos, D. Brumley. All you ever wanted to know about dynamic taint analysis and forward symbolic execution (but might have been afraid to ask). 2010 IEEE Symposium on Security and Privacy, 2010 [Ser08] D. Serpanos, J. Henkel, Dependability and security will change embedded computing. Computer 41(1), 103–105 (2008) [Ser18] D. Serpanos, Secure and resilient industrial control systems. IEEE Design Test 35(1), 90–94 (2018) [Ser18a] D. Serpanos, M.T. Khan, H. Shrobe, Designing safe and secure industrial control systems: a tutorial review. IEEE Design Test 35(3), 73–88 (2018) [Sfo] http://www.securityfocus.com [Sha11] R. Shapiro, S. Bratus, E. Rogers, S. Smith, Identifying vulnerabilities in SCADA systems via fuzz-testing. Crit. Infrastruct. Prot. V IFIP AICT 367, 57–72 (2011) [Sie82] D. Siewiorek, R. Swarz, The Theory and Practice of Reliable System Design (Digital Press, Bedford, 1982) [Spa07] S.  Sparks, S.  Embleton, R.  Cunningham, C.  Zou, Automated vulnerability analysis: leveraging control flow for evolutionary input crafting, in Proceedings of the 23rd Annual IEEE Computer Security Applications Conference (ACSAC 2007), pp. 477–486 [Str] http://www.securitytracker.com [Sut07] M. Sutton, A. Greene, P. Amini, Fuzzing: Brute Force Vulnerability Discovery (Addison-­ Wesley Professional, Upper Saddle River, 2007) [Tak08] A. Takanen, J. DeMott, C. Miller, Fuzzing for Software Security Testing and Quality Assurance (Artech House, Boston, 2008) [Tsa12] P. Tsankov, M. Torabi Dashti, D. Basin, SECFUZZ: fuzz-testing security protocols, in Proceedings of the 7th International Workshop on Automation of Software Test (AST 2012), Zurich, 2–3 June 2012 [Val00] A.Valdes, K. Skinner, Adaptive, model-based monitoring for cyber attack detection, in Proceedings of the 3rd International Workshop on Recent Advances in Intrusion Detection (RAID 2000), Toulouse, 2–4 Oct 2000, Springer, pp. 80–93 [Vda14] J.D. DeMott, R.J. Enbody, W.F. Punch, Revolutionizing the Field of Grey-box Attack Surface Testing with Evolutionary Fuzzing. VDA Labs, Available at: https://www.vdalabs. com/tools/EFS.pdf [Vie00] J. Viega et al., ITS4: a static vulnerability scanner for C and C++ code, in Proceedings of 16th Annual IEEE Conference Computer Security Applications (ACSAC’00), New Orleans, 2000, pp. 257–267 [Voy15] A.G. Voyiatzis, K. Katsigiannis, S. Koubias, A Modbus/TCP Fuzzer for testing internetworked industrial systems, in Proceedings of the 20th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA 2015), Luxembourg, 8–11 Sept 2015, pp. 1–6 [Wan10] T. Wang, T. Wei, G. Gu, W. Zou, TaintScope: a checksum-aware directed fuzzing tool for automatic software vulnerability detection, in 2010 IEEE Symposium on Security and Privacy, Oakland, 2010, pp. 497–512 [Wan13] T.  Wang, et  al., Design and implementation of fuzzing technology for OPC protocol, in Proceedings of 9th International Conference on Intelligent Information Hiding and Multimedia Signal Processing, Beijing, 2013, pp. 424–428

72

5  Security Testing and Run-Time Monitoring

[Wan17] J. Wang, B. Chen, L. Wei, Y. Liu, Skyfire: data-driven seed generation for fuzzing, in Proceedings of IEEE Symposium on Security and Privacy, 2017 [Wat07] C. Watterson, D. Heffernan, Runtime verification and monitoring of embedded systems. Software, IET 1(5), 172–179 (2007) [ZarAFL] M. Zalewski, American fuzzy lop, available at: http://lcamtuf.coredump.cx/afl/ [Zha11] J. Zhao, Y. Wen, G. Zhao, H-fuzzing: a new heuristic method for fuzzing data generation, in Proceedings of Network and Parallel Computing, LNCS, vol. 6985, (Springer, 2011), pp. 32–43

Chapter 6

False Data Injection Attacks

6.1  Introduction False data injection attacks constitute a new, powerful class of attacks to the safety and security of cyber-physical systems. The goal of the attackers is to feed into the system wrong input data, in order to lead systems to the wrong decisions without attacking the systems themselves. They implement their attack, typically, attacking sensors that measure parameters of the physical plant of the cyber-physical system. Figure 6.1 shows a cyber-physical system model from the control point of view, i.e., as a control loop that controls and manages a plant. Although different types of computational and network attacks can be launched against all components and connections of the loop, false data injection attacks are launched against the sensors, as indicated in the figure. Inserting wrong (false) data as measurements in the system can lead the system to the wrong decision and, thus, to the wrong action. A simple example familiar to all is a fire control system with temperature sensors, where a malicious attacker lights a lighter next to a sensor and causes the fire management system to start pumping water in the area, although there is no fire. This simple example indicates that, if a malicious attacker inserts to the system false data (relatively to the intended), the decisions and reactions of the cyber-physical system may be wrong and destructive, although the remaining of the system—computing nodes and networks—have not been compromised in any way. The power of false data injection attacks is high, when considering that sensors are typically in the field, they can get under the control of attackers, and their interface with the physical plant can be manipulated with high degree of precision in several cases, e.g., measurements of temperature, pressure, chemical concentrations, etc. Consider, as a simple example, a cyber-physical system whose sensors are measuring temperature at various points. An attacker can record the measurements over an arbitrary period of time and then insert into the sensors iteratively the recorded values, while the real temperatures can be changing dramatically due © Springer Nature Switzerland AG 2020 M. Wolf, D. Serpanos, Safe and Secure Cyber-Physical Systems and Internet-of-­ Things Systems, https://doi.org/10.1007/978-3-030-25808-5_6

73

74

6  False Data Injection Attacks

Fig. 6.1  Control view of a cyber-physical system

to some intervention. If the attack takes place on all the sensors of the system, effectively the controller will be operating in a virtual environment assuming that the real environment has the replayed temperatures. In reality, the physical system can be completely out of control. Such attacks are realistic and powerful. This class of attacks falls part of the known Stuxnet attack [Lan13, Stu11], where centrifuges for uranium enrichment were manipulated, increasing their rotation speed, and the operators had a view of normal operation, because the rotation speeds appeared to be within the permissible value range. The results of the attack are known: centrifuges were destroyed causing significant problems to the operation of the plant for an extended period of time.

6.2  Vulnerability Analysis False data injection attacks are not only powerful but, importantly, theoretically unavoidable, since a powerful attacker can take control of all sensors and insert acceptable values in all of them for arbitrary time intervals. However, this theoretical attack is not feasible for a wide range of practical systems and processes, because of the large number of sensors and their spatial dispersity as, for example, in the case of a power grid. Thus, when one designs a cyber-physical system and considers attacker profiles, one should conduct vulnerability analysis to false data injection attacks, identify potential attacks, and then take measures and intervene to the design, in order to increase the degree of difficulty to launch such an attack and to reduce and limit their effects. Clearly, the exact plant process, i.e., the cyber-­physical system application, needs to be considered in the design, because system behavior constitutes a system parameter that can be exploited to detect false data injection attacks through inconsistencies of plant process (application) data values.

6.2  Vulnerability Analysis

75

We introduce the concept of vulnerability analysis for false data injection attacks in cyber-physical systems with an example. Consider an HVAC system that reads environment temperature in a room and targets to keep the temperature within a specified temperature range, e.g., 70–74 °F. Typically, such HVAC systems have a sensor on the system itself that reads the temperature of the environment. Apparently, when temperature rises beyond 74 °F, the HVAC takes a cooling action, while it takes a heating action if the temperature falls beyond 70 °F. When designing the system and considering attack models for false data injection, the design can consider that a straightforward attack would be to place a source of heating next to the HVAC sensor and lead it to cooling actions, despite the real temperature of the room. One method to defend against such a simplistic attack would be to use redundancy at the sensor level, place multiple sensors in different parts of the room, and decide the action (cooling or heating) taking into account the measurements of all the sensors rather than the measurement of one; there are several methods in control engineering for data filtering with which the system could identify outliers in measurements, discard them as bad data, and make a correct decision how to react. This design change is based on the assumption that an attacker will not be able to manipulate concurrently most or all of the sensors, while the system can detect an attack that manipulates one sensor and discard the influenced value. Thus, an attacker would need to change the attack model and implement one that manipulates several sensors concurrently in such a way, so that the data filtering function of the system does not detect the affected measurements. The example demonstrates that analysis of cyber-physical systems for vulnerabilities to false data injection attacks requires the concept of a data monitor, i.e., a monitor that checks incoming measurements and decides whether incoming data are good (acceptable) or bad (unacceptable). The monitor must be plant-specific, because different plants implement different processes and, thus, their measurements have different interrelations that need to be evaluated. So, data monitors are plant- and application-specific. Another important aspect of the vulnerability analysis is the consideration of constraints on the measured values. In many systems, operational parameter values need to be in specific and known value ranges, so measurements outside these ranges indicate an attack (or a malfunction or a sensor failure), because they create immediately apparent inconsistencies with normal plant behavior. Vulnerability analysis for false data injection attacks is a process that can be systematic and automated to a large degree, exploiting emerging approaches and tools. In the following, we present a vulnerability analysis method exploiting an SMT solver, dReal [Gao13], for false data injection attacks on the state estimator of a power grid system, as presented in [Gao15]. A power grid is a network of busses interconnected with transmission lines and can be modeled as a directed graph (B, E), where the nodes, B, are the busses, and the edges, E, are the transmission lines. For the state estimation of the power grid, one measures the voltage, v, on a bus and the phase angle, θ. Furthermore, each bus and transmission line has a corresponding measurement, zp, of active power injection (for a bus) or flow (for a transmission line), and a corresponding measurement

76

6  False Data Injection Attacks

zq of reactive power injection or flow, for a bus and a line correspondingly. The z measurements are necessary to evaluate the correctness of measurements, since the z measurements must satisfy the well-known power flow and power injection equations, where the power grid is at steady state. All these measurements are vectors, since they include all busses and transmission lines, with zqii indicating the corresponding measurement on bus i and zqij the measurement on transmission line (i, j). The power flow equations that are satisfied at steady state are the following:

Z ijp = hijp ( vi ,v j ,θi ,θ j )





Z ijq = hijq ( vi ,v j ,θi ,θ j )



where

hijp ( vi ,v j ,θi ,θ j ) = vi2 gij – vi v j gij cos (θi – θ j ) − vi v j bij sin (θi – θ j )



hijq ( vi ,v j ,θi ,θ j ) = −vi2 bij – vi v j bij cos (θi – θ j ) − vi v j gij sin (θi – θ j )



and the variables bij and gij are the constants for the conductance and susceptance of the transmission line (i,j). The power injection equations are

Z iip = hiip ( vi ,θi ,vi1 ,…,vik ,θi1 ,…,θik )





Z iiq = hiiq ( vi ,θi ,vi1 ,…,vik ,θi1 ,…,θik )



where



(

)

(

)

hiip ( v,θ θ ) = vi ∑ v j − gij cos (θi − θ j ) − bij sin (θi − θ j ) j ∈N i

hiiq ( v,θ θ ) = vi ∑ v j − gij sin (θi − θ j ) − bij cos (θi − θ j ) j ∈N i



and the set Ni = {i1, …, ik} is the set of the k neighboring busses i1,…, ik of bus i. The state estimator est(Z, X), shown in Fig. 6.2, calculates the estimated values of the state variables X at time t based on their values at time (t − 1) and the measurements Z. The estimator effectively calculates values for the state variables X that satisfy the above power equations, taking into account that measurements are not precise, and it is understood that the estimated values are not precise; the corresponding state calculation methods ensure that the estimated values are within a predetermined range of error ε. An admissible model of the power grid is one that satisfies the power equations within a bounded error ε. In this context, a false data injection attack is one that presents an admissible model of the grid but violates safety requirements of the grid.

77

6.2  Vulnerability Analysis

Monitor mon()

Estimator est()

Bus

Transmission line

Fig. 6.2  Power grid vulnerability analysis

Considering that measurements are not precise, i.e., there is a measurement error or a potential sensor failure, state estimation methods also estimate the measurement error through the calculation of the measurement residue, i.e., the difference R = Z − h(X), where Z are the measurements and h(X) are the expected measurements as calculated from the state variables. The monitor mon(Z,X) checks whether this residue R is bounded by a predetermined bound τ; values that are beyond this are bound as discarded bad data, in general, since typical monitors take into account probabilistic failures of sensors and/or probabilistic measurement error models. Thus, false data injection attacks do not only present an admissible model of the grid, but they insert false measurements that are accepted by the monitor as well. Based on the above, vulnerability analysis of the power grid state estimation process to false data injection attacks requires the identification of attacks, i.e., combinations of false input data, that present an admissible grid model which satisfies power equations and where the monitor accepts the input measurements. To conduct such analysis and identify vulnerabilities to such attacks, the problem can be stated as an existential problem. Specifically, the problem is stated as follows: does there exist a combination of input data that is admissible and accepted by the monitor although it is not correct? Such an existential problem can be encoded as a satisfiability problem with real functions, since power equations are real functions, considering that appropriate tools exist today. We describe a method, presented in [Gao15], which employs dReal [Gao13], an SMT solver that supports real functions. The approach is effective and leads to practical results, as experiments have shown with the IEEE benchmark power grid configurations [IEEE–B]. The obtained results of the use of the dReal SMT solver demonstrate that vulnerabilities are identified with very fast execution times of the solver for all sizes of grids (less than a few hundreds of seconds in all cases) [Gao15]. Importantly, considering the problem statement from the computational point of view, it becomes clear that augmenting the problem with additional constraints on measurement values makes the problem harder to solve while, also, reducing the attack surface for false data injections. This provides a

78

6  False Data Injection Attacks

direction to add constraints to sensor measurements to alleviate vulnerabilities and lead to robust designs. Continuous experimentation with alternative designs, i.e., exploration of the design space is practical, since dReal is efficient even for large numbers of constraints [Gao15].

6.3  Dynamic Monitoring The deployment of cyber-physical systems in a wide range of application domains and especially in critical infrastructures, such as the power grid and the water distribution networks, require their robust operation in the field. The risks originate from malicious attacks and component and process faults not only due to design flaws and unanticipated operating conditions but due to component aging as well. Cyber attacks, process automation bugs, components failures, and incipient faults constitute significant risks that need to be detected, diagnosed, defended against, recovered from, and fixed [Ser18, Ser18a, Ser18b, Rig16]. Despite the development of robust system designs, operation in the field brings faults and failures due to unanticipated conditions and emerging methods of attacks. Clearly, it is necessary to monitor cyber-physical system operation in the field at run time, in order to ensure early detection of problems and quick recovery to ensure continuous and safe operation. The used monitors need to detect all types of parametric changes and faults, small or large, in order to cover all types of attacks and failures. Especially for false data injection attacks, monitors need to observe incoming parameter measurements and detect deviations from normal operation. The wide range of parametric changes and failures due to false data injection attacks, sensor faults, and plant failures, e.g., due to aging, indicates the need to develop powerful computational tools and methods. Existing methods for condition monitoring, the prevalent technology for monitoring plants for such parametric changes, are limited and do not achieve early detection and diagnosis of all types of failures [Pas13]. A promising method to develop run-time monitors for robust, i.e., safe and secure, cyber-physical systems is to extend conventional condition monitors [Shi17] to detect additional types of failures such as cyber attacks, especially false data injection ones and incipient faults. Computational tools and methods for condition monitoring of critical infrastructure can be classified primarily into two approaches: (i) model-based and (ii) model-­ free [Rig15, Rig16]. Model-based approaches exploit prior information about the dynamics of the monitored system in the fault-free condition. This information is included in a model, such as a state-space description or equivalent representation. Model-free approaches process only raw data and possibly represent them in the form of non-parametric approximators such as neural networks, out of which indices for the monitored system’s operation are extracted and compared against failure-­indicating norms [Rig18, Rig19]. Independently of the approach, modelbased or model-free, detection of a parametric change that indicates a cyber attack or a failure requires a representation of the attack-/fault-free system operation and the detection of a divergence from this operation when there is a problem. This

6.3  Dynamic Monitoring

79

divergence is detected through a calculation that indicates deviation beyond a threshold. The definition of this threshold is critical in these methods not only for effective early detection of problems but for avoidance of false alarms as well. The credibility of methods to detect attacks and faults and to isolate them is determined by three main factors: (i) the accuracy of the model used to represent the attack-/fault-free operation of the monitored system, (ii) the accuracy of the estimation method for the outputs of the monitored system in the attack-/fault-free condition, and (iii) the accuracy of the statistical decision-making procedure and the fault threshold used to infer the existence of a fault [Lia17, Rig16, Ise06]. The accuracy of the model that is used by linear and nonlinear observers and filters is critical to the reliability of most model-based fault diagnosis approaches. Analogously, in model-free fault diagnosis methods, the accuracy of models that are extracted from raw data (e.g., in neural, wavelet, and neurofuzzy approximators) dictates their reliability. This accuracy can be measured with the use of model validation methods. Models get invalidated often, e.g., when the operating conditions of the monitored system change. The system may be still fault-free, but the values of its parameters may deviate from the ones used by a model. To determine whether a model is still valid, the outputs of the fault-free system in its new operating conditions are compared against the estimated outputs that are provided by the model. This is performed by calculating the sequence of residuals and analyzing them statistically. This analysis leads to the evaluation of the model validity, because changes beyond set thresholds result to update the model [Rig18a, Rig19, Ise06, Din08]. The accuracy of the estimation method is critical to a reliable attack-/fault diagnosis process. Estimators of the outputs of the attack-/fault-free system need to have minimum variance, so that the effects of measurement noise can be eliminated and the estimated system outputs approximate the real values of the attack-/fault-free system outputs. Most state-observers or filters used in classical fault diagnosis do not provide assurance of optimality in estimation. Actually, only the Kalman filter provides minimum variance estimates among all linear state-observers and filters. Also, among nonlinear state-observers or filters, optimality can be claimed only for those that perform estimation with the use of the recursion of the standard (linear) Kalman filter, after applying suitable state-space transformations. The recursion of the Kalman filter appears to have better performance, in terms of computation speed, relatively to other state-observers or filters, achieving fast convergence and enabling real-time fault diagnosis in dynamic systems. Furthermore, the Kalman filter can be redesigned in a form that assures robustness to measurement noise and modeling errors. Thus, the Kalman filter is the preferable estimator because it is optimal, in terms of minimum variance estimates, and it outperforms other robust observers and filters which are suboptimal since their estimation error variance is not assured to converge to a minimum [Man14, Che99]. The definition of statistical criteria to detect attacks and faults as well as the optimal selection of related thresholds is important to early attack and fault d­ iagnosis, including incipient faults, and to the reduction of false alarms [Ben87, Gao15a]. Similar to model validation, the sequence of residuals is used to determine ­stochastic variables (statistical tests) that indicate the appearance of an attack or fault. Since the elements of the residuals sequence are independent and identically distributed

80

6  False Data Injection Attacks

and follow the zero-mean Gaussian distribution, one can prove that the sum of the squares of the residuals vectors follows the χ2 distribution, after being multiplied by the inverse of the related covariance matrix. The confidence intervals of this distribution can be used to detect deviations of the monitored system operation from the attack-/fault-free model operation. Development of dynamic monitors for safety and security is a process that depends on the specific plant that is being monitored, as explained above. We consider a power system, similar to the vulnerability analysis of the previous section, employing Kalman filter as an estimator for the plant operation. A simple and effective method that extends traditional Kalman filter-based condition monitors uses the Kalman filter as a virtual sensor to emulate the operation of a plant’s sensors in fault-free mode and identifies deviations between the results of this virtual sensor and the actual sensor measurements; deviations beyond a threshold indicate an attack or a failure [Rig17]. This method, combined with statistical decision criteria, is used to detect attacks against sensors of power grids. The Kalman filter is used as a virtual sensor to emulate grid sensors operation in the fault-free mode, and its output is compared against the output of the real sensors producing a sequence (vector) of residuals, i.e., differences between estimated and actual sensor measurements, as shown in Fig. 6.3. The square of this residual vector, weighted by the inverse of the associated covariance matrix, constitutes a random variable that follows the χ2 distribution. Thus, this variable can serve as a statistical test for the deviation of sensor operation from normal; exploiting the properties of the χ2 distribution and using the confidence intervals approach, one can define thresholds for this statistical test. When a value of the statistical test is beyond the threshold, an alarm is raised because sensor operation is abnormal (outside the acceptable range). Importantly, the statistical test can be applied on clusters of sensors, leading to identification of power grid sections that have been exposed to the attack. In addition, one can isolate compromised sensors by applying the statistical test at each individual sensor.

y1y2...yn

Model of power grid sensors (including attacks)

z1z2...zn

Σ

y1y2...yn

Kalman filter (fault-free grid sensors)

Fig. 6.3  Residual sequence for power grid operation

z1z2...zn

Residuals e1e2...en

81

6.3  Dynamic Monitoring

The concept of this technique for false data injection detection is as follows. At every time tk, the measurements are related to the system states by zk = H ( xk , k ) + ek





where the dependence of H on k reflects the possibility of system changes in time. The Kalman filter estimates the system current state from state at the previous time step and the current measurements. By denoting with xˆ k m the a posteriori state estimate at time tk, given observations up to and including time tm and with Pk ∣ m the a posteriori error covariance matrix (a measure of the estimated accuracy of the state estimates), the Kalman filter used to evaluate the system states using the observed measurements is often implemented with two distinct phases: “predict” and “update”; these phases alternate, with the prediction advancing the state until the next scheduled observation (s) and the update incorporating the observation. The calculations of the predict and update phases (steps) of the Kalman filter are shown in Fig. 6.4.

Predict Step: Prediction of the apriori state estimation: where

… state transition and control input model

Prediction of the apriori error covariance: where

is the noise covariance

Update Step: Measurement Residual: Residual Covariance: noise covariance Optimal Kalman Gain: State Estimation update: Covariance Estimation Update: Measurement Residual: Fig. 6.4  Predict and update calculations of the Kalman filter

where

is the observation

82

6  False Data Injection Attacks

Detection and identification of bad data in measurements can be accomplished by processing the measurement residuals. Specifically, the χ2 test can be applied. Upon detection of bad data, two main methods can be used to identify the specific measurement data that actually contain bad data: (i) the largest normalized residual test and (ii) the hypothesis testing identification method. Malicious attack vectors can remain undetected by existing statistical tests for bad data detection if the measurement residuals remain unchanged. One such example is the false data injection attack, which is defined in [Liu09] as follows: the malicious attack vector a = (a1, a2, ..., am)⊤ is called a false data injection attack if and only if a can be expressed as a linear combination of the columns of Hk, as a = Hkc, where c is a random vector. When a false data injection attack is applied to the power system, the collected measurements can be expressed as zκa = zκ0 + α = H k (θ + c ) + ek





When the state estimate uses the malicious measurements, the residuals become

zκa − H k (θ + c ) = zκ0 − H kθ



Thus, the measurement residuals are unaffected by the injection attack vector a, and the attacker successfully misleads the system into accepting that the true state is θ + c instead of the real θ. A unified approach to constructing such attack vectors for this system is given in [Kim11]. These attacks can be detected by adding sparsity and low rank constraints in the objective function of the state estimator, exploiting the intrinsic low dimensionality of temporal measurements of power grid states as well as the sparse nature of false data injection attacks. Despite the benefits offered by these dynamic modeling approaches, there are still challenges that need to be addressed, and they are strongly related to the strategy that generates the attacks as well as the integrity of the historical measurements.

References [Ben87] A. Benveniste, M. Basseville, G. Moustakides, The asymptotic local approach to change detection and model validation. IEEE Trans. Autom. Control 32(7), 583–592 (1987) [Che99] J. Chen, R. Patton, Robust Fault Diagnosis for Dynamic Systems (Prentice-Hall, 1999) [Din08] S.X. Ding, Model-based Fault Diagnosis Techniques: Design Schemes, Algorithms, and Tools (Springer, 2008) [Gao13] S. Gao, S. Kong, E.M. Clarke, dReal: an smt solver for nonlinear theories over the reals, in Automated Deduction–CADE-24, (Springer, 2013), pp. 208–214 [Gao15] S.  Gao, L.  Xie, A.  Solar-Lezama, D.  Serpanos, H.  Shrobe, Automated vulnerability analysis of AC state estimation under constrained false data injection in electric power systems, in Proceedings of the 54th IEEE Conference on Decision and Control (CDC), Osaka, 2015, pp. 2613–2620

References

83

[Gao15a] Z. Gao, X. Ding, D. Gao, A Survey of Fault Diagnosis and Fault-Tolerant Techniques— Part II: Fault Diagnosis With Knowledge-Based and Hybrid/Active Approaches. IEEE Trans. Ind. Electron. 62(6), 3768–3774 (2015) [IEEE-B] https://www.ee.washington.edu/research/pstca/ [Ise06] R. Isermann, Fault-Diagnosis Systems: An Introduction from Fault Detection to Fault Tolerance (Springer, 2006) [Kim11] T.T. Kim, H.V. Poor, Strategic protection against data injection attacks on power grids. IEEE Trans. Smart Grid 2(2), 326–333 (2011) [Lan13] R. Langner, To kill a centrifuge. The Langner Group, 2013. Available at: https://www. langner.com/wp-content/uploads/2017/03/to-kill-a-centrifuge.pdf [Lia17] G.  Liang, J.  Zhao, F.  Luo, S.R.  Weller, Z.Y.  Dong, A review of false data injection attacks against modern power systems. IEEE Trans. Smart Grid 8(4), 1630–1638 (2017) [Liu09] Y. Liu, M.K. Reiter, P. Ning, False data injection attacks against state estimation in electric power grids, in Proceedings of the 16th ACM Conference on Computer and Communications Security (CCS’09), Chicago, 9–13 Nov 2009, pp. 21–32 [Man14] K. Manandhar et al., Detection of faults and attacks including false data injection attack in smart grid using kalman filter. IEEE Trans. Control Netw. Syst. 1(4), 370–379 (2014) [Pas13] F. Pasqualetti, F. Dörfler, F. Bullo, Attack detection and identification in cyber-physical systems. IEEE Trans. Autom. Control 58(11), 2715–2729 (2013) [Rig15] G.  Rigatos, Nonlinear Control and Filtering using differential flatness approaches: Applications to electromechanical Systems (Springer, 2015) [Rig16] G. Rigatos, Intelligent Renewable Energy Systems: Modelling and Control (Springer, 2016) [Rig17] G.  Rigatos, D.  Serpanos, N.  Zervos, Detection of attacks against power grid sensors using Kalman filter and statistical decision making. IEEE Sens. J. 17(23), 7641–7648 (2017) [Rig18] G. Rigatos, N. Zervos. D. Serpanos, V. Siadimas, P. Siano, M. Abbaszadeh, Condition monitoring of wind-power units using the Derivative-free nonlinear Kalman Filter, IEEE INDIN 2018, Porto, Portugal, July 2018 [Rig18a] G.  Rigatos, N.  Zervos, D.  Serpanos, V.  Siadimas, P.  Siano and M.  Abbaszadeh, Condition monitoring of gas-turbine power units using the Derivative-free nonlinear Kalman Filter, in IEEE International Conference on Smart Energy Systems and Technologies, Sevilla, Sep 2018 [Rig19] G. Rigatos, D. Serpanos, V. Siadimas, P. Siano, M. Abbaszadeh, Neural networks and statistical decision making for fault diagnosis in energy conversion systems, in Artificial Intelligence Techniques for a Scalable Energy Transition: Advanced Methods, Digital Technologies, Decision Support Tools, and Applications, ed. by M.  Sayed-Mouchaweh, (Springer, 2019) [Ser18] D.  Serpanos, M.  Wolf, Internet-of-Things (IoT) Systems: Architectures, Algorithms, Methodologies (Springer International Publishing, 2018) [Ser18a] D.  Serpanos, Secure and resilient industrial control systems. IEEE Des. Test 35(1), 90–94 (2018) [Ser18b] D. Serpanos, M.T. Khan, H. Shrobe, Designing safe and secure industrial control systems: a tutorial review. IEEE Des. Test 35(3), 73–88 (2018) [Shi17] D. Shi, R.J. Elliott, T. Chen, On finite-state stochastic modelling and secure estimation of cyber-physical systems. IEEE Trans. Autom. Control 62(1), 65–80 (2017) [Stu11] N. Falliere, L.O. Murchu, E. Chien, W32.Stuxnet Dossier. Symantec, 2011. Available at https://www.symantec.com/content/en/us/enterprise/media/security_response/whitepapers/ w32_stuxnet_dossier.pdf

Index

A Abstraction/refinement approach, 19 Agile methods, 24 Air gaps, 5, 6 Airbus A400M, 2 Aircraft certification, 19, 20 Airworthiness certificates, 19 Android Scripting Environment (ASE), 53 Application-level threats, 35 Arbitrary time intervals, 74 Architectural threat modeling attack model (see Attack models) coding errors, 49 functional vulnerabilities, 50 mitigation, 50 MPSoCs, 50 privilege escalation attacks, 49 QoS attacks, 50 safety-critical systems, 50 system’s non-functional properties, 50 system-level vulnerabilities, 50 timing vulnerabilities, 50 Architecture-level threats, 35 Architectures model-based design, 48, 49 processor security, 47, 48 safety and security characteristics, 47 service-oriented, 54–56 ARM Mbed, 41 ARM TrustZone, 48, 68 ARMET, 67, 68 ASTM F3153, 22 ASTM F3201-16, 30 ASTM F3269-17, 21

Attack analysis advantage, design faults, 17 attack tree model, 17 computer security, 17 cyber kill chain, 18 flow relations, 17 Howard’s model, 17, 18 ICS, 18 information flow models, 17 kill chain, 18 Microsoft STRIDE model, 17 NIST Framework for Improving Critical Infrastructure Cybersecurity, 17 risk assessment activity, 17 security violations, 19 vulnerability, 17 Attack computations, 52 Attack models application layer, 52 architectural information, 50 architecture graph, 52 attack computations, 52 compromised computation, 52 computation, 52 crafted attack, 52 DUT, 50 hardware platform level, 51 Howard’s model, 52 middleware and operating system layer, 51 and mitigations, 53, 54 MMU, 52 multilayer, 51 network-on-chip adapter, 52 STRIDE, 52

© Springer Nature Switzerland AG 2020 M. Wolf, D. Serpanos, Safe and Secure Cyber-Physical Systems and Internet-of-­ Things Systems, https://doi.org/10.1007/978-3-030-25808-5

85

86 Attack models (cont.) structural information, 52 threat model, 50 timing properties, 52 valid computations, 52 Attack surface, 25 Attack tree model, 17 Attack vectors, 25 Attacks, 66, 67 Authorization domains, 42 Automobiles, 59 AUTOSAR software architecture, 53 Availability, 11, 13 B Bell-LaPadula and Biba models, 49 Bit flips, 13 Black-box fuzzing, 61, 62 C CERT® C Coding Standard, 28 Certification, 11 aircraft, 19, 20 airworthiness, 19 ASTM F3269-17, 21 DAL, 20 DO-178C, 20 fail-safe design, 20 FAR Part 43, 19 government, 19 instruction and data values, 21 production, 19 regulations, 19 software-as-a-medical-device, 21 STCs, 19 type, 19 Client/server communication model, 63, 64 CMMI-DEV ®, 22 Coding errors, 49 Common Vulnerabilities and Exposures (CVE), 41 Common Vulnerability Scoring System (CVSS), 26, 41 Compound threats, 36, 37 Compromised computation, 52 Computer security, 1, 17 Computer security attacks, 36 Controlled risk, 12 Crafted attack, 52 Cyber attacks, 3, 78 Cyber kill chain, 18 Cyber-physical attacks, 1

Index Cyber-physical kill chain, 39, 40 Cyber-physical system (CPS) application domains, 60 control view, 73, 74 cyber and physical attacks, 3 designing, 59 ease of attack, 3 extensive and lengthy repairs, 4 and industrial control, 5 Notpetya attack, 2 operating environment, 60 operating schedule, 5 safe and secure, 59 safety failures and attacks, 3, 4 safety properties, 59 specific causes, 2 uncontrolled environment, 59 D Data attacks powerful class, 73 sensors, 73 Data filtering, 75 Data generation, 62 Data monitor, 75 Data mutation, 62 Dependability, 13 Design processes, 6 safety AADL, 24 agile methods, 24 analysis, 22, 23 ASTM F3153, 22 case methodology, 23, 24 classification, 22 E/E systems, 22 engineering of software, 22 functional, 22 ISO 26262, 22 MIL-STD-882E, 24 MISRA, 24 model checking, 23 planning, 22 software safety methodology, 23 STAMP methodology, 23 testing of software, 22 verification process, 22 V methodology, 24, 25 security attack surface, 25 CERT® C Coding Standard, 28 CSET, 26 CVSS, 26

Index Mimikatz, 29 national, 26 NCCIC, 26 NIST Guide to ICS security, 27 NIST guide to smart grid cybersecurity, 27 NIST Platform Firmware Resiliency Guidelines, 26, 27 NIST SP 800-53, 27, 28 NIST Special Publication 800-61, 28 phases, 29 principles, 25, 26 recommendations, 28 researchers, 29 ring policy, 26 secure coding of C and C++ programs, 28 set theory, 25, 26 zero-day exploit, 29 zero-day vulnerability, 29 Design under test (DUT), 50 Development Assurance Level (DAL), 20 Device-level threats, 35 Dieselgate, 2 Digital fault models, 13 Distributed power agreement algorithm, 55 DO-178C, 20 Domain-specific modeling languages (DSMLs), 48, 49 dReal, 75, 77, 78 Dynamic analysis, 61 Dynamic monitoring accuracy of estimation method, 79 condition monitoring, 78 credibility, 79 cyber attacks, 78 detection and identification, 82 development, 80 factors, 79 FDI attacks, 78, 82 Kalman filter, 79–81 malicious attack vectors, 82 measurement data, 82 measurement residuals, 82 measurements, 81 model-based approaches, 78 model-free approaches, 78 model validation, 79 parametric changes and faults, 78 power grid and water distribution networks, 78 predict and update phases, 81 residual sequence, power grid operation, 80 robust system designs, 78

87 run-time monitors, 78 statistical criteria, 79 statistical test, 80 zero-mean Gaussian distribution, 80 E Early-stage analysis, 14 Electrical/electronic (E/E) systems, 22 Eliminated risk, 12 E machine, 55 Embedded software, 30 Energetic Bear, 4 Errors, 13, 59 EternalBlue, 29 Event tree, 16 Experimental aircraft, 19 F Fail-safe design, 20 Failure modes and effects analysis (FMEA), 12, 13, 15, 16 Failures account probabilistic, 77 and attacks, 78 components, 78 plant, 78 sensor, 75, 77 types, 78 unanticipated conditions, 78 False data injection (FDI) dynamic monitoring, 78–82 physical plant, 73 power, 73 vs. sensors, 73 Stuxnet attack, 74 virtual environment, 74 vulnerability analysis, 74–78 False data injection attacks, 60 FAR Part 21, 19 Fault-free condition, 78 Fault models definition, 13 digital, 13 FHA, 13, 14 FMEA, 13, 15, 16 FTA, 13 human-made, 13 permanent/transient, 13 physical, 13 stuck-at, 13 transient, 13 Fault tree analysis (FTA), 13–15

88 Faults, 59 Fire control system, 73 Folk wisdom, 5, 6 Function codes (FC), 63–65 Functional hazard analysis, 14 Functional hazard assessment (FHA), 13, 14 Functional safety, 11, 22 Functional vulnerabilities, 50 Fuzz testing advantages, 61, 62 black-box fuzzing, 62 categorization, 61 configuration, 61 data generation, 62 data mutation, 62 dynamic analysis, 61 input vectors, 62 network protocols, 62 software code analysis, 61 static analysis, 61 symbolic execution, 61 Taint analysis, 62 vulnerabilities, 61 white-box fuzzing, 61 Fuzzing industrial control network systems, 63, 64 G Generic Modeling Environment (GME), 49 Glitches, 13 Government certification processes, 19 Gray-box fuzzing, 61 H Hazard analysis absence/minimization, 13 availability, 13 dependability, 13 FHA, 13, 14 FTA, 14, 15 functional, 14 reliability, 13 Hazard risk index matrix, 12, 13 Hazards characteristic/event, 36 safety, 36 Howard’s model, 17, 18, 52 Human-made faults, 13 HVAC system, 75

Index I Improper authorization threats, 41, 42 Incipient faults, 78, 79 Industrial Control Systems (ICS) security, 18, 27 Information flow models, 17 Information technology (IT), 6 Infrastructure equipment, 1 Intel SGX, 68 Internet, 5 Internet-connected CPS, 7 Internet Crime Complaint Center, 30 Internet-facilitated criminal activity, 30 Internet security, 1 Internet systems, 54 ISO 26262, 22 ISO 26262 standard, 7 ISO/IEC 15504 Automotive SPICE ™, 22 Iterative threat analysis methodology, 42, 43 K Kalman filter, 79–81 Kill chain, 18 L Leveson’s STAMP methodology, 36 M Malicious attacker, 73 Malicious attack vectors, 82 Measurement error, 77 Medical data, 7 Memory model, 65 Microarchitectural Data Sampling (MDS) attacks, 48 Microsoft STRIDE model, 17 MIL-STD-882E, 24 Mimikatz, 29 Mitigation post-deployment, 44 pre-deployment, 44 Mitigations, 53, 54 MMU, 52 Modbus application packets, 64 client/server communication model, 63, 64 FC, 63, 64 industrial networks, 63 MTF, 64–66 OSI reference model, 63

Index packets, 63 PDUs, 64 protocol stacks, 63 research and development, 63 secure communications, 64 specifications, 63 and TCP, 63 Modbus Application Protocol (MBAP), 64 Modbus memory organization model, 65 Model-based approaches, 78 Model-based design, 48, 49, 55 Model-based monitoring, 66, 67 Model-free approaches, 78 Model-integrated computing, 48 MTF-Storm, 64–66 Multilayer attack model, 51 Multiprocessor systems-on-chip (MPSoCs), 50 N National Cybersecurity and Communications Integration Center (NCCIC), 4 National security, 26 National Transportation Safety Board (NTSB), 40 National Vulnerability Database (NVD), 41, 63 NCCIC, 26 NCCIC Cyber Security Evaluation Tool (CSET)®, 26 Networked control system, 54 Network-on-chip adapter, 52 9/11 Commission Report, 41 NIST ICS security, 27 smart grid cybersecurity, 27 SP 800-53, 27, 28 Special Publication 800-61, 28 NIST Framework for Improving Critical Infrastructure Cybersecurity, 17 NIST Platform Firmware Resiliency Guidelines, 26, 27 Non-parametric approximators, 78 Notpetya attack, 2 O Object constraint language (OCL), 48 Open-source software, 5 Operational technology (OT), 6 Organizational life cycle processes, 22 OSI reference model, 63

89 P Packet field manipulations, 65 Patching, 5 Permanent faults, 13 Physical attacks, 3 Physical faults, 13 Physical plants, 30 Physical safety, 1, 13 Post-deployment, 44 Posteriori error covariance matrix, 81 Potential sensor failure, 77 Power grid system, 75, 77 Pre-deployment, 44 Primary life cycle processes, 22 Privacy, 7 Privilege escalation attacks, 49, 53 Process reference model, 22 Processing/network elements, 52 Processor security embedded computing systems, 47 root-of-trust, 48 side channel attacks, 48 Process-oriented model, 17 Production certificates, 19 Profile-based monitoring, 66, 67 Programmable logic controllers (PLC), 4 Protocol data units (PDUs), 64 Q QoS-aware service-oriented architectures, 56 Quality management systems CMMI-DEV®, 22 ISO 9000, 21 ISO 9001, 21 ISO/IEC 15504 Automotive SPICE™, 22 principles, 21 safety task implementation, 22 Quality planning process, 21 Quality-of-service (QoS) attacks, 50, 54, 55 R Real-time embedded computing systems, 1 Real-time embedded systems, 53 Reconnaissance, 65 Reliability, 11, 13 Replay attacks, 30 Requirements, 55 Residual risk, 12 Ring policy, 26 Risk, 11 Risk analysis, 12

90 Risk assessment, 12 Risk assessment activity, 17 Risk management appropriateness, 12 controlled/eliminated, 12 and costs, 12 desirability, 12 FMEA, 12 hazard risk index matrix, 12, 13 residual risk, 12 risk analysis, 12 risk assessment, 12 risk planning, 12 risk probabilities, 12 RPN, 12, 13 safety management processes, 12 unacceptable risk, 12 Risk matrix, 24 Risk planning, 12 Risk Priority Number (RPN), 12, 13, 37 Risk probabilities, 12 Risks identify and mitigate, 40 management, 42 RPN, 37 safety analysis, 35 security testing of software, 44 security vulnerability, 36 Root-of-trust, 48, 56 Run-time monitoring, 66, 67 and testing, 60 Run-time security monitor (RSM), 67, 68 S Safety concentrates, 11 controllers, 4 cyber-physical attacks, 1 design processes (see Design processes) engineering, 1 functional, 11 hazards, 3 multiple causes, 3 physical, 1, 13 physical damage, 3 risk analysis, 29 and security (see Security) vs. security design processes, 29–31 Safety analysis, 23, 35 Safety Basis Information System (SBIS), 40 Safety-critical system designs, 7 Safety hazards, 36

Index Safety management processes, 12 Safety-oriented design processes, 42 Safety-oriented fault tree methodologies, 37 Safety planning, 22 Safety task implementation, 22 Security computer, 1 design processes (see Design processes) emphasizes architecture and coding, 11 internet, 1 patching, 5 and safety (see Safety) vs. safety design processes, 29–31 vulnerabilities, 3, 29 Security analysis, 35 Security testing and fuzz, 61–62 evaluation, 60 properties, 60 verification and validation, 60 Security violations, 19 Service description language, 55 Service-oriented architecture, 54–56 Set theory, 25, 26 Side channel attacks, 48 Side channel vulnerabilities, 31 Signature-based systems, 66 Smart grid applications, 55 Smart grid cybersecurity, 27 Software-as-a-medical-device, 21 Software code analysis, 61 Software safety methodology, 23 Software safety threats, 42 Specification execution, 67 SRAM attack, 54 StackGuard, 61 STAMP methodology, 23 Standards and certification (see Certification) modified V-process, 20 Static analysis, 61 Statistical test, 80 STRIDE, 52 Stuck-at fault, 13 Stuxnet, 4, 30, 74 Supplemental type certificates (STCs), 19 Supporting life cycle processes, 22 Switches/network adapters, 52 Symbolic execution, 61 System Safety Program Plan, 24 System under test (SUT), 61 System—computing nodes, 73 System-level design algorithms, 51

Index T Taint analysis, 62 TaintCheck, 61 Therac-25 medical radiation device, 3 Threat analysis models cyber-physical kill chain, 39, 40 failure mode effects analysis worksheet analysis, 38, 39 functional hazard analysis worksheet analysis, 38 RPN, 37 safety-oriented fault tree methodologies, 37 threat trees, 37 TPN, 38 VPN, 37 Threat priority number (TPN), 38 Threats, 2, 47 application-level, 35 architecture-level, 35 authorization domains, 42 compound, 36, 37 design flaws, 35 device-level, 35 improper authorization, 41, 42 iterative threat analysis methodology, 42, 43 mitigation, 43, 44 software safety, 42 system design process, 35 Threat trees, 36, 37 Timing attacks, 30 Timing-oriented attacks, 30 Timing vulnerabilities, 50 Tire pressure monitoring system (TPMS), 3 Trade secrets, 7 Transaction-oriented systems, 30 Transient faults, 13 Triconex, 4 Trust zones, 44 U Ukraine power grid attack, 4, 6 Unacceptable risk, 12 V Valid computations, 52 V methodology, 24, 25, 30

91 V model, 7 Vulnerabilities, 3–5, 7, 8, 17 ARM Mbed, 41 CVE, 41 CVSS, 41 databases characterize software, 41 firmware, 40 9/11 Commission Report, 41 NTSB, 40 NVD, 41 poor safety-oriented procedure, 36 safety engineering uses reports, 40 Vulnerability analysis admissible model, power grid, 76 arbitrary time intervals, 74 CPS application, 74 data filtering, 75 data monitor, 75 design space, 78 dReal, 75 existential problem, 77 FDI attacks, 75 HVAC system, 75 measured values, 75 measurement error, 77 potential sensor failure, 77 power flow equations, 76 power grid configurations, 77 power grid state estimation process, 77 power grid system, 75, 77 power injection equations, 76 SMT solver, 75 state estimator, 76 value ranges, 75 Vulnerability priority number (VPN), 37 W W32.Stuxnet, 4 Western Electric 5ESS telephone switching system, 5 White-box fuzzing, 61 Z Zero-day exploit, 29 Zero-day vulnerability, 29 Zero-mean Gaussian distribution, 80