Reliability Engineering and Computational Intelligence (Studies in Computational Intelligence, 976) [1st ed. 2021] 3030745554, 9783030745554

Computational intelligence is rapidly becoming an essential part of reliability engineering. This book offers a wide spe

1,037 128 8MB

English Pages 315 [307] Year 2021

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Reliability Engineering and Computational Intelligence (Studies in Computational Intelligence, 976) [1st ed. 2021]
 3030745554, 9783030745554

Table of contents :
Preface
Contents
Mathematical and Computational Intelligence Methods in Reliability Engineering
System Reliability Analysis and Assessment by Means of Logic Differential Calculus and Hasse Diagram
1 Introduction
2 Structure Function and Logic Differential Calculus
2.1 Logic Differential Calculus
2.2 Minimal Path Vectors and Logic Differential Calculus
2.3 Minimal Cut Vectors and Logic Differential Calculus
2.4 Hand Calculation Example
3 Graph of the Structure Function
3.1 Hasse Diagram of an Ordered Set of Variables
3.2 State Diagram of the Structure Function of a System
3.3 Ordered and Ordered Weighted Graphs of the System
3.4 How to Build the Ordered Graph?
3.5 Reliability Computing Algorithm
4 Conclusion
References
The Survival Signature for Quantifying System Reliability: An Introductory Overview from Practical Perspective
1 Introduction
2 Survival Signature
3 Exchangeability of Components' Failure Times
4 Computation, Simulation and Inference
5 Recent Developments
6 Further Considerations
References
Application of Fuzzy Decision Tree for the Construction of Structure Function for System Reliability Analysis Based on Uncertain Data
1 Introduction
2 MSS Structure Function
3 Principal Steps of Method for Structure Function Construction
4 Data Transformation
5 Fuzzy Decision Tree Induction Based on Cumulative Mutual Information
6 Case Study: Reliability Analysis of the System Failure Prediction
7 Conclusion
References
Unavailability Optimization of a System Undergoing a Real Ageing Process Under Failure Based PM
1 Introduction
2 Notation
3 Renewal Process Model
3.1 Unavailability Analysis
4 Unavailability Optimization of a Selected System
4.1 Selected System with Latent Failures, Inspection Period τ= 120 Days
4.2 Optimization of FBM Under Unavailability Restriction
4.3 Optimization of Inspection Period Under Unavailability Restriction
4.4 Optimization of FBM at Fixed Inspection Period and Under Unavailability Restriction
5 Conclusions
References
Minimal Filtering Algorithms for Convolutional Neural Networks
1 Introduction
2 Preliminary Remarks
3 Minimal Filtering Algorithms
3.1 Algorithm 1, M = 3
3.2 Algorithm 2, M = 5
3.3 Algorithm 3, M = 7
3.4 Algorithm 4, M = 9
3.5 Algorithm 5, M = 11
4 Implementation Complexity
5 Conclusion
References
Digital Technologies in Reliability Engineering
New Challenges and Opportunities in Reliability Engineering of Complex Technical Systems
1 Introduction
2 The Probabilistic Risk Assessment Process
2.1 Current Process
2.2 Game Changers
2.3 Envisioned Process
2.4 Discussion
3 Model-Based Safety Assessment
3.1 The Promise of Model-Based Risk Assessment
3.2 The S2ML+X Paradigm
3.3 S2ML in a Nutshell
3.4 AltaRica 3.0
3.5 Textual Versus Graphical Representations
4 Challenges
4.1 Transforming Big Data into Smart Data
4.2 Handling the Increasing Complexity of Systems
4.3 Computational Complexity of Probabilistic Risk Assessment
4.4 Integrating Seamlessly Models and Data Sets into the Digital Twin
4.5 Managing the Change
5 Conclusion
References
Development of Structured Arguments for Assurance Case
1 Introduction
1.1 History and Concept of Assurance Case
1.2 Goal and Structure
2 State of the Art
2.1 Assurance Case for Attributes Assessment
2.2 Assurance Case Based Certification
2.3 Assurance Based Development
2.4 Assurance Case for Knowledge Management
2.5 Improvement of Argumentation
3 Development of the Structured Argumentation Method
3.1 Transformation of Typical Arguments in a Structured Argument Form
3.2 Argumentation Improvement: Hierarchy of Requirements and Templates of Structured Text
3.3 Algorithm for the Structured Argumentation Method
4 Case Study: Application of the Structured Argumentation Method
5 Conclusion
References
Making Reliability Engineering and Computational Intelligence Solutions SMARTER
1 Introduction
2 Basic Building Blocks of Effective RECI Solutions
2.1 RECI as Part of EA
2.2 Viable RECI Algorithms for EA
2.3 Ontology
2.4 Visualisation
3 SMARTER Projects for RECI
3.1 Reach
3.2 Exciting
3.3 Specific
3.4 Realistic
3.5 Measurable
3.6 Assignable
3.7 Time-Related
4 Conclusion
References
Method for Determining the Structural Reliability of a Network Based on a Hyperconverged Architecture
1 Introduction
1.1 Motivation
1.2 State of the Art
1.3 Goal and Tasks
2 Assessing the Structural Reliability of a Module of the Network Based on HCP
2.1 Modified Moore-Shannon Method for Assessing the Structural Reliability of a Network Based on HCP
2.2 Selecting the Network Reliability Indices on the HCP
2.3 The Algorithm of Structural Reliability Calculation
3 Results and Discussion
4 Conclusions
References
Database Approach for Increasing Reliability in Distributed Information Systems
1 Introduction
2 Related Work
3 Our Contribution
4 Experiments
5 Conclusion
References
Time Dependent Reliability Analysis of the Data Storage System Based on the Structure Function and Logic Differential Calculus
1 Introduction
2 Structure Function
3 Logic Differential Calculus
4 Importance Measures
4.1 Reliability Importance Measures
4.2 Lifetime Importance Measures
5 Case Study
6 Conclusion
References
Energy Efficiency for IoT
1 Motivation
1.1 Demarcation Between IoP, IoS and IoT
1.2 IoT Enabling Network Technologies
1.3 Case Study 1: Energy-Efficient IoT with LoRa WAN
1.4 To the Structure of this Work
2 Energy-Efficient Approaches and Solutions
2.1 Energy Efficiency for Combined Infrastructure Wireless Sensor Networks
2.2 Case Study 2: Multi-Layered Monitoring and Control for Infrastructure WSN
3 A Multi-Layered Approach and the Principles for Energy Efficiency in WSN and WPAN
4 Energy Efficiency in Contactless Communication Via RFID and NFC
4.1 Energy Efficiency Via RFID and NFC
4.2 Case Study 3: Energy-Efficient Monitoring and Management of Farm Animals Via RFID and Wi-Fi
5 Conclusions and Outlook
References
Image Analysis and Other Applications of Computational Intelligence in Reliability Engineering
Knowledge-Based Multispectral Remote Sensing Imagery Superresolution
1 Introduction
2 Spectral Band Translation
3 Image Superresolution
3.1 Fourier Transform
3.2 Image Shift Direction
3.3 Arbitrary Shift
3.4 Pixel Number Reduction
3.5 Linear Regression Model with a Priory Data
4 Convolutional Neural Networks Implement
5 Actual Resolution Evaluation
6 Results
7 Conclusions
References
Neural Network Training Acceleration by Weight Standardization in Segmentation of Electronic Commerce Images
1 Introduction
2 Training SET
3 Weight Standardization
4 Image Segmentation Using Yolact
5 Experiments and Results
6 Discussions and Conclusion
References
Waterproofing Membranes Reliability Analysis by Embedded and High-Throughput Deep-Learning Algorithm
1 Introduction
2 Problem Formulation
2.1 Water-Repellent Membranes: Some Comments
2.2 The Reliability Model of Membrane Surface
3 Research Methodology
3.1 Method Overview
3.2 Dataset
3.3 Image Recognition: Basic Technique
4 Results
5 Conclusions
References
Network of Autonomous Units for the Complex Technological Objects Reliable Monitoring
1 Introduction
2 The Complex Technological Objects Monitoring
3 IoT for Technological Objects Monitoring
4 Monitoring System Design
5 System Graphic Data Processing
5.1 Solving the Problem of Visual Navigation by Onboard Cameras of a Mobile Unit
5.2 Solving the Problem of Visual Navigation of a Mobile Unit by External Stationary Cameras
6 Conclusions
References
A Correlative Method to Rank Sensors with Information Reliability: Interval-Valued Numbers Case
1 Introduction
2 Theoretical Information
2.1 Interval-Valued Numbers and Sequences
2.2 Calculation of the Correlation Coefficient Between Sequences of Interval-Valued Numbers
3 Setting the Objectives
4 The Method
5 Numerical Example
6 Conclusion
References
COVID-19 Pandemic Risk Analytics: Data Mining with Reliability Engineering Methods for Analyzing Spreading Behavior and Comparison with Infectious Diseases
1 Introduction
2 Goal of Research Study
3 Methods
3.1 Weibull Distribution Model
3.2 Cox-Stuart Trend Test
4 DataBase
5 Data Quality and Impact on Uncertainty
6 Analyses of the COVID-19 Spreading Behavior
6.1 Infection (Confirmed Cases) Before Lockdown
6.2 Infection (Confirmed Cases) After Lockdown
6.3 Second Wave Detection
6.4 Infection (Confirmed Cases) Second Wave
7 Comparison of the COVID-19 Spreading Behavior with Other Infectious Diseases
7.1 Influenza and Measles
7.2 Comparison: COVID-19 Versus Influenza Versus Measles
8 Summary
References

Citation preview

Studies in Computational Intelligence 976

Coen van Gulijk Elena Zaitseva   Editors

Reliability Engineering and Computational Intelligence

Studies in Computational Intelligence Volume 976

Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland

The series “Studies in Computational Intelligence” (SCI) publishes new developments and advances in the various areas of computational intelligence—quickly and with a high quality. The intent is to cover the theory, applications, and design methods of computational intelligence, as embedded in the fields of engineering, computer science, physics and life sciences, as well as the methodologies behind them. The series contains monographs, lecture notes and edited volumes in computational intelligence spanning the areas of neural networks, connectionist systems, genetic algorithms, evolutionary computation, artificial intelligence, cellular automata, selforganizing systems, soft computing, fuzzy systems, and hybrid intelligent systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.

More information about this series at http://www.springer.com/series/7092

Coen van Gulijk · Elena Zaitseva Editors

Reliability Engineering and Computational Intelligence

Editors Coen van Gulijk Healthy Living, TNO Leiden, The Netherlands School of Computing and Engineering University of Huddersfield Huddersfield, UK

Elena Zaitseva Faculty of Management Science and Informatics University of Zilina Zilina, Slovakia

Faculty of Technology Policy and Management Delft University of Technology Delft, The Netherlands

ISSN 1860-949X ISSN 1860-9503 (electronic) Studies in Computational Intelligence ISBN 978-3-030-74555-4 ISBN 978-3-030-74556-1 (eBook) https://doi.org/10.1007/978-3-030-74556-1 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

This volume addresses the fusion between two scientific domains: reliability engineering and computational intelligence. Reliability engineering is an established domain that has a very good practical and scientific background for the analysis of the reliability of systems. Computational intelligence is relatively new in reliability engineering. But it has been an equally well-established branch of research with many groups over the world attempting to develop useful computational intelligence tools in different fields. Today, the continuous drive for digitalization causes reliability engineering and computational intelligence to merge. Combining the fields paves the way to progress on big data analytics, uncertain information evaluation, reasoning, prediction, modeling, optimization, decision making, and of course: more reliable systems. The RECI workshop was held to discuss the merger of reliability engineering and computational intelligence and progress in the field. The workshop was an online scientific event held from October 27–29, 2020 (htps://ki.fri.uniza.sk/RECI2020/). This volume combines several papers submitted to the workshop. Out of 68 submissions for paper, 47 were selected for discussion and after review by 2 or 3 peers, 18 were accepted for publication in this volume. Although there was some overlap in the papers, progress tended to cluster around three subjects: mathematical methods for RECI, digital techniques for RECI, and progress in RECI applications of which most are based on image analysis. The works in this volume are organized according to these clusters. The first cluster focuses on mathematical techniques and is entitled “Mathematical and computational intelligence methods in reliability engineering.” This cluster contains papers explaining system reliability techniques based on logic differential calculus and Hesse diagrams, survival signatures, fuzzy random forests, syntax trees, failure-based prevention models, and filtering algorithms for convolutional neural networks. The second cluster focuses on subjects for digital techniques that are equally varied discussing: “Digital technologies in reliability engineering”. Starting from relatively abstract insights of how digital modeling technologies could replace mathematical modeling, building safety assurance cases for software, and a business approach to RECI. And moving into practicalities for hyperconverged systems, reliability of distributed processing and energy optimization of IoT systems. v

vi

Preface

Progress on image analysis is reported on: image superresolution, weights standardization in neural network techniques, deep learning, autonomous units for reliable monitoring, and complex monitoring systems (though the last two don’t just treat image analysis). And last but not least progress is reported on Covid-19. This last cluster is entitled: “Image analysis and other applications of computational intelligence in reliability engineering.” We believe that the work recorded in this volume is relevant for scientists interested in fusing problems of reliability engineering and computational intelligence. Whether they be researchers from an academic or business background, whether they be dealing with risk and reliability analysis or computational intelligence, we believe this work contributes to their understanding of the subject. We believe that the work is equally useful for IT companies traditionally working on computational intelligence, and developers of engineering systems traditionally working on reliability engineering. We thank the local organizers for their efforts in making the workshop a successful one. We thank all authors and reviewers for their excellent contributions and we thank our funders through the “Advanced Centre for Ph.D. Students and Young Researchers in Informatics”—ACeSYRI (610166-EPP-1-2019-1-SK-EPPKA2-CBHE-JP) of the European Union’s Erasmus+ programme and “New Methods Development for Reliability Analysis of Complex System” (APVV-18-0027) of the Slovak Research and Development Agency. Leiden, The Netherlands Zilina, Slovakia December 2020

Coen van Gulijk Elena Zaitseva

Contents

Mathematical and Computational Intelligence Methods in Reliability Engineering System Reliability Analysis and Assessment by Means of Logic Differential Calculus and Hasse Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nicolae Brinzei, Jean-François Aubry, Miroslav Kvassay, and Elena Zaitseva

3

The Survival Signature for Quantifying System Reliability: An Introductory Overview from Practical Perspective . . . . . . . . . . . . . . . . Frank P. A. Coolen and Tahani Coolen-Maturi

23

Application of Fuzzy Decision Tree for the Construction of Structure Function for System Reliability Analysis Based on Uncertain Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jan Rabcan, Peter Sedlacek, Igor Bolvashenkov, and Jörg Kammermann Unavailability Optimization of a System Undergoing a Real Ageing Process Under Failure Based PM . . . . . . . . . . . . . . . . . . . . . . . . . . . . Radim Briš and Pavel Jahoda Minimal Filtering Algorithms for Convolutional Neural Networks . . . . . Aleksandr Cariow and Galina Cariowa

39

55 73

Digital Technologies in Reliability Engineering New Challenges and Opportunities in Reliability Engineering of Complex Technical Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Antoine Rauzy

91

Development of Structured Arguments for Assurance Case . . . . . . . . . . . . 115 Vladimir Sklyar and Vyacheslav Kharchenko Making Reliability Engineering and Computational Intelligence Solutions SMARTER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Coen van Gulijk vii

viii

Contents

Method for Determining the Structural Reliability of a Network Based on a Hyperconverged Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Igor Ruban, Heorhii Kuchuk, Andriy Kovalenko, Nataliia Lukova-Chuiko, and Vitalii Martovytsky Database Approach for Increasing Reliability in Distributed Information Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Roman Ceresnak and Karol Matiasko Time Dependent Reliability Analysis of the Data Storage System Based on the Structure Function and Logic Differential Calculus . . . . . . 179 Patrik Rusnak and Michal Mrena Energy Efficiency for IoT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Andriy Luntovskyy and Bohdan Shubyn Image Analysis and Other Applications of Computational Intelligence in Reliability Engineering Knowledge-Based Multispectral Remote Sensing Imagery Superresolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 Sergey A. Stankevich, Iryna O. Piestova, Mykola S. Lubskyi, Sergiy V. Shklyar, Artur R. Lysenko, Oleg V. Maslenko, and Jan Rabcan Neural Network Training Acceleration by Weight Standardization in Segmentation of Electronic Commerce Images . . . . . . . . . . . . . . . . . . . . . 237 V. Sorokina and S. Ablameyko Waterproofing Membranes Reliability Analysis by Embedded and High-Throughput Deep-Learning Algorithm . . . . . . . . . . . . . . . . . . . . . 245 Darya Filatova and Charles El-Nouty Network of Autonomous Units for the Complex Technological Objects Reliable Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 Oleksandr Chemerys, Oleksandr Bushma, Oksana Lytvyn, Alexei Belotserkovsky, and Pavel Lukashevich A Correlative Method to Rank Sensors with Information Reliability: Interval-Valued Numbers Case . . . . . . . . . . . . . . . . . . . . . . . . . . 275 Mykhailo O. Popov, Oleksandr V. Zaitsev, Ruslana G. Stambirska, Sofiia I. Alpert, and Oleksandr M. Kondratov COVID-19 Pandemic Risk Analytics: Data Mining with Reliability Engineering Methods for Analyzing Spreading Behavior and Comparison with Infectious Diseases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 Alicia Puls and Stefan Bracke

Mathematical and Computational Intelligence Methods in Reliability Engineering

System Reliability Analysis and Assessment by Means of Logic Differential Calculus and Hasse Diagram Nicolae Brinzei, Jean-François Aubry, Miroslav Kvassay, and Elena Zaitseva

Abstract The aim of this research work concerns the development of an integrated approach based on logic differential calculus and Hasse diagram. Logic differential calculus is a powerful mathematical methodology that allows determination of the minimal path/cut sets of systems represented by Boolean structure function. Minimal path/cut sets are important in reliability analysis because they describe the minimal component configurations in which the system operates or fails. Hasse diagram is a mathematical diagram allowing the representation of the partially order between the system states. It has the advantage of unifying in the same kind of model (graph describing the state diagram) the modeling of systems represented both by Boolean structure functions or stochastic processes. Such state diagram can be generated automatically from minimal path sets obtained previously by logic differential calculus. Afterwards, the analytical expression of system reliability is computed from this state diagram. Keywords System reliability · Hasse diagram · Direct partial Boolean derivative · Minimal cut set · Minimal cut vector · Minimal path set · Minimal path vector · Structure function

N. Brinzei · J.-F. Aubry Université de Lorraine, CNRS, CRAN (Research Center for Automatic Control), 54000 Nancy, France e-mail: [email protected] J.-F. Aubry e-mail: [email protected] M. Kvassay · E. Zaitseva (B) University of Zilina, Univerzitna 8215/1, 01026 Zilina, Slovakia e-mail: [email protected] M. Kvassay e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 C. van Gulijk and E. Zaitseva (eds.), Reliability Engineering and Computational Intelligence, Studies in Computational Intelligence 976, https://doi.org/10.1007/978-3-030-74556-1_1

3

4

N. Brinzei et al.

1 Introduction Assessment of system reliability belongs to important tasks in many engineering fields. Such an assessment can be done using various mathematical methodologies depending on the mathematical representation of the system. Structure function is one of the most common representations [1–3]. It takes into account topology and structure of the system, which allows us to use it in structure analysis of the system. Based on the structure function, it is possible to compare systems with various topologies and identify one with the most reliable topology, or evaluate importance of individual components or groups of components on system operation [2, 4, 5]. Such an evaluation is done using special indices known as importance measures [6]. Calculation of these measures is usually based on one of two existing concepts. The first concept is based on the term of criticality and it evaluates importance of a specific system component considering situations in which the component is critical for system operation, i.e., in which a failure (repair) of the component results in system failure (repair) [5, 7]. Another concept is based on identification of Minimal Cut Sets (MCSs) and Minimal Path Sets (MPSs) [8–10]. MCSs belong to key terms of reliability analysis. They represent minimal sets of components whose simultaneous failure results in a system failure [8, 9]. Their dual form is known as MPSs. An MPS agrees with a minimal group of system components whose functioning ensures that the system is functioning. Knowledge of MCSs and MPSs can be used not only to analyze importance of system components [10], but also to estimate total reliability of the system or plan system maintenance [3]. Although MCSs and MPSs belong to well-known concepts of reliability analysis, sometimes it can be quite complicated to identify them and very time consuming to compute exact reliability of the system. To avoid these problems, various approaches for their identification and manipulation with them have been developed. Examples involve methods based on decision diagrams [11, 12], transformation rules of Boolean algebra [12, 13], orthogonalization of Boolean functions [14], or application of truncation limits [13, 15]. Works [16, 17] deal with one more alternative approach used for identification of MCSs and MPSs. This approach is based on application of logic differential calculus, more specifically on use of Direct Partial Boolean Derivatives (DPBDs). Logic differential calculus is a mathematical methodology developed for analysis of Boolean functions [18, 19]. In reliability analysis, it can be used to analyze structure function and identify situations in which the component is critical. If we know all these situations for all system components, then we can combine them together and find all MCSs or MPSs of the system [16]. However, reliability analysis does not end after identification of MCSs or MPSs. An important task is computation of basic system characteristics such as reliability or unreliability of the system. In particular, authors in paper [20] propose an approach to compute system reliability and unreliability based MPSs and MCSs, but this approach can be used for coherent system only and needs forming of orthogonal representation which is not well formalized for technical implementation. Partially the forming of orthogonal form based on MPSs and MCSs can be solved by application of Binary Decision

System Reliability Analysis and Assessment ...

5

Diagrams (BDDs) which are graph-type model [18, 21, 22]. But effective analysis of BDD can be implemented for optimal BDD that is a separate complex problem [18]. In paper [23] another approach with the use of a graph-type model for system reliability analysis based on MPSs and MCSs is proposed. This graph-type model is Hasse diagram which allows representing ordering of elements in partially ordered sets. This property of Hasse diagram is used for the calculation of reliability or unreliability of coherent and non-coherent system. Unlike the BDD method, this approach uses the monotonicity to define a Disjunctive normal form which is orthogonal form [18, 19]. The main interest of Hasse diagram is that it allows us to directly find the system reliability and unreliability. But in paper [23] the definition of system MPSs and MCSs are not considered. In this paper we propose to join two approaches of the system reliability analysis: definition of the system MPSs and MCSs with the use of DPBDs [16, 17] and calculation of system reliability and unreliability based on application of Hasse diagrams [3, 23].

2 Structure Function and Logic Differential Calculus Almost each system is composed of elements, which are named as components of the system. If we assume that the system and each of its components can be in one of two possible states, which are failure represented as state 0 and functioning agreeing with state 1, then the dependency between states of the system components and state of the system can be defined using structure function. The structure function is a map defined as follows [3, 5]: φ(x) = φ(x1 , x2 , . . . , xn ) : {0, 1}n → {0, 1},

(1)

where n is a number of system components, xi , for i = 1, 2, . . . , n, is a variable that defines state of component i, and x = (x1 , x2 , . . . , xn ) is a vector of states of the system components (state vector). Depending on the properties of this function, two types of systems can be recognized. The first ones are coherent systems in which a failure of any system component cannot lead to improving in the system functioning. If such a situation can occur, then we get the second type of systems, which are known as non-coherent [3]. The specific of coherent system is monotonic structure function (1). In the rest of the paper, we consider only coherent systems. Knowledge of structure function is usually not sufficient to perform complex reliability analysis. For this purpose, we need also information about the probabilities of individual state vectors. If we assume that the system components are independent, then we need only information about the state probabilities of individual system components. The probability that component i is in state 1 is known as availability of the component and the probability that it is in state 0 as unavailability of the component. In the rest of the paper, we use the following natation regarding these

6

N. Brinzei et al.

two basic probabilities: pi = Pr{xi = 1}, qi = Pr{xi = 0}, pi + qi = 1, for i = 1, 2, . . . , n.

(2)

Using these probabilities and structure function, we can define basic reliability measures that are system reliability R and system unreliability U: R( p) = Pr{φ(x) = 1}, U (q) = Pr{φ(x) = 0}, R( p) + U (q) = 1,

(3)

where p = ( p1 , p2 , . . . , pn ) is a vector of availabilities of the system components and q = (q1 , q2 , . . . , qn ) is a vector of their unavailabilities. In addition to measures of system reliability R and system unreliability U (3), the reliability analysis based on the system structure function includes other indices and measures, for example, importance measures, evaluation of critical states and other [5, 6, 16]. The calculation of such indices and measures need special mathematical methods and methodologies. One of such mathematical methodology is logic differential calculus.

2.1 Logic Differential Calculus Logic differential calculus allows analyzing dynamic properties of Boolean functions [19]. Since the formal definition of structure function (1) agrees with a definition of Boolean function, this mathematical methodology can also be applied in analysis of structure function. For this purpose, DPBDs, which are its part, are most suitable. This derivative allows investigating the influence of specified change of variable value to change of the Boolean function value. For structure function φ(x), this derivative is defined as follows [16, 19]:     ∂φ j → j¯ ={φ(si , x) ↔ j} ∧ φ(¯si , x) ↔ j¯ ∂φ(s → s¯ )  1 if φ(si , x) = j and φ(si , x) = j¯ , = 0 otherwise

(4)

where s ∈ {0, 1}, j ∈ {0, 1}, (si , x) = (x1 , x2 , . . . , xi−1 , s, xi+1 , . . . , xn ), and special symbol ↔ denotes logical biconditional defined as follows:  a↔b=

1 if a = b . 0 otherwise

(5)

DPBD (4) has two different forms. The first one is DPBD ∂φ(1 → 0)/∂ xi (1 → 0), which allows us to find state vectors (1i , x) at which

System Reliability Analysis and Assessment ...

7

a failure of component i causes that the system fails. These vectors are known as critical path vectors for component i [16]. The second one has a form of ∂φ(0 → 1)/∂ xi (0 → 1), and it can be used to identify state vectors (0i , x) at which a repair of the i-th system component results in a repair of the whole system. In this case, we say about critical cut vectors for component i [16]. It is important to note that the DPBDs themselves do not depend on variable xi according to (4). Therefore, if we want to use them to find the considered state vectors, we have to combine them with logical biconditional (5) in the following way: {xi ↔ s}

∂φ(s → s) , for s ∈ {0, 1}. ∂ xi (s → s)

(6)

Depending on value s, the points at which this formula takes nonzero value correspond to critical path vectors (if s = 1) or critical cut vectors (if s = 0) for component i. It is worth noting that logical biconditional {xi ↔ 1} agrees with positive literal xi and logical biconditional {xi ↔ 0} with negative literal xi . Based on this, the previous formula obtains the following form: xi

∂φ(1 → 0) ∂ xi (1 → 0)

(7)

if we want to use it for identification of critical path vectors for component i and the next one: xi

∂φ(0 → 1) ∂ xi (0 → 1)

(8)

if we are interested in finding critical cut vectors for the considered component [16]. Formulas (6), (7) and (8) imply that identification of specific state vectors of the system requires combining DPBDs computed for component i with variable xi defining state of the considered component. Furthermore, if we are interested in points at which a specific change of component i results in a specific change of system state, then we should compute DPBD ∂φ(1 → 0)/∂ xi (1 → 0) only at points (1i , x) of the structure function and DPBD ∂φ(0 → 1)/∂ xi (0 → 1) only at points (0i , x). This means that DPBD ∂φ(1 → 0)/∂ xi (1 → 0) should not be computed at points (0i , x) of the structure function because it analyzes consequences of the component failure, which requires the component should be working at the beginning, i.e., in state 1. In the similar way, DPBD ∂φ(0 → 1)/∂ xi (0 → 1) should not be computed at points (1i , x) if we are interested in situations in which a repair of component i results in a repair of the system. Taking this idea into account, we can define a special

8

N. Brinzei et al.

type of DPBD, which is named as expanded DPBD. This derivative can be defined as follows [16, 17]: ∂e φ(s → s) ∂φ(s → s) = ∗{xi ↔ s} ∨ {xi ↔ s} ∂e xi (s → s) ∂ xi (s → s) ⎧ ⎨ 1 if xi = s and (φ(si , x) = s and φ(s i , x) = s) = 0 if xi = s and (φ(si , x) = s or φ(s i , x) = s) , ⎩ ∗ if xi = s for s ∈ {0, 1}

(9)

This derivative takes value 1 at points (si , x) at which a change of component i from state s to state s results in the same change of system state, value 0 at points (si , x) at which the considered change of the component state does not result in the same change of the system state, and value ∗ at points (s i , x), i.e., at points at which DPBD ∂φ(s → s)/∂ xi (s → s) should not be computed. Knowledge of critical path vectors and critical cut vectors for a specific component plays an important role in analysis of importance of individual components on system operation. This is done using various types of importance measures, such as structure, Birnbaum’s or criticality importance [6]. However, these specific state vectors do not allow us to evaluate how a component contributes to system reliability or unreliability, which is typically studied by one more importance measure known as Fussell-Vesely’s importance [8, 9]. To compute this measure, Minimal Path Vectors (MPVs) and Minimal Cut Vectors (MCVs) are needed [16]. These vectors are specific representations of MPSs, which agree with minimal sets of components whose simultaneous work ensures that the system will be working, and MCSs, which correspond to minimal sets of components whose simultaneous failure results in a failure of the whole system.

2.2 Minimal Path Vectors and Logic Differential Calculus MPVs are a special type of state vectors, which define circumstances under which a failure of any working component results in a failure of the system [16]. Based on this definition, they can also be viewed as a special type of critical us consider a state vector of the form of

path vectors. More precisely, let 1i1 , 1i2 , . . . , 1in p , 0r1 , 0r2 , . . . , 0rn−n p where n p is a number of system components that are working (are in state 1), n − n p is a number of system components that failed (are in state 0), i k denotes components that are working, for k = 1, 2, . . . , n p , and rk represents components that failed, for k = 1, 2, . . . , n − n p . This state vector is an MPV if and only if it is a critical path vector for each of components i 1 ,i 2 ,…, i n p .

Based on (7) this indicates that state vector 1i1 , 1i2 , . . . , 1in p , 0r1 , 0r2 , . . . , 0rn−n p is an MPV if and only if the following formula:

System Reliability Analysis and Assessment ...

xi1 xi2 . . . xin p

9

∂φ(1 → 0) ∂φ(1 → 0) ∂φ(1 → 0) ... ∂ xi1 (1 → 0) ∂ xi2 (1 → 0) ∂ xin p (1 → 0)

(10)



takes a nonzero value at point 1i1 , 1i2 , . . . , 1in p , 0r1 , 0r2 , . . . , 0rn−n p . This allows us

to conclude that state vector 1i1 , 1i2 , . . . , 1in p , 0r1 , 0r2 , . . . , 0rn−n p is an MPV if and only if the next formula holds: xi1 xi2 . . . xin p xr1 xr2 . . . xrn−n p

∂φ(1 → 0) ∂φ(1 → 0) ∂φ(1 → 0) ... ↔ 1. ∂ xi1 (1 → 0) ∂ xi2 (1 → 0) ∂ xin p (1 → 0) (11)

Now, let us consider a state vector x, and let us denote a set of components that are in state 1 according to this vector as N1 (x) and a set of system components that are in state 0 as N0 (x). Using this notation, formula (11) can be rewritten in the following more general form: k∈N1 (x)

xk

⎫ ⎧ ⎨ ∂φ(1 → 0) ⎬ xk ↔ 1. k∈N0 (x) ⎩ ∂ xk (1 → 0) ⎭ k∈N (x)



(12)

1

This formula allows us to decide whether a specific state vector is an MPV or not. If we compute its result for  all possible state vectors x except vector (01 , 02 , . . . , 0n ) ∂φ(1 → 0)/∂ xk (1 → 0) is not defined, then we can for which expression k∈N1 (x)

find all MPVs of the system. Mathematically, the MPVs of a system with structure function φ(x) agree with points at which the following formula takes the nonzero value: ⎫⎧ ⎫⎧ ⎫⎞ ⎛⎧ ⎨ ⎬⎨ ⎬⎨ ∂φ(1 → 0) ⎬  ⎝ ⎠. (13) xi xi ⎩ ⎭⎩ ⎭⎩ ⎭ ∂ x → 0) (1 i i∈N1 (x) i∈N0 (x) i∈N1 (x) x ∈ {0, 1}n x = (01 , 02 , . . . , 0n ) Using rules of Boolean algebra, formula (13) can be transformed in the next form: 

n 

i=1

Clause

n  i=1

∂φ(1 → 0) xi ∨ xi ∂ xi (1 → 0)

  n xi .

(14)

i=1

xi in this formula ensures that the formula cannot take nonzero

value for state vector (01 , 02 , . . . , 0n ), which cannot be an MPV of a coherent system. To avoid a problem with this specific vector we can replace term xi ∨ n xi ∂φ(1 → 0)/∂ xi (1 → 0) by expanded DPBD (9) and Boolean conjunction ∧i=1 by special 3-valued conjunction named in [16] as -conjunction (definition of this operation is shown in Table 1). After this substitution, it is possible to remove clause

10

N. Brinzei et al.

Table 1 -conjunction of two expanded DPBDs ∂e φ(s→s) ∂e x j (s→s)

-conjunction ∂e φ(s→s) ∂e xi (s→s)

n 

*

0

1

*

*

0

1

0

0

0

0

1

1

0

1

xi , and MPVs of the system can be identified as points at which the following

i=1

formula holds: n i=1

∂e φ(1 → 0) ↔ 1, ∂e xi (1 → 0)

(15)

i.e., as points at which -conjunction of all expanded DPBDs analyzing consequences of component failure takes value 1. It is worth noting that this -conjunction takes value ∗ only in one point that is point (01 , 02 , . . . , 0n ) and value 0 in all points that are different from (01 , 02 , . . . , 0n ) and that do not correspond to MPVs [16].

2.3 Minimal Cut Vectors and Logic Differential Calculus As MPVs, MCVs represent a special type of state vectors. These vectors correspond to situations in which a repair of any failed component results in a repair of the system [16, 17]. As their name indicates, they can be viewed as a special type of   critical cut vectors of the form of 1r1 , 1r2 , . . . , 1rn−nc , 0i1 , 0i2 , . . . , 0inc where n c is a number of system components that failed (are in state 0), n − n c is a number of system components that are working (are in state 1), rk denotes components that are working, for k = 1, 2, . . . , n − n c , and i k represents components that failed, for k = 1, 2, . . . , n c . State vector having this form is an MCV if and only if it is a critical cut vector for each of components i 1 , i 2 ,…, i n c . As inthe case of MPVs, DPBDs can be used  to decide whether a state vector of the form of 1r1 , 1r2 , . . . , 1rn−nc , 0i1 , 0i2 , . . . , 0inc  is a MCV or not. However, in this case  DPBDs (8) have to be used. So, state vector 1r1 , 1r2 , . . . , 1rn−nc , 0i1 , 0i2 , . . . , 0inc is an MCV if and only if the next formula: xi1 xi2 . . . xinc

∂φ(0 → 1) ∂φ(0 → 1) ∂φ(0 → 1) ... ∂ xi1 (0 → 1) ∂ xi2 (0 → 1) ∂ xinc (0 → 1)

(16)

  takes nonzero value at point 1r1 , 1r2 , . . . , 1rn−nc , 0i1 , 0i2 , . . . , 0inc . Using the same ideas as those presented in the previous part, it can be shown that all MCVs of a system

System Reliability Analysis and Assessment ...

11

with structure function φ(x) can be identified by DPBDs ∂e φ(0 → 1)/∂e xi (0 → 1) as points of the structure function at which the following formula holds [16]: n i=1

∂e φ(0 → 1) ↔ 1. ∂e xi (0 → 1)

(17)

Unlike (15), -conjunction of all expanded DPBDs ∂e φ(0 → 1)/∂e xi (0 → 1) takes value ∗ only in point (11 , 12 , . . . , 1n ), which cannot be an MCV since no component in a situation defined by this state vector can be repaired.

2.4 Hand Calculation Example For illustration of the method of computation of MPVs and MCVs using logic differential calculus, let us consider a simple parallel system composed of 2 parallel lines from which each contains 2 components. The block diagram depicting layout of this system can be viewed in Fig. 1. Based on this diagram, the structure function of this system agrees with a Boolean function defined by Table 2. Firstly, let us compute MPVs of the considered system. For this purpose, we need to compute expanded DPBDs of the form of ∂e φ(1 → 0)/∂e xi (1 → 0) for each system component, i.e., for i = 1, 2, 3, 4. These derivatives can be seen in Table 3. Please note that each of these expanded DPDBs takes value ∗ at points (0i , x) of the structure function at which DPBD ∂φ(1 → 0)/∂ xi (1 → 0) should not be computed if we are interested in situations in which a failure of a component can occur. In the second step, we have to compute the -conjunction defined by Table 1 of all these four expanded DPBDs. This can be found in the last column in Table 3. Value 1 in this column identifies points of the structure function that agree with the MPVs of the system. Based on this, we can recognize that MPVs agree with vectors (0, 0, 1, 1) and (1, 1, 0, 0). Each of these 2 MPVs agrees with one MPS, whose elements agree with components that have state 1 in considered MPVs [16, 17]. So, the first MPV contains components 3 and 4 in state 1, therefore, it corresponds to MPS {3, 4}. The second MPV has also 2 components in state 1. In this case, these components are components 1 and 2. Because of that, MPV (1, 1, 0, 0) is a vector representation of MPS {1, 2}. If we look at Fig. 1, then we can see that this result agrees with our expectations.

Fig. 1 Simple parallel system composed of 4 components

12

N. Brinzei et al.

Table 2 Structure function of the system depicted in Fig. 1 x1

x2

x3

x4

φ(x)

0

0

0

0

0

0

0

0

1

0

0

0

1

0

0

0

0

1

1

1

0

1

0

0

0

0

1

0

1

0

0

1

1

0

0

0

1

1

1

1

1

0

0

0

0

1

0

0

1

0

1

0

1

0

0

1

0

1

1

1

1

1

0

0

1

1

1

0

1

1

1

1

1

0

1

1

1

1

1

1

Table 3 Expanded DPBDs of the form of ∂φ(1 → 0)/∂ xi (1 → 0) and their -conjunction computed based on the structure function defined by Table 1 x1

x2

x3

x4

φ(x)

∂e φ(1→0) ∂e x 1 (1→0)

∂e φ(1→0) ∂e x 2 (1→0)

∂e φ(1→0) ∂e x 3 (1→0)

∂e φ(1→0) ∂e x 4 (1→0)

-conjunction

0

0

0

0

0











0

0

0

1

0







0

0

0

0

1

0

0





0



0

0

0

1

1

1





1

1

1

0

1

0

0

0



0





0

0

1

0

1

0



0



0

0

0

1

1

0

0



0

0



0

0

1

1

1

1



0

1

1

0

1

0

0

0

0

0







0

1

0

0

1

0

0





0

0

1

0

1

0

0

0



0



0

1

0

1

1

1

0



1

1

0

1

1

0

0

1

1

1





1

1

1

0

1

1

1

1



0

0

1

1

1

0

1

1

1

0



0

1

1

1

1

1

0

0

0

0

0

System Reliability Analysis and Assessment ...

13

Table 4 Expanded DPBDs of the form of ∂φ(0 → 1)/∂ xi (0 → 1) and their -conjunction computed based on the structure function defined by Table 1 x1

x2

x3

x4

φ(x)

∂e φ(0→1) ∂e x 1 (0→1)

∂e φ(0→1) ∂e x 2 (0→1)

∂e φ(0→1) ∂e x 3 (0→1)

∂e φ(0→1) ∂e x 4 (0→1)

-conjunction

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

1



0

0

0

1

0

0

0

0



1

0

0

0

1

1

1

0

0





0

0

1

0

0

0

1



0

0

0

0

1

0

1

0

1



1



1

0

1

1

0

0

1





1

1

0

1

1

1

1

0







0

1

0

0

0

0



1

0

0

0

1

0

0

1

0



1

1



1

1

0

1

0

0



1



1

1

1

0

1

1

1



0





0

1

1

0

0

1





0

0

0

1

1

0

1

1





0



0

1

1

1

0

1







0

0

1

1

1

1

1











Secondly, let us identify the MCVs of the system with structure function defined by Table 2. In this case, we have to compute expanded DPBDs ∂e φ(0 → 1)/∂e xi (0 → 1) for each of the 4 system components. These expanded DPDBs can be found in the middle part of Table 4. If we compute the -conjunction of all these 4 expanded DPBDs using definition introduced by Table 1, then we are able to locate MCVs of the system as points at which this -conjunction takes value 1 (the last column in Table 4). Based on this, we can conclude that MCVs of the system are state vectors (0, 1, 0, 1), (0, 1, 1, 0), (1, 0, 0, 1), and (1, 0, 1, 0). MCSs corresponding to these MCVs can be obtained by selecting components that are in state 0 in a considered MCV [16]. So, these 4 state vectors correspond to MCSs {1, 3}, {1, 4}, {2, 3}, and {2, 4} respectively. If we again check this result using Fig. 1, we can recognize that the result is correct.

3 Graph of the Structure Function We saw in the previous paragraphs that we only consider here coherent systems. Mathematically, this means that the structure function of the system is monotone. To define the monotony of a function, we have first to define an order relation on the variable. The order relation can be defined based on one more mathematical approach of Boolean logic which is Hasse diagram [3, 23].

14

N. Brinzei et al.

3.1 Hasse Diagram of an Ordered Set of Variables Like to Kaufmann et al. [20], let us now consider two different values A and B of state vector x = (x1 , x2 , . . . , xn ) ∈ {0, 1}n :A = (a1 , a2 , . . . , an ) and B = (b1 , b2 , . . . , bn ); ai , bi ∈ {0, 1}. The values 0 and 1 are here integers so that we have of course 0 < 1. Definitions: (1) (2) (3)

A = B if and only if ∀i ∈ {1, 2, . . . , n}; ai = bi A upper bound B denoted by A = B if and only if ∀i ∈ {1, 2, . . . , n}; ai ≥ bi A lower bound B denoted by A ≺= B if and only if ∀i ∈ {1, 2, . . . , n}; ai ≤ bi .

We can verify that these relations on {0, 1}n are reflexive (A ≺= A), transitive (A ≺= B, B ≺= C, ⇒ A ≺= C) and antisymmetric (A ≺= B, B ≺= A, ⇒ A = B). They are therefore order relations. Let us give some examples: (1, 0, 0, 1, 0) ≺= (1, 1, 0, 1, 0) ≺= (1, 1, 0, 1, 1). But (1, 0, 1, 1, 0) is not in relation with (1, 1, 0, 1, 0). We can give a representation of this order relation by its Hasse diagram (Fig. 2 in the case of a system with 4 components) in which one arc between two values of

1111

0011

0111

1011

1101

1110

0101

1001

0110

1010

0001

0010

0100

1000

0000

Fig. 2 Hasse diagram of a four components state vector

1100

System Reliability Analysis and Assessment ...

15

x indicates the existence of the relation. A Hasse diagram is a graph [24] of ordered sets in which the nodes are the values of the variable, the arcs materializing the order relation. It is a simplified version of the sagittal diagram in which all the loops corresponding to the reflexivity and all the arcs corresponding to the transitivity are removed. Hasse diagram defined initially into the ordered set theory and used also in reliability theory [25, 26] can be automatically determined from the number of system components. In the Hasse diagram, we can see that all the nodes are not in relation with all the others. This means that the order relation is only partial. For the clarity of the drawing, we place on the same horizontal line all the nodes of same order, that is to say the number of variables whose state is “1”. So the bottom node corresponds to {0}n and the top node to {1}n .

3.2 State Diagram of the Structure Function of a System If in the Hasse diagram we associate to each node the corresponding value of y = φ(x1 , x2 , . . . , xn ), we get the state diagram of the structure function. We agree to frame the nodes with full lines when φ(x) = 1 and with dashed lines when φ(x) = 0. Figure 3 shows the state diagram of the system presented in Fig. 1. We can see in this diagram that on any path starting from the bottom and arriving to the top, the

1111

0011

0111

1011

1101

1110

0101

1001

0110

1010

0001

0010

0100

1000

0000

Fig. 3 State diagram of the parallel system presented in Fig. 1

1100

16

N. Brinzei et al.

value of φ(x) encounter a sole change. This is the illustration of the monotony of the structure function φ(x). Let us have some other considerations on this diagram. To each node of the diagram corresponds either a path set (or tie set) where φ(x) = 1 or a cut set where φ(x) = 0. The path set, respectively the cut set, is the set of components whose states are “1”, respectively “0”, in the component state vector associated to the node. A minimal path set, respectively a minimal cut set, is associated to a node such that it is only upper bounded by nodes associated to path-sets, respectively only lower bounded by nodes associated to cut-sets. Nodes associated to minimal path-sets and to minimal cut-sets are then placed on either sides of the border between the states where φ(x) = 1 and φ(x) = 0.

3.3 Ordered and Ordered Weighted Graphs of the System By definition, the reliability of a system is its probability to be in operating state. In terms of the state diagram, it is the probability for the system state to belong to the subset of nodes where φ(x) = 1, that is to say, states marked in full lines. We will then only be interested in the corresponding part of the graph. This reduced graph (for large systems, it is generally reduced regarding to the whole graph) is called Ordered Graph of the system and Fig. 4 shows it in the case of parallel system depicted in Fig. 1. This graph contains always a maximum node corresponding to the system state in which all components are operational and several minima corresponding to the minimal path sets. In our example, two minima corresponds to the path sets {x1 , x2 } and {x3 , x4 }. Let us consider the sub-graph composed of a minima and the set of all its upper bounds. The probability for the system state to belong to this state sub-graph is the probability for the components of the corresponding minimal path set to be operating (the belonging condition to this sub-graph is that the state of the path set components is “1” regardless of the state of the other components). This probability equals the product of the reliabilities of the components of this path set.

1111

0111

0011

Fig. 4 Ordered graph of the system

1011

1101

1110

1100

System Reliability Analysis and Assessment ...

17

On the example, two minima are present and then we can observe two sub-graphs. Each of them may be associated to the product of the reliabilities of the components of the corresponding minimal path set: R1 · R2 and R3 · R4 . If we add these two terms, we get an overvalue of the reliability of the system because the two sub-graphs share the top node of the graph. To obtain the exact value of the system reliability, we have then to make a correction to this sum by subtracting the probability for the system to be in the top node state, namely R1 · R2 · R3 · R4 . The reliability of the system is then: R S = R1 · R2 + R3 · R4 − R1 · R2 · R3 · R4 . The generalization of this reasoning to any system consists then to sum the reliabilities of the minimal path-sets and to subtract the probabilities of the nodes belonging to many sub-graphs. Nevertheless, in most cases, it is not so simple as in our example. Some nodes can be counted several times and there must be an order to process to the correction. To conduct this process, we define the weighted ordered graph of the system from the ordered graph. In this new model, we associate a weight to each node of the ordered graph (Fig. 5). To do this, we firstly initiate all the weights to zero and then affect to each minima the weight “1” and increment all the weights of the nodes upper bounding this minima and so on, step by step, until the top of the sub graph. In the case of the example, each node receive the weight “1”, except the top node receiving the weight “2”. We get then the Ordered Weighted Graph (OWG) which was formally designed in [27]. In more complex cases, the number of minimal path sets may be important so that the weights in the upper lines of the OWG increase of several units so that an efficient algorithm must be developed.

1111

0111

0011

1

1011

1

2

1101

1

Fig. 5 Ordered weighted graph of the parallel system

1

1110

1

1100

1

18

N. Brinzei et al.

3.4 How to Build the Ordered Graph? For small systems there is no difficult to build the ordered graph by the enumeration of the state vector values starting from the top of the Hasse diagram and the deduction of the value of φ(x). For more complex systems the dependability assessment start with fault tree analysis with the purpose of the elaboration of a Boolean expression of the structure function. This expression has been from the origin the starting point of calculation methods for system reliability. The graphical method based on ordered graph is radically different and do not need having this expression. The ideal is the knowledge of the set of minimal path sets. We know that they are associated to the nodes minima of the graph. The ordered graph is formed by these minima and all their upper bounds in accordance to the order relation on the x vector. If the previous phase leads to find the cut-sets (as in Sect. 2.3), the ideal is the determination of the critical cut vectors which are associated to immediate lower bounds of the minimal path sets. In our example, we have six critical cut vectors, the four minimal cut vectors {0101}, {1001}, {0110},{1010} and the two values {0001} and {1000}. All the upper bounds of the four minimal cut vectors and one upper bound of the two last values belong to the ordered graph. If we only know the minimal cut vectors, by symmetrical reason, we can build the part of the graph where φ(x) = 0 and complete the upper part where φ(x) = 1. For simple examples with high level of redundancy, this can be acceptable. However for large systems, the part of the state diagram where φ(x) = 1 is very lower with regard to the part where φ(x) = 0. This solution is then not advisable. We nevertheless propose to first build all the upper-bounds of the minimal cut vectors but this is not sufficient, some nodes may remain undetermined. In our example, it is the case of the value of φ(0011) and φ(1100). If we have the assurance that the list of minimal cut vectors is complete, it is possible to remove thisindetermination. We know [27]  that for a system with n components we have j = nl state vector values of order l in the state diagram (in our example with n = 4, the state diagram contains 6 state vector values of order 2 in the third horizontal line). If we suppose that p of these state vector values of order l are minimal cut vectors of maximal order, this means that the j-p other nodes are path vectors of order l, else this would say that there is at least one other not minimal cut vector of order l and consequently that there is at least one maximal cut set vector of order ≥ (l + 1) in contrary to the hypothesis that the maximal order of minimal cut vectors is l. This reasoning may then be applied to the order (l + 1) if there are some minimal cut vectors at this order and so on. So in our example, the third horizontal line contains the 6 state vectors values of order 2 and 4 of them are minimal cut vectors. This means that the 2 other state vector values are path vectors and we can affect to the corresponding nodes the value φ(x) = 1. If there are other minimal cut vectors of other order, the same reasoning can be applied to the corresponding horizontal lines. It is then possible to complete the ordered graph.

System Reliability Analysis and Assessment ...

19

3.5 Reliability Computing Algorithm The principle is simple enough. When we sum the products of the reliabilities of the minimal path sets components we get an overvalue of the system reliability. The contribution of some path vectors are counted several times as we saw in the example. The ordered weighted graph indicates this precisely. The idea is to find the sub-graphs (associated to reliability monomials) to be removed in order to reduce to 1 the weight of each node. During this process, the weight of some nodes may become null and perhaps negative. This mean that new monomials may be added and perhaps many times. We will then define two lists of monomials those to be added and those to be subtracted. The detailed description of the algorithm is shown in Fig. 6.

Algorithm: The input data of the algorithm is: - the components number of the system, - the list of the minimal path vectors (MPV). Let being the order of a Path Vector PVl (variable number such that = 1). Define two sub-sets of reliability monomials M+ and M-. Step 1. Build the ordered graph as follows: - Define a list of couple {PV; weight}, - Initialize the list with the couple {MPVi; 1} (any one of the MPV), : for one couple so defined, its order being l, generate the couples {PVl+1; - Repeat until 1} according to the order relation and put them in the list do For the other MPV: do the same generation as previously but check before pushing in the list if the couple already exists. In this case, increase only the weight by “1”. Step 2. Put in M+ the reliability monomials products of the reliabilities of the components of each minimal path-sets. Step 3. The minimal nodes are removed from the graph (suppressed of the list). Step 4. Let consider the residual graph. - If some of their minima are “1” weighted, they are also suppressed. - Consider one of the minima of minimal order {PV;m}. Consequently of the previous steps, m is an integer If Push times in M- the monomial product of the reliabilities of the path-set components corresponding to PV and subtract m-1 to its weight and to the weights of all its upper bounds If push 1-m times in M+ the monomial product of the reliabilities of the path-set components corresponding to PV and add 1- m to its weight and to the weights of all its upper bounds Extend this process to the other minima of the graph until their weights equal “1”. Consider the obtained new graph and return to step 3 until the list of couples becomes empty. Step 5. The reliability of the system is the sum of the monomials contained in M + to which we must subtract the sum of the monomials contained in M-. End.

Fig. 6 Algorithm for computation of system reliability

20

N. Brinzei et al.

4 Conclusion In this paper a complex approach for investigation of minimal path sets, minimal cut sets and the polynomial expression of system reliability was presented. The approach allows identifying minimal path sets and minimal cut sets by logic differential calculus. Logic differential calculus is a mathematical methodology that allows the study of the Boolean structure function of binary-state systems. By analyzing the dynamic properties (state changes of system according to the state change of system components) of the structure function, the logic differential calculus is an efficient way to determine the minimal path vectors / cut vectors describing the corresponding minimal path sets/cut sets. This minimal path sets/cut sets allow to automatically generate the state diagram of the system based on the Hasse diagram. Hasse diagram is a graph representing the system states and the order relation between them and it is interesting because, in addition to being able to represent the system states space, it is able to represent it also from the probability point of view. It gives directly the expression of the system reliability under its unique polynomial form and, for this, it does not require an intermediate Boolean form and a translation, nor an optimization process (such as for BBD technics). Both, the logic differential calculus and the system state diagram based on Hasse diagram are not strictly limited to coherent systems, but have the ability to take easily into account non-coherent systems. Moreover, their extension to multi-state systems is as natural as possible, thus allowing to take also into account systems with finite degradation structures. This kind of systems: non-coherent and multi-state systems will be addressed in the next work in the future. Acknowledgements This research work was realized in the project “Mathematical Models based on Boolean and Multiple-Valued Logics in Risk and Safety Analysis” supported jointly by the Slovak Research and Development Agency (Agentúra na podporu výskumu a vývoja) under the contract no. SK-FR-2019-0003 and by Delegation for European and International Affairs (Délégation aux affaires européennes et internationales) under the contract PHC (Hubert Curien Partnership) Stefanik no. 45127XL.

References 1. Rausand, M., Høyland, A.: System Reliability Theory. John Wiley & Sons Inc., Hoboken, NJ (2004) 2. Schneeweiss, W.G.: A short Boolean derivation of mean failure frequency for any (also noncoherent) system. Reliab. Eng. Syst. Saf. 94(8), 1363–1367 (2009) 3. Brînzei, N., Aubry, J.F.: Graphs models and algorithms for reliability assessment of coherent and non-coherent systems. Proc. Inst. Mech. Eng. Part O J. Risk Reliab. 232(2), 201–215 (2018) 4. Borgonovo, E., Peccati, L.: Sensitivity analysis in investment project evaluation. Int. J. Prod. Econ. 90(1), 17–25 (2004) 5. Zaitseva, E., Levashenko, V.: Investigation multi-state system reliability by structure function. In: Proceedings of International Conference on Dependability of Computer Systems (DepCoS—RELCOMEX 2007), pp. 81–90, 4272895 (2007).

System Reliability Analysis and Assessment ...

21

6. Kuo, W., Zhu, X.: Importance Measures in Reliability, Risk, and Optimization: Principles and Applications. Wiley, Chichester, UK (2012) 7. Birnbaum, L.W.: On the importance of different components in a multicomponent system. Multivar. Anal. II, 581–592 (1969) 8. Vesely, W.E.: A time-dependent methodology for fault tree evaluation. Nucl. Eng. Des. 13(2), 337–360 (1970) 9. Fussell, J.B.: How to hand-calculate system reliability and safety characteristics. IEEE Trans. Reliab. R-24, 169–174 (1975) 10. Kvassay, M., Zaitseva, E., Levashenko, V., Kostolny, J.: Minimal cut vectors and logical differential calculus. In: Proceedings of The International Symposium on Multiple-Valued Logic, 2014, pp. 167–172 11. Rauzy, A.: New algorithms for fault trees analysis. Reliab. Eng. Syst. Saf. 40, 203–211 (1993) 12. Sinnamon, R.M., Andrews, J.D.: New approaches to evaluating fault trees. Reliab. Eng. Syst. Saf. 58, 89–96 (1997) 13. Brown, K.S.: Evaluating fault trees (AND and OR gates only) with repeated events. IEEE Trans. Reliab. 39, 226–235 (1990) 14. Zakrevskij, A., Cheremisinova, L., Pottosin, Y.: Combinatorial Algorithms of Discrete Mathematics. TUT Press, Minsk (2008) ˇ 15. Cepin, M.: Analysis of truncation limit in probabilistic safety assessment. Reliab. Eng. Syst. Saf. 87, 395–403 (2005) 16. Kvassay, M., Levashenko, V., Zaitseva, E.: Analysis of minimal cut and path sets based on direct partial Boolean derivatives. Proc. Inst. Mech.Eng. Part O: J. Risk Reliab. 230, 147–161 (2016) 17. Kvassay, M., Zaitseva, E., Levashenko, V.: Minimal cut and minimal path vectors in reliability analysis of binary- and multi-state systems. In: CEUR Workshop Proceedings, pp. 713–726 (2017) 18. Yanushkevich, S., Michael Miller, D., Shmerko, V., Stankovic, R.: Decision Diagram Techniques for Micro- and Nanoelectronic Design Handbook. CRC Press, Boca Raton, FL (2005) 19. Steinbach, B., Posthoff, C.: Boolean differential calculus. Synth. Lect. Dig. Circ. Syst. 12, 1–215 (2017) 20. Kaufmann, A., Grouchko G., Cruon R.: Mathematical Models for the Study of the Reliability of Systems. Academic Press Inc. (1977) 21. Akers, S.B.: Binary decision diagrams. IEEE Trans. Comput. 27(6), 509–516 (1978) 22. Rauzy, A.: New algorithms for fault tree analysis. Reliab. Eng. Syst. Saf. 40, 203–211 (1993) 23. Brînzei N., Aubry J.F.: An approach of reliability assessment of systems based on graphs models. In: European Safety and Reliability Conference ESREL 2015, in Safety and Reliability of Complex Engineered Systems, Podofillini, L., Sudret, B., Stojadinovic, B., Zio, E., Kröger, W. (eds.), Zurich (Switzerland), pp. 1485–1493. CRC Press/Balkema, Taylor & Francis Group (2015) 24. Matousek, J., Nesetril, J.: Invitation to Discrete Mathematics. Oxford University Press (1998) 25. Rauzy A., Liu, Y.: Finite degradation structures. J. Appl. Log. 6(7), 1447–1474 (2019) 26. Rocco, C.M., Hernandez-, E., Mun, J.: Introduction to formal concept analysis and its applications in reliability engineering. Reliab. Eng. Syst. Saf. 202, 107002 (2020) 27. Aubry, J.F., Brînzei, N.: Systems Dependability Assessment: Modeling with Graphs and Finite State Automata. Wiley-ISTE (2015)

The Survival Signature for Quantifying System Reliability: An Introductory Overview from Practical Perspective Frank P. A. Coolen and Tahani Coolen-Maturi

Abstract The structure function describes the functioning of a system dependent on the states of its components, and is central to theory of system reliability. The survival signature is a summary of the structure function which is sufficient to derive the system’s reliability function. Since its introduction in 2012, the survival signature has received much attention in the literature, with developments on theory, computation and generalizations. This paper presents an introductory overview of the survival signature, including some recent developments. We discuss challenges for practical use of survival signatures for large systems.

1 Introduction Reliability of systems is very important in every day life and quantification of system reliability has been a topic of research over many decades. It has led to a huge literature, a large part of it with at best spurious links to real world systems and challenges. Methods for analysis are often presented for very small systems with quite straightforward structures, and important practical considerations, e.g. the conditions under which the system has to function, the actual tasks it has to perform and the required level to which it performs these, tend to be avoided in many research papers. In 2012, we introduced the concept of survival signature [8], which is a summary of the system structure function that is sufficient to derive the system survival function, and hence several important reliability metrics. While this can easily be seen as another mathematical concept with little practical relevance, the opposite has always been the intention. This paper presents an introductory overview of the survival signature, with emphasis on practical use and the required additional research to enable this. There are many research challenges to bring the survival signature F. P. A. Coolen (B) · T. Coolen-Maturi Department of Mathematical Sciences, Durham University, Durham DH1 3LE, UK e-mail: [email protected] T. Coolen-Maturi e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 C. van Gulijk and E. Zaitseva (eds.), Reliability Engineering and Computational Intelligence, Studies in Computational Intelligence 976, https://doi.org/10.1007/978-3-030-74556-1_2

23

24

F. P. A. Coolen and T. Coolen-Maturi

methodology to fruition for application to large scale real-world systems, this paper aims to discuss recent contributions in this direction and further challenges. Section 2 of this paper provides a brief introduction to the survival signature. Section 3 discusses the assumption of exchangeability of component failure times, which sits at the heart of the survival signature method. Section 4 discusses some computational issues related to implementing the survival signature, and also aspects of simulation and statistical inference. Section 5 briefly presents recent developments, including resilience through the possibility of swapping components in a system, and new survival signatures for multi-phase systems, for multiple systems which share components, and for multi-state systems. Section 6 concludes the paper with further considerations, including an explanation of the practical need for generalizing the system structure function to be probabilistic and the challenges this brings.

2 Survival Signature The survival signature was introduced by Coolen and Coolen-Maturi [8]. It is a summary of a system structure function which, together with the probability model for the components’ failure times, is sufficient for computing the survival function (also known as reliability function) of the system failure time. Consider a system with  KK ≥ 1 types of components, with n k components of type n k = n. It is crucial to understand what is meant by k ∈ {1, 2, . . . , K } and k=1 ‘types of components’, we discuss this in detail in Sect. 3, the essential assumption is that the random failure times of components of the same type are exchangeable [16]. The state vector x ∈ {0, 1}n of the system describes the states of its components, with 1 representing functioning of a component and 0 that it does not function. The system structure function φ(x) ∈ {0, 1} describes the functioning of the system given the component states x, where 1 represents that the system functions and 0 that it does not function. Due to the arbitrary ordering of the components in the state vector, components of the same type can be grouped together, leading to a state vector that can be written as x = (x 1 , x 2 , . . . , x K ), with x k = (x1k , x2k , . . . , xnkk ) the sub-vector representing the states of the components of type k. The survival signature, denoted by (l1 , l2 , . . . , l K ), with lk = 0, 1, . . . , n k for k = 1, . . . , K , is defined as the probability that the system functions given that prefunction, for each k ∈ {1, 2, . . . , K }. cisely lk of itsn k components of type k  nk xik = lk ; let Slk denote the set of these There are nlkk state vectors x k with i=1 state vectors for components of type k  and let Sl1 ,...,l K denote the set of all state nk xik = lk , k = 1, 2, . . . , K . Due to the vectors for the whole system for which i=1 exchangeability assumption for the failure times of the n k components of type k, all the state vectors x k ∈ Slk are equally likely to occur, hence  (l1 , . . . , l K ) =

K  −1  nk k=1

lk

×

x∈Sl1 ,...,l K

φ(x)

(1)

The Survival Signature for Quantifying System Reliability …

25

K The survival signature requires specification at k=1 (n k + 1) inputs while the structure function must be specified at 2n different inputs; in particular for large values of n and relatively small values of K , so large systems with few component types, the difference is enormous. We will comment on computational aspects in Sect. 4, but note that storage of the structure function may also be a problem for large systems, and this could be substantially easier for the survival signature if there are not many component types. If all components are of different type, so K = n, then the survival signature does not provide any advantages, in the sense of reduced representation, over the structure function. If all components are of the same type, so K = 1, then the survival function is closely related to Samaniego’s system signature [30, 31]. That signature has led to a substantial literature, for example considering properties like stochastic dominance relations between different system lay-outs, but its practical value was limited as most real-world systems consist of multiple types of components. Generalizing Samaniego’s system signature to systems with multiple types of components was an open problem which was solved by the introduction of the survival signature [8], which was an important break-through with particular relevance to reliability quantification for real-world systems [32]. Before we present a basic example of the survival signature and discuss further important aspects, we explain why it is a convenient tool for quantification of system reliability. Let Ck (t) ∈ {0, 1, . . . , n k } denote the number of components of type k in the system which function at time t > 0. The probability for the event that the system functions at time t > 0, so for TS > t where TS is the random system failure time, can be derived by application of the theorem of total probability, P(TS > t) =

n1

···

l1 =0

=

n1

l1 =0

nK

l K =0

···

nK

l K =0

P(TS > t|

K

{Ck (t) = lk })P(

k=1

(l1 , . . . , l K )P(

K

{Ck (t) = lk })

k=1 K

{Ck (t) = lk })

(2)

k=1

Equation (2) is the essential result at the centre of the survival signature theory. It shows that the system survival function can be computed with the required inputs, namely the information about the system structure and about the component failure times, being completely separated. Hence, the effect of changing a system’s structure on its survival function can easily be investigated. One can also compare different system structures in general, without assumptions for the random failure times, by comparing the systems’ survival signatures [32]. The system survival function is sufficient for important metrics such as the expected failure time of the system, or its remaining time till failure once it has been functioning for some time. It is important to emphasize that Eq. (2) only required the assumption that failure times of components of the same type are exchangeable. This allows dependencies between components’ failure times to be taken into account, which is discussed further in Sect. 3. If one assumes that the failure times of components of different types are independent, then Eq. (2) becomes

26

F. P. A. Coolen and T. Coolen-Maturi

P(TS > t) =

n1

···

l1 =0

nK



(l1 , . . . , l K )

l K =0

K 

 P(Ck (t) = lk )

(3)

k=1

If, in addition, one assumes that the failure times of components of the same type are independent and identically distributed (iid), with known cumulative distribution function (CDF) Fk (t) for type k, then this leads to P(TS > t) =

n1

···

l1 =0

nK



l K =0

K    nk [Fk (t)]n k −lk [1 − Fk (t)]lk (l1 , . . . , l K ) l k k=1

 (4)

In many reliability scenarios one may have a good idea about suitable parametric probability distributions for components’ failure times, and one may wish to use statistical inference methods for the unknown parameter. Using general notation Fk (t|θk ) for the CDF with parameter θk (which can be multi-dimensional) for the failure times of components of type k, and the assumption that the component failure times are conditionally independent and identically distributed (ciid), where the conditioning is with respect to the parameter value, the previous equation becomes

P(TS > t|θ1 , . . . , θ K ) =

n1

l1 =0

···

nK

l K =0

(l1 , . . . , l K )

K    nk k=1

lk

 [Fk (t|θk )]

n k −lk

[1 − Fk (t|θk )]

lk

(5)

This equation can be used in a Bayesian statistical approach to system reliability, where prior distributions for the θk are required, as illustrated by Aslett et al. [4]. The survival signature can be applied for any system if the components and the system itself all have two states, functioning or not. If the system is coherent, which means that φ(x) is not decreasing in any of the components of x, then the survival signature is an increasing function, which has substantial advantages as will be discussed in Sect. 4. While there has been quite some attention in the reliability theory literature to non-coherent systems, most practical systems are coherent. Typical examples of non-coherent systems in the literature are such that two component failures cancel each other out, but in practice such situations are likely to lead to a different overall state of the system compared to its state when the two components involved function properly, and this may require a more detailed system state description than simply functioning or not. As a basic example of the survival signature, consider the system in Fig. 1, for which the survival signature is given in Table 1. Verification of the survival signature is straightforward as the structure function can be easily derived for this small system.

The Survival Signature for Quantifying System Reliability …

27

1

1

2

1

2

2 Fig. 1 System with 2 types of components Table 1 Survival signature of the system in Fig. 1 l1 l2 (l1 , l2 ) l1 0 0 0 0 1 1 1 1

0 1 2 3 0 1 2 3

0 0 0 0 0 0 1/9 3/9

2 2 2 2 3 3 3 3

l2

(l1 , l2 )

0 1 2 3 0 1 2 3

0 0 4/9 6/9 1 1 1 1

3 Exchangeability of Components’ Failure Times As explained in Sect. 2, the key assumption underlying the survival signature is that the random failure times of components of the same type are exchangeable, so this defines what it means that components are of the same type. What does this mean? In De Finetti’s theory of probability [16], two random quantities, X and Y , are exchangeable if P(X = x, Y = y) = P(Y = x, X = y) for all possible x and y, and similarly generalized to more than two random quantities. So, X and Y have the same marginal distributions, but it is important to emphasize that they do not need to be independent. Exchangeability is an important concept in Bayesian statistics when one wishes to learn about one random quantity by observing another one [16]. In a system reliability setting, exchangeability of the failure times of the n k components of type k is perhaps easiest understood as follows: If you learn that one component of type k in the system has failed, you do not know which component it is, and you consider each of these n k components to have probability 1/n k to be the failed component. This should hold at any moment in time, and generalizes logically to any subset of these components having failed, which must have the same probability independent of which specific components of type k are in the subset. A crucial

28

F. P. A. Coolen and T. Coolen-Maturi

consideration here is that this is likely to depend not only on the physical nature of the components, e.g. if they are all produced by the same manufacturer, but it also depends on the specific functioning in the system. For example, if one knows that of all components of type k in the system, one has a larger load and hence is more prone to failure, then one would doubt that the assumption of exchangeability of their failure times is appropriate. The assumption of exchangeability of components’ failure times raises a crucial issue for practical quantification of system reliability and related decision support, namely at which level of detail one should model the system. Despite the huge literature on system reliability, this topic has received very little attention. For a large practical system with many components, one may wish to consider the failure times of a group of components to be exchangeable, and hence judge these components to be of the same type, even though one could describe the components’ requirements and functioning in so much detail that one could distinguish between their probabilities of failing at specific times. In such a case, the exchangeability assumption would be motivated by a decision not to include more details of the components in the model, and it is important to realize that a model is not identical to the system in its real world functioning, but a reduced representation which should be of sufficient quality for its task, which is often support of a specific decision or trust in failure-free functioning of the system over a period of time. For example, if we wish to consider reliability of a large rail network, we may judge different stretches of rails, of the same length, to have exchangeable failure times even though environmental aspects could enable us to distinguish between them, and could make a failure more likely to occur at one stretch than another. So, the decision to consider different components to be of the same type, and therefore to have exchangeable failure times, is directly related to the choice of the level of detail in the reliability model. How to decide the appropriate level of detail? This is particularly important for large real world systems, and the answer depends on the task, so the reason of creating the reliability model in the first place, and the available information. But it also depends on time and budget available for the modelling, and the expected benefits which perhaps can also be expressed in terms of money, reduced risk or benefits which may be harder to measure and quantify. Research on this important topic is best done in direct relation to a real reliability study for a large system, as it requires meaningful inputs from management and details of the system. While we have not engaged in such research, a study with similarities was part of a long term collaboration the first author had with an industrial partner about two decades ago, where to support software testers in their complicated tasks a statistical approach based on Bayesian graphical models was developed [12, 35]. These models also required assumptions of exchangeability, which in that setting meant that possible software failures were deemed to be exchangeable, and a project viability approach was developed that would enable managers to decide, before start of such a study, the level of detail of the model in order to support testers whilst staying within budget and time constraints [11, 36]. Similar guidance on decisions about the level of modelling for large scale system reliability studies is much needed, the fact that the survival signature methodology explicitly requires exchangeability

The Survival Signature for Quantifying System Reliability …

29

of components’ failure times to be considered ensures that it fits with the natural questions one needs to answer when choosing a suitable level of model detail. A further challenging research topic is the practical need to zoom in on problem areas, once these become apparent during the system’s functioning. Indeed, there are many great research challenges in this topic area, several more are discussed later in this paper. A further modelling decision is needed with regard to dependence of components’ failure times. As emphasized, components of the same type must have exchangeable failure times for the survival signature approach, and these can be dependent. Furthermore, failure times of components of different types can also be dependent. As explained in Sect. 2, the general formula for the system survival function is Eq. (2), different assumptions on the components’ failure times can lead to simplifications of this equation. In practice, there can be many reasons for modelling components’ failure times as dependent, for example there may be common-cause failure modes, a risk of cascading failures, load sharing between components and so on. Initial studies into several of such possibilities have been published [9, 17, 18], but there are many related research topics left. The main conclusion is that the separation of the system structure and the random components’ failure times, in Eq. (2), enables all required dependencies between failure times to be included in the investigation, but the detailed modelling requires of course an extra effort compared to the simpler situation of independent failure times.

4 Computation, Simulation and Inference An immediate question for application of the survival signature is how to compute it. For very small systems, like the one in the example in Sect. 2, one can simply derive the system structure function and use Eq. (1). This approach can be applied to somewhat larger systems as well, supported by standard computational methods for the structure function, based on cut sets and path sets. This approach has been implemented in R [2], and can be used without problems for systems of about 20 components with relatively little computational effort, and for somewhat larger systems as well although the computational effort increases enormously. Reed [27] presented a substantial improvement on the required computation time by using binary decision diagrams, which however still requires the availability of the full structure function. While the survival signature provides advantages over the full structure function, mainly in terms of storage requirements but also when one wishes to simulate system failure times, as will be discussed later in this section, the main idea of introducing the survival signature was to enable inference on system reliability for large real world systems, for which one normally would not have the full structure function available. There are already some opportunities to make the computational demands somewhat less daunting than one may fear. Of course, brute computational force can be applied to compute the structure function, and from this the survival signature, for

30

F. P. A. Coolen and T. Coolen-Maturi

larger systems, as computational powers are ever increasing and, crucially, a system’s survival signature only needs to be computed once. Coolen et al. [13] provide a simple combinatorial expression to compute the survival signature of a system consisting of two subsystems in either series or parallel configuration, if the survival signatures of those subsystems are available. By repeated application this implies that computation of large series-parallel systems can quite easily be implemented. They also addressed the issue of re-computing a survival signature if a component is replaced and the new component is to be considered as being of a new type. For very large systems, it may be sufficient to use either an approximation to the survival signature, or bounds for it. This is particularly feasible for coherent systems because their survival signatures are increasing functions. It will also be of interest to explore the use of modern simulation and emulation methods to find the part of the entire input space where the function actually increases from 0 to 1. It should be noted that many modern engineering systems, or systems in other application fields such as social-economic systems or computer networks, tend to have some but not very much redundancy, such knowledge can of course also help in computing the survival signature or suitable approximations or bounds for it. This is a substantial area for research with huge possible impact. The survival signature enables very efficient simulation to learn the system survival function, as presented by Patelli et al. [26] and extended by George-Williams et al. [20] for inclusion of dependent failures. The key idea is as follows: for a system with n k components of type k, for k = 1, . . . , K , one simulates n k component failure times for each type k. Instead of investigating which of these component failure times would actually be the system failure time, and using only that as the output of one simulation run, one orders all these observation times and builds up a full simulated stepwise survival function, which at each of these failure times takes on the value of the survival signature with the corresponding values of the lk , the number of still functioning components of each type k. This procedure turns out to be very efficient, with all simulated component failure times being used instead of the perhaps more intuitive method where each simulation run only leads to one simulated system failure time [26]. Therefore, this enables fast inference about the system reliability based on only the survival signature and components’ failure time distributions, where further details about the exact structure of the system is not needed. This points to another advantage of the survival signature approach that may prove very valuable in practice, namely that statistical analysis of the system reliability is possible without the need to know the full structure of the system, as long as the survival signature is available, or a good approximation to it. It should be emphasized that the full system structure cannot be deduced from the survival signature if there are many components but relatively few types, except of course in special cases such as systems without redundancy. Aslett [3] has taken this aspect further and developed a system which enables evaluation of a system’s reliability if information held by different parties, namely the manufacturer of the system and the manufacturers of different types of components, is not shared, and cannot even be deduced by the different parties. This work is an important first step towards practical inference about system reliability without the need for major commercial interests to be revealed, and

The Survival Signature for Quantifying System Reliability …

31

it is only possible through the use of the survival signature as a sufficient summary of the structure function. Statistical inference for system reliability is a topic of major interest, as learning from data, possibly in combination with the use of expert judgements, is crucial in many applications. If one has data available on the individual component types, then inference on the system’s failure time is quite straightforward. Nonparametric predictive inference [6], a frequentist approach using few modelling assumptions made possible by the use of imprecise probabilities [5], can be used to derive bounds for the system survival function [13]. The application of Bayesian methods has been presented as well [4], this is particularly useful if one has relatively little data on component failures and therefore wishes to include expert judgements. Walter et al. [34] generalized the Bayesian approach combined with the survival signature by using sets of priors, as typically done in theory of robust Bayesian methods. They showed that, by choosing the sets of priors in a specific way, one can enable detection of conflict between prior judgements and data, when data become available and are used to update the prior distributions. This can be of great practical importance, as it can point to prior judgements being too optimistic, hence the system reliability may be substantially lower than was originally thought. A major challenge is development of suitable statistical methodology to learn the survival signature from observations of the system, so from information which can consist of system failure times, component failure times, outcomes of inspections or condition monitoring and so on. Due to the inverse nature of such inferences, Bayesian methods are well placed to enable such learning, and it is likely that one can learn the survival signature far easier than the full structure function. Aslett [1] presented such inverse inference for systems with only a single type of component, using Samaniego’s system signature. While conceptually such inference does not pose many problems, it is extremely computationally expensive, so there are substantial research opportunities for useful contributions to the methodology.

5 Recent Developments In recent years, the use of the survival signature has been presented for a range of topics in reliability. Component important measures have become popular management tools for guidance on aspects of system reliability, and the survival signature can be used both to assess importance of specific components and importance of components of a specific type [19]. The former requires a bit more information than just the survival signature for the full system, namely two such signatures with conditioning on the specific component of interest functioning or not. The latter is an interesting difference to the usual idea of component importance measures, where the importance only relates to a specific type of component. For a number of practical decision problems this may be the most relevant information, for example if one needs to decide on immediate availability of spare components in case the system fails, then it may not be crucial to know which specific component is likely to fail

32

F. P. A. Coolen and T. Coolen-Maturi

next but the type of component to fail next could be most important. Eryilmaz et al. [17] considered marginal and joint component importance for dependent components, while linking survival signatures to logic differential calculus has also been shown to provide useful tools to identify important components in system reliability analysis [28], and the survival signature also enables useful new approaches to sensitivity analysis for system reliability [21]. A challenging problem in system design is reliability-redundancy allocation, where under budget or other constraints a system designer must choose between increasing the quality of components or the level of redundancy in the system. Since it is natural to assume that any quality improvement will equally affect all components of a particular type, it is clear that the survival signature can be used to support such decisions. This was considered by Huang et al. [23], who present a fast heuristic algorithm that provides excellent solutions to the problem and that can be implemented for systems of substantial size, as long as the survival signature, or a good approximation of it, is available. All these developments are initial results, with many related research opportunities including computational challenges to enable upscaling to large real world systems. A further recent contribution resulted from the practical need to make systems resilient in case things go wrong. The idea was simple: if a system fails due to one or more failing components, it may be possible to swap a failing component with another component in the system which still functions. This approach brought an interesting question with regard to the definition of a component: namely is a component defined as the specific location (or better ‘role’) in the system, or the part that could actually be moved to another location. It turned out that the latter interpretation is by far easier, as it means that the component does not change its random remaining failure time when moved to another location. The survival signature approach enables quite straightforward investigation of the improvement of the system reliability if specific component swaps are possible [25]. This may be important in practice when such a component swap might provide sufficient time to prepare a substantial maintenance activity on the system. It is worth to emphasize that, although emphasis and terminology in this paper is mainly on engineering systems, the survival signature methodology can also be applied to systems in other fields, including socio-economic and health systems. Swapping of components could, for example, be relevant in an organisation where staff members can be regarded as components. It may be beneficial if some staff members can take over other roles in case colleagues become ill, and one may want to consider training people to enable such swaps of their roles. There are many related research questions, including the option to swap components of different types or even to swap components which have not yet failed, the latter could make sense if loads vary during different periods of system functioning. As with all topics discussed here, it will be ideal if practical issues for real world systems can be analysed to guide the further development of theory and methods. All the contributions to system reliability methodology discussed above use the survival signature in the basic form as given in Eq. (2). However, several important practical scenarios require different survival signatures, which can be seen as generalizations of the basic form, due to the increased complexity of the system or its

The Survival Signature for Quantifying System Reliability …

33

use. We briefly describe three such generalizations, while referring to the respective papers for more details. These new survival signatures are all starting points for substantial further research with regard to similar issues as discussed before in this paper, to ensure wide applicability to real-world systems. The first scenario for which a generalized survival signature has recently been presented is phased mission systems, which are common in practice as many systems have to perform different tasks sequentially. Huang et al. [22] present a new survival signature for this scenario. The main issue here is that not all components need to function in each phase, so one needs to keep a clear record of any component failures, where it is assumed that a failed component does not function anymore for all remaining phases. While the system’s functioning in each phase can be presented by a basic survival signature, the definition of ‘components of the same type’ needs care, because components that do not need to function in one phase are likely to have different failure behaviour after that phase, compared to similar components which did have to function in that phase. In the earlier literature on phased mission systems, this important aspect seems to have been overlooked, mostly components with exponential failure time distributions seem to have been considered for systems such that all components need to function in each phase. In practice these common assumptions are often unrealistic, the survival signature approach by Huang et al. [22] enables more realistic scenarios to be modelled. Building on that work, Coolen et al. [14] considered the opportunity to swap components within the system, either at the time of system failure or at phase transitions. Huang et al. [24] studied component importance for such phased mission systems using the new survival signature. A second scenario for which a generalized survival signature has recently been presented, is when multiple systems share some components, which can be of different types. One can think, for example, about multiple computers linked to a single server, or multiple academic departments at one university during an exams period with strict marking deadlines, which all depend on one central information technology support group which can be regarded as a component shared by the different departments. This scenario applies also to the important situation of one system which has to perform multiple functions, and can be further generalized to multiple systems performing multiple functions. Coolen-Maturi et al. [15] have recently presented the survival signature for such situations, which is a major step in the development of the survival signature methodology for large scale practical applications. Crucial in this work is that one may wish to consider the functioning of different systems at different moments in time, where the status of the shared components must be considered at the different time points. This enables inferences on, for example, the probability that one system still functions at a specific time, given that another system with which it shares some components functioned at an earlier time, or that it had failed at an earlier time, without further information about the status of the shared components. The third scenario for which a generalized survival signature has recently been presented, is multi-state systems with multi-state components. While the reliability literature has traditionally mainly considered binary state systems and components, many real world scenarios require modelling of multiple states, e.g. including an intermediate state between perfect and not functioning, during which maintenance or

34

F. P. A. Coolen and T. Coolen-Maturi

replacement of components may be possible. Qin and Coolen [29] present the survival signature methodology for such systems, where the probability distribution of the system over its possible states is considered as function of the numbers of components of all types in the possible states. The computation of the survival signature for this scenario becomes rather complicated, but Qin and Coolen [29] present an efficient algorithm to combine the survival signatures of two subsystems if the state of the system depends only on the states of these subsystems. As is the case for the basic survival signature for binary states [13], repeated application of this algorithm may enable fast computation of the multi-state survival signature for some large systems.

6 Further Considerations There remains a large discrepancy between system reliability as presented in textbooks and many journal papers, and methods needed to assist analysts and managers in real world problems concerning reliability of large systems. These differences are not only the size of the systems typically presented, but also the actual problems studied, where in real life the system survival function is usually only the input to a more complicated decision problem which determines the required level of detail of the system model and accuracy of approximations to the survival function if it cannot be computed exactly. Perhaps the most important difference, however, is that several important aspects of applications of large real world systems tend not to be reflected in typical reliability models and methods, and they lead to additional uncertainties. It is often not clear what is meant by functioning of the system because the specific tasks, or number of tasks, may not be known, or the environment in which the system has to function may not be fully known or indeed be variable. Deciding the appropriate level of the system is also difficult in the real world, and it must be possible to study a system’s reliability as function of a subset of its components or subsystems. For example, one may want to model reliability of a car as function of its main components like engine, breaks and tyres, but not take into account every minor component that could by itself, or in combination with some other minor components, prevent the car from being used. We have advocated that, to enhance theory of system reliability and to make it far more flexible for real world use, the system structure function should not be deterministic but probabilistic, so φ(x) ∈ [0, 1] instead of φ(x) ∈ {0, 1}, which can also be generalized to imprecise probabilities [10], which have proven to be useful in many reliability problems [7, 33]. This will provide a tool to deal with the additional uncertainties mentioned above. For example, if one only models system reliability as function of some main components, and one wishes to apply statistical inference about the reliability from failure observations, it is quite possible that for the same states of the components included in the model, one has both observed the system to fail and not to fail. In the example of the car mentioned above, one may not have included the car’s heating system as a component in the system reliability model, but if the task at hand is to drive a long distance on a very cold winter day, its failure

The Survival Signature for Quantifying System Reliability …

35

may prevent the car from being used even though all main components function. This example also illustrates the problem of defining the system’s functioning and possible uncertainty about the tasks and environment in which it needs to operate. Moving from deterministic to probabilistic structure functions seems mathematically quite straightforward, and the good news from the perspective of this paper is that it would not provide any difficulties for the survival signature approach, as Eq. (1) can still be used if the structure function is a probability, and if imprecise probabilities are used then the generalization is also straightforward. The main challenges, however, are with probabilistic structure functions themselves. Clearly, the probabilities would need to be assessed, which may require creating models to do so, and computations to derive a structure function will require new theory to be developed because concepts like path sets and cut sets do not carry over to probabilistic structure functions. There are great opportunities for application of the survival signature methodology to networks, an initial example was presented by Aslett et al. [4]. These can be regarded as systems, and typically have at least two types of components, namely nodes and links between nodes. However, networks typically require many different routes through the network to be available, but there may be possibilities to use some alternative routes that would still be satisfactory. Due to the huge importance of reliable networks in modern life and the fact that they tend to be large but often have a limited number of component types, this promises to be an application area where the survival signature can make very substantial contributions, and which in turn can guide future research to extend the survival signature methodologically in meaningful directions. Acknowledgements This paper is closely related to a presentation at The International Workshop on Reliability Engineering and Computational Intelligence (October 2020). We thank the organisers, in particular Elena Zaitseva, for the invitation to present our work.

References 1. Aslett, L.J.M.: MCMC for inference on phase-type and masked system lifetime models. Ph.D. thesis, Trinity College Dublin (2012) 2. Aslett, L.J.M.: Reliability theory: tools for structural reliability analysis. R package (2012). www.louisaslett.com 3. Aslett, L.J.M.: Cryptographically secure multiparty evaluation of system reliability (2016). arXiv:1604.05180 [cs.CR] 4. Aslett, L.J.M., Coolen, F.P.A., Wilson, S.P.: Bayesian inference for reliability of systems and networks using the survival signature. Risk Anal. 35, 1640–1651 (2015) 5. Augustin, T., Coolen, F.P.A., de Cooman, G., Troffaes, M.C.M. (eds.): Introduction to Imprecise Probabilities. Wiley, Chichester (2014) 6. Coolen, F.P.A.: Nonparametric predictive inference. In: Lovric, M. (ed.) International Encyclopedia of Statistical Science, pp. 968–970. Springer (2011) 7. Coolen, F.P.A., Utkin, L.V.: Imprecise reliability. In: Lovric, M. (ed.) International Encyclopedia of Statistical Science, pp. 649–650. Springer (2011)

36

F. P. A. Coolen and T. Coolen-Maturi

8. Coolen, F.P.A., Coolen-Maturi, T.: On generalizing the signature to systems with multiple types of components. In: Zamojski W., Mazurkiewicz J., Sugier J., Walkowiak T., Kacprzyk J. (eds.) Complex Systems and Dependability, pp. 115–130. Springer (2012) 9. Coolen, F.P.A., Coolen-Maturi, T.: Predictive inference for system reliability after commoncause component failures. Reliab. Eng. Syst. Saf. 13, 27–33 (2015) 10. Coolen, F.P.A., Coolen-Maturi, T.: The structure function for system reliability as predictive (imprecise) probability. Reliab. Eng. Syst. Saf. 154, 180–187 (2016) 11. Coolen, F.P.A., Goldstein, M., Wooff, D.A.: Project viability assessment for support of software testing via Bayesian graphical modelling. In: Bedford, T., van Gelder, P. H. A. J. M. (eds.) Safety and Reliability (Proceedings ESREL’03), pp. 417–422. Swets & Zeitlinger (2003) 12. Coolen, F.P.A., Goldstein, M., Wooff, D.A.: Using Bayesian statistics to support testing of software systems. J. Risk Reliab. 221, 85–93 (2007) 13. Coolen, F.P.A., Coolen-Maturi, T., Al-nefaiee, A.H.: Nonparametric predictive inference for system reliability using the survival signature. J. Risk Reliab. 228, 437–448 (2014) 14. Coolen, F.P.A., Huang, X., Najem, A.: Reliability analysis of phased mission systems when components can be swapped upon failure. ASCE-ASME J. Risk Uncertain. Eng. Syst. - Part B: Mech. Eng. 6(2), 020905 (2020) 15. Coolen-Maturi, T., Coolen, F.P.A., Balakrishnan, N.: The joint survival signature of coherent systems with shared components. In submission 16. De Finetti, B.: Theory of Probability (2 Vols). Wiley, Chichester (1974) 17. Eryilmaz, S., Coolen, F.P.A., Coolen-Maturi, T.: Marginal and joint reliability importance based on survival signature. Reliab. Eng. Syst. Saf. 172, 118–128 (2018) 18. Eryilmaz, S., Coolen, F.P.A., Coolen-Maturi, T.: Mean residual life of coherent systems consisting of multiple types of dependent components. Naval Res. Logist. 65, 86–97 (2018) 19. Feng, G., Patelli, E., Beer, M., Coolen, F.P.A.: Imprecise system reliability and component importance based on survival signature. Reliab. Eng. Syst. Saf. 150, 116–125 (2016) 20. George-Williams, H., Feng, G., Coolen, F.P.A., Beer, M., Patelli, E.: Extending the survival signature paradigm to complex systems with non-repairable dependent failures. J. Risk Reliab. 233, 505–519 (2019) 21. Huang, X., Coolen, F.P.A.: Reliability sensitivity analysis of coherent systems based on survival signature. J. Risk Reliab. 232, 627–634 (2018) 22. Huang, X., Aslett, L.J.M., Coolen, F.P.A.: Reliability analysis of general phased mission systems with a new survival signature. Reliab. Eng. Syst. Saf. 189, 416–422 (2019) 23. Huang, X., Coolen, F.P.A., Coolen-Maturi, T.: A heuristical survival signature based approach for reliability-redundancy allocation. Reliab. Eng. Syst. Saf. 185, 511–517 (2019) 24. Huang, X., Coolen, F.P.A., Coolen-Maturi, T., Zhang, Y.: A new study on reliability importance analysis of phased mission systems. IEEE Trans. Reliab. 69, 522–532 (2020) 25. Najem, A., Coolen, F.P.A.: System reliability and component importance when components can be swapped upon failure. Appl. Stoch. Model. Bus. Ind. 35, 399–413 (2019) 26. Patelli, E., Feng, G., Coolen, F.P.A., Coolen-Maturi, T.: Simulation methods for system reliability using the survival signature. Reliab. Eng. Syst. Saf. 167, 327–337 (2017) 27. Reed, S.: An efficient algorithm for exact computation of system and survival signatures using binary decision diagrams. Reliab. Eng. Syst. Saf. 165, 257–267 (2017) 28. Rusnak, P., Zaitseva, E., Coolen, F.P.A., Kvassay, M., Levashenko, V.: Logic differential calculus for reliability analysis based on survival signature. In submission 29. Qin, J., Coolen, F.P.A.: Survival signature for reliability evaluation of a multi-state system with multi-state components. In submission 30. Samaniego, F.J.: On closure of the IFR class under formation of coherent systems. IEEE Trans. Reliab. 34, 69–72 (1985) 31. Samaniego, F.J.: System Signatures and Their Applications in Engineering Reliability. Springer, New York (2007) 32. Samaniego, F.J., Navarro, J.: On comparing coherent systems with heterogeneous components. Adv. Appl. Probab. 48, 88–111 (2016)

The Survival Signature for Quantifying System Reliability …

37

33. Utkin, L.V., Coolen, F.P.A.: Imprecise reliability: an introductory overview. In: Levitin, G. (ed.) Computational Intelligence in Reliability Engineering, Volume 2: New Metaheuristics, Neural and Fuzzy Techniques in Reliability, pp. 261–306. Springer (2007) 34. Walter, G., Aslett, L.J.M., Coolen, F.P.A.: Bayesian nonparametric system reliability using sets of priors. Int. J. Approx. Reason. 80, 67–88 (2017) 35. Wooff, D.A., Goldstein, M., Coolen, F.P.A.: Bayesian graphical models for software testing. IEEE Trans. Softw. Eng. 28, 510–525 (2002) 36. Wooff, D.A., Goldstein, M., Coolen, F.P.A.: Bayesian graphical models for high complexity testing: aspects of implementation. In: Kenett, R.S., Ruggeri, F., Faltin, F.W. (eds.) Analytic Methods in Systems and Software Testing, pp. 213–243. Wiley (2018)

Application of Fuzzy Decision Tree for the Construction of Structure Function for System Reliability Analysis Based on Uncertain Data Jan Rabcan, Peter Sedlacek, Igor Bolvashenkov, and Jörg Kammermann

Abstract The reliability analysis of any system is based on the mathematical representation (mathematical model) of the investigated system. Different types of mathematical models are used in reliability engineering. The use of a particular model is determined by the specifics of the system under study and the requirements for reliability analysis (which of the reliability indices should be calculated). One of the often used models which allow using of simple mathematical methods for reliability evaluation is the structure function. The structure function determines system performance depending on its components states. But this mathematical model can be constructed if all system states depending on components states can be indicated or measured. Therefore the structure function can not be used for incompletely specified and uncertain data about system behavior. In this paper, the new method of structure function construction based on uncertain data is considered. This method uses the classification property of the structure function of dividing of all system states into a set of classes which agree with the number of system performance levels. The Fuzzy Decision Tree is used to implement system states classification based on uncertain data and transform the result of this classification into the structure function.

J. Rabcan (B) · P. Sedlacek University of Zilina, Univerzitna, 8215/1, 01026 Zilina, Slovakia e-mail: [email protected] P. Sedlacek e-mail: [email protected] I. Bolvashenkov · J. Kammermann Technical University of Munich, Theresienstr. 90, 80333 Munich, Germany e-mail: [email protected] J. Kammermann e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 C. van Gulijk and E. Zaitseva (eds.), Reliability Engineering and Computational Intelligence, Studies in Computational Intelligence 976, https://doi.org/10.1007/978-3-030-74556-1_3

39

40

J. Rabcan et al.

1 Introduction Reliability analysis or evaluation of system reliability is an important step in the development and exploration of any system. In terms of reliability analysis, the conception of a system can be different, for example, nuclear power plants [1], transport systems [2], military systems [3], complex sociotechnical systems [4], or systems of critical infrastructure [5, 6]. The reliability evaluation allows detecting problematic parts (components) of the system, reduces the likelihood or frequency of failures. One of the important steps in the reliability of any system is the construction of its mathematical model or mathematical representation which allows calculating of reliability indices and measures [6]. According to the review in [6] the construction of mathmatical model for system reliability analysis includes four steps: • • • •

definition of a number of performance levels for system representation; choice of a mathematical representation of a system (system model); quantification (calculation of indices and measures for reliability evaluation); measurement and improvement of system reliability.

The first of these steps is the definition of the number of system performance levels and its number of states for every component. Depending on the number of system performance level can be two types of system mathematical model that are named as Binary-state system (BSS) and Multi-state system (MSS) [6–8]. BSS allows indicating one of two states for each component and system performance only, for example, perfect functioning and total failure. Such a mathematical model can be used for the analysis of the system failure. It follows that BSS works with the assumption that a system and its components can be only in two extreme states. Nevertheless, this assumption cannot be generalized for all real-world systems because these systems can acquire performance levels between extreme states of perfect functioning and total failure [8, 9]. The advantage of such model is simplicity. But the process of the system degradation analysis has complications based on BSS. On the other hand, MSS allows to each component and system performance to be in one of many states. For example, perfect functioning, good functioning, total failure [7]. The objective of the second step is to create mathematical model of the system. Nowadays, different approaches have been described for creating mathematical models of a system. Between typical mathematical approaches belongs, for example, Markov models, structure function based models (Reliability Block Diagram, Fault Tree, Binary Decision Diagram), Monte-Carlo simulation based model and others [6]. Every mathematical model has some advantages and disadvantages. The specification of mathematical model at the second step depends on specifics of the system under study and the requirements for reliability analysis as time-depend characteristic of reliability, frequency indeces or other indices and measures. In this paper, we are focused on structure function. These mathematical model advantages are possibility to construct the structure function for system with any structure complexity and simplicity of methods for analysis and evaluation of this mathematical model

Application of Fuzzy Decision Tree for the Construction ...

41

[8–10]. In addition, the structure function can be used for both types of mathematical models, such as the MSS as well as BBS. The quantification of a system in the third step supposes calculation of various system reliability indices and measures, for example, availability/unavailability, failure rate, importance measures, etc. [7, 11–13]. The methods and algorithms for calculation of these indices depend on the mathematical representation of the investigated system and general method chosen in the second step. Methods and algorithms for calculation of system reliability measures and indices based on system representation by the structure function are considered in [8–10]. The methods developed for the system reliability evaluation at the third step are used for the calculation of indices and measures of the system under study can be used for the calculation of indices and measures of reliability at the fourth step. The strategy of the system reliability improvement can be proposed based on these indices and measures evaluation. In this paper, we investigated two of the four steps for the mathematical model construction. We specify the MSS structure function for the system mathematical representation. This mathematical model is well investigated [9, 11, 14]. But in these and other studies the structure function of MSS is considered for completely specified data when all system states can be indicated for system behavior. In real applications data about a system behavior is incompletely specified and uncertain as a rule [15, 16]. Therefore, the construction of MSS structure function based on uncertain data is the relevant problem in reliability engineering. In this paper new method for MSS structure function based on Fuzzy Decision Tree is proposed. This method uses the association and interpretation of the structure function as classification structure: the structure function maps (classifies) all possible components states to several groups which correspond to the number of system performance levels. The structure function according to proposed methods is formed as a decision table based on inducted classifier. This correlation of decision table and structure function has been used in the methodology for the structure function construction based on uncertain data. This interpretation of structure function as decision table allows us to use methods for classifier induction based on uncertain data. In particular, in this paper we use Fuzzy Decision Tree for classification. The use of fuzzy classifier allows taking into account uncertainty of initial data [17]. The proposed method is development of approaches proposed in [18, 19] for construction of structure function based on uncertain and incompletely specified data. This paper is structured as follows. The specific of structure function in the point of view of classification is considered in Sect. 2. The principal steps of the proposed method are discussed in Sect. 3. Section 4 provides descriptions of the specifics of data representation. The important aspect of induction of Fuzzy Decision Tree is introduced in Sect. 5. The case study focused on the structure function construction is presented in Sect. 6. The example of the structure function construction is shown on the dataset focused on failure prediction illustrates the proposed method.

42

J. Rabcan et al.

2 MSS Structure Function A structure function is the mathematical model of the system which captures dependencies between a system’s performance and its components. The performance of a system is known according to states of its components. A structure function is designed for a stationary system (time is not considered). The structure function must know all possible states of components to determine a system performance level. Assume the system is created by n components. Assume that the system is in the stationary state. This allows defining a structure function as a time-independent function [10, 11]. In general, the definition of the structure function is L1 × … × Ln → L. Let x i is the state of the i-th component, i = 1, …, n. A system component state is defined by state vector x = (x 1 , x 2 , …, x n ). State of each component can be denoted by a variable, x i , which value is equal to x i = 0 if i-th component fails and x i = 1, …, mi-1 if i-th component is functioning. Denote φ(x) as the structure function, then: φ(x1 , . . . , xn ) = φ(x) = {0, . . . , m 1 − 1} × · · · × {0, . . . , m n − 1}

(1)

where φ(x) is the system state (performance level) from failure (φ(x) = 0) to perfect functioning (φ(x) = M -1); x = (x 1 , …, x n ) is the state vector; x i is the i-th component state that changes from failure (x i = 0) to perfect functioning (x i = mi - 1). For example, according to (1) the structure function of simple parallel system of two components (m1 = 2, m2 = 3 and M = 3) can be defined by truth table (Table 1). Note that the structure function (1) can be interpreted as classification structure (classifier) which divides all possible combination of system component state into several classes and number of these classes is number of system performance levels (Fig. 1). The methods for reliability analysis of a system based on the structure function are well developed. The usage of these methods allows evaluating of system availability/reliability [9, 14], importance analysis [10, 11], reliability indices and measures [9, 12, 13]. These methods are developed with assumption that the structure function is defined for all possible components states. In case of uncertain data about the system behavior these methods cannot be used without additional modifications. At the same time, it should be noted that data about most system behavior is vague, incompletely specified and uncertain very often. Table 1 The structure function of parallel system of two components (m1 = 2, m2 = 3 and M = 3) x1

0

0

0

1

1

1

x2

0

1

2

0

1

2

(x)

0

1

1

1

2

2

Application of Fuzzy Decision Tree for the Construction ...

System states {(00), (01), (02), (10), (11), (12)}

x

1

x1

43

(00)

0

(01), (02), (10) (11), (12)

2

2

Fig. 1 The classification of system states based on structure function in Table 1

The uncertainty of data about systems behavior is caused by many factors [15, 16]. The uncertainty can be caused by the poor quality of data measurement or unexpected data variations during measurements or monitoring [20]. Another reason of uncertainty is preprocessing [21, 22]. In some cases, the form of data about system behavior requires additional preprocessing and transformations that make the data acceptable for the analysis. For example, the necessity of additional transformation can be present when data is obtained by examination of components in fixed time duration [23]. One more factor of data uncertainty is incomplete specification of the data [15]. Incomplete specification of data can occur when some states of a system component cannot be reached during system behavior measurement. This factor implies that collected data about system behavior contains missing values. This is possible when it is expensive to obtain all data about real system behavior because all situations cannot be reached during monitoring. Therefore, data about system behavior are often insufficient to create structure function by the traditional approaches. There are two ways to solve this problem. The first of them is development of new mathematical models which enable the analysis of vagueness and uncertain data [17, 24]. The development of new models causes the development of new methods for their evaluation and calculation of reliability indices and measures. The second way assumes usage of traditional models which can be customized to analyze systems according to the uncertain and vague data about its behavior. In this case, the system reliability evaluation can be implemented based on traditional methods. But new methods for traditional mathematical models based on uncertain data should be proposed. And the usage of the structural function for such systems requires new method of its construction. One of these methods for the structure function construction based on fuzzy decision trees has been considered in papers [11, 25, 26]. In this method, the structure function is considered as classifier (Fig. 1). In this paper we consider the possibility to use another classifier (Fuzzy Decision Tree) for the structure function construction.

44

J. Rabcan et al.

3 Principal Steps of Method for Structure Function Construction Most real-world systems consist of a large number of components. These components are often inhomogeneous and vague. Therefore, in many cases, reliability analysis needs a mathematical model of a system which is based on ambiguous and vague data. The ambiguity is present due to the vagueness of collected data. The vagueness arises due to inaccurate measurement of components states. Uncertainty is result of the incomplete specification of data too. This uncertainty arises in situations if some components state or system performance levels cannot occur during system monitoring time (for example, as the nuclear power plant explosion). The design of the structure function as a classifier for system states should take into account two aspects of initial data uncertainty which are vagueness and incomplete specification of initial data. The induction of such data is typical problem in Data Mining. As a rule, such data classifiers based on fuzzy classifiers. Therefore, classifier based on fuzzy data is considered in this paper. Accordingly, the construction of structure function based on uncertain data includes three of three principal steps: • Collecting of data—Monitoring of system performance according to component’s states. • Mathematical representation of a system in form of fuzzy classifier. • Construction of a structure function derived from a fuzzy classifier. Preparing of accurate classifier requires preprocessing of initial data very often. In some cases, this preprocessing can increase vagueness of data. Therefore, we performed transformation to convert numerical data into fuzzy data. This transformation allows considering vagueness in the next analysis. The collection of initial data about the system behavior and transformation of collected data into fuzzy data are implemented at the first step of the considered method. At the second step of this method the fuzzy classifier is inducted based on collected data. Fuzzy classifier allows us to take into account uncertainty of initial data using fuzzy values for components states (structure function’s variables) and system performance levels (structure function’s values). Fuzzy classifier can be inducted using uncertain and incomplete data based on some (not all) instances. An instance consists of some input and output attributes: the state vector x = (x 1 , …, x n ) is interpreted as input attributes and value of the structure function as output attribute. The method for the classifier induction is caused by the type of classifiers. In this paper we propose to consider the classifier based on decision tree [18, 26]. A decision tree is a formalism for expressing mapping of input attributes (components states) to output attribute/attributes (system performance levels). A decision tree consists of attribute nodes (input attributes) linked to two or more sub-trees and leaves or decision nodes labeled with classes of the output attribute (in our case, a class agrees with a system performance level). In particular, the Fuzzy Decision Tree is considered in this paper. Inducted Fuzzy Decision Tree permits classifying known as well as new samples. Samples agree with components states for the structure function induction. Therefore,

Application of Fuzzy Decision Tree for the Construction ...

45

the structure function can be constructed for all possible combinations of components states if they are defined as instances for a Fuzzy Decision Tree. In terms of Data Mining, the structure function is interpreted as a table of decisions [18, 25]. Fuzzy Decision Tree is used to obtain decision table which was necessary to construct structure function. This approach allows predicting system performance according to each possible profile of components states. Fuzzy Decision Tree provides the mapping of all possible components states (input data) to M performance levels.

4 Data Transformation The classification algorithm used in this paper works with fuzzy data. This fact permits take into account possible ambiguity of data. Therefore, the data preprocessing which transforms numerical data into fuzzy data had been performed. Let a numerical attribute X i is defined by a vector of real scalar values {x1 , x2 , . . . , xk , . . . , x K }. We transformed each numerical attribute into fuzzy attribute respectively. In this case, each fuzzy attribute is represented by mi (mi ≥ 2) linguistic terms. The value of mi is equal to the resulting number of clusters. The fuzzification has been performed by the algorithm described in [27], which is based on the computation of the fuzzy entropy of fuzzy sets. The algorithm divides values x ∈ X i into m i intervals. Intervals are defined by points C1 …Cm i . We find these borders by K-Means algorithm as in [27]. Number m i of intervals is determined automatically. This number agrees with the number of linguistic terms m i of fuzzy attribute Ai . The algorithm is initially performed with two linguistic terms, then the algorithm adds a linguistic term to the attribute. Adding of linguistics term is repeated until the fuzzy entropy of the attribute does not raise. The definition of fuzzy entropy that the algorithm uses is as follows: mb    F E Ai, j = − D bAi, j × log2 D bAi, j , k=1

where m b is the number of classes defined by output attribute B. Notation x∈b define, that x belongs to class b. Then, D bAi, j is defined as follows: 

D bAi, j

x μ Ai, j (x) . = ∈b x μ Ai, j (x)

For each attribute Ai must hold following constrain where the fuzzy entropy of attribute A is defined as: F E(Ai ) =

n  q=1

  F E Ai, j ,

m i

j=1 Ai, j

= 1. Then,

46

J. Rabcan et al.

then, the fuzzification process assigns membership degree μ Ai,1 (x) to each x of numeric attribute X i . This membership degree is obtained by triangular membership function. The definition of used membership function for first linguistic term Ai,1 of attribute Ai is following:

μ Ai,1 (x) =

⎧ ⎪ ⎨ ⎪ ⎩

1

C2 −x C2 −C1

0

x ≤ C1 C1 < x < C2 x ≥ C2.

Each non-first and non-last linguistics term of Ai has a membership function μ Ai,q (q = 2, 3, . . . , m i − 1) defined as follows:

μ Ai,q (x) =

⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩

0 x−Cq−1 Cq −Cq−1 C j+1 −x C j+1 −C j

0

x ≤ C j−1 Cq−1 < x ≤ Cq Cq < x ≤ Cq+1 x ≥ Cq+1.

Finally, the last term Ai,m i of Ai has the membership function with following mathematical form: ⎧ ⎪ x ≤ Cm i −1 ⎨ 0 k−1 C μ Ai,mi (x) = Cx−C m i −1 < x ≤ C m i k −C k−1 ⎪ ⎩ 1 x ≥ Cm i . After fuzzification we obtained data which was used for Fuzzy Decision Tree induction.

5 Fuzzy Decision Tree Induction Based on Cumulative Mutual Information Decision trees are a known classification technique. These trees are composed of leaves and nodes. Each node has one associated (splitting) attribute which specifies the count of outgoing edges from the node. This number is the same as the number of all possible values of the attribute. The classification of new instance starts from tree root. A splitting attribute determines the outgoing edge for the classified instance in each node, and the classification continues using the appropriate sub-tree. When a classified instance comes into a leaf, the leaf can predict a class. When we talk about fuzzy decision trees, classified instances pass by multiple branches. Therefore, the decision is based on a set of leaves. In this paper, we used cumulative mutual information defined in [28] to select splitting attributes.

Application of Fuzzy Decision Tree for the Construction ...

The cumulative mutual information is defined as:

  I B; Uq−1 , Aiq  ,  i q = argmax H Aiq |Uq−1

47

(2)

where function argmax returns attribute index i q with maximal value of CMI. The CMI in output attribute B about the attribute Aiq and the sequence of values Uq−1 has been introduced in [28] and calculated as: 



I B; Uq−1 , Aiq =

×

m iq m b  

  M B j × Uq−1 × Aiq , jq ×

jq =1 j=1

   log2 M B j × Uq−1 × Aiq , jq + log2 M Uq−1   .   − log2 M B j × Uq−1 − log2 M Uq−1 × Aiq , jq 

The conditional cumulative entropy between fuzzy attribute Aiq and the sequence of selected attribute terms Uq−1 is defined as: iq      M Aiq , j , Uq−1 H Aiq |Uq−1 =

m

j=1

      × log2 M Uq−1 − log2 M Aiq , j × Uq−1 . We used unpruned fuzzy decision trees to prepare structure function. Obtained Fuzzy Decision Tree is used to create structure function in such a way that each possible combination of input attributes (or in terms of reliability, system components) is classified. The output of classification gives the system performance. In this way, we obtained a decision table. This table agrees with the truth table of the system, which is one of the possible representations of the structure function. The obtained structure function allows the computing of various reliability indices and measures, for example, structural importance.

6 Case Study: Reliability Analysis of the System Failure Prediction This section provides a demonstration of the structure function construction on dataset [29] focused on failure prediction of the production process. According to description in [29], the dataset was generated to simulate activities of some company. The company uses many machines to build final products. The production of products is stopped each time a machine has a failure. The dataset contains monitoring of these machines. This monitoring was focused on levels of pressure, moisture and

48

J. Rabcan et al.

Table 2 Description of attributes Attribute

Name

Description

A1

Temperature

Temperature inside a machine

A2

Moisture

Moisture inside a machine

A3

Pressure

Pressure inside a machine

A4

Life Time

The number of weeks from last machine failure

B

Broken

Determine if a machine is broken or working

temperature inside the machines. According to levels of these variables, the machine failure can be predicted. Therefore, the dataset contains four input attributes and one output attribute. Names and descriptions of these attributes are listed in Table 2. This dataset contains 1000 measurements (instances) described by attributes. These instances are divided into two classes by output attribute “broken”: no or yes. The first class indicates that the machine is working correctly. The second one assumes the failure of a machine. There is a need to note that if at least one machine fails, the production is interrupted until the machine is repaired. In terms of reliability, the failure of one machine causes the failure of the whole system. The short part of the used dataset is shown in Table 3. According to this dataset, FDT can be inducted. To generate structure function, we used unpruned FDT. The resulting FDT is shown in Fig. 2. This FDT consists of nodes, which are represented by a rectangle. In the left part of each rectangle, the attribute associated with the node is shown. The right part of the rectangle has two lines. The first line shows the frequency of input branch and the second line has two numbers. The first number stand for membership degree to the class yes (B2 Table 3 Example of dataset for failure prediction Life Time

Pressure

Moisture

Temperature

Broken

30

87.56

115.63

89.75

No

28

86.63

86.69

128.36

No

19

92.14

106.56

123.11

No

93

89.64

108.96

100.42

No

58

101.65

92.65

100.45

Yes

96

84.52

111.63

86.48

Yes

60

127.84

183.69

175.83

Yes

15

86.47

109.42

95.19

No

15

No

95.45

107.99

94.55

.

.

.

.

.

.

.

.

.

.

. 60

.

.

.

100.26

106.57

83.91

. No

Application of Fuzzy Decision Tree for the Construction ...

49

Fig. 2 Principal steps for structural function generation

machine is broken) and the second number is membership degree to the class no (B1 machine is not broken). Non-leaf nodes have outgoing edges. These edges are associated with values of input attributes. For example, attribute A1 has two possible linguistic values. Therefore, the node associated with this attribute has two outgoing edges. One edge is A1,1 which cover instances with low temperature and the second edge is A1,2 which cover instances with high level of temperature. We are using FDT, which allows to instances pass down the tree by more branches. For example, if some instance has temperature between low and high level, this instance will pass b both branches A1,1 and A1,2 with some membership degrees. Therefore, classification ends in the set of leaves. Then, these leaves are used to obtain final classification result by some combination method. According to obtained FDT, the structure function can be constructed as follows. Each input attribute has domain consist of precisely two linguistic fuzzy terms. Please note that these linguistic terms have been obtained after the fuzzification. System i-th component is represented by i-th input attribute. The linguistic terms of this attribute represent states of its corresponding component. For example, input attribute A1 (Temperature) represents the first component x1 , which has two states according to the number of possible values of the attribute. The number of possible values of the

50

J. Rabcan et al.

Table 4 Structure function of investigated data for failure prediction x1

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

x2

0

0

0

0

1

1

1

1

0

0

0

0

1

1

1

1

x3

0

0

1

1

0

0

1

1

0

0

1

1

0

0

1

1

x4

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

φ(x)

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

attribute agrees with the number of component states. So, if attribute A1 has two linguistic terms {A1,1 , A1,2 } which cover all its values then component x1 have also two possible states. The state equal to 1 corresponds with the correct state of the component, and the value 0 corresponds with the state of the component failure. The system performance level φ(x) depends on the states of its components. The structure function obtained from data shown in Table 4. This structure function has been constructed based on FDT (it means the FDT induction based on data in Fig. 2 and forming of the decision table that agrees with the structure function). We need to note that data obtained after fuzzification contain missing state vectors. Missing state vectors occurs in situations, when some state of a component or system performance level cannot be observed during monitoring. When all possible state vectors are known, the decision table is complete. A complete decision table includes every possible combination of input attributes (system states). The number of all possible combinations increases exponentially in accordance with a number of input attributes. For example, if the decision table contains n input binary attributes then exists 2n different combinations. In terms of reliability, the decision table gives the system performance for every possible component state profile. To create a decision table, FDT determines output class (system performance) for each possible combination of input attributes (component states). We obtained the representation of the system by structure function based on FDT. This structure function has the form of truth table. Such representation of a system allows calculating system reliability by various indices and measures. For example, we can evaluate importance of each component to system performance according to importance measures [30]. Another approach is focused on calculation of the probability of system performance levels [31]. In this paper, we show how to use obtained function to calculate structural importance SI of components to system performance. The details about algorithms used for structural importance calculation are presented in [28]. The calculations of this importance measures are done for each system component. SI is considered as one of the simplest importance measure. It expresses contribution of a component to system performance level from point of view of topological aspects of the system based on its structure function. The advantage of this index is that the reliability of individual components is not necessary for its calculation and can be used in a phase of the system design. Structural importance focuses mainly on analysis of the system’s topological properties. It means that this importance is focused on the layout of the components in the system. According to [32], the SI is defined as the

Application of Fuzzy Decision Tree for the Construction ...

51

Table 5 Table of structural importance’s Component

Structural importance

x1

1.00

x2

0.25

x3

0.25

x4

0.25

ratio of the number of system states in which decreasing of the i-th component state causes failure of the system, i.e.: s, j

s, j

S Ii

=

pi , ps

s, j

where pi stands for the number of system states when a change of the i-th component state from s to s − 1 causes a change of performance level of the system from j to j − 1, and ps is the number of system states for which (si , x) = j. The advantage of this index is that the reliability of individual components is not necessary to its calculation and can be used in step of the system design. Structural importance is focused mostly on analysis of the system’s structure properties. SI for structural function can be considered as ratio of the number of the system states in which system and i-th components work to number of occurrences where the failure of i-th components causes failure of the system for analyzed performance level. The structural importance for the analyzed system is shown in Table 5. According to the data in the Table 5, we can show that the first component has higher importance for the considered system. This component is associated with temperature of machines. It means that the temperature has maximal influence on the machines operation. The other components have lower impact to the machines operating performance. Therefore, in system maintenance, the company should focus on keeping the temperature of the machines in the correct range.

7 Conclusion We proposed new method for structure function construction based on ambiguous and uncertain data. This method is based on development of methods proposed in [11]. The proposed method has three principal steps, specifically collecting data about system behavior, FDT induction, and structure function construction. The combination of these steps allows working with incompletely specified data. The interpretation of data as fuzzy data allows taking into account the ambiguity of input data for the structure function construction.

52

J. Rabcan et al.

The structure function cannot be constructed for a system where system performance corresponding to some state vector is not known. The main novelty of the method proposed in this paper is that it allows constructing structure function in these terms. Moreover, proposed method deals with uncertainty which is present after preprocessing of initial data obtained after system monitoring. Therefore, the proposed method extends possibility of the structure function application. The proposed method has been demonstrated on example focused on reliabilty evaluation of the production process. The dataset used in the case study focused on failure prediction [29]. This production process use five machines. Failure of at least one of these machines cause system failure. We can measure temperature, moisture, life time and pressure inside machines. According to our reliability analysis, we discovered that temperature has the biggest impact on machine operation state. Further research, we can investigate other algorithms focused on fuzzification. Fuzzification is an important part of the proposed method. It is responsible for the processing of uncertainty. Another way of further research can be focused on the development of new classifiers that can deal with uncertain data. These classifiers can be used not only in reliability analysis but also in typical tasks of data mining. Therefore, the development of such an algorithm is actual and has a wide range of applications. Acknowledgements The presented investigation is supported by the grant of Slovak Research and Development Agency “Mathematical Models based on Boolean and Multiple-Valued Logics in Risk and Safety Analysis” (SK-FR-2019-0003).

References 1. Watanabe, Y., Oikawa, T., Muramatsu, K.: Development of the DQFM method to consider the effect of correlation of component failures in seismic PSA of nuclear power plant. Reliab. Eng. Syst. Saf. 79, 265–279 (2003). https://doi.org/10.1016/S0951-8320(02)00053-4 2. Nyström, B., Austrin, L., Ankarbäck, N., Nilsson, E.: Fault tree analysis of an aircraft electric power supply system to electrical actuators. In: 2006 9th International Conference on Probabilistic Methods Applied to Power Systems, PMAPS (2006). https://doi.org/10.1109/PMAPS. 2006.360325. 3. Levitin, G., Hausken, K.: Influence of attacker’s target recognition ability on defense strategy in homogeneous parallel systems. Reliab. Eng. Syst. Saf. 95, 565–572 (2010). https://doi.org/ 10.1016/j.ress.2010.01.007 4. Lisnianski, A., Frenkel, I., Ding, Y.: Multi-state system reliability analysis and optimization for engineers and industrial managers. Multi-State Syst. Reliab. Anal. Opt. Eng. Ind. Manag. 1–393 (2010). https://doi.org/10.1007/978-1-84996-320-6. 5. Praks, P., Kopustinskas, V., Masera, M.: Probabilistic modelling of security of supply in gas networks and evaluation of new infrastructure. Reliab. Eng. Syst. Saf. 144, 254–264 (2015). https://doi.org/10.1016/j.ress.2015.08.005 6. Zio, E.: Reliability engineering: Old problems and new challenges. Reliab. Eng. Syst. Saf. 94, 125–141 (2009). https://doi.org/10.1016/j.ress.2008.06.002 7. Birolini, A.: Reliability Engineering Theory and Practice. Springer (2007)

Application of Fuzzy Decision Tree for the Construction ...

53

8. Zaitseva, E., Levashenko, V.: Investigation multi-state system reliability by structure function. In: Proceedings—International Conference on Dependability of Computer Systems, DepCoS— RELCOMEX 2007, pp. 81–90. https://doi.org/10.1109/DEPCOS-RELCOMEX.2007.28 9. Natvig, B.: Multistate systems reliability theory with applications. Multistate Syst. Reliab. Theory Appl. 1–232 (2010). https://doi.org/10.1002/9780470977088. 10. Zaitseva, E.N., Levashenko, V.G.: Importance analysis by logical differential calculus. Autom. Remote. Control. 74, 171–182 (2013). https://doi.org/10.1134/S000511791302001X 11. Zaitseva, E., Levashenko, V.: Construction of a reliability structure function based on uncertain data. IEEE Trans. Reliab. 65, 1710–1723 (2016). https://doi.org/10.1109/TR.2016.2578948 12. Barlow, R.E., Proschan, F.: Statistical theory of reliability and life testing probability models to begin with. Technometrics 72, 304 (1975) 13. Aven, T.: On performance measures for multistate monotone systems. Reliab. Eng. Syst. Saf. 41, 259–266 (1993). https://doi.org/10.1016/0951-8320(93)90078-D 14. Cutello, V., Montero, J., Yáñez, J.: Structure functions with fuzzy states. Fuzzy Sets Syst. 83, 189–202 (1996). https://doi.org/10.1016/0165-0114(95)00390-8 15. Lisnianski, A., Levitin, G.: Multi-state system reliability: assessment, optimization and applications. Ser. Qual. Reliab. Eng. Stat. 207–237 (2003). https://doi.org/10.1016/j.rser.2008. 02.009 16. Kvassay, M., Levashenko, V., Zaitseva, E.: Analysis of minimal cut and path sets based on direct partial Boolean derivatives. Proc. Inst. Mech. Eng. Part O J. Risk Reliab. 230, 147–161 (2015). https://doi.org/10.1177/1748006X15598722 17. Kolowrocki, K.: Reliability of large systems. 1–328 (2004). https://doi.org/10.1016/B978-008-044429-1.X5000-4. 18. Aven, T., Baraldi, P., Flage, R., Zio, E.: Uncertainty in risk assessment: the representation and treatment of uncertainties by probabilistic and non-probabilistic methods. 9781118489, 1–186 (2014). https://doi.org/10.1002/9781118763032 19. Jensen, A., Aven, T.: A new definition of complexity in a risk analysis setting. Reliab. Eng. Syst. Saf. 171, 169–173 (2018). https://doi.org/10.1016/j.ress.2017.11.018 20. Cox Louis Anthony, J.: Confronting deep uncertainties in risk analysis. Risk Anal. 32, 1607– 1629 (2012) 21. Potapov, P.: On the loss of information in PCA of spectrum-images. Ultramicroscopy 182, 191–194 (2017). https://doi.org/10.1016/j.ultramic.2017.06.023 22. Geiger, B.C., Kubin, G.: Information Loss in Deterministic Signal Processing Systems. Springer (2018). https://doi.org/10.1007/978-3-319-59533-7 23. Khan, S., Yairi, T.: A review on the application of deep learning in system health management. Mech. Syst. Signal Process. 107, 241–265 (2018). https://doi.org/10.1016/j.ymssp.2017.11.024 24. Kansal, M.L., Agarwal, S.S.: Fuzzy based transformer failure analysis under uncertainty. In: 2014 International Conference on Reliability Optimization and Information Technology (ICROIT), pp. 1–5. IEEE (2014). https://doi.org/10.1109/ICROIT.2014.6798280. 25. Zaitseva, E., Levashenko, V., Kvassay, M., Rabcan, J.: Application of ordered fuzzy decision trees in construction of structure function of multi-state system. In: Communications in Computer and Information Science, pp. 56–75 (2017). https://doi.org/10.1007/978-3-319-699 65-3_4 26. Rabcan, J., Rusnak, P.: Generation of structure function based on ambiguous and incompletely specified data using the fuzzy decision trees. In: International Conference on Emerging eLearning Technologies and Applications (2017). https://doi.org/10.1080/097001 68209427586 27. Hahn-Ming Lee, H.-M., Chih-Ming Chen, C.-M., Jyh-Ming Chen, J.-M., Yu-Lu Jou, Y.-L., Lee, H.M., Chen, C.M., Chen, J.M., Jou, Y.L.: An efficient fuzzy classifier with feature selection based on fuzzy entropy. IEEE Trans. Syst. Man Cybern. Part B: Cybern. 31, 426–432 (2001). https://doi.org/10.1109/3477.931536 28. Levashenko, V., Zaitseva, E.: Usage of new information estimations for induction of fuzzy decision trees. Lect. Notes Comput. Sci. 2412, 493–499 (2002)

54

J. Rabcan et al.

29. Gyamfi, S.K.: UCI machine learning repository: ultrasonic flowmeter diagnostics data set. https://archive.ics.uci.edu/ml/datasets/Ultrasonic+flowmeter+diagnostics. Accessed 14 Mar 2018 30. Kvassay, M., Zaitseva, E., Levashenko, V.: Minimal cut sets and direct partial logic derivatives in reliability analysis. In: Safety and Reliability: Methodology and Applications—Proceedings of the European Safety and Reliability Conference, ESREL 2014, pp. 241–248 (2015). https:// doi.org/10.1201/b17399-37 31. Levashenko, V., Zaitseva, E., Kvassay, M.: Application of structure function in system reliability analysis based on uncertain data. In: CEUR Workshop Proceedings, p. 1614 (2016) 32. Armstrong, M.J.: Reliability-importance and dual failure-mode components. IEEE Trans. Reliab. 46, 212–221 (1997). https://doi.org/10.1109/24.589949

Unavailability Optimization of a System Undergoing a Real Ageing Process Under Failure Based PM Radim Briš and Pavel Jahoda

Abstract This chapter deals with the exploration of the system which acts on demand in case of emergency matter. To verify functionality the system must be regularly inspected with a given period and tested to find latent failures. A failure– based preventive maintenance is considered, in context with the imperfect corrective maintenance model. This means that preventive maintenance brings full renew which is realized after a prescribed number of failures. If a failure occurs, it is detected during the first follow–up inspection and the restoration process starts. The imperfect corrective maintenance model is intended where each restoration deteriorates the system lifetime whose probability distribution is gradually changed via step by step rising failure rate. Reliability mathematics for unavailability analysis is briefly outlined in this chapter. The new renewal process model, including the failure–based preventive maintenance, has been designated here as a real ageing process. The imperfect corrective maintenance as well as increasing period between inspections result in undesired rise of unavailability which can be corrected by the properly selected failure based preventive maintenance. This optimization process is demonstrated on a selected system adopted from references.

1 Introduction The majority of important industrial systems are subjected to corrective and preventive maintenance actions which are supposed to extend their functioning life. Corrective maintenance (CM), sometimes called restoration or repair, is realized after a failure occurs and intends to put the system into a state in which it can perR. Briš (B) · P. Jahoda Department of Applied Mathematics, Faculty of Electrical Engineering and Computer Science, VSB - Technical University of Ostrava, 17. listopadu 2172/15, 708 00 Ostrava-Poruba, Czech Republic e-mail: [email protected] P. Jahoda e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 C. van Gulijk and E. Zaitseva (eds.), Reliability Engineering and Computational Intelligence, Studies in Computational Intelligence 976, https://doi.org/10.1007/978-3-030-74556-1_4

55

56

R. Briš and P. Jahoda

form its function again. Preventive maintenance (PM) is realized when the system is operated and intends to slow down the ageing process and decreases the occurrence of system failures. Reliability growth can be reached by applying both corrective and preventive maintenance. If dormant systems are investigated, that are frequently used for example in nuclear power plants, the preventive maintenance complemented by regular inspections is of particular interest because system mostly remains in standby state where a failure is not identified immediately after its occurrence. We take into account the dormant system, i.e. the system which acts on demand in case of emergency matter. To verify functionality, the system must be regularly inspected with a given period and tested to find latent failures. We will suppose a failure–based preventive maintenance, in context with the imperfect corrective maintenance model. This means that preventive maintenance brings full renew which is realized after a prescribed number of failures. If a failure occurs, it is detected during the first follow–up inspection and the corrective maintenance starts. The imperfect corrective maintenance model is developed where each restoration deteriorates the system lifetime whose probability distribution is gradually changed via increasing failure rate. Failure–based preventive maintenance (FBM) is considered, which means that preventive maintenance process is realized at the occurrence of every nth failure where n is a given deterministic number. Each FBM operation terminates one renewal cycle. The new renewal process model, including the failure–based replacement, has been called here as a real ageing process. The idea about how should be realized the maintenance process, i.e. perfect versus imperfect maintenance is not new, it was developed in a lot of past references. Depending on the degree of restoration one we can differentiate between several different kinds of preventive and corrective actions, as mentioned in [1]: perfect maintenance action that returns the system to as-good-as-new condition and imperfect maintenance action that can have several restoration levels between perfect maintenance and minimal maintenance that restores the system to the condition as it was immediately before the failure occurred and consequently system has the same failure rate that it had just before the minimal maintenance action. The gradual CM repair strategy is greatly explored in [2], where the repair action gradually restores the system to its initial level and it returns to the operating state, i.e. perfect CM is assumed. However, system from real practice is renewed somewhere between as-good-as-new and as-bad-as-old, which is above called as “imperfect maintenance”. For the reason of reliability evaluation it is important to determine the level of imperfect maintenance for any maintenance operation. Imperfect restoration was considered for system components by authors in [3]. They work on the presumption that the system age is influenced by maintenance action. But system failure rate can be as well deteriorated due to maintenance as discussed in [4]. Hence, articles [5, 6] assume more realistic hybrid model, which supposes both age reduction and failure rate adaptation. In references [4, 6], imperfect maintenance relates to some fixed maintenance operation applied on a system which improves its health anywhere between minimal repair and perfect repair, which means full renew.

Unavailability Optimization of a System Undergoing a Real …

57

Additional imperfect repair models that do not always become like new after repair were proposed in [7, 8]. In the proposed work, we suppose such imperfect CM process, where each CM intervention deteriorates the system lifetime, i.e. the lifetime probability distribution is gradually changed via increasing failure rate, in which case the failure rate is increased by two different ways, because apart from this gradual increasing we must consider increasing of the failure rate due to standard stochastic ageing processes. Our above mentioned FBM policy model is in close relation to imperfect maintenance process, which is discussed and recapitulated in [9]. For instance, the similar policy relating to the version that a unit is replaced at the nth failure and (n − 1) previous failures are restored with minimal repair is intended in [10]. The stochastic models to describe the failure type of repairable units subject to minimal maintenance are discussed in [11]. In [12] is presented the extended replacement policy in which a unit is substituted at time T or at nth failure, whichever occurs first, where n is optimized in connection with CM cost and cost of scheduled substitution. We intend here to develop the innovative reliability model including both FBM policy and imperfect maintenance processes which can serve for optimization purposes in case of an unavailability restriction. The imperfect corrective maintenance as well as increasing period between inspections result in undesired rise of the unavailability function which can be corrected by the properly selected failure–based PM. This optimization process is demonstrated on a selected system adopted from [13].

2 Notation We use following notation: PM . . . preventive maintenance (replacement). CM . . . corrective maintenance (repair). FBM . . . Failure–based PM. T . . . length of renewal cycle. τ . . . time between two inspections. Ir = ((r − 1)τ, r τ  . . . r th inspection interval. ki . . . the ith failure occurs in Iki . ji . . . CM of the ith failure is accomplished in I ji .

58

R. Briš and P. Jahoda

X 1 . . . time from the beginning of renewal cycle to the first failure. Yi . . . duration of the ith CM, i ∈ {1, . . . , n − 1}. X i . . . time from the inspection following end of the (i − 1)th CM to the ith failure, i ∈ {2, . . . , n}. Z . . . duration of PM (PM starts at kn τ ). x . . . value of a random variable X . FX (t) = P(X ≤ t)…cumulative distribution function of a random variable X . F X (t) = 1 − FX (t)…reliability function of a random variable X . p X 1 k1 = P(X 1 ∈ Ik1 ) = P((k1 − 1)τ < X 1 ≤ k1 τ ) = FX 1 (k1 τ ) − FX 1 ((k1 − 1)τ ) … It is the probability that first system failure occurs within k1 th inspection interval. pYi ji ki = P(( ji − 1)τ < Yi + ki τ ≤ ji τ ) = FYi ( ji τ − ki τ ) − FY1 (( j1 − 1)τ − k1 τ ) … It is the probability that repair of the ith failure was accomplished within the j1 th inspection interval (in case of the ith failure occurred within the ki th inspection interval).

p X i ki ji−1 = P((ki − 1)τ < X i + ji−1 τ ≤ ki τ ) = FX i (ki τ − ji−1 τ ) − FX 2 ((ki − 1) τ − ji−1 τ ) …

It is the probability that the ith system failure occurs within the ki th inspection interval (in case of repair of the (i − 1)th failure was accomplished within the ji−1 th inspection interval). pk1 , j1 ,...,ki , ji = p X 1 k1 pY1 j1 k1 . . . p X i ki pYi ji ki . pk1 , j1 ,...,ki = pk1 , j1 ,...,ki−1 , ji−1 p X i ki . [x] …integral part of the real number x (i.e. f (x) = [x] is the floor function). ⎧ ⎨ 1 if A is true χ …χ (A) = ⎩ 0 if A is false ϑ . . . unavailability function. h(t) . . . failure rate.

Unavailability Optimization of a System Undergoing a Real …

59

Fig. 1 The first renewal cycle

3 Renewal Process Model Let us consider the following renewal cycle of a system. The cycle starts at time t = 0. The system is periodically inspected. Time between two inspections is equal to τ . If a failure occurs, then it is detected during the following inspection and CM starts. The system is launched during the next inspection following the end of the CM. This keeps on until the nth failure occurs (n ≥ 2 is a given natural number). Preventive maintenance is performed after detecting nth failure. The renewal cycle ends by accomplishing the PM (see Fig. 1) and next cycle starts immediately. We assume that PM restores the system to a good–as–new state. We also assume that X 1 , …, X n , Y1 , …, Yn−1 and Z (see Sect. 2) are independent nonnegative continuous random variables and FX 1 (0) = 0. We denote length of the renewal cycle by T . Let us determine estimation of T . If x1 ∈ Ik1 , y1 + k1 τ ∈ I j1 , …, xn−1 + jn−2 τ ∈ Ikn−1 , yn−1 + kn−1 τ ∈ I jn−1 , xn ∈ Ikn (we put j0 = 0) and Z = z (see Fig. 1), then the length of the first cycle T = kn τ + z. Let us denote expectation of the T under assumption that Z = z by E T (z). It holds: ∞ ∞ ∞    E T (z) = ... k1 =1 j1 =k1 +1

kn = jn−1 +1

pk1 , j1 ,...,kn (kn τ + z) . We obtain the expectation of T as follows: ET =

∞

E T (z)dFZ (z) =

0

=

∞ 

∞ 

k1 =1 j1 =k1 +1

...

∞ 

(1)

kn = jn−1 +1

pk1 , j1 ,...,kn (kn τ + E Z ) . For the unavailability quantification, there is necessary to derive the cumulative distribution function of T . The value FT (t) is equal to the probability that n failures occurred, n − 1 of them was corrected and the PM, which started after the nth failure, was finished before or at the time t. It is obvious that the length of a cycle must be greater than (2n − 1)τ . Hence FT (t) = 0 for t ≤ (2n − 1)τ . Let us consider t > (2n − 1)τ .

60

R. Briš and P. Jahoda

Using the independence of random variables X 1 , …, X n , Y1 , …, Yn−1 and Z again, we obtain: t [ τt ]−2(n−1) [ [ τt ]−2(n−1)+1 τ ]τ   ... FT (t) = k1 =1 j1 =k1 +1 kn = jn−1 +1 (2) pk1 , j1 ,...,kn FZ (t − kn τ ).

3.1 Unavailability Analysis Mathematical methods used in this section were inspired and influenced by the work of [13], which is mainly based on results presented in articles [14–18]. Let us define a function U (t) following way: ⎧ ⎨ 1 if the system is unavailable in time t ∀t ∈ 0, ∞) : U (t) = ⎩ 0 otherwise  Let us denote the length of kth renewal cycle by Tk , S0 = 0 and Sn = nk=1 Tk . Thus the nth renewal cycle is the interval (Sn−1 , Sn . We assume that U (Si ) = 0 for i = 0, 1, . . . and U (x1 ) = U ( ji τ ) = U ( ji τ + xi+1 ) = 1 for i = 1, . . . , n − 1. A set of time points t within the nth renewal cycle where U (t) = 1 we denote by Cn . i.e.: Cn = {t ∈ (Sn−1 , Sn  : U (t) = 1}. We determine the probability of t ∈ C1 for given t ∈ 0, ∞). Let us consider t ∈ Ir . There are exactly three, mutually exclusive, ways how the event t ∈ C1 can occur. The first way, let us denote it as event Ar , is that a failure occurred in interval ((r − 1)τ, t. The second way, we denote it as event B r , is that the system is under CM anywhere within Ir (it is out of order till the end of Ir even if the CM is finished earlier). But this is possible only for r ≥ 2. Finally, the third way, event C r , is that the system is under PM during ((r − 1)τ, t. The event C r can occur only if r ≥ 2n. r We determine P(A   r +1) at first. We can see that an ith failure can occur within Ir and i ≤ n. Hence: only if i ∈ 1, . . . , 2 min{n, [ r +1 2 ]}

P(A ) = r

i=1

P(Ari ).

(3)

Unavailability Optimization of a System Undergoing a Real …

61

Fig. 2 The event Ari

Fig. 3 The event Bir

where Ari means that the ith failure occurred within Ir before t (see Fig. 2). It is obvious that: P(Ar1 ) = P((r − 1)τ < X 1 ≤ t) = = FX 1 (t) − FX 1 ((r − 1)τ ). Let us consider i ≥ 2. It has to happen i − 1 failures and their CM must be finished before Ir to Ari occur. Moreover, it must hold that (r − 1)τ < ji−1 τ + X i ≤ t. Thus holds: for i = 2, . . . , r +1 2 P(Ari ) = r −2(i−1)  r −2(i−1)+1  k1 =1

j1 =k1 +1

...

r −2

r −1

ki−1 = ji−2 +1 ji−1 =ki−1 +1

pk1 , j1 ,...,ki−1 , ji−1 FX i (t − ji−1 τ )

−FX i ((r − 1 − ji−1 )τ ) . is under correcLet us determine P(B r ), r ≥ 2, now. We assume

 that the system tion of an ith failure. This is possible only for i ∈ 1, . . . , r2 and i ≤ n. Hence: min{n, [ r2 ]}

P(B ) = r



P(Bir ).

(4)

i=1

where Bir means that the CM of the ith failure takes place anywhere within Ir (see Fig. 3).

62

R. Briš and P. Jahoda

Fig. 4 The event Cir

For i = 1 holds: P(B1r ) = =

r −1 k1 =1 r −1 k1 =1

p X 1 k1 P(Y1 > (r − 1)τ − k1 τ ) =

p X 1 k1 F Y ((r − 1 − k1 )τ ).

Let us consider i ≥ 2. It has to happen i failures before Ir andthe CM of the last one must take place within Ir to Bir occur. Hence for i = 2, . . . , r2 holds: P(Bir ) =

r −2i+1  r −2i+2  k1 =1

j1 =k1 +1

...

r −1 ki = ji−1 +1

pk1 , j1 ,...,ki F Yi ((r − 1 − ki )τ ). We determine P(C r ) now. Let us suppose that n failures occurred before Ir (we assume r ≥ 2n) and PM is not finished before t (see Fig. 4). It follows that: r −2n+1 r −1  r −2n+2  ... P(C r ) = k1 =1

j1 =k1 +1

kn = jn−1 +1

(5) pk1 , j1 ,...,kn F Z (t − kn τ ). We can summarize above mentioned results as follows. For t = 0 holds P(t ∈ C1 ) = 0 and for t ∈ Ir = ((r − 1)τ, r τ  where r ∈ N holds: P(t ∈ C1 ) = ⎧ P(Ar ) for r = 1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ for 2 ≤ r < 2n P(Ar ) + P(B r ) = ⎪ ⎪ ⎪ r r r ⎪ ⎪ ⎪ P(A ) + P(B ) + P(C ) for r ≥ 2n ⎩

(6)

Unavailability Optimization of a System Undergoing a Real …

The system is unavailable in time t in case of t ∈

63

∞ 

Cn . Hence:

n=1

U (t) =



χ (t ∈ Cn ).

(7)

n=1

(At most one term of this sum is equal to 1). Let us denote ϑ(t) = E [U (t) = 1], which we will further call as the unavailability function. We can see that: ϑ(t) = E [U (t) = 1] = P(U (t) = 1) = =

(8)

∞

P(t ∈ Cn ).

n=1

We use the assumption that PM restores the system to a good–as–new state to determine ϑ(t), i,e., we suppose that P(t ∈ Ci+1 ) = P(t − Ti ∈ Ci ) for i ∈ N. Hence: P(t ∈ C2 ) = P(t − T1 ∈ C1 ), P(t ∈ C3 ) = P(t − (T1 + T2 ) ∈ C1 ), .. .

(9)

P(t ∈ Cn ) = P(t − Sn−1 ∈ C1 ), where Ti is the length of i–th cycle and Sn =

n i=1

Ti . From Eq. (9) follows that

P(t ∈ Cn ) = P(t − Sn−1 ∈ C1 ) = =

t

(10) P(t − s ∈ C1 )dFSn−1 (s),

0

where FSn−1 is the CDF of Sn−1 . It means that FSn−1 is the n − 1–fold convolution of FT . Let us denote G(t) = P(t ∈ C1 ). From Eq. (8) we obtain: ϑ(t) = G(t) +

∞ t 

G(t − s)dFSn−1 (s) =

n=2 0

= G(t) +

∞ t 

G(t − s)dFSn (s) =

n=1 0

= G(t) +

t 0

Let us denote (s) =

∞ n=1

G(t − s)d

∞ 

 FSn (s) .

n=1

FSn (s). Using this notation we obtain:

(11)

64

R. Briš and P. Jahoda

t ϑ(t) = G(t) +

G(t − s)d(s).

(12)

0

The last task is to determine (s). Then we will be able, using similar way as in [13], to determine ϑ(t) as a solution of integral equation (12). From the definition of function (s) follows: (s) = = FS1 (s) + = FT (s) + = FT (s) + = FT (s) + = FT (s) + = FT (s) + = FT (s) + = FT (s) +

∞ n=2

∞ n=2

∞ n=2

∞ n=1

∞ n=1

FSn (s) = P(Sn ≤ s) = P(Sn−1 + T ≤ s) = P(Sn + T ≤ s) = P(Sn ≤ s − T ) =

 s ∞ 0

n=1

 s ∞ 0

s 0

(13)

n=1

P(Sn ≤ s − t)d FT (t) = FSn (s − t)d FT (t) =

(s − t)d FT (t).

We can see that  s (s) can be obtained as a solution of integral equation (s) = FT (s) + 0 (s − t)d FT (t). The function FT has been already determined in Eq. (2).

4 Unavailability Optimization of a Selected System Our renewal process model can be considered as a real (or true) ageing process where each failure degrades the system to some extent until a PM time is reached, which happens just at the time of occurrence of nth failure, so that n is a parameter of PM, here considered as FBM. FBM restores the system to as–good–as–new state. To demonstrate our methodology we consider a system adopted from [13], for which the first failure time (X 1 ) follows the Weibull distribution with shape parameter β = 2 and scale parameter α = 600 days. We suppose that duration of CM follows rectangular distribution on interval [12; 18] days, i.e. mean time of CM is 15 days. Further we consider short deterministic duration of PM (3 days) because it can be

Unavailability Optimization of a System Undergoing a Real …

65

well planned in advance, having at our disposal information about (n − 1)th failure. Ageing of the system can be realized in two ways: • Standard (theoretical) ageing process connected with the increasing Weibull failure rate (here we suppose linear increasing). • Real (true) ageing process is characterized by imperfect CM, which means that CM degrades the system to some extent and consequently its failure rate must be always worse in comparison with the system health before occurrence of the failure. We suppose that after the first system failure and follow up restoration, the failure rate will increase in the following way: it will be multiplied by an ageing quotient qa which is a number between 1 and 1.5 and this process will repeat again and again until the PM time, i.e. time of nth failure, is reached. If X 1 follows the Weibull distribution with the failure rate: h(t) =

β t, α2

(14)

then X 2 follows as well the Weibull distribution with deteriorated failure rate: h(t) =

β qa t, α2

(15)

and so on, …, X n−1 follows the failure rate: h(t) =

β n−2 q t. α2 a

(16)

PM starts at the time kn τ that implies complete system renewal. In this realistic ageing process the system gradually deteriorates, the probability distribution of lifetime is changed after each CM, until a PM time is reached, which is in good agreement with real practice. Each failure followed by a restoration process exposes the system to shocks that are accumulated, and system step by step deteriorates.

4.1 Selected System with Latent Failures, Inspection Period τ = 120 Days Figure 5 brings differences between courses of unavailability function ϑ(t), given by Eq. (12), of the dormant (periodically inspected) system without PM, which undergoes real versus theoretical ageing. System is without PM which means that the FBM parameter n = ∞. In the theoretical ageing unavailability evolution (dashed line) system undergoes standard ageing process. The unavailability function of the system ϑ(t) follows a common saw tooth shape. The unavailability curve has an increasing trend in time due to ageing and it peaks to 0.22 at about 600 days. There-

66

R. Briš and P. Jahoda

theoretical ageing

0.25

real ageing

Unavailability

0.2

0.15

0.1

0.05

0 0

500

1000

1500 2000 Time (days)

2500

3000

Fig. 5 Real (qa = 1.25) versus theoretical ageing of dormant system without PM

after the unavailability function converges to an asymptotic limit, as follows from the Key Renewal Theorem, see [19], and as is discussed in [13]. In the real ageing unavailability evolution, the time dependent unavailability function ϑ(t) of the above mentioned system is depicted as the full line in Fig. 5. System is periodically inspected, the time between two inspections is τ = 120 days (the same as in theoretical ageing) and ageing quotient qa = 1.25, which means that each system failure is followed by CM and follow up lifetime has accordingly deteriorated failure rate. We see that the unavailability course under real ageing process has similarly increasing trend within first 600 days, as in case of theoretical ageing process. After that the unavailability function under real ageing process continues the increasing trend whereas the theoretical ageing process is characterized by the convergence of the unavailability function to the asymptotic limit. Figure 6 brings the comparison of courses of the unavailability function of the same system from Fig. 5 with real ageing which is now assumed to undergo PM. One can compare the effect of PM on real ageing process of the system where PM is realized by two different policies, after detecting 2nd (full line) and 5th (dashed) failure, i.e. when progressively n = 2, 5. In both FBM policies we can observe a positive effect of PM on unavailability course where unavailability is improved dependently on frequency of PM. We can conclude that both FBM policies significantly decrease the maximal unavailability from Fig. 5 (real ageing without PM) which is 0.268.

Unavailability Optimization of a System Undergoing a Real …

67

PM after 5th CM

0.25

PM after 2nd CM

Unavailability

0.2

0.15

0.1

0.05

0 0

500

1000

1500 2000 Time (days)

2500

3000

Fig. 6 The unavailability function of dormant system with PM, when n = 2, 5

4.2 Optimization of FBM Under Unavailability Restriction We are trying to find an optimal value of the FBM parameter n which is necessary for implementation of PM and which simultaneously guarantees that maximal unavailability limit 0.23 given by ageing process at about 600 days will not be exceeded. Running calculations of the unavailability function ϑ(t) were executed for parameter values n = 2, . . . , 5 and the optimal value n opt = 4 was found, what is demonstrated in Fig. 7. For n = 5 the stated unavailability limit has been already exceeded.

4.3 Optimization of Inspection Period Under Unavailability Restriction Let us set the optimal value of the FBM parameter n to n = 4. Apparently the unavailability course will be strongly influenced by the period of inspections τ . Figure 8 brings the comparison of unavailability courses for original inspection period τ = 120 versus double period τ = 240. Evidently, the upper limit of the unavailability function for the double inspection period is around 0.42 which is considered as high and unacceptable value for practical use. Therefore we bring up the following optimization problem to be solved: it is necessary to find an optimal inspection period saving still acceptable maximal unavailability limit ϑmax ≤ 0.3. Table 1 brings the result of the optimization process (bold)

68

R. Briš and P. Jahoda

PM after 4th CM

0.25

Unavailability

0.2

0.15

0.1

0.05

0 0

500

1000

1500 Time (days)

2000

2500

3000

Fig. 7 The course of unavailability function for optimized FBM, optimal n opt = 4

inspections every 240 days 0.5

inspections every 120 days

Unavailability

0.4

0.3

0.2

0.1

0 0

500

1000

1500

2000

2500

3000

Time (days)

Fig. 8 The unavailability course for FBM with optimal n = 4, with inspection periods 120 versus 240 days

Unavailability Optimization of a System Undergoing a Real …

69

0.35 PM after 4th CM, inspections every 162 days

0.3

Unavailability

0.25 0.2 0.15 0.1 0.05 0 0

1000

500

1500 Time (days)

2000

2500

3000

Fig. 9 The unavailability course for FBM with optimal n = 4, and optimal inspection period τopt = 162 days Table 1 Solution of the optimization problem satisfying unavailability limit ϑmax ≤ 0.3 Computing run Inspection period τ (days) Maximal unavailability ϑmax 1 2 3 4 5

156 158 162 163 164

0.283 0.293 0.297 0.301 0.305

and Fig. 9 brings the solution of the problem: course of unavailability function during the mission time 3000 days, being τopt = 162 days.

4.4 Optimization of FBM at Fixed Inspection Period and Under Unavailability Restriction Applying this method in practice, one more optimization problem may come out. In real conditions the inspection period may be fixed (mostly there are limited possibilities of an operative maintenance team), for example τ = 180. Now we look for an optimal value n opt of the parameter n which is necessary for implementation of

70

R. Briš and P. Jahoda no PM, inspections every 180 days optimal PM

0.4

0.321

Unavailability

0.3

0.2

0.1 0.05

0

500

1000

1500 Time (days)

2000

2500

3000

Fig. 10 The unavailability course for FBM with optimal n opt = 3, inspection period is τ = 180 days

FBM and which satisfies the following restriction on unavailability: ϑmax ≤ ϑ0

(17)

where ϑ0 is a limit given by the ageing process at about 600 days, here ϑ0 = 0.321. Computing unavailability evolutions for n = 2, . . . , 5 we detected that n opt = 3. Figure 10 demonstrates the unavailability evolution (dashed line) for n opt = 3 in comparison with the real ageing unavailability course without FBM (full line).

5 Conclusions The fundamental reliability mathematics necessary for unavailability quantification of a system which undergoes a real ageing process was shortly demonstrated within the Sect. 3. Each failure and follow-up repair degrades the system to some extent, i.e. we consider imperfect CM. If the nth failure occurs, the system is so much destroyed that its standard repair process is inefficient and the system is replaced by a new one. This PM policy we denoted as the failure based PM, here abbreviated as FBM. We showed that this real ageing process with imperfect CM can significantly increase the unavailability function in contrast with theoretical ageing process. The FBM is characterized by the parameter n which indicates the number of failures until the PM

Unavailability Optimization of a System Undergoing a Real …

71

is launched. The parameter significantly influences the unavailability course, particularly the maximal value of the unavailability function during a mission time. In this chapter, we clearly demonstrated that this parameter as well as the time between two consecutive inspections can be considered as parameters of optimization. Using our methodology we were able to find optimal value of the parameter n under unavailability restrictions at fixed inspection period as well as optimal period of inspections at fixed value of the parameter n. This new methodology will be in our future research work used for optimization of real systems from practice particularly oriented on power electrical networks. Acknowledgements This work was partly supported by the European Regional Development Fund in the Research Platform focused on Industry 4.0 and Robotics in Ostrava, No. CZ.02.1.01/0.0/0.0/17049/0008425 within the Operational Programme Research, Development and Education, and partly by the VSB-Technical University of Ostrava project “Applied Statistics and Probability”, No. SP2020/46.

References 1. Pham, H., Wang, H.: Imperfect maintenance. Eur. J. Oper. Res. 94(3), 425–438 (1996) 2. Finkelstein, M., Ludick, Z.: On some steady-state characteristics of systems with gradual repair. Reliab. Eng. Syst. Saf. 128, 17–23 (2014) 3. Liu, Y., Huang, H.Z.: Optimal selective maintenance strategy for MSS under imperfect maintenance. IEEE Trans. Reliab. 59(2), 356–367 (2010) 4. Nakagawa, T.: Sequential imperfect preventive maintenance policies. IEEE Trans. Reliab. 37(3), 295–298 (1988) 5. Pandey, M., Zuo, M.J., Moghadas, R., Tiwari, M.K.: Selective maintenance for binary systems under imperfect repair. Reliab. Eng. Syst. Saf. 113, 42–51 (2013) 6. Lin, D., Zuo, M.J., Yam, R.C.M.: Sequential imperfect preventive maintenance models with two categories of failure modes. Naval Res. Logist. 48(2), 172–183 (2001) 7. Brown, M., Proschan, F.: Imperfect repair. J. Appl. Probab. 20, 851–859 (1983) 8. Fontenot, R.A., Proschan, F.: Some imperfect maintenance models. In: Abdel-Hameed, M.S., Cinclar, E., Quinn, J. (eds.) Reliability Theory and Models, pp. 83–101. Academic (1984) 9. Castro, I.T.: Imperfect maintenance: a review. In: Andrews, J., BÃ’renguer, Ch., Jackson, L. (eds.) Maintenance Modelling and Applications (Chap. 3: System Reliability Modelling, Det Norske Veritas), pp. 237–262 (2011). ISBN 978-82-515-0316-7 10. Morimura, H.: On some preventive maintenance policies for IFR. J. Oper. Res. Soc. Jpn. 12, 94–124 (1970) 11. Pulcini, G.: Mechanical reliability and maintenance models. In: Pham, H. (ed.) Handbook of Reliability Engineering, pp. 317–348. Springer, London (2003) 12. Nakagawa, T.: Maintenance Theory of Reliability. Springer (2005) 13. Weide, J.A.M., Pandey, M.D.: A stochastic alternating renewal process model for unavailability analysis of standby safety equipment. Reliab. Eng. Syst. Saf. 139, 97–104 (2015) 14. Vaurio, J.K.: Availability and cost functions for periodically inspected preventively maintained units. Reliab. Eng. Syst. Saf. 63, 133–140 (1999) 15. Vaurio, J.K.: Unavailability of components with inspection and repair. Nucl. Eng. Des. 54, 309–324 (1979) 16. Caldarola, L.: Unavailability and failure intensity of components. Nucl. Eng. Des. 44, 147–162 (1977)

72

R. Briš and P. Jahoda

17. Cui, L., Xie, M.: Availability of a periodically inspected system with random repair or replacement times. J. Stat. Plan Inference 131, 89–100 (2005) 18. Vaurio, J.K.: On time dependent availability and maintenance optimization of standby units under various maintenance policies. Reliab. Eng. Syst. Saf. 56, 79–89 (1997) 19. Gallager, R.: Stochastic Processes: Theory and Applications. Cambridge University Press, UK (2013)

Minimal Filtering Algorithms for Convolutional Neural Networks Aleksandr Cariow and Galina Cariowa

Abstract In this paper, we present several resource-efficient algorithmic solutions regarding the fully parallel hardware implementation of the basic filtering operation performed in the convolutional layers of convolution neural networks. In fact, these basic operations calculate two inner products of neighboring vectors formed by a sliding time window from the current data stream with an impulse response of the M-tap finite impulse response filter. We used extension of Winograd’s minimal filtering method and applied it to develop fully parallel hardware-oriented algorithms for implementing the basic filtering operation for M = 3, 5, 7, 9, and 11. A fully parallel hardware implementation of the proposed algorithms in each case gives approximately 30% savings in the number of embedded multipliers compared to a fully parallel hardware implementation of the naive calculation methods. Keywords Convolution neural networks · Winograd’s minimal filtering algorithm · Fast hardware-oriented computations

1 Introduction Today, artificial intelligence, deep learning and neural networks are powerful and incredibly effective machine learning methods used to solve many scientific and practical problems. Applications of deep neural networks for machine learning are diverse and rapidly developing, covering various areas of basic sciences, technologies and the real world [1–3]. Among the various types of deep neural networks, convolutional neural networks (CNNs) are most widely used [4]. Although there are many optimizing methods to speed up CNN-based digital signal and image processing algorithms, it is still difficult to implement these algorithms in real-time low-power A. Cariow (B) · G. Cariowa ˙ West Pomeranian University of Technology, Zołnierska 52, 71-210 Szczecin, Poland e-mail: [email protected] G. Cariowa e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 C. van Gulijk and E. Zaitseva (eds.), Reliability Engineering and Computational Intelligence, Studies in Computational Intelligence 976, https://doi.org/10.1007/978-3-030-74556-1_5

73

74

A. Cariow and G. Cariowa

systems. The main and most time-consuming operations in CNN are two-dimensional convolution operations. To speed up convolution computation, various algorithmic methods have been proposed [4–13]. The most common approach for efficient convolution implementation is the Fast Fourier Transform (FFT) algorithm [5–7]. The FFT-based convolution method is traditionally used for large length finite impulse response (FIR) filters, but modern CNNs use predominantly small length FIR filters. In this situation one of the most effective algorithms used in the computation of a small-length two-dimensional convolution is the Winograd’s minimal filtering algorithm, which is most intensively used in recent time [8–14]. The algorithm computes linear convolution over small tiles with minimal complexity, which makes it more effective with small filters and small batch sizes. In fact, this algorithm calculates two inner products of neighboring vectors formed by a sliding time window from the current data stream with an impulse response of the 3-tap FIR filter. CNN contains several kinds of layers. However, the name of the convolutional neural network itself suggests that the convolutional layers are dominant in this type of network. In CNN, convolutional layers are the most computationally intensive, since in a typical implementation they occupied more than 90% of the CNN execution time [15]. In turn, convolution itself requires performing a large number of arithmetic operations. In many cases, convolution is performed on terabytes or petabytes of data, so even insignificant improvement can significantly reduce the computation time. That is why developers of such type networks seek and design efficient ways of implementing convolution using the smallest possible number of arithmetic operations. Especially, algorithm developers try to minimize the number of multiplications since this operation is more complex than addition. Despite the fact that the execution time of addition and multiplication in modern computers is supposedly comparable, nevertheless, multiplication requires more manipulations with operands, therefore its implementation requires more time and effort than expected. As a result of the multiplication of two n-bit operands, a 2n-bit product is obtained. This is why in all fixed-point digital signal processing (DSP) units the product register and the accumulator are double the widths of all other registers. However, in such a case, two-time access to memory during both writing and reading is required. This increases the actual multiplication time. For example, 32-bit integer multiplication on GPU takes 16 clock cycles. Floating-point multiplication operations require even rather more complicated housekeeping. Therefore, the statement that in modern processors the multiplication operation takes the same time as the addition is somewhat exaggerated.

Minimal Filtering Algorithms for Convolutional Neural Networks

75

Another way to solve this problem is to take advantage of the massive parallelism offered by graphic processing units (GPUs), application-specific integrated circuit (ASIC) and field programmable gate array (FPGA) devices to implement a large amount of internal parallelism demonstrated by CNN-based algorithms [15–31]. GPUs are the most popular and widely used accelerators for improving training and classification processes at CNN [16–18]. This is due to their high performance when performing matrix operations [19]. However, GPU accelerators consume a large amount of energy and therefore their use in CNN-based applications implemented in on-board battery-powered mobile devices is becoming a problem. ASIC and FPGA are the preferred acceleration platforms on-board CNN due to their promising performance and high energy efficiency. They can also achieve high performance, but with significantly lower power consumption [20–31]. In addition, most modern highperformance FPGA targets contain a number of integrated hardware multipliers. Thus, instead of mapping a multiplier into several logic gates, dedicated multipliers provided on the FPGA fabric can be used. So, all multiplications involved in the implementation of the fully parallel algorithm can be efficiently implemented using these embedded multipliers. However, their number may simply be insufficient to meet the requirements of a fully parallel implementation of the algorithm. If multiplications are implemented using hardwired multipliers within the target FPGA, this dramatically limits the complexity of the CNN that can be implemented. For example, the second layer of the LeNet5 network requires 2400 multipliers [32]. This number largely exceeds the number of multipliers provided by many FPGAs and, especially by embedded devices. The designer uses hardwired multipliers to implement multiplication operations until the implemented computing unit occupies all the embedded hardwired multipliers. If the FPGA target runs out of embedded multipliers, the designer uses generic logic gates instead, and the multiplication implementation becomes expensive in terms of FPGA resource usage. In some cases, therefore, available logic has to be exploited to implement multipliers, seriously restricting the maximum number of real multiplications that can be implemented in parallel on a target device. This will lead to significant difficulties during the implementation of the computation unit. Thus, the problem of minimizing the multiplications in the development of the parallel hardware-oriented algorithms for convolutional neural networks regardless of which platforms they will be implemented remains relevant. Next, we consider a number of algorithmic solutions that contribute to the solution of this problem.

76

A. Cariow and G. Cariowa

2 Preliminary Remarks The main operation of convolutional neural networks is an inner product of a vector, formed by a sliding time window from the current data stream with an impulse response of the M-tap finite impulse response (FIR) filter. In the most general case, the procedure for calculating convolution elements can be represented as follows: yj =

M−1 

xi+ j wi

(1)

i=0

i = 0, 1, ..., M − 1,

j = 0, 1, ..., N − M + 1,

where N is a length of current data stream,{xi+ j } are the elements of the current data stream,{wi } are the coefficients of the impulse response of the FIR filter, which are constants. For example, a direct application of two consecutive steps of a 3-tap FIR filter with coefficients {w0 , w1 , w2 } to a set of four elements {x0 , x1 , x2 , x3 } requires 4 additions and 6 multiplications: y0 = x0 w0 + x1 w1 + x2 w2 , y1 = x1 w0 + x2 w1 + x3 w2 .

(2)

S. Winograd came up with a tricky way to reduce the number of multiplications during calculating expression (2): w0 + w1 + w2 w0 − w1 + w2 , µ3 = (x2 − x1 ) 2 2 µ4 = (x1 − x3 )w2 , y0 = µ1 + µ2 + µ3 , y1 = µ2 − µ3 − µ4 . µ1 = (x0 − x2 )w0 , µ2 = (x1 + x2 )

This trick was called the minimal filtering algorithm [8]. The values (w0 + w1 + w2 )/2 and (w0 − w1 + w2 )/2 can be calculated in advance, then this method requires 4 multiplications and 8 additions, which is equal to number of arithmetical operations in the direct method. Since multiplication is a much more complicated operation than addition, the Winograd’s minimal filtering algorithm is more efficient than the direct method of computation. The above expressions exhaustively describe the entire set of mathematical operations necessary to perform the calculations. But, strictly speaking, they are not an algorithm, because they do not reveal the sequence of calculations. In addition, convolutional neural networks use FIR filters with a longer impulse response, for which minimal filtering algorithms have not yet been developed. Considering the above, the goal of this article is to develop and describe minimal filtering algorithms for M = 3, 5, 7, 9, 11.

Minimal Filtering Algorithms for Convolutional Neural Networks

77

Fig. 1 Illustration of the organization of calculations in accordance with the basic filtering operation

First, we define the basic operation of CNN-filtering as an application of two consecutive steps of an M-tap FIR filter with coefficients {w0 , w1 , ..., w M−1 } to a set of elements {x0 , x1 , ..., x M }. Figure 1 clarifies the essence of what was said. A more compactly introduced operation can be represented in the form of a vector– matrix product: y2 = F2×M w M

(3)

where  F2×M =

 x0 x1 · · · x M−1 T , y2(M) = [y0(M) , y1(M) ]T , w M = [w0(M) , w1(M) , ..., w (M) M−1 ] . x1 x2 · · · x M

(Please note, that hereinafter, the superscript (M) will denote quantities related to the basic operation of minimal filtering with an M-tap filter). Next, we present minimal filtering algorithms using a Winograd’s trick for 3-tap FIR filter. The developed algorithms are distinguished by a reduced number of multiplications, which makes them suitable for fully parallel hardware implementation.

78

A. Cariow and G. Cariowa

3 Minimal Filtering Algorithms 3.1 Algorithm 1, M = 3 Let x4 = [x0 , x1 , x2 , x3 ]T be a vector that represents the input data set, w3 = [w0(3) , w1(3) , w2(3) ]T be a vector that contains the coefficients of the impulse response of 3-tap FIR filter, and y2(3) = [y0(3) , y1(3) ]T be a vector describing the results of using a 3tap FIR filter. Then, a fully parallel algorithm for computation y2(3) using Winograd’s minimal filtering method can be written with the help of following matrix-vector calculating procedure: (3) (3) y2(3) = A(3) 2×4 D4 A4 x4 ,

(4)

where ⎡

A(3) 4

1 ⎢0 =⎢ ⎣0 0

0 1 −1 1

−1 1 1 0

⎤ 0 0 ⎥ ⎥, A(3) = 1 1 1 0 , 2×4 0 1 −1 −1 0 ⎦ −1

and (3)

(3)

(3)

(3)

(3)

D4 = diag(s0 , s1 , s2 , s3 ),

(3) (3) (3) (3) (3) (3) (3) (3) (3) (3) (3) s0 = w0 , s1 = (w0 + w1 + w2 ) 2, s2 = (w0 − w1 + w2 ) 2, s3 = w2 .

Figure 2 shows a data flow diagram of the proposed algorithm for the implementation of minimal filtering basic operation for 3-tap FIR filter. In this paper, data flow diagrams are oriented from left to right and straight lines in the figures denote the data

Fig. 2 Illustration of the organization of calculations in accordance with the basic filtering operation, M = 3

Minimal Filtering Algorithms for Convolutional Neural Networks

79

transfer operations. The circles in these figures show the operation of multiplication by a number inscribed inside a circle. The points where the lines converge indicate the summation, the dashed lines indicate data transfer operations with a simultaneous change of sign. We use the usual lines without arrows on purpose, so as not to clutter the picture. In order to simplify, we also removed the superscripts of the variables in all the figures, since it is obvious from the figures what vector sizes we are dealing with in each case.

3.2 Algorithm 2, M = 5 Let x6 = [x0 , x1 , ..., x5 ]T be a vector that represents the input data set, w5 = [w0(5) , w1(5) , ..., w4(5) ]T be a vector that contains the coefficients of the impulse response of 5-tap FIR filter, and y2(5) = [y0(5) , y1(5) ]T be a vector describing the results of using a 5-tap FIR filter. Then, a fully parallel minimal filtering algorithm for computation y2(5) can be written with the help of following matrix-vector calculating procedure: (5) (5) (5) y2(5) = A(5) 2×7 D7 A7×6 x6 ,

(5)

where ⎡

A(5) 7×6

1 ⎢0 ⎢ ⎢0 ⎢ ⎢ = ⎢0 ⎢ ⎢0 ⎢ ⎣0 0

(5) A 2x7

1 1 0 1

0 1 −1 1 0 0 0

−1 1 1 0 0 0 0

1

0

1

0 0 0 −1 1 0 0

0 0 0 0 −1 1 −1

⎤ 0 0⎥ ⎥ 0⎥ ⎥ ⎥ 0 ⎥, ⎥ 0⎥ ⎥ 0⎦ 1

1 1 0

1 0 1 1

,

(5) (5) (5) (5) (5) (5) (5) (5) (5) (5) D7 = diag(s0 s1 , ..., s6 ), s0 = w0 , s1 = (w0 + w1 + w2 ) 2,

s2(5) = (w0(5) − w1(5) + w2(5) ) 2, s3(5) = w2(5) , s4(5) = w3(5) , s5(5) = w3(5) + w4(5) , s6(5) = w4(5) .

Figure 3 shows a data flow diagram of the proposed algorithm for the implementation of minimal filtering basic operation for 5-tap FIR filter.

80

A. Cariow and G. Cariowa

Fig. 3 Illustration of the organization of calculations in accordance with the basic filtering operation, M = 5

3.3 Algorithm 3, M = 7 Let x9 = [x0 , x1 , ..., x8 ]T be a vector that represents the input data set, w7 = [w0(7) , w1(7) , ..., w6(7) ]T be a vector that contains the coefficients of the impulse response of 7-tap FIR filter, and y2(7) = [y0(7) , y1(7) ]T be a vector describing the results of using a 7-tap FIR filter. Then, a fully parallel minimal filtering algorithm for computation y2(7) can be written with the help of following matrix-vector calculating procedure: (7) (7) (7) (7) y2(7) = A(7) 2×6 A6×10 D10 A10×8 x8

(6)

where 1

0

0

1

0

1

1

0

1

0

0

1

0

(7) 0 A 10x8

0

0

05

0

1

4

05

1 1

0

1

0

0

1

0

1

0 A

(7)

4

1

1 0 1 0 1 0

2x6 0 1 0 1 0 1

1 ,

0 1

0 0

1

0

1

0

0

1

,

Minimal Filtering Algorithms for Convolutional Neural Networks

81

Fig. 4 Illustration of the organization of calculations in accordance with the basic filtering operation, M = 7

(7) (7) (7) D(7) 10 = diag(s0 s1 , ..., s9 ),

s0(7) = w0(7) , s1(7) = (w0(7) + w1(7) + w2(7) ) 2, s2(7) = (w0(7) − w1(7) + w2(7) ) 2,

s3(7) = w2(7) , s4(7) = s5(7) = w3(7) , s6(7) = w4(7) , s7(7) = (w4(7) + w5(7) + w6(7) ) 2,

s8(7) = (w4(7) − w5(7) + w6(7) ) 2, s9(7) = w6(7) . Figure 4 shows a data flow diagram of the proposed algorithm for the implementation of minimal filtering basic operation for 7-tap FIR filter.

82

A. Cariow and G. Cariowa

3.4 Algorithm 4, M = 9 Let x10 = [x0 , x1 , ..., x9 ]T be a vector that represents the input data set, w9 = [w0(9) , w1(9) , ..., w8(9) ]T be a vector that contains the coefficients of the impulse response of 9-tap FIR filter, and y2(9) = [y0(9) , y1(9) ]T be a vector describing the results of using a 9-tap FIR filter. Then, a fully parallel minimal filtering algorithm for computation y2(9) can be written with the help of following matrix-vector calculating procedure: (9) (9) (9) (9) y2(9) = A(9) 2×6 A6×12 D12 A12×10 x10

(7)

where 1

0

0

1

0

1

1

0

1

(9) A12 10

04

0

0

1

0

0

0

1

0

0

0

0

0

0

1 1

0

0

1

0

3

1

04

3

0 0 04

1

0 1 A (9) 6 12

A (9) 2 6

0

1

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

1 1

0

0

1

0

3

02

1 1 1

02

4

02

4

02

1 1 1

4

0 1

1 0 1 0 1 0

A (9) 2 6

1

0

1

1

1

4

1

0 1

3

1

0 1 1

0 0

1

1

0

04

0

0

1

0

1

0

0 02

4

02

4

1 1

,

1

, 0 1

.

0 1 0 1 0 1 (9) (9) (9) D(9) 12 = diag(s0 s1 , ..., s11 ),

(9)

s0

(9)

s2

(9)

s6

(9)

s10

(9)

(9)

= w0 , s1

(9)

(9)

= (w0 + w1

(9) (9) (9) = (w0 − w1 + w2 ) 2,

(9) (9) (9) = (w3 − w4 + w5 ) 2,

(9) (9) (9) = (w6 − w7 + w8 ) 2,

(9) + w2 ) 2,

(9) (9) (9) = (w3 + w4 + w5 ) 2,

(9) (9) (9) (9) (9) (9) (9) (9) s7 = w5 , s8 = w6 , s9 = (w6 + w7 + w8 ) 2, (9)

= w2 , s4

(9)

(9)

s3

(9)

s11 = w8 .

(9)

(9)

(9)

= w3 , s5

Minimal Filtering Algorithms for Convolutional Neural Networks

83

Fig. 5 Illustration of the organization of calculations in accordance with the basic filtering operation, M = 9

Figure 5 shows a data flow diagram of the proposed algorithm for the implementation of minimal filtering basic operation for 9-tap FIR filter.

3.5 Algorithm 5, M = 11 Let x12 = [x0 , x1 , ..., x11 ]T be a vector that represents the input data set, w11 = (11) T ] be a vector that contains the coefficients of impulse response [w0(11) , w1(11) , ..., w10 of 11-tap FIR filter, and y2(11) = [y0(11) , y1(11) ]T be a vector describing the results of using a 9-tap FIR filter. Then, a fully parallel minimal filtering algorithm for computation y2(11) can be calculating procedure: (11) (11) (11) (11) y2(11) = A(11) 2×8 A8×15 D15 A15×12 x12 ,

(8)

84

A. Cariow and G. Cariowa

where (11)

(11) (11) (11) s1 , ..., s14 ),

(11) (11) (9) (11) (11) (11) = w0 , s1 = (w0 + w1 + w2 ) 2, s0

= diag(s0

D15

(11) (11) (11) (11) ) 2, s3 = w2 , s4 = w3 ,

(11) (11) (11) (11) (11) (11) (11) (11) (11) (11) s5 = (w3 + w4 + w5 ) 2, s6 = (w3 − w4 + w5 ) 2, s7 = w5 ,

(11) (11) (11) (11) (11) (11) (11) (11) (11) (11) s8 = w6 , s9 = (w6 + w7 + w8 ) 2, s10 = (w6 − w7 + w8 ) 2, (11)

= (w0

(11)

(11)

= w8

s2

(11)

s11

(11)

− w1

(11)

, s12

(11)

+ w2

(11)

= w9

1

0

0

1

1

0

1

1

0

1

0

(11)

(11)

= w10 .

04 3

1

0

1 0

0

0

0

1

1

0

0

0

0

1

1

0

0

0

0

1

0

04 3

03

03

1 1 1

0

0 1

1

1 02 4

(11) 8 15

04 3

1

A

04 3

1

0

0

1

0

1 0

0

0

0

1

1

0

0

0

0

1

1

0

0

0

0

1

0

1

0

0

1

1 0

0

1

0

1 1

,

03

02 4 1 1 1

0

0 1

1

1

02 4

02 4

02 4

02 4

I2

(11)

+ w10 , s14

04 3

04 3

A 2 8 11 4

(11)

= w9

1

(11) 15 12

A

(11)

, s13

02 4

02 3

02 4

02 3

1 1 1

0

0 1

1

1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1

1 02 4

0

, 02 3 1 1 0 0 1 1

.

Figure 6 shows a data flow diagram of the proposed algorithm for the implementation of minimal filtering basic operation for 11-tap FIR filter.

Minimal Filtering Algorithms for Convolutional Neural Networks

85

Fig. 6 Illustration of the organization of calculations in accordance with the basic filtering operation, M = 11

4 Implementation Complexity Since the lengths of the input sequences are relatively small, and the data flow diagrams representing the organization of the computation process are fairly simple, it is easy to estimate the implementation complexity of the proposed solutions. Table 1 shows estimates of the number of arithmetic blocks for the fully parallel implementation of the short lengths CNN-minimal filtering algorithms. As you can see, the implementation of the proposed algorithms requires fewer multipliers than the implementation based on naive methods of performing the filtering operations. Reducing the number of multipliers is especially important in the design of specialized VLSI fully parallel processors because minimizing the number of necessary multipliers also reduces the power dissipation and lowers the cost implementation of the entire system being implemented. This is because the hardware multiplier is a more complex unit than the adder and occupies much more of the chip area than the adder.

86

A. Cariow and G. Cariowa

Table 1 Implementation complexities of naive and proposed solutions Size M

Numbers of arithmetic blocks Naive method multipliers

Proposed algorithm M-input adders

3

6

2

5

10

7

14

9 11

multipliers

2-input adders

3-input adders

4-input adders

5-input adders

2



2 –

4

4

2

7

6





2

10

8

6





18

2

12

12

8





22

2

15

16

6

2



It is proved that the implementation complexity of a hardwired multiplier grows quadratically with operand size, while the hardware complexity of a binary adder increases linearly with operand size [33]. Therefore, a reduction in the number of multipliers, even at the cost of a small increase in the number of adders, has a significant role in the hardware implementation of the algorithm.

5 Conclusion In this paper, we analyzed possibilities to reduce the multiplicative complexity of calculating basic filtering operations for small length impulse responses of the Mtap FIR filters, that used in convolution neural networks. We also synthesized new algorithms for implementing these operations for M = 3, 5, 7, 9, 11. Using these algorithms reduces the computational complexity of basic filtering operation, thus reducing its hardware implementation complexity. In addition, as can be seen from figures, the proposed algorithms have a pronounced parallel modular structure. This simplifies the mapping of the algorithms into an ASIC structure and unifies its implementation in FPGAs. Thus, the acceleration of computations during the implementation of these algorithms can also be achieved due to the parallelization of the computation processes. The proposed algorithms can be effectively used to speed up computations in applications, including computational intelligence, deep learning, and the use of convolutional neural networks for solving various problems related to the reliability and safety of the functioning of constructions and equipment [34–37].

References 1. Tadeusiewicz, R., Chaki, R., Chaki, N.: Exploring Neural Networks with C#, CRC Press. Taylor & Francis Group, Boca Raton (2014)

Minimal Filtering Algorithms for Convolutional Neural Networks

87

2. Aggarwal, C.C.: Neural Networks and Deep Learning : A Textbook. Springer International Publishing AG (2018) 3. Adhikari, S.P., Kim, H., Yang, C., Chua, L.O.: Building cellular neural network templates with a hardware friendly learning algorithm. Neurocomputing 312(27), 276–284 (2018) 4. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Proceedings of NIPS’12, Lake Tahoe, Nev., USA, pp. 1097–1105 (2012) 5. Mathieu, M., Henaff, M., LeCun, Y.: Fast training of convolutional networks through ffts (2013). arXiv:1312.5851 6. Lin, S., Liu, N., Nazemi, M., Li, H., Ding, C., Wang, Y.: Pedram, M. FFT-Based Deep Learning Deployment in Embedded Systems (2017). arXiv:1712.04910v1(2017). 7. Abtahi, T., Shea, C., Kulkarni, A, Mohsenin, T.: accelerating convolutional neural network with FFT on embedded hardware. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 26(9), 1737–1749 (2018) 8. Lavin, A., Gray, S.: Fast algorithms for convolutional neural networks. In: Proceedings of CVPR’16. Las Vegas, NV, US, pp. 4013–4021 (2016) 9. Wang, X., Wang., Zhou, X.: WinoNN: Optimising FPGA-based neural network accelerators using fast winograd algorithm. In: Proceedings of International Conference (CODES+ISSS)’18, Turin, Italy (2018) 10. Lu, L., Liang, Y.: SpWA: an efficient sparse winograd convolutional neural networks accelerator on FPGAs. In: Proceedings of International Conference on DAC’18, San Francisco, California, USA (2018) 11. Xygkis, A., Papadopoulos, L., Moloney, D., Soudris, D., Yous, S.: Efficient Winogradbased Convolution Kernel implementation on edge devices. In: Proceedings of International Conference on DAC’18, San Francisco, California, USA (2016) 12. Jia, Z., Zlateski, A., Durand, F., Li, K.: Optimizing N-dimensional, Winograd-based convolution for manycore CPUs. ACM SIGPLAN Notices—PPoPP ‘18, Vol. 53 no. 1, pp. 109–123 (2018) 13. Yu, J., Hu, Y., Ning, X., Qiu, J., Guo, K., Wang, Y., Yang, H.: instruction driven cross-layer CNN accelerator with winograd transformation on FPGA. ACM Trans. Rec. Tech. Syst. (TRETS) 11(3), pp. 227–230 (2018) 14. Zhao, Y., Wang, D., Wang, L., Liu, P.: A faster algorithm for reducing the computational complexity of convolutional neural networks. Algorithms 11, 159 (2018) 15. Lu, L., Liang, Y., Xiao, Q., Yan, S.: Evaluating fast algorithms for convolutional neural networks on FPGAs. In: Proceedings of FCCM’17, Napa, CA, USA, pp. 101–108 (2017) 16. Cengil, E., Çinar, A., Güler, Z.: A GPU-based convolutional neural network approach for image classification. In: 2017 International Artificial Intelligence and Data Processing Symposium (IDAP) (2017) 17. Strigl, D., Kofler, K., Podlipnig, S.: Performance and scalability of GPU-based convolutional neural networks. In: Proceedings of the 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, PDP 2010, Pisa, Italy, February 17–19, 2010 18. Li, X., Zhang, G., Huang, H. H., Wang, Z., Zheng, W.: Performance analysis of gpu-based convolutional neural networks. In: Proceedings of 45th International Conference on Parallel Processing, pp. 67–76 (2016) 19. Shawahna, A., Sait, S.M., El-Maleh, A.: FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review (2019). arXiv:1901.00121v1 20. Guo, K., Zeng, S., Yu, J., Wang Y., Yang, H.: A Survey of FPGA-Based Neural Network Inference Accelerator (2018). arXiv:1712.08934v3. 21. Hoffmann, J., Navarro, O., Kästner, F., Janßen, B., Hübner, M.: A survey on CNN and RNN implementations. In: PESARO 2017: The Seventh International Conference on Performance, Safety and Robustness in Complex Systems and Applications, pp. 33–39 (2017) 22. Liu, Z., Chow, P., Xu, J., Jiang, J., Dou, Y., Zhou, J.: A uniform architecture design for accelerating 2D and 3D CNNs on FPGAs. Electronics 8(65), 1–19 (2019) 23. Chen, Y.H., Krishna, T., Emer, J.S., Sze, V.: Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circuits 52(1), 127–138 (2017)

88

A. Cariow and G. Cariowa

24. Zhao, R., Song, W., Zhang, W., Xing, T., Lin, J.-H., Srivastava, M., Gupta, R., Zhang, Z.: Accelerating binarized convolutional neural networks with software-programmable fpgas. In: Proceedings of FPGA’17, Monterey, CA, USA, pp. 15–24 (2017) 25. Zhang, C., Li, P., Sun, G., Guan, Y., Xiao, B., Cong, J.: Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: Proceedings of FPGA’15, ACM, USA, Monterey, CA, USA, pp. 161–170 (2015) 26. Farabet, C., Poulet, C., Han, J. Y., LeCun, Y. CNP: an FPGA-based processor for convolutional networks. In: Proceedings of FPL 2009, IEEE, Prague, Czech Republic, pp. 32–37 (2009) 27. Ovtcharov, K., Ruwase, O., Kim, J.Y., Fowers, J., Strauss, K, Chung, E.S.: Accelerating deep convolutional neural networks using specialized hardware. Microsoft Research Whitepaper, Microsoft Research, 2/22 (2015) 28. Li, Y., Liu, Z., Xu, K., Yu, H., Ren, F.: A 7.663-Tops 8.2-w Energy Efficient FPGA Accelerator for Binary Convolutional Neural Networks (2017). arXiv:1702.06392 29. Qiu, J., Wang, J., Yao, S., Guo, K., Li, B., Zhou, E., Yu, J., Tang, T., Xu, N., Song, S., Wang, Y., Yang, H.: Going deeper with embedded FPGA platform for convolutional neural network. In: Proceedings of FPGA’16, ACM, Monterey, CA, USA, pp. 26–35 (2016) 30. Li, H., Fan, X., Jiao, L.; Cao, W., Zhou, X., Wang, L.: A high performance FPGA-based accelerator for large-scale convolutional neural networks. In: Proceedings of FPL’16, Lausanne, Switzerland, pp. 1–9 (2016) 31. Hardieck, M., Kumm, M., Möller, K., Zipf, P.: Reconfigurable convolutional kernels for neural networks on FPGAs. In: Proceedings of International Conference on FPGA’19, Seaside, CA, USA, pp. 43–52 (2019) 32. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 33. Oudjida, A.K., Chaillet, N., Berrandjia, M.L., Liacha, A.: A new high radix-2r (r ≥ 8) multibit recoding algorithm for large operand size (N ≥ 32) multipliers. J. Low Power Electron. ASP 9, 50–62 (2013) 34. Marugán, A.P., Chacón, A.M.P., Márquez, F.P.G.: Reliability analysis of detecting false alarms that employ neural networks: a real case study on wind turbines. Reliab. Eng. Syst. Saf. 191(106574), 1–12 (2019) 35. Xu, Z., Saleh, J.H.: Machine Learning for Reliability Engineering and Safety Applications: Review of Current Status and Future Opportunities (2020). arXiv:2008.08221 36. Duchesne, L., Karangelos, E., Wehenkel, L.: Recent developments in machine learning for energy systems reliability management. Proc. IEEE 108(9), 1656–1676 (2020) 37. Chen, B., Liu, Y., Zhang, C., Wang, Z.: Time series data for equipment reliability analysis with deep learning. IEEE Access 8, 105484–105493 (2020)

Digital Technologies in Reliability Engineering

New Challenges and Opportunities in Reliability Engineering of Complex Technical Systems Antoine Rauzy

Abstract In this article, we discuss the impacts of technological transformations currently at work on reliability engineering of complex technical systems. We consider transformations both in systems and in means to study them. We review challenges to meet in order to manage the current technological paradigm shift. We advocate the potential benefits of the so-called model-based approach in probabilistic risk assessment. We exemplified this approach by presenting the S2ML+X modeling technology.

1 Introduction This article aims at discussing new challenges and opportunities brought to reliability engineering of complex technical systems by technological transformations currently at work. It echoes some reflections initiated a few years ago by Aven and Zio [1–4], and aims at contributing to the on-going debate about the future of our discipline. All complex technical systems (aircrafts, nuclear power plants, offshore platforms, civil and military drones…) present risks to themselves, their operators and the environment. Therefore, one must ensure that these risks are economically, ecologically, and socially acceptable. This is the role of reliability engineering. Reliability engineering encompasses processes as diverse as safety analyses, optimizations of maintenance policies, assessments of the expected production level of a plant over a given period, assessments of the resilience of a socio-technical infrastructure and so on. In a word, reliability engineering aims at assessing the operational performance of systems subject to random events such as mechanical failures, operator errors, sudden changes in environmental conditions… With that respect, it relies on models and more specifically on stochastic models as its objective is to deal with aleatory uncertainties. Probabilistic risk analysis or equivalently probabilistic risk A. Rauzy (B) Department of Mechanical and Industrial Engineering, Norwegian University of Science and Technology, S. P. Andersens veg 3, 7491 Trondheim, Norway e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 C. van Gulijk and E. Zaitseva (eds.), Reliability Engineering and Computational Intelligence, Studies in Computational Intelligence 976, https://doi.org/10.1007/978-3-030-74556-1_6

91

92

A. Rauzy

assessment is the process by which these models are designed and used to calculate performance indicators. The WASH1400 report [5], which followed the Three Miles Island nuclear accident, is usually considered as the historical starting point of the worldwide, cross industry adoption of probabilistic risk analyses. As of today, they rely mostly on modeling technologies such as fault trees, reliability block diagrams and event trees [6, 7]. These technologies are well suited for mechanical systems and well mastered by practitioners. In this article, we address two questions: 1.

2.

Are these modeling technologies still suitable to assess risks in new generations of systems, which are software intensive and rely on ubiquitous control mechanisms? Can the new capacities provided by artificial intelligence and information technologies change the probabilistic risk analysis process?

The first question is indeed of importance because most of the systems currently designed by industry fall into this category. Our answer to this question is essentially negative. More powerful modeling frameworks are needed. Our answer to the second question is positive, even though many challenges remain to meet. We advocate here that part of the answer relies on the so-called model-based approach in probabilistic risk assessment and more generally in systems engineering. This approach relies itself on a new generation of modeling techniques and tools that make possible to represent more accurately the behavior of complex technical systems, to maintain more easily models through the life-cycle of systems, to integrate seamlessly probabilistic risk analyses with other model-based systems engineering processes, and last but not least, to take advantages of the fantastic opportunities provided by artificial intelligence and information technologies. We present here the main underlying ideas of this approach and exemplified them with the description of the S2ML+X family of domain specific modeling languages [8, 9]. Of course, if the model-based approach can contribute to solve problems at stake, it does not solve all of them. As software engineers use to say, “there is no silver bullet” [10]. In particular, risk analysts must face the combinatorial explosion of the number of scenarios to analyze, with inherently limited calculation resources, whatever technology is used to support the analysis [11]. We shall discuss here this issue and review some other challenges to meet. The contribution of this article is thus threefold. First, it discusses the two above questions, with the point of view of a computer scientist. Second, it provides a brief introduction to the model-based approach in probabilistic risk assessment. Third, it discusses challenges to meet to manage the current paradigm shift in technologies. The remainder of this article is organized as follows. In Sect. 2, we shall recall the basic principles of probabilistic risk analyses, explain why the current process will probably change dramatically soon, and sketch the

New Challenges and Opportunities in Reliability ...

93

forthcoming process. In Sect. 3, we shall present the model-based approach in probabilistic risk assessment and exemplified it with the S2ML+X modeling technology. In Sect. 4, we shall discuss challenges to meet. Finally, Sect. 5 concludes the article.

2 The Probabilistic Risk Assessment Process 2.1 Current Process The current probabilistic risk assessment process is described in Fig. 1. The risk analyst uses two kinds of prior knowledge to perform probabilistic risk assessment: the specifications of the system under study, so to understand how the system works and how it may fail, and reliability data for the components of the system, typically those recorded in the OREDA book [12] for the oil and gas industry. From this knowledge, the analyst designs a model, e.g. a fault tree. Then, he calculates indicators of operational performance such as the availability of the system, its reliability, its mean down time, its average production and so on. These indicators are eventually used to make decisions about the design of the system.

Fig. 1 The (current) probabilistic risk assessment process

94

A. Rauzy

2.2 Game Changers The above process will necessarily change, probably sooner than most of us expect, for at least three reasons. First, systems designed by industry are increasingly complex, due notably to the massive introduction of software and ubiquitous control mechanisms. We are gradually moving from mechanical systems to mechatronic systems, cyber-physical systems and even systems of systems. As we shall show in the fourth section, fault trees and related modeling formalisms are not suitable to represent accurately the dynamic aspects of the behavior of these systems. Second, we are quickly moving from a situation where reliability data are scarce and difficult to access to a situation where data are over numerous and easy to access. This will induce considerable changes in the probabilistic risk assessment process, although it is admittedly still hard to see the premises of this (r)evolution. We ask here the reader to consider again the current situation, as it is represented Fig. 1: reliability data are manually collected by operators, then aggregated by experts (with high skills in statistics) who try to fit them into parametric distributions such as the negative exponential distribution or the Weibull distribution. The parameters of these distributions are then recorded into books like the already mentioned OREDA [12]. Risk analysts pick up eventually data in these books to feed their models. This way of doing things was fine but looks now completely outdated. First, manual processing of data will no longer be possible when these data will come from sensors continuously monitoring systems. Second, relying on books sounds weird at a time where billions and billions of digital data circulate on internet every second. Therefore, modeling environments should be soon directly connected to data bases. Third, the main reason to use parametric distributions was they provide a compact way to store the information. However, the smallest image posted on social networks contains more information than required to describe any empirical probability distribution. Therefore, why going on using parametric distributions and not directly source data? Using directly source data would produce more accurate results. It could also make it possible to update data if not in real time, at least much more often than currently. Moreover, different treatments could be performed on data depending on the needs of the analysis. In a word, reliability data that are used in probabilistic risk assessment are currently obtained via intermediations that will probably disappear in a near future. The third game changer is also linked to the digital transformation of industrial processes: any complex system comes now with hundred, if not thousands, of models and data sets. These models and data sets constitute what is sometimes called the “digital twin” of the system [13]. We entered definitely in the model-based systems engineering era. Models are used not only to design systems, but also to operate and even to decommission them. This has two consequences: first, one needs to update models much more frequently than before. Second, one needs to ensure the coherence of the various models

New Challenges and Opportunities in Reliability ...

95

designed by the different engineering disciplines. With both respect, modeling formalisms like fault trees, block diagrams or event trees are not well suited. Validating and updating these models is an extremely hard task because of their cognitive distance to systems specifications. Concretely, it is nearly impossible to understand how the system works from the fault tree describing how it may fail. This calls for a new generation of modeling formalisms making it possible to reduce the gap between (model-based) systems specifications and risk assessment models.

2.3 Envisioned Process The emerging risk assessment process, as we envision it, is pictured in Fig. 2. As the reader can see, it presents significant difference with the process described in Fig. 2. Systems specifications from which the analyst derives the risk assessment model will rely more and more on models, as opposed to documents. Means should thus be put in place to synchronize system architecture models with risk assessment models. We shall come back to that point in the fourth section. Manual recording of failures will be progressively replaced by automated health monitoring of systems, by means of sensors. Monitoring data will be stored in data bases. Data analysts will use artificial intelligence and machine learning techniques to extract from these data some learned degradation indicators and probability distributions of failure of components. These indicators will be integrated directly into models via digital communications.

Fig. 2 Envisioned probabilistic risk assessment process

96

A. Rauzy

Risk assessment models will also evolve so to be able to represent faithfully the behaviors of systems, which will much more dynamic that those of purely mechanical systems. Finally, models will be used on-line to make decision about systems operations and not off-line in the design phase. In a word, models will be “in the loop”. Bets on evolution of technologies are usually lost. The future probabilistic risk assessment process will thus probably not look to what we described above. Nevertheless, we are convinced that there are there elements of the future process and in any case issues that are worth to study.

2.4 Discussion The process described in the previous section relies heavily on models designed by the analyst. In an even more futuristic vision, one could imagine performing risk analyses straight from systems specifications and health monitoring data, by means of artificial intelligence techniques. The author believes that such a dream (for managers at least) has little chance to become reality, if any. There are at least two major reasons to support our disbelief. First, artificial intelligence techniques require large training sets. As Yann LeCun, chief scientist at Facebook and one of the gurus of deep learning, keeps repeating the progresses made in artificial intelligence in the recent past years come for a good part of the availability of very large training sets [14]. But incidents and accidents are hopefully rare. Therefore, even though there are enough data to feed handmade models, there are not enough to get rid of these models. Second, artificial intelligence techniques, as any other computer tool, are efficient on well-defined problems. But precisely, the process of designing a model is the process by which the analyst makes the problem well-defined. We shall now clarify what we mean by model-based approach in probabilistic risk assessment as it is an essential ingredient of the envisioned process.

3 Model-Based Safety Assessment 3.1 The Promise of Model-Based Risk Assessment In systems engineering, the model-based approach is defined as opposed to the document-based approach [15]. The situation is indeed different for probabilistic risk analysis that relies in essence on models. Rather, the model-based approach in reliability engineering is characterized the type of models that are used. The most widely used modeling formalisms for safety analyses lack either expressiveness, e.g., fault trees and event trees, or structure, e.g., Markov chains and

New Challenges and Opportunities in Reliability ...

97

Fig. 3 A two-line separation system

stochastic Petri nets. Consequently, they are far from systems specifications. These deficiencies make the models hard to design, hard to share with stakeholders, and even more importantly, hard to maintain through the entire lifecycle of systems. As an illustration, consider for instance the small system pictured in Fig. 3 that we shall use throughout this section. This system is made of two lines (L1 and L2). Each line consists itself of a separator S and a compressor C. The system is working if at least one of the two lines is working. A line is working if both its separator and its compressor are working. Figure 4 shows a minimal fault tree describing the possible failures of this system. This model makes a number of implicit assumptions: the two lines are assumed to be in hot redundancy, the capacities of unit are assumed to be either 100% or 0% and so on. Perhaps more importantly, it is nearly impossible from such a fault tree to retrieve the actual architecture of the system. If failures of separators and compressors are further decomposed, the analyst must duplicate by hand the descriptions of these failure conditions, which is both tedious and error prone, not to speak about maintenance (of the mode) issues. Modeling systems in a more structured way and with suitable mathematical frameworks can reduce the distance between systems specifications and models, without

Fig. 4 A minimal fault tree describing failures of the system pictured in Fig. 3

98

A. Rauzy

increasing the complexity of calculations. This is the promise of the so-called modelbased risk assessment. This approach affords the ability to animate/simulate models, to ease their validation, and to share them with stakeholders. Moreover, it presents the following important benefits for risk analyses stricto sensu: • A single model can address several safety goals, which eases versioning, configuration and change management; • It can be assessed by several assessment tools, which increases versatility of assessments and quality-assurance of results (even if at a certain cost). • It allows fine grain analyses, which limits over-pessimism resulting from coarse grain analyses as performed for instance with fault trees. • Its maintenance is alleviated significantly, as it is closer to systems specifications. • Similar formalisms can be used to design simple static models as well as dynamic models, hence facilitating the acquisition of competences and the industrial deployment of tools. • The graphical animation of models makes it possible to share them with nonspecialists. • The same technology can be used not only for risk analyses but more generally to assess the operational performance of systems (in terms of costs, delays, production levels…). Modeling formalisms that support this approach can be classified into three categories. The first category consists of specialized profiles of model-based systems engineering formalisms such as SysML, see e.g. [16]. The objective here is however more to introduce a safety facet into models of system architecture than to design actual safety models. The second category consists of extensions of fault trees or reliability block diagrams so to enrich their expressive power. This category includes dynamic fault trees [17, 18], multistate systems [19–21], and some other proposals [22]. The third category, which aims at taking full advantage of the model-based approach, consists of modeling languages such as Figaro [23] or AltaRica [24]. We shall focus on the latter category, as we consider it as the most promising. More exactly we shall now present the S2ML+X family of modeling languages.

3.2 The S2ML+X Paradigm Modeling languages of the S2ML+X family consists of two parts: a specific part, the X, which is a particular mathematical framework, e.g. guarded transition systems (GTS) [25, 26] in the case of AltaRica 3.0, and a general part, S2ML. S2ML stands for S2ML stands for “system structure modeling language” [9]. S2ML gathers in a coherent way structuring constructs stemmed from object-oriented programming [27], and prototype-oriented programming [28]. In other words, languages of the S2ML+X family obeys the following equation, which echoes the title of the famous book of Wirth on the Pascal programming language (“Algorithms + Data - Structures = Programs”) [29].

New Challenges and Opportunities in Reliability ...

99

Behaviours + Architectures = Models Our thesis is that it is possible to obtain full-fledged object-oriented modeling languages by putting S2ML on top of a core mathematical framework aiming at describing behaviors in a certain way. This applies not only to guarded transition systems (X = GTS), which gives to AltaRica 3.0, but also for systems of stochastic Boolean equations (X = SBE), which are the underlying mathematical framework of fault trees and reliability block diagrams. This is actually what we have implemented in the new version of our tool XFTA, which probably the most powerful and efficient calculation engine in its class [30]. Beyond, it would be possible to apply the same principle to finite degradation structures [31] and even mathematical frameworks used outside of reliability engineering such as ordinary differential equations, obtaining in this way modeling languages similar to Matlab/Simulink [32] or Modelica [33].

3.3 S2ML in a Nutshell At this point, it is probably time for us to provide the reader with more insights about S2ML. Surprisingly enough, S2ML relies on only ten concepts: those of ports, connections, prototypes, classes, composition, cloning, instantiation, inheritance, reference and aggregation. Ports are basic objects of a model. For instance, in S2ML+SBE, parameters of probability distribution, basic, intermediate and, house events, as well as common cause failure groups are ports. Connections are relations, taken in a broad sense, that link ports. Connections capture the behavior of the system. For instance, in S2ML+SBE, equations defining parameters, basic, intermediate and house variables, as well as definitions of common cause group failures are connections. Ports and connections suffice to create a model. For instance, the fault tree pictured Fig. 4 is just a graphical representation of the following system of Boolean equations. F-Loss = L1-Loss and L2-Loss L1-Loss = L1-S-Failed or L1-C-Failed L1-Loss = L1-S-Failed or L1-C-Failed

For the sake of simplicity, we let here aside the description of probability distributions associated with the four basic events L1-S-Failed, L1-C-Failed, L2-S-Failed and L2-C-Failed, but it can be encoded in a similar way. Writing such a set of equations, or equivalently drawing the fault tree, is easy when the system under study is small. However, a model made only of ports and connections reflect only very indirectly the architecture of the system under study, as discussed above.

100

A. Rauzy

To structure models, one needs containers for declarations of ports and connections (and other elements). The fundamental container is the prototype, i.e. a container with a unique occurrence in the model. In languages of the S2ML+X family, prototypes are called blocks. When a container, a block, or any other type of container, contains an element, one says that the container composes this element. Composition is a fundamental relation between model elements, sometimes referred to as the is-part-of relation. With ports, connections, and prototypes, it is already possible to design hierarchical models, i.e. to decompose the system under study into functional or physical subsystems, then these subsystems into sub-subsystems, and so one until the wanted degree of granularity is reached. Such models would lack however of two fundamental ingredients. First, a way to represent that two elements of the model describe similar parts of the system. This is especially of interest in reliability engineering where redundancy is a key element to ensure the required level of performance (like in our example). Second, a way to connect elements located in different places in the hierarchy. When two parts of the system under study are alike, e.g. the system pictured in Fig. 3 is made of two identical lines, the description of the second line is the same as the description of the first one, up to the naming of elements. Once the first line described, the description of the second one can be obtained by a kind of copy-paste operation. This is however error prone and hides a fundamental information: the fact, precisely, that line 1 and line 2 are identical. S2ML provides the concept of cloning to deal with such situations. Rather to copypaste the prototype describing the first line to get the prototype describing the second one, one says that the second prototype is a clone of the first one. The assessment tool, XFTA in our case, is then in charge of performing the duplication. For instance, the description of the system pictured in Fig. 3 could have the following structure. block System block Line1 block Separator // description of the behavior of the separator end block Compressor // description of the behavior of the compressor end end clones Line1 as Line2; end

Note that the above structure is independent of the mathematical framework chosen to describe behaviors. Cloning makes it possible to duplicate modeling elements within a model, but not to reuse them from models to models. Moreover, considering the description

New Challenges and Opportunities in Reliability ...

101

of basic components, e.g. pumps or valves, the choice of the initial model element (from other similar model elements are obtained by cloning) is very arbitrary. The idea is therefore to create libraries of on-the-shelf modeling elements, outside any particular model, and to clone these modeling elements into the model, when needed. In S2ML (and more generally in object-oriented programming), this is achieved by the concepts of classes and instances. A class is just a prototype declared outside the model. Instantiation is the operation by which a class is cloned into a model. The resulting prototype is called an instance of the class. For instance, we could define classes to describe the behavior of separators and compressors, then instantiate them into our model: class Separator // description of the behavior of the separator End class Compressor // description of the behavior of the compressor end block System block Line1 Separator S; Compressor C; end clones Line1 as Line2; end

Note again that the above structure is independent of the mathematical framework chosen to describe behaviors. Now it is sometimes the case that a component is a particular type of a more general category of components, e.g. a solenoid valve is a particular type of valve. Most of the properties of the particular component are actually common to all components of the category, while some are specific. To represent that, it would be indeed possible to create a class for generic components then to instantiate this class into the class describing specific ones. This would lead however to awkward modelse: a solenoid valve is not part of a generic one. Rather, a solenoid valve is-a valve. In S2ML (and more generally in object-oriented programming), capturing isa relations is achieved by means of inheritance. When a prototype or class inherits from another prototype or a class, it means that all elements composed by the latter are composed by the former. It is then possible to modify the definitions of these elements or to add new ones to reflect the particular properties of the specific component. Example

102

A. Rauzy

class Valve // description of the behavior of a generic valve end class SolenoidValve extends Valve; // description of the specific features of solenoid valves end

The last ingredient we need to deploy fully object-oriented modeling is the possibility to refer to an element located somewhere in the hierarchy of prototypes from anywhere else in this hierarchy. The notion of reference is thus key. In S2ML, referring to ports is achieved by means of paths. Within a block, each element is uniquely identified with a name, called its identifier. Two elements cannot have the same name, even though they are of different types. To refer to an element located in other blocks, one uses paths built with the dot notation and the two primitives main and owner: • B.E refers to the element E composed by the block B itself composed by the current block. Applying this principle recursively makes it possible to refer to any element located in the hierarchy rooted by the current block. • owner refers to the parent block of the current block. Therefore, owner.owner.B.E refers to the element E composed by the block B itself composed by the grand-parent block of the current block. The primitive owner makes it possible to create relative paths referring to any element in the current hierarchy. • main refers to the outermost block of the current hierarchy, i.e. the model itself. Therefore, main.B.E refers to the element E composed by the block B declared at the top-level. The primitive main makes it possible to create absolute paths referring to any element in the model. For instance, assuming that the class Separator declares a variable out, that is true if and only if the separator works properly and that the class Compressor declares a variable in to reflect the flow upstream the compressor. Then, at line level we have to connect these two variables by means of an equation. This can be done as follows, using the dot notation. block Line1 Separator S; Compressor C; flow C.in = S.out; end

There are cases where one needs to refer to not only an individual element, like a parameter or a variable, but a whole container, possibly itself composing subcontainers. In that case, using paths would be tedious, and error prone. The solution consists in the last concepts provided by S2ML, namely the aggregation of containers.

New Challenges and Opportunities in Reliability ...

103

Let A and B be two containers located at different places in the same hierarchy. Let π.B the path (relative or absolute) to go from A to B in that hierarchy. To access an element E composed by B from A, one must normally use the path π.B.E. By aggregating in A the container B (actually the container π.B) under the name C, one makes possible to E in A by means of the path C.E. In some sense, this creates the alias C for the path π.B in A. Aggregation should not be seen however only as a technical solution to create references. More fundamentally, it represents a uses relation. A uses B although B is not declared in the vicinity of A. Aggregation is a key tool to describe so-called functional chains [34] as well as to glue together, within the same model, descriptions of functional and physical architectures [35].

3.4 AltaRica 3.0 So far, we used on systems of stochastic Boolean equations, which have the expressive power of fault trees or reliability block diagrams. With AltaRica 3.0 [24], we leave the category of combinatorial modeling formalisms, to enter the category of state automata, see reference [11] for a discussion on these categories. Due to space limitations, it is not possible to present here all the features of the language. In the previous section, we gave a flavor of S2ML. We shall thus illustrate here the expressive power of guarded transition systems [25, 26] by means of an example. Assume that, in our case study, the second line is a backup for the first one, i.e. that its separator and its compressor are put in operation on demand. Systems of stochastic Boolean equations are not powerful enough to represent faithfully this behavior (and more generally to take into account time dependencies). Figure 5 shows the graphical representation of a guarded transition system representing a standby unit. Figure 6 shows the AltaRica code for this guarded transition system. Just as in systems of stochastic Boolean equations, guarded transition systems use two types of variables to represent the current state of the system under study: state variables that represent actually the state of the system and flow variables that represent flows of matters, energy of information circulating in the network of components. The guarded transition system pictured in Fig. 5 used one state variable, state, and three flow variables demand, in and out. In AltaRica 3.0, variables take their values into sets of constants called domains. The domain of the variable state is the set of three symbolic constants {STANDBY, WORKING, FAILED}. The three flow variables are Boolean. The value of flow variables is calculated from the value of state variables, which means that the former is recomputed each time the former is modified.

104

Fig. 5 The guarded transition system representing a standby unit

Fig. 6 AltaRica code for the guarded transition system pictured in Fig. 5

A. Rauzy

New Challenges and Opportunities in Reliability ...

105

The value of state variables changes under the occurrence of events. In AltaRica, these changes are described by means of guarded transitions. A guarded transition is a triple (event, guard, action). The guard of a transition is a Boolean condition telling when the transition is enabled. The action of a transition is the way this transition modifies the value of state variables, when fired. In our example, there are five transitions labeled respectively by the events start, stop, failureOnDemand, failure and repair and represented by arrows. Events are associated with probability distributions. In our example, transitions labeled with events start, stop, and failureOnDemand are deterministic and instantaneous (associated with Dirac distributions), while transitions labeled with events failure and repair are timed and stochastic. The combination of GTS and S2ML results in a powerful, versatile language which exploits optimally assessment algorithms. An integrated modeling environment for AltaRica 3.0 (AltaRica Wizard) has been developed as join effort of the Open-AltaRica team at IRT-SystemX (Paris, France) and the author at NTNU. Industrial partners (Airbus, Safran and Thalès) support this project. A versatile set of assessment tools is under development, which includes: • A step by step simulator making possible to play “what-if” scenarios and to validate models. This simulator implements abstract interpretation techniques so to simulate faithfully stochastic and timed executions [36]. • A compiler of AltaRica models into fault trees. This compiler relies on advanced algorithmic techniques [37]. Fault trees are then assessed with XFTA [30], which is one of the most efficient available calculation engines. • A compiler of AltaRica models into Markov chains. This compiler produces Markov chains that approximate the original model while staying of reasonable sizes [38]. Markov chains are then assessed with Mark-XPR, as very efficient calculation engine [39]. • A generator of critical sequences. • A stochastic simulator. Stochastic simulation is itself a versatile tool to assess complex models [40]. These tools make the AltaRica 3.0 technology extremely efficient. They make it possible cross-verification. They prefigure what will be the next generation of modeling environments for the assessment of operational performance of complex technical systems.

3.5 Textual Versus Graphical Representations As all modeling languages of the S2ML+X family, S2ML+SPBE and AltaRica 3.0 are a primarily textual, just as computer programs. Graphical representations can be used, but the ultimate reference is the text. Not only we do not consider that as a drawback, but we claim it is a necessity. At first, this thesis may seem at best extremely

106

A. Rauzy

provocative, as most of the models designed for both system architecture and risk analyses (as well as in other engineering disciplines) are authored via graphical modeling environments and many practitioners just refuse to write a single line of code. However, graphical modeling is mainly useful to describe structural parts of models and systems, see e.g. reference [41] for an interesting discussion on the pragmatics of graphical modeling. It is hard to conceive how to author a differential equation or the probability distribution of the basic event of a fault tree graphically. Behavioral descriptions, such as Markov chains or Petri nets, can be represented graphically. However, as soon as models become large, which is the case for nearly any industrial-scale system, their graphical representations become more problematic than useful: as they cannot fit into any reasonable space (computer screen or printed out paper), the analyst can only visualize them by parts. This means that she or he must anyway develop a global cognitive model to understand local graphical representations. Moreover, many subtle differences in behaviors are just impossible to represent graphically. In a word, models exist independently of their graphical representations. These graphical representations, even taken together, cannot fully describe the model, except in simple cases. It is often very convenient to have several partial graphical representations for the same information and to extract dynamically graphical representations according to one’s needs. The parallel with software engineering is here fruitful. It is useful to represent the architecture of software by diagrams such those of UML [42]. However, the software exists independently of these representations and the code is the ultimate reference. Moreover, below a certain level of abstraction, the code gives a more compact, more precise, in a word more useful, information than any drawing. At the end of the day, the humanity invented writing to overcome the lack of precision of drawing. It remains that making textual models adopted is one of the challenges that we must meet. We shall now discuss these challenges.

4 Challenges 4.1 Transforming Big Data into Smart Data Sensors produce already lots of data (big data) and will produce even more in the future. However, most of these data cannot be exploited for probabilistic risk assessment. Therefore, a key question is how to collect data that can be translated into (probabilistic) degradation models, in complement or not to physical models. If we can do so, it will remain to introduce degradation models for components into risk assessment models for systems, i.e. to accommodate them into stochastic discrete events systems. This latter point does not seem a major technical or scientific issue. We can be reasonably optimistic on this question.

New Challenges and Opportunities in Reliability ...

107

4.2 Handling the Increasing Complexity of Systems The behavior of software intensive systems, often called mechatronic systems, or cyber-physical systems if they are connected to the net, is highly dynamic. Control mechanisms change the configuration of these systems depending on the state of their components, of the environment or on the needs in terms of production. Conditionbased maintenance policies, which are increasingly adopted for the sake of reducing costs of maintenance interventions and reducing production down-times due to these interventions, enter in this category. The introduction these control mechanisms creates dependencies among components as well as dynamically scheduled phases in the life cycle of systems. Maintenance interventions are not scheduled once for all, on a calendar basis, but decided dynamically based on the monitoring of the condition of the system, which in turn depends on maintenance interventions. Static models, such as fault trees, event trees or reliability block diagrams, cannot represent faithfully these dependencies and dynamically scheduled phases of life cycles. To be able to do so, one needs at least the expressive power of (stochastic) discrete event systems, like AltaRica. Moving from static models to discrete event systems has however a triple cost: first, analysts should be trained to these new modeling technologies; second, the computational cost of calculations of risk indicators increases significantly; third, as modeling formalisms are more powerful and problems at stake are more complex, models are more difficult to design and to validate. We shall discuss the two first point latter in this section. The third one, model design and validation of complex technical systems, is one of the major technical challenges we are facing. It is striking how, as of today, the reliability engineering literature is still silent of this issue. It is like modeling was a subsidiary task, requiring no other competences and skills than a good mathematical background and a solid practical knowledge of the systems under study. Nothing is more illusory. Models must be recognized as first-class citizens of scientific research in our domain. We need to develop the science and the engineering of models (of engineering). With that respect, much can be learned from the historical development of computer science and software engineering. As explained in the previous section, AltaRica 3.0 embeds already the most advanced concepts for structuring models. These concepts are stemmed from objectoriented programming and prototype-oriented programming [27, 28]. Relying on a power mathematical framework and versatile structuring mechanisms is mandatory to handle the problems at stake. It is however not sufficient. To make the modeling process efficient both in terms of model design and model validation, it is of primary importance to reuse as much as possible modeling components within models and between models. In modeling languages such as Modelica [33], this goal is achieved via the design of libraries of on-the-shelf ready-to-use modeling components. Reusing components is also possible in probabilistic risk analyses, but to a much lesser extent. The reason is that these analyses represent systems at a high level of abstraction. Modeling components, except for very basic ones, tend thus to be specific to each system. Reuse is mostly achieved by the design of modeling

108

A. Rauzy

patterns, i.e. examples of models representing remarkable features of the system under study. Once identified, patterns can be duplicated and adjusted for specific needs [24]. The notion of patterns is pervasive in systems engineering. For instance, it has been developed in the field of technical system architecture, see e.g. [43], as well as in software engineering [44]. Patterns are also excellent communication mean: in order to document models (or programs), it is often sufficient to refer to the patterns that have been used to design them. The author strongly believes that one of the tasks of the reliability engineering community should be to perform a systematic exploration of the modeling patterns for probabilistic risk assessment of nowadays technical systems. It is probably the only way to tame the complexity of these systems.

4.3 Computational Complexity of Probabilistic Risk Assessment The risk analyst must face the combinatorial explosion of the number of scenarios to analyse. Whatever modeling technology is used, the calculation of probabilistic risk indicators is provably computationally hard, namely #P-hard, as demonstrated by Valiant [45] and further completed by Toda [46]. This was already true for mechanical systems; this is indeed even more sensitive for mechatronic systems. During the last decades fantastic progresses have been made in the development of efficient algorithms and heuristics for probabilistic risk assessment. The power of computers has also dramatically increased. No doubt that more progresses will be made in both directions in the future. Nevertheless, the above mathematical limits will continue to apply. The risk analyst will always have limited calculation capacities at hand. In practice, this means that probabilistic risk assessment models result necessarily of a trade-off between the accuracy of the description of the system under study and the ability to perform calculations on this description. In other words, the risk analyst faces the fundamental epistemic and aleatory uncertainties of risk assessment with a bounded calculation capacity. This bounded capacity over-determines both the design of models and the decisions that can be made from models, see reference [11] for an in-depth discussion on this topic. With that respect, he or she is like Simon’s economical agent who must make decisions with a bounded rationality [47]. The scientific and technological question at stake here is therefore to work on algorithms, heuristics and modeling methodologies that help to use as efficiently as possible the calculation resources at hand. At this point, we must say few words about probabilistic risk analyses of systems of systems, which are increasingly present in industry and more generally in our lives [48]. These systems are very different from mechanical, mechatronic and even cyber-physical systems. We can characterize them as being: • Opaque: their states can be observed only by indirect means; • Reflective: they embody models of their own behavior and environment;

New Challenges and Opportunities in Reliability ...

109

• Deformable: their architecture changes throughout their mission. Clearly, even modeling technologies like AltaRica 3.0 are not suited to represent systems having these properties as they assume a fixed architecture of the system under study is fixed [49]. To represent the behaviors of these systems of systems, another class of modeling frameworks is probably required that we can called stochastic process algebras in reference [11]. This class includes formalisms as diverse as (stochastic variants of) colored Petri nets (with an unbound number of colors) [50], process algebras such as Milner’s pi-calculus [51] and agent-oriented modeling languages [52]. These formalisms are extremely powerful. They have however a major drawback: most of the questions we may ask are undecidable [53]. Consequently, we must forge new concepts to analyze these systems.

4.4 Integrating Seamlessly Models and Data Sets into the Digital Twin To face the complexity of technical systems, the engineering disciplines contributing to the design and operations of these systems are designing models and collecting engineering data: as already said, any technical system comes now with hundred, if not thousands, of models and data sets. These models are designed by different teams in different modeling formalisms, at different levels of abstraction, for different purposes. Models mature also at different rates. The question is thus how to ensure that they describe the same system, i.e. how to synchronize them. There are at least four distinct aspects in this question: a first one concerns the management of models and data sets in the context of the extended enterprise. This is the realm of collaborative data bases, product life cycle and product data management environments [54]. The concept of “digital twin” is gaining popularity to designate systems in charge of models (and engineering data) management [55]. It impacts all engineering disciplines, including of course probabilistic risk assessment, as collaborative data bases will provide the infrastructure for the system analysis and modeling processes. A second aspect is related to the seamless cooperation of models of different abstraction levels within a discipline. This is an important and difficult topic [56]. This aspect concerns also probabilistic risk analyses. The question here is how models designed by a client and its suppliers can cooperate. A mere integration cannot be the answer for both intellectual property and computational complexity issues. Mathematical concepts and algorithmic tools must be developed this purpose. A third aspect regards the co-simulation of heterogeneous but compatible models, such as experiments performed in the framework of the Ptolemy project [57]. For probabilistic risk analyses, it would mean for instance to couple risk assessment models with 3D physical simulation codes. This would be of interest, especially in terms of communication with the stakeholders. Regarding calculation of risk indicators, it is probably quickly limited by computational complexity issues.

110

A. Rauzy

The fourth aspect regards the alignment of heterogeneous models representing the system at about the same level of abstraction. The alignment of system architecture models and probabilistic risk assessment models is a paradigmatic example of that. This alignment is an industrial necessity and is required by Safety Standard such as IEC 61,508 [58] and IEC 61,511 [59]. The heterogeneity of these models makes it impossible to compare them directly. To compare them, we first must abstract them into a common language, and then perform a comparison of their abstractions. Once the comparison has been made, it is possible to go back to original models via a concretization mechanism. This principle is close to Cousot’s abstract interpretation of programs [60]. Significant results have been obtained in this direction that show the interest of this approach [61–63].

4.5 Managing the Change The technological transformations we discussed in this article cannot be achieved without a conscious, organized and systematic management of change. To start with, it requires to solve numerous intellectual property issues: who is the owner of the data, who can access to them, under which conditions and so on. This problematic goes indeed well beyond probabilistic risk analyses. It concerns actually the whole digital twin in the context of the extended enterprise. Training risk analysts to new modeling technologies is also a major issue. In now more than twenty years of experience in both academia and industry, the author knows perfectly how hard it is to pull well trained, experienced experts out of their comfort zone. Risk analysts are conservative so to say in essence: you must have very good reasons to change a solution that worked so far. But reasons for a radical change are here. Here again, we learn lessons from the historical development of computer science and software engineering: new programming paradigms have been progressively introduced in industry by new generations of engineers who learned them at university. Nowadays students are not afraid to write computer code. On the contrary: to attract the best students, we should propose them state-of-the-art activities and competences. One of the author’s deepest convictions is that much more discrete mathematics—see e.g. reference [64] for an introduction—should be introduced in engineering curricula.

5 Conclusion In this article, we discussed the impact of current technological transformations on probabilistic risk analyses. We advocated that two major changes in the probabilistic risk assessment process are foreseeable. First, in-book reliability data collected and organized by statisticians will be replaced by databases of degradation indicators obtained machine learning techniques ran by data scientists. Second, classical

New Challenges and Opportunities in Reliability ...

111

modeling formalisms such as fault trees, event trees or reliability blocks diagrams will be replaced by modeling formalisms supporting the model-based approach, as exemplified by XFTA (S2ML+SBE) or AltaRica 3.0. The industrial deployment of such radical changes will take time and is by no means certain. However, there are solid scientific and technological arguments to support them. The author hopes that the present article will at least serve to open the discussion on the future of probabilistic risk analyses and will contribute to create fruitful exchanges between academia and industry.

References 1. Zio, E.: Reliability engineering: old problems and new challenges. Reliab. Eng. Syst. Saf. 94, 125–141 (2009) 2. Zio, E., Aven, T. Industrial disasters: extreme events, extremely rare. some reflections on the treatment of uncertainties in the assessment of the associated risks. Process Safe. Environ. Prot. 91, 31–45 (2013). https://doi.org/10.1016/j.psep.2012.01.004 3. Aven, T., Baraldi, P., Flage, R. et al.: Uncertainty in Risk Assessment: The Representation and Treatment of Uncertainties by Probabilistic and Non-Probabilistic Methods. Chichester, West Sussex, United Kingdom: Wiley-Blackwell (2014). ISBN 978-1118489581 4. Aven, T.: The concept of antifragility and its implications for the practice of risk analysis. Risk Anal. 35(3), 476–483 (2015). https://doi.org/10.1111/risa.12279 5. Rasmussen, N.C.: Reactor Safety Study. An Assessment of Accident Risks in U.S. Commercial Nuclear Power Plants. U.S. Nuclear Regulatory Commission. Rockville, MD, USA. WASH 1400, NUREG-75/014 (1975) 6. Andrews, J.D., Moss, R.T.: Reliability and Risk Assessment (second edition). Materials Park, Ohio 44073-0002, USA: ASM International (2002). ISBN 978-0791801833 7. Kumamoto, H., Henley, E.J.: Probabilistic Risk Assessment and Management for Engineers and Scientists. Piscataway, N.J., USA: IEEE Press (1996). ISBN 978-0780360174 8. Rauzy, A., Haskins, C.: Foundations for model-based systems engineering and model-based safety assessment. J. Syst. Eng. (2018). Wiley Online Library. https://doi.org/10.1002/sys. 21469 9. Batteux, M., Prosvirnova, T., Rauzy, A.: From models of structures to structures of models. In: IEEE International Symposium on Systems Engineering (ISSE 2018). IEEE. Roma, Italy, October (2018). https://doi.org/10.1109/SysEng.2018.8544424 10. Brooks, F.: The Mythical Man-Month. Addison-Wesley, New York, NY, USA (1995). ISBN 0-201-83595-9 11. Rauzy, A.: Notes on computational uncertainties in probabilistic risk/safety assessment. Entropy (2018). MDPI. https://doi.org/10.3390/e20030162 12. Oreda Handbook—Offshore Reliability Data, Vols. 1 and 2, 6th edn. (2015) 13. Datta, S.: Emergence of Digital Twins. DSpace@MIT. https://dspace.mit.edu/handle/1721.1/ 104429 14. Lecun, Y.: L’apprentissage profond, Leçons inaugurales au Collège de France Fayard (2017. ISBN 978-2213701820 (in French) 15. Holt, J., Perry, S.: SysML for Systems Engineering: A Model-Based Approach. Institution of Engineering and Technology. Stevenage Herts, United Kingdom (2013). ISBN 978-1849196512 16. Yakymets, N., Munoz Julho, Y., Lanusse, A.: Sophia framework for model-based safety analysis. Actes du congrès Lambda-Mu 19 (actes électroniques). Institut pour la Maîtrise des Risques, Dijon, France (2014). ISBN 978-2-35147-037-4

112

A. Rauzy

17. Dugan, J.B., Bavuso, S.J., Boyd, M.A.: Dynamic fault-tree models for fault-tolerant computer systems. IEEE Trans. Reliab. 41(3), 363–377 (1992). https://doi.org/10.1109/24.159800 18. Bouissou, M., Bon, J.-L.: A new formalism that combines advantages of fault-trees and Markov models: boolean logic-driven Markov processes. Reliab. Eng. Syst. Safe. 82(2), 149–163 (2003). Elsevier. https://doi.org/10.1016/S0951-8320(03)00143-1 19. Lisnianski, A., Levitin, G.: Multi-State System Reliability. World Scientific. London, England (2003). ISBN 981-238-306-9 20. Papadopoulos, Y., Martin, M., Parker, D., Rüde, E., Hamann, R., Uhlig, A., Grätz, U., Liend, R.: An approach to optimization of fault tolerant architectures using HiP-HOPS. J. Eng. Fail. Anal. 18(2), 590–608 (2011). Elsevier Science. https://doi.org/10.1016/j.engfailanal.2010.09.025 21. Zaitseva, E., Levashenko, V.: Reliability analysis of multi-state system with application of multiple-valued logic. Int. J. Qual. Reliab. Manage. 34(6), 862–878 (2017). Emerald Publishing. https://doi.org/10.1108/IJQRM-06-2016-0081 22. Signoret, J.-P., Dutuit, Y., Cacheux, J.-P., Folleau, C., Collas, S., Thomas, P.: Make your Petri nets understandable: reliability block diagrams driven Petri nets. Reliab. Eng. Syst. Safe. 113, 61–75 (2013). Elsevier. doi:https://doi.org/10.1016/j.ress.2012.12.008 23. Bouissou, M., Bouhadana, H., Bannelier, M., Villatte, N.: Knowledge modeling and reliability processing: presentation of the FIGARO language and of associated tools. In: Proceedings of SAFECOMP’91, IFAC International Conference on Safety of Computer Control Systems, Lindeberg, J.F. (ed.). Pergamon Press, Trondheim, Norway, pp. 69–75 (1991). ISBN 0-08041697-7 24. Batteux, M., Prosvirnova, T., Rauzy, A.: AltaRica 3.0 in 10 modeling patterns. Int. J. Crit. Comput.-Based Syst. 9(1–2), 133–165 (2019). Inderscience Publishers. https://doi.org/10. 1504/IJCCBS.2019.098809 25. Rauzy, A.: Guarded transition systems: a new states/events formalism for reliability studies. J. Risk Reliab. 222(4), 495–505 (2008).Professional Engineering Publishing. https://doi.org/10. 1243/1748006XJRR177 26. Batteux, M., Prosvirnova, T., Rauzy, A.: AltaRica 3.0 assertions: the why and the wherefore. J. Risk Reliab. (2017). Professional Engineering Publishing. https://doi.org/10.1177/1748006X1 7728209 27. Abadi, M., Cardelli, L.: A Theory of Objects. Springer-Verlag, New-York (1998). ISBN 9780387947754 28. Noble, J., Taivalsaari, A., Moore, I.: Prototype-Based Programming: Concepts, Languages and Applications. Springer-Verlag, Berlin and Heidelberg (1999). ISBN 978-9814021258 29. Wirth, N.: Algorithms + Data Structures = Programs. Prentice-Hall, Upper Saddle River (1976). ISBN 978-0130224187 30. Rauzy, A.: Probabilistic Safety Analysis with XFTA. AltaRica Association, Les Essarts le Roi (2020). ISBN 978-82-692273-0-7 31. Rauzy, A., Yang, L.: Finite degradation structures. J. Appl. Log. IfCoLog J. Log. Appl. 6(7), 1471–1495 (2019). College Publications 32. Klee, H., Allen, R.: Simulation of Dynamic Systems with MATLAB and Simulink. CRC Press, Boca Raton (2011). ISBN 978-1439836736 33. Fritzson, P.: Principles of Object-Oriented Modeling and Simulation with Modelica 3.3: A Cyber-Physical Approach. Wiley-IEEE Press, Hoboken (2015). ISBN 978-1118859124 34. Voirin, J.-L.: Method and tools for constrained system architecting. In: Proceedings 18th Annual International Symposium of the International Council on Systems Engineering (INCOSE 2008). Curran Associates, Inc., pp. 775–789, Utrecht, The Netherlands (2008). ISBN 978-1605604473 35. Batteux, M., Prosvirnova, T., Rauzy, A., Yang, L.: Reliability assessment of phased-mission systems with AltaRica 3.0. In: Proceedings of the 3rd International Conference on System Reliability and Safety (ICSRS), Barcelona, Spain, November 2018, pp. 400–407. IEEE. https:// doi.org/10.1109/ICSRS.2018.00072 36. Batteux, M., Prosvirnova, T., Rauzy, A.: Abstract Executions of Stochastic Discrete Event Systems (2020)

New Challenges and Opportunities in Reliability ...

113

37. Prosvirnova, T., Rauzy, A.: Automated generation of minimal cutsets from AltaRica 3.0 models. Int. J. Crit. Comput. Based Syst. 6(1), 50–79 (2015). Inderscience Publishers. https://doi.org/ 10.1504/IJCCBS.2015.068852 38. Brameret, P.-A., Rauzy, A., Roussel, J.-M.: Automated generation of partial Markov chain from high level descriptions. Reliab. Eng. Syst. Safe. 139, 179–187 (2015). Elsevier. https:// doi.org/10.1016/j.ress.2015.02.009 39. Rauzy, A.: An experimental study on six algorithms to compute transient solutions of large Markov systems. Reliab. Eng. Syst. Safe. 86(1), 105–115 (2004). Elsevier 40. Zio, E.: The Monte Carlo Simulation Method for System Reliability and Risk Analysis. Springer, London (2013). ISBN 978-1-4471-4587-5 41. Fuhrmann, H.A.L.: On the Pragmatics of Graphical Modeling. Norderstedt, Germany (2011). ISBN 978-384480084 42. Rumbaugh, J., Jacobson, I., Booch, G.: The Unified Modeling Language Reference Manual. Addison Wesley, Boston (2005). ISBN 978-0321267979 43. Maier, M.W.: The Art of Systems Architecting. CRC Press, Boca Raton (2009) 44. Gamma, E., Helm, R., Johnson, R., Vlissides, J.: Design Patterns—Elements of Reusable Object-Oriented Software. Addison-Wesley, Boston (1994). ISBN 978-0201633610 45. Valiant, L.G.: The complexity of enumeration and reliability problems. SIAM J. Comput. 8(3), 410–421 (1979) 46. Toda, S.: PP is as hard as the polynomial-time hierarchy. SIAM J. Comput. 20(5), 865–877 (1991) 47. Simon, H.: Models of Man: Social and Rational. Mathematical Essays on Rational Behavior in a Social Setting. Wiley, New York (1957) 48. Maier, M.W.: Architecting principles for systems-of-systems. Syst. Eng. Wiley Period. 1(4), 267–284 (1998). https://doi.org/10.1002/j.2334-5837.1996.tb02054.x 49. Kloul, L., Prosvirnova, T., Rauzy, A.: Modeling systems with mobile components: a comparison between AltaRica and PEPA nets. J. Risk Reliab.227(6), 599–613 (2013). Professional Engineering Publishing. https://doi.org/10.1177/1748006X13490497 50. Jensen, K.: Coloured Petri Nets. Springer-Verlag, Berlin (2014). ISBN ISBN-10: 364242581X. ISBN-13: 978-3642425813 51. Milner, R.: Communicating and Mobile Systems: The pi-Calculus. Cambridge University Press, Cambridge (1999). ISBN 978-0521658690 52. Railsback, S., Grimm, V.: Agent-Based and Individual-Based Modeling—A Practical Introduction. Princeton University Press, Princeton (2011). ISBN 978-0691136745 53. Esperza, J.: Decidability and Complexity of Petri Nets Problems—An introduction. Lectures on Petri Nets I: Basic Models, pp. 374–428. In: Reisig, W., Rozenberg, G. (eds.). Springer (1998). ISBN 3-540-65306-6 54. Stark, J.: Product Lifecycle Management: 21st Century Paradigm for Product Realisation, 2nd edn. Springer, London (2011). ISBN 978-0857295453 55. Datta, S.: Emergence of Digital Twins (2015). https://dspace.mit.edu/handle/1721.1/104429 56. Mainini, L., Maggiore, P.: Multidisciplinary integrated framework for the optimal design of a jet aircraft wing. Int. J. Aerosp. Eng. (2012). Hindawi Publishing Corporation. https://doi.org/ 10.1155/2012/750642 57. Ptolemaeus, C.: System Design, Modeling, and Simulation using Ptolemy II. Ptolemy.org (2014). ISBN 978-130442106. http://ptolemy.org/books/Systems 58. IEC: International IEC Standard IEC61508—Functional Safety of Electrical/Electronic/Programmable Safety-related Systems (E/E/PE, or E/E/PES). International Electrotechnical Commission, Geneva, Switzerland (2010). ISBN ISBN 978-2-88910-524-3 59. IEC: International IEC Standard IEC61511—Functional Safety—Safety Instrumented Systems for the Process Industry Sector. International Electrotechnical Commission, Geneva, Switzerland (2016). ISBN 978-2-8322-4752-5 60. Cousot, P., Cousot, R.: Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. In: Conference Record of the Fourth Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 238–252 (1977). ACM Press, New York, NY, USA

114

A. Rauzy

61. Legendre, A., Lanusse, A., Rauzy, A.: Directions towards supporting synergies between design and probabilistic safety assessment activities: illustration on a fire detection system embedded in a helicopter. In: Proceedings PSAM’13, IPSAM, Seoul, South-Korea (2016) 62. Batteux, M., Prosvirnova, T., Rauzy, A.: Model Synchronization: A Formal Framework for the Management of Heterogeneous Models. Model-Based Safety and Assessment. In: Papadopoulos, Y., Aslansefat, K., Katsaros, p., Bozzano, M. (eds.), pp. 157–172. Springer, Thessaloniki, Greece. ISBN 978-3-030-32871-9 63. Batteux, M., Choley, J.-Y., Mhenni, F., Prosvirnova, T., Rauzy, A.: Synchronization of system architecture and safety models: a proof of concept. In: Proceedings of the IEEE 2019 International Symposium on Systems Engineering (ISSE), IEEE, Edinburgh, Scotland (2019) 64. O’Regan, G.: Guide to Discrete Mathematics: An Accessible Introduction to the History, Theory, Logic and Applications. Springer, Cham, Switzerland (2016). ISBN ISBN 9783319445601

Development of Structured Arguments for Assurance Case Vladimir Sklyar and Vyacheslav Kharchenko

Abstract The paper describes an approach to improve Assurance Case applicability through structured argumentation. We started from approach based on use of twice argumentation step including reasoning step and evidential step with structured text support. After that, we improve the existing method with the following issues: (1) a general algorithm for the development of the Assurance Case is proposed; (2) relations between the argumentation graph and templates of structured text are explicitly explained; (3) structured text is supplied with clear templates. We implement a case study applying the obtained method for arguing functional safety compliance. A general conclusion is this method makes Assurance Case methodology more practical and understandable. Keywords Assurance case · GSN · Safety case · Structured argument

1 Introduction 1.1 History and Concept of Assurance Case For safety-critical and security-critical applications we always need to argue or assert that some system is safe. Obviously, a number of criteria must be introduced for that. However, we need to determine how reliable our knowledge about the analyzed system is. Why can we trust this knowledge? What makes our arguments and reasoning credible? Having delved into such problems, one cannot do without philosophical disciplines such as ontology, epistemology and logic. The next step is to understand how should we justify or assess safety and security in a reasonable and logical way. Such approach is based on the theory of argumentation. The Assurance V. Sklyar (B) · V. Kharchenko National Aerospace University “KhAI”, 17, Chkalova Street, Kharkiv, Ukraine e-mail: [email protected] V. Kharchenko e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 C. van Gulijk and E. Zaitseva (eds.), Reliability Engineering and Computational Intelligence, Studies in Computational Intelligence 976, https://doi.org/10.1007/978-3-030-74556-1_7

115

116

V. Sklyar and V. Kharchenko

Case (AC) is a structured argument that some system has some properties we desire; that it is safe, or reliable, or secure against attack [1]. The British philosopher Stephen Toulmin gave a new impetus to the modern development of argumentation in the work entitled “The Uses of Argument”, published in 1958 [2]. Toulmin extended the logical implicative inference with additional parameters and proposed to represent this operation in graphical form. Toulmin’s notation operates with the following entities: data (D)—initial data for analysis, claim (C)— the purpose of logical implication inference (If D So C), warrant (W)—an additional argument, qualifier (Q)—the degree of confidence in the results of the logical output, rebuttable (R)—additional counterargument. Argument maps were used to visualize reasoning before Toulmin, but it was he who most successfully generalized the structural model for the analysis and verification of arguments. Note that modern argument maps do not use directly the Toulmin’s notation, because of more simplification. In the 1990s, researchers continued to seek new approaches to assessing safety. The idea seems to be on the surface: let’s develop a special notation to justify compliance with the requirements of man-made objects and systems. Two British university teams took over, including City, University of London, where the spin-off company Adelard was formed [3], and University of York [4]. Today Adelard and University of York also still occupy leading positions in the promotion of the AC. For the development of notations, the emphasis was placed on the logical reasoning that a property or component of the system meets the stated requirements. The works of Stephen Toulmin, which we have already considered, were chosen as the theoretical basis. As a humanitarian, Toulmin hardly thought about technical systems, however, he went down in history, among other things, as the founder of the argumentation for the AC. As the result, University of York developed Goal Structuring Notation (GSN) [5], while Adelard developed Claim, Argument and Evidence (CAE) notation, as well as a software tool Adelard ASCE (Assurance and Safety Case Environment) [6]. Despite all benefits and some successful applications, the AC is well known only for some restricted areas. Developing evidence to support compliance is a creative process that is highly human-driven. So, what is the most practical and realistic method for developing the AC? Some drawbacks are associated with the lack of argumentation techniques. One of the authors who have attempted to bridge this gap is John Rushby, who proposed a modified GSN approach to structured argument development [7]. In this paper, we adopt a structured argument approach as the basis, and go ahead to make it more usable and practical.

1.2 Goal and Structure A goal of this article is to improve a structured argument approach based on reasoning and evidential step. To achieve this goal the following sections are included in the chapter.

Development of Structured Arguments for Assurance Case

117

Literature review is performed in Sect. 2 of this paper. For this review we consider such the AC applications as attributes assessment, certification, assurance based development, knowledge management, and, finally, improvement of argumentation. After that in Sect. 3 we discuss the main issues related with development of the structured argument method. At the beginning, we consider an existing approach to transformation of typical arguments in a structured argument form. This approach is based on two steps argumentation including reasoning and evidential steps. We found some points of improvement for this approach. Firstly, we build hierarchy of requirement with relation of structured text, and develop typical templates for structured text supported reasoning and evidential steps. After that we propose a general algorithm for the structured argument method. Based on the theoretical results, we perform a case study in Sect. 4 of this paper. This study considers implementation of the structured argument method for a part of the AC in relation with requirement for documentation management from functional safety management framework. General conclusions are discussed in Sect. 5. Future research directions in structured argument domain should consider formalization and automation of reasoning and evidential steps as well as introduction of Artificial Intelligence tools.

2 State of the Art 2.1 Assurance Case for Attributes Assessment Some works in 2000s broaden the concept of AC to the higher level of system attributes. A research group from Software Engineering Institute of Carnegie Mellon University (CMU/SEI) proposed this AC application. The report [8] discusses Dependability Case for communication system using GSN. It is only terminological issue because an approach is identical to AC. An idea is, if only dependability or quality attribute of interest is safety, then Dependability or Quality Case becomes Safety Case. The same is right for Security Case that can be a particular case of Dependability or Quality Case. This also entails a general concept of AC that can be an umbrella for different system attributes including dependability, quality, safety and security. Figure 1 shows dependencies between levels of attributes and associated cases. Such, Quality Case and Dependability Case are a particular case of Assurance Case, as well as Safety Case and Security Case are a particular case of Assurance Case, Quality Case and Dependability Case. It is worth to note, nowadays terms “Quality Case” and “Dependability Case” are not widely used. There is not any difference in methodology between AC and any other kind of “case”. In addition, the Nimrod Review [9] recommended that Safety Cases should be renamed “Risk Cases”. However, the UK Ministry of Defence did not adopt this recommendation.

118

V. Sklyar and V. Kharchenko Assurance Case

Quality Case

Safety Case

Dependability Case

Security (Trustworthiness) Case

Fig. 1 Relations between cases of different attributes sets

The CMU/SEI also proposed the Survivability Analysis Framework (SAF), what is a structured view of people, process, and technology that was developed to help organizations characterize the complexity of multi-system and multi-organizational business processes [10]. By combining SAF and GSN based AC, the strengths and gaps for the survivability of a business process can be described in a graphical and visually compelling form that management, architects, system engineers, software engineers, and users can share [11].

2.2 Assurance Case Based Certification Certification activity is very close to licensing activity [12], so it is obvious, there are researchers efforts directed to application of AC for certification goals. Certification is a process, which is to substantiate the compliance of applicable requirements by critical software and systems. With the recommended processes that are intended to support certification, it is easy and clear for duty-holders to organize and plan activities and resources in the development lifecycle. The main idea is integration of AC regime with existing regulation and practice in certification. For that, practical guidance will be required as to how to formulate arguments, appropriately select evidence and critically review AC [13]. One from the first research, proposed to extract requirements from standards for AC building need [14], was devoted to mapping AC from three standards: • ISO/IEC 15,408 Information technology—Security techniques—Evaluation criteria for IT security (The Common Criteria); • RTCA/DO-178 Software Considerations in Airborne Systems and Equipment Certification; • ISO 14,971 Medical devices—Application of risk management to medical devices. The paper [5] is also devoted to provide argumentation on the base of the perspective Common Criteria (the standard ISO/IEC 15,408). The above provides the basis to other industries specific researches, for example for civil aircrafts that is not covered

Development of Structured Arguments for Assurance Case

119

with AC requirements and methodology [15]. The paper [16] provides results of development of as named explicit “e78-1.6” Assurance Case, which is intended to capture what is required by the avionic standard RTCA DO-178. Therefore, AC may help serve as a catalyst for prompting improved cooperation and mutual understanding between supporters of prescriptive standards and supporters of goal-based standards [17]. However, the decision to implement AC for civil aircrafts has not been made by now.

2.3 Assurance Based Development The paper [18] presents Assurance Based Development (ABD), what is an approach to the simultaneous development of systems and their assurance argumentation, which finally shall be represented in a view of AC. ABD ensures that the techniques and means selected to create a system supports the correct evidence to justify the required confidence. ABD is based on two key concepts: firstly, engineering choices should be driven by the need to produce evidence for the assurance arguments, and, secondly, argument should be used to document the rationale for believing that the system is fit for use (see Fig. 2). Safety contracts method is a modification of approach to ABD, since contracts is an approach to formalize development of software [19]. The paper [20] proposes deriving contracts from fault trees. Such safety contracts guarantee to prevent or minimize the faulty state described by the node. Descriptions of specific safety contracts are implemented in AC diagram as components of GSN. Contract/evidence pairs are represented as C: ;E, which can be read as follows: contract C, which under assumptions A offers guarantees G, is supported by evidence E. Another brunch of ABD is application of model-based development [21]. The paper [22] is devoted to development of software and AC in parallel following a model-based technique that combines formal modeling of the system, systematic

Fig. 2 A concept of assurance based development

120

V. Sklyar and V. Kharchenko

code generation from the formal model, and measurement based verification of timing behavior. The software is developed for an electronic medical device.

2.4 Assurance Case for Knowledge Management Since AC is a visualization method using a natural language, AC is widely used for support knowledge management and other associated activities such as business management strategy, change and maintenance management, documents management and even software test management. Some researches in Japan are directed to apply AC methodology for business processes. Kobayashi et al. [23] proposed a method for confirmation and evaluation the management strategy with using AC. The research [24] introduces management vision model, management strategy model, business process model, and IT system model based on AC respectively contribute to improving the feasibility of accomplishing management vision and management strategy. The paper [25] considers the effectiveness of the advantage of AC as a framework for teaching information security. AC has been used as one of the tools for students during educational project implementation to improve teaching efficiency. The paper [26] introduces AC to improve testing strategy for space mission critical software. The key step that combines methods is to extract from AC the combinatorial test conditions needed to have confidence in the autonomous system, and feed those conditions into the test suite. This provides an explicit and documented link between the AC and the test generator, which improves confidence and test efficiency. If the AC has to be changed during development, it is easier to update the parameters and re-generate the test cases, thereby reducing regression test costs. The Japanese Aerospace Exploration Agency (JAXA) implements AC to manage testing activities naming this framework as Independent Verification and Validation (IV&V) case [27]. JAXA identified a range of the following IV&V needs: clear accountability for activities confidence, guarantee the software quality as a whole, show traceability between software defects on orbit and operational risks. JAXA introduces GSN for sharing and application of knowledge in IV&V area. Obtained effects of IV&V case includes improvement of demand and value of IV&V for Stakeholders as well as maintenance of IV&V quality.

2.5 Improvement of Argumentation A new wave of AC researches appeared after some critical notes made in the as named Nimrod Report [28] published in 2009. It became clear, that neither the philosophy literature nor other disciplines that use argument seem to offer a universal theory of knowledge that is applicable to safety arguments [29]. Normative models of informal argumentation do not offer clear guidance on when an argument should cite

Development of Structured Arguments for Assurance Case

121

evidence rather than appeal to a more detailed argument. Therefore, improvement of argumentation stimulated a lot of papers devoted to this issue [30, 31], taking into account there is not any completed agreement on which kind of evidence could be sufficient [32]. Epistemology based approach takes into account the study of the nature of knowledge, justification, and the rationality of belief (“What makes justified beliefs really justified?”). The paper [29] hypothesizes that recognition of a set of rules for what counts as sufficient evidence for a given kind of claim under given circumstances would provide developers, assessors, and regulators with a practical means to make justified decisions about how much detail an argument should have and whether an argument is sufficiently compelling. Eliminative induction was suggested firstly by Sir Francis Bacon for evaluating confidence in a claim. The idea is, confidence in a hypothesis (or claim) increases as reasons for doubting its truth are identified and eliminated (Baconian confidence). The paper [33] proposes to improve argumentation confidence by converting AC models between different notations. The method starts from argument based cases (CAE or GSN), which are converted into a set of Toulmin model instances; then they use Hitchcock’s evaluative criteria [34] for solo verb reasoning to analyze and quantify the Toulmin model instances into Bayesian Belief Network (BBN); running the BBN, quantified confidence from each claim of the AC is got. The paper [30] surveys how researchers have reasoned about uncertainty in assurance cases. The types of uncertainties are addressed and distinguished between qualitative and quantitative approaches. The qualitative approach is covered with Baconian probability [32] and logical argumentation, as per [35]. The paper [35] introduces assured safety arguments. This structure explicitly separates the safety case argument into two components—a safety argument and an accompanying confidence argument. The safety argument is allowed to talk only in terms of the causal chain of risk reduction, and is not allowed to contain general ‘confidence raising’ arguments. Quantitative approaches introduce using of probability to define confidence. The paper [36] proposes that probability is the appropriate measure of uncertainty and dependent confidence. Researchers explored how the confidence in judgments affects the overall judgment of a safety related probability of failure on demand and illustrated this with an example of Safety Integrity Level (SIL). In the paper [37] argument structures is presented as for a formal probabilistic treatment of confidence with implementation of the multi-legged approach. It answers questions such as “How much extra confidence about a system’s safety will I have if I add a verification argument leg to an argument leg based upon statistical testing?” There is a simplified and idealized example of a safety system in which interest centers upon a claim about the probability of failure on demand. The approach is based on a BBN model of a two-legged argument, and manipulates this analytically via parameters that define its node probability tables.

122

V. Sklyar and V. Kharchenko

3 Development of the Structured Argumentation Method 3.1 Transformation of Typical Arguments in a Structured Argument Form There are some shortcomings in the existing works, which are due to the lack of satisfactory practical argumentation techniques. Thus, in order to apply the AC methodology, it is necessary to select and improve the appropriate mathematical and methodological approaches for structuring the argumentation. The argumentation in the AC corresponds to the implication in logic, when the truth of the conclusion depends on the truth of the conditions. A logical rule involves a logical multiplication in the form of: SC1 AND SC2… AND SCn IMPLIES C, where SCi are subgoals, which also can be complex expressions. As noted above, there are some drawbacks in the existing papers that are related to the lack of argumentation techniques. One of the few authors who have attempted to address this gap is John Rushby, who in his technical report [7] offers an approach to developing structured arguments based on a modified GSN [38, 39]. In this section we use and update this approach. Classical application of GSN (Fig. 3) is characterized by support for argumentation steps (AS) of any claim (C) with both subclaims (SC) and evidences (E). This approach has some drawbacks, which are due to the inability to have always a regular and typical argument structure. We can observe that the same argumentation step is supported with both subclaims and evidences. It could entail mixing subclaims with evidences and breaking argumentation workflow. Modification of argumentation steps is proposed in [22] to reduce them to a typical two-step structure (Fig. 3). The first step, called the reasoning step (RS), is an analysis of subclaims that are aimed at achieving the primary claim, but there is no recourse

Fig. 3 Transformation of a typical argument form to a structured argument form

Development of Structured Arguments for Assurance Case

123

to the evidence at that step. This reasoning has to elicit and extract all the subclaims from known sources. In the second step, called the evidential step (ES), the evidences for supporting the subgoals that was formulated in the previous step is represented. These evidences have to represents that all subgoals are met. Thus, the graph of the argumentation structure is transformed as shown in Fig. 3. This allows us to make a connection between the concept of safety and security (clam) and our knowledge of the physical world (evidence). To further formalize the steps of RS and ES in [7] it is suggested to use structured text. This approach is appropriate, but in our opinion, it has a number of opportunities for improvement, such as the following: a) there is not a general algorithm for the development of the Assurance Case; b) relations between the argumentation graph and templates of structured text are not explicitly explained; c) structured text does not have clear templates.

3.2 Argumentation Improvement: Hierarchy of Requirements and Templates of Structured Text In addition, the development of the AC is in many ways a creative process, which many depends on the human factor. The below is an improvement of the approach described in [7], which, in our view, will allow us to move further in structuring the arguments of the AC and eliminate the above shortcomings. We demonstrate the opportunity of explicitly combining the AC with structured text components. Let’s present a hierarchy of requirements that creates the structure of the AC in the form of a pyramid. In most regulatory requirements for control systems, the structure of requirements includes 3 or 4 levels (Fig. 4).

0 1 Goals 2 Groups of requirements 3 Composite requirements 4 Separate requirements

RS

ST

RS

ST

ES

ST

Fig. 4 Hierarchy of requirements to control systems and a relation of requirements with argumentation steps

124

V. Sklyar and V. Kharchenko

Zero level is a meta-goal according to which the control system must meet all safety requirements. At the first level, global safety goals are achieved, for example, according to functional safety requirements: • • • •

The safety and security management system shall achieve all safety objectives; Safety and security life cycle should be implemented during system development; A sufficient set of measures against random failure must be applied to the system; A sufficient set of measures against systematic and software failures, including cyberattack defense, must be applied to the system.

The requirements groups contain related requirements and support one or other of the global goals. For example, the requirements for safety and security management in IEC 61508 [40] include requirements to human resource management, configuration management, documentation management, and others. The structure of the links between the zero, first and second levels is a tree transparent enough and does not require detailed elaboration of the arguments, since these arguments are typical and well tested. However, structured arguments are required when moving from the second level to the lower levels. The requirements of the lower levels may be either composite (such as include a number of separate requirements) or separate. If all requirements are separate, this level becomes third, and then it is directly related to the subgroups of requirements. Figure 4 combines the overall structure of the AC and the algorithm for constructing structured arguments. Such arguments should be developed for the second, third and fourth (if any) levels. An approach to argument structure is introduced in Fig. 3. For the lowest level, besides the RS, the ES should also be applied. Since it is not appropriate to add detailed information about the content of the arguments on the graph structure, each of the nodes of the AC, starting with the second level, is marked with an argument description using so-called structured text (ST). Notice, that the AC is not a strict tree because the same evidence can support different arguments or subgoals. Let’s develop a typical structured text configuration for the reasoning and evidential steps using the GSN components. The structured text has a template with a set of fields that are denoted by service words that correspond to the GSN components. We need to provide two templates, for the RS and for the ES (Figs. 5, and 6). In these templates, the names of the service words are given in bold, and italics provide a brief description of the content that should fill the template fields.

3.3 Algorithm for the Structured Argumentation Method Based on the results obtained in the preliminary section, we can draw a formalized algorithm for the structured argumentation method (Fig. 7). For that, we use activity diagram notation of the UML. Steps of the algorithm are related with levels of a hierarchy of requirements that is represented on Fig. 4. The input data for the method

Development of Structured Arguments for Assurance Case

125

Reasoning Step Context Connection with the Assurance Case graph in relation with high and low levels Docs Technical documents related with arguments and evidences Claim Goal related with argument Subclaims Subgoals demonstrated the goal (Claim) achievment Justification Structure and content of subgoals (Subclaims) END Reasoning Step Fig. 5 A template of structured text for a reasoning step

Evidential Step Context Connection with the Assurance Case graph in relation with high and low levels Docs Technical documents related with arguments and evidences Claim Subclaims from Reasoning Step become Claim Evidence Proofs, which support Claim achievement Justification Structure and content of Evidence END Evidential Step Fig. 6 A template of structured text for a evidential step

application include a database of standards applicable for the domain of the licensed system. The first step of the method application contains analysis of the standards database. The expected result does extract a general set of requirements which has to support a top level of global goals (GG) for safety and security. A typical set of GG for safety related application includes requirements to management, life cycle, protective measures and assessment. GG can be represented in a view of a simple mind map. The next step is decomposition of GG to groups of requirements (GR). It contains top-down analysis of all requirements which are related with any specific GG. It is possible to use only one target standard as well as a set of standards specified in the requirements to the licensed system. The expected result has to contain sets of the text fragments which cover GG by GG. For the first step a separate GR can be represented in a view of a mind map. Later it can be transformed in GSN with use of software tools. It is reasonable to draw the AC graph (GSN graph) for each of the separated set of the group of requirements.

126

V. Sklyar and V. Kharchenko

Database of standards

Analysis and choice of the applied requirements set

GG

Decomposition of global goals to groups of requiremets: GG --> GR

GR

Decomposition of GR to composite and separate requirements: RS(GR --> SG) Decomposition of composite requirements to separate: RS(SGC --> SGS) Formulating the evidences for supporting the subgoals: ES(SGS --> E)

ST(GR --> SG) ST(SGC --> SGS) ST(SGS --> E)

Assurance Case in a view of GSN graph supported with ST

Fig. 7 An algorithm for application of the structured argumentation method

However, if any relations between subgoals or evidences of different GRs of one GG are discovered, then the AC graph should be built for the GG in general. The next step is the first RS, which decompose GR to SGs. For this step we use the template of ST (see Fig. 4). An issue is some SGs can be composite, so such SGs requests the future decomposition to separate SGs.

4 Case Study: Application of the Structured Argumentation Method Let’s synchronize the AC with the hierarchy of requirements (Fig. 4). For this, we implement the obtained method (Fig. 7). The meta-goal (Level 0) is a compliance of some abstract system with all identified requirements to safety and security. Goals of the Level 1 correspond to the main parts of safety and security issues like concept and functions, standards and regulations, system architecture etc. In this paper we consider the Level 1 goal related with safety and security management and assessment. The transition from the Level 1 to the Level 2 groups of requirements contains an analysis of existing requirements to safety and security management and assessment

Development of Structured Arguments for Assurance Case

127

like human resource management, configuration management, software tools selection and evaluation etc. Let’s consider documentation management on the Level 2. The goal is documentation management complies with all identified requirements. The transition from the Level 2 to the Level 3 requirements contains the RS, which is based on an analysis of IEC 61508 requirements to documentation management. Such requirements are contained in IEC 61508, Part 1 “General requirements”, Sect. 5 “Documentation” [40]. This RS transforms the text of IEC 61508 into a set of subclaims related with the Level 2 claim (documentation management complies with all identified requirements). Also, during the subclaims identification and analysis we shall identify composite requirements for which we need one more level to obtain separate requirements from composite requirements, so more argumentation steps will be performed for transition from the Level 3 to the Level 4. Figure 8 represents RS for the Level 2, and demonstrates that the most parts of the subclames requirements are separate and the next step for it is ES. Exception are the SC6 and SC10 with are composite requirements, so for them we need one more RS to transit from the Level 3 to separate requirements of the Level 4 (Fig. 8). Reasoning Step (Documentation Management) Context: Connection between the group of Documentation Management requirements of the Assurance Case Level 2 and composite and separate requirements of Level 3 Docs: Documentation Management Plan Claim: Documentation Management complies with IEC 61508 requirements Subclaim SC1 (IEC 61508-1, 5.2.1), SEPARATE Documentation supports all phases of safety life cycle Subclaim SC2 (IEC 61508-1, 5.2.2), SEPARATE Documentation supports functional safety management Subclaim SC3 (IEC 61508-1, 5.2.3), SEPARATE Documentation supports functional safety assessment Subclaim SC4 (IEC 61508-1, 5.2.4), SEPARATE Documentation complies with standards Subclaim SC5 (IEC 61508-1, 5.2.5), SEPARATE Documents are available Subclaim SC6 (IEC 61508-1, 5.2.6a,…,d), COMPOSITE Documents have sufficient quality Subclaim SC7 (IEC 61508-1, 5.2.7), SEPARATE Documents have title and content Subclaim SC8 (IEC 61508-1, 5.2.8), SEPARATE Documents comply with procedures and practices Subclaim SC9 (IEC 61508-1, 5.2.9), SEPARATE Documents have version numbers Subclaim SC10 (IEC 61508-1, 5.2.10a,b), COMPOSITE Documents have structure for search support. The last version of documents can be identified Subclaim SC11 (IEC 61508-1, 5.2.11), SEPARATE Document control system is implemented Justification: Structure and content of Documentation Management Plan END Reasoning Step Fig. 8 Structured text for the reasoning step of Level 2

128

V. Sklyar and V. Kharchenko

Evidential Step ES1,…,ES11 Context: Connection with the subclaims of the Levels 3 and the Level 4 Docs: Documentation Management Plan; Project Repository Claim: SC1,…, SC11 Evidence E1: Strategy of documentation for functional safety Evidence E2: Documents access rights Evidence E3: Documents preparation review and approval Evidence E4: Documents list and responsibilities Evidence E5: Documents format and templates Evidence E6: Documents version and change control Evidence E7: Project repository structure Evidence E8: Document control system

Justification: Structure and content of E1,…,E11 END Evidential Step Fig. 9 Structured text for the evidential step

The future analysis of the point 5.2.6 of the IEC 61508-1 shows that there is a list with four additional requirements. All these requirements are related with quality of documents so they can be covered with the same ES. The same situation is the point 5.2.10 of the IEC 61508-1. That case does not affect the structured argument form. We propose an additional operation of convolution for framework of structured argumentation. We can implement the convolution, if separate requirements related with one composite requirement are supported with the same evidence step. Also the convolution entails simplification of the Assurance Case graph in the part of transition between the Level 3 and the Level 4. At the Level 4 we have six more separate SCs (four plus two), so no more decomposition is needed. The next is application of the ES as per the developed template. The results of the ES implementation are given on Fig. 9. Figure 10 represents the Assurance Case graph for documentation management process. The claim is to organize documentation management in compliance with standards requirements. On the reasoning step (RS) we extracted all subclaims (SC) from the relevant IEC 61508 requirements. We got together eleven subclaims. Two subclaims (SC6 and SC10) are composite and the rest subclaims are separate. The composite subclaims are supported with the same evidences what makes simple the graph structure since we can hide details for composite subclaims without information loosing. We got some goals which are supported with multiple evidences as well as some evidences support multiple subclaims.

Development of Structured Arguments for Assurance Case

129

C: DocMan

RS

SC1

SC2

SC3

SC4

SC5

SC6

SC7

SC8

SC9 SC10 SC11

ES1

ES2

ES3

ES4

ES5

ES6

ES7

ES8

ES9 ES10 ES11

E1

E2

E3

E4

E5

E6

E7

E8

Fig. 10 GSN graph for the Assurance case based on structured argumentation

5 Conclusion The analysis of existing approaches to the development of the Assurance Case is conducted. Existing works have some drawbacks due to the lack of satisfactory practical argumentation techniques. One of the few authors who attempted to address this gap is John Rushby, who in his technical report [7] offers an approach to developing structured arguments based on modified GSN and structured text. In this paper, we use and develop this approach. Thus, in order to apply the methodology of the Assurance Case, a mathematical and methodological apparatus for structuring the argumentation was selected and improved. We obtained the structured argumentation method including the following: the overall algorithm of the Assurance Case development; the proposed structure of the Assurance Case graph, which is based on the typical structure of the arguments and is developed in connection with the structured text of the description of these arguments; improved structured text templates for arguments description. The obtained method can be used as the basis of the appropriated argumentation framework supported with a set of formal operations performed with the Assurance Case graph and supported structural text. We applied the proposed structured argumentation method for the group of requirements related with documentation management. As the result, we get the template with the Assurance Case graph and structural text related with typical reasoning and evidential steps. The obtained practical and theoretical results may be used for different kinds of safety and security critical systems in different applications. For example, in our researches we focus on Assurance Case application for

130

V. Sklyar and V. Kharchenko

Nuclear Power Plants Instrumentation and Control Systems [41], Unmanned Aircraft Vehicle [42], and accident monitoring systems [43].

References 1. Alexander, R., Hawkins, R., Kelly, T.: Security Assurance Cases: Motivation and the State of the Art. High Integrity Systems Engineering Department of Computer Science University of York (2011) 2. Toulmin, S.: The Uses of Argument. Cambridge University Press (1958) 3. Bishop, P., Bloomfield, R.: A methodology for safety case development. Safe. Reliab. 20(1), 34–42 (2000) 4. Kelly, T.: Arguing safety: a systematic approach to managing safety cases. PhD Thesis, University of York (1999) 5. Hawkins, R., Habli, I., Kelly, T., McDermid, J.: Assurance cases and prescriptive software safety certification: a comparative study. Safe. Sci. 59, 55–71 (2013) 6. Bloomfield, R., Bishop, P.: Safety and assurance cases: past, present and possible future—an Adelard perspective. In: Making Systems Safer, pp. 51–67. Springer (2010) 7. Rushby, J.: The interpretation and evaluation of assurance cases. Technical Report SRI-CSL15-01, SRI International (2015) 8. Weinstock, C., Goodenough, J., Hudak, J.: Dependability cases. CMU/SEI-2004-TN-016. Technical Report SEI/CMU (2004) 9. Haddon-Cave, C.: The Nimrod review. An independent review into the broader issues surrounding the loss of the RAF Nimrod MR2 Aircraft XV230 in Afghanistan in 2006, Crown Copyright (2009) 10. Ellison, R., Goodenough, J., Weinstock, C., Woody, C.: Survivability assurance for system of systems. Technical Report CMU/SEI-2008-TR-008, CMU/SEI (2008) 11. Sklyar, V., Kharchenko, V.: Green assurance case: applications for internet of things. In: Green IT Engineering: Social, Business and Industrial Applications, pp. 351–371. Springer (2019) 12. Evidence: Using safety cases in industry and healthcare. Health Foundation, London, UK (2012) 13. Sun, L., Zhang, W., Kelly, T.: Do safety cases have a role in aircraft certification? Proc. Eng. 17, 358–368 (2011) 14. Ankrum, T., Kromholz, A.: Structured assurance cases: three common standards. In: Proceedings of the 9th IEEE International Symposium on High-Assurance Systems Engineering.— Heidelberg, Germany, pp. 99–108, October 12–14, 2005 15. Graydon, P., Knight, J., Green, M.: Certification and safety cases. In: Proceedings of the 28th International Systems Safety Conference, Minneapolis, MN, USA, August 30–3 September 04, 2010 16. Holloway, M.: Explicate ‘78: uncovering the implicit assurance case in DO-178C. In: Proceedings of the 23rd Safety-Critical Systems Symposium, Bristol, UK, February 3–5, 2015 17. Sklyar, V., Kharchenko, V.: Assurance case driven design based on the harmonized framework of safety and security requirements. In: Proceedings of the 13th International Conference on ICT in Education, Research and Industrial Applications (ICTERI 2017), May 15–18, 2017 18. Graydon, P., Knight, J.: Assurance based development. Technical Report CS-2009-10, University of Virginia (2009) 19. Sljivo, I., Gallina, B., Carlson, J., Hansson, H.: Generation of safety case argument-fragments from safety contracts. In: Computer Safety, Reliability, and Security, pp. 170–185. Springer (2014) 20. Sljivo, I., Jaradat, O., Bate, I., Graydon, P.: Deriving safety contracts to support architecture design of safety critical systems. In: Proceedings of the 2015 IEEE 16th International Symposium on High Assurance Systems Engineering, Washington DC, USA, pp. 126–133, January 08–10, 2015

Development of Structured Arguments for Assurance Case

131

21. Wei, R., Kelly, T., Dai, X., Zhao, S., Hawkins, R.: Model based system assurance using the structured assurance case metamodel. J. Syst. Softw. 154, 211–233 (2019) 22. Jee, E., Lee, I., Sokolsky, O.: Assurance cases in model-driven development of the pacemaker software. In: Leveraging Applications of Formal Methods, Verification, and Validation, pp. 343–356. Springer (2010) 23. Kobayashi, N., Nakamoto, A., Kawase, N., Sussan, F., Shirasaka, S.: What model(s) of assurance cases will increase the feasibility of accomplishing both vision and strategy? Rev. Integr. Bus. Econ. Res. 7(2), 1–17 (2018) 24. Chowdhury, T., Wassyng, A., Paige, R., Lawford, M.: Systematic evaluation of (safety) assurance cases. In: Proceedings of International Conference on Computer Safety, Reliability, and Security, pp. 18–33. Springer (2020) 25. Gallo, R., Dahab, R.: Assurance cases as a didactic tool for information security. In: Information Security Education Across the Curriculum, pp. 15–26. Springer (2015) 26. Smith, B., Feather, M., Huntsberger, T.: A hybrid method of assurance cases and testing for improved confidence in autonomous space systems. In: Proceedings of the AIAA SciTech 2018 Forum, pp. 1566–1577, Kissimmee, FL, USA, January 8–12, 2018 27. Kakimoto, K., Sasaki, K., Umeda, H., Ueda, Y.: IV&V case: empirical study of software independent verification and validation based on safety case. In: Proceedings of the 2017 IEEE International Symposium on Software Reliability Engineering Workshops, pp. 32–35, Toulouse, France, October 23–26, 2017 28. Haddon-Cave, C.: The Nimrod review. An independent review into the broader issues surrounding the loss of the RAF Nimrod MR2 Aircraft XV230 in Afghanistan in 2006, London, UK, Crown Copyright 585 p (2009) 29. Graydon, P., Holloway, C.: “Evidence” under a magnifying glass: thoughts on safety argument epistemology. In: Proceedings of the 10th IET System Safety and Cyber-Security Conference, Bristol, UK, October 21–22, 2015 30. Duan, L., Rayadurgam, S., Heimdahl, M., Ayoub, A., Sokolsky, O., Lee, I.: Reasoning about confidence and uncertainty in assurance cases: a survey. In: Huhn, M., Williams, L. (eds.) Software Engineering in Health Care, Vol. 32, pp. 64–80. Springer (2017)32 31. Sklyar, V., Kharchenko, V.: Structured argumentation for assurance case of monitoring system based on UAVs. In: Proceedings of the 11th IEEE International Conference on Dependable Systems, Services and Technologies (DESSERT2020), pp. 40–46, Kyiv (2020) 32. Goodenough, J., Weinstock, C., Klein, A.: Eliminative argumentation: a basis for arguing confidence in system properties. February, Technical Report, CMU/SEI-2015-TR-005, CMU/SEI, Pittsburgh, PA, USA, 2015, 71 p 33. Zhao, X., Zhang, D., Lu, M., Zeng, F.: A new approach to assessment of confidence in assurance cases. In: Ortmeier, F., Daniel, P. (eds.) Computer Safety, Reliability and Security, pp. 79–91. Springer (2012) 34. Hitchcock, D.: Good reasoning on the toulmin model. Argumentation 19(3), 373–391 (2005) 35. Hawkins, R., Kelly, T., Knight, J., Graydon, P.: A new approach to creating clear safety arguments. In: Proceedings of the 19th Safety Critical Systems Symposium, pp. 3–23, Southampton, UK, February 8–10, 2011 36. Bloomfield, R., Littlewood, B., Wright, D.: Confidence: its role in dependability cases for risk assessment. In: Proceedings of the International Conference on Dependable Systems and Networks, pp. 338–346, Edinburgh, UK, June 25–28, 2007 37. Littlewood, B., Wrigh, D.: The use of multilegged arguments to increase confidence in safety claims for software-based systems: a study based on a BBN analysis of an idealized example. IEEE Trans. Softw. Eng. 33(5), 347–365 (2007) 38. GSN community standard, version 1. Goal Structuring Notation Working Group (2011) 39. Structured Assurance Case Metamodel, v2.0. Object Management Group (2016) 40. IEC 61508: Functional Safety of Electrical/Electronic/Programmable Electronic Safety-related Systems (in 7 parts). International Electrotechnical Commission (2010) 41. Sklyar, V., Kharchenko, V.: Assurance case for i&c systems safety and security assessment. In: Yastrebenetsky, M., Kharchenko, V. (eds.) Cyber Security and Safety of Nuclear Power Plant Instrumentation and Control Systems, pp. 302–322. IGI Global (2020)

132

V. Sklyar and V. Kharchenko

42. Sklyar, V., Kharchenko, V.: Assurance case based licensing for nuclear power plant postaccident monitoring system based on unmanned aircraft vehicles. In: Proceedings of the 10th International Conference on Dependable Systems, Services and Technologies, Leeds, pp. 186– 192, UK, June 5–7, 2019 43. Sklyar, V., Kharchenko, V.: Structured argumentation for assurance case of monitoring system based on UAVs. In: Proceedings of the 2020 IEEE 11th International Conference on Dependable Systems Services and Technologies, pp. 40–46, Kyiv, Ukraine, May 14–17, 2020

Making Reliability Engineering and Computational Intelligence Solutions SMARTER Coen van Gulijk

Abstract This work puts Computational Intelligence in Reliability Engineering in perspective within the larger framework of digitalization and business. The approach is to consider RECI solutions as business solutions in a wider enterprise architecture environment. Using that as a starting point four key components are discussed for successful implementation of RECI solutions in a wider digital business ecosystem. In addition a simple but effective project-design approach aims to aid researchers in designing RECI projects that are viable in a wider ecosystem: SMARTER. Keywords Reliability engineering · Computational intelligence · Enterprise architecture · Software solutions · Project design · SMARTER

1 Introduction This volume is dedicated to making progress in merging Reliability Engineering and Computational Intelligence into a single discipline: RECI. Previous work in this space, and indeed the papers in this volume, show that merging reliability engineering, fundamentally a mathematical discipline, and computational intelligence, fundamentally a computer science domain, leads to synergetic solutions. The synergy is mostly found in the implementation of advanced reliability mathematics in efficient computational environments. But make no mistake, the IT transformation of systems for reliability engineering is a formidable challenge. Both reliability engineering and computational intelligence have research histories going back at least five decades and have developed into C. van Gulijk (B) Work and Health Technology, TNO, Schipholweg 77, 2316 ZL Leiden, The Netherlands e-mail: [email protected]; [email protected]; [email protected] School of Computing and Engineering, University of Huddersfield Queensgate, Huddersfield HD1 3DH, UK Safety and Security Science, Delft University of Technology, Jaffalaan 5, 2628 BX Delft, The Netherlands © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 C. van Gulijk and E. Zaitseva (eds.), Reliability Engineering and Computational Intelligence, Studies in Computational Intelligence 976, https://doi.org/10.1007/978-3-030-74556-1_8

133

134

C. van Gulijk

highly specialized scientific domains. This paper provides some sense-making on how computational intelligence in reliability engineering fits into the larger world of digitalization, what key domains are relevant for successful implementation and how to make RECI projects attractive. This paper can only scratch the surface of the intricacies in combining two disciplines but this framework helps researchers understand their contribution within the larger efforts of digitalization of business and frame their work within the larger business scope. This work captures the key points from a discussion at the RECI workshop and the subsequent discussion about embedding RECI solutions in business environments. The work combines basic elements of ICT business and project design and links it with the enterprise architecture ecosystem. The work re-uses some elements of earlier papers and recasts them into the business framework for RECI.

2 Basic Building Blocks of Effective RECI Solutions RECI solutions of many different methods and techniques that take advantage of data-linkage for technological and non-technological solutions to support business processes. Working with huge amounts of data seems to be one of the most important indicators but dealing with a variety of data-sources quickly is equally important. For the purpose of this work it is easier to think of a RECI solution as a business support solution that contributes to the success of organizations operating complex technological systems or that contributes to the supply chain for such a system. To understand how the linkage between scientific progress on RECI and business interests is made this work takes a step back from the detailed computing and mathematical algorithms to break RECI down into basic building blocks, how they fit together and why they are important for the integration of RECI solutions in business. The key building blocks in the development of RECI are adopted from our earlier work [1] and solutions are: enterprise architecture, algorithms & data, ontology and visualization. Figure 1 visualizes them in a simple diagram. Algorithms and data are necessarily in the middle of the triangle: they are the RECI solutions themselves. RECI applications combine data with advanced mathematical models and algorithms are necessarily the centre building block. Without value-adding algorithms there cannot be RECI delivery. Section 2.2 describes how such systems add value. Enterprise architecture represents the wider business ecosystem; it is a key building block for RECI solutions. As the enterprise architecture provides the framework and boundaries for functional RECI solutions it is the pinnacle of the triangle in Fig. 1. Ontology and visualization offer conceptual (and possibly digital) linkage and options for human inspection, respectively. Though there is less attention for these subjects in this work, they will be relevant for implementation in business systems. This work mostly focusses on the embedding of RECI solutions in business environments and therefore, the attention is mainly focused on enterprise architecture and SMARTER project design.

Making Reliability Engineering and Computational …

135

Fig. 1 Building blocks for RECI (adapted from [1])

2.1 RECI as Part of EA Enterprise architecture (EA) addresses RECI solutions as tools that add to the wider business environment in which RECI solutions are embedded. EA comprises of the joint development of an organization together with its IT backbone. It collates organizational tasks or within one or more organizations that work toward a common set of goals or aims; architectures map out how the components fit together and where the interfaces between them are. Together they support a business capability or build a service to society [2]. Enterprise Architecture was introduced to structure the design process for such systems and to design the software to support them. The main feature is the the concurrent design of an organization and its IT backbone. Zachman enterprise architecture is the most enduring framework. It was developed by John Zachman at IBM as early as 1980. Zachman describes an enterprise in a two-dimensional matrix, based on the intersection of six communication questions (what, where, when, why, who and how) with six rows according to viewpoints from different stakeholders (transformation planner, business owner, system designer, system builder, and subcontractor) (see Zachman International [3]). Where the Zachman approach leads to clarification of objectives, TOGAF introduces a process to formalize enterprise architecture as a process. Figure 2 shows the process (after [2]). It is beyond the scope of this work to discuss an enterprise architecture approach; TOGAF offers an excellent description of how that works. Another useful source could be the standard: ISO/IEC/IEEE 42,010 [4]. RECI researchers do well to read into EA literature to get a clearer understanding of how and where their solutions add value in the larger scale of business. A product from an EA exercise that stands out is a high-level schematic that indicates different parts of the organization and how they are linked together. Such a

136

C. van Gulijk

Fig. 2 Enterprise architecture process (adapted after [2])

schematic represents the design of the organizational model and so offers insight in what information is required from a business process and what information should be linked to the next business process. Figure 3 shows an example of an enterprise architecture map for the GB railways [5]. The graphical model shows clusters of business processes focusing on general management, planning, rolling stock (trains), infrastructure management, traffic management, planning, revenue and safety. The whole representation capturing over two-hundred interlinked business processes that

Fig. 3 Railway functional architecture (as made available through [5])

Making Reliability Engineering and Computational …

137

come together to achieve a consolidated business goal: efficient rail transport in Great Britain. Figure 4 zooms in on part of railway infrastructure management: track management. Even if this is still a relatively high level of granularity, there are 14 business processes that are performed all over the GB railways. The individual business processes are significant efforts; track renewal, rail profile checking, and ballast condition monitoring being structural maintenance processes with are associated with huge investments and hundreds of tasks. It would not be difficult to imagine that each of these business processes depend on ICT support systems and there being dedicated digital monitoring systems for track monitoring for which individual reliability engineering models were designed. Actually, a whole society of railway engineers is working on track maintenance monitoring and track monitoring that publish scientific papers at least weekly.

Fig. 4 Track maintenance in architecture (made available through [5])

138

C. van Gulijk

Summarizing, this section demonstrates how reliability engineering research is part of a much larger ICT transformation process. The work of reliability engineers is always part of a much larger business process where ICT systems work together. From that top-down perspective, advances in reliability engineering may be viewed as a small part of the digital transformation. Understanding the wider scope of the digital transformation and where RECI solutions fit make it easier to design RECI solutions successfully.

2.2 Viable RECI Algorithms for EA When we consider RECI solutions as parts of an EA it makes sense to consider what they have to deliver. Rather than delving into the intricacies of IT or mathematics (this volume has 16 papers addressing that issue) it makes sense to consider what functions RECI solutions have to deliver become viable in the EA ecosystem. From a business perspective, all RECI solutions deliver about the same: they connect advanced mathematical methods to data with or without using intelligent analytics solutions to add value to the system. Value, in this context, are usually cost savings by maintaining quality levels for less cost (e.g. fewer rail replacements), improving service of the system (more trains per hour), or forecasting (predicting end-of-servicelife for wheel-sets). Van Gulijk et al. [1] describe five characteristics of viable digital support systems, of which RECI is one, that tell whether it adds value to the enterprise as a whole. Business viable RECI systems are enterprise systems that: • extract information from one or more data sources; • processes the data quickly to infer and present relevant reliability information; • with one or more software applications to collectively provide sensible interpretation; and • uses online interfaces to connect the right people at the right time In order to: • provide decision support for maintenance management. Any RECI system that succeeds in addressing these points in line with a business demand from the enterprise architecture, AND do so in a cost-effective manner can be considered viable RECI systems. Note that the level of digitalization in an EA system depends on networks of technical and non-technical enablers such as the Internet, Clouds, 5G, ultrafast internet, cyber-security. These technologies provide the technical backbone for organizations. The more mature that system is, the less investment is required to install computational intelligence. So there is some dependency on the local IT infrastructure for the actual implementation of RECI systems.

Making Reliability Engineering and Computational …

139

2.3 Ontology Enterprise Architectures are sprawling networks and numerous interactions exist between subsystems. As the EA model for the GB railway demonstrates, each subsystem can be managed simultaneously by different organizations (e.g. infrastructure managers, operators, manufacturers or maintainers). Each organization has its own structure (e.g. financial area, health and safety area, resources area or selling area) formed by people of different expertise, skills and competences that change over time. Thus, a rich tapestry of different organizational knowledge co-exist and they don’t necessarily draw from the same jargon. As an example, a person on a train may be a passenger for a railway operating company, it might be a ticket buyer for an economist perspective and equally, it might be an individual at risk on a moving train for a safety expert (even if they didn’t pay their ticket). The same individual carries different labels but when data has to be exchanged there may not be a mutual understanding between different players in the EA network. Knowledge Management (KM) is the discipline that deals with interoperability of knowledge in different organizational contexts [6]. KM is specifically oriented toward capturing and storing organizational knowledge into a format that is useful for humans and computers alike. Capturing the interactions between parts of the railways is one of the greatest challenges in data-collaboration. The objective is to understand decision-making processes for risk analysis based on relevant reliability engineering knowledge from the RECI system into the EA system. Another, more popular terminology for knowledge management is ontology. Ontology building is a common technique used by Computer Science to represent a common framework of understanding. Ontology is the systematic classification of domain knowledge that supports the use of different databases in a meaningful way: it can be compared to a search engine which holds the right search keys to produce results that are relevant to the human operator. The search keys are based on a repository of concepts and words that represent the knowledge structure of a specific domain. The concepts are the ways in which the components within the domain combine and interact to create the emergent behaviour of the overall system. Depending on the type of knowledge to represent, different levels of complexity of ontology can be required. Figure 5 shows a spectrum of ontologies depending on their complexity and descriptive power in computer science [7]. To represent a knowledge domain, diverse lightweight ontologies such as vocabularies, on the lefthand side of Fig. 5 can be created and linked to form heavyweight ontologies, on the right-hand side of Fig. 5. Heavyweight ontologies tend to be very abstract which is desirable from a scientific point of view but they tend to be unwieldy for practical computer solutions. An embryonic ontology for railway safety and risk is suggested by Figueres et.al. [8]. It attempts to establish a common semantic framework among the current safety organizational knowledge to select data from different systems and enable data analysis. It means that it is not necessary to change the knowledge of each

140

C. van Gulijk

Fig. 5 Different types of ontologies depending on their complexity (after [7])

organization, just to describe and match it. That means that ontologies have to provide the semantics to obtain and use relevant data for railway safety.

2.4 Visualisation Visualisation has been developed from different disciplines (e.g. computer science, engineering, psychology or management sciences), its definition has been expressed in terms that might indicate different levels of abstraction and understanding depending on the discipline, generating conflicts and inconsistencies [9]. Although we can find references related to the understanding and insight of data by means of visual perception to support the cognitive process in data analysis [10], visualisation has historically been divided into information visualisation and scientific visualisation. Scientific visualisation focuses on visual techniques to depict scientific and spatial data, whilst information visualisation focuses on abstract and non-spatial data. Visual analytics is a different proposition altogether: it focuses on interactive visual tools to analyse large datasets. The term ‘visual analytics’ arose around 2005, being defined as a combination of “…automated analysis techniques with interactive visualizations for an effective understanding, reasoning and decision making on the basis of very large and complex data sets.” [11]. VA is a multidisciplinary area that attempts to obtain insight from massive, incomplete, inconsistent and conflicting data in order to support data analysis, but usually requires human judgment [12]. Visual analytics, as a discipline, aims to support complex analysis of which RECI is an example. The visual aspects are important for human oversight; through dash-boarding and visual monitoring, humans can assess whether the processing proceeds as expected and whether the outcomes of the algorithms produce sensible outcomes. Also, it offers visual pointers for human intervention, for instance prompting a safety manager to decide between

Making Reliability Engineering and Computational …

141

Fig. 6 Main areas involved in visual analytics (after [1])

options. Visual Analytics has yielded five pillars where visualisation can support data analysis tasks, viz. data management; data analysis/mining; risk communication; human–computer interaction and information/scientific visualisation (Fig. 6). In relation to enterprise architecture, the human–computer interaction is the most important one since it provides the basis for business support system.

3 SMARTER Projects for RECI Keeping in the back of our minds that RECI solutions require a business approach to fit the wider enterprise architecture this paper uses a straightforward business approach for designing RECI projects. The approach is derives from Dorian’s original SMART acronym [13]. This paper adds a few elements that are frequently used to capture scientific progress in a fast-moving domain to form SMARTER: S Specific–target a specific area for improvement, M Measurable–quantify or at least suggest an indicator of progress, A Assignable–specify who will do it, R Realistic–what results can realistically be achieved, given resources, T Time-related–specify when the result(s) can be achieved, E Exciting–electrifying and motivating. R Reach–pushing past state-of-the-art.

142

C. van Gulijk

3.1 Reach The last point, reach, addresses the need to push the boundaries of current knowledge beyond the state-of-the-art. In many ways this is the core business of researchers of RECI and any other innovative domain. The basic assumption is that researchers, such as the authors and audience of this work, are in-sink with current developments in their domain by keeping up with literature, experimenting with ideas and methods of predecessors and taking the next step. In terms of progress for RECI there are grossly three methods to move beyond the state-of-the-art: (i) prove (or disprove) existing theories and methods by systematic scientific testing or designing alternative routes to come to prove or disprove a theory; (ii) expand the usability of existing methods and theories by generalizing (inductive), specializing (deductive) or transfer to different domains, provided that scientific evidence supports the expansion; and last but not least (iii) by developing novel theories and methods and providing evidence for their correctness. In many ways this is the core business for any scientific endeavour, RECI is no exception.

3.2 Exciting The second-last point, exciting, tends to be a more personal aspect. The work has to be thrilling, electrifying and motivating. In some ways excitement is the buoy that scientific progress floats on. Industries are tempted by solutions that improve their business and are willing to pay handsomely for that. Companies are motivated by elements of competition to try to outperform competitors. But also on a personal level, researchers are excited by the developments in their field. A good research programme or project cannot thrive without some sense of excitement for it to be truly successful and RECI researchers are wise to keep that in mind. This is especially important when RECI researchers discuss research and solutions with industry: they have to be inspired themselves and be able to convey the excitement of the novelty and progress to funders. The works in this volume attest to the energy and drive associated with RECI research: cutting-edge computer science meets advanced reliability engineering mathematics.

3.3 Specific Returning to the first point, specific, we find elements that are associated with the more traditional use of the SMART acronym. This is mostly to do with making the area of work explicit. Describe the improvement you target that you are pursuing and what routes are open to you what the expected impact is for science and for the industry. Try to link to a broader enterprise architecture environment regardless on

Making Reliability Engineering and Computational …

143

whether that enterprise is digitized or not. Map out what boundaries you think there might be and in what way they influence your work. Explain in as much detail as is required the concepts, theories and methods you will be using, what parts of them are relevant and which need to change or improved. Explain which technologies are required and how they contribute to science and business and. And not unimportant in this is to understand your own position as a researcher individually or as part of a team: what is it about you and your track record that makes you suitable to pursue this line of investigation. Your funders and peers will ask themselves: why are you doing this at this point in time?

3.4 Realistic Before moving on to more mundane elements of project management we treat realism. Up until now we proposed that progress, excitement, on a specific topic breed the kind of ambition that RECI research requires but of course, some deliberation is required to consider limitations. Many things can limit the reach of a research project: funds, time, skills, equipment, connections to industry, access to data, and even maturity of scientific theories can limit what is possible within a RECI research project. It helps if a consistent analysis is performed about possible problems for the work. Even a simple method, such as a SWAT analysis could help. In this stage it is important for those involved to be honest to themselves and their partners about what can be achieved; projects that promise results that cannot be achieved are particularly unwelcome in a business environment, especially if the promised functionalities play a pivotal role in an enterprise architecture. If performed well, the analysis informs those involves in what personal development is required (perhaps in terms of learning mathematical techniques, programming languages or oral communication). It may be necessary to revisit the first point (specific) to tone targets down (or, indeed, move them even higher).

3.5 Measurable Moving into more mundane practicalities of RECI (project) research it becomes necessary to measure progress as the project moves forward. Ultimately the research works toward a target and the researcher(s) themselves as well as funders or other interested partners need to understand whether the targets can be met or have to change. Traditionally in research a research plan provides a stepwise plan for progress, mostly based on key deliveries. But equally quality indicators can be assigned or designed; examples include, but are not limited to: expenditure, progress against a backlog or energy efficiency. A consistent tracking system tends to work better than vague targets and it increases trust for achieving the targets, or adjusting them timely, by everyone involved in the research.

144

C. van Gulijk

3.6 Assignable Also delving into the practicalities of research, assignable, focuses on specifying who will deliver what to achieve progress. From the researchers’ perspective it is mostly their own progress that is relevant but especially in RECI projects there are many dependencies with stakeholders. If the research contributes to a business in a wider enterprise architecture they may have to convey business constraints, access to systems, data, exchange formats etc. in a timely manner. As simple as this sounds, very often there are difficulties in providing access to data. Assigning responsibilities is often key to the success of a multi-disciplinary research team. And addressing PhD students in this: yes, you can assign responsibilities to your professor about the amount of time you wish them to spend on networking events and reviewing your work. With computer science playing into the RECI domain there is a special concern for rapid changes. The world of IT is moving forward at neck-breaking speed and software that is state-of-the-art today might be obsolete two years from now; experienced programmers you work with might only be available to you briefly because they are mobile in their assignments and businesses are frequently bought and sold. Ensuring that the right people are connected probably entails re-assigning the same task to another person; keeping track of assignable tasks may be key to success of the work.

3.7 Time-Related Last but not least, time-related, is an important issue. When the project is well designed. Whether the work is done as a student on assignment or within a research programme running for several years, time is always against you. The steps above should provide sufficient attention to time issues but more often than not, plans change, deliveries are delayed, and data turns out to be owned by a party that was not part of the initial discussions. Yet professors don’t like PhD processes overrunning, clients don’t like delays and students’ terms don’t alter because the project isn’t finished yet. Controversially, controlling time asks for an activity that costs time: meetings. It is worth discussing progress, complications and delays on a regular basis. This is where the points ‘specific’ and ‘measurable’ come into their strength. Supervisors, clients and funders have to understand time-pressures and complications and typically want to play their part in deciding whether and how targets and timelines change. Any time spent on adjusting plans in collaboration with stakeholders is time well spent with the up-side that unnecessary time-stresses are usually relieved. Therefore, it is sensible to take a meeting schedule into account when considering the points ‘specific’ and ‘measurable.’ It is well understood that the SMARTER method does not provide solace for RECI research as a research field as a whole, nor does it provide a blueprint for good science but it provides a blueprint for setting up RECI projects to meet requirements

Making Reliability Engineering and Computational …

145

for a implementation in a wider business environment in an enterprise architecture. Of course, SMARTER is not the only way to design research projects in alignment with enterprise architecture demands; it is worthwhile reading up on Zachman [3] and/or TOGAF [4] to get a better understanding of EA business approaches.

4 Conclusion This work captures some important points about business viability of RECI solutions. The work draws on earlier work on business sensibility for the practical implementation of advanced analytics solutions and turns it toward RECI solutions. Four building blocks were found to be relevant: enterprise architecture, analytics & data, ontology and visualization. This work elaborates on enterprise architecture because it provides the business ecosystem for analytics solutions: RECI solutions do not exist on their own but fit within a large network of business processes and it has to demonstrate added value within that network. The work provides some guidelines for assessing the viability of RECI proposals through an established business acronym: SMARTER. The approach may be of use when assessing the added value of envisaged RECI research and/or solutions to fit into the business environment; in that sense it can help build a more solid case for project funding and management. We believe that the approach is equally of use in the design stage and the execution of a RECI research project. Also, the work invites researchers to scrutinize the business ecosystems in which their solutions add value. After all, adding value to a business ecosystem is a crucial success factor for analytics solutions.

References 1. Van Gulijk, C., Figueres-Esteban, M., Hughes, P., Loukiavov, A.: Introduction to IT transformation of safety and risk management systems. In: Mahboob, Q., Zio, E. (eds.) Hanbook in RAMS and Railway Systems, pp. 631–650. CRC-press, New York (2018) 2. TOGAF: The Open Group Architecture Framework. https://www.opengroup.org/togaf (2006) 3. Zachman-Institute: The Concise Definition of The Zachman Framework. https://www.zac hman.com/about-the-zachman-framework (2020) 4. ISO/IEC/IEEE 42010:2011: Systems and Software Engineering—Architecture Description. https://www.iso.org/standard/50508.html (2011) 5. RSSB: The Railway Functional Architecture, Report T912. https://www.rssb.co.uk/en/res earch-catalogue/CatalogueItem/T912 (2011) 6. Brewster, C., Ciravegna, F., Wilks, Y.: Knowledge acquisition for knowledge management: position paper. In: Proceeding of the IJCAI-2001 Workshop on Ontology Learning, pp. 121–34 (2001) 7. Uschold, M., Gruninger, M.: Ontologies and ssemantics for seamless connectivity. ACM SIGMOD Rec. 33(58), (2004) 8. Figueres-Esteban, F., Hughes, P., El-Rashidy, R., Van Gulijk, C.: Manifestation of ontologies in graph databases for big data risk analysis. In: Haugen, S., Barros, A., Van Gulijk, C., Kongsvik,

146

9. 10. 11.

12. 13.

C. van Gulijk T., Vinnem, J. (eds.) Safety and Reliability—Safe Societies in a Changing Work, pp. 3189–3193 (2018) Spiegelhalter, D.: The Art of Statistics. Pelican Books, London (2019) Card, S.K., Mackinlay, J.D., Shneiderman, B.: Readings in information visualization. In: Using Vision to Think, vol. 1. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1999) Keim, D., Kohlhammer, J., Ellis, G., Mansmann F.: Mastering the information age solving problems with visual analytics. In: Mastering the Information Age Solving Problems with Visual Analytics, pp. 57–86 (2010) Thomas, J.J., Cook, K.A.P.: Illuminating the path: the research and development agenda for visual analytics. In: IEEE Computer Society, vol. 54 (2005) Doran, G.T.: There’s a SMART way to write management’s goals and objectives. Manag. Rev. 70(11), 35–36 (1981)

Method for Determining the Structural Reliability of a Network Based on a Hyperconverged Architecture Igor Ruban, Heorhii Kuchuk, Andriy Kovalenko, Nataliia Lukova-Chuiko, and Vitalii Martovytsky

Abstract The features of a network operation, which is based on a hyperconverged architecture, are considered. A hierarchical graph is built, isomorphic to the network structure, in which the network hypervisor is the center. The graph vertices are stratified depending on the length of the path to the center. Sets of graph branches are constructed for each level of stratification. Utilization rates for the nodes and branches of the graph are calculated. On their basis, the level of network operation quality is determined, depending on the availability of operable branches and the reliability of the nodes and communication links. Logical functions are constructed that describe the performance of the branches. To obtain a scalar indicator of the structural reliability of a network, the distribution of a discrete random variable of the number of operable branches is considered. An iterative algorithm for obtaining its numerical characteristics is proposed. The algorithm is based on finding the generating polynomial of the distribution. The mathematical expectation of a given random variable is chosen as an indicator of the structural reliability of the network. The analysis of the results of the structural reliability calculation for a network based on a hyperconverged architecture, depending on the functional redundancy, the number of levels of stratification, the degree of system complexity is performed. I. Ruban · A. Kovalenko (B) · V. Martovytsky Department of Electronic Computers, Kharkiv National University of Radio Electronics, Kharkiv 61166, Ukraine I. Ruban e-mail: [email protected] V. Martovytsky e-mail: [email protected] H. Kuchuk Department of Computer Engineering and Programming, National Technical University Kharkiv Polytechnic Institute, Kharkiv 61002, Ukraine e-mail: [email protected] N. Lukova-Chuiko Department of Cyber Security and Information Protection, Taras Shevchenko National University of Kyiv, Kyiv 01601, Ukraine e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 C. van Gulijk and E. Zaitseva (eds.), Reliability Engineering and Computational Intelligence, Studies in Computational Intelligence 976, https://doi.org/10.1007/978-3-030-74556-1_9

147

148

I. Ruban et al.

1 Introduction 1.1 Motivation Currently, at the network market networks based on the hyperconverged platform (HCP) are increasingly sought [1]. HCPs imply further development of the convergence ideology that means combining memory, computing and network resources into a common, pre-configured pool. Hyperconverged platforms add modularity. Due to this, all the necessary virtualized computing resources, network systems and network storage systems operate autonomously within separate modules. Hyperconvergence implies combining several different technological layers in one of the infrastructure components at the stage of creating a platform [2]. The use of hyperconverged structures enables dividing the technological layers of a computer system into separate pools of resources (computation, storage) no more since they are initially built into each unified element using a hypervisor, and a new platform is created on these elements. Developing the operability of hypervisors made it possible to overcome the constraints linked to the access to data on local media of individual servers, and to replace the classic network with a virtual one. The situation is similar to the computer network, which is emulated by soft switches, routers and can be completely virtualized. Unification and simplicity of nodes, horizontal scaling and load balancing via fast interconnect, the capability to define the most infrastructure components by software and control them using a single system are the main features of modern hyperconverged platforms [3]. Modules in HCP are grouped to provide fault tolerance, high performance, and flexibility in resource pooling. The platform hypervisor controls groups and modules in a hierarchy chain. Therefore, the network on a hyperconverged platform belongs to the class of structurally complex systems with hierarchical control. Any new qualities in such systems require that their reliability and reliability control be carefully analysed. The calculations of reliability are especially important when designing a network on the hyperconverged platform.

1.2 State of the Art Currently, many methods, techniques and algorithms to analyse the reliability of structurally complex systems are proposed [4–14]. However, they do not completely take into account the hyperconvergence factor. To calculate the reliability indices of technical systems analytical methods are used. These are the methods of the theory of random processes, the theory of expert assessments, decomposition, logical-probabilistic, asymptotic, analytical and statistical methods. In practice, the methods of simulation and statistical modelling are used [15–20].

Method for Determining the Structural Reliability of a Network …

149

Analytical methods for calculating reliability indices are rooted in the theory of random processes. The reliability calculation of complex technical systems is often based on the assumption that the uptime and the recovery time of elements have exponential probability distributions [4]. The construction of Markov reliability models is as follows. Based on the information on the structure and principles of the target system operation, the set of its probable states is determined, which is divided into two subsets: operable states and failure states. The transition graph is built, its vertices being the states of the system, and the edges—the probable transitions between the states. According to the transition graph, the system of equations is made, its solution enables obtaining the required reliability indices [5]. The assessment of the reliability parameters of technical systems using graphs enables taking into account any factors, which may affect the system. Describing a system by a state graph has a disadvantage: data entry and methods to determine reliability characteristics for systems with a large number of states are complex [6]. The processes that occur in systems with arbitrary distributions of time intervals (Erlangian, normal) are semi-Markov, that means that within these processes the probability of the system transition from one state to another depends on the time the system spent in the first state [7]. Methods based on semi-Markov processes can be restrictedly used (they enable determining only stationary values of the reliability indices) as, in a general way, they cannot be used to develop a mathematical model of the restorable technical system, taking into account structural redundancy and any repair discipline. Multidimensional Markov processes describe the operation of technical systems with arbitrary distributions of uptime and recovery times, taking into account structural and time redundancy, control of technical means and several types of failures. The indices of reliability are calculated using statistical modelling that requires time and computer memory by the method of multidimensional Markov processes. To assess the reliability of complex technical systems with a small number of states, asymptotic methods can be used [8]. The disadvantage of asymptotic methods, that limits their application, is the locality of the solutions obtained. They make it possible to find solutions to the problem only within small limits of changing the parameters of the system. In practice, however, it is often necessary to go beyond these limits [9]. Logical-probabilistic methods of analysing the reliability of complex technical systems use the mathematical apparatus of the binary algebra of logic and the theory of probability [10, 11]. The methods of queuing theory, which include the differential method of decomposition into phases and Kendall’s method [12], make it possible to reduce the non-Markov model to the Markov one. The methods of stepwise approximation of failure rates and recovery of elements are used to assess the reliability of systems, which have a small number of states and slowly varying rate [13]. The methods of heuristic prediction are based on statistical processing of independent assessments of the values of the expected reliability indices of the target object [14, 15]. The method does not enable establishing the calculation error and is used exclusively for the case of highly reliable elements and systems.

150

I. Ruban et al.

The decomposition method is based on the construction of mathematical models that enable obtaining sufficiently accurate upper and lower bounds of the reliability index being assessed [17]. The method of statistical modelling is used to study the behaviour of probabilistic systems under conditions when the internal interactions in these systems are not fully known [18]. The method is based on multiple tests of the constructed model with subsequent statistical processing of the obtained data. In general, simulation modelling methods are universal and enable considering systems with a large number of elements [19]. However, their use as a method for studying reliability problems is advisable only when it is difficult or impossible to obtain the analytical solution [21]. The comparative analysis of existing methods (assessment of their capabilities) shows that to assess the reliability and efficiency of the operation of each complex technical system with a large number of states, it is necessary, based on traditional methods, to develop a methodology that takes into account the peculiarities of the operation and the singularity of a particular system, which makes it possible to evaluate the errors in calculating reliability indices with the required accuracy. So, the following conclusions can be made. Analytical methods are important to study the reliability of structurally complex systems since for a large number of factors affecting the reliability of systems, high reliability of simulation cannot be practically achieved. However, the existing analytical methods for assessing the reliability of structurally complex systems have the following disadvantages: the methods are complex and focused on simple structures, there are difficulties in studying the non-stationary characteristics of reliability; it is impossible to study dependent processes, analyse systems with variable structure. When developing a mathematical model for the operation of a structurally complex system and methods for its analysis, the peculiarities of its operation should be taken into account. The exponential law of reliability cannot be applied to complex systems: the initial data in the models are inadequate to the physical processes that occur in the systems. Thus, when studying each structurally complex system, it is necessary, based on traditional methods, to develop the methodology that takes into account the peculiarities of the operation and the singularity of a particular system.

1.3 Goal and Tasks The goal of the chapter is to propose the method that enables determining the structural reliability of a network based on a hyperconverged platform at the stage of the network design. To achieve the goal, the following stages shall be performed: – assessment the structural reliability of a separate network module on a hyperconverged platform; – consideration of the reliability indices of the network on a hyper-converged platform;

Method for Determining the Structural Reliability of a Network …

151

– development of the algorithm for calculating the structural reliability of a network on a hyperconverged platform.

2 Assessing the Structural Reliability of a Module of the Network Based on HCP When designing a network based on a hyperconverged platform, two aspects of reliability can be singled out: hardware and structural [2]. The hardware aspect is thought of as the problem of ensuring the reliability of equipment, individual devices and their elements that form a communication network. The structural aspect of reliability reflects the operation of the network as a whole, depending on the state of nodes and communication links. The structural reliability of the network is linked, first of all, to the fact if there are ways of delivering information between the corresponding nodes of the network. When considering the structural reliability of structurally complex systems with a hierarchical structure, the reliability of simple chains is initially considered. These chains do not contain the same vertices, and hence loops and cycles, or a set of such chains between a given pair of nodes. Thus, we consider the reliability of the connectivity of the two poles of the network with the known reliability indices of the edges and nodes included in the chains connecting them, or, in other words, the structural reliability of a two-pole network [5]. The index of the quantitative assessment of this characteristic is the probability of the connectivity of the given source and destination of a two-pole network within a given time t. The features of hyperconverged structures make it possible to represent individual modules of the system as two-pole subnets. Unfortunately, this method cannot be used to solve practical problems of large dimensions, since its computational complexity is determined by the exponent 2n , where n is the number of the network element. Among the exact methods, the method for assessing network structural reliability by a set of paths should be focused on [22]. Its processing time is also exponential 2m x,y − 1, where m x,y is the number of ways from the source x of the network to the destination y. The method enables obtaining the exact values of the required connectivity probability (structural reliability) Px,y (t) among the given poles of the network, at least on low-power structures. The principle of the method is to form all possible combinations  from m x,y paths along i = 1, m x,y , 

that is to form a set of combinations Cmi x,y . Each combination is a disjunctive assembling the elements of paths included in it. The resulting expression for Px,y (t) is determined by an alternating sum, each summand of which is the product of the reliability values of the elements of the corresponding set of paths. Among the exact methods for assessing the structural reliability of two-pole networks, the method of sequential decomposition of the initial structure with respect

152

I. Ruban et al.

to the bridge connection (the Moore-Shannon method [23]) is of particular significance. Its advantage over the above methods of this class is determined by significantly lower processing time, estimated as 2kx,y , where k x,y the number of levels of network decomposition, that is equal, in the general case, to the number of decomposition elements, that is bridge connections. For comparison, it should be noted that even for the simplest bridge circuits, the number of their simple circuits is m x,y = 2kx,y +1 , and hence the complexity of solving this problem by the method of direct enumerak x,y +1 − 1, which is unacceptable under real conditions. tion of simple chains will be 22 However, taking into account the peculiarities of the network on HCP, this method can be modified, which will significantly simplify its computational complexity.

2.1 Modified Moore-Shannon Method for Assessing the Structural Reliability of a Network Based on HCP As an example, consider the HPE Hyper Converged 250 System. The system switching diagram is shown in Fig. 1. Transform Fig. 1 in the form of the simplest two-pole diagram (Fig. 2). The target value P1,4 (t) for the given structure that comprises the bridge connection br23 (or br32 ) with the reliability p23 (t) (to simplify the calculations, the reliability of network nodes in this case is considered to be equal to 1), is equal to

Fig. 1 Switching Diagram for HPE HC250 Hyperconverged System

Method for Determining the Structural Reliability of a Network … Fig. 2 The simplest two-pole c(v2 , v4 ) bridge diagram

153

v2

br12

b r2 3

br24

v4

v1 (br32 ) br13

br34

v3

P1,4 (t) = p23 (t) · P14 (t) p23 = 1 + (1 − p23 (t)) · P14 (t) p23 = 0,

(1)

 where P14 (t) p23 =1 is the reliability of connectivity of the network vertex v1 with the vertex v4 (the reliability of the collection of all simple chains from a node v1 in v4 ), if assume that the bridge reliability equals 1, which is equivalent  to contraction (merging) of nodes v2 and v3 along the edge br23 (Fig. 3); P14 (t) p23 =0 is the same reliability but if assume that the bridge is completely unreliable p23 (t) = 0, which is similar to the situation when the edge br23 is removed from the network (Fig. 4). The initial structure is sequentially decomposed (when the nodes are contracted and the edges are removed) until the rest substructures are the combinations of series– parallel models. In the considered example, these substructures will be the results of the decomposition shown in Figs. 3, 4. Here k1,4 = 1, and a bridge connection br23 was used as an element of decomposition. If to insert the convolutions of the obtained substructures done according to series– parallel diagrams in (1), then this results in P1,4 (t) = p23 (1 − (1 − p12 (t)) · (1 − p13 (t))) · (1 − (1 − p24 (t)) · (1 − p34 (t)))+, + (1 − p23 (t)) · (1 − (1 − p12 (t) · p24 (t)) · (1 − p13 (t) · p34 (t))). (2) br24

br12

v1

v4 v23

br13

Fig. 3 The substructure of decomposition when p23 (t) = 1

br34

154

I. Ruban et al.

br12

v2 br24

v4

v1

br13

br34

v3 Fig. 4 The decomposition substructure when p23 (t) = 0

After expanding expression (2) and grouping, we finally obtain P1,4 (t) = p12 (t) p24 (t) + p13 (t) p34 (t) + p12 (t) p34 (t) p23 (t)+ + p24 (t) p13 (t) p23 (t) − p12 (t) p24 (t) p13 (t) p34 (t)− − p12 (t) p24 (t) p34 (t) p23 (t) − p12 (t) p24 (t) p13 (t) p23 (t)−, − p12 (t) p13 (t) p34 (t) p23 (t) − p24 (t) p13 (t) p34 (t) p23 (t)+ + 2p12 (t)p24 (t)p13 (t)p34 (t)p23 .

(3)

The calculation by the method of direct enumeration of simple chains (Table 1) defines the value P1,4 (t) P1,4 (t) = p12 (t) p24 (t) + p13 (t) p34 (t) + p12 (t) p34 (t) p23 (t)+ + p24 (t) p13 (t) p32 (t) − p12 (t) p24 (t) p13 (t) p34 (t)− − p12 (t) p24 (t) p34 (t) p23 − p12 (t) p24 (t) p13 (t) p32 (t)−, − p12 (t) p13 (t) p34 (t) p23 (t) − p24 (t) p13 (t) p34 (t) p32 (t)+ + p12 (t) p24 (t) p13 (t) p34 (t) p23 (t)+ + p12 (t) p24 (t) p13 (t) p34 (t) p32 (t).

(4)

Expressions (3) and (4) comprises summands that correspond to all simple chains as well as to their disjunctive combinations. They differ only in constituents marked identically, and the identity (3) i (4) occurs if the condition p23 (t) = p32 (t) is met. Therefore, the proposed modification of the method to assess the structural reliability of two-pole networks is valid only in cases where the reliability values of the bridge connections, along which the decompositions are performed, are symmetric. Study this example.  23 , (Fig. 5). This means Let a bridge connection be unidirectional (oriented) arc br  that there is no arc br 32 , then p32 (t) = 0. The decomposition of such structure according to the condition p23 (t) = 1 (Fig. 6) will add an error to the result

Method for Determining the Structural Reliability of a Network … Table 1 The matrix of aggregate simple chains   No i , br12 br24 br13 br34 br23 Cm1,4

br32

155

Sign

Chain

i = 1, 4 1

1

1

1

0

0

0

0

+

p12 p24

2

2

0

0

1

1

0

0

+

p13 p34

3

3

1

0

0

1

1

0

+

p12 p34 p23

4

4

0

1

1

0

0

1

+

p24 p13 p32

5

12

1

1

1

1

0

0



p12 p24 p13 p34

6

13

1

1

0

1

1

0



p12 p24 p34 p23

7

14

1

1

1

0

0

1



p12 p24 p13 p32

8

23

1

0

1

1

1

0



p12 p13 p34 p23

9

24

0

1

1

1

0

1



p24 p13 p34 p32

10

34

1

1

1

1

1

1



p12 p24 p13 p34 p23 p32

11

123

1

1

1

1

1

0

+

p12 p24 p13 p34 p23

12

234

1

1

1

1

1

1

+

p12 p24 p13 p34 p23 p32

13

341

1

1

1

1

1

1

+

p12 p24 p13 p34 p23 p32

14

412

1

1

1

1

0

1

+

p12 p24 p13 p34 p32

15

1234

1

1

1

1

1

1



p12 p24 p13 p34 p23 p32

br12

v2 br24

v4

v1

br13

br34

v3 Fig. 5 The bridge diagram with unidirectional bridge connection

P1,4 (t) since nodes v2 and v3 can be combined in this case as the simplest path   23 , br34 ) and its combinations ∃br  23 can come through the combined vertex (br12 , br v23 but the reverse merge of the nodes v3 and v2 is not inadmissible.  32 , br24 ) and its combinaIn this case, there does not exist the simple path (br13 , br tion with other simple paths through the given vertex. In other words, in the resulting expression P1,4 (t), according to Moore-Shannon, four significant summands will appear, their true values being equal to 0 as there is a multiplier in them, whose

156

I. Ruban et al. br24

br12

v1

v4 v23

br13

br34

Fig. 6 The substructure of diagram decomposition Fig. 5 when p23 (t) = 1

actual value is determined by the value p32 (t) = 0. This will determine the error of the required reliability P1,4 (t). The given substantiation of the need for symmetry of the bridge connection, as an element of the decomposition of bridge diagrams, constraints the area of using the Moore-Shannon method by the undirected networks to obtain an accurate assessment of the structural reliability of two-pole configurations of network nodes but considerably simplify the computational complexity of structural reliability assessment of a separate unit.

2.2 Selecting the Network Reliability Indices on the HCP In a hyperconverged system (HCS) the hypervisor is a superordinate control. All other nodes have certain functional subordination. That is why the groups of nodes of the same subordination order can be singled out in a HCS, in this way the HCS is stratified. The nodes of the lower level stratum numbered as N will be called peripheral nodes (PN). Usually, the stratum of the lower level of the HCS has some functional redundancy. Therefore, the failure of one or more nodes of this level does not lead to the system failure, but to a decrease in the quality of operation. And the degree of operational quality decrease depends not only on the number of failed PN, but on the role in the structure of the HCS. N strata are the result of this process. The upper level is determined by the stratum numbered as N = 1. Consider the stratum numbered as n (1 < n < N) or one of the strata of the intermediate control level. The failure of any node of these strata leads to the failure of the whole lower branch of the control structure (all subordinate nodes up to the peripheral ones). However, the structure of the HCS implies the meridian and latitude connections within a strict hierarchical structure. The meridian connection links two nodes of different strata while the latitude ones connect the nodes of the same stratum. As a result, the impact of one node failure on the general network operational capability is reduced. Every node and every communication link of the HCS can have a complicated structure. However, in the structural graph of the system,

Method for Determining the Structural Reliability of a Network …

157

Fig. 7 The stratified HCS structure

they are represented as a single element that has a specific value of the reliability index. The concept of graph branch can be introduced in the structural diagram of the HCS. The branch will be thought as a two-pole subgraph, which has the central vertex, one of the vertices of the lower level and all the paths on the graph that connect them. So, in Fig. 7 the HCS structure has 2 meridian connections (A1, C2) and (A1, C4), 2 latitude connections (C2, C3) and (C4, C5). S branches can be singled out in this structure. Double lines show the edges of one of these branches (A1, C2) in Fig. 6. An important feature of such systems is the ambiguity in determining the system failure. The system unambiguously fails when the central unit fails. In other cases, there is a slight decrease in the quality of the system operation, and it is very difficult to obtain its scalar index by standard approaches. It is possible to move from reliability indices to efficiency indices [23] by determining, for example, the discrete distribution of the number of operable branches Pk (t) = P(K (t) = k), where K (t) is a continuous stepwise random variable that determines the number of operable branches at the time t. For an isotropic system (all branches are structurally and reliably equivalent), it makes sense to consider the mathematical expectation of a random value of a quality measure of the network operation. E(Q(t)) =

M  k=0

Q k · pk (t),

(5)

158

I. Ruban et al.

where E is the function of mathematical expectation, Q k is the level of quality of operation under the condition K, in other words when there are K operable branches, M = dim(S N ) is the number of peripheral nodes, the number of which equals the number of branches of the HCS structure. For a non-isotropic system Q k values vary depending on the operable branch combination. To describe Q k , it can be presented in this case as an infinite nominal: Qk =

∞ 

ei · k i ,

(6)

i=0

then, E(Q(t)) =

∞ 

αi · ei ,

(7)

i=0

where αi is the i-th initial moment of K (t) distribution, αi =

M 

k i pk .

k=0

Coefficients Q k are determined quite simply if the quality measure has a certain physical meaning, for example, the volumes of received and transmitted information. In other cases, significant difficulties arise that lead to a decrease in the accuracy and reliability of the determined values. As a result, this approach becomes meaningless. Let us study the approach that includes using the vector of reliability indices that depend on the uptime. The thresholds for the operation quality reduction are determined. In particular, the probability is calculated for an isotopic unrestorable system. P(t, m) = P(T0 (m) > t) = τm (t) = P(ξ (t) ≥ N − m), where T0 (m) is the uptime until there occurs the failure of (N − m + 1) on the branch; ξ (t) is the number of operable branches at the moment of time t. For various m values, N set is obtained, that is the curves of the dependence of the occurrence probability of a certain number of inoperable branches (Fig. 8). If P∗ values determine the threshold of the isotropic system operability, then consequently a random vector of system uptime  values is formed depending on the number of inoperable branches t1∗ , t2∗ , t3∗ , . . . of the length N. In addition to the considered indices, the indices of system use are of interest. These include the system utilization ratio. It shows how likely it is that the required branch will be operable at an arbitrary moment in time for a restored system and at a fixed moment in time for non-restored systems.

Method for Determining the Structural Reliability of a Network …

t4*

t3*

t2*

159

t1*

Fig. 8 Set of probability curves

2.3 The Algorithm of Structural Reliability Calculation Based on the properties of the hyperconverged structure and the results discussed in Sect. 2, the following notations of its basic parameters are introduced: • n is the number of strata; • r i is the factor of system branching when passing from the i-th stratum to the K(i + 1)-th one; in the isotropic system the number of subordinate nodes of the (i + 1)-th stratum equals to one node of the i-th stratum; • Ri the probability of failure of one node of the i-th stratum; • Pi = 1 − Ri is the probability that one node of the i-th stratum is in the operable condition; • Rij is the probability of failure of at least one meridian line of communication among the nodes of i-th and j-th strata; • Pij = 1 − Qij is the probability of the operability of all meridian lines of communication among the nodes of i-th and j-th strata; • N = rn = r1 · r2 · . . . · rn−1 is the number of peripheral nodes and consequently the graph branches of the HCS structure. When calculating the structural reliability of a stratum, a logical reliability function is calculated for one branch of an isotropic network, but the structural reliability of its modules is previously calculated. For this, the logical-probabilistic design method is used.

160

I. Ruban et al.

The mixed form of the probability function is further built for this branch Pn ( f 1 , f 2 , . . . , f n−1 ) = Q n 1 −

n−1

Rnfii .

(8)

i=1

For this branch, it is built (n) n ( f 1 , f 2 , . . . , f n−1 ) = 1 − Pn + Pn · z = 1 + Pn · (1 − z),

(9)

where the subscript shows the number of ranks in the system, and the superscript shows the upper rank of the node, to which the given polynomial is referred to. The polynomial (9) is further raised to the power that is equal to the factor of branching on the tier r n-1 ; thus, the following polynomial is obtained:  rn−1 = (n) . (n−1) n n

(10)

After replacing Boolean variables in (10) that correspond to the (n − 1)-th rank, it is raised to the power of rn−2 . As a result of the multistep execution of this procedure, the generating polynomial for the distribution of the number of operable branches of the HCS network is obtained, which enables determining the indices of structural reliability.

3 Results and Discussion The proposed approach was applied to determine the degree of readiness of a conditionally fully connected network of the HCS, which has a four-tiered structure. The modules between which there was no virtual link connection have zero bandwidth. 4; r 2 = 2; r 3 = 3; r 4 = 2. The number of modules on tiers is: r 1 = The general number of modules is N = 4k=1 rk = 32. The module reliabilities were given as Pi = 1− Ri for the modules with i numbers. The probabilities of the operable condition of communication lines among the modules were given Pi j = 1 − Ri j . The results of modelling are shown in Figs. 9, 10 for the following variants: (V1) Pi = 1; Ri = 0; Pi j = 0.9; Ri j = 0.1; (V2) Pi = 1; Ri = 0; Pi j = 0.8; Ri j = 0.2; (V3) Pi = 0.9; Ri = 0.1; Pi j = 0.9; Ri j = 0.1. The diagram shows: m is the minimum allowed number of operable branches; K is the coefficient of system readiness; Pm is the system reliability for a given m.

Method for Determining the Structural Reliability of a Network …

161

Fig. 9 Distribution of the number of operable branches

The results of modelling show that when the values of probabilities Ri and Ri j are low, the numerical values of the structural reliability of the system decrease while m rises, however, when m values are high, the distribution becomes unimodal. The coefficient of system readiness increases when minimal permissible number of operable branches decreases, and when Ri = 0, it can reach the values that are very close to 1, the redundancy being comparatively low (Fig. 10).

Fig. 10 Readiness coefficients

162

I. Ruban et al.

4 Conclusions The features of a network operation, which is based on a hyperconverged architecture, are considered. A hierarchical graph is built, isomorphic to the network structure, in which the network hypervisor is the center. The graph vertices are stratified depending on the length of the path to the center. Sets of graph branches are constructed for each level of stratification. Utilization rates for the nodes and branches of the graph are calculated. On their basis, the level of network operation quality is determined, depending on the availability of operable branches and the reliability of the nodes and communication links. Logical functions are constructed that describe the performance of the branches. To obtain a scalar indicator of the structural reliability of a network, the distribution of a discrete random variable of the number of operable branches is considered. An iterative algorithm for obtaining its numerical characteristics is proposed. The algorithm is based on finding the generating polynomial of the distribution. The mathematical expectation of a given random variable is chosen as an indicator of the structural reliability of the network. The analysis of the results of the structural reliability calculation for a network based on a hyperconverged architecture, depending on the functional redundancy, the number of levels of stratification, the degree of system complexity is performed.

References 1. Linthicum, D.: Hyperconverged technology a natural fit for edge computing. https://www. cisco.com/c/en/us/solutions/data-center/hyperconverged-technology (2019) 2. Merlac, V., Smatkov, S., Kuchuk, N., Nechausov, A.: Resourses Distribution Method of University e-learning on the Hypercovergent platform. In: 2018 IEEE 9th International Conference on Dependable Systems, Service and Technologies (DESSERT’2018) Kyiv, pp. 136–140 (2018). https://doi.org/10.1109/DESSERT.2018.8409114 3. Kuchuk, N., Mozhaiev, O., Mozhaiev, M., Kuchuk, H.: Method for calculating of R-learning traffic peakedness In: 2017 4th International Scientific-Practical Conference Problems of Infocommunications Science and Technology, PIC S and T 2017–Proceedings, pp. 359–362 (2017). https://doi.org/10.1109/INFOCOMMST.2017.8246416 4. Al-Kuwaiti, M., Kyriakopoulos, N., Hussein, S.: Network dependability, fault-tolerance, reliability, security, survivability: a framework for comparative analysis. In: 2006 International Conference on Computer Engineering and Systems, IEEE Xplore 26 February 2007, INSPEC: 9232341, Cairo, Egypt (2007). https://doi.org/10.1109/ICCES.2006.320462 5. Villemeur, A.: Reliability, Availability, Maintainability and Assessment (Methods and Techniques, vol. 1), p. 367. Wiley & Sons (1992) 6. Donets, V., Kuchuk, N., Shmatkov, S.: Development of software of e-learning information system synthesis modeling process. Adv. Inf. Syst. 2(2), 117–121 (2018). https://doi.org/10. 20998/2522-9052.2018.2.20 7. Mukhin, V., Kuchuk, N., Kosenko, N., Kuchuk, H., Kosenko, V.: Decomposition Method for Synthesizing the Computer System Architecture, Advances in Intelligent Systems and Computing, AISC, vol. 938, pp. 289–300 (2020). https://doi.org/10.1007/978-3-030-166212_27 8. Kuchuk, G., Kovalenko, A., Komari, I.E., Svyrydov, A., Kharchenko, V.: Improving big data centers energy efficiency: traffic based model and method. In: Kharchenko, V., Kondratenko,

Method for Determining the Structural Reliability of a Network …

9.

10.

11.

12. 13. 14.

15.

16.

17.

18.

19.

20. 21.

22. 23.

163

Y., Kacprzyk, J. (eds.) Green IT Engineering: Social, Business and Industrial Applications. Studies in Systems, Decision and Control, vol. 171. Springer, Cham (2019). https://doi.org/10. 1007/978-3-030-00253-4_8 Shubinsky, I.B., Zamyshlyaev, A.M.: Topological semimarkov method for calculation of stationary parameters of reliability and functional safety of technical systems. Reliab. Theory Appl. 2 (2012) Ruban, I.V., Martovytskyi, V.O., Kovalenko, A.A., Lukova-Chuiko, N.V.: Identification in informative systems on the basis of users’ behaviour. In: Proceedings of the International Conference on Advanced Optoelectronics and Lasers (CAOL 2019), pp. 574–577 (2019). https://doi.org/10.1109/CAOL46282.2019.9019446 Kuchuk, H., Kovalenko, A., Ibrahim, B.F., Ruban, I.: Adaptive compression method for video information. Int. J. Adv. Trends Comput. Sci. Eng., 66–69 (2019). https://doi.org/10.30534/ija tcse/2019/1181.22019 Wilkie, D.: Pictorial representation of Kendall’s rank correlation coeff1c1ent. Teachmg Stat. 2, 76–78 (1980) CENELEC EN.50126: Railway applications—the specification and demonstration of Reliability, Availability, Maintainability and Safety (RAMS) (1998) Attar, H., Khosravi, M.R., Igorovich, S.S., Georgievan, K.N.: Alhihi, M. Review and performance evaluation of FIFO, PQ, CQ, FQ, and WFQ algorithms in multimedia wireless sensor networks. Int. J. Distrib. Sens. Netw. 16(6), 155014772091323 (2020). https://doi.org/10.1177/ 1550147720913233 Singh, S., Gupta, V., Grover, A., Dhori, K.J.: Diagnostic circuit for latent fault detection in SRAM row decoder, Quality Electronic Design (ISQED). In: 2020 21st International Symposium, pp. 395–400 (2020). https://doi.org/10.1109/ISQED48828.2020.9136968 Ismail, A., Jung, W.: Research trends in automotive functional safety. In: 2013 International Conference on Quality, Reliability, Risk, Maintenance, and Safety Engineering (QR2MSE), Chengdu, China, INSPEC: 13828750 (2013). https://doi.org/10.1109/QR2MSE.2013.6625523 Svyrydov, A., Kuchuk, H., Tsiapa, O.: Improving efficienty of image recognition process: approach and case study. In: Proceedings of 2018 IEEE 9th International Conference on Dependable Systems, Services and Technologies (DESSERT 2018), pp. 593–597 (2018). https://doi.org/10.1109/DESSERT.2018.8409201· Kovalenko, A., Kuchuk, H.: Methods for synthesis of informational and technical structures of critical application object’s control system. Adv. Inf. Syst. 2(1), 22–27 (2018). https://doi. org/10.20998/2522-9052.2018.1.04 Semenov, S., Sira, O., Gavrylenko, S., Kuchuk, N.: Identification of the state of an object under conditions of fuzzy input data. East.-Eur. J. Enterp. Technol. 1(4), 22–30 (2019). https://doi. org/10.15587/1729-4061.2019.157085 Zaitseva, E.N., Levashenko, V.G.: Importance analysis by logical differential calculus. Autom Remote Control. 74, 171–182 (2013). https://doi.org/10.1134/S000511791302001X Zaitseva, E., Levashenko, V., Rabcan, J., Krsak, E.: Application of the structure function in the evaluation of the human factor in healthcare. Symmetry 12(1), 93 (2020). https://doi.org/10. 3390/sym12010093 Kuo, W., Ming, Z.: Optimal Reliability Modeling: Principles and Applications. Hoboken, NJ: Whiley & Sons (2003) Moore, E.F., Shannon, C.E.: Reliable circuits using less reliable relays—part I. J. Franklin Inst. 262(3), 191–208 (1956)

Database Approach for Increasing Reliability in Distributed Information Systems Roman Ceresnak and Karol Matiasko

Abstract Nowadays, data replication plays a key role while increasing system reliability. A vast amount of the data entering the system has to be protected, not only while processing but also while storing it. Since distributed processing happens in many cases, while processing the data and the data are placed on various calculation knots, it is necessary to ensure an error, respectively, a calculation unit outage. The data loss will not happen, even in the case of a hardware error. This article describes a method based on a reliability index, helping us to determine a replication coefficient. Right, on the replication coefficient basis, we can reduce the amount of unimportant data and thus to reduce not only the size of a place taken on the Keywords Replication method · Distributed data processing · Database system · System reliability

1 Introduction The modern world, where the data are a commodity, changes every moment. According to the official statistics, the data about the size of 1.7 MB were created every second in 2020, representing daily data flow about the size of 2.5 quintillion bytes. Based on the statistic, it is seen that it is appropriate to consider the time into account during data manipulation. We came across a study [1] dealing with the time factor in data acquisition while examining this fact. The researchers introduce a new complicated table classification in the background of time-limited validity, emphasizing the efficiency of cure, recovery, and information acquisition. The mentioned data significantly influence the decision-making process. Subsequently, they deal with different types of the table definition, their characteristic, and use suitability. R. Ceresnak (B) · K. Matiasko Faculty of Management Science and Informatics, University of Zilina, Zilina, Slovakia e-mail: [email protected] K. Matiasko e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 C. van Gulijk and E. Zaitseva (eds.), Reliability Engineering and Computational Intelligence, Studies in Computational Intelligence 976, https://doi.org/10.1007/978-3-030-74556-1_10

165

166

R. Ceresnak and K. Matiasko

The cited study clearly shows the time factor is an essential part of faster information acquisition from a database, and thus it can provide the data to users. In many cases, various types of indexes, primary indexes [2], secondary indexes [3], or temporary indexes [4], are used to acquire stored data in relational databases faster. All mentioned structures can effectively reduce the times needed for the data acquisition from the relational database. As it was mentioned at the beginning of the chapter, the data created by the users are stored in massive amounts and various types. Precisely the data storing of multiple types, structures, and sizes led to the creation of databases capable of manipulating free data structure data. This type of database comes directly from the basic conventions of the relational databases. However, it makes it possible to store nonstructured data, and this type is called a nonrelational database. Nowadays, we know various kinds of nonrelational databases. The basic types are Key-Value, Graph, Document, and Wide Column [5]. A data replication belongs to the essential characteristics of nonrelational databases. Even if nonrelational databases can effectively work with the data, a crucial part of the big data is their size. For the more effective provision of the data storage for a whole data amount, a distributed system was created. This system is based on a principle of several calculation knots, for example, computers whose completed unit calculation together. An effective data sharing in this storage system complicates unpredictable knot failure, unreliable network connection, and limited bandwidth. Some applications, such as scientific applications, demand available data when needed, at least with high probability. The probability of failure increases in some data storage because many of the data knots are used in cloud or hardware storage. Thus, the improvement of the data available in the data storage system becomes a significant challenge for system designers. The replication is often used as a tool for data availability improvement in the storage system, such as file system Google. If more block copies exist on the various data knots, the canals of at least one available copy will increase. If one data knot fails, the data are still available in replicas. Administration costs are significantly higher with the increasing amount of replicas. Various models do not have to make the availability significantly more improved, but they will bring useless expenses. The critical question is how significant the replicas are, which will ensure optimal system operation during drop-out of the calculation knots. Suppose you let all replicas to be active. In that case, other copies can be used, not only for availability improvement but also for improving overload balance and whole performance if the models are primarily distributed [6]. While setting up replica placement, we need to consider the system size and calculation knots number, where calculation operations will be done. We will solve these questions precisely in the following article.

Database Approach for Increasing Reliability …

167

The main benefits of this article will be: • The designed method that will provide the replication coefficient based on the database table using. • Making the data replications more effective and thus to reduce calculation knots amount and to make the replication process faster. • Reduction of overload of the distributed system needed for replication process processing. • Optimization of calculation tools. The rest of the paper is structured as follows. Related works are summarized in Sect. 2. Section 3 represents our system model. In Sect. 4, we are proved experiments, and in Sect. 5, we conclude.

2 Related Work Current innovations such as Sybase Replication Server, Prophet Symmetric Replication Technology, Ingres Replicator, and exchange screens such as DEC-ACMS, give fundamental capacities to the physical replication of information pieces. Lamentably they allow unwinding of coherency at fixed levels just: old-style 2PC, primary secondary approach, idealistic utilizing timestamps, or uncontrolled. These business frameworks do not give instruments, whose uphold application explicit coherency conditions. Our research on the value of replication in the distributed system is based on HDFS and increased reliability for reliable storage of huge files across distributed commodity machines in a large cluster. It stores each document as an arrangement of squares; all squares in a record are of a similar size, aside from the last one. A document’s squares are duplicated for understanding execution and adaptation to internal failure. HDFS presents a basic, however, profoundly successful tripling strategy to assign copies for a court. The default copy position strategy of HDFS has placed one reproduction on one hub in the neighborhood rack, another on a seat in a distant stand. They keep going on an alternate corner in a similar far off the shelf. This imitation situation strategy cuts between rack compose traffic, which by and considerably improves composing execution. The reason for the rack-mindful copy arrangement methodology is to enhance information dependability, accessibility, and organization transmission capacity use. Overseers of HDFS groups can indicate the default replication factor (number of reproductions) or the replication factor for available information. They can likewise execute their copy situation methodology for HDFS. We came across work that deals with a time-oriented database architecture during our research in the paper [7], which manages undefined values and proposes a comprehensive classification of systems on transactions, accesses, and indices. Since various data types can enter our system, whether it is structured or unstructured data, modeling undefined values is recorded in the mentioned work. Furthermore, it covers synchronization processes using groups of data. The critical component of

168

R. Ceresnak and K. Matiasko

the mentioned article are solutions for effective data acquisition with emphasis on undefined values and states. In studying the system’s reliability, we also examined the problem dealing with the sensitivity and accuracy of the problem provided. In the paper [8], the author deals with managing the temporary system’s granularity and proposes a data-sharing model based on the reliability, sensitivity, and accuracy of data providers. It provides a system concept that introduces a cash prospect, which is then evaluated in the experiment section. Optimization of the data flow by historical data aggregation and limitation of the data amount is a core part for the system decision making, whereas the time for data transferring is strictly limited. Another look at increasing reliability is given in the studies in the article [9]. The main idea is to manage the data asynchronously, and then the data is merged. Study data began in 2006 and is the result of an in-depth analysis. The study’s achieved result is the creation of architectural design for a distributed information system with asynchronous update data. During the development of the researchers concluded came the need to store versioned data on the server. Their different approach to solving reliability in the mentioned paper uses new techniques of storing versioned data in a unitemporal relational database. Storage is a departure from traditional security practices. The created solution can also preserve the advantages of RDBMS, such as referential integrity and transaction processing. Replication factor and reproduction arrangement are essential questions of replication of the board. The issue of dynamic replication is the executive’s system for HDFS has drawn significant consideration. CDRM [10] is a savvy dynamic replication. The executives conspire for enormous scope distributed storage framework. It develops a cost model to catch the connection between accessibility and replication factor. Given this model, lower bound on the copy reference number to fulfill accessibility necessity can be resolved. CDRM further places copies among circulating hubs to limit hindering likelihood, improve load parity, and generally speaking execution. DARE (Is the method for adaptive data replication for efficient cluster scheduling) [11] is a versatile information replication component for HDFS. It utilizes probabilistic testing and a severe maturing calculation autonomously at every hub to decide the number of imitations to distribute to each record and every reproduction area. DARE exploits existing far off information recoveries and chooses a subset of the information to be embedded into the document framework, thus making an imitation without expending additional organization and calculation assets. Khan et al. [12] present a calculation that finds the ideal number of codeword images required for recuperation for any XOR-based deletion code and delivers recuperation plans that utilize a base measure of information. This calculation improves I/O execution by and by the vast square sizes used in cloud record frameworks, such as HDFS. CDRM and DARE attempt to make sense of appropriate replication factor for each record and spot them insensible data nodes as per the current outstanding burden and hub limit. In any case, they don’t consider the accessibility of information, that has low reproduction factors. DiskReduce presents a RAID (Redundant Array of Independent Disks) strategy only from time to time utilized information yet doesn’t

Database Approach for Increasing Reliability …

169

bring up step by step instructions to pass judgment on the cool information. In ERMS, we utilize unique imitation approaches for various information. It expands the replication number for hot information to improve execution. It utilizes RAID for cold information to spare an extra room. It utilizes the default tripling strategy for ordinary information. We likewise portray a particular reproduction situation technique for the additional imitations of hot information and RAID equality. Abawajy [13] formulated the data replication problem and designed a distributed data replication algorithm with a consistency guarantee for the data grid. The approach consists of a systematic organization of the data grid sites into distinct regions, a new replica placement policy, and a new quorum-based replica management policy. The quorum serves essential tools for providing a uniform and reliable way to achieve consistency among the system’s replicas. The main advantage of quorum-based replication protocols is their resilience to a node and network failures. This is because any quorum with fully operational nodes can grant reading and writing permissions, improving the system’s availability. The distributed article from researchers (SON 2003) showed a synchronization conspired for appropriated data frameworks. The plan builds the unwavering quality just as the level of the simultaneousness of the framework. A token is utilized to assign a read-compose duplicate. The technique permits exchanges to work on an information object if more than one symbolic copy is accessible. The serializability hypothesis for duplicated information and the recuperation instruments related to the plan are talked about. Another methodology, which specialists look at, is replicating our issue in document replication. ORCS and DMS, proposed in [14] and keep all reproductions of a record steady once its substance change, are among different instruments acquainted with saving imitations consistency. DMS is proposed to adjust the information area to the client asks for and diminish the separation between peruser hubs and information imitations. What’s more, ORCS and DMS consider each host’s extra accessible rooms in assessing the replication factor. So, to accumulate the required data, they use observing instruments like NWS and Ganglia and store this data in an information base. As one of its main points of interest, this instrument considers the entrance territory in document replication. Our designed mechanism is based on the principle to watch total data choose operation overload from database scheme, and based on estimated replication borders, to set the replication coefficient, that will serve as a replication coefficient in the distributed data processing.

3 Our Contribution The data replication, nowadays, is the key component in increasing system reliability, but it also ensures, that when server outage, data knot outage or blackout happens, any data loss will not happen, and the system will reliably continue to work. Nowadays, we see the situation, when the replication coefficient is set to 3 in many developer articles and blogs. According to our judgment, the replication coefficient should be

170

R. Ceresnak and K. Matiasko

based on data overload in data storage. Individual choose, actualization, and delete data operations for individual tables should settle, how many times are the data replicated in the established system (Figs. 1 and 2). We used a method giving us information about individual operations overload in database Oracle for these purposes and it looks as follows:

Fig. 1 Scale up mechanism

Fig. 2 Scale down mechanism

Database Approach for Increasing Reliability …

171

Using the display, we get the total statistics for a specific data model. The number of accesses to each table and the total number of records in the table for all users who have access to the data model shown in Fig. 3. On the basic of these values, we can clearly state some records are more important for the right application operation, which means, that above the data in individual tables operations of search respectively edit and delete were done more often than in other tables. The replication coefficient must be based exactly on real values we got in the database. According to the value of total records reading, we made a rule determining, how many times the data we are working with has to be replicated to not to cause the record loss.

Fig. 3 Structure of data in system

172

R. Ceresnak and K. Matiasko

On the basic of recommended literature, where replication coefficient is set to 3, we decided, that value 3 is really valid for the key values, however, on the basic of our designed algorithm, that watches value overload, it performs replication of the data as follows: • The algorithm divides the records according to the reading number and record in the table – We defined a weight coefficient for the writing, which is equal to 1 because it is in many cases in the background opposite the data writing operation to the database. – We defined a weight coefficient for the reading, which is equal to 2 because it is more often operation for reading in many cases than the data writing operation to the database. • Subsequently, the median of records is calculated based on all tables by the formula: n Rc ∗ W r c + W c ∗ Rwc (1) x= 1 n where: x—means coefficient average, n—means table number, Rc—means number of readings from the table, Wrc—means weight reading coefficient, Wc—means writing number to the table, Rwc—means weight writing coefficient. • The algorithm goes through the values got from the database and operates the following operation: If the operation for the first table is smaller than x, it means the rules apply: Rcl ∗ W r c + W c ∗ Rwc < x. So, the value of the replication coefficient is set to 2, in other cases, it is set to 3. • Subsequently, the algorithm takes a step for all tables and the result is a diverse replication coefficient for our provided tables. We created a structure based on the running algorithm, that defines the replication coefficient for every table. By using of database model portrayed in picture 88, that servers the purposes of the teaching of subject Basics of Databases at the Faculty of Management and Informatics at the University of Žilina. The algorithm for replication control in individual tables is defined in a file and has a simple structure like the name of the table and the replication coefficient.

Database Approach for Increasing Reliability …

173

The records and also the tables can, of course during watching of the replication coefficient, change, and edit, and this file is used for control of correct data replication. Experimentally we examined the value of the replication coefficient average with records values from the tables, and based on these values, we edited the file with replication coefficients. We added the number of employees to the file when it is appropriate to increase respectively to decrease the value of the replication coefficient. We created two simple methods for these purposes, and they are the method of automatic scaling upwards and downwards. A. Scaling up policy Automatic scaling upwards makes it possible to adjust the number of needed data replicas based on database statistics and the replication coefficient. The architecture of the automatic scaling is shown in picture 1. The created architecture for the replication coefficient scaling purposes works as follows: • By service CloudWatch, it watches statistics of the individual data • If the situation of big reading from the records happens, so the change of statistics also happens • The induced method gets new statistics and the values are again recalculated by the formula. – If reading or writing operation did not make the changes in replication number, any change will not happen – If reading or writing operations made changes in replication number, any change will not happen B. Scale down policy Automatic scaling downwards makes it possible to adjust the number of needed data replicas based on database statistic and replication coefficient. The architecture of automatic scaling is shown in picture 2.

4 Experiments We used the model portrayed in picture 3 for finding out the right process of our application. We watched the values got during one term by students completing the subject Basics of database systems for these purposes. While using the command mentioned above, we got the deals, whose we then substitute into the formula (1), and we got the coefficient of average. Reliability in experimental operation is understood as the system’s ability to secure data inserted into the system in the event of the failure of multiple data nodes. We will measure the value at 10,000 attempts, and we will monitor how many times the system did not provide us with data in time or did not provide us with data in general.

174

R. Ceresnak and K. Matiasko

This coefficient helped us to determine how many times the data are supposed to be replicated for the table, and the results were as follows: • • • • • • • •

personal_data replication coefficient = 3, student replication coefficient = 3, study_subject replication coefficient = 3, subject_year replication coefficient = 3, teacher replication coefficient = 3, st_field replication coefficient = 2, st_program replication coefficient = 2, subject replication coefficient = 2.

We created the file with the identic structure based on these values. The replication coefficient values enter the system always when running of the operation map-reduce happens as input parameter speaking, how many times the data will occur in the system. We were dynamically changing the replication values based on the data that appeared in the design based on this aspect. System Hadoop Distributed File System storages the files as data blocks and distributes them in the whole cluster. The blocks are repeated several times to ensure the high availability of the data because system HDFS was designed to be resistant to errors and work on commodity hardware. The replication factor is a characteristic that is possible to settle in the configuration file HDFS, making it possible to edit the global replication factor for the whole cluster for you. For every block stored in HDFS will be distributed n-1 duplicated tables in the cluster. We added the following characteristic for every table for the replica’s purposes in the configuration process. We did not change the position, and it is placed in folder conf/ in default.

The created configuration has a default setting of the replication coefficient to 3. For the first five tables on the list, this coefficient is correct, but for the last three tables, it has to be changed to 2, so the following command is sent to system HDFS, which will ensure the replication coefficient change to 2. hadoop fs –setrep –w 3 /my/replica_file. We used cloud form Amazon company for these purposes with the following configuration (Table 1). We choose following configuration for Hadoop.

Database Approach for Increasing Reliability … Table 1 Cluster configuration

175

EC2 Instance

a1.medium vCPU: 1 MeM(GiB): 2

EMR cluster

master: 1x m3.xlarge core: 2x m4.4xlarge

Host namenode HostName ec2-18-216-40-160.us-east 2.compute.amazonaws.com User ubuntu IdentyFile ~/.ssh/MyLab_Machine.pem Host datanode1 HostName ec2-18-220-65-115.us-east-2.compute.amazonaws.com User ubuntu IdentyFile ~/.ssh/MyLab_Machine.pem Host datanode2 HostName ec2-52-15-229-142.us-east-2.compute.amazonaws.com User ubuntu IdentyFile ~/.ssh/MyLab_Machine.pem Host datanode3 HostName ec2-18-220-72-56.us-east-2.compute.amazonaws.com User ubuntu IdentyFile ~/.ssh/MyLab_Machine.pem

In the very first running of the experiments with a recommended value of replication coefficient equaled 3, and the values we got, while using of our created method of the replication coefficient, the time needed to edit was as follows: • With the default value, the time was 2 mperforms replication of thein and 38 s, • With edited value, the time was 2 min and 24 s. We decided to replicate the data, instead of in 1 zone of availability, within one region, which reduces the number of errors from network outrage to different catastrophes. Regarding the chosen value of the replication, the coefficient equaled 2, it could mean in the case of 1 knot, that we had the data only for one replication. As it turned out, the time required for distributed data processing in the system has been accelerated by using our method. However, this is not all we have achieved with our approach. In the recommended solution, when each table has a replication coefficient set to 3, up to 720 MB is needed for each table at 30 MB for the tables listed (which are listed at the beginning of the experiment). Using our method, this size was reduced to 630 MB, which ultimately reduced storage space requirements and reduced the time required to process the same amount of data.

176

R. Ceresnak and K. Matiasko

The system’s reliability is determined at a replication coefficient of 3 at 10,000 experiments with a period of 5 min to a value of 99.99%. The result obtained means that only in 1 attempt were the values not accessible to the user in case of failure of 1 node. We performed this experiment simultaneously, which reduced the replication coefficient to precisely the same result. We monitored this fact directly on the console. When creating new records, we always had a margin of 10 s before requesting data from the user, and for one situation, that led to a system failure to provide data to the user. We were able to reduce the amount of total overload of the knots in the cluster in 2% to 1 knot. The reduction of the amount of storage needed to replicate the data in the data storage and reducing the calculation units required to perform the operation required is related to this method.

5 Conclusion Distributed data processing, nowadays, plays a very important role, while processing the data in extensive data clusters. The main goal of the data processing in extensive data clusters is the data division and processing on several data knots, and thus ensuring better flexibility of the data processing and also the possibility to perform operations nearly independently from each other. Exactly this way of manipulation became a very effective tool when processing big data, which is very popular nowadays and is the key part of many industries. This document was devoted to the question of effective data replication, concretely to the replication coefficient to optimize replicas number of individual data. We developed the replication method allowing us to make the way of replicas creation, based on the database statistics, more effective. The designed algorithm consists of two parts, the first part is based on the principle of getting statistics values about the total overload of individual database tables. The second part is based on effective data management able to effectively increase respectively reduce replication coefficient in the case of change of tables overload without decreasing of total reliability, respectively data loss in systems, and thus to ensure we will never lose the data, in the case of system drop-out and also the constant operation will be ensured. The experiments bring various useful information about the created method’s performance and efficiency. After the application of our replication method, the time needed for total edit and data replication in the cluster was decreased and we needed such a large amount of the data knots to make it needed for the preservation of the standard replication coefficient. Based on the experiments, it is seen the created method works effectively, not only at the beginning of the processing, but it can effectively adjust the replications number in standard regime according to the changing value of demanded records in the database. Our next work is devoted to generalization of this model and ensuring of API interface for full use of the created method, not only for cloud Amazon but also for

Database Approach for Increasing Reliability …

177

other databases such as MySQL, Postgres, and MsSQL. We also plan to evaluate the designed systems empirically from the point of consistency and performance in other backgrounds, which need a fast reaction to the data demand. Acknowledgements This work was supported by Grant System of University of Zilina No. 1/2020. (8056).

References 1. Yang, T., Fu, C.P., Hsu, C.H.: File replication, maintenance, and consistency management services in data grids. J. Supercomput. (2010) 2. Abad, C.L., Lu, Y., Campbell, R.H.: DARE: adaptive data replication for efficient cluster scheduling. In: Proceedings—IEEE International Conference on Cluster Computing, ICCC (2011) 3. Wei, Q., Veeravalli, B., Gong, B., Zeng, L., Feng, D.: CDRM: a cost-effective dynamic replication management scheme for cloud storage cluster. In: Proceedings—IEEE International Conference on Cluster Computing, ICCC (2010) 4. Abawajy, J.H., Deris, M.M.: Data replication approach with consistency guarantee for data . IEEE Trans. Comput. (2014) 5. Khan, O., Burns, R., Plank, J., Pierce, W., Huang, C.: Rethinking erasure codes for cloud file systems: Minimizing I/O for recovery and degraded reads. In: Proceedings of FAST 2012: 10th USENIX Conference on File and Storage Technologies (2019) 6. Powell, G., McCullough-Dieter, C.: Indexes and Clusters. In: Oracle SQL: Jumpstart with Examples (2005) 7. Kvet, M., Toth, S., Krsak, E.: Concept of temporal data retrieval: Undefined value management. Concurrency Computat. Pract. Exp. 32(13) (2019). https://doi.org/10.1002/cpe.5399 8. Kvet, M.: Data distribution in ad-hoc transport network. In: 2019 International Conference on Information and Digital Technologies (IDT), June 2019. https://doi.org/10.1109/dt.2019.881 3437. 9. Janech, J., Tavac, M., Kvet, M.: Versioned database storage using unitemporal relational database. In: IEEE 15th International Scientific Conference on Informatics, November 2019. https://doi.org/10.1109/informatics47936.2019.9119269 10. Wiktorski, T.: NOSQL databases. In: Advanced Information and Knowledge Processing (2019) 11. Kvet, M., Matiasko, K.: Temporal flower index eliminating impact of high water mark. In: Communications in Computer and Information Science (2018) 12. Carns, P., W.L. III., Ross, R., Thakur, R.: PVFS: A parallel file system for linux clusters. In: Proceedings of the 4th Annual Linux Showcase and Conference (2000) 13. Gellman, R.: Privacy in the clouds: risks to privacy and confidentiality from cloud computing. In: Proceedings of World Private Forum ( 2009) 14. Xie, S., Cheng, Y.: RAFR: A high reliability replication algorithm for cloud storage. In: CSAE 2012—Proceedings, 2012 IEEE International Conference on Computer Science and Automation Engineering (2012) 15. Kvet, M., Matiasko, K.: Time as the important factor of the data retrieval—table type classification. In: Advances in Intelligent Systems and Computing (2017) 16. de Haan, L.: Mastering Oracle SQL and SQL*Plus (2005) 17. Xiong, J., Li, J., Tang, R., Hu, Y.: Improving data availability for a cluster file system through replication. In: IPDPS Miami 2008—Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium, Program and CD-ROM (2008)

Time Dependent Reliability Analysis of the Data Storage System Based on the Structure Function and Logic Differential Calculus Patrik Rusnak and Michal Mrena

Abstract Nowadays, the data storage system is an integral part of our lives. Therefore, the main focus is in making this system reliable and accessible at all times. Reliability analysis of such systems provides insight into the components and topological parts of the system, in which a system is most vulnerable. Several approaches can be chosen to represent the system, one of the most used is known as the structure function. This approach allows us to represent a system of any complexity and also allows us to use the tools of logic algebra such as logic differential calculus. The aim of this work is to show the use of logic differential calculus and structure function for the calculation of the time-dependent importance measures for components in the time-dependent reliability analysis of the selected data storage system. After the computation of all importance measures, the problem areas of the data storage will be identified from a reliability point of view. Keywords Structure function · Logic differential calculus · Reliability analysis · Importance measures

1 Introduction Reliability engineering is a multidisciplinary scientific field that provides the methods necessary to quantify the reliability of the system, to test the design of the system, to analyze the system and its components, etc. Important step in reliability evaluation of systems is the development of its mathematical representation. As has been shown in [1], this mathematical representation must allow investigating the system failure, e.g., mechanisms of failure and its consequences; measuring system reliability; analyzing

P. Rusnak (B) · M. Mrena Faculty of Management Science and Informatics, University of Zilina, Zilina, Slovakia e-mail: [email protected] M. Mrena e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 C. van Gulijk and E. Zaitseva (eds.), Reliability Engineering and Computational Intelligence, Studies in Computational Intelligence 976, https://doi.org/10.1007/978-3-030-74556-1_11

179

180

P. Rusnak and M. Mrena

critical states of system reliability; elaborating maintenance of the system, fault diagnosis and prognosis. The most often used mathematical representation of a system in reliability analysis is a model that takes into account two important states of system: failure and working state. This mathematical model is known as Binary-State System (BSS) and it mainly uses Boolean logic that has been introduced as one of the first [2–4]. This will also be used in this work. Another used mathematical representation of a system is known as Multi-State System (MSS). This mathematical representation can work with more than two performance levels and is used to define multiple states for the system and its components and to perform a more detailed reliability analysis of the states of the system or its components [5, 6]. There are various methods to evaluate the system reliability and failure based on these mathematical models. All these methods can be divided into four groups depending on the mathematical background [1, 3]: methods based on structure function, stochastic methods, Monte-Carlo simulation and methods based on universal generation function. The structure function based methods permit mathematically representing a system of any structural complexity [2] and will be used in this work. The structure function defines univalent correlation of a system performance level on component states and is used to represent systems composed of n components [7]. The structure function can be viewed as a Boolean function for BSS and can be easily used in reliability analysis of the steady-state system [7, 8]. Such a mathematical representation is time-independent. An important advantage of this representation is the possibility to use the well-developed and useful mathematical approach of Boolean algebra in reliability evaluation of the investigated system. Effective methods in reliability analysis were developed with application of Boolean algebra for minimal cut/path sets definition [8], frequency characteristics of system reliability [7], or importance measures calculation [9]. The structure function has its relevant role in modern development in reliability analysis, for example, in case of multi-function system reliability [10], general multilinear expression of the structure function of an arbitrary semi-coherent system [11], or Graphs models and algorithms for reliability assessment [12]. The disadvantage of these methods is that analysis of systems is in stationary state. On the other hand, the structure function in a form of Boolean function can be used for calculation of the reliability function of the system that represents the probability of the system to be in functioning state during its mission time or specific time. In this case, special methods for this calculation should be developed [3]. Although, reliability function is important in reliability analysis, it is not sufficient to give a complete picture about system reliability. Another necessary constituent of reliability evaluation is importance analysis. Methods for calculation of Importance Measures (IMs), which quantify influence of the system components on the whole system, based on application of system representation by the structure function and logic differential calculus have been considered in [13, 14] for a system in stationary state. By taking all the previously mentioned information into account, the structure function in form of Boolean function is a simple mathematical representation of

Time Dependent Reliability Analysis of the Data Storage System …

181

a system in reliability analysis which can be formed for a system of any structural complexity and evaluated based on well-developed methods related to Boolean functions. The disadvantage of this mathematical representation is an impossibility of time-depent analysis. Therefore, development of the new approaches for reliability analysis of systems based on the structure function that allows time-dependent analysis of the system are needed. One such approach can be based on the application of logic differential calculus. This approach can be then used to perform the time dependent reliability analysis on systems described with structure function. Data storage systems are one such group of systems, especially when the Redundant Array of Independent Disks (RAID) [15] is considered.

2 Structure Function The structure function is a mapping that defines values of system state for each combination of states of the system components. If we assume that the system is composed of n components, then this mapping is as follows [3]: φ(x1 , x2 , . . . xn ) = φ(x) : {0, 1}n → {0, 1},

(1)

where xi is a variable that defines state of component i for i = 1, 2, . . . , n and x = (x 1 , x2 , . . . , xn ) is a vector of states of the system components (state vector). For example, let us consider a data storage system consisting of two main modules, in which the same data is stored. In the first part, two hard drive disks (HDDs) are organized in RAID 0. In RAID 0, the capacity of the unit is equal to the sum of capacities of the used drives, which implies no redundancy of data. Therefore, failure of one drive means that the entire RAID 0 is lost. In the second part, the single HDD is used to store data. At least one part must be in working state to write and read data successfully. According to the system description, its structure function can be represented by the following logic expression: φ(x1 , x2 , x3 ) = x1 ∧ x2 ∨ x3 ,

(2)

where the operator ∧ represents the Boolean operation AND and operator ∨ represents the Boolean operation OR. From this point forward, we will assume that the analyzed system is coherent, which means that structure function φ(x) is not decreasing in any of the variables and all the components are relevant for system operation [3, 14]. The data storage system is an example of the coherent system because its structure function is not decreasing in any of the variables, and each component—HDD is needed for system operation. This means that there are no situations in which the HDD failure will result in system failure, if the system was functional.

182

P. Rusnak and M. Mrena

Knowledge of the structure function allows us to investigate topological properties of the system. For example, we can use it to find the most reliable topology in a set of systems with different topologies [8] or evaluate the importance of the components of a system and find those with the greatest influence on system operation from topological point of view [8, 13, 16–18]. However, its knowledge is not sufficient in performing time-dependent reliability analysis, which deals with evaluation of reliability of the system over time. For such purposes the system state function can be used. System state at time t can be obtained from system state function z(t) that has following form [3]: z(t) = φ(x(t)) = φ(x1 (t), x2 (t), . . . , xn (t)) : 0, ∞) → {0, 1},

(3)

where xi (t) for i = 1, 2, . . . , n is a function that defines state of the i-th component at time t. Although system state function z(t) is closely related to structure function φ(x), these two functions are very different in their nature because the former is a function of time, while the latter is a function defining system topology, which is independent of time. If we consider the data storage system, its state function has the following form: φ(x1 (t), x2 (t), x3 (t)) = x1 (t) ∧ x2 (t) ∨ x3 (t).

(4)

The system state function can be viewed as a composition of system structure function and one specific realization of the state functions of all the system components, which means that the system state function z(t) can also be viewed as one realization of uncountable many system state functions. This implies that evolution of the system over time can be viewed as the following stochastic process: {z(t); t ≥ 0},

(5)

where Z (t) is a random variable modelling behavior of the system at time t. Let us evaluate function Z (t) at fixed time. In such a case, we obtain random variable X that takes value from set {0, 1} with probability A or U . These probabilities are known as system availability and unavailability, and they represent one of the basic reliability characteristics of a BSS [3]. In terms of single system component, those probabilities are pi and qi and are defined as follows [3]: pi = Pr{xi = 1}, qi = Pr{xi = 0}, pi + qi = 1.

(6)

If we know random variable xi , which models behavior of component i at fixed time, for each system component, i.e., for i = 1, 2, . . . , n, and if we assume that the components are independent, then random variable X can be obtained by combining

Time Dependent Reliability Analysis of the Data Storage System …

183

random variables xi using the structure function. This allows us to compute the system state probabilities using the following formula [3]: p = Pr{φ(x) = 1}, q = Pr{φ(x) = 0},

(7)

where x = (x1 , x2 , . . . , xn ) is a vector of random variables modeling behavior of the system components at fixed time. This definition implies that the system availability A and unavailability U can be viewed as functions of component state probabilities [3]: A = A( p) = Pr{φ(x) = 1}, U = U (q) = Pr{φ(x) = 0}, A + U = 1,

(8)

where p = ( p1 , p2 , . . . , pn ) and q = (q1 , q2 , . . . , qn ) are vectors whose elements are the state probabilities of individual system components. Formula (8) allows us to find the system state probabilities if we know the structure function of the system and the state probabilities of the components. It can be used to investigate how specific changes in state probabilities of one or more components influence the system state probabilities or other reliability measures [3, 14], but it does not allow us to perform dynamic (time-dependent) analysis of a BSS. For this task, random variable X has to be replaced by function Z (t), which defines how properties of random variable X changes over time. In this case, the system availability A(t) and unavailability U (t) become functions of time, i.e.: A(t) = A(P(t)) = Pr{φ(x(t)) = 1}, t ≥ 0, U (t) = U (Q(t)) = Pr{φ(x(t)) = 0}, t ≥ 0,

(9)

A(t) + U (t) = 1, t ≥ 0, where P(t) = (P1 (t), P2 (t), . . . , Pn (t)) and Q(t) = (Q 1 (t), Q 2 (t), . . . , Q n (t)) are vector-valued functions, whose elements are functions defining the state probabilities of individual system components over time, and x(t) = (x1 (t), x2 (t), . . . , xn (t)) is a vector of random variables modelling behavior of the system components over time. This function can be used to find how reliability of the system or importance of the components change as time flows. The most important result of previous formulae is that the system state probabilities can be viewed as a function of the components state probabilities combined using the structure function (static analysis based on (8)) or as a composition of functions defining the state probabilities of the system components over time (timedependent analysis based on (9)) defined again by the structure function. This means if the system components are independent and we know the structure function of the system and the (time-dependent) state probabilities of the components, then we are able to find the (time-dependent) system state probabilities. As one can see, a BSS

184

P. Rusnak and M. Mrena

can be analyzed with respect to time (dynamic analysis) or regardless of time (static analysis). This implies that reliability measures of a BSS might or might not depend on time. In reliability analysis, it is also needed to compute the system reliability R that represents a probability that the system will be functioning during its mission time (period of time during which the system is required to operate properly). It is needed to point out that the system reliability has the same meaning as the system availability for unrepairable systems, which are the main focus of this work. Therefore, the system reliability is defined as follows [3, 14]: R = R( p) = Pr Pr{φ(x) = 1},

(10)

where p = ( p1 , p2 , . . . , pn ) is a vector of probabilities of components being functioning during the mission time and pi is the probability that component i will be functioning during the mission time (it agrees with reliability of the component). A complementary measure to system reliability is system unreliability, which agrees with the probability that the system will fail during the mission time [3, 14]: F = F(q) = Pr Pr{φ(x) = 0} = 1 − R( p),

(11)

where q = (q1 , q2 , . . . , qn ) is a vector of unreliabilities of the components and qi = 1 − pi is the probability of a failure of component i during the mission time (it agrees with unreliability of the component). As an example, we will compute the R and F for the data storage system. Thanks to the fact that the storage system has a parallel topology with serial topology with two HDDs in one branch and single HDD in another branch, the R for data storage system has following form: R = p1 ∗ p2 + p3 − p1 ∗ p2 ∗ p3 .

(12)

In case of the F for the data storage system, it can be easily computed as follows F = 1 − R = q1 ∗q3 +q2 ∗q3 −q1 ∗q2 ∗q3 . In this example, we will be working with same HDDs with p = 0.8. Therefore, the R and F of the data storage system is as follows: R = 0.8 ∗ 0.8 + 0.8 − 0.8 ∗ 0.8 ∗ 0.8 = 0.928 and F = 1 − 0.928 = 0.072. Definitions of system reliability and unreliability are computed for the whole mission time of the system, but they do not take the specific time values into account. Therefore, they allow us to compute reliability or unreliability of the system only for given values of reliabilities/unreliabilities of the components. If we want to find functions R(t) and F(t) that define time courses of system reliability and unreliability (occurrence of system failure), we have to replace vector x by its time-dependent version. Similarly, vector p of reliabilities of the components has to be replaced by P(t) = (P1 (t), P2 (t), . . . , Pn (t)) and vector q of unreliabilities of the components by time-dependent vector Q(t) = (Q 1 (t), Q 2 (t), . . . , Q n (t)) [14]. In this case, Q i (t) is lifetime distribution of component i. This distribution defines the probability that the

Time Dependent Reliability Analysis of the Data Storage System …

185

component fails no later than at time t given it is working at time 0. After finishing this process, we obtain functions R(t) and F(t) that are defined as follows: R(t) = R(P(t)) = Pr Pr{φ(x(t)) = 1},

(13)

F = F(Q(t)) = Pr Pr{φ(x(t)) = 0} = 1 − R(t).

(14)

The procedure described above allows us to find reliability and unreliability (failure) function of the system if the structure function of the system is known, and we have information about lifetime distributions of all the system components. This proves that structure function, which is a static representation of the system (it defines system topology independent of time), can be used in time-dependent (dynamic) reliability analysis. We will show how the R(t) and F(t) can be computed on the data storage system. We will assume that each HDD is independent and because all HDDs have the same type, they are also identically distributed. Furthermore, we will be working with exponential distribution with λ = 1/5000 days as lifetime distribution of each HDD. Therefore, the R(t) and F(t) for data storage system has the following form: R(t) = P1 (t) ∗ P2 (t) + P3 (t) − P1 (t) ∗ P2 (t) ∗ P3 (t).

(15)

F(t) = 1 − R(t) = Q 1 (t) ∗ Q 3 (t) + Q 2 (t) ∗ Q 3 (t) − Q 1 (t) ∗ Q 2 (t) ∗ Q 3 (t) (16) The computed values of R(t) and F(t) for 10 000 days are depicted in Fig. 1, where R(t) is shown as a blue solid line and F(t) is shown as a red dotted line.

Fig. 1 Reliability and unreliability function of the data storage system

186

P. Rusnak and M. Mrena

3 Logic Differential Calculus Definition (1) of the structure function corresponds to the definition of Boolean function [13]. This means it is possible to use the mathematical methodology of Boolean algebra in reliability analysis based on structure functions. One useful part of this methodology, which can be used to analyze how failure of a component affects the system operation, is logic differential calculus [19]. The central term of this methodology is logic derivative defined as follows [13, 19]:   ∂φ(x) = φ(xi , x) ⊕ φ xi , x = φ(0i , x) ⊕ φ(1i , x), ∂ xi

(17)

where the first operand of XOR ⊕ is the structure function of the system when component i is in state 0, and the second is the structure function when component i is in state 1. For example, a logic derivative for (2) according to variable x2 has the following form: ∂φ(x1 , x2 , x3 ) = φ(x1 , 0, x3 ) ⊕ φ(x1 , 1, x3 ) = (x1 ∧ 0 ∨ x3 ) ⊕ (x1 ∧ 1 ∨ x3 ) ∂ x2 = x3 ⊕ (x1 ∨ x3 ) = x1 ∧ x3 . (18) From the resulting logic derivation for variable x2 it is possible to say, that the change of x2 value will result in change of the value of the Boolean function φ if the variable x1 has value 1 and the variable x3 has value 0. The logic derivative (17) can be used to analyze how a change of component state affects the system state [8, 13]. However, in order to analyze the direction of component state change, a direct partial logic derivative (DPLD) is needed. This type of logic derivative can be used to analyze how a specific change of component state (from 0 to 1 or from 1 to 0) affects the system functionality (from 0 to 1 or from 1 to 0). This direct derivative is defined as follows [20]: ∂φ(0 → 1) ∂φ(1 → 0) = = φ(0i , x) ∧ φ(1i , x), ∂ xi (1 → 0) ∂ xi (0 → 1)

(19)

where ∧ denotes Boolean operation AND and _ is a negation of the argument interpreted as a Boolean function. Given the Boolean function (2), the direct partial logic derivatives with respect to variable x 2 have following forms:

Time Dependent Reliability Analysis of the Data Storage System …

187

∂φ(1 → 0) = (x1 ∧ 0 ∨ x3 ) ∧ (x1 ∧ 1 ∨ x3 ) = x3 ∧ (x1 ∨ x3 ) = x1 ∧ x3 , ∂ x2 (1 → 0) ∂φ(0 → 1) (20) = (x1 ∧ 0 ∨ x3 ) ∧ (x1 ∧ 1 ∨ x3 ) = x3 ∧ (x1 ∨ x3 ) = x1 ∧ x3 . ∂ x2 (0 → 1)

It is possible to see, that DPLDs are the same as the derivation (18). It is important to point out that the logic derivative (17) is composed of direct and inverse (change of the variable value result in the opposite change of the function value) partial logic derivatives that are connected using Boolean operation OR. In reliability analysis, all previously mentioned DPLDs can be mostly used to find critical states of the system [8, 13], which describe situations in which a failure/repair of one or more system components results in a failure/repair of the system. They can also be used to compute importance measures, which will be presented later.

4 Importance Measures In previous chapters, the theoretical background for reliability analysis like structure function, reliability, unreliability, availability and unavailability were presented. Reliability and Unreliability are useful measures of the system for reliability analysis, but they do not measure the importance of components for the functioning of the system. Logic differential calculus was also introduced as an approach that can be used for structure function to determine how failure of a system component affects the system operation. An important part of reliability analysis is an estimation of influence of a component or a group of components on system operation. Such estimation is implemented by IMs [14] and can be used, for example, to optimize system reliability or to plan its maintenance. There are many IMs, and each of them takes into account different factors that make a system component more important than others. According to [14], IMs can be divided into three categories: structure, reliability, and lifetime IMs. In this chapter, IMs will be presented as well as standard computation and computation that uses DPLD for structure and reliability IMs. In case of lifetime IMs, they are standardly computed by using the reliability function. The approach that has been developed for lifetime IMs computation based on the structure function and DPLDs will be depicted [21]. The efficiency of the new approach will be shown on selected systems in the next chapter. This part is based on results presented in [21].

188

P. Rusnak and M. Mrena

4.1 Reliability Importance Measures Reliability IMs, take into account not only system structure in form of structure function but also the probabilities of the components functioning and failure [14]. The most known reliability IM is Birnbaum’s Importance (BI), which takes into account system topology and the probabilities of the components functioning and failed. This measure can be computed using reliability as follows [14]: B Ii =

∂R , ∂ pi

(21)

and it agrees with the probability that a failure of component i results in system failure, i.e., with the probability that the component is critical for the system. We will show its computation at the data storage system represented by with same type of HDD with probability p = 0.8 of functioning during mission time. As a first step, we will compute partial derivative for each HDD and then use it to compute BI according to. For HDD 2 that is represented by Boolean variable x2 , its ∂∂pR2 is p1 − p1 ∗ p3 and therefore B I 2 = 0.8 − 0.8 ∗ 0.8 = 0.16. As for the other HDDs, B I 1 = p2 − p2 ∗ p3 = 0.16 and B I 3 = 1 − p1 ∗ p2 = 0.36. From those values we can conclude that a failure of HDD 3 is the most problematic, because this failure will result in system failure with the highest probability. Alternatively, BI can be also computed using DPLD as follows [13]:  ∂φ(1 → 0) =1 . B Ii = Pr ∂ xi (1 → 0) 

(22)

We will show its computation at the data storage system with same type of HDD with probability p = 0.8 of functioning during mission time. We will firstly compute DPLD for each HDD and then use it to compute BI. For HDD 2 that is represented by Boolean variable x2 , its DPLD ∂∂φ(0→1) is x1 ∧ x3 and by transforming it into x2 (0→1) _

a probabilistic form we get B I 2 = p1 − p1 ∗ p3 = 0.16. As for the other HDDs, B I 1 = p2 − p2 ∗ p3 = 0.16 and B I 3 = 1 − p1 ∗ p2 = 0.36. Another useful type of reliability IM is Criticality Importance (CI). This IM extends the BI and it corresponds to the probability that system failure has been caused by a failure of component i given that the system has failed [14]. This is shown by following formula: C Ii = B Ii

qi , F

(23)

and it can be used to find components whose failures have resulted in system failure with the greatest probability when we know that the system has failed. It is typically used in system maintenance to identify components whose repairs will result in system repair with the greatest probability [3, 14]. As it was for BI, we will compute

Time Dependent Reliability Analysis of the Data Storage System …

189

1.0

BI

0.8 0.6 0.4 0.2 0.0 0

1250

2500

3750

5000

6250

7500

8750

10000

Time [days] HDD 1

HDD 2

HDD 3

Fig. 2 Time-dependent BI measures for HDDs of the storage system

the CI for each HDD in the data storage system. In case of HDD 1, by using its B I 1 = 0.16, the probability of HDD failure during mission time q = 1− p = 0.2 and 0.2 = 0.444. system unreliability F = 0.072, we can then compute C I 1 = 0.16 ∗ 0.072 As for the other HDDs, C I 2 = 0.444 and C I 3 = 1, which means that if we choose to repair HDD 3 if it is in non-working state, the system will start to work given that the system was failed.

4.2 Lifetime Importance Measures Reliability IMs assume that the state probabilities pi and qi of the components are known, and they do not depend on time. If we know how these probabilities change in time, we can investigate how BI and CI of the components vary during system mission. For this purpose, lifetime IMs are used [14]. These IMs depend on the positions of components within the system and the components lifetime distributions. First IM, that will be presented in this chapter is time-dependent BI for component i at time t. This IM agrees with the probability that the system is in a state at time t in which component i is critical for the system, and it is standardly computed by partial differentiation of reliability function R(t) according to Pi (t) [14]: B Ii (t) =

∂ R(t) . ∂ Pi (t)

(24)

We will show its computation at the data storage system with same type of HDD that are independent and they have exponential distribution with λ = 15 000 days as lifetime distribution. By using this system reliability function, it is possible to compute derivation for each HDD and then use it to compute time-dependent BI. = P2 (t) − P2 (t) ∗ P3 (t), Therefore, their time-dependent BI are B I 1 (t) = ∂∂PR(t) 1 (t) ∂ R(t) ∂ R(t) B I 2 (t) = ∂ P2 (t) = P1 (t) − P1 (t) ∗ P3 (t) and B I 3 (t) = ∂ P3 (t) = 1 − P1 (t) ∗ P2 (t).

190

P. Rusnak and M. Mrena

Time courses of those IMs are depicted in Fig. 2, where B I 1 (t) is represented by green line, B I 2 (t) is represented by red dotted line and B I 3 (t) is represented by a blue line. From those time courses we can conclude that a failure of HDD 3 is the most problematic throughout the whole time, because this failure will result in system failure with the highest probability and its value rises sharply for 5 000 days. As for the other HDDs, their importance rose slightly for around 3 000 days and then it started slowly decreasing. This is mostly caused by their placement in the series and by the fact that HDD 3 is with them in the parallel topology. The previously mentioned approach for computation of the time-dependent BI uses the reliability function that can be obtained by using transformation of the structure function. This process is shown in Fig. 3 on the left side. Another possibility, that we suggest in this work, is to use DPLD for computation of the time-dependent BI. This approach will allow us to use other IMs that can be computed using DPLD in time-dependent reliability analysis as is shown in Fig. 3 on the right side and it will be defined as follows:   ∂ Z (1 → 0, t) =1 . (25) B Ii (t) = Pr ∂ xi (1 → 0, t) This new way for computation of the time-dependent BI is based on the same approach as was shown previously in case of the structure function, thanks to the fact, that the structure function and the DPLD are both Boolean function and in case of DPLD, the meaning of the reliability (availability) function will change to time

Fig. 3 Showing the new approach for computation of the time-dependent BI

Time Dependent Reliability Analysis of the Data Storage System …

191

dependent BI. We will show its computation on the data storage. According to this approach, we need to compute DPLDs for each HDD. These DPLDs are shown in Table 1. In the next step, we transform them into time-dependent probability form and therefore we obtain the time-dependent BI for each HDD. These BIs are B I 1 (t) = P2 (t)∗ Q 3 (t) = P2 (t)− P2 (t)∗ P3 (t),B I 2 (t) = P1 (t)∗ Q 3 (t) = P1 (t)− P1 (t)∗ P3 (t) and B I 3 (t) = Q 1 (t) + Q 2 (t) − Q 1 (t) ∗ Q 2 (t) = 1 − P1 (t) ∗ P2 (t). As we can see, those time-dependent BIs are exactly the same as in the case of the normally used approach. Another useful time-dependent IM is C I i (t). This IM can be computed from B I i (t) as follows [14]: C Ii (t) = B Ii (t)

Q i (t) , F(t)

(26)

and it corresponds to the probability that component i has failed by time t and that component i is critical for the system at time t, given that the system has failed by time t [14]. The time-dependent CI for each HDD of a data storage system are Q 1 (t) ∗ Q 3 (t) − Q 1 (t) ∗ Q 2 (t) ∗ Q 3 (t) Q 1 (t) = , F(t) Q 1 (t) ∗ Q 3 (t) + Q 2 (t) ∗ Q 3 (t) − Q 1 (t) ∗ Q 2 (t) ∗ Q 3 (t) Q 2 (t) Q 2 (t) ∗ Q 3 (t) − Q 1 (t) ∗ Q 2 (t) ∗ Q 3 (t) C I2 (t) = B I2 (t) = and F(t) Q 1 (t) ∗ Q 3 (t) + Q 2 (t) ∗ Q 3 (t) − Q 1 (t) ∗ Q 2 (t) ∗ Q 3 (t) Q 1 (t) ∗ Q 3 (t) + Q 2 (t) ∗ Q 3 (t) − Q 1 (t) ∗ Q 2 (t) ∗ Q 3 (t) Q 3 (t) = = 1. C I3 (t) = B I3 (t) F(t) Q 1 (t) ∗ Q 3 (t) + Q 2 (t) ∗ Q 3 (t) − Q 1 (t) ∗ Q 2 (t) ∗ Q 3 (t) C I1 (t) = B I1 (t)

Table 1 DPLD for each HDD of the storage system Number of HDD

DPLD

1

x2 ∧ x3

2

x1 ∧ x3

3

x2 ∨ x2

1.0

CI

0.8 0.6 0.4 0.2 0.0 0

1250

2500

3750

5000

6250

7500

8750

Time [days] HDD 1

HDD 2

Fig. 4 Time-dependent CI measures for HDDs of the storage system

HDD 3

10000

192

P. Rusnak and M. Mrena

Their time courses can be seen in Fig. 4, where C I 1 (t) is represented by green line, C I 2 (t) is represented by red dotted line and C I 3 (t) is represented by a blue solid line. From those time courses, we can conclude that repair of the HDD 3 will surely result in system repair at any time point if we know that the system has failed because its value is always 1. This is caused by its placement in one branch in the parallel topology of the system. As for the other HDDs, their value of time-dependent CI slowly decreases as time flows, which means that repair of one of those HDDs will result in system repair with less probability at a later time.

5 Case Study We will show how the time-dependent reliability analysis based on the structure function can be performed and time-dependent BI and CI measures can be computed with the use of DPLDs on the following task in choosing a specific RAID for data storage system with eight HDDs from reliability and capacity point of view. This data storage system is composed of four HDDs that have type HGST HUH721212ALN604 (type 1), which are labelled as HDD 1, 2 3 and 4 and four HDDs that have type Seagate ST12000NM0007 (type 2), which are labelled as HDD 5, 6, 7 and 8. Capacity of each HDD is 12 TB and are located in a system according to their label. A selected HDD can be in two states: first state represents the working HDD that can be used to store or retrieve data and second state represents the failed HDD that cannot be used to store or retrieve data. The storage system can be in two states, either the system is working (data can be retrieved from it or stored in it) or is in failed state (data cannot be retrieved from it or stored in it). As a desired RAIDs that can be used in analysis, the RAID 0 + 1 and RAID 1 + 0 were chosen as a main interest by data storage owner. Those RAIDs are known as Nested RAIDs [15], because they are a combination of two different RAIDs. RAID 0 + 1 is a RAID 1, that has a RAID 0 as a storage unit and RAID 0 + 1 in reverse order. This is useful to find a good tradeoff between capacity and reliability demand. According to this requirement we chose to focus on 2 different variants for each desired RAID and we also add RAID 0 and RAID 1 further depicts our findings on those extreme cases. RAID 1 is known by mirroring, which means that on each HDD are the same data and if at least one HDD is working, data can be accessed [15]. By taking all of this into account, the RBDs for each RAID are shown in Fig. 5. The state of HDD i is represented by block with Boolean variable xi .

Time Dependent Reliability Analysis of the Data Storage System …

193

Fig. 5 RBD for each RAID

According to the RBDs shown in Fig. 5, the structure functions for each RAID are as follows: φ R0 (x1 , x2 , x 3 , x4 , x5 , x6 , x7 , x8 ) = x1 ∧ x2 ∧ x 3 ∧ x4 ∧ x5 ∧ x6 ∧ x7 ∧ x8 , φ R01 (x1 , x2 , x 3 , x4 , x5 , x6 , x7 , x8 ) = (x1 ∧ x2 ∧ x 3 ∧ x4 ) ∨ (x5 ∧ x6 ∧ x7 ∧ x8 ), φ R01_2 (x1 , x2 , x 3 , x4 , x5 , x6 , x7 , x8 ) = (x1 ∧ x2 ) ∨ (x 3 ∧ x4 ) ∨ (x5 ∧ x6 ) ∨ (x7 ∧ x8 ), φ R1 (x1 , x2 , x 3 , x4 , x5 , x6 , x7 , x8 ) = x1 ∨ x2 ∨ x 3 ∨ x4 ∨ x5 ∨ x6 ∨ x7 ∨ x8 ,

(27)

φ R10 (x1 , x2 , x 3 , x4 , x5 , x6 , x7 , x8 ) = (x1 ∨ x2 ∨ x 3 ∨ x4 ) ∧ (x5 ∨ x6 ∨ x7 ∨ x8 ), φ R10_2 (x1 , x2 , x 3 , x4 , x5 , x6 , x7 , x8 ) = (x1 ∨ x2 ) ∧ (x 3 ∨ x4 ) ∧ (x5 ∨ x6 ) ∧ (x7 ∨ x8 ),

where x1 , x2 , x3 , x4 are Boolean variables representing the state of HDDs of type 1 and Boolean variables x5 , x6 , x7 , x8 represent the state of the HDDs of type 2. From this point onward, we will be showing the computation in the quantitative reliability analysis with results for the RAID 0 + 1 and result for other RAIDs. The first step is to compute the Reliability and Reliability function for each RAID. The Reliability and Reliability function for RAID 0 + 1 is as follows: R R01 = p1 ∗ p2 ∗ p3 ∗ p4 + p5 ∗ p6 ∗ p7 ∗ p8 − p1 ∗ p2 ∗ p3 ∗ p4 + p5 ∗ p6 ∗ p7 ∗ p8 , R R10 (t) = P1 (t) ∗ P2 (t) ∗ P3 (t) ∗ P4 (t) ∗ P5 (t) ∗ P6 (t) ∗ P7 (t) ∗ P8 (t) −P1 (t) ∗ P2 (t) ∗ P3 (t) ∗ P4 (t) ∗ P5 (t) ∗ P6 (t) ∗ P7 (t) ∗ P8 (t),

(28)

194

P. Rusnak and M. Mrena

where pi and Pi (t) for i = 1, 2, . . . , 8 represents the probability, that the i-th component is working during its mission time or during specific time t. Lifetime distribution for each component in this system agrees with the exponential distribution, where λi = 1/M T T F i for i = 1, 2, . . . , 8. MTTFs for each HDD were obtained from data published by Backblaze Storage company in 2019 [22]. MTTF for each component can be seen in Table 2. According to the Reliability function for each RAID, MTTFs that are shown in Table 2 and the lifetime distributions for each HDD, it is possible to compute their reliability function during a specific time period (till 300 000 days). Result of such computation can be seen in Fig. 6. From it we can see that the most reliable RAID is RAID 1 and the most unreliable is RAID 0 as expected. As for the other RAIDs, the most reliable ones are RAID 0 + 1 of type 2 then RAID 0 + 1, and RAID 1 + 0. This comes mostly from the topology point of view and from the MTTF of each HDD. In order to better understand how each HDD influences the system from the reliability point of view, the BI and CI for each HDD can be computed. In this storage system, we have only two types of HDDs and placement of HDDs of each type in the system is similar. This means that the values of IMs for HDDs 1–4 will be the same as well in case of HDDs 5–8. Thanks to this, we will be focused on how each type of HDD affects the storage system. The first IM that will be computed is time-dependent BI. If we use DPLD, the time-dependent BI for RAID 0 + 1 has following form: Table 2 MTTF for each component of the surveillance system Component

Component data MTTF [days]

1–4

HGST HUH721212ALN604

91,129.75

5–8

Seagate ST12000NM0007

14,028.64

1.0 0.8 0.6 0.4 0.2 0.0 0 12000 24000 36000 48000 60000 72000 84000 96000 108000 120000 132000 144000 156000 168000 180000 192000 204000 216000 228000 240000 252000 264000 276000 288000 300000

Reliability

Component name

Time [days] R - RAID0 R - RAID1 Fig. 6 Reliability function for each RAID

R - RAID0+1 R - RAID1+0

R - RAID0+1 2 R - RAID1+0 2

Time Dependent Reliability Analysis of the Data Storage System … B I1 (t) = P2 (t) ∗ P3 (t) ∗ P4 (t) − P2 (t) ∗ P3 (t) ∗ P4 (t) ∗ P5 (t) ∗ P6 (t) ∗ P7 (t) ∗ P8 (t), B I5 (t) = P6 (t) ∗ P7 (t) ∗ P8 (t) − P1 (t) ∗ P2 (t) ∗ P3 (t) ∗ P4 (t) ∗ P6 (t) ∗ P7 (t) ∗ P8 (t).

195

(29)

Values of time-dependent BI for both types of HDDs and for each RAID is shown on Figs. 7 and 8. Thanks to those values, it is possible to see how exactly are components important for a system. In case of RAID 0 and RAID 1, both types have almost the same importance for the system, which is caused by the topology of all HDDs in those RAIDs. Only difference is in the slope, which is caused by the different MTTFs. As for the other RAIDs, for both types of RAID 0 + 1 the most important type of HDD is HGST, which is caused by the fact that this type is more reliable than the Seagate and that this is further confirmed by its placement in the RAID. On the other hand, in case of both types of RAID 1 + 0 the most important type is Seagate. This is caused by the fact that in order to successfully

Fig. 7 Time-dependent BI for type HGST

Fig. 8 Time-dependent BI for type Seagate

196

P. Rusnak and M. Mrena

save or read data in those situations, the HDD with this type is needed. Because the MTTF is worse in comparison with HGST, this makes him much more important for the system. The time-dependent BI gives us information about the importance of the selected component on the system. But in order to further see this influence, we compute the time dependent CI for both types and its values can be seen in Figs. 9 and 10. As we can see, the conclusions from the CI are further depicted here. The most interesting one is RAID 1 + 0. In this RAID, the most important type of HDD is Seagate. This is mostly caused by the placement of HDDs of those types in the RAID, because in order to get or store data, at least one working component of this type is needed. This holds true for the second type of RAID 1 + 0.

Fig. 9 Time-dependent CI for type HGST

Fig. 10 Time-dependent CI for type Seagate

Time Dependent Reliability Analysis of the Data Storage System …

197

Fig. 11 Reliability function for RAID 1 + 0 type 2

6 Conclusion The current state and level of technology brings new challenges in the development of reliability engineering, which has to be able to deal with analysis of complex systems, i.e. systems composed of many components with various behaviors. Investigation of such systems requires development of new methods that allow analyzing their properties in a reasonable time. A solution to this problem can be the use of methodology of logic differential calculus, whose application in time-independent reliability analysis has been considered in [13, 14]. However, for real-world problems, it is very important to be able to perform time-dependent analysis, which allows us to find how properties of the system change as time flows. In this paper, we showed that logic differential calculus can also be used in time-dependent reliability analysis, which expands possibilities of its application in solving real-world problems, in case study of data storage systems. From the computed measures we can see that the most interesting RAID from the good tradeoff between reliability and capacity are both types of RAID 0 + 1 and RAID 1 + 0 of type 2. In case of RAID 1 + 0 of type 2 (variant 1), if we change the placement of the HDDs (variant 2), we can get a significant improvement from a reliability point of view without changing the capacity. If HDD 1,3,5,7 will be of type 1 (HGST) and other HDDs will be of type 2 (Seagate) then the most important type will be type 1 and reliability of such a data storage system will sharply improve (Fig. 11). In this paper, we primarily deal with IMs based on the concept of criticality (BI and CI). However, there also exist other types of IMs, which are based on another concept known as a concept of minimal cut sets or minimal path sets. Typical example of such IMs is Fussell-Vesely’s importance [14], which quantifies how a failure (repair) of a component contributes to a failure (functioning) of the system. Therefore, in future work, we would like to focus on the possibility of application of logic differential calculus in time-dependent analysis based on minimal cut sets and minimal path sets. Acknowledgements This work was supported by the Slovak Research and Development Agency under the grant No. SK-SRB-18-0002.

198

P. Rusnak and M. Mrena

References 1. Zio, E.: Reliability engineering: old problems and new challenges. Reliab. Eng. Syst. Saf. 94, 125–141 (2009). doi:https://doi.org/10.1016/J.RESS.2008.06.002 2. Block, H.W., Barlow, R.E., Proshan, F.: Statistical theory of reliability and life testing: probability models. J. Am. Stat. Assoc. 72, 227 (1977). doi:https://doi.org/10.2307/2286944 3. Rausand, M., Høyland, A.: System Reliability Theory: Models, Statistical Methods, and Applications. Wiley (2003) 4. Grouchko, D., Kaufmann, A., Cruon, R.: Mathematical Models for the Study of the Reliability of Systems, Elsevier Science (1977) 5. Natvig, B., Multistate Systems Reliability Theory with Applications, Wiley (2010) 6. Lisnianski, A., Levitin, G.: Multi-State System Reliability: Assessment, Optimization and Applications, World Scientific (2003) 7. Schneeweiss, W.G.: A short Boolean derivation of mean failure frequency for any (also noncoherent) system. Reliab. Eng. Syst. Saf. 94, 1363–1367 (2009). doi:https://doi.org/10.1016/j. ress.2008.12.001 8. Kvassay, M., Levashenko, V., Zaitseva, E.: Analysis of minimal cut and path sets based on direct partial Boolean derivatives. Proc. Inst. Mech. Eng. Part O J. Risk Reliab. 230, 147–161 (2016). doi:https://doi.org/10.1177/1748006X15598722 9. Armstrong, M.J.: Reliability-importance and dual failure-mode components. IEEE Trans. Reliab. 46, 212–221 (1997). doi:https://doi.org/10.1109/24.589949 10. Zhang, J.: Multi-function system reliability. In: Proceedings – Annual Reliability and Maintainability Symposium. Institute of Electrical and Electronics Engineers Inc (2019) 11. Marichal, J.L.: Structure functions and minimal path sets. IEEE Trans. Reliab. 65, 763–768 (2016). doi:https://doi.org/10.1109/TR.2015.2513017 12. Brinzei, N., Aubry, J.-F.: Graphs models and algorithms for reliability assessment of coherent and non-coherent systems. Proc. Inst. Mech. Eng. Part O J. Risk Reliab. 232, 201–215. doi:https://doi.org/10.1177/1748006X17744381 13. Zaitseva, E., Levashenko, V., Kostolny, J.: Importance analysis based on logical differential calculus and Binary Decision Diagram. Reliab. Eng. Syst. Saf. 138,135–144 (2015). doi:https:// doi.org/10.1016/J.RESS.2015.01.009 14. Kuo, W., Zhu, X.: Importance Measures in Reliability, Risk, and Optimization: Principles and Applications, John Wiley and Sons (2012) 15. Silberschatz, A., Galvin, P.B., Gagne, G.: Operating System Concepts, 10th edn. John Wiley & Sons, Inc. (2018) 16. Butler, D.: (1979) Complete importance ranking for components of binary coherent systems, with extensions to multi-state systems. Nav. Res. Logist. Q. doi:https://doi.org/10.1002/nav. 3800260402 17. Al Luhayb, A.S.M., Coolen-Maturi, T., Coolen, F.P.A.: Smoothed bootstrap for survival function inference. In: Proceedings of the International Conference on Information and Digital Technologies 2019, IDT 2019. Institute of Electrical and Electronics Engineers Inc., pp. 296–303 (2019) 18. Papadopoulos, V., Giovanis, D.G.: Reliability analysis. In: Mathematical Engineering, pp. 71– 98. Springer Verlag (2018) 19. Steinbach, B., Posthoff, C.: Boolean differential calculus. Synth. Lect. Digit. Circuits Syst. 12, 1–217 (2017). doi:https://doi.org/10.2200/S00766ED1V01Y201704DCS052 20. Yanushkevich, S.N., Michael Miller, D., Shmerko, V.P., Stankovi´c, R.S.: Decision Diagram Techniques for Micro- and Nanoelectronic Design: Handbook, CRC Press (2005) 21. Rusnak, P., Rabcan, J., Kvassay, M., Levashenko, V.: Time-dependent reliability analysis based on structure function and logic differential calculus. In: Advances in Intelligent Systems and Computing, pp. 409–419 (2019) 22. Klein, A.: Backblaze Hard Drive Stats for 2019 https://www.backblaze.com/blog/hard-drivestats-for-2019/ (2020). Accessed 11 Oct 2020

Energy Efficiency for IoT Andriy Luntovskyy and Bohdan Shubyn

Abstract The given paper is aimed to investigation of the energy efficiency issues for IoT solutions, based on combined sensor and contactless systems. The combined energy-efficient solutions are namely the hierarchical organized infrastructure WSNs and self-organized ad-hoc WSNs, which are widely interoperable with the 4G/5G base stations and micro-cells, the backboned Wi-Fi access points as well as with inexpensive and energy-efficient RFID/NFC reader farms. In regard to provide the energyefficient WSN protocols, a holistic multi-layered approach is used, which is based on Low-Duty-Cycle-Principle, energy-harvesting, LEACH Clustering and topology optimization, efficient OS and software frameworks, enabling data reduction. Keywords 5G · WSN · RFID · NFC · IoT · Energy-efficient protocols · Low data unit costs

1 Motivation Since 2003 the P2P systems (Internet of Things, Fog Computing) in combination with convenient C-S communication model based on IoP (so-called Internet of People) as well as server-less structures (SLMA, robotics) have gained on popularity. Then the Cloud-based solutions became a trend (2010) under predominant use of the loadbalanced “thin clients” with functionality delegation to the clouds [1–3]. Under use of fog computing the IoT solutions are constructed. The workload is shifted on the edge (minor QoS requirements) to the energy autarky and resource economizing small nodes (higher energy efficiency requirements) [1, 2].

A. Luntovskyy (B) BA Dresden University of Cooperative Education (Saxon Study Academy), Hans-Grundig-Str. 25, 01307 Dresden, Germany e-mail: [email protected] B. Shubyn Lviv Polytechnic National University, ITRE, Professorska 2, Lviv 79013, Ukraine © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 C. van Gulijk and E. Zaitseva (eds.), Reliability Engineering and Computational Intelligence, Studies in Computational Intelligence 976, https://doi.org/10.1007/978-3-030-74556-1_12

199

200

A. Luntovskyy and B. Shubyn

Fig. 1 Demarcation and development phases for mobile internet and IoT [4–8]

1.1 Demarcation Between IoP, IoS and IoT Firstly, let’s to give a demarcation between so-called IoT, IoS and IoP concepts (Fig. 1). Internet (of People) consists of up to N ~1–5 billion of devices nowadays. Since 2003 in so-called Internet of Services (IoS, Internet of Services) the innovative technological developments have been driven with the aim of the creation of new delivery channels for services and new business models. The creation of such services based on cloud computing in 2010 became possible under use of the open platforms and interface architectures like SOA (Service-Oriented Architectures), respectively Enterprise SOA, which provide IoS for companies [4–8].

1.2 IoT Enabling Network Technologies Which IoT-relevant network technologies do we know? As an example, a smart mobile device can be considered as “All-in-One” (Fig. 2a).The following IoT-relevant network technologies are working together and can be considered as IoT Enabling Network Technologies [1–3, 9–16]: (1) (2)

Mobile Networks (LTE, 5G) WSN (Wireless Sensor Networks, ZigBee/IEEE 802.15.4, EnOcean, Z-Wave)

Energy Efficiency for IoT

201

Fig. 2 Interoperability: a Smartphone as “All-in-One” [4–8]; b Hierarchical cell concept

(3) (4) (5) (6) (7) (8) (9) (10)

6LoWPAN (IPv6 over Low power Wireless Personal Area Networks, based on ZigBee) Wi-Fi (Wireless Fidelity IEEE 802.11ac,ax, ah) Bluetooth v5 (IEEE 802.15.1), IrDA (Infrared Data Association) RFID (Radio Frequency ID) and NFC (Near Field Communication) GPS (Global Positioning System) Powerline, Homeplug, PoE (Power over Ethernet) KNX (Konnex), LON (Local Operating Network) Infrequently, QR (Quick Response) and Watermarks (as steganografic applications) can be considered.

Modern IoT devices are interoperable with 2G5G mobile networks under use of the convenient hierarchical cell concept (pico-cells, micro-cells, mega-cells and giga-cells, refer Fig. 2b). These networks are completely interoperable (Table 1).

202

A. Luntovskyy and B. Shubyn

Table 1 Overview of hierarchical cells Type

Distance

Mobility (km/h)

Deployment by …

Giga-cell

~1000 km

~4700 (~1,3 km/s)

International providers, satellite radio

Macro-cell

~1–5 km

Up to 500

National and regional providers

Micro-cell

~100–300 m

Up to 120

Metropolitan areas, city districts, campus, office area

Pico-cell (indoor)

~10 m

~10

Hotspots at railway stations, airports, hotels, restaurants, clubs, home area

Pico-Cell (personal area)

~10 cm

Stationary

PAN, wearable, smart stuff

The most of them use the established combinations with IPv4, IPv6, TCP and UDP for data transfer (Fig. 3). An exception is ZigBee, which provides its own protocol for transport and is based on the IEEE 802.15.4 standards (the lowest layers). ZigBee is specified for WSN and corresponds to the basic IoT concept of data economy and energy efficiency. The further development of ZigBee is 6LoWPAN under use of IPv6. The IoT applications (refer Fig. 4) have access to the data, provided via various higher-level protocols and platforms (HTTP, CoAP, SEP 2.0, MQTT, OneM2M,

Fig. 3 Architecture and components of a IoT device

Energy Efficiency for IoT

203

Fig. 4 Protocol stack for IoT

OPC UA, ROS) [4–6]. They are the widely used in IoT. E.g. HTTP can also be used for data queries and data string transfer. RPL (Routing Protocol for Low power and Lossy Networks) is optimized to use in combination with IPv6 and further WSN.

1.3 Case Study 1: Energy-Efficient IoT with LoRa WAN LoRa WAN devices are used for so-called low-power wide-area Long Range communication (LPWAN). The Semtech Corporation acts as s pioneer of LoRA Alliance [17]. The LoRa protocol is based on chirp spread spectrum modulation techniques (CSSM). The LPWAN is operated in license-free frequency areas like SRD and ISM: 2.4 GHz, 868/915, 433, 169 MHz. The basic modules are available as open source software (OSS). LoRa WAN provides asymmetrical links and energy efficiency for M2M scenarios (Fig. 5). The uplink (UL) ranges are d~10 km (from the end device to the network). However, low DR by approx. 292 bps to 50 kbps is only available. The middle ranges for LoRa extend from 2 km in urban areas up to 40 km in rural areas. Multiple LoRaWAN networks are deployed Europe- and worldwide: they provide IoT coverage in the Netherlands, Switzerland, South Korea etc. These countries are the first countries with area-wide LoRaWAN coverage. As an example, Swisscom LPN can be mentioned: • Frequency band from 863 to 870 MHz (SRD Band for Europe). • Gateway PTx = 500 mW (27 dBm).

204

A. Luntovskyy and B. Shubyn

Fig. 5 LoRA architecture (own representation, based on LoRA alliance)

1.4 To the Structure of this Work The remainder of this paper’s content follows as: • State-of-the-art for energy-efficient approaches and solutions for hierarchically organized infrastructure WSNs and self-organized ad-hoc WSNs (Sect. 2).They are widely interoperable with the 4G/5G base stations and micro-cells, the backboned Wi-Fi access points (refer Sect. 2). • In regard to provide the energy-efficient WSN protocols, a holistic multilayered method is used, which is based on Low-Duty-Cycle-Principle, energyharvesting, LEACH Clustering and topology optimization, efficient OS and software frameworks, enabling data reduction (Sect. 3). • With the aim of further optimization, the inexpensive and energy-efficient RFID/NFC reader farms can be used in combination to WSN (refer Sect. 4). • The paper provides the case studies on the discussed subjects. The conclusions are offered in Sect. 5.

2 Energy-Efficient Approaches and Solutions The demarcation for hierarchically organized infrastructure WSNs and selforganized ad-hoc WSNs as well as state-of-the-art for energy-efficient approaches and solutions is offered within the discussed Sect. 2 [4–8].

Energy Efficiency for IoT

205

The next Sect. 3 includes a brief overview of common methods for WSN energy efficiency respectively. The most important tradeoffs between various factors that influence the WSN energy efficiency and QoS on the various network layers are discussed.

2.1 Energy Efficiency for Combined Infrastructure Wireless Sensor Networks Nowadays the widespread wireless networks are based on hierarchical structures and are integrated to fixed (hierarchical) infrastructures, based on Wi-Fi-Access Points and Mobile Base Stations. Mobile end devices are assigned to a specific 4G/5G base station (BS) and supplied with data. These BSs schedule the communication: radio channel access and synchronization. A comparison of WSN technologies is presented in Table 2. The most important problem in all of these technologies is the energy-efficient operation by transfer of small telegrams (sometimes under 100 Bit) on short ranges. Energy-efficient sensor nodes are characterized by durability, interoperability and guarantee of quality of Table 2 WSN in comparison Feature

EnOcean

Frequency, in 868 MHz MAC-layer

KNX-RF

Z-Wave

Zig Bee (IEEE 802.15.4)

6LoWPAN (Based on ZigBee)

Nano NET

868

868

2400 world-wide

2400

2400

CSMA

Beaconing, CSMA

Beaconing, CSMA

CSMA/CA, TDMA, ALOHA

Beaconing

IPv6 support

Available

topology

Star, mesh

Star

Star, mesh

Star, mesh

Star, mesh

Mesh

Data rate, Kbit/s

125

16,4

9,6/40

250

250

2000

256

2**32

2**16 (256 in cluster)

2**16 (256 2**48 in cluster)

Node number 2**32

+

Security

AES

In mid-term

AES

Energy consumption

Very low

Low

Low

Low

Very low

Middle

Collision probability

Very low

+

+

Low

Low

Very low

Energy harvesting

Available

No

No

No

No

No

Distance, m

30–300

10–100

20–200

10–75

10–100

40–250

206

A. Luntovskyy and B. Shubyn

service (QoS) requirements in the constructed WSN. In addition, they provide long life; they have high reliability and inexpensive options for customization [1–16].

2.2 Case Study 2: Multi-Layered Monitoring and Control for Infrastructure WSN Monitoring of wireless sensor networks can be organized hierarchically (Fig. 6). The SN are coupled into the clusters with single hop to the Wi-Fi AP. The stratum of the sensors is thus formed. The highest Gateway Stratum is based on the middle backbone stratum of the summarized Wi-Fi Access Points. The Gateway Stratum enables holistic management and monitoring and control of the wireless sensor networks. As an example, for “best practices”, the Infotec BRAIN for Monitoring of Wireless Sensor Networks can be considered. The system provides energy efficiency and low data unit costs (refer Sect. 3). The next Multi-Layered Combined Solution for Anti-Drone Defense is discussed below [13, 14, 18]. Drones are becoming increasingly easier and less expensive to access and can thus offer criminals and terrorists a new means of attack (Fig. 7). Note “BOS—Behörden und Organisationen mit Sicherheitsaufgaben” (in German). BOS means “Authorities and organizations with security tasks” and is a collective term for institutions that are entrusted with the prevention of dangers for safety and security (emergency services). So-called system “AMBOS” for defence against

Fig. 6 Multi-layered monitoring and control based on an infrastructure WSN

Energy Efficiency for IoT

207

Fig. 7 Anti-drone defense system AMBOS [13, 14, 18]

unmanned aerial objects for BOS (Authorities and organizations with security tasks), developed by the IDMT Fraunhofer Institute for Digital Media Technology, enables the security forces to support the detection of potential threats from drones. The system features possess hierarchical architecture. AMBOS sensor system came from the field of audio signal processing and can detect possible threats using four different sensor modalities such as radio, acoustics, electro-optics, infrared and radar. The system AMBOS is developed via the partners from science and industry, which are working since 2017 on information support for the detection and localization as well as an assessment of flying drones.

3 A Multi-Layered Approach and the Principles for Energy Efficiency in WSN and WPAN Generally, a WSN consists of the spatially distributed autonomous sensor nodes (SN, Sensor Nodes), which are suitable for cooperative monitoring: physical parameters (industry) and environmental conditions, e.g. B. temperature, sound, vibration, pressure, movement or contamination, pollutant concentration. The main principles for achieving energy efficiency in WSN/WPAN are as follows (Fig. 8). Possible application scenarios for wireless sensor networks are as follows: • Numerous traffic scenarios—tunnels, railways and highways, • Disaster control, e.g. GITEWS (German Indonesian Tsunami Early Warning System) for measuring temperature, pressure, wind speed in tsunami areas with interoperability with satellite radio and GPS, • Military, e.g. MFI (Micromechanical Flying Insects, drones), • Environment and Biology, e.g. ZebraNET (Princeton University), • Animal and patients monitoring, healthcare (Smart Pills).

208

A. Luntovskyy and B. Shubyn

Fig. 8 Holistic multi-layered method for energy efficiency: sub-optimal-architecture with trade-offs [4–8]

Wireless sensor piconets are in general a subject to the condition of unified development with optimization of energy consumption in all layers of the OSI reference model up-to-down and across. The uniform approach to energy efficiency through so-called “Low-Duty-Cycle” is followed (Fig. 9). To increase energy efficiency in WSN, the combination of the following, often contradictory, optimization methods is used:

Fig. 9 Energy efficiency via “Low Duty Cycle” principle

Energy Efficiency for IoT

1. 2. 3. 4. 5.

209

Optimized software components and frameworks with data reduction and improved data acquisition, So-called data-based energy efficiency, Optimized routing, i.a. under use of IPv6, Optimized topology and clusters with dynamic selection of the node and CH (Cluster Heads) or LEACH, Energy harvesting and management of power supply (Dynamic Voltage, Scheduling, Event-based processing in nodes).

Optimization of the energy consumption in WSN can be only considered as a sub-optimal process with several tradeoffs (compromises), for example, like [4–8]: • • • • •

Trade-off “Transmission Power—Noise”, Trade off “Frequency—Energy”, Trade-off “Voltage—Data Rate”, Trade-off “Energy—Accuracy”, Trade-off “Energy—Latency”.

Efficient dynamic cluster formation with dynamic selection of nodes and CH (cluster heads) is known as the LEACH method. The LEACH Principle can be described as follows (Fig. 10). The nodes, which are already CHs, cannot operate again as CHs for 1/P rounds, where P is a wished part of Cluster Heads within the network is. Furthermore each node possess the average probability Z < T(n) to become to Cluster Head in each new round. Finally, on the end of the round each node, which did not become to CH connects to a next Cluster Head and become to Cluster Member (Join Cluster). Each of the Cluster Heads schedules a slotted communication plan (Cluster Schedule) for each node from the own Cluster aimed to the successful data transfer (Slot 1 … Slot m). The procedure extends up to 4 times the average life expectancy of the WSN (refer Fig. 10) compared to static solutions: (1) directly connected sensor nodes and (2) statically coupled nodes and CH in the clusters.

4 Energy Efficiency in Contactless Communication Via RFID and NFC With the aim of further optimization, the inexpensive and energy-efficient RFID and NFC readers as well as their farms can be combined hierarchically to the 4G5G mobile BS, Wi-Fi APs, WSN SN (refer Sect. 1). The RFID/NFC technology takes its origin in music (by Lew Theremin), trading and retail, logistics and warehouse management on shorter ranges (10 cm–10 m).

210

A. Luntovskyy and B. Shubyn

Fig. 10 LEACH-clustering for energy efficiency [4–8]

4.1 Energy Efficiency Via RFID and NFC What is an RFID (Radio Frequency ID) transponder today? It is used for fast, uncomplicated data exchange between transmitter and receiver over a short distance. Unsatisfied data security is a major disadvantage. However, increasing the data security is possible by reducing the working distance. Otherwise, the encryption can be provided via the active models [19–21].

Energy Efficiency for IoT

211

The essential construction elements for RFID technology are the transmitters and receivers: a distinction is made as to whether communication is active or passive. RFID enables automatic and contactless identification and localization of (mobile) objects over short distances. This modus operandi provides energy efficiency with a distance of up to 10 m (passive mode), as well as up to 100 m for active systems. For applications in trade and logistics, multiple identification is possible as well as identification via EPC (Electronic Product Code) with 64, 96, 128 bit. In combination with RFID technology, the EPC is suitable for the detection and tracking of objects (products, logistic units, packaging, transport pallets, systems, service relationships, documents, reusable transport containers, storage locations, locations) that are equipped with a transponder with EPC, possible without visual and touch contact. It is about an internationally used key and code system. There are multiple constructions of the RFID transponders (Fig. 11a–e): The operation principle (refer Fig. 11f) is as follows: 1. 2. 3. 4. 5.

The reader sends commands via an electromagnetic field (Transmitting) Reception takes place via the antenna of the transponder Forwarding to the microchip if any for a Pre-processing is possible Transponder sends the Response via electromagnetic field Data processing in the reader is available.

Furthermore, inexpensive and loosely coupled RFID transponders with sensors are suitable for the measurement of physical and chemical variables, in particular, like

Fig. 11 RFID types and communication principle [19–21]

212

A. Luntovskyy and B. Shubyn

pressure, acceleration, expansion, humidity and electrical conductivity are possible. Typical RFID frequencies are as follows: • Low Frequency (LF): 30500 kHz for motor vehicle immobilizers, animal identification, • High Frequency (HF): 1015 MHz for smart labels on goods in logistics and retail. For RFID technology, the following ITU-coordinated frequency bands are defined: LW at 125–134 kHz; SW at 13.56 MHz; UHF at 865–869 MHz (EU); UHF at 950 MHz (USA/Asia); SHF at 2.45 and 5.8 GHz. The following modes are distinguished: • Passive RFID: only antenna, • Active RFID: with memory, micro-controller and battery: therefore shorter life expectancy, but programmable and reconfigurable. RFID transponders are often used together with wireless sensors for measuring of physical and chemical variables. They are, in particular: pressure, acceleration, elongation, humidity and electrical conductivity. Energy efficiency can be provided via inexpensive contactless NFC technology too by networking of the machines and work-pieces. NFC is based on unified RFID specifications [22, 23] by ISO 14,443 and ISO 15,693. NFC devices are energyefficient with long-life up to 12–72 months. The NFC devices are frequently unsafe due to attacks by third parties but encryption is also optionally available. NFC devices can provide holistic tracing for the products from the supplier to the customer and processes automatic entries in the machines’ and work-pieces’ databases.

4.2 Case Study 3: Energy-Efficient Monitoring and Management of Farm Animals Via RFID and Wi-Fi The use of RFID technology for energy-efficient monitoring and management of farm animals was depicted in Fig. 12.

5 Conclusions and Outlook The given paper represents a demarcation between IoT, IoS and IoP as well as provides an overview on interoperable IoT technologies: 25G, WSN, LoRa WAN, WI-Fi 5/6, RFID/NFC, which promise “a new technological breakthrough” for widespread modern energy-efficient systems. Furthermore, in the given paper, the energy efficiency is examined as one of the most important issues for IoT.

Energy Efficiency for IoT

213

Fig. 12 RFID and Wi-Fi based monitoring and management of farm animals

An overview for energy-efficient approaches and solutions for hierarchically organized infrastructure WSNs and self-organized ad-hoc WSNs is provided. They are widely interoperable with the 4G/5G base stations and micro-cells, the backboned Wi-Fi access points. A qualitative comparison of the RF/NFC technologies has been done by the criteria of data rate, distance and energy consumption. With the aim of further optimization, the inexpensive and energy-efficient RFID/NFC reader farms can be used in combination to convenient WSN. In regard to provide the energy-efficient WSN protocols, a holistic multilayered method is offered, which is based on Low-Duty-Cycle-Principle, energyharvesting, LEACH Clustering and topology optimization, efficient OS and software frameworks, enabling data reduction. The paper possesses a WIP status (Work-in-Process). The discussed case studies and deployment scenarios are provided. The case studies on energy efficiency of IoT were discussed. Inter alia, the following case studies are considered: • Case Study 1: Energy-Efficient IoT with LoRa WAN, • Case Study 2: Multi-Layered Monitoring and Control for infrastructure WSN, • Case Study 3: Energy-Efficient Monitoring and Management of Farm Animals via RFID and Wi-Fi. In mid-term, the standards for ML and AI will accompany IoP, IoS and clouds, IoT and fog, as well as the industries, digital economy and everyday life over the world and for each institution. Therefore, it’s very important for the systems to provide energy efficiency and low data unit costs in view of the increasing demands on the QoS [4–8].

214

A. Luntovskyy and B. Shubyn

Acknowledgements Authors’ great acknowledgements to the colleagues from BA Dresden (Saxon Academy of Studies), Lviv National Polytechnic University and University of Žilina University, especially to Prof. Dr. habil. A. Haensel, Mr. E. Zumpe, Mr. M. Stoll, Mr. A. Podogrygora, Mr. T. Zobjack, Prof. Dr. habil. M. Klymash, Dr. D. Guetter, Prof. Dr. V. Levashenko and Prof. Dr. E. Zaitseva for inspiration and challenges by fulfilling of this work.

References 1. Gessler, R., Krause, T.: Wireless- Netzwerke für den Nahbereich. Vieweg+Teubner, Wiesbaden (2009) ISBN 978-3-8348-0247-7, 342 S 2. Dressler, F. CCS Labs/Paderborn University/Drahtlose Selbstorganisierende Netzwerke (2020). https://www.ccs-labs.org/team/dressler/ 3. 6LoWPAN (2020). http://www.scantec.de/ 4. Luntovskyy, Andriy, Gütter, Dietbert: Moderne Rechnernetze: Protokolle, Standards und Apps in kombinierten drahtgebundenen, mobilen und drahtlosen Netzwerken, 481 Seiten, 263 Abb. Springer Nature, Juli (2020) ISBN 9783658256166 (https://www.springer.com/gp/book/978 3658256166) 5. Luntovskyy, Andriy, Gütter, Dietbert: Moderne Rechnernetze - Übungsbuch: Aufgaben und Musterlösungen zu Protokollen, Standards und Apps in kombinierten Netzwerken, 145 Seiten, 44 Abb. Springer Nature, Juli (2020) ISBN 9783658256180 (https://www.springer.com/gp/ book/9783658256180) 6. Schill, Alexander, Springer, Thomas: Verteilte Systeme: Grundlagen und Basistechnologien: Kompakte Darstellung der Grundlagen und Techniken Verteilter Systeme, Springer-Verlag Berlin Heidelberg (2012) 2. Ausgabe, ISBN 9783642257957 (https://www.springer.com/gp/ book/9783642257957#otherversion=9783642257964) 7. Luntovskyy, Andriy, Spillner, Josef: Architectural Transformations in Network Services and Distributed Systems: Current Technologies, Standards and Research Results in Advanced (Mobile) Networks, p. 344. Springer Vieweg Wiesbaden (2017), ISBN 9783658148409 (https:// www.springer.com/gp/book/9783658148409#otherversion=9783658148423) 8. Luntovskyy, Andriy, Gütter, Dietbert, Melnik, Igor: Planung und Optimierung von Rechnernetzen. Methoden, Modelle, Tools für Entwurf, Diagnose und Management im Lebenszyklus von drahtgebundenen und drahtlosen Rechnernetzen: Planung von Rechnernetzwerke theoretisch anspruchsvoll und praxisnah, Springer/Vieweg+Teubner Wiesbaden, 415 Seiten, 245 Abb. (2012) ISBN 9783834814586 (https://www.springer.com/gp/book/9783834814586#oth erversion=9783834882424) 9. Virtenio Sensors (2020). https://www.virtenio.com/ 10. EnOcean Alliance (2020). https://www.enocean-alliance.org/ 11. Preon 32 Sensor Nodes (2020). https://www.virtenio.com/ 12. Smarter World Sensors (2020). https://smarterworld.de/ 13. AMBOS (2020). https://www.fraunhofer-innovisions.de/oeffentliche-sicherheit/ambos/ 14. Infotec BRAIN—Monitoring of Wireless Sensor Networks (2020). https://infotec-edv.de/ 15. Field Protocol (2020). https://www.computer-automation.de/feldebene/zensoren/die-protok oll-frage.74052.html 16. Clemens Otte. Industrie 4.0 (2020). https://bdi.eu/artikel/news/was-bedeutetindustrie-40/ 17. LoRa Alliance (Online 2020): https://lora-alliance.org/ 18. BSI—Federal Office for Information Security (2020). https://www.bsi.bund.de/. 19. RFID Transponders (2020). https://moodle.ruhr-unibochum.de/m/mod/wiki/view.php?pag eid=1144#toc-5 20. RFID in der Industrie 4.0 (2020). https://ihk-industrie40.de/leitfaden-industrie-4-0/technolog ien/rfid-technologie/

Energy Efficiency for IoT 21. RFID (2020). https://www.elektronik-kompendium.de/sites/kom/0902021.htm 22. NFC in der Industrie 4.0 (2020). https://industriemagazin.at/a/nfc-alsproduktivitaetsturbo 23. NFC (2020). https://www.elektronik-kompendium.de/sites/kom/1107181.htm

215

Image Analysis and Other Applications of Computational Intelligence in Reliability Engineering

Knowledge-Based Multispectral Remote Sensing Imagery Superresolution Sergey A. Stankevich, Iryna O. Piestova, Mykola S. Lubskyi, Sergiy V. Shklyar, Artur R. Lysenko, Oleg V. Maslenko, and Jan Rabcan

Abstract Software techniques for remotely sensed imagery superresolution enhance data reliability and veracity. The most common approach for superresolution is the processing of few images of the same scene captured simultaneously with subpixel shift relatively to each other. These conditions exclude radiometric inconsistency between images, and subpixel shift allow the extracting of additional land surface details. A general superresolution approach can be adopted to multispectral remote sensing imagery registered in different spectral bands. In this case, the intrinsic radiometric inconsistency can be overpassed by translating of the input bands into some additional virtual one, joint for all inputs. Typically, such an additional band overlaps all input ones in the spectrum. Necessary knowledge for bands translation are all bands spectral responses, as well as the subpixel shifts between restored images. So, the spectral radiance for a new spectral band is estimated. Therefore, each input band image transforms into new image in the same spectral range. Obtained images are appropriate for any existing superresolution techniques, for example, using Gaussian regularization in the frequency domain. The last step of the proposed method is image improvement after superresolution using a convolutional artificial neural network. Keywords Multispectral remote sensing imagery · Spectral band translation · Superresolution · Subpixel shift · Pan-sharpening · Convolutional neural network

1 Introduction Remote sensing applications related to economics, natural resources and social aspects use data from airborne and spaceborne survey systems [1]. The main purpose S. A. Stankevich (B) · I. O. Piestova · M. S. Lubskyi · S. V. Shklyar · A. R. Lysenko · O. V. Maslenko Scientific Centre for Aerospace Research of the Earth, NAS of Ukraine, Kyiv, Ukraine e-mail: [email protected] J. Rabcan University of Žilina, Žilina, Slovakia © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 C. van Gulijk and E. Zaitseva (eds.), Reliability Engineering and Computational Intelligence, Studies in Computational Intelligence 976, https://doi.org/10.1007/978-3-030-74556-1_13

219

220

S. A. Stankevich et al.

of the remote sensing data analysis is to provide reliable detection and recognition of objects on the land or at sea, which in most cases have small details and low contrast [2]. The probability of detecting such objects depends directly on the informativeness of the aerospace imagery. Consider a general model for quantifying the informativeness of multispectral aerospace images as a fraction of total information capacity [3]. The main components of the model that influence the informativeness value are spectral differences between the radiometric signals of the object and a background, the equivalent signal-to-noise ratio in a multispectral image, and spatial resolution. In information theory, informativeness is generally referred to as the amount or proportion of useful information in general. The informativeness of the aerospace image in relation to a particular remote sensing application is determined by the amount of information useful for the correct identification of objects and the background characteristics [4]. Information on the detection of objects and the background is contained in the spectral, energy and spatial distribution of their optical signals. The study focuses on improving spatial resolution based on knowledge, as the most powerful component in the overall model for assessing informativeness [5, 6]. In this study, knowledge means external information and the rules for its processing, which in principle can be divided into two types. The first type is external knowledge about spectral signals in the image and their distribution, transformation, redistribution, etc. This is a well-formalized knowledge that is easily programmed and often found in image processing software that implements bands processing operations, for example, resampling, merging, band mathematics, and many more. The second type includes external knowledge about the desired image parameters and the rules for its analysis in specific remote sensing application. This knowledge is vague and poorly formalized because it contains uncertain data. It cannot be programmed straightly. Along with other methods [7], artificial intelligence (AI) techniques will be useful here, in particular, neural networks can be trained by good examples and applied. Thus, a two-level image processing and analysis framework is proposed: which, at the first level, conducts precise image processing, and at the second, conducts AI-based analysis. Software methods for remote sensing images superresolution enhance the reliability and credibility of data [8]. Various types of two-dimensional interpolation (nearest neighbor, bicubic, spline, etc.) [9, 10] are considered to be the simplest methods for spatial resolution enhancement. There are widely used sharpening approaches and related metrics, such as modulation transfer functions (MTF) [11]. These approaches are usually implemented in frequency (inverse filtration) or wavelet domains [12, 13]. Frequency domain analysis is universal method for imagery processing, that allow extracting, modifying and merging of the particular image spatial frequency components. The most common of the frequency domain analysis is discrete Fourier transform (DFT). Superresolution approaches based on multiple image processing are considered separately. The most common approach to superresolution is to process multiple images of the same scene captured at the same time with the subpixel shift relative

Knowledge-Based Multispectral Remote Sensing Imagery Superresolution

221

to each other [14] These conditions eliminate radiometric inconsistency in images, and subpixel shift allows the extraction of additional details of the land surface.

2 Spectral Band Translation Panchromatic data typically overlaps several narrow spectral bands, which allow obtaining images with higher spatial resolution, but reduced spectral resolution. This makes available to use panchromatic data for spatial resolution enhancing of other spectral bands, which are overlapping panchromatic band for object detecting purposes [15]. However, composing panchromatic data with multispectral data for resolution enhancement (pan-sharpening technique) always contributes radiometric distortion into resulting images [16]. To avoid this issue it is proposed to use high-pass filtering [17] which consider extracting the high-frequency information from the panchromatic image in the frequency domain. Extracted high-frequency components are inserted into the multispectral image, represented in the frequency domain. These techniques are applicable in case when a particular imaging system provides panchromatic data. Oppositely, the proposed spatial resolution enhancement technique consider the case when there is no panchromatic band among acquired data. For resolution enhancement of multispectral remote sensing image using multiple images of low spatial resolution this input data must satisfy the next criteria [11]: – the presence of a linear subpixel shift between images relatively one to another; – short time delay between deriving of land surface spectral features to avoid the influence of surface objects’ spectrum changes, that occur over time; – representation of the same physical feature by input multispectral data—all bands need to be in the same spectral band. The requirement of subpixel shift between images is achieved through the time delay between acquiring images in different spectral bands, which is usually inherent in modern multispectral images. To satisfy the third requirement, the technique intends translation of the existing multispectral data into panchromatic bands—all narrow bands of a particular spectrum are expanded into the spectrum range of the unified width. In the case of two bands when each band represents the spectral radiance R1 (λ) and R2 (λ) each spectral range corresponds to its own spectral response function S 1 (λ) S 2 (λ). The initial spectral response functions of the scanning systems are determined instrumentally during laboratory calibration and typically are given in the technical documentation [18]. Within the wavelengths common to the bands S 1 (λ) and S 2 (λ), a new combined virtual spectral band with the spectral response function S 3 (λ) = S 1 (λ)∪ S 2 (λ) is formed. The spectral radiance R3 (λ) in the combined spectral band will be: R3 (λ) =

1 π

 E 0 (λ) ρ(λ) τ 2 (λ) S3 (λ) dλ,

(1)

222

S. A. Stankevich et al.

Fig. 1 Spectral response functions of two initial bands (left) and resulting image’s spectral response function after band-to-broadband translation of input images (right)

where E 0 (λ) is the spectral irradiance of the land surface by Sun, ρ(λ) is the spectral reflectance of the land surface, and τ(λ) is the spectral transmittance of atmosphere. Equation (1) includes all functions except spectral reflection, which may differ from pixel to pixel. Neglecting the non-uniformity of spectral reflectance, it can be assumed that:   S3 (λ) + S1 × R1 , (2) R3 ∼ = S1 (λ) or R3 ∼ =



 S3 (λ) + S2 × R2 , S2 (λ)

(3)

where S 1 , S 2 are corrections resulting from the nonlinearity of the functions E 0 (λ), and τ(λ). Applying relations (2) and (3) to the each pixel of the input band images will result in two additional broadband images in a joint unified spectral range that overlaps both initial spectral bands of the input images (Fig. 1). The difference between input and resulting images is determined by the spectral radiance of the input images. Described technique can be applied to any number of initial bands for translation onto the same number of wide panchromatic bands, which are correspond mentioned requirements to the appropriate data for spatial resolution enhancement.

3 Image Superresolution Image resolution can be enhanced in different ways [19]. One of the approaches is to create an up-scaled superresolution image. The main problem is that the pixel

Knowledge-Based Multispectral Remote Sensing Imagery Superresolution

223

number of the base image is a few times less than in the superresolution image. There are many approaches to solve it, for example, an interpolation [20]. However, any interpolation method does not provide any new information to the target image, so such techniques are considered to be non-effective for image resolution enhancement. Thus, it has been developed the method that recreates a double-sized high-resolution image using several low-resolution sub-pixel shifted images of the same size and scene [21]. The DFT has been applied as the basic imagery processing approach which allow to decompose imagery data represented in the spatial domain into its periodic components. The DFT output represents the image in the frequency domain, where each point represents a particular frequency component contained in the image. The following technique performs superresolution in the frequency domain due to the ability of fast large image processing.

3.1 Fourier Transform There are different versions of Fourier transform, which differ by common scalar multiplier, the origin of the image, and  the sign of the exponent [22]. Let X =  X (y, x), y = 0, m − 1, x = 0, n − 1 be an image with m × n dimension. Fourier transform would be: 

X (η, ξ ) =

m−1 n−1  y=0 x=0

   xξ yη + . X (y, x) exp 2πi m n

(4)



The result of the transform X (η, ξ ) is a periodic function of each argument, that is for integer k 1 and k 2 : 



X (η + k1 m, ξ + k2 n) = X (η, ξ )

(5)

   xξ yη 1   + , X (η, ξ ) exp −2πi X (y, x) = mn η ξ m n

(6)

The inverse transform is:

where the sum is calculated over the period.

3.2 Image Shift Direction The shift is a vector that shows the coordinate difference between the respective points of the estimated image and the referenced image. Images could be represented as

224

S. A. Stankevich et al.

matrices: ⎛

Xre f

0 ⎜0 =⎜ ⎝0 0

0 1 0 0

0 0 0 0

⎛ ⎛ ⎞ ⎞ 0 0000 0 ⎜0 0 0 0⎟ ⎜0 0⎟ ⎜ ⎟, X 1 = ⎜ ⎟ ⎝ 0 0 1 0 ⎠, X 2 = ⎝ 0 0⎠ 0 0000 0

0 0 0 0

0 0 0 0

⎞ 0 1⎟ ⎟. 0⎠ 0

Then the shift of X 1 relatively to X ref is (y1 , x 1 ) = (1,1). The shift of X 2 relatively to X ref is (y2 , x 2 ) = (0,2) See the example below (Fig. 2). Let us study the relation between the high-resolution image and the low-resolution images. Consider a resampling operator, which. – doubles the size of the pixel footprint, and – takes account of the subpixel shift. This operator maps the high-resolution image into one of the low-resolution images. Assume the shift is less than a half-pixel size: −0.5 ≤ y ≤ 0.5, −0.5 ≤ x ≤ 0.5. Then the operator would be as follows: Y (y, x) = (0.5 + y)(0.5 + x)X (y − 1, x − 1) + (0.5 + y)X (y − 1, x) +(0.5 + y)(0.5 − x)X (y − 1, x + 1) + (0.5 + x)X (y, x − 1) +X (y, x) + (0.5 − x)X (y, x + 1) + (0.5 − y)(0.5 + x)X (y + 1, x − 1) +(0.5 − y)X (y + 1, x) + (0.5 − y)(0.5 − x)X (y + 1, x + 1). (7) Figure 3 represents an example of the above approach. Y (y, x) is a pixel of a target high-resolution image and it consists of 9 pixels of low-resolution image X.

Fig. 2 Examples of images with pixel shift

Knowledge-Based Multispectral Remote Sensing Imagery Superresolution

225

Fig. 3 Example of subpixel-shifted images

To see how X depends on Y could be applied to Y (y − 1, x − 1), Y (y − 1, x), Y (y − 1, x + 1), …, Y (y + 1, x + 1) and in the right side of the result we will leave only X (y, x) components: Y (y − 1, x − 1) = · · · + (0.5 − y)(0.5 − x)X (y, x) Y (y − 1, x) = · · · + (0.5 − y)X (y, x) + · · · Y (y − 1, x + 1) = · · · + (0.5 − y)(0.5 + x)X (y, x) + · · · .. . Y (y + 1, x + 1) = (0.5 + y)(0.5 + x)X (y, x) + · · · The convolution matrix of the transform equals: ⎛

⎞ (0.5 − y)(0.5 − x) (0.5 − y) (0.5 − y)(0.5 + x) ⎝ ⎠. 1 (0.5 − x) (0.5 + x) (0.5 + y)(0.5 − x) (0.5 + y) (0.5 + y)(0.5 + x)

(8)

The transform function, which is a shift and a blur operator, can be obtained via Fourier transform of the convolution matrix:   T (η, ξ ) = (0.5 − y)e−2πiη/ m + 1 + (0.5 − y)e2πiη/ m   (9) × (0.5 − x)e−2πiξ / n + 1 + (0.5 − x)e2πiξ / n .

226

S. A. Stankevich et al.

3.3 Arbitrary Shift In general, the image shift is not always less than pixel size. Let us extend the previous approach for the general solution. Assume that yint and x int are the integers closest to y and x respectively. The reminder part will be denoted as yfrac and x frac : y = yint + y f rac , −0.5 ≤ y f rac ≤ 0.5, x = xint + x f rac , −0.5 ≤ x f rac ≤ 0.5. For this case, the transfer function is a superposition of the integer shift operator and a shift-blur operator described above. The transfer function for integer shift is:    ηyint ξ xint . Tint (η, ξ ) = exp 2πi + m n

(10)

The transfer function for the subpixel shift is obtained from (9). Thus, the total transfer function is:      T (η, ξ ) = e2πiηyint / m 0.5 − y f rac e−2πiη/ m + 1 + 0.5 − y f rac e2πiη/ m      ×e2πiξ xint / n 0.5 − x f rac e−2πiξ / n + 1 + 0.5 − x f rac e2πiξ / n . (11)

3.4 Pixel Number Reduction The low-resolution image has a grid that is twice less dense in each direction the high-resolution image. Thus, the final step that needs to be done in order to create  a half-sized low resolution image is the pixel number reduction. Let X = X (y, x), y = 0, 2m − 1, x = 0, 2n − 1 be a 2m × 2n image. Let Y be an image with m × n dimensions that consists of pixels of X and both images’ coordinates are even numbers: Y (y, x) = X (2y, 2x), 0 ≤ y ≤ m, 0 ≤ x ≤ n. Then the Fourier transform of the images X and Y are related as follows: 









4Y (η, ξ ) = X (η, ξ ) + X (η ± m, ξ ) + X (η, ξ ± n) + X (η ± m, ξ ± n). 

Because of the periodicity of X , the sign + or − does not matter.

(12)

Knowledge-Based Multispectral Remote Sensing Imagery Superresolution

227

3.5 Linear Regression Model with a Priory Data Having solved the inverse task we can now solve the direct one. Let low-resolution images have m × n dimensions and high-resolution images have 2m × 2n dimen

sions. K is the number of low-resolution images and Y k , k = 1, K be the Fourier 

transforms of low-resolution images. X is the Fourier transform of the unobserved high-resolution image. Suppose that the low-resolution images are obtained from the high-resolution image by a shift-blur operator (see (9) or (11) for the transfer function), pixel number reduction (see (12)) and addition of an error. In the frequency domain, the Fourier transform of the k observed image equals:, 







4Y k = Tk (η, ξ ) X (η, ξ ) + Tk (η ± m, ξ ) X (η ± m, ξ ) + Tk (η, ξ ± n) X (η, ξ ± n) 



+Tk (η ± m, ξ ± n) X (η ± m, ξ ± n) + 4 E(η, ξ ), (13) where T k is the optical transfer function for the blur-shift operator for the k-th 

observed image; E k (η, ξ ) is the Fourier transform of the error in the k-th observed image. The errors are supposed to have zero mean and variance equal to γ E (η, ξ ): E Eˆ k (η, ξ ) = 0, E|E(η, ξ )|2 = γ E (η, ξ ). 

(14)



It is also supposed that X (η, ξ ) has a priory mean X pri (η, ξ ) and variance equal to γ X (η, ξ ).  2   E Xˆ (η, ξ ) = Xˆ pri (η, ξ ), E  Xˆ (η, ξ ) − Xˆ pri (η, ξ ) = γ X (η, ξ ),

(15)



where γ E is a spectral density of the error, X pri is a Fourier transform of a priory X pri , γ X is a spectral density of X − X pri .   ˆ E(η, ξ ) are complex variables such that E(−η, −ξ ) and E(η, ξ ) are conjugate 



complex numbers. The same holds true for X (η, ξ ) and X pri (η, ξ ). γ X may or may not depend on k. It is assumed that γ X does not depend on k. For this model γ X is assumed to be estimated as the mean spectral density of K input low-resolution images and γ E is assumed to be estimated as the mean pairwise difference of K input low-resolution images. So, the newly developed algorithm, which embodies complete workflow for image superresolution, is presented in Fig. 4.

228

S. A. Stankevich et al. Input data

Evaluation of input images’ shift

Yes

Shift > pixel

No

Cut images with integer precision

Evaluation of γX, γY and transfer functions

Resolution enhancement

Imagery filtering

Inversed Fourier Transform

Fig. 4 Superresolution technique algorithm flowchart

4 Convolutional Neural Networks Implement Image processing, improvement, and analysis are one of the first applications of artificial intelligence (AI). AI-powered image improvement methods simulate human retoucher’s work using precedent-based machine learning algorithms. The approach based on the neural network ensures more or less successful image restoration and improvement. Convolutional neural network (CNN) fulfills such tasks most successfully. Convolution neural networks were applied to image processing as early as the last century. In recent decades they have been actively developing. This was achieved through efficient training with powerful GPU and simpler access to open data. Convolutional neural networks implement pattern-based image improvement methods. The basic image improvement operations are pattern extraction, their nonlinear transformation and reconstruction. The main functions of CNN are denoising, sharpening, and fine image details restoring.

Knowledge-Based Multispectral Remote Sensing Imagery Superresolution

229

CNN for image improvement can be subdivided into several types: linear CNN [23], residual CNN [24], recursive CNN [25], densely connected CNN [26], attentionbased CNN [27], generative adversarial CNN [28], etc. Ones differ in many parameters and structure. The main focus of CNN training is visual perception quality because training images are evaluated by the human expert. In remote sensing applications, the use of CNN can significantly improve the efficiency of satellite image visual interpretation. However, the application of existing CNNs to satellite imagery is not always possible and advisable. Firstly, there are not many satellite images suitable for training compared to general-purpose images. Secondly, the network training time for analyzing and improving satellite images is much longer. Satellite images can be taken at different times of the year, at long time intervals. The changes that have occurred can introduce an error both during training and during operation. Satellite imagery are specific images. The different spatial resolution of the satellite imagery (from low to high) requires different training of CNN. Usually CNN cannot excellently work with images of different spatial resolution. Noise type, contrast, and other satellite imagery features are very different too. The amount of space imagery is very large, so the application of CNN for processing it is quite resource-intensive. One type of CNN that can be used to process space imagery is SRCNN. Therefore, it was adapted for the final improvement of high spatial resolution images in this work.

5 Actual Resolution Evaluation A no-reference method based on modulation transfer function (MTF) has been engaged to evaluate the actual resolution of digital images. The MTF allows to evaluate objectively the actual resolution of any digital image and thus to determine its change after enhancement [29]. The MTF is the absolute value of Fourier transform from impulse response h(x, y) of image:   ∞ ∞      −2πiξ x −2πiηy  h(x, y) · e ·e d x d y , T (ξ, η) =   

(16)

−∞ −∞

where ξ, η are the spatial frequencies, which correspond to x and y spatial dimensions in the image. In optical sciences, the system’s impulse response is also referred to as the point spread function (PSF) [30]. There are many known methods for the MTF evaluation directly by digital image: using special test target (resolution chart); by the image of a bright point, which can be considered as PSF; by the image of a narrow line; by the jump response of the contrast edge in the image—edge spread function (ESF), etc. The latter method is

230

S. A. Stankevich et al.

the most preferred one because it provides better accuracy due to the more statistics available after measuring [31]. The Gaussian approximation of MTF was considered in this research. Let the bidirectional PSF of the image is described by the Gaussoid of the form [32]: h(x, y) =

2 x2 1 − y · e− 2σ 2 · e 2ς 2 , 2π · σ · ς

(17)

where σ and ς are the Gaussiod’s parameters along the spatial axes x and y. In this case, after applying the Fourier transform to (17), the bidirectional MTF will take the following form: T (ξ, η) = e−2π

2

·σ 2 ·ξ 2

· e−2π

2

·ς 2 ·η2

.

(18)

A typical MTF of the digital image is shown in Fig. 5. The value of the actual spatial resolution of the image can be determined using the MTF for a preassigned modulation threshold T * . The actual spatial resolution r of the image is determined by the spatial frequency ζ * corresponding to the modulation threshold T * [33, 34]: r≡

1 . ξ∗

(19)

The r value itself is an objective estimate that provides comparing the results of the satellite images resolution enhancement.

Fig. 5 Gaussian approximation of MTF with σ = 0.32 parameter value

Knowledge-Based Multispectral Remote Sensing Imagery Superresolution

231

6 Results The possibilities of the proposed approach were demonstrated on the spatial resolution enhancement of the visible bands of the Sentinel-2A satellite multispectral image (Fig. 6a). The following data processing chain was put in operation: the spectral bands of the original image were translated into a single spectral band joint to all ones. Thereafter it is possible to apply a superresolution to the unified image set. Mutual subpixel shifts between images have been estimated and the algorithm described in Sect. 3 has been run. The resolution enhancement algorithm implements an image processing dataflow that is explained by the Fig. 7 diagram. An arbitrary number of spectral bands of the original multispectral satellite image (RGB 1x) through band translation are converted into a set of panchromatic images (Pan 1x) of the joint spectral band, which overlaps all input ones. Next, the superresolution procedure is executed, as a result of which one enhanced resolution panchromatic image (Pan 2x) is formed from several subpixel-shifted panchromatic images. The sequent step of processing is a standard pan-sharpening based on the enhanced resolution panchromatic image and the original (non-translated) spectral bands of the RGB 1x image. The result of pan-sharpening is an RGB 2x multispectral image that is already quite close to the desired one. It remains only to slightly improve one’s quality using a convolutional neural network (CNN) and in this way to obtain the final image (Enhanced RGB). The whole processing dataflow of Fig. 7 was applied to the input multispectral satellite image Fig. 6a, and images Fig. 6b–e were derived sequentially. An actual resolution was evaluated for Fig. 6a, d, and e images. The Gaussian approximation of MTF was used. The ESFs both along the columns (horizontal) and along the lines (vertical) were extracted from the evaluated image to obtain a bidirectional MTF. To do this, the entire image was scanned with a 4 × 4 sliding window, the columns and lines were averaged to obtain four-pixel ESF realizations, and those that correspond to the well-traced edges in the image, were selected. This technique is described in more detail in [35]. Since the image processing result is two estimates of spatial resolution—vertical r x and horizontal r y —then the geometric mean is used as a general anisotropic estimate of resolution: r=



rx · r y .

(20)

As a result of test images analysis, averaged ESF Fig. 8 was extracted. The values of the Gaussoids’ parameter σ calculated by the ESFs as well as their corresponding actual resolutions of the test images for the modulation threshold T * = 0.25 are contained in Table 1. The observed slight decrease in a pixel resolution of the image after subpixel processing and subsequent pan-sharpening is explained by doubling the size of the output image: the rescaled pixel resolution of the input image will be r = 2 r

232

S. A. Stankevich et al.

a)

b)

d)

c)

e)

Fig. 6 Sentinel-2A multispectral satellite images under resolution enhancement Zilina (Slovakia), September 19, 2020, 10 m input spatial resolution: a—original 3-band natural color composite image; b—translated band panchromatic image; c—panchromatic image after 3 translated bands superresolution; d—natural color composite image after pan-sharpening; e—final output image after AI-based improvement

= 5.966 pix. The terminational CNN provides additional resolution enhancement approaching the limit for directly acquired satellite images. Thus, the performed evaluation of actual spatial resolution after the proposed processing shows its practical efficiency for multispectral satellite images.

Knowledge-Based Multispectral Remote Sensing Imagery Superresolution

Fig. 7 Multispectral satellite image resolution enhancement dataflow diagram

Fig. 8 Averaged edge spread functions of Fig. 7a, d, and e test images

233

234

S. A. Stankevich et al.

Table 1 Resolution enhancement evaluation results Test image

Figure 6a (RGB 1×)

Figure 6d (RGB 2×)

Figure 6e (RGB enhanced)

Gaussiod’s parameter σ, pix

0.791

0.844

0.331

Resolution r, pix (for T * = 0.25)

2.983

3.185

1.246

7 Conclusions The proposed multispectral imagery spatial resolution enhancement technique proved to be rather efficient especially in the case of panchromatic image absence among input data. Band translation approach allows not only restoring images with enhanced spatial resolution but also maintaining radiometric correctness of resulting images. The chosen method for superresolution itself provides a 46.6% enhancement in actual spatial resolution, and the terminal CNN engagement further improves it significantly. Future development of this technique need concentration on band translation method, particularly, improvement of resulting data signal approximation approach—spectra classification [36] instead of linear interpolation. Also CNN required special adaptation to remote sensing image handling. This method is widely used in different AI-based applications. Generally, CNNs offer a very high potential in remote sensing data improvement, especially based on radiometric signals classification, notwithstanding that this technology still needs a huge amount of pre-labelled data for the machine learning basement. Acknowledgements This work was supported by the Slovak Research and Development Agency under the grant No. SK-SRB-18-0002.

References 1. Kokhan, S.S.: Application of multispectral remotely sensed imagery in agriculture. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, XXXVIII, part 7B, pp. 337−341 (2010) 2. Burciu, Z., Abramowicz-Gerigk, T., Przybyl, W., Plebankiewicz, I., Januszko, A.: The impact of the improved search object detection on the SAR action success probability in maritime transport. Sensors 20, 3962 (2020) 3. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(4), 623–656 (1948) 4. Stankevich, S.A.: Quantitative analysis of informativeness of hyperspectral aerospace imagery in solving thematic tasks of Earth remote sensing (in Ukrainian). Rep. NAS Ukraine 10, 136– 139 (2006) 5. Stankevich, S., Piestova, I., Shklyar, S., Lysenko, A.: Satellite dual-polarization radar imagery superresolution under physical constraints. In: Shakhovska, N., Medykovskyy, M.O. (eds.)

Knowledge-Based Multispectral Remote Sensing Imagery Superresolution

6.

7. 8.

9. 10. 11.

12.

13. 14.

15. 16.

17.

18.

19. 20.

21.

22.

23.

235

Advances in Intelligent Systems and Computing IV, vol. 1080, pp. 439–452. Springer, Cham (2020) Piestova, I.O., Stankevich, S.A., Kostolny, J.: Multispectral imagery super-resolution with logical reallocation of spectra. In: Proceedings of the International Conference on Information and Digital Technologies, pp. 322–326. Zilina, Slovakia (2017) Zaitseva, E., Levashenko, V.: Construction of a reliability structure function based on uncertain data. IEEE Trans. Reliab. 65(4), 1710–1723 (2016) Stankevich, S.A., Andreiev, A.A., Lysenko, A.R.: Multiframe remote sensed imagery superresolution. In: Proceedings of the 15th International Scientific-Practical Conference on Mathematical Modeling and Simulation Systems (MODS 2020). Chernihiv National University of Technology, Chernihiv, Ukraine (2020) Rukundo, O., Cao, H.: Nearest neighbor value interpolation. Int. J. Adv. Comput. Sci. Appl. 3(4), 25–30 (2012) Keys, R.: Cubic convolution interpolation for digital image processing. IEEE Trans. Acoust. Speech Signal Process. 29(6), 1153–1160 (1981) Stankevich, S.A., Shklyar, S.V., Podorvan, V.N., Lubskyi, N.S.: Thermal infrared imagery informativity enhancement using sub-pixel co-registration. In: Proceedings of the 2016 International Conference on Information and Digital Technologies, pp. 245–248. Rzeszow, Poland (2016) Asokan, A., Anitha, J.: Lifting wavelet and discrete cosine transform-based super-resolution for satellite image fusion. In: Singh, V., Asari, V., Kumar, S., Patel, R. (eds.) Computational Methods and Data Engineering. Advances in Intelligent Systems and Computing, vol. 1227. Springer, Singapore (2021) Basha, S.A., Vijayakumar, V.: Wavelet transform based satellite image enhancement. J. Eng. Appl. Sci. 13(4), 854–856 (2018) Stankevich, S.A., Popov, M.O., Shklyar, S.V., Sukhanov, K.Y., Andreiev, A.A., Lysenko, A.R., Kun, X., Cao, S., Yupan, S., Boya, S.: Estimation of mutual subpixel shift between satellite images: software implementation. Ukr. J. Remote Sens. 24, 9–14 (2020) Sekrecka, A., Kedzierski, M., Wierzbicki, D.: Pre-processing of panchromatic images to improve object detection in pansharpened images. Sensors 19(23), 5146–5172 (2019) Aiazzi, B., Baronti, S., Selva, M.: Image fusion through multiresolution oversampled decompositions. In: Stathaki, T. (ed.) Image Fusion: Algorithms and Applications, pp. 27–66. Academic Press (2008) Gonzalez-Audicana, M., Saleta, J. L., Catalan, R. G., Garcia, R.: Fusion of multispectral and panchromatic images using improved IHS and PCA mergers based on wavelet decomposition. IEEE Trans. Geosci. Remote Sens. 6(42), 1291–1299 (2004) Markham, B., Barsi, J., Kvaran, G., Ong, L., Kaita, E., Biggar, S., Czapla-Myers, J., Mishra, N., Helder, D.: Landsat-8 Operational Land Imager radiometric calibration and stability. Remote Sens. 6, 12275–12308 (2014) Yue, L., Shen, H., Li, J., Yuan, Q., Zhang, H., Zhang, L.: Image super–resolution: the techniques, applications, and future. Signal Process. 128, 389–408 (2016) Patil, V.H., Bormane, D.S.: Interpolation for super resolution imaging. In: Sobh, T. (ed.) Innovations and Advanced Techniques in Computer and Information Sciences and Engineering, pp. 483–489. Springer, Dordrecht (2007) Lyalko, V.I., Popov, M.A., Stankevich, S.A., Shklayr, S.V., Podorvan, V.N.: Prototype of satellite infrared spectroradiometer with superresolution. J. Inf. Control Manage. Syst. 2(12), 153–164 (2014) Stone, H.S., Orchard, M.T., Chang, E.-C., Martucci, S.A.: A fast direct Fourier-based algorithm for subpixel registration of images. IEEE Trans. Geosci. Remote Sens 39(10), 2235–2243 (2001) Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolutional network for image superresolution. In: Fleet, D., Pajdla T., Schiele B., Tuytelaars T. (eds.) Computer Vision, Lecture Notes in Computer Science, vol. 8692. Springer, Cham (2014)

236

S. A. Stankevich et al.

24. Lim, B., Son, S., Kim, H., Nah, S., Lee, K. M.: Enhanced deep residual networks for single image super-resolution. In: 2017 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1132–1140. Honolulu, HI, USA (2017) 25. Tai, Y., Yang, J., Liu, X., Xu, C.: MemNet: a persistent memory network for image restoration. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4539–4547. Venice, Italy (2017) 26. Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., Fu, Y.: Image super-resolution using very deep residual channel attention networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision—ECCV 2018. Lecture Notes in Computer Science, vol. 11211, pp. 286–301. Springer, Cham (2018) 27. Zhang, Y., Tian, Y., Kong, Y., Zhong, B., Fu, Y.: Residual dense network for image superresolution. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2472–2481. Salt Lake City, UT, USA (2018) 28. Wang, X., Yu, K., Wu, S., Gu, J., Liu, Y., Dong, C., Qiao, Y., Loy, C.C.: ESRGAN: enhanced super-resolution generative adversarial networks. In Computer Vision – ECCV 2018 Workshops. Munich, Germany (2019) 29. Zhang, R., Zhang, X., Gong, Z., Ji, X., Luo, S.: Fusion image quality assessment based on modulation transfer function. In: Proceedings of IEEE International Symposium on Image and Data Fusion, pp. 1–5. Tengchong, Yunnan, China (2011) 30. Seshadrinathan, K., Pappas, T.N., Safranek, R.J., Chen, J., Wang, R. Sheikh, H.R., Bovik, A.C.: Image quality assessment. In: The Essential Guide to Image Processing, pp. 553–595. Academic Press, San Diego (2009) 31. Kang, J., Hao, Q., Cheng, X.: Measurement and comparison of one- and two-dimensional modulation transfer function of optical imaging systems based on the random target method. Opt. Eng. 53(10), 8 (2014) 32. Li, T., Feng, H.: Comparison of different analytical edge spread function models for MTF calculation using curve-fitting. Proc. SPIE. 7498, 74981H, 8 (2009) 33. Becker, S., Haala, N.: Determination and improvement of spatial resolution for digital aerial images. ISPRS Arch. XXXVI, 1/W3, 6 (2005) 34. Viallefont-Robinet, F., Helder, D., Fraisse, R., Newbury, A., van den Bergh, F., Lee, D.H., Saunier, S.: Comparison of MTF measurements using edge method: towards reference data set. Opt. Express 26(26), 33625–33648 (2018) 35. Stankevich, S.A.: Evaluation of the spatial resolution of digital aerospace image by the bidirectional point spread function parameterization. In: Shkarlet, S., Morozov, A., Palagin, A. (eds.) Advances in Intelligent Systems and Computing, vol. 1265, pp. 317–327. Springer Nature, Cham (2021) 36. Levashenko, V., Zaitseva, E., Puurinen S.: Fuzzy classifier based on fuzzy decision tree. In: Proceedings of EUROCON 2007 The International Conference on Computer as a Tool 2007, pp. 823–827

Neural Network Training Acceleration by Weight Standardization in Segmentation of Electronic Commerce Images V. Sorokina and S. Ablameyko

Abstract We propose the usage of weight standardization in segmentation of electronic commerce images, the main idea of which is neural network training acceleration. The aim of weight standardization is micro-batch training which is characterized by 1–2 images per GPU that makes it possible to obtain the results comparable to or superior to those received during training with batch normalization. We validate the approach on the segmentation of 21 classes of electronic commerce images. All experiments demonstrate improvement of micro-batch training both in quality (about 3%) and speed (faster in 1.6). Moreover, using weight standardization outperforms results in the tasks of classification and object detection. Keywords Segmentation · Convolutional neural network (CNN) · YOLACT · Weight standardization

1 Introduction Over the past few years, the computer vision community have made rapid progress in image segmentation task. State-of-the-art approaches to instance segmentation like Mask R-CNN (Mask Region-based Convolutional Neural Network) [1] and FCIS (Fully Convolutional Instance-aware Semantic Segmentation) [2], directly build off of advances in object detection like Faster R-CNN (Faster Region-based Convolutional Neural Network) [3] and R-FCN (Region-based Fully Convolutional Network) [4].

V. Sorokina (B) · S. Ablameyko Belarussian State University, Nezavisimosti av. 4, 220050 Minsk, Belarus S. Ablameyko e-mail: [email protected] S. Ablameyko United Institute of Informatics Problems of the National Academy of Sciences of Belarus, Surganova str. 6, 220012 Minsk, Belarus © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 C. van Gulijk and E. Zaitseva (eds.), Reliability Engineering and Computational Intelligence, Studies in Computational Intelligence 976, https://doi.org/10.1007/978-3-030-74556-1_14

237

238

V. Sorokina and S. Ablameyko

Mask-RCNN is a typical two-stage segmentation method that is characterized by the generation of potential regions of interest (ROI) on the first stage, and the classification and segmentation of these areas on the second stage. In further researches, there were attempts to improve the accuracy of this method, for instance, enriching Feature Pyramid Network (FPN) [5] or eliminating the incompatibilities between the assessment of the ground truth of the mask and the accuracy of its localization [6]. These two-stage methods require resampling for each ROI and their postprocessing making them unable to achieve real-time speed (30 frames per second) even reducing the image’s size. One-stage methods produce position-sensitive maps. These maps are reassembled into final masks with a position-dependent [2, 7]. Notwithstanding one-stage segmentation methods are conceptually faster than two-stage, they still require resampling or other non-trivial computations. This fact limits their speed making them unable to obtain real-time speed. Finally, some methods first perform semantic segmentation, then detect boundaries [8], and cluster pixels [9, 10]. Again, these methods involve multiple steps and/or include expensive clustering procedures, which limits their viability for real-time applications. All these facts lead to the one of the most significant problem is that training neural network for real-time segmentation requires considerable processing power (a few GPUs—graphics processing units) and it takes a long time to achieve the same results as with two-stages methods. In this paper, approach for real-time segmentation of electronic commerce images is proposed. This approach is based on the usage of YOLACT [11] that is modified by weight standardization [12]. The main idea is neural network training acceleration. For this purpose, segmentation task is solved using one or two GPU while the size of the images does not decrease and the prediction accuracy is not lost. The goal is to recognize 21 class objects pixel by pixel in images.

2 Training SET The training set consists of 40 032 images divided into 21 classes of objects. Training set was created by combining an identified object and an arbitrary background using rotation, dilation and centering. Each image is 800 × 800 px in size, the format is RGB, 8 bit on each channel (Fig. 1). To solve the segmentation task each image from the training set should have a binary mask that is one-channel image where 0 represents background and 255 is the foreground (the object itself). The test set consists of similar images and includes 12,098 images.

Neural Network Training Acceleration by Weight …

239

Fig. 1 Examples of the images in training set

3 Weight Standardization Neural network YOLACT used in research belongs to the class of convolutional neural networks. In the architecture of a convolutional neural network, weights are elements of the kernel—the matrix involved in the convolution operation. Each kernel slides on the corresponding input channels of the image, generating a processed version of them. To improve the performance and stabilize the operation of the neural network, normalization methods applied to some layers of the network are used. One of these methods is batch normalization. The gist of this method lies in the fact that some layers of the neural network are fed with preprocessed data having zero expected value and unit variance. Batch gradient descent is an implementation of gradient descent, when at each iteration the whole training sample is scanned and only after that the model weights are changed. Batch normalization influences the neural network training process in a fundamental way: it reduces the Lipschitz constants of the loss functions and makes gradients more Lipschitz, that is, the loss functions will have better smoothness [12].

240

V. Sorokina and S. Ablameyko

Lipschitz constant of a function f is the value of L if f satisfies | f (x1 ) − f (x2 )| ≤ L||x1 − x2 ||, ∀x1 , x2 . These results are obtained for activations that are standardized by the batch normalization to get zero expected value and unit variance. Batch normalization takes into consideration Lipschitz constants related to the activations, not to weights that are optimizer directly changes. To directly optimize the weights, the weight standardization method is used. Its idea is to calibrate weights in convolutional layers to smooth the landscape. As a result, there is no need to worry about transferring smoothing effects from activations to weights. The aim of weight standardization is to accelerate the training of neural networks, such as batch normalization, but without using a large number of batches during training. In [12] it was proved that weight standardization reduces the Lipschitz constants of the loss functions and gradients. Thus, training is improved. In Fig. 2, a comparison of the methods of normalizations (batch normalization, layer normalization, individual object normalization and group normalization) and weight standardization [12] is shown. Each graph shows the feature map tensor, where N is the batch axis, C is the channel axis, and H and W are spatial axes. Blue pixels are normalized to the same mean and variance calculated by aggregating the value of those pixels. Consider a standard convolutional layer with a bias equals to zero: 

y = W ∗ x, 



where W ∈ R O×l W ∈ R O×l denotes the weights in the layer, * is the convolutional operator. For W ∈ R O×l W ∈ R O×l · O is the number of output channels, l is the number of input channels in kernel region of each output layer. In Fig. 2 O = 



Fig. 2 Comparing normalization methods on the activation functions (blue) and Weight Standardization (orange)

Neural Network Training Acceleration by Weight …

241

Cout and l = Cin × K er nel_si ze. In the weight standardization method, instead of direct optimization of the loss function L with initial weights W , the weights W are reparametrized as a function of W , i.e. W = W S(W ), and the loss function L on W is optimized by the stochastic gradient descent method: 





   Wi, j − μwi,. , y = W ∗ x, W = W i, j Wi, j = σ wi,. + ε











where   I I 1  

2 1 Wi, j − μwi,. . Wi, j , σ wi,. = μwi,. = I j=1 I i=1 Like batch normalization, weight standardization adjusts the first and second moments of the weights of each output layer individually in the convolutional layers. Weight standardization normalizes weights in a differential way to normalize the gradients during backpropagation. It should be noted that affine transformations are not applied to W , as in [12] it was shown that the use of affine transformations will damage training. 

4 Image Segmentation Using Yolact The main idea when creating the architecture of YOLACT [11] is to add a mask branch to the existing one-stage model in order to solve the segmentation task, similar to Mask R-CNN for Faster R-CNN, but without an explicit function localization step (for example, recombining functions). For this, segmentation task was divided into two simpler parallel tasks, the results of which can be combined to form the final masks. The first branch uses Fully Convolutional Network (FCN) to create a set of prototype masks with a size that matches the image itself, which do not depend on any instances of detected objects. The second branch is added to the object detection branch to predict a vector of mask coefficients for each anchor that encodes a representation of the instance in prototype space. Finally, for each instance that survives on the Non-maximum Suppression (NMS), a mask for that instance is created by a linear combination of the two branches. The YOLACT architecture is shown in Fig. 3. To train the model three loss functions are used: classification Loss (L cls ); bounding box regression Loss (L box ); mask Loss (L mask ) with weights 1, 1.5 i 6.125 accordingly. Functions L cls and L box are defined similar to [13]. Then, to calculate L mask , pixel binary cross entropy (BC E) between obtained masks (M) and ground truth masks (Mgt ) is used: L mask = BC E(M; Mgt ). The used backbone is

242

V. Sorokina and S. Ablameyko

Fig. 3 YOLACT architecture

ResNet-101 and base size of the image is 800 × 800 pixels. To train regressor the lost function smooth − L 1 is used, for classification—softmax cross entropy. Weight standardization was applied to the convolutional layers of backbone ResNet-101 together with group normalization, i.e. the standard 2D convolutional operation of PyTorch framework was modified.

5 Experiments and Results Two YOLACT networks were trained: classical and with the use of weight standardization. The training of the model was aimed at recognizing 21 classes of electronic commerce objects: the model (a person in full height), shoes (four classes), clothes (five classes), food (five classes), cosmetics (five classes) and a background class. To train the net GPU NVIDIA T4 was used with batch_size = 2. We find that this is a sufficient batch size to use weight standardization. Training takes 5 days. The same results without weight standardization were achieved at 8 days so weight standardization makes it possible to accelerate training process in 1.6. Weight standardization was used in convolutional layers during a forward pass of neural network training. It allowed the classification of objects to be improved by an average of 3%, and object detection by 4%. Neural network evaluation results are shown in Fig. 4. Results of comparison between classical YOLACT and YOLACT with weight standardization are shown in Table 1, where FPS—frames per second, AP—average precision—popular metric in measuring the accuracy of object detectors that shows the average precision value for recall value over 0–1, APS —average precision of the segmentation, APM —average precision of the mask, APL —average precision of the localization.

Neural Network Training Acceleration by Weight …

243

Fig. 4 NN evaluation results

Table 1 Average precision of the trained neural networks Method

FPS

AP

AP50

AP75

APS

APM

APL

YOLACT

28.3

33.7

53.5

35.9

17.2

35.6

45.7

YOLACT with weight standardization

28.3

36.8

59.2

38.2

22.4

37.2

47.2

6 Discussions and Conclusion In the course of the research, dataset was prepared and annotated, two YOLACT networks—classical and with weight standardization were built. These models were trained on the prepared dataset, validated and tested. It was found that the weight standardization allowed the training to be speeded up (fewer epochs were required to obtain the same results as without weight standardization). Moreover, the approach made it possible to reduce the requirements for training the network (fewer batches can be used). It was shown that the use of weight standardization gives a significant improvement in metrics training with micro-batches that is typical for solving the segmentation task. Compared to training the classical YOLCAT network, usage of weight standardization gives an improvement of 3% for classification problem, 4% for the object detection problem and 3% for the segmentation problem. The process can re-executed by using YOLACT network and by adding weight standardization block to the existing one-stage model. The demonstrated method could be applied to other neural network architectures, and also used for YOLACT when increasing the size of the input image and, as a result, the output mask.

244

V. Sorokina and S. Ablameyko

References 1. He, K.: Mask R-CNN. In: IEEE International Conference on Computer Vision (ICCV), Venice, 22–29 October 2017, pp. 2980–2988 2. Li, Y. et al.: Fully convolutional instance-aware semantic segmentation. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21–26 July 2017, pp. 4438–4446 3. Ren, S. et al.: Faster R-CNN: towards real-time object detection with region proposal networks. In: 29th Annual Conference on Neural Information Processing Systems (NIPS), Montreal, 7–12 December 2015, Vol. 39, no. 6, pp. 1137–1149 4. Dai, J. et al.: R-FCN: object detection via region-based fully convolutional networks. In: 30th Annual Conference on Neural Information Processing Systems (NIPS), Barcelona, 4–9 December 2016, pp. 379–387 5. Liu, S. et al.: Path aggregation network for instance segmentation. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, 18–22 June 2018, pp. 8759–8768. IEEE 6. Huang, Z. et al.: Mask scoring R-CNN. In: 2019 IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15–21 June, 2019, pp. 6402–6411 7. Dai, J. et al.: Instance-sensitive fully convolutional networks. In: 14th European Conf. on Computer Vision, Amsterdam, 11–14 October 2016, Vol. 9910, pp. 534–549 8. Kirillov, A. et al.: InstanceCut: from edges to instances with MultiCut. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21–26 July 2017, pp. 7322–7331 9. Bai, M., Urtasun, R.: Deep watershed transform for instance segmentation. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21–26 July, 2017, pp. 2858–866. 10. Liang, X. et al.: Proposal-free network for instance-level object segmentation. Trans. Pattern Anal. Mach. Intell. 40(12), 2978–2991 (2018) 11. Bolya, D. et al.: YOLACT: real-time instance segmentation. https://arxiv.org/abs/1904.02689. Accessed 13 Aug 2020 12. Qiao, S. et al.: Weight standardization. https://arxiv.org/abs/1903.10520. Accessed 13 Aug 2020 13. Liu, W. et al.: SSD: single shot MultiBox detector. In: 14th European Conference on Computer Vision, Amsterdam, 11–14 October 2016, pp. 21–27

Waterproofing Membranes Reliability Analysis by Embedded and High-Throughput Deep-Learning Algorithm Darya Filatova and Charles El-Nouty

Abstract The reliability analysis in civil engineering requires an understanding of the stability, durability, and rigidity principles over the planned period of exploitation. We develop a high-performance deep learning algorithm related to the classification of water-repellent-membranes’ defects with consequent reliability analysis. Based on CNN architecture and on the mixed technique which uses K-fold crossvalidation with synthetic dataset augmentation, the proposed methodology consists of the sequent transformations of pixel-image intensities to find and to classify damaged fragments on the membrane’s surface. The “domain of confidence” reliability metric is introduced to analyze the further behavior of membranes. The computational experiments showed that the methodology can be successfully applied in near-real time while improving throughput to get conclusive results by projecting AI-based, automated, and embedded devices on-site removing the human error factor.

1 Introduction The exploitation of any civil engineering system is associated with uncertainty and risk that cannot be foreseen either at the design stage or at the construction period. Until recently, in civil engineering, the reliability issues (stability, durability, and rigidity) are still the most important points during engineering design. Despite that, the construction’s maintenance and its constant monitoring also require a quantitative understanding of the risk in a post-construction period. In particular, while managerial actions one wants to know exactly the reliability of constant operating an existing structure as well as the possible consequences. The every-day observed technical and technological improvements are leading to the appearance of new materials with very D. Filatova (B) P-A-R-I-S, CHArt, EPHE, Paris, France e-mail: [email protected] C. El-Nouty LAGA, UMR 7539, F-93430, Universite Sorbonne Paris Nord, Paris, France e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 C. van Gulijk and E. Zaitseva (eds.), Reliability Engineering and Computational Intelligence, Studies in Computational Intelligence 976, https://doi.org/10.1007/978-3-030-74556-1_15

245

246

D. Filatova and C. El-Nouty

exceptional properties, implying numerous simplifications of construction principles and at the same time posing numerous questions on how to estimate the reliability of up-coming construction [1]. Computer modeling helps to respond to these questions by means of increasing the number of alternative propositions to the same problem solution. Respecting the rigor standard design principles, these computer-based models are used to the selection of the right material and the development of reliable constructions. In addition, taking into account some optimal criteria (i.e. energy losses) the structure can be modified and simplified without losses of safety only by means of the choice of “good” materials. For example, hydro isolation leads to the reduction of energy losses of roofs. That is to say, the monolithic concrete stronghold surface of the roof can be reinforced by hydro protective layer. The clue of this technology is polymer chemical reactions with the concrete surface [1–3]. However, such waterproofing systems stop working effectively if scratch and cracks occur leading to cross penetration of water inside the structure. The place of damage and the place of sopping wet usually are not complemented. As one can conclude, due to the transformations of technical parameters the reliability of the entire construction is growing down. Recently, many papers were dedicated to the impacts of new materials on the water migration phenomenon. The experimental research methodology is based on thermographic surveys or electronic analysis of the digital images. Despite this helps to identify pathologies in waterproofing layers, the methodologies require the final analysis to be completed by the experts. The human-factor decreases the reliability of the analysis. The question which arises here is how to minimize the risk of the decision-making process [4, 5]. The answer requires a search for a new methodology. The fourth industrial revolution would not have been possible without the development of information technologies. The preferences given to the creation of cyberphysical systems are widely reflected in civil engineering. The creation of any object (physical system) is based on a detailed information model used as a control element at each stage of the object’s life cycle. It is not difficult to see that the more complex the object is, the more complex the information model is and the more diverse information is generated. Very often, several information systems can be associated with each such object. Analysis of vast and heterogeneous flows of information (i.e. Big Data) becomes impossible for a “simple-human-user” of these information systems. The problem seems to be more complex in the case of the hydro isolation surface analysis. Thermographic pictures or digital images contain unstructured information, which is still difficult to proceed by the classical method of statistical inference. As an alternative approach, it is better to use artificial intelligence algorithms, which can find and classify even early-stage hidden faults, giving and evaluating the information on the construction’s reliability in real-time [6]. Using the achievements of pattern recognition applied to biological materials analysis (i.e. for cancer cell detection [7, 8]), we aim to develop a method capable in real-time to identify, classify, and estimate the risks related to the defects on the surface of water-repellent-membranes. Hence, thinking about the quality and performance speed-up, we will discuss the deep-learning algorithm which would work on the augmented data set containing real and artificially generated images of some waterproofing membranes.

Waterproofing Membranes Reliability Analysis by Embedded …

247

Fig. 1 The thermographic survey of damaged flat roof by using high resolution thermal imaging equipment

The rest of this paper is organized in the following manner. In Sect. 2, the artificial surface model of waterproof membranes as well as the propagation of damages are discussed. Section 3 concerns the AI-based methodology for damages detection and classification. Section 4 contains some simulation experiments. Section 5 concludes the research has been done and indicates its further development.

2 Problem Formulation 2.1 Water-Repellent Membranes: Some Comments The thermographic survey allows identifying various damages on the surfaces of structures. For example, it can be conducted to determine the location of the leak and conclude the cause of the presence of water inside the roof. To analyze the results, it is necessary to indicate the area with “weakened” properties. Figure 1 shows an example of a thermographic photograph of a flat roof taken at 30 ◦ C. One can notice that the temperature distribution varies greatly. Dark-red-fragments indicate thinning processes of the surface layer (over-warmed fragments), while dark-bluefragments indicate the presence of water (cracks, holes, etc.). Hence, the surface layer is damaged, and the reliability of the entire structure has deteriorated, the reparation is required. Recently, to avoid these kinds of problems and to make surfaces more resistible to the weather conditions variations special classes of waterproof materials are applied on the surface [3, 4].

248

D. Filatova and C. El-Nouty

Fig. 2 The electronic leak detection survey based on digital image analysis

Named water-repellent membranes and distinguished as liquid–applied or sheet— based, these materials present the continuous of thin about 2 to 4 mm thick layers laminated to vertical, horizontal, and more complex-geometry surfaces. It is necessary to admit, that the pore-structure membrane allows the element to breathe and at the same time protects water penetration. Totally covered by these membranes, the element of the constructions (i.e., landscaped concrete decks, underneath, basements, roofs, terrace slabs, balconies) has better resistance to UV radiation, temperature, wear and abrasion, chemical pollution. Since water no longer seeps into the structural element, its reliability stays high longer time [4]. However, any puddles of water remained on the surface sooner or later provoke erosion and corrosion of the hydro-repellent layer (see Fig. 2). The initiation of the destruction processes is related to the pores deformation and destruction (see Fig. 3). If to discover this process at an early stage and apply required managerial actions, the reliability of the construction would not be harmed [5]. In other words, if to classify pores of the membrane as damaged or not damaged, on the digital image analysis, one could conclude on the reliability of the water-protection layer. Moreover, to avoid human error, this analysis can be done employing AI methods [6].

2.2 The Reliability Model of Membrane Surface Once applied to the surface of the construction, the membrane has regular monochrome structure. Gradually, due to environment influences its surface non-heterogeneously changes the color as well as due to erosion its pores loose the possibility to adjust the size after the variations of temperature (see Figs. 2 and 3) and to turn to the initial size. We use these observations to introduce the generalized model of the membrane’s surface as it shown on Fig. 4.

Waterproofing Membranes Reliability Analysis by Embedded …

249

Fig. 3 Examples of corrupted liquid–applied waterproofing membrane: pore deformation, erosive swellings, and micro-cracks

Fig. 4 The idealized model of the membrane surface: the regular and monochrome structure

Let the bounded closed set D = [0, n − 1] × [0, n − 1] ⊂ N2 ,

(1)

correspond to the n − by − n pixels digital image I of the membrane’s surface. Without loss of generality, we set I as the gray-scale image, where each pixel (x, y) ∈ D has an intensity f (x,y) ∈ F and F = { f min , . . . , f max } ⊂ R+ is an ordered final set. We place centers of the pores at the vertices (i h, j h) of the regular grid associated with D, where h correspond to the grid spacing,  i, j ∈ {1, . . . , n h } with n h r0 ) = p,

(3)

• took a value bigger than the critical level r1 P ( s1 ← s0 | r (t) > r1 ) = 1.

(4)

As it was stated in [2, 3] the aptitude of pore breathability, and thus its reliability, depends on ratio of number of damaged regions κ1 (t) to the number of all regions κ (t), t ∈ (t0 , t1 ],  ∈ L. The both quantities can be counted as the cardinal number of the corresponding sets. The following function describes the time-varying reliability of the pore ω ∈ 

Waterproofing Membranes Reliability Analysis by Embedded …

251

  κ1 (t) , Rω (t) = exp − ∗ κ

(5)

where κ ∗ indicates the threshold value of number of damaged regions, κ ∗ ≤ κ (t0 ). Finally, we introduce the sets +1 and −1 of non-corrupted and corrupted pores, +1 ∩ −1 =  and +1 ∩ −1 = ∅. For any  ∈ L the pore ω ∈ −1 if at least one of the following conditions c1 : ∀ ∈ L, ∀k ∈ L,  = k R (t) ∩ Rk (t) = ∅,  = s1 , c2 : ∀ (x, y) ∈ R (t) s(x,y) c3 : ∀t ∈ (t0 , t1 ] Rω (t) ≤ Rω∗  (Rω∗  is the threshold value of reliability value (5)), is fulfilled, otherwise ω ∈ +1 . Applying the same reasoning as in the case of damaged regions of one pore, we can estimate the reliability of whole surface covered by the membrane, i.e. ⎛ ⎜ RD (t) = exp ⎝−

i



A(i) (t) ωi ∈ − ⎟ ⎠, A∗

(6)

where A(i) refers to the area of damaged pore ωi , A∗ is the threshold value of the surface could be taking by damages pores. To avoid a confusion with the accuracy metric, we will call (6) as the “domain of confidence reliability metric”. The research problem of this paper can be formulated as follows: how to find the damages on the membrane, classify them and estimate the reliability of the surface covered by the membrane using digital images?

3 Research Methodology 3.1 Method Overview The successful solution to the research problem depends on the analysis of visual flow information associated with the digital image. To avoid error-prone time-consuming “manual” recognition of anomalies (damages areas of the membrane) on large surfaces of the construction (i.e. roofs) it would be beneficial to eliminate the human factor, replacing it with a high-performance embedded device. In general, each such device needs a kind of efficient and high-performed embedded system equipped with a computational algorithm capable to complete real-time analysis. Hence, the solution has to concern the computational aspects. Methods of artificial intelligence are extensively applied in image recognition. Over the past decades, the neural networks and deep learning algorithms allowed to make significant progress in the anomalies classification using image recognition,

252

D. Filatova and C. El-Nouty

Table 1 The main stages of image recognition methodology Stage Purpose Preprocessing

Segmentation Classification Post-processing Evaluation

To improve the quality of initial image; to speed up the detection of the location and the classification of the anomalies; to remove significant variations in initial data due to non-linear luminance effect [9, 10] To extract desired features, i.e. to detect edges of the membrane pores; to avoid the artifacts detection [11] To classify the extracted features according to the predefined image pattern or rules [6] To avoid problems of overfitting or underfitting; to correct possible errors; to improve the classification quality [10, 11, 13, 14] To conclude on the recognition quality [12]

and develop their own methodology. This methodology consists of several consequent steps, namely: preprocessing, segmentation, classification, post-processing, and evaluation (see Table 1). The choice of neural network architecture and, consequently, the implementation of the deep learning algorithm depend on the purpose of the task to be solved, the availability and variety of data, and processing speed requirements. In our case, when implementing the image recognition algorithm, special attention should be paid to achieving maximum accuracy at the classification stage without introducing the time-consuming calculations. The “training–validation” or “training-estimationvalidation” of the neural networks has the biggest impact on the accuracy of the “prediction.” The model training instances should be carefully selected to avoid the model overfitting or underfitting using dataset splitting into partitions, i.e., “training and validation sets,” “training, test, and validation sets.” When the embedded solution with stable results is required in real-world applications, the training and validation sets are always available, but not the test set. As a solution to the problem, very often the dataset is augmented by artificially generated images with high-quality labeled examples. This technique accelerates patterns detection for unsupervised, semi-supervised, or supervised implementations. Considering the results of previous experience [12–14], we will use the k-fold cross-validation method for the dataset augmented by artificially generated images for CNN-based segmentation and classification.

3.2 Dataset The real image dataset, considered in this study, was created by time-lapse shooting. It contains ten sets of the 1024 pixel by 1024 pixel digital gray-scale images (samples) of the same polymer membrane applied at the flat roof (see Figs. 2 and 3). Each set contains 300 images (regions of interest boundaries are manually labeled) and

Waterproofing Membranes Reliability Analysis by Embedded …

253

Fig. 6 The heterogeneities of the surface of the membrane for different values of Hurst parameter (H = 0.15, H = 0.5, H = 0.85)

Fig. 7 Some examples of the initial configuration of the pores involved in corruption process (the edge of pore is marked by dim-gray color ( f = 105), the internal part of pore is marked by light-gray color ( f = 211), the corruptions are marked by grey color ( f = 211))

refers to the one of ten observation period labeled as τ1 , τ2 , …, τ10 . To generate the augmented set of ten series of 100 the same-size-gray-scale images of the following observations were used. The corruption processes are observed on the membrane surface during the exploitation period [t0 , t1 ] on one side as non-heterogeneous edge-microcracks and as irrecoverable extensions of pores (this means the loss of the water-repellent ability), on the other side as gradient color changes of the background (mostly due to chemical reactions). Visually, both phenomena can be monitored as the changes of pixels’ intensities. Since these transformations have rather a random nature, the intensities f (x,y) (t) (we remind that (x, y) ∈ D and t ∈ [t0 , t1 ]) follow some spatial stochastic process. In this work we use the fractional Brownian field with the Hurst parameter 0 < H < 1 to show non-uniform transformations of the color. We refer for the detailed description of this process to [15]. The proper selection of a roughness parameter 2H allows to model non-smooth and smooth heterogeneities on the membrane surface (see Fig. 6 and compare with Fig. 2) (Fig. 7). To obtain the augmented set we created a set “initial” images, which contained pores simulated by unconditional Strauss process (r0 = 0.2, β = 100, γ = 0.1).1 We set H (τ1 ) = 0.50, H (τ2 ) = 0.55, …, H (τ10 ) = 0.95 and imitate 100 evolutionary processes of the surface starting each time from the initial image. Using the methodology presented in [15], we simulated corruption processes of pores as the independent This is a point process of the density f pd f (x) = β n((x) γ s(x) which depends on two parameters β, γ and spatial coordinates x; n((x) is the number of points, s(x) is the number of pairs where the distance between two points is r0 . In our case, the clusters formed by this process correspond to the pores with activated corruption processes.

1

254

D. Filatova and C. El-Nouty

stochastic processes. The simulated images were rescaled to the 1024 pixel by 1024 pixel images. Finally, the whole dataset contains 4000 gray-scale images.

3.3 Image Recognition: Basic Technique 3.3.1

Preprocessing

The grayscale image preprocessing consists in consequent modifications of the each pixel (x, y) ∈ D intensity f (x,y) ∈ F of the original image I I F

−→

F

nor mali zation

−→

nor mali zation

−→

 I

−→

∗  I ,

equali zation gamma−corr ection

 F

−→

equali zation

 F

−→

gamma−corr ection

∗  F ,



F are the ordered final sets which contain the modified intensities F, and  where  F,  ˜ ∗ correspondingly. f˜(x,y) , f˜(x,y) , and f˜˜(x,y) In this methodological approach we limit preprocessing considering only the normalization-equalization steps.2 The goal of this steps is to augment the global contrast of I. Once L–leveled gray-scale was settled and extreme values f min and f max for the initial intensities were determined, new intensities can be calculated as follows:  f˜max − f˜min  f (x,y) − f min , f˜(x,y) = f˜min + f max − f min where f˜min = f˜0 < f˜1 < · · · < f˜k < · · · < f˜L−1 = f˜max .  using m  as the frequency of gray level , we count the relative frequencies Next, T f˜k of each intensity f˜k , 0 ≤ k ≤ L − 1as: k   1  m, T f˜k = 2 n =0

      to determine g f˜k = f˜min + f˜max − f˜min T f˜k , and, hence, the set of normalized and equalized intensities:     F= g  fk ,

2

 0≤k ≤ L −1 .

The gamma-correction step depends on the parameters optical device, which is not an interest of this paper.

Waterproofing Membranes Reliability Analysis by Embedded …

3.3.2

255

Segmentation

The normalized and equalized image I˜˜ is used for the patterns recognition. Each ¯ ⊂ D, where  ¯ is a set of the pattern can be assigned to an unknown object ω¯ ∈  patterns. To estimate (5) and (6) we need to know areas of corrupted pores. Hence, we have to find edge of ω. ¯ For that reason, the main purpose of segmentation should ¯ and the description of its elements by be focused on the construction of the set  the contours Cω¯ and the areas Aω¯ . One of the method to find the edge of the pattern consists in the particular transformations and the comparison of the intensities f i∗j of neighboring pixels. ˜˜ i, j ∈ {1, ..., n}, be an element of n × n matrix F∗ associated with Let f i∗j ∈ F, two spatial variables θ1 and θ2 . We denote the non-integer row index by θ1 and the non-integer column index by θ2 . Let g (θ1 , θ2 ) be a function smooth enough. To ¯ and find Cω¯ and Aω¯ the following four phases the edge-based segmentation form  algorithm is considered (see Table 2): 1. Set up k 1.

(10)

Waterproofing Membranes Reliability Analysis by Embedded …

3.3.4

257

Post-processing and Evaluation

Post-processing was completed after segmentation to reduce the number of possible false positive patterns. All the patterns with areas smaller than nine pixels were removed from consideration to avoid confusions with background noise. Moreover, the elimination of isolated pixels as well as “small” neighborhoods amalgamation were done by the opening operation [15]. To evaluate the segmentation-classification algorithm we use the fact that the training dataset already contains manually labeled patterns. For this reason, the contours and the areas of corrupted pores as well as their quantity are known. In an ideal situation, all domains of patterns labeled on the image and that discovered by the algorithm suppose to coincide. However, the algorithm could find more, less, or the same quantity of patterns of interest. The same concerns the contours and areas. Since the primary goal is the estimate of the reliability of the surface covered by the membrane, we will use the same metrics as in [14], namely: DoC :=

min{ 1 ,2 }  i=1

P(A1i ∩ A2i ) , P(A1i )

(11)

where A1i is the area manually marked and labeled as the corrupted pore, A2i is the area identified by the algorithm, 1 and 2 are quantities of the corrupted pores on labeled image and defined by the algorithm, the notation P denotes the number of pixels, which belong to the area.

4 Results In order to extract and classify corrupted pores we applied deep learning strategy using CNN models included in Deep Learning Toolbox MatLab R2020b. The input to the CNN was 1024 × 1024 gray-scale image from the dataset. The output layer for the CNN was a softmax layer with two categories: corrupted and non-corrupted pores.3 To find optimum of the goal function (10) the Augmented Lagrangian Alternating Direction Inexact Newton (ALADIN) method was applied [16]. For the classifiers, the training-validation was completed for 10 − f old cross-validation. The other parameters, important for the classifiers, were taken to distinguish non-corrupted pores, namely the initial radius r = 2, the critical value of radius r˜ = 4, the critical value of the critical damaged regions p˜ = 0.7. The CNN was trained with the following parameters: the momentum 0.9, the weight decay rate 5 · 10−4 , the initial learning rate 5 · 10−3 with the shrinking parameter 0.99995, and 104 training steps. The simulations were run twice separately for each observation instant with different values of the smoothing parameter in (7). The experiments were completed 3

Despite the multi-class classification requires at least three classes, we left two classifiers to be capable to increase the number of classes without rewriting the classification algorithm.

258

D. Filatova and C. El-Nouty 1

5x5 window 7x7 window

value of DoC metric

0.9

0.8

0.7

0.6

0.5 1

2

3

4

5

6

7

8

9

10

the observation time instant,

Fig. 8 The algorithm performance metrics for the surface covered by the waterproofing membrane for different settings of kernel smoothing parameter in (7)

on the computer with processor Intel (R) Core (TM) i7-470HQ [email protected] RAM 16 Go and graphic card NVIDIA GeForce GTX 860M 1029MHz under 64bit Windows 8.1 OS. The performance of the algorithm given by the metric (11) and the variations of reliability given by the metric (6) for the surface covered by the waterproofing membrane are shown in Fig. 8 and Fig. 9. Let us make some comments. As it was pointed in [14] the noise on the membrane surface complicated the segmentation: the pores are hardly visible and it gives the effect of “false positive regions detection”. It was also recommended to use different values of the kernel smoothing parameter to get optimal values of the DoC metric. In these experiments, the application of 7 × 7 window “caught” more correctly identified regions. Since the quantity of truly recognized regions has been improved, the correctness of damaged surface classification has been increased also. This implies that the reliability corresponding to 7 × 7 window (black points on Fig. 9) would better explain the state of the roof under study. Subsequently, taking the threshold value of reliability, it is possible to obtain predicted values of the optimal operating time of the investigated surface covered by the waterprofing membrane (in our field study it was the flat roof).

5 Conclusions This study presented the machine learning method based on the deep-learning algorithm for the reliability analysis of the damaged surfaces covered by the water-

Waterproofing Membranes Reliability Analysis by Embedded …

259

1

value of reliability metric, R

5x5 window 7x7 window

0.8

0.6

0.4

0.2

0

0

2

4

6

8

10

the observation time instant, Fig. 9 The reliability metrics for the surface covered by the waterproofing membrane for different settings of kernel smoothing parameter in (7)

proofing membrane. Qualitative evaluation of the method reached applicable outcomes for the selected dataset of images. The correctness of damages recognition was evaluated by the “domain of confidence” metric. This allowed showing the reliability of the construction element under study. The main attention was concentrated on damages to membrane pore. This method has the possibility of expanding classes of damages that could happen on the surface. The main advantages of this method can be listed as the possibility of expansion of classes of damages that could happen on the surface, the embeddedness and the high-throughput of implementation, the reliability of construction can be estimated in near-real-time decreasing human error factor. Acknowledgements Many thanks to prof. Vladimir Sidorov from Moscow State University of Civil Engineering (MGSU) for discussing the topic and conducting field studies of the damaged roof made for the creation of the training dataset used in this study.

References 1. Cui, H., Li, Y., Zhao, X., Yin, X., Yu, J., Ding, B.: Multilevel porous structured polyvinylidene fluoride/polyurethane fibrous membranes for ultrahigh waterproof and breathable application. Compos. Commun. 6, 63–67 (2017) 2. Lee, K., Kim, D., Chang, S.-H., Choi, S.-W., Park, B., Lee, C.: Numerical approach to assessing the contact characteristics of a polymer-based waterproof membrane. Tunn. Undergr. Space Technol. 79, 242–249 (2018)

260

D. Filatova and C. El-Nouty

3. Rupal, A., Sharma, S.K., Tyagi, G.D.: Experimental investigation on mechanical properties of polyurethane modified bituminous waterproofing membrane. Mater. Today: Proc. 27(1), 467–474 (2020) 4. Gu, H., Li, G., Li, P., Liu, H., Chadyagondo, T.T., Li, N., Xiong, J.: Superhydrophobic and breathable SiO2/polyurethane porous membrane for durable water repellent application and oilwater separation. Appl. Surf. Sci. 512 (2020). https://doi.org/10.1016/j.apsusc.2019.144837 5. Francke, B., Piekarczuk, A.: Experimental investigation of adhesion failure between waterproof coatings and terrace tiles under usage loads. Buildings 10(3) (2020). https://doi.org/10.3390/ buildings10030059 6. Song, Y., Huang, Z., Shen, Ch., Humphrey, S., Lange, D.A.: Deep learning-based automated image segmentation for concrete petrographic analysis. Cem. Concr. Res. 135 (2020). https:// doi.org/10.1016/j.cemconres.2020.106118. 7. Miotto, R., Wang, F., Wang, S., Jiang, X., Dudley, J.T.: Deep learning for healthcare: review, opportunities and challenges. Brief Bioinform 19(6), 1236–1246 (2018) 8. Kose, U., Alzubi, J.A. (Eds.): Deep Learning for Cancer Diagnosis, Studies in Computational Intelligence. Springer (2021) 9. Ramírez-Gallego, S., Krawczyk, B., García, S., Wo´zniak, M., Herrera, F.: A survey on data preprocessing for data stream mining: current status and future directions. Neurocomputing 239, 39–57 (2017) 10. Lou, Q., Peng, J., Wu, F., Kong, D.: Variational model for image segmentation. In: Bebis G. et al. (eds.) Advances in Visual Computing. ISVC Lecture Notes in Computer Science, vol. 8034. Springer, Berlin (2013) 11. Ma, T., Antoniou, C., Toledo, T.: Hybrid machine learning algorithm and statistical time series model for network-wide traffic forecast. Transp. Res. Part C: Emerg. Technol. 111, 352–372 (2020) 12. Oksuz, I., Ruijsink, B., Puyol-Antón, E., Clough, J.R., Cruz, G., Bustin, A., et al.: Automatic CNN-based detection of cardiac MR motion artefacts using k-space data augmentation and curriculum learning. Med. Image Anal. 55, 136–147 (2019) 13. Kroese, D.P., Botev, Z.I.: Spatial Process Simulation. In: Schmidt V. (eds.) Stochastic Geometry, Spatial Statistics and Random Fields. Lecture Notes in Mathematics, vol 2120. Springer, Cham (2015) 14. Filatova, D., El-Nouty, Ch., Punko, U.: High-throughput deep learning algorithm for diagnosis and defects classification of waterproofing membranes. Int. J. Comput. Civ. Struct. Eng. 16(2), 26–38 (2020) 15. Tosta, T., Faria, P.R. , Alves Neves, L., Zanchetta do Nascimento, M.: Computational method for unsupervised segmentation of lymphoma histological images based on fuzzy 3-partition entropy and genetic algorithm. Expert Syst. Appl. 81, 223–243 (2017) 16. Engelmann, A., Jiang, Y., Muhlpfordt, T., Houska, B., Faulwasser, T.: Toward distributed OPF using ALADIN. IEEE Trans. Power Syst. 34(1), 584–594 (2019)

Network of Autonomous Units for the Complex Technological Objects Reliable Monitoring Oleksandr Chemerys, Oleksandr Bushma, Oksana Lytvyn, Alexei Belotserkovsky, and Pavel Lukashevich

Abstract Nowadays autonomous remote monitoring and control systems are required for the reliable operation of complex, especially distributed over large areas or inaccessible places, technological facilities and systems for various purposes (agriculture, urban studies, environmental protection, emergency natural and man-made situations, etc.), as well as for the operational technological or managerial decisions. To monitor the state of an object, as a rule, a distributed stationary system of sensors for various purposes is used. In this case, it is necessary to take into account the balance between functionality and cost of the system in each specific case. By functionality we mean not only a set of capabilities, but also energy consumption, which should be as low as possible for remote monitoring, performance, in particular, when processing camera images and transmitting data on the network, and protection from external interventions, both objective (technical obstacles), and malicious. The speed of the received data and its completeness remains a problem when using stationary systems. The solution may increase stationary points, but at the same time the network efficiency will decrease. Authors consider the relevant and perspective IIoT (Industrial Internet-of-Things) development concept—a scalable heterogeneous network consisting of fixed and mobile nodes for monitoring the state of complex distributed technological objects. Many issues must be solved comprehensively at designing and creating such a network. This is especially true for control systems, data transmission channels and data stream processing, their analysis, scalability, decision making. The paper describes a new concept for development of a multi-level architecture IoT network for monitoring the state of geographically distributed technological objects, consisting of a heterogeneous set of nodes (stationary and mobile units) equipped with various sensors and video cameras. O. Chemerys Pukhov Institute for Modelling in Energy Engineering NAS of Ukraine, Kiev, Ukraine O. Bushma · O. Lytvyn Borys Grinchenko Kyiv University, Kiev, Ukraine A. Belotserkovsky (B) · P. Lukashevich United Institute of Informatics Problems of The National Academy of Sciences of Belarus, Minsk, Belarus © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 C. van Gulijk and E. Zaitseva (eds.), Reliability Engineering and Computational Intelligence, Studies in Computational Intelligence 976, https://doi.org/10.1007/978-3-030-74556-1_16

261

262

O. Chemerys et al.

1 Introduction Today, autonomous and remote monitoring and control systems are necessary to ensure the uninterrupted functioning of complex, especially those ones distributed over large areas or in hard-to-reach places, technological objects and systems for various purposes. This primarily applies to such areas as agriculture, urbanism, environmental protection, natural and human-made emergency situations, etc. The result of such monitoring is the prompt adoption of technological and managerial decisions. Such monitoring systems allow you to monitor production in real time, to control the consumption of natural resources and prevent both emissions of harmful substances and emergencies or to minimize their damage effect. Modern industrial enterprises are complex, geographically distributed objects. Their management requires the introduction of modern computer technologies. Monitoring systems work with maximum efficiency when encapsulated in a common corporate information system. They have to forecast the potential danger that may bring object destruction and negative influence for the environment. Danger or hazard is a property inherent for a complex technical system. This can be realized in the form of direct or indirect damage actions to the affected object, gradually or suddenly, as a result of a system failure. The analysis of real emergencies, events and factors, human activity made it possible to identify a number of danger properties in technical systems [1]: 1. 2. 3. 4. 5.

Any technical system is potentially dangerous, Technogenic hazards exist if the daily flows of matter, energy and information in the technosphere exceed threshold values, Sources of technogenic hazards are elements of the technosphere, Human-made hazards operate in space and time, Technogenic hazards have a negative influence on humans, the natural environment and elements of the technosphere at the same time.

Technogenic hazards worsen human health, lead to injuries, material losses and degradation of the natural environment. To realize the danger, at least three conditions must be met: (1) the danger really exists, (2) the object is in the danger zone, and (3) the object does not have sufficient means of protection. To confront technogenic hazards, it is necessary to know the state of the object at each point in the space that it occupies. Comparison of the measured parameters with the virtual model of the object makes it possible to make decisions about the appropriate actions at the moments of time preceding emergency situations. Thus, this allows you to prevent disasters. In the event of emergencies, such system should promptly provide an action plan to minimize damage, protect personnel and areas adjacent to the facility, automatically perform the series of actions to terminate the facility’s activities, etc. Thus, the monitoring system in conjunction with the acceptance system solutions are an integral part of the information structure of complex objects, in particular, industrial enterprises.

Network of Autonomous Units for the Complex …

263

With regard to potentially dangerous objects (PDO), monitoring is the continuous collection of information, observation and control of the facility, including procedures for risk analysis, measurement of process parameters of objects, emissions of harmful substances, the state of the environment in the areas near the object. Monitoring data and information on various processes and phenomena serve as the basis for risk analysis and forecasting. The purpose of forecasting emergency situations (ES) is to identify the time of occurrence of a hazardous situation, the possible location, scale and consequences for the population and the environment. For constructions and buildings of industrial objects that have been in operation for a long time, the cause of accidents can be the degradation of material properties, the ultimate levels of accumulated damage, the formation and propagation of cracks, and cavitation wear [2]. In complex systems, accidents are of a logical and probabilistic nature. Therefore, a disaster or failure scenario can be drawn up, the corresponding logical risk function and a probability risk polynomial may be designed. At the stage of operation, the assessment and analysis of the risk of a system accident are carried out on the basis of the corresponding scenarios using the results of monitoring on the values of wear of elements, real loads and vibrations, the features of operation, and the preparedness of the service personnel. Quantitative risk assessment and analysis allow you to make informed decisions about extending the life of the system; develop proposals to ensure safe operation; organize training of personnel in emergency situations. Monitoring is an integral part of safety and risk management systems for complex technical, technological, economic, organizational and social systems. As an information technology, monitoring is designed to assess the technical state of a complex system and make a decision on extending the resource and ensuring the safe operation of the system with an extended resource [3]. Information technologies play a key role nowadays for the creation and operation of monitoring systems for industrial objects.

2 The Complex Technological Objects Monitoring Assessment of the existing structure of the natural and human-made risk management system made it possible to identify that in relation to PDO, the monitoring is a permanent collection of information, supervision and control of an object, including procedures for risk analysis, measurement of technological process parameters at objects, emissions of harmful substances, the state of the environment at territories near the object. The main operational and tactical tasks that are worked out by PDO monitoring system can be divided into three categories: (a)

identification of the situation, drawing up the necessary safety data sheets, studying the causes of fires and emergencies, taking measures to ensure safety;

264

O. Chemerys et al.

(b)

forecasting the development of situations leading to fire and emergencies, modeling the dynamics of their development and assessing the resources for their elimination, assessing the need to evacuate the population; development and analysis of a fire and emergency response strategy, dividing the territory into sections and service areas and assigning responsible employees to them, determining the number of necessary units and their composition, distributing forces and means among objects to achieve tactical goals, creating closed zones and patrol zones, the organization of evacuation (complete or partial).

(c)

Thus, the main functions of the monitoring system are: tracking with reference to the object and real time of operation of controllers and sensors, both in event mode and in the mode of periodic survey at the initiative of the central controller from the dispatch center, automatically mode including; fixation of events occurring at controlled objects, with their binding to the object, geographic coordinates and real time; emergency and warning signaling about the occurrence of emergency situations at controlled facilities; automated analysis of events in the form of various reports. Forecasting is the feature of such system too: consequences of accidents at facilities using active chemically hazardous substances and during their transportation; consequences of accidents at explosive objects; consequences of accidents on oil and gas pipelines, etc. To monitor the state of an object, as a rule, a distributed stationary system of sensors for various purposes is used. At the same time, it is necessary to consider the balance between functionality and cost of the system in each specific use case. By term functionality we mean not only a set of capabilities, but also power consumption, which should be as low as possible to perform remote monitoring, performance, in particular, when processing camera images and transferring data in the network, protection from external interference, both objective (technical obstacles) and malicious. When using stationary systems, the problem is the efficiency of the data obtained and their completeness. The solution may be to increase stationary points, but this reduces the efficiency of the network. An example is ABB’s developments aimed at monitoring the technological equipment of industrial enterprises. In 2019, the company formulated the concept of integrating a predictive emission monitoring system (PEMS) with traditional continuous air emission monitoring systems (CEMS). This provided a wide range of analytical solutions for various applications in various industries [4]. Thus, modern monitoring systems are in most cases stationary and their aim is the physical parameters measuring that affect the state of objects. They also notify personnel in the event of deviations of the monitored parameters from the boundary values. The authors propose the concept of an integrated heterogeneous stationarymobile monitoring system, which is a network of independent devices capable for independent collection and preliminary analysis of data including visual data.

Network of Autonomous Units for the Complex …

265

3 IoT for Technological Objects Monitoring Recently, the concept of the Internet of Things (IoT) has been rapidly developing. It is based on a network of automatically interacting autonomous devices. The application of the main provisions of the concept in monitoring systems for distributed industrial objects will allow us to solve such problems as increasing the quality of assessing the situation, system performance, scalability, energy efficiency. The simplest monitoring systems based on IoT technology appeared some time ago. Technical capabilities, wireless data transmission and many possible types of sensors to determine the parameters of objects were taken into account. Separate rooms or buildings were monitored. The sensors were static using both wired and wireless communication channels. In [5], a real-time monitoring and control system is presented. It was designed to monitor environmental parameters such as temperature, humidity and pressure readings in any plant and to control by peripheral systems. The parameters are controlled from the monitoring room using wireless communication using Zigbee technology. It is designed for highly sensitive and mission critical applications. Remotely, the system allows the user to effectively control and manage office equipment and equipment via a mobile phone, sending commands in the form of SMS messages and receiving information about the status of the equipment. An example of the use of mobile devices as part of monitoring systems is Facebook’s Aquila project to create unmanned aerial vehicles, which aimed to provide broadband Internet access to residents of hard-to-reach areas. The project began in 2014, but in 2018 Facebook decided to stop creating Internet drones on its own [6]. Another example of the use of the Internet of Drones as part of IoT is a project by Uavia (France) aimed at creating a tool that allows businesses to conduct aerial inspections and surveillance in real time from anywhere in the world at any time. It was assumed that one or more users can simultaneously collect raw aerial photographs, analyze and obtain the data necessary for use in making management decisions [7]. In [8, 9], the concept and structure of the radiation background monitoring system for the post-accident situation at nuclear power plants is presented. The system is based on a set of UAV with radiation sensors. Monitoring data is transmitted to ground stations by wireless communication channels. Further, through this interface, the drone system interacts with the control and decision-making center. Nevertheless, the implementation of such a system requires the solution of many technical and organizational problems. Recently, the concept of the Industrial Internet of Things (IIoT), an IoT for corporate use, has also been developing. This concept defines a system of interconnected computer networks and connected industrial facilities with built-in sensors and software for collecting and exchanging data, with the possibility of remote monitoring and control in an automated mode, without human intervention. The use of IoT principles for the design of monitoring systems by complex industrial enterprises is a perspective and actual task. Younana et al. [10] shows the future of IIoT where system structure includes mobile networks, vehicular and wireless sensors network. They

266

O. Chemerys et al.

allow data sharing between different networks. So, units in the monitoring network may do information exchange and to play as the one team. The level and dynamics of the development of the industrial Internet of Things and the areas of its application have led to increasing in the complexity of control systems [11, 12]. In the case of a heterogeneous system containing a network of different types of devices, at the design level, this is the complexity of their formal description and modeling of processes and all possible normal and abnormal conditions. Considering the construction level, it is structural, functional, complexity of management, decision-making and choice of strategies. At the same time, it is necessary to ensure the degree of operability, autonomy, quality and efficiency of both data transmission and processing and feedback for network management under various states of the monitoring network itself and objects or monitoring environment, which is required in each specific case.

4 Monitoring System Design Considering the development of IIoT concept, the actual and promising for monitoring the state of complex distributed technological objects is the use of a scalable heterogeneous network consisting of stationary and mobile nodes. When designing and creating such a network, many issues must be addressed in an integrated manner. This is especially true of control systems, data transmission channels and data flow processing, their analysis, scalability, decision making. The main goal of the paper authors is the development of principles for building a multi-level architecture of a network system for monitoring the state of geographically distributed technological objects, consisting of a heterogeneous set of nodes (stationary and mobile devices) equipped with various sensors and video cameras. The technical basis for the development is a distributed heterogeneous network that functionally unites a certain number of stationary and mobile nodes equipped with the necessary set of sensor elements. In general, structurally such a system may be represented as it is shown in Fig. 1. In this representation, the basis of the system is a network of separate autonomous devices U = {U1 , U2 , . . . , U N }. The network is heterogeneous both in terms of a set of devices and communication channels, which can be based on wired channels for fixed devices, and based on wireless communication for mobile devices. Each device is equipped with P sensors that determine a set of measured parameters. In Fig. 1 the dashed circles indicate the monitoring areas for each device. Through the communication channels, the devices transfer data through the local dispatcher to the monitoring and decision-making system. They receive control commands back. At their level, the devices make decisions on movement, avoidance of obstacles, etc. By exchanging information through the dispatcher, the devices provide efficient

Network of Autonomous Units for the Complex …

267

Fig. 1 Simplified block diagram of the monitoring system

movement for the best coverage of the monitoring space. Thus, the system can be displayed by the following tuple  U (V, S, X Y ), C, N =  P,

(1)

where P is a set of measured parameters, V is the speed of movement of the device, S is the direction of movement, X Y is the coordinates of the device in space, C is the set of communication channels that connect U in the network. Mobile platforms are currently widely used in various fields, particular, in industry (robocars), in the elimination of the consequences of man-made accidents and disasters (in environments inaccessible to humans or dangerous for them), in military affairs (intelligence robots) etc. A typical mobile platform is a small-sized remotely controlled vehicle consisting of a body, a power supply, a transmission and mover. It is equipped with the required set of sensors and a photo and/or video camera. It can also be equipped with a manipulator for performing the necessary operations. Either wheels with tires or tracks with sprung rollers are used as mover of the platform, which ensures effective movement in various environments and surfaces, including over rough terrain, loose soils, snow, etc. In economically justified cases, it is advisable to use unmanned aerial vehicles (UAVs, drones) to perform specific monitoring tasks as movers of a mobile platform.

268

O. Chemerys et al.

Fig. 2 Functional schema of a heterogeneous network of autonomous devices

The set of stationary and mobile sensor nodes belongs to the lower executive level of control, since it determines the quality of the monitoring procedure implementation from the practical system point of view. On the other hand, the executive level is completely determined both by the design of its main controlled system facilities (sensors, technical vision, communication, movement, physical manipulations in a controlled environment), and by the design of the information-measuring and control system (IMCS), organizing the operation of all equipment. The heterogeneous network connecting mobile (MSP) and stationary (SSP) sensor platforms with each other and with the IMCS through coordinated collection of wired (WRNS), wireless (WSNS), cellular (CLNS) and satellite (SANS) segments, providing reliable communication in a virtual network environment (Fig. 2). Data flow control is provided by IMCS using a router. A feature of the network is that both mobile MSP platforms and stationary SSPs are connected by two of its segments, at least. However, mobile platforms do not use the wired WRNS segment. The principles of multiple connection are formed when designing a network, taking into account the information load of the corresponding group of platforms, the spatial area of their responsibility, the criticality of this area in terms of monitoring emergency situations, the need to supply and maintain energy on the platform, as well as the economic component of a particular project. Additional information flows transmitted over different segments provide data redundancy, maintaining the required level of information reliability of the system. The management of the reservation procedure rests with the IMCS. It is obvious that multiple connections significantly increase the information component of the system’s reliability, but to a certain extent reduces its hardware and software components. In this case, one should take into account the increase in power

Network of Autonomous Units for the Complex …

269

consumption of platforms and the cost of obtaining information in IMCS due to duplication of data streams over more expensive transmission channels. One of the most important tasks of ensuring the reliability of an ergatic (with human participation) monitoring system is a component that takes into account the parameters of the communication channel with an operator. The principles of formation of the information field of the means of visual data output make a significant contribution to ensuring the required level of reliability of the monitoring process as a whole [13, 14]. This becomes extremely relevant in critical situations when the prevention of an accident at a controlled facility depends on the operator’s actions. The optimal choice should be considered a figurative representation of monitoring parameters, including the use of bar graph displays or synthesis of bar graph data representation at regular display. This approach is successfully implemented on the basis of software or hardware-software solutions [15]. Data received from peripheral nodes is collected and accumulated at the central node of the system. The processing of the received information aims to create an up-to-date dynamic model of the technological object, on which the monitoring process is organized, and the environment. The model in real time displays the current state and captures the manifestations of critical and close to them situations according to certain parameters. The created dynamic model is characterized by a high level of detail and reliability, since it is formed on the basis of relevant information simultaneously from both stationary and mobile nodes of the system. One of the most actual tasks is programming the trajectory of mobile platforms. In a deterministic known environment, it is most effective to use distance as a key parameter in motion control. However, such an approach in a previously unknown external environment can lead to significant errors and, as a rule, to collisions with obstacles. Sensors, as a rule, solves this problem: the control logic is transformed into the conditions for control and navigation along the landmarks identified by the sensor system. In this case, the description of the control goal can be carried out in other terms (for example, “to move straight ahead until the appearance of an obstacle detected in front of him by the ultrasonic sensor”). This approach is implemented on the basis of a decision tree. The initial data for this is the construction of the information space of the working environment of the platform using touch elements for navigation, motion control, the formation of appropriate behavior, as well as the possible impact on the environment using manipulators. The trajectories of movement of mobile sensor elements are promptly corrected by means of the system in accordance with the identified critical and close to them states of the technological object and the environment. This configuration of the system will make it possible to promptly identify and prevent emergency situations at the technological facility, adequately respond to such cases, increase the efficiency of the operating personnel, quickly and adequately neutralize critical situations through the intervention of the relevant emergency services. The current situations at the monitoring object are ranged into 4 levels: “red”, “orange”, “yellow” and “green”. The “green” level corresponds to the normal situation, and the “red” level corresponds to the critical one. In the case of a “yellow” situation, the IMCS directs additional mobile sensor platforms to the source (sources),

270

O. Chemerys et al.

generating a deviation from the normal state of the system. The “orange” level, assessed as pre-critical, requires the intervention of the IMCS operator and urgent concentration in the appropriate places of the sensors and platforms with manipulators to prevent the situation from going to the “red” level. In a critical situation (“red”), it is necessary to attract the maximum permissible number of mobile platforms to monitor the events and to connect the operating personnel for prompt actions to prevent an emergency situation. Thus, speaking about the design of a monitoring system for complex objects, the tasks of creating a network of stationary and moving devices equipped with sensors of various nature are being solved. The devices are connected by wired and wireless communication channels. Here some tasks are solved such as optimal placement and trajectory of movement, unit data exchange, data preprocessing and decision making using these data and virtual model of the object, and so on. They transmit current information, including visual information, to the subsystem for making decisions and displaying the real state of the object on its virtual model. A special place is occupied by the issues of processing the video stream from many cameras installed on the system devices, especially moving ones. Separately, there are issues of displaying video information on monitors of the dispatching control center for a clear display of the current state of the object. The next section is devoted to these issues.

5 System Graphic Data Processing Among the whole set of heterogeneous stationery and mobile sensors of the information-measuring and control system (IMCS), one should separately highlight the image sensors—video cameras. In addition to the basic patterns of using video cameras for video surveillance systems, modern methods of processing images from cameras additionally allow: navigating mobile devices (both onboard cameras and external stationary cameras); detect movement, changes in the environment, lighting; detect smoke and fire; conduct non-contact technical quality control; perform detection of heterogeneous objects; to detect abandoned and dangerous objects, to carry out personal identification and prevent intrusions, etc. [16, 17]. The following subsections provide a detailed description of the use of video cameras for solving navigation problems (navigation through the onboard cameras of the device and navigation through the stationary cameras of an industrial facility).

Network of Autonomous Units for the Complex …

271

5.1 Solving the Problem of Visual Navigation by Onboard Cameras of a Mobile Unit Recently, a lot of autonomous mobile devices, service robots, UAVs have appeared. They have become cheaper and more affordable for industry and users in general. A necessary condition for the functioning of such an apparatus is the determination of their position. Satellite positioning systems (GNSS) are the main basic means of navigation for mobile devices. However, such a system may not be applicable for many tasks. It may not be accurate enough in urban conditions, it can be easily blocked or even substituted [18], which affects the reliability of the system as a whole. Besides, these systems are practically not used for indoor navigation. For navigation of autonomous devices, methods for analyzing images from the onboard camera are proposed to use. Such approaches are very promising and are being widely developed at present [19–23]. Today, image navigation is one of the most affordable and effective ways of positioning an autonomous vehicle indoors (indoor navigation). Despite the large abundance of publications on the subject at the moment, there is no completely ready-made universal solution for performing autonomous image navigation. Many of the approaches described in the literature are very specific and highly specialized, and some of them require an additional set of sensors (laser rangefinder, stereo camera (stereo pair)). Nevertheless, the actively developing group of Visual SLAM methods [19, 20, 22, 24, 25] can be taken as the basis for image navigation methods. When the SLAM algorithm works, it extracts particular visual features of the image. Then, using analytical methods, features from different images are compared and placed on a 3D map of the environment created and filled online; simultaneously marked on the map and the location of the camera at the shooting time. A more detailed description of the functioning of these algorithms is beyond the scope of this article. This type of navigation method is well suited for the operation of an arbitrary device in a previously unknown environment. However, it can also be trained in advance for given environmental conditions, which will significantly improve the quality of navigation, its accuracy. It may also reduce the time for building a digital map of the environment inside each device, allowing all devices to use a single consistent map of the entire object. The coordinated map of the object also allows you to set in advance the permitted and prohibited zones for movement, determine the permitted routes of movement, set the intervals for the inspection of the object, describe the scenarios of actions in case of emergencies.

272

O. Chemerys et al.

5.2 Solving the Problem of Visual Navigation of a Mobile Unit by External Stationary Cameras In addition to the above-described approach of visual navigation using onboard cameras, there is also the possibility of visual navigation of the unit by images from external stationary cameras. This approach will be most suitable for determining the position of a mobile robot relative to a previously known floor plan or open area. The described approach makes it possible to supplement the methods of visual onboard navigation, to increase the overall reliability of the system, to enhance the accuracy of navigation, to determine the location and state of the device in case of loss of communication with it, and in other emergencies. This approach is actively used to control the movement of robots in an environment with obstacles, for precise control of UAVs indoors [26–28]. Industrial facilities are equipped with a developed network of Closed Circuit Television (CCTV) cameras for security purposes. The signal from such cameras can already be processed practically without modifications, so we may use it for the specified navigation tasks. This fact is quite favorable for the development of the method of navigation. The proposed navigation method is based on the principles of image processing and analysis, motion detection, selection of specialized markers and objects of a given type, joint processing of images from multiple cameras with an intersecting field of view, etc. For more reliable detection and identification of robots, it is proposed to designate them with special machine-readable markers (QR code, Aruco, April Tag, ARToolKit, etc.). It should be noted that the operation of this system will require significant efforts associated with creating a digital map of an industrial area, initial calibration, and adjustment of all stationary cameras used for navigation, applying machine-readable markers to mobile robots, etc. However, this approach has additional advantages. Stationary cameras are potentially located along all planned routes, it is possible to notice changes on the route quickly (people and other obstacles, accumulations of robots, emergencies). Thus, we may plan a detour trajectory, determine the position of faulty devices, carry out visual navigation in inconvenient conditions such as smoke and fire, where mobile cameras are not so effective. The methods described in these subsections can obviously be applied together in a hybrid approach that combines the advantages of both methods.

6 Conclusions A reliable monitoring system is based on a conjunction of means that receive data from the object and transmit it to a central node. They are subject to stringent requirements for the corresponding set of parameters. In this case, it is optimal to use a pool

Network of Autonomous Units for the Complex …

273

of stationary and mobile multisensory platforms with different movers. The transmission of the received data is carried out using a heterogeneous network that simultaneously uses wired, wireless, cellular and satellite technologies. Data duplication management, increasing the level of its information content and the organization of parallel transmission channels is implemented by the central node of the system. The algorithms used to increase the reliability and adequacy of information are determined by the level of criticality of the situation that has arisen at the controlled object. In a critical and emergency situation, the role of video streams is increased and the requirements for the reliability of making decisions is important significantly. The use of a figurative representation of information and bar graph means provides the required level of speed and reliability of data reading, accelerates the adoption of correct decisions in emergency situations. Thus, it is prospective that the monitoring system has to be designed on functionality principles of distributed heterogeneous infrastructure consisting of stationary and mobile units equipped by video cameras and a set of sensors for collecting information to build the object state profile. This infrastructure also includes a video surveillance system, storage and processing of video information streams, as well as deciding on the quality of functioning of a large technical system. It helps to make decisions more accurately.

References 1. Mosyagin, A.A.: Monitoring of potentially dangerous objects based on logical and probabilistic modeling. Abstract of dissertation research for the degree of candidate of technical sciences. M: Academy of the Ministry of Internal Affairs, 27p. (2009) (in rus.) 2. Solozhentsev, E.D.: Scenario Logic-Probabilistic Risk Management in Business and Technology. SPb. Publishing house “Business-Press”, 432p. (2004) (in rus.) 3. Tkachenko, T.E.: Monitoring of industrial objects as the basis for the prevention of technogenic emergencies. Sci. Educ. Probl. Civ. Prot. 1, 62–65 (2013) (in rus.) 4. Predictive emission monitoring systems monitoring emissions from industry. ABB Meas. Anal. ABB, 8p. (2019) 5. Trivedi, R., Vora, V.: Real-time monitoring and control system for industry. IJSRD – Int. J. Sci. Res. Dev. 1(2), 142–147 (2013). ISSN (online): 2321-0613 6. Russell, J.: Facebook is reportedly testing solar-powered internet drones again — this time with Airbus. TechCrunch. https://techcrunch.com/2019/01/21/facebook-airbus-solar-dronesinternet-program/. Accessed 30 May 2019 7. UAVIA releases its “Uavia Inside” program for drone solutions providers. Paris, France, 07 May 2019. https://www.uavia.eu/PR_20190506_UAVIA_INSIDE 8. Kharchenko, V., Yastrebenetsky, M., Fesenko, H., Sachenko, A., Kochan, V.: NPP post-accident monitoring system based on unmanned aircraft vehicle: reliability models. Nucl. Radiat. Saf. 4(76), 50–55 (2017) 9. Sachenko, A., Kochan, V., Kharchenko, V., Yastrebenetsky, M., Fesenko, H., Yanovsky, M.: NPP post-accident monitoring system based on unmanned aircraft vehicle: concept, design principles. Nucl. Radiat. Saf. 1(73), 24–29. https://doi.org/10.32918/nrs.2017.1(73).04 10. Younana, M., Housseina, E.H., Elhoseny, M., Alia, A.A.: Challenges and recommended technologies for the industrial internet of things: a comprehensive review. Measurement 151 (2020). https://doi.org/10.1016/j.measurement.2019.107198

274

O. Chemerys et al.

11. Grösser, S.N.: Complexity management and system dynamics thinking. In: Grösser, S., ReyesLecuona, A., Granholm, G. (eds.) Dynamics of Long-Life Assets. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-45438-2_5 12. Schott, P., Lederer, M., Eigner, I., Bodendorf, F.: Case-based reasoning for complexity management in Industry 4.0. J. Manuf. Technol. Manag. https://doi.org/10.1108/jmtm-082018-0262 13. Duffy, V.G.: Handbook of Digital Human Modeling: Research for Applied Ergonomics and Human Factors Engineering. CRC Press, 1006p. (2016) 14. da Cruz, P.M.A.M.: Semantic figurative metaphors in information visualization. Coimbra: [s.n.]. Tese de doutoramento. Disponível na (2016). http://hdl.handle.net/10316/31166 15. Bushma, A.V., Turukalo, A.V.: Software controlling the LED bar graph displays. Semicond. Phys. Q. Electron. Optoelectron. 23(3), 329–335 (2020). https://doi.org/10.15407/spqeo23. 03.329 16. Connell, J., Fan, Q., Gabbur, P., Haas, N., Pankanti, S., Trinh, H.: Retail video analytics: an overview and survey. Proc. SPIE – Int. Soc. Opt. Eng. 8663 (2013). https://doi.org/10.1117/ 12.2008899 17. Olatunji I.E., Cheng, C.-H.: Video analytics for visual surveillance and applications: an overview and survey. In: Tsihrintzis, G., Virvou, M., Sakkopoulos, E., Jain, L. (eds.) Machine Learning Paradigms. Learning and Analytics in Intelligent Systems, vol 1. Springer, Cham. https://doi.org/10.1007/978-3-030-15628-2_15 18. EXCLUSIVE: Drones vulnerable to terrorist hijacking, researchers say [Electronic resource]. – Mode of access: http://www.foxnews.com/tech/2012/06/25/drones-vulnerable-to-terroristhijacking-researchers-say/ – Date of access: 15.05.2015 19. Davison, A.J., et al.: MonoSLAM: real-time single camera slam. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 1052–1067 (2007) 20. Larson, C.D.: An integrity framework for image-based navigation systems. In: Larson, C.D. (ed.) Air Force Inst Of Tech Wright-Patterson Afb Oh School of Engineering and Management, vol. AFIT/DEE/ENG/10-03 (2010) 21. Robertson, D., Cipolla, R.: An image-based system for urban navigation. In: The 15th British Machine Vision Conference (BMVC04), pp. 819–828 (2004) 22. Roumeliotis, S.I., Johnson, A.E., Montgomery, J.F.: Augmenting inertial navigation with image-based motion estimation. In: Proceedings of the IEEE International Conference on Robotics and Automation, ICRA’02, vol. 4, pp. 4326–4333 (2002) 23. Templeton, T.: Autonomous vision-based landing and terrain mapping using an mpc-controlled unmanned rotorcraft In: IEEE International Conference on Robotics and Automation, Roma, 10–14 April 2007, pp. 1349–1356 (2007) 24. Taketomi, T., Uchiyama, H., Ikeda, S.: Visual SLAM algorithms: a survey from 2010 to 2016. IPSJ T Comput. Vis. Appl. 9, 16 (2017). https://doi.org/10.1186/s41074-017-0027-2 25. Huang, B., Zhao, J., Liu, J.: A survey of simultaneous localization and mapping with an envision in 6G wireless networks (2019) 26. Pizarro, D., Marron, M., Peon, D., Mazo, M., Garcia, J.C., Sotelo, M.A., Santiso, E.: Robot and obstacles localization and tracking with an external camera ring. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA 2008), Pasadena, CA, USA, 19–23 May 2008; pp. 516–521 (2008) 27. Ji, Y., Yamashita, A., Asama, H.: Automatic calibration and trajectory reconstruction of mobile robot in camera sensor network. In: Proceedings of the IEEE International Conference on Automation Science and Engineering (CASE), Gothenburg, Sweden, 24–28 August 2015, pp. 206–211 (2015) 28. Pizarro, D., Santiso, E., Mazo, M., Marron, M.: Pose and sparse structure of a mobile robot using an external camera. In: Proceedings of the IEEE International Symposium on Intelligent Signal Processing (WISP 2007), Alcala de Henares, Spain, 3–5 October 2007, pp. 1–6 (2007)

A Correlative Method to Rank Sensors with Information Reliability: Interval-Valued Numbers Case Mykhailo O. Popov, Oleksandr V. Zaitsev, Ruslana G. Stambirska, Sofiia I. Alpert, and Oleksandr M. Kondratov

Abstract Multisensor systems are increasingly used to obtain information when solving problems of investigating incidents and incidents, risk assessment, etc. However, as the complexity of the tasks being solved grows, the amount of data grows too, what leads to the well-known Big Data problem. One of the ways to overcome existing difficulties is to expand the functionality of sensors, but this requires knowledge of the information reliability of sensors. The paper proposes a method to rank sensors via information reliability criterion, which is based on the idea that the information reliability of any sensor can be calculated based on estimates of the proximity of its data to the data of other sensors. The implementation of this idea allows to solve the problem of assessing the information reliability of the sensors of a multisensor system in the absence of information about the class affiliations of test objects. To assess the information reliability of the sensors, it is proposed to use the correlation approach, at that the case is considered when the data has the form of interval-valued numbers. The developed method makes it possible to determine the information reliability of the sensors in an ordinal scale. To facilitate the practical application of the developed method, a numerical example is given. Keywords Multisensor system · Ranking · Pearson correlation coefficient · Information reliability of sensor · Interval-valued number

M. O. Popov · S. I. Alpert (B) Scientific Centre for Aerospace Research of the Earth of the Institute of Geological Sciences of the National Academy of Sciences of Ukraine, Kiev, Ukraine O. V. Zaitsev · R. G. Stambirska Department of Information Technologies of the Military Diplomatic Academy Named Evgen Bereznyak Kyiv, Kiev, Ukraine O. M. Kondratov Department of Information Technologies of the Scientific Research Institute, Kiev, Ukraine © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 C. van Gulijk and E. Zaitseva (eds.), Reliability Engineering and Computational Intelligence, Studies in Computational Intelligence 976, https://doi.org/10.1007/978-3-030-74556-1_17

275

276

M. O. Popov et al.

1 Introduction Nowadays systems that have groups of sensors, or so-called multisensor systems, are widely used to obtain information about objects of the surrounding world [1]. Both technical devices and experts can be used as sensors in the structure of such systems. The classical scheme of the multisensor system building assumes that the function of each sensor in the system is to collect primary data about objects of interest. Then this data, collected by all available sensors in the system, are transmitted to the higher level, where data aggregation, processing, and recognition of observed objects are performed. However, with increasing complexity and dynamics of object behavior, the amount of collected and processed data is growing rapidly, and we have to face the known difficulties of the Big Data problem [2]. One of the ways to overcome these difficulties is to expand the computational capabilities of sensors with the addition of intelligent operations to their function. However, it is difficult to make further applying of intellectual data processing results without knowledge about the information reliability of sensors. The information reliability of a sensor is understood as its ability to assess reality adequately [3]. The simplest approaches to assessment of the sensor information reliability are testing with known input data [4] or model research of the sensor [5]. But both of these approaches are not always possible to implement in practice. Therefore, it was proposed to evaluate the reliability of the sensor by comparing the information that is given by this sensor with information from other sensors [6, 7]. If the information from the sensor is close to the information from most of other sensors, then that sensor is considered as reliable, and vice versa, if the information of the sensor is not consistent with the data from most of other sensors, then that sensor is not considered as reliable. Various approaches and metrics are known to assess the degree of data proximity generated by different sensors [8]. One of the easiest ways to assess the degree of data proximity is to use a correlation approach. In this paper, we consider a recognition system that consists of intellectualized sensors, in which each sensor provides information about an object in the form of a set of hypotheses as concerns its possible state. The proximity of data generated by different sensors is calculated using the correlation coefficient of hypotheses probability distributions about the state of the object. However, it should take into consideration that the correlation assessment procedure depends on the type of calculation data that represent the probabilities of hypotheses. If there are one-valued (point) numbers, then the Pearson coefficient is usually used for calculation the correlation of hypotheses distributions formed by sensors [9, 10]. In [11], it is proposed to determine the correlation of hypotheses probability distributions according to Pearson, taking into account the quality of information contained in these distributions. For cases where the law of initial data distribution generated by sensors is unknown, it is proposed to use the Sherman rank correlation coefficient in [12]. Another method for calculating rank correlation is proposed in [13]. In [14], a formula

A Correlative Method to Rank Sensors …

277

is proposed for calculating the correlation of hypotheses probability distributions, which makes it possible to evaluate their information consistency. However, in series of studies, for example, in [15], it is experimentally proved that in situations of uncertainty, unambiguous point assessments are not always optimal. In such situations, as shown in [16], there are often better assessments in the form of interval-valued numbers. Nowadays the advantages of interval assessments are embodied in communication networks [17], classification systems [18], forecasting problems [19], and others. In this paper, we consider a system containing a set of intellectualized sensors, each of them forms its own estimates about objects of the surrounding world in the form of interval-valued numbers. The task is to develop a method for determining the sensors information reliability of the system based on their estimates of test objects under conditions of uncertainty. The uncertainty is manifested in the fact that the class affiliation of test objects is unknown a priori. The material is organized in the following order. Firstly, we provide the necessary theoretical information about interval-valued numbers and a computational procedure for determining the correlation coefficient of interval-values numbers sequences. Next, the problem statement is formulated and the method of its solution is described in detail at the mathematical-algorithmic level. To facilitate the practical adoption of the developed method, an illustrative numerical example is provided.

2 Theoretical Information 2.1 Interval-Valued Numbers and Sequences Interval-valued numbers and sequences.   Interval-valued number is a bounded closed subset [x, x] = z ∈ R; x ≤ z ≤ x , where R is the set of real numbers. Let there are two interval numbers [x, x] and [y, y]. In interval arithmetic elementary operations with these numbers are performed according to the following rules [20]: [x, x] + [y, y] = [x + y, x + y];

(1)

[x, x] − [y, y] = [x − y, x − y];

(2)

  [x, x] · [y, y] = min(x y, x y, x y, x y), max(x y, x y, x y, x y) ;

(3)





   x x x x x x x x [x, x]/[y, y] = min , , , , max , , , . y y y y y y y y

(4a)

278

M. O. Popov et al.

Rule (4a) is applied when y = 0; y = 0. In cases where the divisor in (4a) is an one-valued positive number, i.e. y = y = k (k > 0), the division rule is simplified to:  x x¯ , . (4b) [x, x]/[k, k] = k k Below we will use the fact that any interval-valued number [x, x] can still be represented as follows [21]: ¯ = t x + (1 − t)x, ¯ where 0 ≤ t ≤ 1. [x, x]

(5)

Suppose we have some ordered sequence P of n interval-valued numbers:

P = Pn = [ p n , pn ]; n = 1, 2, . . . , N .

(6)

The ordered sequence P is called normalized [22], if its elements satisfy two conditions. The first condition is defined by the group of three following expressions: 0 ≤ p n ≤ p n ≤ 1; N

(7)

p n ≤ 1;

(8)

p n ≥ 1.

(9)

n=1 N n=1

The second condition requires that for any k (k ∈ {1, 2,…, N}) the following two inequalities hold: N

p n + ( p k − p k ) ≤ 1;

(10)

p n − ( p k − p k ) ≥ 1.

(11)

n=1 N n=1

For probabilistic cases, condition (6) is satisfied by definition. If the intervalvalued numbers from the ordered sequence P do not satisfy the conditions (8) and (9), then transformation for the normalization in the form [22] should be conducted:

A Correlative Method to Rank Sensors …



 pn



pn



nor m

=

nor m

279

pn +

pn

=

;

(12)

; n = 1, 2, . . . , N .

(13)

pn +

pn N

N

pj

j=1, j=n

pj

j=1, j=n

If the interval-valued numbers from the ordered sequence P satisfy the conditions (8) and (9), but at the same time the conditions (10) and (11) are not satisfied, then to normalization the following transformation should be conducted [22]: ⎤

⎡ 

 pn

nor m



pn

 nor m

⎥ pj⎥ ⎦;

(14)

⎥ pj⎥ ⎦; n = 1, 2, . . . , N .

(15)

⎢ = max⎢ ⎣ pn , 1 − ⎤

⎡ ⎢ = min⎢ ⎣ pn , 1 −

N j=1, j=n

N j=1, j=n

2.2 Calculation of the Correlation Coefficient Between Sequences of Interval-Valued Numbers   Suppose, X = X n = [x n , x n ]; n = 1, 2, . . . , N is one ordered sequence of the normalized interval-valued numbers x and Y = Yn = [y n , y n ]; n = 1, 2, . . . , N is other ordered sequence of the normalized interval-valued numbers y. Let’s set ourselves the task to determine the level of proximity between given two sequences. Let’s use the Pearson correlation coefficient for this:

N 

r X,Y

 ˆ X − X (Yn − Yˆ ) n n=1 =r =   2   2 ,

N  N ˆ ˆ X Y − X × − Y n n n=1 n=1

(16)

where Xˆ and Yˆ are sample means. Using formula (5), the intervals X n and Yn from the sequences X and Y could be represented as following [22]:

280

M. O. Popov et al.

X n = t x n + (1 − t)x n ; 0 ≤ t ≤ 1;

(17a)

Yn = θ y n + (1 − θ )y n ; 0 ≤ θ ≤ 1.

(17b)

Mean over intervals of the sequence X will be ˆ¯ Xˆ = t xˆ + (1 − t)x,

(18a)

and mean over intervals of the sequence Y will be Yˆ = θ yˆ + (1 − θ ) yˆ¯ .

(18b)

In expressions (17a), (17b) and (18a), (18b) such denotations are used: xˆ =

1 N 1 N x n ; xˆ¯ = xn; n=1 n=1 N N

yˆ =

1 N 1 N y n ; yˆ¯ = y . n=1 n=1 n N N

Now one can write: ˆ¯ X n − Xˆ = t (x n − x) ˆ + (1 − t)(x n − x), and

(19)

Yn − Yˆ = θ (y n − yˆ ) + (1 − θ )(y n − yˆ¯ ).

(20)

Using expressions (17a), (17b)–(20) and omitting intermediate calculations we have:  N  X n − Xˆ (Yi − Yˆ ) = c22 + (c12 − c22 )t + (c21 − c22 )θ n=1

+ (c11 − c12 − c21 + c22 )tθ,

(21)

where c11 = c21 =

N n=1

N n=1

(x n − x)(y ˆ n − yˆ ); c12 = (x n − x)(y ˆ n − yˆ ); c22 =

N n=1

N n=1

(x n − x)(y ˆ n − yˆ¯ ); (x n − x)(y ˆ n − yˆ¯ ).

Next, we define for the sequence X: N  n=1

X n − Xˆ

2

= a22 + 2(a12 − a22 )t + (a11 + a22 − 2a12 )t 2 ,

(22)

A Correlative Method to Rank Sensors …

281

where a 11 =

N n=1

(x n − x) ˆ 2 ; a 22 =

a 12 =

N n=1

N n=1

ˆ¯ 2 ; (x n − x)

ˆ¯ (x n − x)(x ˆ n − x).

Similarly, we define for the sequence Y: N  n=1

Yn − Yˆ

2

= b22 + 2(b12 − b22 )θ + (b11 + b22 − 2b12 )θ 2 ,

(23)

where b11 =

N n=1

(y n − yˆ )2 ; b22 =

b12 =

N n=1

N n=1

( y¯n − yˆ¯ )2 ;

(y n − yˆ )( y¯n − yˆ¯ ).

After substituting expressions (18a), (18b)–(23) into formula (16), the correlation coefficient between two sequences of normalized interval-valued numbers takes the form: c22 + (c12 − c22 )t + (c21 − c22 )θ + (c11 − c12 − c21 + c22 )tθ  r=  . a22 + 2(a12 − a22 )t + (a11 + a22 − 2a12 )t 2 × b22 + 2(b12 − b22 )θ + (b11 + b22 − 2b12 )θ 2

(24)

Formula (24) shows that the correlation coefficient between sequences depends on two parameters −t and θ, i.e. r = r (t, θ ). Point values of the correlation coefficient are limited by the interval [–1, 1], so the   r = r (t, θ ) must be within this interval, regardless of a pair of values  0 values t , θ 0  t 0 , θ 0 ∈ [0, 1]. Formula (24) also shows that the values of the correlation coefficient for different pairs (t, θ ), 0 ≤ t, θ ≤ 1, en sequences X = X n = [x n , x n ]; n = 1, 2, . . . , N

and Y = Yn = [y n , y n ]; n = 1, 2, . . . , N lies in the certain closed subinterval of   interval [−1, 1]. This subinterval r , r can be determined by varying the values t and θ. The value r shows the lower, minimum level of correlation between sequences, and the value shows the upper, maximum level. Thus, the formula (24) allows us to calculate the correlation coefficient between two  normalized interval sequences X  and Y as an interval-valued number r = r , r .

282

M. O. Popov et al.

Numbers r and r are calculated by solving two optimization problems: 1. 2.

Minimize the functional r (t, θ ) under the constraints 0 ≤ t ≤ 1; 0 ≤ θ ≤ 1. The solution of this problem is the number r . Maximize the functional r (t, θ ) under the constraints 0 ≤ t ≤ 1; 0 ≤ θ ≤ 1. The solution of this problem is the number r .

If the solution of the first task is r > 0, then there is a positive correlation between the interval sequences. In the second task, the solution r < 0 means that the correlation between sequences is negative. In the case of r < 0 and r > 0 there is a negative-positive kind of relationship between the sequences. The result r = r = 0 indicates that there is no correlation between sequences.

3 Setting the Objectives Many information systems (monitoring, surveillance, etc.) have the structure shown in Fig. 1. There we have a set of intellectual sensors that observe the object of interest, and each sensor gives its own independent assessment of class affiliation of this object. Next, assessments of sensors are jointly processed and a single integrated assessment that shows class affiliation of object is formed basing on the results. The reliability of the integrated assessment depends on the reliability of the individual sensors. As we have already noted, the most accurate assessment of the information reliability of sensors can be obtained by testing it on a set of objects with a known class affiliation of each one. However, that is not always possible; much more often, only objects with a partially or completely absent a priori information about their class affiliation can be used for testing. In such conditions, determining the information reliability of sensors remains an open problem.

Fig. 1 The structure of a multisensor information system

A Correlative Method to Rank Sensors …

283

In our study, the situation is considered when a set of  of q objects is available for evaluating the sensors information reliability, in which each object belongs to one of the defined N classes, but its class affiliation is a priori unknown. We assume that a multisensor system contains of Kintellectualized sensors and  during considering a certain object of the interest ωq ωq ∈ Ω each sensor forms its own independent assessment of the object class membership. The assessment of the kth sensor (k∈{1,2,…,K}) has the form: Sk =



     H1k , p1k , . . . , Hnk , pnk , . . . , HNk , p kN ,

(25)

where Hnk is the hypothesis of the kth sensor, that shows the affiliation of the object to the n-class; pnk is the probability of the specified hypothesis; N is the number of classes that the object belongs to. Observing the object, all sensors form and consider the same set of hypotheses, and properties of the sensor are appeared in   the probability distribution  k the individual p1 . . . p kN , which it assigns to hypotheses H1k . . . HNk . The result of observing an object with any kth sensor (k ∈ {1, 2,…, K}) can be represented by a probability distribution in the form of a sequence   p k = p1k . . . , pnk . . . , p kN .

(26)

The key challenge is to evaluate the information reliability of each sensor based only on hypotheses, which are formed by the sensors of a multisensor system during the observing indeterminate test objects. The purpose of our work is to develop a correlation method for solving this problem. The problem is solved for the case when probability assessments assigned by sensors for hypotheses that are expressed as interval-valued numbers. k For example,  sensor’s assessment of the hypothesis probability Hn has the  the Kth form pnk = p kn , pnk .

4 The Method The method is based on the idea that the information reliability of each individual sensor of a multisensor system is determined by the proximity of this sensor estimates to the estimates of other sensors of the system. Two assumptions have been made: Assumption 1 The closer the estimates of this sensor are to the estimates of all other sensors of the multisensor system, the greater its information reliability in framework of the system. Assumption 2 The level of the proximity of two sensors assessments is determined by the value of their correlation coefficient.

284

M. O. Popov et al.

Fig. 2 Scheme of the correlation method for determining the information reliability of sensors

The method consists of seven steps   (Fig. 2). Step 1. Any object ωq ωq ∈ Ω is selected and each of the K sensors of the system independently forms its probability distribution of hypotheses, regarding it. So, we have K sequences:  

1 , . . . , p1 , . . . , p1 1 p1 , p1,q ,..., P 1 = p1,q n,q N ,q =  1,q 

k , . . . , pk , . . . , pk k , pk = p , ..., P k = p1,q n,q 1,q N ,q  1,q 

K K K K K K P = p1,q , . . . , pn,q , . . . , p N ,q = p1,q , p1,q , . . . ,







 1 , p1 pn,q n,q , . . . ,  k pk , pn,q ,...,  n,q  K K pn,q , pn,q , . . . ,



 ⎫ p1N ,q , p1N ,q ⎪ ⎪ ⎬  ⎪ pkN ,q , pkN ,q .   ⎪ ⎪ ⎪ p NK ,q , p NK ,q ⎭

(27)

Step 2. The interval-valued numbers of sequences (27) are normalized according to the procedure described by formulae (7)−(15). Step 3. The correlation between the distributions of normalized interval-valued numbers obtained on the qth object ωq is calculated. To do that, let’s firstly change the form of interval-valued numbers representation. All the interval-valued numbers normalized in the previous step that make up sequences (27) are converted to a point form by the formula (5).

A Correlative Method to Rank Sensors …

285

       p kn,H , p kn,H is a normalized interval , p kn nor m nor m       + (1 − t) p kn nor m number, it is converted to the form: Pnk nor m = t p kn nor m     k   Pn,H = t p kn,H + (1 − t) p kn,H where 0 ≤ t ≤ 1 and so on. Then, substituting the normalized and transformed input data into the formula (24), we should calculate the correlation coefficients between the sequences that have been formed by the system’s sensors based on the results of the qth object observing. As a result, a correlation matrix is constructed 

For example, if

p kn





rq1,1 ⎢ .. Rq = ⎣ . rqK ,1

⎤ · · · rq1,K . ⎥ .. . .. ⎦, · · · rqK ,K

(28)

where rqk,m is the correlation coefficient between the sequences, that have been formed by the kth and mth sensors concerning of the qth object (k, m ∈ (1, 2, . . . , K )). Step 4. It is checked, whether all objects from the set  were attracted for testing or not. If “no”–then it’s necessary to go to the step 1 and select the next object ω  . If—“yes”, then we should go to the next step. Step 5. It is calculated correlation matrix, averaged over the set  of Q objects: ⎡

r 1,1 1⎢ . 1 Q Rq = ⎣ .. R= q=1 Q Q r K ,1

⎤ · · · r 1,K . ⎥ .. . .. ⎦, · · · r K ,K

(29)

Q k,m where r k,m = q=1 rq ; k, m ∈ (1, 2, . . . , K ). Step 6. It is calculated correlation between each individual kth sensor and the combination of all other system sensors as Rk =

1 K k,i i=1, r ; k ∈ 1, 2, . . . K . K −1 i=k

(30)

In accordance with the above assumptions, each number R k that is calculated using the formula (30), reflects the value of the information reliability of the kth sensor within a multisensor system. Step 7. Sensors are ranked according to the value of information Since   reliability. k the values R k are interval-valued numbers, that is, each R k = R k , R , then we use the method given in [24] for ranking. Its essence is as follows. Let’s suppose we

286

M. O. Popov et al.

    need to compare two interval-valued numbers: x, x and y, y . Then, according     to [24], the possibility that x, x ≥ y, y , is defined by the formula:

psb







x, x ≥ y, y



⎧ ⎨

⎫ ⎞  ⎬ y−x ⎠   = max 1 − max⎝  , 0 , 0 .  ⎩ ⎭ x−x + y−y ⎛



(31) Applying the formula (31), we rank a set of K interval-valued numbers. To rank all interval-valued numbers R k (k = 1, 2, . . . K ) and thus to determine the information reliability of each sensor of a multisensor system, at first each interval-valued number is compared with all other. It is done sequentially according to the formula: 

psb R i ≥ R

 j



 ⎫ ⎞ j ⎬ R − Ri   j  , 0⎠ , 0 ; = max 1 − max⎝  i ⎩ ⎭ R − Ri + R − R j ⎧ ⎨



i, j ∈ 1, 2, ...K .

(32)

  For simplicity, we’ll denote: psb R i ≥ R j = psbi j . Then the matrix is formed from all psb (·), calculated with a help of formula (32): ⎡

⎤ psb11 · · · psb1K ⎢ .. ⎥. .. PSB = ⎣ ... . . ⎦ psb K 1 · · · psb K K Then we should find the sum of elements in each row of PSB matrix: ⎫ psb1 = psb11 + · · · + psb1k + · · · + psb1K ⎬ . psbk = psbk1 + · · · + psbkk + · · · + psbk K ⎭ psb K = psb K 1 + · · · + psb K k + · · · + psb K K

(33)

(34)

The values psbk (k = 1, 2, . . . , K ) calculated in this way are ranked in a descending order and ordinal statistics is constructed, in which the location of the psb value determines the information reliability rank of the corresponding sensor of the multisensor system. For example, we have three sensors S1, S2, S3 and following values are calculated for them: psb1 = 1, 2; psb2 = 1, 3; psb3 = 0, 9. Their order gives the order statistics: psb2 > psb1 > psb3 . It follows, that S2 > S1 > S3, i.e. the S2 sensor is the most informative, S1 sensor follows S2 sensor, and the S3 sensor is the least informative. Thus, the developed correlation method makes it possible to determine the information reliability of sensors in ordinal scale. Let’s illustrate the proposed method with a numerical example.

A Correlative Method to Rank Sensors …

287

5 Numerical Example Step 1. The formation of the input data. Let’s suppose that input data are generated by 3 sensors of the multispectral system. Observing 2 objects, each sensor formed a set of 4 hypotheses about their probable states. It was formed 2 sequences with interval estimates of determined hypotheses probabilities of represented in Table 1. Step 2. Normalization. Each row- sequence is normalized according to the procedure defined by formulae (7)–(15). In this case, the table elements meet the requirements of conditions (7)−(9) (Table 2). Some elements don’t meet the conditions (10), (11) (Table 3). Therefore, we conduct the normalization of the corresponding sequences according to the formulae (14), (15). Normalized elements are presented in Table 4. Step 3. Conversion of representation interval-valued numbers form into a point form by formula (5) (Table 5). Step 4. Calculations of correlation coefficients. The step is performed in the following sequence. Table 1 Incoming data Hypothesis

H1

H1

H2

H2

H3

H3

H4

H4

Sensor/Object

Low

Up

Low

Up

Low

Up

Low

Up

1/1

0.10

0.20

0.20

0.40

0.50

0.60

0.10

0.20

1/2

0.09

0.21

0.18

0.43

0.42

0.62

0.05

0.17

2/1

0.05

0.30

0.11

0.30

0.18

0.40

0.28

0.50

2/2

0.15

0.35

0.05

0.25

0.25

0.42

0.22

0.61

3/1

0.30

0.40

0.30

0.60

0.10

0.20

0.10

0.20

3/2

0.30

0.50

0.30

0.40

0.10

0.15

0.10

0.25

H—number of hypothesis, Sensor—number of sensor, Object—number of object, Low—lower bound of interval, Up—upper bound of interval

Table 2 Results of the checking under the conditions (7)−(9) No sensor/No object

Results of calculations under No conditions No 7

No 8

No 9

1/1

[0; 1]

0.90

1.40

1/2

[0; 1]

0.74

1.43

2/1

[0; 1]

0.62

1.50

2/2

[0; 1]

0.67

1.63

3/1

[0; 1]

0.80

1.40

3/2

[0; 1]

0.80

1.30

288

M. O. Popov et al.

Table 3 Results of the checking under the conditions (10), (11) H1

H1

H2

H2

H3

H3

H4

H4

S/Obj

Low

Up

Low

Up

Low

Up

Low

Up

1/1

1.00

1.30

1.10

1.20

1.00

1.30

1.00

1.30

1/2

0.86

1.31

0.99

1.18

0.94

1.23

0.86

1.31

2/1

0.87

1.25

0.81

1.31

0.84

1.28

0.84

1.28

2/2

0.87

1.43

0.87

1.43

0.84

1.46

1.06

1.24

3/1

0.90

1.30

1.10

1.10

0.90

1.30

0.90

1.30

3/2

1.00

1.10

0.90

1.20

0.85

1.25

0.95

1.15

H—number of hypothesis, S—number of sensor, Obj—number of object, Low—lower bound of interval, Up—upper bound of interval Table 4 The results of normalization by formulae (14), (15) H1

H1

H2

H2

H3

H3

H4

H4

S/Obj

Low

Up

Low

Up

Low

Up

Low

Up

1/1

0.10

0.20

0.20

0.30

0.50

0.60

0.10

0.20

1/2

0.09

0.21

0.18

0.43

0.42

0.62

0.05

0.17

2/1

0.05

0.30

0.11

0.30

0.18

0.40

0.28

0.50

2/2

0.15

0.35

0.05

0.25

0.25

0.42

0.22

0.55

3/1

0.30

0.40

0.30

0.50

0.10

0.20

0.10

0.20

3/2

0.30

0.50

0.30

0.40

0.10

0.15

0.10

0.25

Table 5 The results of the form conversion by formula (5) No sensor/No object

Hypothesis 1

Hypothesis 2

Hypothesis 3

Hypothesis 4

1/1

t * 0.10 + (1 – t) * 0.20

t * 0.20 + (1 – t) * 0.30

t * 0.50 + (1 – t) * 0.60

t * 0.10 + (1 – t) * 0.20

1/2

t * 0.09 + (1 – t) * 0.21

t * 0.18 + (1 – t) * 0.43

t * 0.42 + (1 – t) * 0.62

t * 0.05 + (1 – t) * 0.17

2/1

t * 0.05 + (1 – t) * 0.30

t * 0.11 + (1 – t) * 0.30

t * 0.18 + (1 – t) * 0.40

t * 0.28 + (1 – t) * 0.50

2/2

t * 0.15 + (1 – t) * 0.35

t * 0.05 + (1 – t) * 0.25

t * 0.25 + (1 – t) * 0.42

t * 0.22 + (1 – t) * 0.55

3/1

t * 0.30 + (1 – t) * 0.40

t * 0.30 + (1 – t) * 0.50

t * 0.10 + (1 – t) * 0.20

t * 0.10 + (1 – t) * 0.20

3/2

t * 0.30 + (1 – t) * 0.50

t * 0.30 + (1 – t) * 0.40

t * 0.10 + (1 – t) * 0.15

t * 0.10 + (1 – t) * 0.25

A Correlative Method to Rank Sensors …

289

Table 6 Correlation coefficients between the sequences No object

Correlation between sensor 1 & sensor 2

Correlation between sensor 1 & sensor3

Correlation between sensor2 & sensor 3

1

[0.045; 0.097]

[–0.45; –0.38]

[–0.90; –0.77]

2

[–0.33; 0.33]

[–0.62; –0.21]

[–0.90; –0.59]

Table 7 The average level of correlation sensor № object

Sensor 1

Sensor 2

Sensor 3

1

[0.69; 0.76]

[0.47; 0.56]

[0.22; 0.32]

2

[0.42; 0.96]

[0.28; 0.77]

[0.14; 0.50]

4.1. Substituting the normalized and transformed input data into formulae (16), we calculate the correlation coefficients between the sequences formed by the sensors in the multisensor system based on the observation results (see Table 6). The calculations were conducted in the MATLAB software using the minimum search function of the fmincon function. 4.2. Calculate the average level of sensor correlation by formula (29) (Table 7). 4.3. The average information reliability of each sensor in the multi-sensor system for the set of all objects is calculated by the formula (30) (Table 8). Step 5. Determining the information reliability of sensors. Carry out by ranking the sensors according to the procedure above and by formulae (29)−(34) (Table 9). As a result of adding elements in rows of the table we get values of information reliability for each sensor (Table 10). Thus, the Sensor 3 has highest information reliability (=1.43). Table 8 The average level of correlation of the sensor Sensor 1

Sensor 2

Sensor 3

[0.56; 0.86]

[0.37; 0.66]

[0.18; 0.41]

Table 9 The average level of sensor correlation Probability

Sensor 1

Sensor 2

Sensor 3

Sensor 1

0.50

0.18

0.00

Sensor 2

0.81

0.50

0.06

Sensor 3

0.00

0.93

0.50

Table 10 The value of information reliability Sensor 1

Sensor 2

Sensor 3

0.68

1.37

1.43

290

M. O. Popov et al.

6 Conclusion One of the main trends in the development of multisensor systems is the intellectualization of sensors. In this case, the information reliability of sensors is especially important for data processing. The paper proposes a method to rank sensors via information reliability criterion, which is based on the idea that the information reliability of any sensor can be calculated based on estimates of the proximity of its data to the data of other sensors. The implementation of this idea allows to solve the problem of assessing the information reliability of the sensors of a multisensor system in the absence of information about the class affiliations of test objects. To assess the information reliability of the sensors, it is proposed to use the correlation approach, at that the case is considered when the data has the form of interval-valued numbers. The developed method makes it possible to determine the information reliability of the sensors in an ordinal scale. To facilitate the practical application of the developed method, a numerical example is given. The developed method may be useful for evaluation of the multisensor systems effectiveness operating under uncertainty.

References 1. Xiong, N., Svensson, P.: Multi-sensor management for information fusion: issues and approaches. Inf. Fusion 3, 163–186 (2002) 2. Soille, P., Marchetti, P.: Proceedings of the 2016 Conference on Big Data from Space (BiDS’16), EUR 27775 EN (2016). https://doi.org/10.2788/854791 3. Rogova, G., Nimier, V.: Reliability in information fusion: literature survey. In: Proceedings of 7th International Conference on Information Fusion, Stockholm, Sweden (2004) 4. Zhu, J., Wang, X., Song, Y.: Evaluating the reliability coefficient of a sensor based on the training data within the framework of evidence theory. IEEE Access 6, 30952–30601 (2018). https://doi.org/10.1109/access.2018.2816915 5. Yuan, K., Xiao, F., Fei, L., Kang, B., Deng, Y.: Modeling sensor reliability in fault diagnosis, based on evidence theory. MDPI Sens. 16(113) (2016). https://doi.org/10.3390/s16010113 6. Elouedi, Z., Mellouli, K., Smets, P.: Assessing sensor reliability for sensor data fusion within transferable belief model. IEEE Trans. Syst. Man Cybern. 34(1), 782–787 (2004) 7. Schubert, J., Gabbay, D.M., Kruse, R., Nonnengart, A., Ohlbach, H.J.: creating prototypes for fast classification in Dempster-Shafer clustering in qualitative and quantitative practical reasoning and uncertainty. In: Proceedings of First International Joint Conference on ECSQARU-FAPR’97, Bad Honnef, June 1997 (LNAI 1244), pp. 525–535. Springer, Berlin (1997) 8. Deza, E., Deza, M.M.: Dictionary of Distances, 444p. Normal High School, Paris (2008) 9. Song, Y., Wang, X., Lei, L., Xue, A.: Evidence combination based on credibility and separability. In: 12th International Conference on Signal Processing (ICSP), pp. 1392–1396 (2014). https://doi.org/10.1109/icsp2014.7015228 10. Jiang, W., Wang, S., Liu, X., Zheng, H., Wei, B.: Evidence conflict measure based on OWA operator in open world. PLoS ONE 12(5), e0177828 (2017). https://doi.org/10.1371/journal. pone.0177828. Southwest University, China

A Correlative Method to Rank Sensors …

291

11. Li, D., Deng, Y.: A new correlation coefficient based on generalized information quality. Open Access J. 7, 175411–175419 (2019). https://doi.org/10.1109/access.2019.2957796 12. Shi, F., Su, X., Qian, H., Yang, N., Han, W.: Research on the fusion of dependent evidence based on rank correlation coefficient. MDPI Sens. 17, 2362p. (2017). https://doi.org/10.3390/ s17102362 13. Su, X., Xu, P., Mahadevan, S., Deng, Y.: On consideration of dependence and reliability of evidence in Dempster-Shafer theory. J. Inf. Comput. Sci. 11, 4901–4910 (2014) 14. Sun, G., Guan, X., Yi, X., Zhao, J.: Conflict evidence measurement based on the weighted separate union kernel correlation coefficient. Open Access J. 6, 30458–30472 (2018). https:// doi.org/10.1109/access.2018.2844201 15. Wintle, B.C., Fraser, H., Wills, B.C., Nicholson, A.E., Fidler, F.: Verbal probabilities: very likely to be somewhat more confusing than numbers. PLoS ONE (Yechiam, E. (eds.)) 14(4), e0213522 (2019). https://doi.org/10.1371/journal.pone.0213522. Technion Israel Institute of Technology, ISRAEL 16. Nguyen, H.T., Kreinovich, V., Zuo, Q.: Interval-Valued Degrees of Belief: Applications of Interval Computations to Expert Systems and Intelligent Control. Int. Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, pp. 1–42. World Scientific Publishing Company (1996) 17. Zhang, Y., Liu, Y., Chao, H.-C., Zhang, Z.: Classification of Incomplete Data Based on Evidence Theory and an Extreme Learning Machine in Wireless Sensor Networks. MDPI Sensors, vol. 18, 1046p. (2018). https://doi.org/10.3390/s18041046 18. Yu, X.C., He, H., Hu, D., Zhou, W.: Land cover classification of remote sensing imagery based on interval-valued data fuzzy c-means algorithm. Sci. China: Earth Sci. 57, 1306–1313 (2014). https://doi.org/10.1007/s11430-013-4689-z 19. Maia, A.L.S., de Carvalho, F.A.T., Ludermir, T.B.: Forecasting models for interval-valued time series. Neurocomputing 71, 3344–3352 (2008) 20. Moore, R.E., Kearfott, R.B., Cloud, M.J.: Introduction to Interval Analysis, 234p. Society for Industrial and Applied Mathematics, Philadelphia, PA (2009) 21. Ren, A., Wang, Y., Xue, X.: A novel approach based on preference-based index for interval bilevel linear programming problem. J. Inequalities Appl. 112, 16p. (2017). https://doi.org/10. 1186/s13660-017-1384-1 22. Wang, Y.-M., Yang, J.-B., Xu, D.-L., Chin, K.-S.: On the combination and normalization of interval-valued belief structures. Inf. Sci. 177, 1230–1247 (2007) 23. Pandian, P., Kavitha, K.: On correlation between two real interval sets, J. Phys.: Conf. Ser. 1000, 012055 (2018) 24. Xu, Z.S., Da, Q.L.: The uncertain OWA operator. Int. J. Intell. Syst. 17, 569–575 (2002)

COVID-19 Pandemic Risk Analytics: Data Mining with Reliability Engineering Methods for Analyzing Spreading Behavior and Comparison with Infectious Diseases Alicia Puls and Stefan Bracke Abstract In December 2019, the world was confronted with the outbreak of the respiratory disease COVID-19 (“Corona”). The first infection—confirmed case— was detected in the City Wuhan, Hubei, China. First, it was an epidemic in China, but in the first quarter of 2020, it evolved into a pandemic, which continues to this day. This paper focuses on data analytics regarding COVID-19 infection data. The goal is data mining considering model uncertainty, pandemic spreading behavior with lockdown impact in Germany, Italy, Japan, New Zealand and France in first and second wave. Furthermore, a comparison with other infectious diseases -measles and influenza- is made. Statistical models and methods from reliability engineering like Weibull distribution model or trend test are used to analyze the occurrence of infection.

1 Introduction In December 2019, the world was confronted with the outbreak of the respiratory disease COVID-19 (“Corona”). The first infection—confirmed case—was detected in the City Wuhan, Hubei, China. First, it was an epidemic in China, but in the first quarter of 2020, it evolved into a pandemic, which continues to this day. The COVID-19 pandemic with its incredible speed of spread shows the vulnerability of a globalized and networked world. The first months of the pandemic were characterized by a heavy burden on health systems. Worldwide, the population of countries was affected with severe restrictions, like educational system shutdown, public traffic system breakdown or a comprehensive lockdown. The severity of the burden was dependent on many factors, e.g. government, culture or health system. However, the burden happened regarding each country with slight time lags, cf. Bracke et al. [1]. A. Puls · S. Bracke (B) Chair of Reliability Engineering and Risk Analytics, University of Wuppertal, Wuppertal, Germany e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 C. van Gulijk and E. Zaitseva (eds.), Reliability Engineering and Computational Intelligence, Studies in Computational Intelligence 976, https://doi.org/10.1007/978-3-030-74556-1_18

293

294

A. Puls and S. Bracke

This paper focuses on data analytics regarding infection data of the COVID-19 pandemic. It is a continuation of the research study “COVID-19 pandemic data analytics: Data heterogeneity, spreading behavior, and lockdown impact”, published by Bracke et al. [1]. The goal of this assessment is the evaluation/analysis of infection data mining considering model uncertainty, pandemic spreading behavior with lockdown impact and early second wave in Germany, Italy, Japan, New Zealand and France. Furthermore, a comparison with other infectious diseases—measles and influenza—is made. The used data base from Johns Hopkins University (JHU) runs from 01/22/2020 until 09/22/2020 with daily data, the dynamic development after 09/22/2020 is not considered. The measles/influenza analytics are based on Robert Koch Institute (RKI) database 09/22/2020. Statistical models and methods from reliability engineering like Weibull distribution model or trend test are used to analyze the occurrence of infection.

2 Goal of Research Study The overarching goal is the analysis of the development of infection occurrence within the mid-early COVID-19 pandemic time (12.2019–09.2020). The detailed topics are as follows: 1. 2. 3. 4.

Overview with respect to data quality and the impact on uncertainty, Impact of lockdown measures based on spreading behavior analytics, Detection and analyzing of spreading behavior in the early second wave, Comparison of COVID-19 spreading behavior with other infectious diseases (influenza/measles).

These topics are discussed based on data from five different reference countries. The selection of the countries was based on following characteristics: 1. 2. 3. 4. 5.

Germany: Data quality/access, lockdown: distance regulations and restrictions on contact, Italy: First massive outbreak in Europe, hard lockdown, Japan: Soft lockdown, socially usual high hygienic standard, New Zealand: Hard lockdown, effective border closures and soft second wave, France: Hard lockdown in first wave and strong spreading in second wave.

3 Methods This section shows the statistic fundamentals for analyzing the COVID-19 pandemic data. The spreading behavior and the impact of the lockdown regarding different countries is analyzed by using the Weibull distribution model. The Weibull distribution model is frequently used within reliability engineering and risk analytics,

COVID-19 Pandemic Risk Analytics: Data …

295

cf. Birolini [2]. The applicability of this model for the evaluation of occurrence of infection is based on the exponential progress. In addition to classical methods of virology such as the SIR model (cf. Kermack and McKendrik [9]), e.g. applied for COVID-19 in D’Arienzo and Coniglio [4] in combination with basic reproduction number, the Weibull distribution model offers the possibility, to gain knowledge with regard to the infection development. The easy interpretability of the Weibull parameters allows the analysis of the spreading behavior, in particular the spreading speed. This is the first advantage in comparison to the use of an exponential distribution model. Second advantage is the normalization of the Weibull distribution function: It allows an easy comparison of measurement data based on different time ranges (samples). Therefore, the analysis of the spreading behavior before and after lockdown as well as the comparison of COVID-19 with other infectious diseases are based on the determination and interpretation of these Weibull parameters. While the analysis of the second wave is also done with the Weibull model, the detection is conducted with a Cox-Stuart Trend Test (Significance test).

3.1 Weibull Distribution Model The two-parameter Weibull distribution model is given based on Eq. (1), cf. Weibull [16].     x b . F(x) = 1 − ex p − T

(1)

The parameters, besides the term life span variable x, are scale parameter T (in lifetime analysis: characteristic life span) and shape parameter b. By variating parameter b, different failure rates can be described, therefore the Weibull model can be flexibly used for different applications, cf. Rinne [13]. The shape parameter b gives hints regarding the character of the failure period: early failure period, random failure or operation time related failure behavior. The Weibull parameters are estimated by using the Maximum Likelihood Estimator (MLE), cf. Fisher [6]. In occurrence of infection, the shape parameter b as gradient of the model is interpreted as spreading speed (transfer thinking of reliability engineering: here, shape parameter b describes the occurrence of damage cases within a product fleet). The scale parameter T gives another hint of the spreading speed considering the first infection case. Representing the x value of the probability 0.633, T indicates while comparing different models (e.g.: countries) how fast the infection cases are progressing in relation to the total days.

296

A. Puls and S. Bracke

3.2 Cox-Stuart Trend Test The Cox Stuart trend test is a non-parametric statistical test for detecting trends in a sample, based on the Binomial distribution. The data is divided in the midpoint into two sequences and the paired difference D is build. For the detection of the second wave, the one-sided form of the test is used to determine an upward trend. Therefore, the number of the positive signs in D is defined as S+. The null hypothesis states that S+ follows a binomial distribution with the number of experiments n as number of elements of D and a probability 0.5. If the p-value of the test is smaller than the significance level alpha, the null hypothesis is rejected and an uptrend is confirmed; cf. Cox and Stuart [3], Papula [10]: p = P(X ≤ S+) =

 n  k≤S+

k

0.5k · (1 − 0.5)n−k ≤∝ .

(2)

For detecting the second wave, the confirmed daily cases are analyzed as time series, so the Cox Stuart trend test is applicable.

4 DataBase The base of operations for the presented research study are the documentation of the worldwide infection data of the Johns Hopkins University (JHU). The COVID19 dashboard by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University [8] has documented confirmed cases, recovered cases as well as death cases regarding countries and regions, starting at 01-22-2020, cf. JHU [8]. Furthermore, the data was compared with the Robert Koch Institute (RKI) on a case-by-case basis; cf. RKI [12]. As a frame for the data basis for the presented study, Table 1 shows an overview of key data with regard to 1st infection (confirmed case) and lockdown by countries, which are in focus in this paper. For the presented research work, the data regarding the first infection and the lockdown are relevant for the analyzed country. Table 1 Dates of first infection and lockdown per country Country

1st infection

Lockdown

Germany

01/28/2020

03/22/2020

Italy

01/28/2020

03/09/2020

Japan

01/16/2020

03/28/2020

France

01/24/2020

03/17/2020

New Zealand

02/28/2020

03/26/2020

COVID-19 Pandemic Risk Analytics: Data …

297

Note: The date “1st infection” represents the index case, the first documented case. The design of the lockdown had different characteristics: The Italy lockdown includes the prohibition for leaving home (except for necessities). The lockdown in Japan was a strong recommendation of the government; a real lockdown cannot be enforced considering the Japanese constitution. In Germany, distance regulations come into effect since 22nd March 2020. The lockdown in France was characterized by curfew and movement restriction. The lockdown in New Zeeland was focusing on strict border closures, favored by island location.

5 Data Quality and Impact on Uncertainty It must be considered, that the data quality within the JHU database is different, caused by reasons with relation to the reported countries: E.g., data can be incomplete and censored, depending on the collection and reporting system of the country. Furthermore, the different definitions of facts (e.g. death case: death with COVID19 or because of COVID-19) has an impact on the data. This section gives a brief overview of the uncertainty with regard to data acquisition, for detailed explanations cf. Bracke et al. [1]. First of all the type of measuring method has to be considered. Three aspects can be mentioned: • Criteria for testing (test strategy, e.g. symptom-based or area-wide), • Reporting system (reporting procedure), • Accessibility of health department (e.g. weekend-impact). The definition of a case of illness (confirmed case) is also relevant for the database. Other relevant definitions are infection, recovered or death case. The post mortem analysis has a strong impact on death cases and can be different regarding countries and regions (e.g. Post mortem analyses required by law or individual decision by the doctor who determines the cause of death). Apart from the political measures and differences in measurement and definitions considered here, many other factors influence virus spread and thus the data situation. Some of these uncertainty factors are (without claiming to be conclusive); cf. Dimmock et al. [5]: • Seasonality and climatic effects, cf. Sajadi et al. [14], • Frequency of susceptible individuals in the population, like urbanity and persons in agglomerations (population density), • Differences in behavior, e.g. cultural or climatic determined, • Type of treatment, cf. Gattinoni et al. [7]. With the knowledge that these uncertainties occur, the comparative data is kept as constant as possible. Therefore:

298

A. Puls and S. Bracke

• The data is differentiated between before and after lockdown, • Comparison of countries with similar industrialization standard, • Ranked data is used and the time is normalized to the arrival date. In addition, the results are checked for plausibility and uncertainties during the analyses.

6 Analyses of the COVID-19 Spreading Behavior This section focusses on COVID-19 data analytics: The analyses of the spreading behavior within different countries in the first wave, considering the spreading before and after lockdown; and the analyses of the early second wave.

6.1 Infection (Confirmed Cases) Before Lockdown The comparison of the development of the infection in Germany, Italy, Japan, France and New Zealand with the focus on confirmed cases is shown in Fig. 1 (log-log-scale). Weibull distribution models (cf. Eq. 1) are fitted based on the cumulative cases; the Weibull parameters are estimated by using the MLE and shown in Table 2; shape parameter incl. confidence belt with confidence level γ = 0.95. The Weibull model and the parameters shows the infection development in a sound way: The Weibull

Fig. 1 Weibull distribution models, representing the infection development, confirmed cases, interval 1st infection case (index case) until lockdown related to the analyzed countries (log-logscale)

COVID-19 Pandemic Risk Analytics: Data …

299

Table 2 Weibull model parameters before lockdown starting with 1st infection case (cumulative confirmed cases related to analyzed country). Confidence level γ = 0.95 Country

Cases

T (d)

Shape b (confidence belt)

Germany

22,213

53

19.61 ≤ 19.82 ≤ 20.03

Italy

9,172

38

12.91 ≤ 13.12 ≤ 13.34

Japan

1,466

52

4.60 ≤ 4.79 ≤ 5.00

France

7,652

52

17.48 ≤ 17.80 ≤ 18.12

283

26

9.95 ≤ 10.96 ≤ 12.01

New Zealand

fit is based on the data range first known infection case (day 1) until the lockdown related to the analyzed country. Not all uncertainty factors could be correctively considered caused by unavailable information. However, the different characteristics of the spreading behavior in the countries under consideration can be seen. The Weibull models show the different infection developments: The shape parameter represents the gradient and can be interpreted as the spreading speed of the infection within the population. A steep curve means a high spreading speed. Japan shows the lowest shape parameter (spreading speed). The possible reason for this effect can be the socially usual high hygienic standard (e.g. wearing masks, social distance), which was additionally enforced by the COVID-19 pandemic. The shape parameters Germany, Italy and France are on a similar level, this gives a clear hint regarding the high spreading speed (behavior) within a short time period (few weeks) in Europe. The small number of cases and the lower T of New Zealand can be explained with the fast decision and implementation of the lockdown (about one month) in comparison to the other analyzed countries (about two months), cf. Table 1. Note: In reliability engineering, typical wear out mechanism can have 1.5 ≤ b ≤ 3. Brittle failures can have b ~ 8. Thinking transfer (from reliability engineering point of view): The shape parameters regarding COVID-19 spreading behavior in Germany, Italy, France (cf. Table 2) show a very strong gradient respectively spreading speed within the population in comparison to strong failure spreading behaviors within technical product fleets (e.g. automobiles) in use phase; e.g. cf. Sochacki and Bracke [15].

6.2 Infection (Confirmed Cases) After Lockdown The measure lockdown causes a significant change regarding the spreading speed (behavior) with respect to the analyzed countries. Figure 2 shows the Weibull distribution model fits based on the confirmed cases data after lockdown considering a time span of approximately 28 days. The Weibull distribution model describes once again the distribution of the measurement data (confirmed cases) after the lockdown. It is not a prediction of the expectable future confirmed cases. The estimated Weibull parameters are shown in Table 3.

300

A. Puls and S. Bracke

Fig. 2 Weibull distribution models, representing the infection development, confirmed cases, time span 28 days from the lockdown related to the analyzed countries (log-log-scale)

Table 3 Weibull model parameters since lockdown (confirmed cases, approx. 28 days) related to analyzed countries. Confidence level γ = 0.95 Country

Cases

T (d)

Shape b (confidence belt)

Germany

122,971

15

1.78 ≤ 1.79 ≤ 1.80

Italy

126,415

18

2.31 ≤ 2.32 ≤ 2.33

Japan

11,765

20

2.56 ≤ 2.60 ≤ 2.64

1,173

10

1.46 ≤ 1.53 ≤ 1.60

121,605

20

2.47 ≤ 2.48 ≤ 2.50

New Zealand France

First, the reduction of the shape parameter within the comparison before and after lockdown (cf. Tables 2 and 3) is clearly visible for all analyzed countries. The change is significant due to comparison of confidence intervals of the shape parameters (before/after lockdown). The spreading speed (gradient) is significantly reduced, the biggest change is observed in Germany. The smallest spreading speed after lockdown can be seen in New Zealand. There, strict border closures resulted in a halt to the spreading of the virus; the gradient (shape parameter) is close to a random failure rate characteristic (shape parameter b ~ 1). Besides, the low population density (18.6/km2 ) can be a reason for this low spreading speed. Japan shows the highest shape parameter (spreading speed) after lockdown in comparison to the other countries, maybe explainable by the highest population density (334/km2 ). Though these differences in population density (uncertainty factor), the characteristics of the lockdown (soft, hard, etc.) in the different countries can be interpreted as the reason for the different Weibull distribution model fit results.

COVID-19 Pandemic Risk Analytics: Data …

301

6.3 Second Wave Detection To detect the beginning of the second wave, a trend test is done. Therefore, since 1 July, the daily confirmed cases are analyzed. At this time, the spreading of the first wave had decreased and there were few new case numbers. A one-sided trend test (upward trend) according to Cox and Stuart [3] is performed with 14 data points and a significance level α of 5%, cf. Eq. 2. The tested hypotheses are as follows: • Null hypothesis: There is no upward trend, • Alternative hypothesis: There is an upward trend (second wave). The sample size of 14 days is chosen to mitigate outliners and data falsifications like the weekend impact. If the p value of the test is smaller than α, the date of the first data point in the tested data set is detected as the beginning of the second wave. Otherwise, the sample is moved one day to the right and the test is performed again. The Cox and Stuart significance test was performed 71 times in this way on the example of Germany; cf. Fig. 3. The results are the p values as black points plotted on the right ordinate. As a comparison, the daily confirmed cases are deducted on the left ordinate. The corresponding dates are assigned to these values. Additionally the significance is shown as a horizontal red line. All points placed under the red line represent those tests, which result in an upward trend. According to the described approach, the first point under the red line represents

Fig. 3 Second wave detection with Cox-Stuart trend test, p values (black points), alpha = 0.05 (red line), daily confirmed cases Germany since 01 July (blue). Data inventory 09/22/2020

302

A. Puls and S. Bracke

Table 4 Detected begin second wave per country, based on application Cox and Stuart significance test Country

Begin second wave

Germany

07/08/2020

Italy

07/13/2020

Japan

07/02/2020

New Zealand

08/05/2020

France

07/10/2020

the beginning of the second wave; here the p value is below the significance level, so the null hypothesis is rejected and the alternative hypothesis of an upward trend is assumed. Based on these analyzing procedure, in Germany the second wave started on 8th July 2020. In comparison to the daily confirmed cases this date seems plausible for the beginning of a second wave, from then on an increase in the number of cases can be seen. The results of the trend tests for the other countries and thus the respective beginnings of the second wave are documented in Table 4. A European wide second wave can be recognized; the beginnings of the second waves of Germany, Italy and France are in a similar period. The second wave in Japan started earlier, while the New Zealand second wave began about one month later.

6.4 Infection (Confirmed Cases) Second Wave The spreading of the early second wave is analyzed with a Weibull model fit. Therefore, a period of about 50 days after the beginning of the second wave is considered for the analyzed countries: Germany, Italy, Japan, New Zealand and France. This time span allows a comparison with the shape parameter (spreading speed) of the first wave, the spreading until lockdown, due to the same amount of analyzed days. Figure 4 shows the Weibull distribution models of the second wave. The estimated Weibull parameters are shown in Table 5. France has the highest spreading speed, which corresponds to the high number of cases in a short time (hard second wave). The curve of New Zealand is the flattest, the shape parameter is the smallest, and so the second wave is not very pronounced there, which is consistent with the low number of cases. When comparing the shape parameter with that of the first wave, it is noticeable that the early second wave shows a spreading spread on a lower (moderate) level. Rather, the propagation speed is in the range of the course after the lockdown. By comparing the confidence intervals of the gradient (shape parameter b), it becomes clear that the spreading speed of the early second wave is significantly higher than that of the time after the lockdown.

COVID-19 Pandemic Risk Analytics: Data …

303

Fig. 4 Weibull distribution models, representing the infection development, confirmed cases, time span 50 days from the second wave related to the analyzed countries (log-log-scale)

Table 5 Weibull model parameters since begin second wave (confirmed cases, 50 days) related to analyzed countries. Confidence level γ = 0.95 Country Germany

Cases 44,492

T (d)

Shape b (confidence belt)

37

2.58 ≤ 2.60 ≤ 2.61

Italy

29,851

41

2.99 ≤ 3.01 ≤ 3.04

Japan

43,820

37

2.86 ≤ 2.88 ≤ 2.90

264

27

1.85 ≤ 2.05 ≤ 2.25

105,860

42

3.52 ≤ 3.54 ≤ 3.56

New Zealand France

The spreading speed of the early second wave is lower than that of the first wave and slightly higher than the one of the time after the lockdown. It has to be considered, that the data inventory of this paper is the 09/22/2020, so further developments of the second wave are not analyzed.

7 Comparison of the COVID-19 Spreading Behavior with Other Infectious Diseases To see the COVID-19 spreading in a greater context, it is compared with the spreading of influenza and measles in Germany. The seasons 2014/15 until 2016/17 are chosen as comparable seasons with average case numbers and process. The first 56 days since the first case are analyzed for each season in a first step. Then, a 3-year average is estimated and the spreading is compared with COVID-19 in Germany before and after lockdown.

304

A. Puls and S. Bracke

7.1 Influenza and Measles To gain the knowledge regarding the spreading speed of typical infectious diseases, Weibull distribution models are fitted to in each case three seasons of influenza and measles. To create a comparability with the COVID-19 spreading, the first 56 days (8 weeks) are analyzed. Since the data from RKI [11] is given on a weekly base, the analyzing of the ranked data is based on a transformation (1 week corresponds to 7 days). Figure 5 shows the resulted Weibull distribution models, the Weibull parameters are listed in Table 6. There are more influenza cases than measles cases, the parameters variate between the seasons and the disease. The spreading speed (shape parameter) of measles is slightly slower than that of influenza. Since the confidence intervals of the two

Fig. 5 Weibull distribution models, representing the infection development, influenza and measles, time span 56 days from the index case related to the analyzed seasons (log-log-scale)

Table 6 Weibull model parameters since index case (56 days) related to analyzed diseases and seasons. Confidence level γ = 0.95 Disease

Season

Cases

T (d)

Shape b (confidence belt)

Influenza

14/15

116

43

2.52 ≤ 2.93 ≤ 3.39

15/16

266

47

3.18 ≤ 3.52 ≤ 3.87

16/17

510

45

2.59 ≤ 2.78 ≤ 2.98

14/15

72

38

1.91 ≤ 2.32 ≤ 2.78

Measles

15/16

22

30

1.38 ≤ 2.01 ≤ 2.77

16/17

128

40

2.48 ≤ 2.87 ≤ 3.29

COVID-19 Pandemic Risk Analytics: Data …

305

diseases overlap, this difference is not significant. In addition, the spreading speeds within the diseases slightly, not significant, differ. Therefore, the summarization in an average of the seasons (3-year average) is legitimate, as done in the next step.

7.2 Comparison: COVID-19 Versus Influenza Versus Measles On the base of the known infection cases, a 3-year average is developed for influenza and measles. Weibull distribution models are fitted for these averages and the COVID19 spreading before and after lockdown on the example of Germany. The comparison of COVID-19/influenza/measles models is shown in Fig. 6 and the corresponding parameters are documented in Table 7. It is clearly visible that the COVID-19 spreading differs from the spreading of other infectious diseases like influenza and measles. The spreading speed (shape

Fig. 6 Weibull distribution models, representing the infection development, comparison of COVID19 before and after lockdown with 3-year average of measles and influenza 2013/14–2015/16, time span about 50 days (log-log-scale). Database: Germany

Table 7 Weibull model parameters since index case (about 50 days) and since lockdown LD; 28 days) related to analyzed diseases. Database: Germany. Confidence level γ = 0.95 Disease COVID-19 before LD

Cases

T (d)

Shape b (confidence belt)

22,213

53

19.61 ≤ 19.82 ≤ 20.03

122,971

15

1.78 ≤ 1.79 ≤ 1.80

Influenza 3-year average

296

45

2.73 ≤ 3.00 ≤ 3.28

Measles 3-year-average

75

43

2.58 ≤ 3.13 ≤ 3.74

COVID-19 after LD

306

A. Puls and S. Bracke

parameter) of COVID-19 before lockdown is significant larger (influenza: factor ~6.6; measles: factor ~6.3) than that of the 3-year averages of influenza and measles. This shows that the extreme progress of COVID-19 pandemic in comparison to influenza and measles. The spreading behavior of COVID-19 is not comparable with “normal” infectious diseases. In the 3-year average, no significant difference can be seen between the spreading speed of measles and influenza. Obviously, the on hard measures of the lockdown (e.g. restrictions on contact, distance regulations) lead to the level of a “normal” infectious diseases (influenza and measles) spreading behavior. Without any measures, the spreading speed (gradient) of COVID-19 is higher by factor ~6.6 compared to influenza respectively by factor ~6.3 compared to measles. Note: The spreading of influenza and measles is also affected by basic immunity caused by available vaccines, which do not exist for COVID-19 disease.

8 Summary The application of statistical methods of reliability engineering enabled a detailed analysis of the occurrence of infection. Using Weibull distribution models, the spreading behavior can be evaluated. The main part of the data analytics was the application of a model used for damage cases within reliability engineering analytics describing the cases of a pandemic. The shape parameter b as the gradient of the Weibull distribution model (log-log-scale) is interpreted as the spreading speed. When analyzing data from different countries, uncertainty factors and data quality must be taken into account: Differences in the countries like test strategies or reporting systems, definitions of cases or external influences like seasonality effect the database. This was handled by analyzing ranked data and countries with similar industrialization standards. By comparing the shape parameters and their confidence intervals from the time before and after the lockdown, it got clear that the lockdown measures could significantly reduce the COVID-19 spreading speed in all analyzed countries. The Cox and Stuart trend test detected the beginning of the second wave depending from the analyzed country in July or August 2020. Thereby a European wide second wave could be determined, the detected starting points are within short time range. In terms of spreading speed, it could be noticed that the early second wave is much more moderate than the first wave in terms of spreading. In comparison to the time after the lockdown, the spreading speed is higher. With the data inventory of the 09/22/2020, further developments cases were not analyzed and the impact of autumn respectively winter season is not considered. Considering these seasonal effect, a mixed distribution—with different increasing gradients—can be expected. Further research studies will focus on a longterm comparison of first and second COVID-19 wave. To see the COVID-19 spreading in a greater context, it was compared with the spreading of influenza and measles in the example of Germany. Thereby, the

COVID-19 Pandemic Risk Analytics: Data …

307

Weibull distribution model was fitted to the beginning of the season to enable a comparison with the COVID-19 spreading. It got clearly visible that the COVID19 spreading differs from the spreading of other infectious diseases. The spreading speed (shape parameter) of COVID-19 before lockdown is significant higher with factor ~6.6 in comparison to the 3-year averages of influenza, and higher with factor ~6.3 in comparison to measles. This shows that the extreme progress—the spreading speed—of COVID-19, it is not comparable with “normal” infectious diseases (influenza/measles). Only the COVID-19 time period under the strong impact of the lockdown measures, shows a spreading speed on the level of influenza or measles.

References 1. Bracke, S., Puls, A., Grams, L.: COVID-19 pandemic data analytics: Data heterogeneity, spreading behavior, and lockdown impact. In: Proceedings of the 30th European Safety and Reliability Conference and the 15 the Probabilistic Safety Assessment and Management Conference. Research Publishing (2020) 2. Birolini, A.: Reliability Engineering. Theory and Practice. Springer, Berlin (2017) 3. Cox, D.R., Stuart, A.: Some quick sign tests for trend in location and dispersion. Biometrika 42(1–2), 80–95 (1955) 4. D’Arienzo, M., Coniglio, A.: Assessment of the SARS-CoV-2 basic reproduction number, R0 , based on the early phase of COVD-19 outbreak in Italy. Biosaf. Health 2(2), 57–59 5. Dimmock, N.J., Easton, A.J., Leppard, K.N.: Introduction to Modern Virology. Wiley, New York (2016) 6. Fisher, R.A.: On an absolute criterion for fitting frequency curves. Messenger Math. 41, 155– 160 (1912) 7. Gattinoni, L., Chiumello, D., Caironi, P., et al.: April), COVID-19 pneumonia: different respiratory treatments for different phenotypes? Intensiv. Care Med. 46, 1099–1102 (2020) 8. JHU: COVID-19 dashboard by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU). Johns Hopkins University, Coronavirus Resource Center (2020) 9. Kermack, W.O., McKendrick, A.G.: A contribution to the mathematical theory of epidemics. Proc. R. Soc. A 115, 700–721 (1927) 10. Papula, L.: Mathematische Formelsammlung. Für Ingenieure und Naturwissenschaftler. 12. Auflage. Springer Vieweg (2017) 11. RKI: SurvStat@RKI 2.0, Robert Koch Institut (2020). https://survstat.rki.de 12. RKI: COVID-19 (Coronavirus SARS-CoV-2), September 2020. Robert Koch Institut (2020) 13. Rinne, H. (2008). The Weibull Distribution: A Handbook. CRC Press, Taylor & Francis Group 14. Sajadi, M.M., Habibzadeh, P., Vintzileos, A., et al.: Temperature, humidity, and latitude analysis to estimate potential spread and seasonality of coronavirus disease 2019 (COVID-19). JAMA Netw. Open 3(6), 1–11 (2020) 15. Sochacki, S., Bracke, S.: The comparison of the estimation and prognosis of failure behavior in product fleets by the RAPP method with state-of-the-art risk prognosis models within the usage phase. In: Cepin, M., Bris, R. (eds.) Safety and Reliability. Theory and Applications, pp. 3481–3490. CRC Press (2017) 16. Weibull, W.: A statistical distribution function of wide applicability. ASME J. Appl. Mech. 18(3), 293–297 (1951)