Knowledge-Based Software Engineering: 2022: Proceedings of the 14th International Joint Conference on Knowledge-Based Software Engineering 3031175824, 9783031175824

This book contains extended versions of  the works and new research results presented at the 14th International Joint Co

363 49 5MB

English Pages 218 [219] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Knowledge-Based Software Engineering: 2022: Proceedings of the 14th International Joint Conference on Knowledge-Based Software Engineering
 3031175824, 9783031175824

Table of contents :
Preface
Artificial Intelligence as Dual Use Technology
Attacks in Android Mobile User Interfaces Exposing User Privacy and Personal Data
Contents
Part I Software Development Techniques and Tools
1 Proposal of a Middleware to Support Development of IoT Firmware Analysis Tools
1.1 Introduction
1.1.1 Background
1.1.2 Contribution
1.2 Characterization of Firmware Vulnerability Detection Methods
1.2.1 Categorization of Features
1.2.2 Research on Firmware Vulnerability Analysis
1.2.3 Result
1.3 Our Approach
1.3.1 Firmware Splitting
1.3.2 Static Strings
1.3.3 Control Flow Graph
1.3.4 Identifying Network Functions
1.4 Conclusions and Future Work
References
2 Feature-Based Cloud Provisioning for Rehosting
2.1 Introduction
2.2 Motivating Examples
2.2.1 Trial-and-Error Searches for Optimal Cloud Service Configuration in Design Process
2.2.2 Manual Provisioning with Console Used by Engineers in Construction Process
2.3 Feature-Based Cloud Provisioning Method for Rehosting
2.3.1 Overview of Proposed Method
2.3.2 Cloud Feature Model
2.3.3 Cloud Provisioning Tool
2.4 Evaluation
2.4.1 Evaluation Method
2.4.2 Results
2.5 Discussion
2.5.1 Effects of Application to Design Process
2.5.2 Effects of Application to Construction Process
2.6 Related Work
2.7 Conclusion
References
3 Pattern to Improve Reusability of Numerical Simulation
3.1 Introduction
3.2 Background
3.2.1 Modelica Language
3.2.2 Adopting Design Patterns to Modelica
3.3 Patterns for Physical Modeling and Their Adaptation
3.4 Case Study
3.4.1 Moving Ball
3.4.2 SIR Model
3.5 Conclusions
References
4 SpiderTailed: A Tool for Detecting Presentation Failures Using Screenshots and DOM Extraction
4.1 Introduction
4.2 Proposed Method
4.2.1 Take Screenshots
4.2.2 Extract Visual Properties
4.2.3 Comparison of Visual Properties
4.2.4 Output of Comparison Results
4.3 Implementation
4.3.1 Take Screenshots Function
4.3.2 Extract Visual Properties Function
4.3.3 Comparison of Visual Properties Function
4.3.4 Comparison Results Output Function
4.4 Evaluation
4.4.1 Preparing Web Page for Evaluation
4.4.2 Establishment of Evaluation Criteria
4.4.3 Experiment: 1 Evaluation of SpiderTailed
4.4.4 Experiment: 2 Evaluation of Manual Observation
4.5 Results and Discussion
4.5.1 RQ1: What Are the Strengths of SpiderTailed Compared to the Manual Observation
4.5.2 RQ2: Comparison of the Time Consumption
4.5.3 RQ3: Challenges of the SpiderTailed Method
4.6 Related Work
4.7 Limitations and Validity
4.7.1 Limitations
4.7.2 Validity
4.8 Conclusion
References
Part II AI/ML-Based Software Development
5 Collecting Insights and Developing Patterns for Machine Learning Projects Based on Project Practices
5.1 Introduction
5.2 Related Work
5.3 Research Subject and Hypothesis
5.3.1 ML-based Service System
5.3.2 Architecture Design Pattern for ML Service Systems
5.3.3 Research Hypothesis
5.4 Proposed Method
5.4.1 Overview
5.4.2 Reference Development Model and Collection of Insights
5.4.3 Construction of Patterns from Collected Insights
5.5 Practice
5.6 Discussion
5.7 Conclusions
References
6 Supporting Code Review by a Neural Network Using Program Images
6.1 Introduction
6.2 Related Work
6.3 CNN-BI System
6.3.1 Training Method and the Training Data
6.3.2 Preparing the Learning Data
6.3.3 Check List
6.4 Experimental Study
6.4.1 Overview
6.4.2 Applying Supervised Learning
6.4.3 Visualization of the Training
6.4.4 Verification of the Categorization
6.4.5 Types of Defects Inferred
6.4.6 Review Process
6.4.7 Result
6.5 Discussion
6.5.1 The Answer to RQ1
6.5.2 The Answer to RQ2
6.5.3 Internal Validity
6.5.4 External Validity
6.6 Conclusion
References
7 Safety and Risk Analysis and Evaluation Methods for DNN Systems in Automated Driving
7.1 Introduction
7.2 Related Work
7.2.1 Machine Learning Systems Engineering and Safety
7.2.2 Safety Guidelines and Research Trends for Automated Driving
7.2.3 Model and STAMP and Related Methods
7.3 Safety Challenges for Machine Learning Systems
7.3.1 Safety Challenges of Automated Driving
7.3.2 The “Question” that Forms the Core of the Research
7.3.3 Research Goals
7.4 Proposal of Safety and Risk Analysis and Evaluation Methods for DNN Systems
7.4.1 Step 1: System-level Safety Analysis
7.4.2 Step 2: Scenario and Training Data Generation for High-Risk Scenes
7.4.3 Step 3: DNN Design Modeling and Problem Analysis
7.4.4 Step 4: Design Labels with Safety in Mind
7.4.5 Step 5: Model Evaluation by Risk
7.4.6 Step 6: Setting Evaluation Criteria
7.4.7 Step 7: Model Improvement Through Debugging and Modification Techniques
7.5 Safety Arguments and Case Studies
7.5.1 Safety Arguments for DNN Safety Analysis and Assessment Methodology
7.5.2 Embodiment of Steps 1–3
7.6 Conclusion
References
8 Regulation and Validation Challenges in Artificial Intelligence-Empowered Healthcare Applications—The Case of Blood-Retrieved Biomarkers
8.1 Introduction
8.2 Key Issues and Challenges
8.2.1 Biomarkers
8.2.2 Automating Interventions and Patient's Journey
8.2.3 The Importance of Regulation
8.2.4 Validation of Neural Networks in Health Applications and Continuous Integration
8.2.5 The Role of Machine Learning in Health Care
8.3 Related Work
8.4 Building a Blood Exam-Based Personalised Recommender System
8.4.1 Development Methodologies
8.5 Conclusion and Research Key Findings
References
Part III Educational and Assistive Software
9 Multi-agent Simulation for Risk Prediction in Student Projects with Real Clients
9.1 Introduction
9.2 Software Development Project Course with Real Clients
9.3 Multi-agent Model of Student Projects
9.3.1 NetLogo
9.3.2 Student Project Model
9.3.3 Dependencies Among Tasks in a Project
9.3.4 Task Allocation to Project Members
9.3.5 Members’ Skill and Performance
9.3.6 Risk Prediction by Simulating in Our Model
9.4 Simulation Results
9.5 Questionnaire Survey
9.6 Related Work
9.7 Conclusion and Future Works
References
10 Automatic Scoring in Programming Examinations for Beginners
10.1 Introduction
10.2 Preliminaries
10.2.1 Presburger Arithmetic
10.2.2 Notation
10.3 Proposed Methods
10.3.1 Programming Language
10.3.2 Program Verification
10.3.3 Automatic Scoring
10.3.4 Examination System
10.4 Experiments
10.4.1 Count
10.4.2 Bubble Sort
10.4.3 Binary Search
10.5 Discussion
10.6 Conclusion
References
11 A Study on Analyzing Learner Behaviors in State Machine Modeling Using Process Mining and Statistical Test
11.1 Introduction
11.2 Theoretical Background
11.2.1 State Machine Model
11.2.2 Model Log
11.2.3 Event Log
11.2.4 Process Mining
11.2.5 Identification of Different Activity and Transition
11.3 Method
11.3.1 Activity and Transition Extraction
11.3.2 Difference Identification
11.4 Experiment
11.4.1 Modeling Task and Answers
11.4.2 Result
11.4.3 Consideration
11.5 Related Work
11.6 Conclusion
References
12 Supporting Conveyance of Webpages by Highlighting Text for Visually Impaired Persons
12.1 Introduction
12.2 Related Work
12.3 Emphasis Expressions in this Research
12.4 Outline of Our Method
12.4.1 Approach of Our Method
12.4.2 Structure of Our Method
12.5 Deciding the Reading Voice for Emphasized Expressions
12.5.1 Weighting of Text
12.5.2 Weighting of Voice
12.5.3 Determination of the Reading Method
12.6 Evaluation
12.6.1 Experimental Design
12.6.2 Results
12.6.3 Discussion
12.7 Conclusion
References
Part IV Requirements Analysis and Software Modeling
13 Comparative Study on Functional Resonance Matrices
13.1 Introduction
13.2 Related Work
13.2.1 Fram
13.2.2 FRAM Matrix Representation
13.2.3 Matrix Representation
13.3 Functional Aspect Resonance Matrix
13.4 Comparative Study on Matrix Representations
13.5 Discussion
13.5.1 Novelty
13.5.2 Effectiveness
13.5.3 Computational Cost
13.5.4 Limitations
13.6 Summary
References
14 A Method for Matching Patterns Based on Event Semantics with Requirements
14.1 Introduction
14.2 Related Work
14.3 Proposed Method
14.3.1 Characteristic Semantic Representation
14.4 Experimental Pattern Matching
14.5 Conclusions
References
15 Digital SDGs Framework Towards Knowledge Integration
15.1 Introduction
15.2 Related Work
15.2.1 SDGs
15.2.2 Dx
15.2.3 Knowledge Integration
15.3 Issues
15.4 DSDG Framework
15.4.1 Classification of SDGs in Enterprises
15.4.2 DSDG Strategy Map
15.4.3 DSDG Framework
15.4.4 SDGsVCM
15.5 Case Study
15.5.1 DSDG Strategy Map
15.5.2 DSDG Framework
15.5.3 SDGsVCM
15.6 Discussion
15.7 Summary
References
16 Hierarchical User Review Clustering Based on Multiple Sub-goal Generation
16.1 Introduction
16.2 Relevant Work
16.3 The Existing Clustering Method
16.3.1 Ward Method
16.4 Comprehensive Clustering Method
16.4.1 LDA Topic Model
16.4.2 The Distance-Based Clustering Algorithm
16.5 Experiment and Evaluation
16.5.1 Purpose of Experiments
16.5.2 Experiment and Discussion
16.6 Conclusion
References

Citation preview

Learning and Analytics in Intelligent Systems 30

Maria Virvou Takuya Saruwatari Lakhmi C. Jain   Editors

Knowledge-Based Software Engineering: 2022 Proceedings of the 14th International Joint Conference on Knowledge-Based Software Engineering (JCKBSE 2022), Larnaca, Cyprus, August 22–24, 2022

Learning and Analytics in Intelligent Systems Volume 30

Series Editors George A. Tsihrintzis, University of Piraeus, Piraeus, Greece Maria Virvou, University of Piraeus, Piraeus, Greece Lakhmi C. Jain, KES International, Shoreham-by-Sea, UK

The main aim of the series is to make available a publication of books in hard copy form and soft copy form on all aspects of learning, analytics and advanced intelligent systems and related technologies. The mentioned disciplines are strongly related and complement one another significantly. Thus, the series encourages crossfertilization highlighting research and knowledge of common interest. The series allows a unified/integrated approach to themes and topics in these scientific disciplines which will result in significant cross-fertilization and research dissemination. To maximize dissemination of research results and knowledge in these disciplines, the series publishes edited books, monographs, handbooks, textbooks and conference proceedings. Indexed by EI Compendex.

Maria Virvou · Takuya Saruwatari · Lakhmi C. Jain Editors

Knowledge-Based Software Engineering: 2022 Proceedings of the 14th International Joint Conference on Knowledge-Based Software Engineering (JCKBSE 2022), Larnaca, Cyprus, August 22–24, 2022

Editors Maria Virvou Department of Informatics University of Piraeus Piraeus, Greece

Takuya Saruwatari NTT Data Corporation Tokyo, Japan

Lakhmi C. Jain KES International Shoreham-by-Sea, UK

ISSN 2662-3447 ISSN 2662-3455 (electronic) Learning and Analytics in Intelligent Systems ISBN 978-3-031-17582-4 ISBN 978-3-031-17583-1 (eBook) https://doi.org/10.1007/978-3-031-17583-1 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

This book contains extended versions of the works and new research results presented at the 14th International Joint Conference on Knowledge-based Software Engineering (JCKBSE2022). JCKBSE2022 was originally planned to take place in Larnaca, Cyprus. Unfortunately, the COVID-19 pandemic forced it to be rescheduled as an online conference. JCKBSE is a well-established international biennial conference that focuses on the applications of artificial intelligence on software engineering. The 14th International Joint Conference on Knowledge-based Software Engineering (JCKBSE2022) was organized by the Department of Informatics of the University of Piraeus, Greece. This year, pretty much like every year, the majority of submissions originated from Japan, while Greece was second. The submitted papers were rigorously reviewed by at least two independent reviewers. Finally, 16 papers were accepted for presentation at JCKBSE2022 and inclusion in its Proceedings. The papers accepted for presentation in JCKBSE2022 address topics such as the following: • • • • • • • • • • • • •

Architecture of knowledge-based systems, intelligent agents and softbots Architectures for knowledge-based shells Automating software design and synthesis Decision support methods for software engineering Development of multi-modal interfaces Development of user models Development processes for knowledge-based applications Empirical/evaluation studies for knowledge-based applications Intelligent user interfaces and human–machine interaction Internet-based interactive applications Knowledge acquisition Knowledge engineering for process management and project management Knowledge management for business processes, workflows and enterprise modeling • Knowledge technologies for semantic web

v

vi

Preface

• Knowledge technologies for service-oriented systems, Internet of services and Internet of Things • Knowledge technologies for web services • Knowledge-based methods and tools for software engineering education • Knowledge-based methods and tools for testing, verification and validation, maintenance and evolution • Knowledge-based methods for software metrics • Knowledge-based requirements engineering, domain analysis and modeling • Methodology and tools for knowledge discovery and data mining • Ontologies and patterns in UML modeling • Ontology engineering • Program understanding, programming knowledge, modeling programs and programmers • Software engineering methods for Intelligent Tutoring Systems • Software life cycle of intelligent interactive systems • Software tools assisting the development In addition to technical paper presenters, in JCKBSE2020 we had the following distinguished researchers as keynote speakers: 1. Dr. Haruki Ueno, National Institute of Informatics, Japan 2. Dr. Efthimios Alepis, University of Piraeus, Greece First and foremost, we would like to thank Prof. Dr. Shuichihiro Yamamoto, International Professional University of Technology in Nagoya, Japan and Prof. Dr. Nikolaos G. Bourbakis, Wright State University, Ohio, USA, for acting as Honorary Chairs of JCKBSE2022. We also would like to thank the authors for choosing JCKBSE2022 as the forum for presenting the results of their research. Additionally, we would like to thank the reviewers for taking the time to review the submitted papers rigorously. For putting together the Web site of JCKBSE2022 and for managing the conference administration system and coordinating JCKBSE2022, we would like to thank Easy Conferences Ltd., Nicosia, Cyprus. Finally, we would like to thank Springer personnel for their wonderful job in producing the JCKBSE2022 proceedings, which are available at https://link.springer.com/. Piraeus, Greece Tokyo, Japan Piraeus, Greece

George A. Tsihrintzis Takuya Saruwatari Maria Virvou The JCKBSE2020 General Chairs

Artificial Intelligence as Dual Use Technology

Haruki Ueno National Institute of Informatics, Japan

Abstract: This keynote speech discusses artificial intelligence (AI) from the point of view of Dual Use Technology (DUT). DUT is any technology with the potential for both peace and military applications and increasing role for nations globally. AI is a key technology in DUT. This keynote speech presents the root, history and current trends of AI, DUT and their interrelationship. The DUT dilemma is also discussed for national security. The role of government has increased so that nation-level policy, long-range goals, financial support to academia, military and industry should be made based on democratic procedures. In addition, global trends are discussed via reviewing of some typical countries. Short CV: Dr. Haruki Ueno is Professor Emeritus of The National Institute of Informatics, and The Graduate University for Advanced Study (Sokendai), Japan. He received a B.E.E. degree from the National Defense Academy and a Ph.D. in Electrical Engineering degree from Tokyo Denki University. His career includes the positions of (i) Research Associate of the Institute of Medical Informatics,

vii

viii

Artificial Intelligence as Dual Use Technology

Missouri University, (ii) Professor in the Department of Information Science, Graduate School of Tokyo University, and (iii) Professor in the Department of Informatics, National Institute of Informatics and Sokendai. He has authored/co-authored more than 300 scientific publications which have appeared in international journals, international conferences and national journals, books and book chapters. He proposed the concept and technology of object model for advanced knowledge systems and developed knowledge-based systems including program understanding, symbiotic robots, digital system trouble shooting (with IBM), etc., using the proposed model. He is the founder of the JCKBSE series.

Attacks in Android Mobile User Interfaces Exposing User Privacy and Personal Data

Efthimios Alepis University of Piraeus, Greece

Abstract: This keynote speech presents research in problems that appear in recent mobile operating system (OS) user interfaces (UI). Mobile devices are highly dependent on the design of user interfaces, since their size and computational cost introduce considerable constraints. Therefore, both user experience (UX) and user interfaces are considered as top priorities among major mobile OS platforms. This keynote speech highlights pitfalls in the design of Android UI, which can greatly expose users and break user trust in the UI by proving how deceiving it can be in a number of UI attacks. Short CV: Dr. Efthimios Alepis is Associate Professor in the Department of Informatics, University of Piraeus. He has authored/co-authored more than 180 scientific papers, which have been published in international journals, book chapters and international conferences. He has received several awards by Google and Microsoft for his research in the security of their Operating Systems. He is the founder of the Greek Company “Software Engineering Innovation Group—SEIG” the activities of which include, among other, production of innovative software and IT services. His ix

x

Attacks in Android Mobile User Interfaces Exposing User Privacy …

current research interests are in the areas of object-oriented programming, mobile software engineering, human–computer interaction, affective computing and security in mobile applications.

Contents

Part I 1

Software Development Techniques and Tools

Proposal of a Middleware to Support Development of IoT Firmware Analysis Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Minami Yoda, Shuji Sakuraba, Yuichi Sei, Yasuyuki Tahara, and Akihiko Ohsuga

3

2

Feature-Based Cloud Provisioning for Rehosting . . . . . . . . . . . . . . . . . Misaki Mito, Takatoshi Ohara, Ryo Shimizu, Hideyuki Kanuka, and Minoru Tomisaka

15

3

Pattern to Improve Reusability of Numerical Simulation . . . . . . . . . . Junichi Ichimurar and Takako Nakatani

27

4

SpiderTailed: A Tool for Detecting Presentation Failures Using Screenshots and DOM Extraction . . . . . . . . . . . . . . . . . . . . . . . . . Takato Okajima, Takafumi Tanaka, Atsuo Hazeyama, and Hiroaki Hashiura

Part II 5

6

7

39

AI/ML-Based Software Development

Collecting Insights and Developing Patterns for Machine Learning Projects Based on Project Practices . . . . . . . . . . . . . . . . . . . . Hironori Takeuchi, Kota Imazaki, Noriyoshi Kuno, Takuo Doi, and Yosuke Motohashi Supporting Code Review by a Neural Network Using Program Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kazuhiko Ogawa and Takako Nakatani Safety and Risk Analysis and Evaluation Methods for DNN Systems in Automated Driving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tomoko Kaneko, Yuji Takahashi, Shinichi Yamaguchi, Jyunji Hashimoto, and Nobukazu Yoshioka

55

69

83

xi

xii

8

Contents

Regulation and Validation Challenges in Artificial Intelligence-Empowered Healthcare Applications—The Case of Blood-Retrieved Biomarkers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dimitrios P. Panagoulias, Maria Virvou, and George A. Tsihrintzis

97

Part III Educational and Assistive Software 9

Multi-agent Simulation for Risk Prediction in Student Projects with Real Clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Fumihiro Kumeno

10 Automatic Scoring in Programming Examinations for Beginners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Yoshinori Tanabe and Masami Hagiya 11 A Study on Analyzing Learner Behaviors in State Machine Modeling Using Process Mining and Statistical Test . . . . . . . . . . . . . . 141 Shinpei Ogata, Hiroyuki Nakagawa, Haruhiko Kaiya, and Hironori Takeuchi 12 Supporting Conveyance of Webpages by Highlighting Text for Visually Impaired Persons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Daisuke Sayama, Junko Shirogane, Hajime Iwata, and Yoshiaki Fukazawa Part IV Requirements Analysis and Software Modeling 13 Comparative Study on Functional Resonance Matrices . . . . . . . . . . . 169 Shuichiro Yamamoto 14 A Method for Matching Patterns Based on Event Semantics with Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Maiko Onishi, Shinpei Ogata, Kozo Okano, and Daisuke Bekki 15 Digital SDGs Framework Towards Knowledge Integration . . . . . . . . 193 Shuichiro Yamamoto 16 Hierarchical User Review Clustering Based on Multiple Sub-goal Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Shuaicai Ren, Hiroyuki Nakagawa, and Tatsuhiro Tsuchiya

Part I

Software Development Techniques and Tools

Chapter 1

Proposal of a Middleware to Support Development of IoT Firmware Analysis Tools Minami Yoda, Shuji Sakuraba, Yuichi Sei, Yasuyuki Tahara, and Akihiko Ohsuga Abstract Although Internet of Things (IoT) devices have made our lives increasingly convenient, there have been many reports of vulnerabilities, such as information leakage and use of these devices as springboards for attacks. To detect vulnerabilities in IoT devices, many vulnerability detection methods using firmware analysis have been proposed by many researches. The general flow of vulnerability detection is to load firmware into an analysis program called an SRE (software reverse engineering) tool and then apply the original algorithm of each program under study to find vulnerabilities. However, firmware analysis requires complex analysis preparation before applying proprietary algorithms, and the development cost of analysis preparation is burdensome for researchers. In this study, we surveyed research on firmware vulnerability detection. We chose ten studies on vulnerability detection. The results of our survey showed that the commonly used functions for firmware analysis are firmware splitting, static strings, graphs, and network functions. Also despite the fact that many studies use similar functions, all of them were developed by each study. According to the survey result, we propose a middleware that standardizes analysis preparation. By using our middleware, researchers do not need to develop a basic function of finding vulnerability. This ability will reduce the development cost of preparing for M. Yoda (B) · Y. Tahara · A. Ohsuga Graduate School of Informatics and Engineering, The University of Electro-Communications, Chofu, Tokyo, Japan e-mail: [email protected] Y. Tahara e-mail: [email protected] A. Ohsuga e-mail: [email protected] S. Sakuraba Graduate School of Information Systems, The University of Electro-Communications, Chofu, Tokyo, Japan e-mail: [email protected] Y. Sei JST PRESTO, Chiyoda, Tokyo, Japan e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Various et al. (eds.), Knowledge-Based Software Engineering: 2022, Learning and Analytics in Intelligent Systems 30, https://doi.org/10.1007/978-3-031-17583-1_1

3

4

M. Yoda et al.

analysis and allow researchers to spend more time developing vulnerability detection algorithms, which is this project’s original purpose. Keywords IoT · Firmware analysis · Static analysis · Backdoor

1.1 Introduction 1.1.1 Background Products containing firmware, such as IoT devices, are becoming more prevalent in every household and society. Although this trend has made our lives more convenient, there has been no end to the number of attacks and reports of vulnerabilities in IoT devices’ firmware. As an example, security researchers Sasnauskas et al. discovered a vulnerability in Jetstream and Wavlink routers that allows privileged commands to be executed remotely on IoT devices, which could be used as a stepping-stone for malware such as Mirai.1 These routers made headlines because they are sold through a wide range of e-commerce outlets, including Amazon and eBay, and are likely to be widely used among average homeowners. Other reports have indicated that IDs and passwords are embedded in the firmware of Zyxel’s routers,2 which have been registered as CVEs. These identifiers are maintained by MITRE, Inc.3 IDs and passwords written directly into a device’s firmware cannot be changed by the device’s user and can be removed or changed only by vendor updates. Therefore, if an IOT device’s ID and password are leaked to a third party, the device may be hacked or manipulated, which is dangerous. Moreover, IDs and passwords embedded in firmware represent the most important vulnerability of which users must be aware, ranking first in OWASP’s ranking of vulnerabilities in IoT devices.4 Cybersecurity specialists have proposed many methods to detect vulnerabilities in IoT devices using firmware analysis. Because firmware is binary, it is difficult to analyze a program as it is. Therefore, firmware is generally analyzed using a method called software reverse engineering (SRE). SRE tools disassemble and analyze firmware, and this approach has been used to find vulnerabilities in many firmware.

1

CyberNews. Walmart-exclusive router and others sold on Amazon & eBay contain hidden backdoors to control devices, 2020. https://cybernews.com/security/walmart-exclusive-routers-othersmade-in-china-contain-backdoors-to-control-devices/. 2 Undocumented user account in Zyxel products (CVE-2020-29583)—EYE, 2021. https://www. eyecontrol.nl/blog/undocumented-user-account-in-zyxel-products.html. 3 Common Weakness Enumeration (CWE™)-https://cwe.mitre.org/. 4 OWASP. Owasp-iot-top-10-2018, 2018. https://owasp.org/www-pdf-archive/OWASP-IoT-Top10-2018-final.pdf.

1 Proposal of a Middleware to Support Development of IoT Firmware …

5

However, most of the methods in papers on vulnerability detection using SRE tools spend more time on understanding firmware’s functionality than on developing algorithms for vulnerability detection. Nevertheless, IDA Pro and Ghidra, which are prominent SRE tools, do not target the estimation of each firmware function for analysis. Although it is possible for SRE tools to estimate functions to some extent using symbol tables, there are limits to this symbol analysis capability, and some firmware have many symbols that cannot be guessed. Therefore, each researcher must undergo a highly complicated preparatory process before applying their own detection algorithm to the firmware, and this process’s development cost is burdensome for researchers.

1.1.2 Contribution This study surveyed research on firmware vulnerability detection. We chose ten studies on vulnerability detection. The results of our survey showed that the commonly used functions for firmware analysis are firmware splitting, static strings, graphs, and network functions. Also we proposes a middleware that provides the common features to analyze firmware’s vulnerability and complex vulnerability analysis preparation process as middleware based on the survey results. We found from our survey that many research utilize common features for analyzation but they develop each feature by their own. Thus, by utilizing our middleware, researchers will be able to lower development’s preparation cost and spend more time on the development of vulnerability detection algorithms, which is this project’s primary purpose. Some tools of existing studies are available, but they are difficult to use and remain the threshold for firmware analysis. This middleware is developed user-friendly tool so that researcher can use it easier. In Sect. 1.2, we present a survey of research about finding firmware’s vulnerabilities to determine what functions are commonly used in vulnerability analysis. We use the result to realize our method to standardize firmware analysis steps. In Sect. 1.3, we present our middleware according to Sect. 1.2’s results. In our middleware, we propose four methods. In Sect. 1.4, we discuss our conclusions and directions for future work.

1.2 Characterization of Firmware Vulnerability Detection Methods We surveyed ten studies on firmware vulnerability analysis to investigate common features in firmware vulnerability analysis. We included only papers that revealed previously unknown vulnerabilities or that were referenced in many other papers.

6

M. Yoda et al.

1.2.1 Categorization of Features To summarize this survey, we categorized their technologies in terms of what features are effective in finding vulnerabilities. For example, if Research A uses Firmware splitting in their algorithm to find vulnerability, we check ‘Firmware splitting’ to Research A. In this way, we checked all the features that researchers use to find vulnerabilities. According to ten studies, we categorized features into eight items. (1) Splitting Firmware, (2) Finding Static strings, (3) Finding Memory value, (4) Using Symbolic execution, (5) Using Emulator, (6) Using Graph, (7) Using Machine learning and (8) Finding Network Function. Splitting Firmware Firmware splitting is a technology to divide files from a firmware. It is helpful for loading each file into SRE tool. Firmware is normally compressed by several files. Thus, dividing firmware into each file is necessary to analyze firmware. Finding Static Strings Static strings means that an embedded string values in firmware. An embedded password and ID often cause a vulnerability that allow an attacker login into firmware and control the device. Thus, a research often used Static strings to find these vulnerability. Finding Memory Value Memory value is a value that is stored in memory. Memory value stores an input value from user. Thus, if an attacker enter some keys, Memory value saves the value. Using Symbolic Execution Symbolic execution is a technology that traces the path and impact of input values on the execution of a program. Using Emulator Emulator is famous feature for dynamic analyzation of firmware. Emulator shows a movement of device on virtual environment. Using Graph Graph describes a connection of program. For example, there is famous a program graph called Control Flow Graph. Control Flow Graph shows all paths of firmware program. Using Machine Learning Machine learning means that a deep learning or other machine learning technic. If a research uses Machine learning, we thinks that the study used Machine learning regardless of algorithm type.

1 Proposal of a Middleware to Support Development of IoT Firmware …

7

Finding Network Function Network means that a network function in firmware such as socket symbol or network port. Network functions are often used in firmware analysis because they provide a path for the input values of the attacker.

1.2.2 Research on Firmware Vulnerability Analysis Firmalice Shoshitaishvili et al. proposed Firmalice, a method to detect authentication evasion (i.e., backdoors) using symbolic execution [1]. Backdoors allow authentication through unintended paths in a software device’s design. Firmalice describes the points where authentication evasion can occur for each firmware (i.e., execution points for privileged programs) in the form of a security policy and detects authentication evasion based on this policy. The security policy mainly describes three types of information: intentionally embedded log-in information, intentionally hidden authentication interface and unintentional bugs. The first step in the analysis process is to construct a firmware relationship graph using static analysis. Using the constructed graph, the program performs symbolic execution to detect the privileged execution points of programs. If the program detects a privileged program’s execution point, it checks whether the execution point allows for authentication evasion. PIE PIE is a method for detecting bugs and finding hidden commands by converting firmware’s binary code into an intermediate language using LLVM, dividing it into individual functions and components, and extracting all commands executed for each protocol [2]. To identify functions, LLVM learns from samples such as core function programs and complex protocols with multiple servers and then characterizes them by the number of basic blocks, the number of branches (e.g., if–then-else and loop), and the number of conditional statements. The method uses CFG and DFG to parse the code. FIRMADYNE FIRMADYNE is a bug detection tool that uses a QEMU emulator for dynamic firmware analysis [3]. There are three detection objectives: (1) detection of public web pages accessible from the firmware image’s LAN interface, (2) detection of Simple Network Management Protocol (SNMP) information using the Snmpwalk tool to detect all unauthenticated SNMP information, and (3) detection of known vulnerabilities (i.e., 60 vulnerabilities taken from the Metasploit framework).

8

M. Yoda et al.

Stringer Thomas et al. proposed a method called Stringer, which uses strcmp() and strncmp() to compare user input to the embedded password string [4]. This method weights functions commonly used as backdoors, determines the functions with high weights to be backdoor candidates, and labels each function’s basic blocks with the set of sequences of static data against which user input must be matched to reach the blocks. Using these sets, the method assigns a score to each function to measure the extent to which static data influence the functions’ branching. The authors demonstrated the approach’s effectiveness through a lightweight analysis by running it on a data set of 2,451,532 binaries from 30 different COTS device vendors. The results showed that this technique is effective through the discovery of three backdoors and the recovery of a proprietary command set. HumIDIFy Thomas et al. also proposed a method to detect backdoors using HumIDIFy, a classifier using semi-supervised learning [5]. This method gathers symbol information and learns information from semi-supervised vector-support machine learning binaries to create a backdoor detection model, thus comparing it with the expected functionality profile that their method defines by hand for a range of applications. To specify these profiles, the authors developed Binary Functionality Description Language, a domain-specific language for encoding the static analysis passes used in identifying a binary’s specific functionality traits. The authors experimented with applying techniques to a large-scale analysis by measuring performance on a large firmware data set. By sampling that data set, their method identified several binaries containing unexpected functionality—notably, a backdoor in Tenda router firmware. Although the method effectively found a backdoor, it requires time and effort to generate a model before proceeding with backdoor detection. D-Taint D-Taint is a proposal for a system capable of detecting taint-style vulnerabilities using Firmadyne [6]. Taint analysis identifies the information flow of untrustworthy input that affects the sensitive sink or part of the system. File- and network-related symbols are easy to target and taint to determine where their data comes from. Sink-insecure functions—such as strcpy(), memcpy(), and system()—or code patterns (e.g., loop buffer copies) detect unsafe data paths. Firmup Firmup finds firmware programs that are similar to CVE-registered programs [7]. This method performs assembly analysis by utilizing binary similarity as a backand-forth game technique. FirmFuzz FirmFuzz uses a QEMU-based emulator to analyze Linux-based IoT firmware [8]. It collects usernames and passwords for web apps and utilizes fake device drivers

1 Proposal of a Middleware to Support Development of IoT Firmware …

9

to detect attacks and vulnerabilities from web apps. Specifically, it identified four previously undiscovered vulnerabilities: (1) command injection, (2) three pre- and one post-authentication buffer overflows, (3) one pre-authentication-reflected XSS vulnerability, and (4) one pre-authentication null pointer dereference. John et al. John et al. designed a method for Android malware detection utilizing graph convolutional networks [9]. The authors focused on system calls, socket calls (read/send), and binder drivers because network-related system calls are commonly used by malware. A graph is constructed with the flow of system calls, and the flow graph of programs related to the aforementioned are evaluated as candidates for malware. KARONTE KARONTE is a vulnerability detection system focusing on inter-process communication: File, SharedMemory, environment variables, sockets, and command arguments [10]. The method monitors data communication between binaries by constructing graphs on a per-binary basis and correlating data communication and memory location with static taint analysis. String Search String Search finds a function by searching for lines that use strcmp() or strncmp() symbols [11]. These symbols are used to compare user input to hardcoded login information. Thus, this method considers functions that use these symbols as backdoor function candidates. Socket Search Socket Search finds lines that use a strcmp() or strncmp() symbol around the socket function [12]. According to vulnerability-finding research, a backdoor is always accessible by the TCP/UDP function, so the hardcoded strings around the socket symbol are candidates for log-in information. Algorithm 1 Main Program of Firmware Analysis 1: RawBinaries ← getRawBinaries() 2: while ← RawBinaries.hasNext() do 3: RawBinary ← Raw Binaries.getBinary() 4: ELFs ← RawBinary.splitFirmware() 5: while ELFs.hasNext() do 6: imported ← importELFintoSRE(ELFs.getELF()) 7: if imported then 8: strings ← staticStrings(importedE LF) 9: graphs ← control FlowGraph(importedELF) 10: networks ← identifyingNetwork(importedELF) 11: algorithmByEachResearch(strings, graphs, networks) 12: end if

10

M. Yoda et al.

13: end while 14: end while Algorithm 1 Example usage of our middleware User Input Search Yoda et al. proposed a method for detecting hardcoded log-in information (username and password) in IoT devices using static analysis with a focus on the user input value [13] because users must enter correct log-in values to gain access to IoT devices.

1.2.3 Result Table 1.1 shows information about technologies that are used for firmware analysis. As a result, a category of feature is eight item. From the results, the functions highly used in firmware analysis were firmware splitting, graphs, network functions, and static strings. Based on these results, we propose a middleware that provides the following four types of function: firmware splitting, static strings, graphs, and network functions. Table 1.1 Summary of technologies Research

Splitting Finding Finding Using Using Using Using Finding firmware static memory symbolic emulator graph machine network strings value execution learning function

Firmalice



PIE

















FIRMADYNE ✔





























Stringer

















HumIDIFy

















DTaint

















FirmUp

















FirmFuzz

















John et al

















KARONTE

















String Search

















SocketSearch

















UserInput Search

















Number of ✔

12

8

4

1

3

9

2

6

1 Proposal of a Middleware to Support Development of IoT Firmware …

Firmware Split

SRE

11

Newly developed analysis tools

Firmware

Vulnerabilities Static strings Graph Network Function

is our middleware

Fig. 1.1 Overview of our middleware

1.3 Our Approach Based on the results presented in the previous chapter, we propose a middleware that provides four functions: firmware splitting, static strings, graphs, and network functions. Figure 1.1 shows an overview of our middleware. Algorithm 1 shows an example of using the middleware. Our approach provides common features to analyze firmware’s vulnerability. Researchers just call our middleware as function in their program and use our result to analyze firmware program. Without our middleware, researchers have to develop common features but out tool provide what they need on basis. They can focus on developing their core algorithm. Our notable originalities are at Static strings, Graph and Identifying network functions we describe details below.

1.3.1 Firmware Splitting Firmware file splitting is the most utilized function in Table 1. Firmware splitting is a highly important function involving a complex process. This is because if firmware is input directly into SRE software, the software cannot correctly recognize and analyze entry points, so the firmware must be split into a form that can the SRE software can read. Most vulnerability analysis studies extract the file system from the firmware before extracting the ELF files from the file system one by one and then loading them into the SRE tool. Therefore, this method provides a module that outputs the firmware file system and ELF from the firmware input that is to be split. This method uses Binwalk5 to split the firmware.

5

Binwalk, https://github.com/ReFirmLabs/binwalk.

12

M. Yoda et al.

1.3.2 Static Strings Many vulnerabilities have been discovered by previous researches in analysis utilizing static strings, and embedded strings/passwords ranked first on OWASP’s IoT vulnerability ranking list. IDs and passwords embedded in string comparison functions such as strcmp() represent a common vulnerability related to static strings. Therefore, this methodology provides a function to enumerate string comparison functions. For the function’s implementation, we refer to a previous study on String Search [11] and User Input Search [13]. We combine static stings and memory value. Many research uses static strings but do not parse memory value. A value from an attacker is stored memory so parsing memory value is helpful to find vulnerability. We provide a result of embedded password verification by parsing a root of value to be compared with embedded password.

1.3.3 Control Flow Graph A program control flow graph is utilized when weighting functions suspected of having vulnerabilities and detecting hidden authentication paths, as well as in other processes. IDA Pro and Ghidra, well-known SRE tools, provide standard control flow graph functionality. However, if program restoration is incomplete, the control flow graph may also be left incomplete. For example, Ghidra may not be able to trace data back to its parent function. Alternatively, a child function’s path may be interrupted because of limitation of analysis ability. If the control flow graph is incomplete, the program’s exact flow cannot be determined, thus affecting the analysis’ accuracy. Therefore, this method provides a function to complete the flow graph.

1.3.4 Identifying Network Functions IoT devices’ network functions can easily become entrance points for attacks and intrusions. In this way, IoT devices can be used as stepping-stones. Alternatively, the data they contain can be leaked through remote control by a third party. Therefore, this method provides a function to detect a firmware’s network function and then display a list of functions that reference the network function. We achieve specific implementation by incorporating SocketSearch [12], a method developed in a previous study, and developing a network function classifier using machine learning. This study maintains firmware data for approximately 40,000 real-world products, and we extract network functions from these data to use as training data.

1 Proposal of a Middleware to Support Development of IoT Firmware …

13

A related study revealed that program content can be understood from method names with 80% accuracy [14]. In some cases, IDA Pro and Ghidra cannot recover function names, whereas the current method can detect a network function even if its method name is unknown.

1.4 Conclusions and Future Work In this study, we proposed a concept of middleware to standardize the analysis functions required for firmware vulnerability analysis. Although there are many studies on firmware vulnerability analysis, each study requires extensive time for common preliminaries because of the complexity of firmware analysis. By utilizing our middleware, researchers can shorten the time required for firmware analysis and concentrate on the essential vulnerability analysis. As future work, we develop the proposed middleware and experiment validity of our tool. We are planning to measure how much the time to find vulnerabilities changes with and without the use of the middleware. After validation and experiment, we are releasing our middleware on Github so that many researchers can use it. Acknowledgements This work was supported by JSPS KAKENHI (Grant Numbers JP18H03229, JP18H03340, 18K19835, JP19H04113, JP19K12107, and JP21H03496).

References 1. Y. Shoshitaishvili, R. Wang, C. Hauser, C. Kruegel, G. Vigna, Firmalice: automatic detection of authentication bypass vulnerabilities in binary firmware, in The 2015 Network and Distributed System Security (USA, 2015) 2. L. Cojocar, J. Zaddach, R. Verdult, B. Herbert, A. Francillon, D. Balzarotti, PIE: parser identification in embedded systems, in Annual Computer Security Applications Conference (USA, 2015) 3. D.D. Chen, M. Egele, M. Woo, D. Brumley, Towards automated dynamic analysis for linuxbased embedded firmware, in The 2016 Network and Distributed System Security (USA, 2016) 4. S.L. Thomas, T. Chothia, F.D. Garcia, Stringer: measuring the importance of static data comparisons to detect backdoors and undocumented functionality, in The European Symposium on Research in Computer Security (Norway. 2017) 5. S.L. Thomas, T. Chothia, F.D. Garcia, HumIDIFy: a tool for hidden functionality detection in firmware, in The Network and Distributed System Security Symposium (USA, 2017) 6. K. Cheng, Q. Li, Q. Wang, Q. Chen, Y. Zheng, L. Sun, Z. Liang, DTaint: detecting the taintstyle vulnerability in embedded device firmware, in: 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (Luxembourg, 2018) 7. D. Yaniv, N. PartushEran, Y. Yahav, FirmUp: precise static detection of common vulnerabilities in firmware, in The Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (USA, 2018)

14

M. Yoda et al.

8. P. Srivastava, H. Peng, J. Li, H. Okhravi, H. Shrobe, M. Payer, FirmFuzz: automated IoT firmware introspection and analysis, in The Internet of Things Security and Privacy (United Kingdom, 2019) 9. T.S. John, T. Thomas, S. Emmanuel, Graph convolutional networks for android malware detection with system call graphs, in Third ISEA ASIA Security and Privacy (India, 2020) 10. N. Redini, A. Machiry, R. Wang, C. Spensky, A.C. Continella, Y. Shoshitaishvili, C. Kruegel, G. Vigna, KARONTE: detecting insecure multi-binary interactions in embedded firmware, in 41st IEEE Symposium on Security and Privacy, Virtual (2020) 11. M. Yoda, S. Sakuraba, Y. Sei, Y. Tahara, A. Ohsuga, Detection of the hardcoded login information from socket and string compare symbols, in Annals of Emerging Technologies in Computing (2021) 12. M. Yoda, S. Sakuraba, Y. Sei, Y. Tahara, A. Ohsuga, Detection of the hardcoded login information from socket symbols, in International Conference on Computing, Electronics & Communications Engineering, Virtual (2020) 13. M. Yoda, S. Sakuraba, Y. Sei, Y. Tahara, A. Ohsuga, Detecting hardcoded login information from user input, in The 40th IEEE International Conference on Consumer Electronics, Virtual (2022) 14. S. Suzuki, H. Aman, M. Kuwahara, A decision tree-based model for judging the compatibility between java method’s name and implementation and its evaluation, in FOSE (Japan, 2017)

Chapter 2

Feature-Based Cloud Provisioning for Rehosting Misaki Mito , Takatoshi Ohara , Ryo Shimizu , Hideyuki Kanuka , and Minoru Tomisaka

Abstract As IT infrastructure is increasingly rehosted to public clouds, IT system development must become more efficient. Traditionally, configuring and provisioning cloud services to meet the same requirements as on-premises systems is a timeconsuming manual process that is prone to human error. Therefore, we define a cloud feature model based on rehosting requirements and propose an automatic provisioning method for cloud environments in which engineers select model elements from the same design perspective as on-premises cases. We evaluated our method’s effectiveness in two projects for rehosting to an AWS environment. The results show that our method reduced man-hours by 20% in the design process and by 60% in the construction process compared with the conventional method. In the design process, our method eliminated the trial-and-error consideration of cloud services. In the construction process, our method helped introduce Infrastructure as Code, and the automatic matching function immediately increased productivity after provisioning. Therefore, our method improved efficiency in IT system development associated with rehosting. Keywords Cloud computing · Feature model · Infrastructure as code

2.1 Introduction The global public cloud market for Infrastructure as a Service (IaaS) provides the infrastructure layer of IT systems, and it was predicted to grow by 40.7% in 2020 [6]. Rehosting is a migration strategy that takes an existing application and hosts it on a cloud platform provided by IaaS [1]. Rehosting is often used as a migration strategy for a rapid transition to a cloud service [13]. However, new on-premises application M. Mito (B) · T. Ohara · R. Shimizu · H. Kanuka Research and Development Group, Hitachi, Ltd., Kanagawa, Japan e-mail: [email protected] M. Tomisaka Services and Platforms Business Unit, Hitachi, Ltd., Tokyo, Japan © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Various et al. (eds.), Knowledge-Based Software Engineering: 2022, Learning and Analytics in Intelligent Systems 30, https://doi.org/10.1007/978-3-031-17583-1_2

15

16

M. Mito et al.

Fig. 2.1 Overview of IT system development process

architecture, data flows, server operating systems, software, etc., are not designed as part of the migration. In Japan, in-house IT system development by user companies is low (58% in the U.S. vs. 31.2% in Japan [4]), and the majority of IT system design and development is outsourced to IT vendor companies with many engineers. Therefore, IT vendor companies work to improve their profit margins by streamlining their IT system development for rehosting. Figure 2.1 shows that the IT system development flow for rehosting is requirement definition, design, construction, and testing. While the requirement definition and testing rehosting processes are conducted the same as for on-premises systems, the design and construction processes are implemented differently despite having the same purposes. In the design process, hardware and software that satisfy the defined requirements are selected, parameters are defined, and a design document is prepared. Figure 2.1 shows how a cloud service that corresponds to an on-premises appliance (e.g., a router) is selected in rehosting, and these parameters are set and combined. In many cases, the design document is treated as a deliverable to the client company. In the construction process, the necessary system components are procured, the operating system and software are installed, and parameters are set based on the design documents. IT vendors procure and configure appliances for on-premises systems and deploy them in the customer’s company. Engineers perform procurement and installation tasks for rehosting by entering the parameters of the design document on the console provided by the public cloud vendor. Engineers who mostly (or solely) develop on-premises IT systems are likely to cause errors implementing these changes because they may lack specific cloud computing knowledge. In addition, engineers are more likely to lag behind in technical learning because of the high frequency of service releases in public cloud services. Therefore, manual work that requires cloud-specific knowledge should be reduced to make IT system development associated with rehosting more efficient and increase profit margins.

2 Feature-Based Cloud Provisioning for Rehosting

17

Table 2.1 Example architectures corresponding to requirements Targets On-premises Cloud (AWS) symbol (a) (b) (c) Load balancer

Server

Protocol Availability Zones (AZs) Satisfies requirements? Available provisioning?

(d)

HTTP N/A

Network Load Balancer (NLB) TCP / UDP 1

Application Load Balancer (ALB) HTTP 1

Application Load Balancer (ALB) HTTP 2

Yes

No

Yes

Yes

No

Yes

No

Yes

2.2 Motivating Examples 2.2.1 Trial-and-Error Searches for Optimal Cloud Service Configuration in Design Process In the design process, engineers must search for the optimal cloud service configuration that satisfies on-premises requirements through trial-and-error. Table 2.1 shows an example configuration whose requirement is to maintain high availability with load balancing using HTTP as the protocol. Conventional IT vendor companies often have many engineers who develop on-premises IT systems. These engineers can easily determine a configuration that satisfies project requirements based on their experience, as shown in Table 2.1a. A managed service provided by a public cloud vendor for load balancers (e.g., Elastic Load Balancing (ELB) in Amazon Web Services (AWS)) is commonly used because cloud services do not correspond 1:1 with on-premises appliances. Of the several ELB types, the Network Load Balancer (NLB) and Application Load Balancer (ALB) are shown in Table 2.1a–c. Their protocols differ, and the ELBs that satisfy the requirements of the example in this section are shown in Table 2.1b, c. An ALB that is capable of using HTTP is used. The difference between Table 2.1b, c is the number of Availability Zones (AZs), which are geographically separated data centers in AWS. The design parameter could seem arbitrary to an engineer inexperienced in AWS, but Table 2.1b fails in provisioning because of a dependency in which ALB requires two or more AZs. In addition, unless there is a requirement to minimize the communication latency, the de facto standard for using AWS is to distribute the data across multiple AZs to ensure high availability. Thus, Table 2.1c is the only architecture that can be provisioned to meet the requirements of Table 2.1. So, engineers determine the optimal configuration of cloud services that satisfies the on-premises requirements through trial-and-error.

18

M. Mito et al.

Step 1. Create Virtual Private Cloud (VPC) Step 2. Create Subnet in VPC Step 3. Create Internet Gateway (IGW) Step 4. Attach IGW to VPC Step 5. Create Route Table Step 6. Add IGW Route to Route Table Step 7. Change Route Table of Subnet

Fig. 2.2 Required steps with console for manually provisioning simple network layer, created from partial excerpt of AWS’ webinar document [2]

2.2.2 Manual Provisioning with Console Used by Engineers in Construction Process Manual provisioning using a console during the construction process is error-prone and increases the required man-hours. Section 2.1 explains how engineers provision cloud environments by entering the design document parameters into the console provided by the public cloud vendor. Figure 2.2 shows an example of manual provisioning network layer to AWS by engineers. The following four AWS services are handled in the procedure shown in Fig. 2.2: a virtual private cloud (VPC), the configuration unit of the AWS network; a subnet, the network segmentation unit; an internet gateway, the communication access point between the AWS network and the internet; and a route table, the network routing destination setting in VPCs and subnets. As Fig. 2.2 shows, it takes seven steps to provision the network layer, and their order depends on what each cloud service provides. For example, in Fig. 2.2, the execution order of steps 1 and 2 cannot be swapped because the subnet is divided inside the VPC. Therefore, provisioning would fail if the execution order were swapped. Furthermore, input errors are also likely to occur because of the manual process. Provisioning will fail or servers will be unable to communicate with each other after provisioning, thus increasing the man-hours required for IT system development. Infrastructure as Code (IaC) tools are intended to automate manual provisioning. The infrastructure and software development team defines the configuration of the infrastructure, including cloud services, as code and then constructs and manages the infrastructure with automated tools [8]. In the public cloud, various IaC tools are available from public cloud vendors, and cloud service configurations can be written using IaC tools to automate the management and provisioning of networks and servers [3]. However, the required IaC syntax rules are expensive to learn and write for engineers who are familiar with on-premises systems.

2 Feature-Based Cloud Provisioning for Rehosting

19

2.3 Feature-Based Cloud Provisioning Method for Rehosting 2.3.1 Overview of Proposed Method To solve the aforementioned problems, we propose a feature-based cloud provisioning method for rehosting. Figure 2.3 overviews our method. Our method is equipped with a cloud feature model (CFM), i.e., a feature model (FM) for rehosting, and a provisioning tool that automatically provisions a cloud environment in conjunction with the CFM. The CFM expresses rehosting requirements, and engineers can easily design cloud environments by selecting features on the CFM, thus eliminating the need for trial-and-error. The provisioning tool is composed of IaC internally and provisions the cloud environment in accordance with the CFM selection results. The engineers do not need to learn and write IaC because they do not need to write new IaC based on changing requirements. In addition, IaC improves IT system development by eliminating the need for manual labor with the console provided by the public cloud vendor. In the following sections, we detail our method.

2.3.2 Cloud Feature Model Figure 2.4 shows the CFM constructed for the three-tier web architecture. The CFM was developed based on feature-oriented domain analysis FM (FODA-FM) [7], which is a basic form of FM. In FODA-FM, the FM is developed by organizing the viewpoints used to define requirements and basic design as features instead of strictly mapping implementation materials and features 1:1 as in Quinton et al. [10]. The three feature selection states are “mandatory,” in which a feature is selected regardless of the user’s intended selections; “optional,” in which the user specifies if a feature is selected on a per-feature basis; and “selective,” in which a feature is selected from multiple features. When these selection states are set, the CFM can restrict engineers from selecting a combination of features that cannot be provi-

Design Problem

Input

Trial and error

Design Cloud Env. Requirements

Error Engineer

Trial-and-error searching for optimal configuration of cloud services

Design info.

Console IaC tools Engineer Engineer Manual provisioning w/ console used by engineer

Easily

Automate

Select & Config

Proposed method CFM (Three-tier web) Design

Engineer

Construction

Provision

Input Selection results

Generate & Deploy

IaC

Cloud provisioning tool

Fig. 2.3 Overview of proposed method for solving problems in each process

Cloud env.

Construction

20

M. Mito et al.

Network

Web server publishing scope 1 DC (AZ) 2 EBS capacity (GiB)

Web server

Instance type Middleware

EBS capacity (GiB)

Web system

Instance type App server

Middleware Auto scaling OFF

Mandatory Auto scaling ON

Optional

within company (Private) outside company (Public)

10 100 m5.large r5.xlarge Apache Nginx 10 100 t3.medium c5.2xlarge uCosminexus Application Server JBoss 1 Servers 2 1 Min servers 4 4 Max servers 16 Target tracking CPU utilization

60%

Selective Required

DB server

Username DBMS

Admin HiRDB DB size Amazon RDS

S M L

Fig. 2.4 Outline of three-tier web CFM

sioned. CFM can also express dependencies among features through “Required.” We use “Required” to constrain the two AP servers in AWS to two AZs when load balancing to increase availability. The basic CFM feature adoption policy is the source on-premises system requirement because cloud services for rehosting are chosen based on said requirement. Therefore, the new learning cost associated with rehosting and the number of wrong selections from misunderstanding the cloud environment terminology can be reduced for engineers. However, some dependencies between cloud services do not exist onpremises (e.g., the relationship between NLBs and ALBs and the number of AZs in Table 2.1). Thus, the minimum essential concepts that do not exist on-premises but help determine the combination of cloud services that are used as features [“number of AZs” in Table 2.1 are defined as “DC(AZ)” features in Fig. 2.4]. In addition, features such as “DC(AZ)” affect the entire system and are placed under “network” to avoid item set duplication by engineers. Figure 2.5 shows that the CFM can be operated via a graphical user interface (GUI) to reduce the learning cost for engineers.

2.3.3 Cloud Provisioning Tool Figure 2.6 overviews the cloud provisioning tool. The tool provisions a cloud environment by the following procedure. 1. The results of the selections the engineer made for the CFM are input into the constant design document generation tool.

2 Feature-Based Cloud Provisioning for Rehosting

21

Fig. 2.5 GUI for CFM selection, written in Japanese (i.e., native language of engineer evaluating proposed method) Input

Constant design document generation tool

Selection results

Generate AWS CloudFormation (AWS CFn) template

Input

“Resource”:{ “App1”:{ “Type”: “AWS::EC2::Instance”, ... “App2”:{ “Type”: “AWS::EC2::Instance”, ... “AppLoadBalancer”: { “Type”: “AWS::ElasticLoadBalancingV2 ::LoadBalancer”,...

(Optimized) Constant design document

Constant design document generation code

Generate

EC2 Resource name

App1

Instance type

c5. c5. … 2xlarge 2xlarge

:

“SheetName”: “Def_EC2”, “CfnResourceType”: “AWS::EC2::Instance”, …

Input { Input

Input

:

App2

:

ELB

{

“SheetName”: “Def_ELB”, “CfnResourceType”: “AWS::ElasticLoadBalancingV2 ::LoadBalancer”, …

AWS CFn template generation tool Pre-coded IaC using AWS CDK

ELB

Constant design document

class app_mod(autoscaling, servers, …): if autoscaling ==False: for i in range(servers): server = ec2.Instance(self,f’app-{i}’,… if servers > 2: app_lb = elbv2.ApplicationLadBalancer(…

{

EC2

Generate

Pre-coded IaC using AWS Cloud Development Kit (AWS CDK) classWebSystem : app_param = get_param(“App”) app = app_mod(app_param)

“WebSystem”:{ “modules”:{ “name”: “App”, “autoscaling”: False, “servers”: 2, …

Input

class ec2_wapper: res_name = get(“res_name”)…

Input

class elb_wapper: res_name = get(“res_name”)…

Fig. 2.6 Overview of cloud provisioning tool

Generate

Generate

Resource name

AppLoad … Balancer

Scheme

internal

:



:

Optimize for project Engineer

Provisioning tool (AWS CFn template) Provision Cloud environment



22

M. Mito et al.

2. The tool includes pre-coded IaC with the AWS Cloud Development Kit (AWS CDK) and a constant design document generation code. The pre-coded IaC takes the selection results into variables for each layer and generates an AWS CloudFormation (AWS CFn) template reflected them. The code generates a constant design document with content defined for each cloud service by the AWS CFn template using code defined for each cloud service. 3. The engineer edits and optimizes the generated constant design document based on the project requirements. Sometimes, the engineer writes additional constant design documents so that additional cloud services can be provisioned. 4. The optimized constant design documents are input into the AWS CFn template generation tool, which is written using AWS CDK. This tool reads constant design documents with classes defined for each cloud service and generates AWS CFn templates using AWS CDK functions. 5. The cloud environment is provisioned using the AWS CFn template generated in the previous section. The constant design document is composed in a spreadsheet format, so the engineer can edit the values in the document as needed without an in-depth understanding of the IaC syntax rules. An optimized constant design document can be used as a deliverable to the client company, thus saving on IT system development in constructing the deliverable. Our method also provides an automatic matching function to check whether the environment has actually been constructed per the constant design document. The auto-check function obtains cloud service parameters in the provisioned environment using the AWS command-line interface (CLI) and compares them with parameters set in the constant design document. The function can be used for a unit test.

2.4 Evaluation 2.4.1 Evaluation Method We evaluated our method with two rehosting projects in the AWS environment. Although the engineers’ proficiency levels differed for each project, the same engineers were responsible for coordinating the two projects. We interviewed engineers about the advantages and disadvantages of our method in terms of usability and functionality. In addition, we asked the engineers who would manage the project to estimate the relative man-hours (for the design and construction processes, respectively) required if the project were promoted using our method over the conventional method (as shown in Fig. 2.1 Cloud lift migration). The engineers who managed the projects we evaluated are very experienced in IT systems development and were expected to make highly accurate estimations of the relative man-hours required for the proposed and the conventional methods. The CFM used in the evaluation was based on the three-tier web architecture shown in Fig. 2.4.

2 Feature-Based Cloud Provisioning for Rehosting

23

Table 2.2 Comments on the proposed method through evaluation Process Comments Design process

– Automatic generation of the required constant design documents eliminated the need for prior consideration of cloud services∗ . – A separate constant design document was required for cloud services not covered by the architecture. We believe that the required man-hours will be reduced further when the architecture is expanded in the future. – We were able to easily select configurations using the GUI. However, executing commands such as constant design document generation using the CLI was difficult until we became accustomed to the commands. Thus, the ability to use the GUI would be preferable. Construction pro- – Despite knowing IaC exists, I could not introduce it because of its syntax and cess other obstacles. However, I was able to introduce it with the help of the tool. – I was impressed by the automation, which eliminated the need for console operations and shortened the system construction time. – No human error was made during the construction process, and the quality of the work was reliable. – After constructing the environment, we could reduce man-hours by performing mechanical parameter checks that could be used for quality assurance checks. On the whole – We rely on our tools because they enable us to make a profit even when the price per unit of work drops due to competition. – It was difficult to construct an execution environment for the tool. ∗ Engineers select a cloud service that satisfies their requirements and describes the cloud service configuration in a design document

2.4.2 Results 2.4.2.1

Over/Under Use of Proposed Method and Functions

Table 2.2 lists comments from the engineers on the usability and the advantages or disadvantages of the functions when using our method.

2.4.2.2

Effect of Man-hour Reduction

Fig. 2.7 shows the man-hours calculated for the design and construction processes for each of the two projects. The values in Fig. 2.7 illustrate the ratio of man-hours required by engineers. Thus, the project in Fig. 2.7b required twice as many man-hours for the construction process as the project in Fig. 2.7a did with the conventional method. Figure 2.7 shows that our method reduced man-hours by 20% in the design process and by 60% in the construction process for both projects regardless of the proficiency level of the engineers.

24

M. Mito et al. Ratio of man-hours*

Conventional method Proposed method

10

8

5

2

(a) Project #1

Ratio of man-hours* Conventional method Proposed method

10

8

10

4

Design Construction

(b) Project #2

Fig. 2.7 Estimated man-hours required for projects in design and construction process. *Engineerestimated ratio of man-hours required by project

2.5 Discussion 2.5.1 Effects of Application to Design Process In Table 2.2, “eliminated the need for prior consideration of cloud services” is mentioned. Our method can generate the necessary constant design documents without any preliminary study. This indicates that the trial-and-error process during the preliminary study is no longer necessary. There were no comments directly evaluating the structure of the CFM itself. However, the lack of comments implying the unsuitability of the CFM structure for rehosting indicates that it worked appropriately for the engineers. Figure 2.7 shows that man-hours were reduced. Although some of the engineers had less than one year of development experience using AWS, they could provision development using the CFM regardless of their experience. However, the constant design document template that was automatically generated by the CFM did not support all of the cloud services used by the project. Therefore, man-hours were only reduced by 20%. Table 2.2 shows that man-hours could have been further reduced by expanding the number of cloud services targeted by the CFM.

2.5.2 Effects of Application to Construction Process Table 2.2 and Fig. 2.7 show that our method facilitated IaC implementation and reduced IaC man-hours in both projects. Furthermore, productivity was greatly improved using the automatic matching function immediately after provisioning. The “on the whole” comment in Table 2.2 indicates that the reduced project man-hours can be partially reflected in the order price, which makes prices more competitive. However, the engineers using our method were not accustomed to coding, so they had a hard time constructing the environment because they mainly worked on the console using the conventional method. We talked with the engineers and found that this could be improved by using a script to introduce the execution environment in a batch.

2 Feature-Based Cloud Provisioning for Rehosting

25

2.6 Related Work For the public cloud, García-Galán et al. [5] reported an EC2 virtual server of AWS that was constructed by constructing an FM for IaaS. García-Galán et al. introduced a software product line development method for the mass production of individual software products in which code is reused [9]. However, because the construction target was limited to a single server, it is not possible to, for example, construct an entire system that requires three-tier web architecture. Quinton et al. constructed a cloud knowledge model that defines the constraint relationships between FMs for cloud systems, target system knowledge, and domain knowledge, using them to automatically construct cloud environments [10]. The 1:1 correspondence between features and scripts leads to the assumption that a parameter that specifies the number of AZs is embedded in each feature that indicates the server type. It also indicates that switching is performed accordingly. However, parameters commonly specified for the entire system (e.g., the number of AZs) are specified for each server, a redundancy that can easily cause engineers to make mistakes with their settings. Similar our method, the ARGON cloud environment provisioning tool proposed by Sanobalin et al. [11] uses a GUI. With ARGON, we can model cloud services, select cloud service models on the GUI, and specify combinations among the models. Furthermore, Sanobalin et al. showed that selecting models on a GUI is a faster and more stable provisioning method than directly utilizing Ansible and/or other IaCs [12]. However, ARGON requires almost the same number of parameters to be set as when specified on a console provided by a public cloud vendor. Moreover, engineering trial-and-error cannot be avoided.

2.7 Conclusion When designing IT systems for rehosting, engineers search for the best cloud service configuration to meet the on-premises requirements through trial-and-error. When constructing the systems, manual provisioning with consoles is error-prone and timeconsuming. Therefore, we proposed a feature-based cloud provisioning method that reduces IT system development costs for public cloud rehosting. For the design and construction processes, we constructed an FM for rehosting (CFM) and developed an automatic cloud provisioning tool in conjunction with the CFM, respectively. We evaluated the effectiveness of our method with two projects for rehosting to an AWS environment. The design and construction processes were 20% and 60% faster, respectively, when compared with the conventional method. Thus, our method improves the efficiency of IT system rehosting. In the future, we will further reduce man-hours by expanding the scope of the CFM and improving the execution environment. Also, we will consider provisioning non-AWS cloud environments by using IaC tools other than the AWS CDK.

26

M. Mito et al.

References 1. N. Ahmad, Q.N. Naveed, N. Hoda, Strategy and procedures for migration to the cloud computing, in 2018 IEEE 5th International Conference on Engineering Technologies and Applied Sciences (ICETAS) (2018), pp. 1–5. https://doi.org/10.1109/ICETAS.2018.8629101 2. Amazon Web Services, Inc. Webinar for Beginners, Networking on AWS (Jan 2015). https:// www.slideshare.net/AmazonWebServicesJapan/webinar-aws-43351630 (in Japanese). Last accessed 30 Mar 2022 3. B. Campbell, The Definitive Guide to AWS Infrastructure Automation, 1st edn. (Apress, Berkeley, CA, 2020). https://doi.org/10.1007/978-1-4842-5398-4 4. Information-technology Promotion Agency (IPA), Japan, DX Strategy, People, and Technology in Japan-U.S. Comparative Study, DX white paper 2021 (2021) (in Japanese) 5. J. García-Galán, P. Trinidad, O.F. Rana, A. Ruiz-Cortés, Automated configuration support for infrastructure migration to the cloud. Future Gener. Comput. Syst. 55, 200–212 (2016) 6. Gartner, Inc. Gartner Says Worldwide IAAS Public Cloud Services Market Grew 40.7% in 2020 (2021). https://www.gartner.com/en/newsroom/press-releases/2021-06-28-gartner-saysworldwide-iaas-public-cloud-services-market-grew-40-7-percent-in-2020. Last accessed 30 Mar 2022 7. K.C. Kang, S.G. Cohen, J.A. Hess, W.E. Novak, A.S. Peterson, Feature-Oriented Domain Analysis (FODA) Feasibility Study. Report (1990) 8. K. Morris, Infrastructure as Code: Managing Servers in the Cloud (O’Reilly Media, Inc., 2016) 9. K. Pohl, G. Böckle, F. van der Linden, Software Product Line Engineering, 1st edn. (Springer, Heidelberg, 2005). https://doi.org/10.1007/3-540-28901-1 10. C. Quinton, D. Romero, L. Duchien, Saloon: a platform for selecting and configuring cloud environments. Softw.: Pract. Experience 46(1), 55–78 (2016). https://doi.org/10.1002/spe.2311 11. J. Sandobalin, E. Insfran, S. Abrahao, An infrastructure modelling tool for cloud provisioning, in 2017 IEEE International Conference on Services Computing (SCC) (2017), pp. 354–361. https://doi.org/10.1109/SCC.2017.52 12. J. Sandobalin, E. Insfran, S. Abrahao, On the effectiveness of tools to support infrastructure as code: model-driven versus code-centric. IEEE Access 8, 17734–17761 (2020). https://doi.org/ 10.1109/access.2020.2966597 13. O. Stephen, 6 Strategies for Migrating Applications to the Cloud (AWS Cloud Enterprise Strategy Blog, Amazon Web Services, Inc., 2016). https://aws.amazon.com/jp/blogs/enterprisestrategy/6-strategies-for-migrating-applications-to-the-cloud/. Last accessed 30 Mar 2022

Chapter 3

Pattern to Improve Reusability of Numerical Simulation Junichi Ichimurar and Takako Nakatani

Abstract In recent years, products have become more complex due to increased and more sophisticated requirements, and development time has become shorter to respond to changes in the market. Therefore, a reusable system is important for the efficient use of simulation. In this paper, we propose patterns for reusable simulation languages and how to apply them. This pattern divides the simulation target into State and Actions, and relates with a sum of zero between State and Actions. Furthermore, Strategy pattern is applied to increase reusability during Action development. We also apply this pattern using Modelica language, an object-oriented physical modeling language. By applying these patterns, the new reusable simulation library can now be created from scratch. Keywords Physical modeling · Object-oriented · Design pattern · Modelica language

3.1 Introduction In recent years, the automobile industry has become more complicated than ever due to the demand for CASE (Connected, Autonomous, Shared & Services, Electric). And it is also required to shorten the product development cycle, thus modelbased development (MBD) using simulation in the upstream process becomes more important. On the other hand, manufacturing has focused on the fusion technology of “Real (Measurement)” and “Virtual (Simulation)” such as Digital Twin. This makes it possible to virtually (simulation) examine problems that are difficult to experiment. Thus, simulation technology has taken an important in the design and development process. J. Ichimurar (B) · T. Nakatani The Open University of Japan, 2-11 Wakaba, Mihama-ku, Chiba, Japan e-mail: [email protected] T. Nakatani e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Various et al. (eds.), Knowledge-Based Software Engineering: 2022, Learning and Analytics in Intelligent Systems 30, https://doi.org/10.1007/978-3-031-17583-1_3

27

28

J. Ichimurar and T. Nakatani

The following three points are important for the technology used to simulate the physical behavior. 1. Highly readable 2. Accurate and fast 3. Highly expandable and reusable With satisfying the above requirements, the modeling language using object orientation has been developed for numerical simulation (e.g., Modelica Language [1], Bond graph method [2], VHDL-AMS [3]). These languages provide libraries for mechanism, electricity/magnetism, heat, oil/pneumatics, etc., which are commonly used in the industrial field, making it easy to create and simulate models. But when beginners make a new library, object-oriented thought may not be well utilized, and the library structure has low reusability. On the other hand, “Design Patterns” has been proposed for designing highly reusable software using object orientation [4]. In this paper, we consider patterns for modeling languages and libraries that are highly expandable and reusable simulations. And we consider how the patterns could be applied to build libraries using the Modelica language as examples. This paper is organized as follows. Section 2 Section 3 Section 4 Section 5

Overview of the Modelica language and previous research applying design patterns. Present the Patterns required for reusable simulation. Show examples of the pattern applied to moving ball model and SIR model. Summarizes and discusses prospects.

3.2 Background 3.2.1 Modelica Language This section describes the Modelica language, one of the physical modeling languages. The Modelica language is the object-oriented language designed to simulate large, complex physical phenomena [1, 5]. To improve readability, the physical phenomena to be reproduced are modeled graphically and intuitively by means of diagrams called object diagrams. In addition, we create elements that reproduce the phenomenon in a section called a library, and in the elements, mathematical expressions are defined. We use these elements as instances to build simulation models. To enable graphical representation and calculation, the Modelica language has the following characteristics of thinking/implementing formulas.

3 Pattern to Improve Reusability of Numerical Simulation

29

Fig. 3.1 a Partial excerpt of Modelica language. In the Modelica language, the above and below are equivalent expressions, b Mass-Spring-Damper model in Modelica, c process to calculation in Modelica language

Possible to express by equation General programming languages consist of assignment expressions where the value on the left-hand side becomes the description on the right-hand side. On the other hand, the Modelica language is possible to be described by the equation on the left side and the right side (See Fig. 3.1a). This feature allows physical phenomena to be expressed without thought to the computational process. Acausal connections and through/across variables The Modelica language has a special meaning for a connection between elements. In the normal block diagram, the value of the output port is assigned to the value of the input ports to which it is connected. On the other hand, through variables and across variables are defined in Modelica language, and through variables equal to zero and across variables are equal in the connections(Fig. 3.1b). The actual calculation method of the formula is not defined Modelica language specification does not define how to solve the model. Therefore, each Modelica supported tool interprets and analyzes the mathematical expressions(Fig. 3.1c).

3.2.2 Adopting Design Patterns to Modelica With the goal of designing reusable classes and libraries, E., Gamma et al. classify their experiences in designing object-oriented software as 23 design patterns [4].

30

J. Ichimurar and T. Nakatani

Fig. 3.2 a Numerical simulation modeling and reusability, b relationship between state and action

Clauß, C. et al. have also applied this design pattern to Modelica with the goal of creating highly reusable libraries for different physical domains [6]. They provide examples of three design patterns. Adapter: Strategy: Abstract Factory:

To adapt interfaces to different physical domains. To adapt different algorithm signals with time dependence. To provide a unique interface for creating/using different parameter sets and equations for models.

On the other hand, they did not consider patterns that included feature of acausal connections in the previous section. The use of acausal connections is important for making reusable libraries, but Modelica beginner may not be able to select appropriate through variables and may build a library with low reusability. In Subsequent chapters, we discuss patterns including the connection feature and clarify the role of connections for improving reusability. Then we discuss the process of applying patterns.

3.3 Patterns for Physical Modeling and Their Adaptation In this section, we consider patterns for improving reusability in numerical simulation models. When reuse is not a consideration, the simulation model is created by directly coding (formulating) the phenomenon. In this case, it is necessary to code each phenomenon. On the other hand, to reuse formulas that have been studied for similar phenomena, it is useful to decompose/componentize the phenomena and build a model by assembling the components (Fig. 3.2a). When we consider a model that simulates a change in state, the amount of change in state is determined based on the current state, and the next state is determined based on the amount of change in state (Fig. 3.2b: where“State” is the element that calculates the current state and “Action” is the element that calculates the amount of change in state). Since the amount of change in this state changes with each phenomenon, this part is extracted and the calculation of the amount of change is componentized.

3 Pattern to Improve Reusability of Numerical Simulation

31

Fig. 3.3 To modify the model and easily to reuse

Fig. 3.4 Connection relationships and adding terms

And this makes it possible to simulate various phenomena by selecting components corresponding to the phenomena at the time of model building (Fig. 3.3). When considering the reuse of components, it is important to easily add and delete terms in the equation (Fig. 3.4). To achieve this, the “Action” class should be split into individual terms. In addition, the equation of “Action” before the division and the equation assembled from “Actions” after the division should be equivalent. Therefore, we define two variables for which the following relationship is valid. Through Variables: The Through variable is defined as the value proportional to the time variation calculated for each “Action” (Like mass in the case of translational motion, the coefficient of proportionality depends on the attributes that State has). The sum of the Through variables across connections is proportional to the amount of change in state; if the amount of change in state used to “State” class is also defined “Through variables”, the sum of “Through variables” across connections can be expressed as zero. Across Variables: Across Variables is defined by state quantity variables. Across variables are equal across connections. These two variables make it possible to split the equation and still convert the equation back to the original equation.

32

J. Ichimurar and T. Nakatani

Fig. 3.5 Physical modeling language pattern

These are summarized in the pattern of the class diagram shown in Fig. 3.5. The formula is divided into “State” and “Action,” which are related by “Through variables” the sum of which is zero between connections and “Across variables” the values of which are equal. Action is further divided by each term. The overall structure is like the Strategy pattern but adds a Connection relationship between State and Abstract Actions.

3.4 Case Study In this section, we show examples of applying the proposed pattern to libraries for the Moving Ball and SIR models.

3.4.1 Moving Ball In this section, we apply the pattern to the behavior of a single object such as the Eq. (3.1) and (3.2) for a falling, rolling ball. dx d2x = −mg − b dt 2 dt

(3.1)

d2x = −μmg cos θ − mg sin θ dt 2

(3.2)

m

m

Fig. 3.6 The use case of applying the pattern to ball behavior (See. 4.1)

3 Pattern to Improve Reusability of Numerical Simulation 33

34

J. Ichimurar and T. Nakatani

m is the mass of the ball, x is the displacement of the ball, g is the acceleration of gravity, b is the drag coefficient of air, μ is the coefficient of friction, and θ is the slope of the hill(Step0). First, we separate State and Actions classes (See Fig. 3.6. Step1). In this case, displacement x, velocity v, and acceleration a are the state quantities. Velocity with respect to displacement and acceleration with respect to velocity are differential relations. We define these two attributes and behavior in the State class. Next, we define the connection settings(See Fig. 3.6. Step2). Across variables are displacement x, velocity v, and acceleration a. Through variable is a value proportional to the time derivative of the state quantity, so we select to be either proportional to velocity, acceleration, or acceleration, jerk j (= da/dt). In this case, proportional value to acceleration is chosen for the through variable. And mass m is selected as the proportional term. The product of the acceleration dimension and mass dimension is a force dimension, and the through variable also satisfies the force balance. Note that if velocity is selected as the through variable, the through variable satisfies the momentum conservation law. In this case, the State defines the left side of Eqs. (3.1) and (3.2) and Actions define the right side of ones. Then, we divide the term of Actions(See Fig. 3.6. Step3). Equation (3.1) divides into gravity and damping force, and Eq. (3.2) divides into friction and gravity force in the direction of the slope. Finally, we define the common part as the abstract crass(See Fig. 3.6. Step4). In this case, the abstract class defines the through and across variables to the outside. Figure 3.7 left shows the library created in these steps. It consists of 1 “State” Component and 4 “Action” ones. Figure 3.7 right shows the model and simulation results using this library. Results show falling with damping force or rolling with friction force, and the same phenomena are calculated as before splitting into components.

3.4.2 SIR Model Next, we apply the patterns to the SIR model, a basic mathematical model that models the epidemic dynamics of infectious diseases transmitted directly from person to person. This case study is of modeling that includes interactions between state elements. ⎧d ⎨ dt S(t) = −β S(t)I (t) d (3.3) I (t) = β S(t)I (t) − γ I (t) ⎩ dtd R(t) = γ I (t) dt S(t) is the number of susceptible populations. I (t) is the number of infectious populations. R(t) is the number of recovered populations. β is the infection rate per infected person. γ is the recovery rate. In this case, deaths are not considered for the sake of simplicity. Figure 3.8 shows the same study as Steps 1 to 4 in the previous section. The model includes the state’s quantities of susceptible, infectious, and

3 Pattern to Improve Reusability of Numerical Simulation

35

Fig. 3.7 Left: moving ball library based on Fig. 3.6. Upper right: simulate falling ball with damping using this library. Upper left: simulate rolling ball with friction using this library

Fig. 3.8 Class diagram of SIR model

36

J. Ichimurar and T. Nakatani

Fig. 3.9 Modelica diagram of SIR model (left) and results (right)

Fig. 3.10 Modelica diagram of SIR+Death model (left) and results (right)

recovered populations. Since they all mean population quantity, we select population p as the across quantity. Also, the time derivative of the population dp is selected as the through value. Actions consist of two classes of infection and recovery terms. In Step 4, when considering several independent state variables, we should consider whether there are any common relationships other than the through and across variables to the outside. For both the infection and recovery terms, one state is related to an increase in population, while the other state is related to a decrease in population, and this changing population is equal. Thus, this relationship is also added to the abstract class. Figure 3.9 shows the model and simulation results using this library. The red line shows the change of susceptible populations, the green shows infectious ones, and the blue shows recovered populations. In addition, when considering death, the model can be easily changed, like Fig. 3.10. The relationship between the time variation of deaths and each population is the same as in recovery, so it can be reused without modification.

3 Pattern to Improve Reusability of Numerical Simulation

37

3.5 Conclusions In this paper, we proposed the pattern for a reusable simulation library. The pattern is based on strategy pattern, which is divided into state and actions. To add and remove terms without changing the structure of the library, the through and across variables were defined using related classes for the relationship between Action and State. We also proposed how to apply the pattern. The first step is dividing into state and actions. Next is defining through and across variables as a relationship between State and Actions to improve reusability. Then, Dividing Actions and easily adding or deleting terms in the formula. Finally, the Common abstract class is defined for Actions to make the library more expandable. By applying these patterns, the new reusable simulation library can now be created from scratch for domains where no library is provided. In the future, we will investigate how pattern structures differ in other physical modeling languages from the aspect of reusability.

References 1. Modelica Homepage. https://modelica.org/index.html. Last accessed 11 Apr 2022 2. A.K. Samantaray, B.O. Bouamama, Model-Based Process Supervision: A Bond Graph Approach (Springer, London, 2008) 3. VHDL Analog and Mixed-Signal Extensions, IEEE Standard 1076.1-1999 4. E. Gamma, R. Helm, R. Johnson, J. Vlissides, Design Patterns Elements of Reusable ObjectOriented Software (Addison-Wesley Professional, Boston, MA, 1994) 5. P. Fritzson, Principles of Object-Oriented Modeling and Simulation with Modelica 3.3: A CyberPhysical Approach (Wiley, 2014) 6. C. Clauß, T. Leitner, A. Schneider, P. Schwarz, Object-oriented modelling of physical systems with Modelica using design patterns, in System Design Automation (Springer, Boston, MA, 2001), pp. 195–208

Chapter 4

SpiderTailed: A Tool for Detecting Presentation Failures Using Screenshots and DOM Extraction Takato Okajima, Takafumi Tanaka, Atsuo Hazeyama, and Hiroaki Hashiura

Abstract In smartphone and Web applications, a major problem arises when the rendering result of a GUI component does not meet a defined specification. In this study, we refer to such problems as Presentation Failure, and although many methods have been proposed to detect Presentation Failure, most of them are achieved by directly comparing images of the application output with the specified specifications. Therefore, a large difference between the two may induce misdetections of Presentation Failure. This study proposes a method to extract abstract visual properties from the output of before and after version web pages and compare visual properties extracted correspondingly. Specifically, screenshots are taken from each image for each element, and visual properties are extracted using computer vision techniques. Furthermore, we implement the proposed method as a tool (SpiderTailed) and describe the comparison experiment results with a manual observation test to confirm the effectiveness of the method. Keywords Image processing · Regression testing · Presentation failure

T. Okajima (B) · H. Hashiura Graduate School of Engineering, Nippon Institute of Technology, 4-1 Gakuendai, Miyashiro, Minami-Saitama, Saitama 345-8501, Japan e-mail: [email protected] H. Hashiura e-mail: [email protected] T. Tanaka Faculty of Engineering, Tamagawa University, 6-1-1 Tamagawagakuen, Machida, Tokyo 194-0041, Japan e-mail: [email protected] A. Hazeyama Department of Information Science, Tokyo Gakugei University, 4-1-1 Nukuikita-machi, Koganei, Tokyo 184-8501, Japan e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Various et al. (eds.), Knowledge-Based Software Engineering: 2022, Learning and Analytics in Intelligent Systems 30, https://doi.org/10.1007/978-3-031-17583-1_4

39

40

T. Okajima et al.

4.1 Introduction GUI (Graphical User Interface) is mainly used as a user interface in today’s applications and has become indispensable. Pengnate et al. [1] point out that the appearance of a Web page influences user impressions and is a factor in commercial success. When the rendering results of GUI components in applications on smartphones and/or the Web do not meet the specified specifications, it can be a major problem. As an example, consider an application with buttons on the screen. If the buttons are misaligned and extend off the screen, or if their color blends in with the background color, the user may not be able to use the functionality, or the buttons may be difficult to find. Problems in which the rendering result of a GUI component does not meet the defined specification is called Presentation Failure [2]. In general, testing to detect such Presentation Failure is performed manually by visually checking the results of the GUI Web page being tested. However, the cost of such manual testing increases in proportion to the number of screens and pages. This problem has become even more pronounced with the increase in regression testing due to the spread of agile development. Therefore, to solve the aforementioned problem, many methods have been proposed to detect it, and most of them are realized by directly comparing images. Therefore, a large difference between the two images may induce misdetections for Presentation Failure. This study aims to propose a method to detect Presentation Failure in regression testing by extracting abstracted visual properties from the output of before and after version web pages and comparing the visual properties extracted from each. The rest of this paper is organized as follows: In Sect. 4.2, we describe how to extract visual properties in our method; in Sect. 4.3, we explain how to implement the proposed method; in Sect. 4.4, we describe how to evaluate our method; in Sect. 4.5, we present evaluation results and discussion; in Sect. 4.6, we describe related work; in Sect. 4.7, we describe limitations and validity of the proposed method; and in Sect. 4.8, we present our conclusion.

4.2 Proposed Method Figure 4.1 shows an overview of the proposed method. The proposed method first receives two different versions of a Web page HTML as input and then performs the following four processes in order. 1. 2. 3. 4.

Take screenshots Extract visual properties Comparison of visual properties Output of comparison results

4 SpiderTailed: A Tool for Detecting Presentation Failures …

41

Fig. 4.1 An overview of the proposed method

4.2.1 Take Screenshots This procedure first generates the DOM from the web page under test. Next, selectors for each element are generated based on the generated DOM. Then, a screenshot (partial image) of the element corresponding to each selector is obtained from the generated selectors and the browser’s HTML rendering results.

42

T. Okajima et al.

Table 4.1 List of visual properties to be extracted # Property name 1 2 3 4 5 6 7 8

width height x-location y-location relative-x relative-y background-color font-color

Property mean Element width Element height Absolute coordinate (x) Absolute coordinate (y) Relative coordinate (x) Relative coordinate (y) Background color Font color

4.2.2 Extract Visual Properties This procedure uses computer vision techniques to extract visual properties from the screenshots obtained in the previous procedure. Table 4.1 shows a list of visual properties to be extracted. The extraction method for each visual property is described below. The extraction of visual properties for width (width) and height (height) uses the width and height of the screenshot of each element. The extraction of visual properties for absolute coordinates (x-location, y-location) is done by template matching and feature matching. Specifically, the screenshot corresponding to a selector is used as the search space, and the image of each element is matched as the search target. The coordinates obtained in this process are used as visual properties of the absolute coordinates. SIFT [3] is used as a feature point in the proposed method. The extraction of visual properties for relative coordinates (relative-x, relative-y) uses template matching and feature matching as in the extraction of absolute coordinates. The difference from absolute coordinates is that the screenshot is the search space, and the parent element of the selector is the search space. The coordinates obtained from such the process are used as the visual property of relative coordinates. The extraction of visual properties for the background color (background-color) is done using histograms. In this method, a histogram of each RGB color is obtained from a screenshot of each element, and the color with the highest occupancy ratio is used as the background color visual property. The extraction of visual properties for font color (font-color) is done using a histogram and text detection techniques. In the proposed method, the text is first detected from screenshots of each element using text detection techniques. Next, a histogram of each RGB color, excluding the background color, is computed from the image cropped by the bounding rectangle of the detected text. EAST [4] is used as the text detection technique in the proposed method.

4 SpiderTailed: A Tool for Detecting Presentation Failures …

43

4.2.3 Comparison of Visual Properties In this procedure, the visual properties extracted in the previous procedure are compared for each selector. The proposed method compares the values of each visual property except absolute coordinates (x-location, y-location). If the values do not match, the element is detected as Presentation Failure.

4.2.4 Output of Comparison Results In this procedure, a report is made to present the test results to the user based on the comparison results obtained in the previous procedure. The report outputs the following information for each selector: 1. Screenshots of the two different versions 2. Extracted visual properties of the two different versions 3. HTML code corresponding to selectors

4.3 Implementation The proposed method was implemented as a tool. The authors named the tool SpiderTailed. Figure 4.2 shows a construction of the tool. Python 3.7 and the libraries described the following subsections were used to implement the tool. Based on the procedure described in the previous section, the tool was implemented by dividing the functions into the following four categories. 1. 2. 3. 4.

Take screenshots function Extract visual properties function Comparison of visual properties function Comparison results output function

Fig. 4.2 Construction of SpiderTailed

44

T. Okajima et al.

Fig. 4.3 Example of visual property extraction

4.3.1 Take Screenshots Function The Take Screenshots function was implemented using Beautiful Soup [5] and Selenium [6]. Beautiful Soup is one of the HTML parsers and is used to extract the DOM. Selenium is one of the front-end testing tools and is used to take screenshots.

4.3.2 Extract Visual Properties Function The Extract Visual Properties function was implemented using OpenCV [7], an image processing library. The visual property of each extracted GUI element is saved in an XML format. Figure 4.3 is an example of visual property extraction. This example shows the extraction of visual property of a button on a Web page. The extracted visual property is described as a child element of the property tag.

4.3.3 Comparison of Visual Properties Function The Comparison of Visual Properties function was implemented using the standard Python library to compare the aforementioned XML files.

4.3.4 Comparison Results Output Function The Comparison Results Output function was implemented using the standard Python library. The report of comparison results is the output in HTML format.

4 SpiderTailed: A Tool for Detecting Presentation Failures …

45

Fig. 4.4 Web page prepared for evaluation

4.4 Evaluation An experiment was conducted to verify the effectiveness of the proposed tool. The experiment consisted of detecting Presentation Failures on web pages with artificially inserted Presentation Failures, both by the proposed tool and by the manual observation. Twelve subjects, undergraduate and graduate students, at Nippon Institute of Technology experimented on two types of web pages created in advance.

4.4.1 Preparing Web Page for Evaluation Figure 4.4 is the web page made for the evaluation. In the experiment, 127 elements of this web page were selected, and a web page was made in which Presentation Failure was intentionally inserted in 47 randomly selected elements. Also, Presentation Failure in this experiment refers to any small difference between the two Web pages created.

4.4.2 Establishment of Evaluation Criteria In this study, the following three research questions (RQs) were established to clarify the evaluation criteria. RQ1 RQ2 RQ3

What are the strengths of the proposed tool compared to manual observations? Is the time consumption for the proposed tool practical? What are the challenges of the proposed method?

46

T. Okajima et al.

Table 4.2 Classification of test results Whether the corresponding selector has Presentation failure or not Presentation failure Not presentation failure SpiderTailed or manual

Presentation failure

True Positive (TP)

False Positive (FP)

Not Presentation failure

False Negative (FN)

True Negative (TN)

To answer RQ1, we classified the test results of the proposed tool and manual observations. Table 4.2 is a classification of the test results. Based on the classified test results, comparisons were made by calculating the Accuracy, Precision, Recall, Specificity, and F-measure. Accuracy =

TP + TN TP + FP + TN + FN

(4.1)

TP TP + FP

(4.2)

Precision =

Recall =

TP TP + FN

Specificity =

F − measure =

TN FP + TN

2Recall · Precision Recall + Precision

(4.3)

(4.4)

(4.5)

To answer RQ2, the tool was run 10 times and the recorded time consumption was compared with the time consumption for the manual test recorded in the experiment. To answer RQ3, we investigated what visual properties were causing the GUI elements that were determined as misdetection (FP, FN) in Experiment 1.

4.4.3 Experiment: 1 Evaluation of SpiderTailed In the evaluation using SpiderTailed, first, visual properties were generated from two types of Web pages using the proposed tool. The results of the test were then verified by checking the report output as a result of the comparison of the generated visual properties.

4 SpiderTailed: A Tool for Detecting Presentation Failures … Table 4.3 Classification of test results for experiments 1 and 2 # TP TN 1 2 3

SpiderTailed 47 Manual (sum) 328 Manual 27 (average)

55 878 73

47

FP

FN

26 94 8

0 236 20

Table 4.4 Comparison of accuracy, precision, recall, specificity, and F-measure #

Precision

Recall

Specificity

F-measure

1

SpiderTailed 0.797

Accuracy

0.644

1.000

0.679

0.783

2

Manual

0.777

0.582

0.903

0.665

0.785

4.4.4 Experiment: 2 Evaluation of Manual Observation For the manual observation evaluation, subjects opened two different web pages in their browsers and visually inspected them. The browser used by the subjects was Google Chrome, and a form for recording Presentation Failure was distributed in advance. The recording sheet was pre-printed with the web page after the Presentation Failure was intentionally inserted, and the subjects were asked to mark on the sheet when they found the part that appeared to be Presentation Failure. In addition, this experiment had a time limit of 8 min (480 s) for the test.

4.5 Results and Discussion Table 4.3 shows the classification of test results for Experiments 1 and 2. To confirm the effectiveness of SpiderTailed, from the tabulated results, detection (TP + TN ) and misdetection (FP + FN ) were calculated, and χ 2 tests were performed. The test results confirmed the insignificance of the experimental results (p = 0.76).

4.5.1 RQ1: What Are the Strengths of SpiderTailed Compared to the Manual Observation Table 4.4 is the results of the calculation of accuracy, precision, recall, specificity, and F-measure based on Table 4.3.

48

T. Okajima et al.

Fig. 4.5 Comparison of the time consumption Table 4.5 Breakdown of time consumption for SpiderTailed # Number of processes 1 2 1

Time consumption (average)

72.45

147.13

3

4

0.01

1.13

In the manual observation, there were many false negatives in the experiment, resulting in a significant decrease in the Recall. On the other hand, SpiderTailed detected many false positives, resulting in a significant decrease in Precision and Specificity. However, SpiderTailed outperformed the manual one in terms of Accuracy, Recall, and F-measure. These results showed that SpiderTailed is capable of testing with high Accuracy, Recall, and F-measure.

4.5.2 RQ2: Comparison of the Time Consumption Fig. 4.5 is a comparison of the time consumption for Experiment 1 and Experiment 2. With manual, the majority of the subjects in this experiment used the entire time limit (480 s). On the other hand, SpiderTailed was able to complete the test in an average of 221 s. Table 4.5 is a breakdown of the time consumption for SpiderTailed. In Experiment 2, the second process took the most time. This is because SpiderTailed uses a neural network-based method without using a GPU. These results showed that SpiderTailed is more stable and faster than manually testing in practical time.

4 SpiderTailed: A Tool for Detecting Presentation Failures …

49

Table 4.6 Classification of the misdetection based on root cause #

Property name

width

height

relative-x

relative-y background-color

font-color

1

FP

4

2

3

4

4

2

2

FN

0

0

0

0

0

0

4.5.3 RQ3: Challenges of the SpiderTailed Method Table 4.6 shows the results of the experiment using SpiderTailed (Experiment 1), in which the visual properties that caused false positives were tabulated. First, the number of misdetections is discussed. In Experiment 2, the most common causes of misdetections were visual properties of width, relative-y, and background-color. Next, specific causes of misdetections as discussed. Misdetections for relative-x, relative-y, width, height, and background-color were caused by resizing the parent element. Of these, width and height should be improved by using the ratio to the size of the parent element instead of comparing the values as they are. Similarly, relative-x and relative-y need to be improved by comparing positional relationships rather than coordinates. Misdetection for font-color was caused by the algorithm of the SpiderTailed method. In the SpiderTailed method, when extracting visual properties, the extraction result was uniquely determined based on the condition. Therefore, incorrect visual properties were sometimes selected as extraction results. This problem is difficult to improve because of the limitations of the algorithm of the SpiderTailed method. These results showed that the performance of SpiderTailed can be further improved by modification of the cause of misdetection.

4.6 Related Work In regression testing of web applications, REG SUIT [8] is an existing tool for detecting differences in rendering results between different versions of web pages. REG SUIT supports regression testing by providing feedback to the user on the differences in images and code between before and after version web pages. Hallé et al. [9] implemented Cornipickle, a specification description language for GUI in Web pages, and proposed a method to detect Presentation Failure by comparing the described specification with the properties of the DOM of the actual Web page. Mahajan et al. [10] proposed a new approach to automatically debug the GUI of a Web page by comparing a mockup diagram of the Web page with a screenshot of the implemented Web page and automatically determining whether it is Presentation Failure using a Perceptual Image Differencing (PID).

50

T. Okajima et al.

Fig. 4.6 Examples of limitations

Tanno and Adachi [11] proposed a semi-automatic method to efficiently check whether a screen is normal or not by obtaining the difference between the correct screen and the other screen to be compared and feeding the result back to the tester, pointing out that this can reduce the time required. Compared to the existing studies, our study can be performed with only two Web pages and can fully detect Presentation Failure automatically [11]. In addition, compared to methods that directly compare images, this research compares visual properties, so when Presentation Failure is detected, the cause can be identified by referring to the value of the extracted visual property [8, 10, 11]. Another feature is that although the input does not support abstract input, such as mockup diagrams of Web pages [10], SpiderTailed does not require the user to describe the state of the GUI in a proprietary language [9].

4.7 Limitations and Validity 4.7.1 Limitations In SpiderTailed, false positives are caused when the element from which the visual property is extracted is affected by changes in the parent element. For example, if the background color is not set for the extraction target and the background color of the parent element is changed, the background color of the extraction target is also changed (Fig. 4.6). The method does not take these effects into account and therefore detects Presentation Failure from both elements.

4.7.2 Validity Some Presentation Failure in this experiment may be considered unrecognizable to the manual observation. Therefore, it should be noted that SpiderTailed is different from the manual observation.

4 SpiderTailed: A Tool for Detecting Presentation Failures …

51

4.8 Conclusion In this study, we proposed a method (SpiderTailed) to detect Presentation Failure by extracting the visual properties of each GUI element from screenshots of a web page and comparing the before and after versions. Experimental results showed that the test is faster than manual testing and has higher Accuracy, Recall, and F-measure than manual observations. Acknowledgements This work was supported by JSPS KAKENHI Grant Numbers 21K12179.

References 1. S.F. Pengnate, R. Sarathy, An experimental investigation of the influence of website emotional design features on trust in unfamiliar online vendors. Comput. Human Behav. 67, 49–60 (2017) 2. K. Moran, B. Li, C. Bernal-Cárdenas, D. Jelf, D. Poshyvanyk, Automated reporting of GUI design violations for mobile apps, in Proceedings of the 40th International Conference on Software Engineering ICSE ’18 (Association for Computing Machinery, New York, NY, USA, 2018), pp. 165–175. https://doi.org/10.1145/3180155.3180246 3. D.G. Lowe, Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004). https://doi.org/10.1023/B:VISI.0000029664.99615.94 4. X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, J. Liang, East: an efficient and accurate scene text detector, in Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 2642–2651. https://doi.org/10.1109/CVPR.2017.283 5. BeautifulSoup, https://www.crummy.com/software/BeautifulSoup/. Accessed 11 Apr 2022 6. Selenium, https://www.selenium.dev/. Accessed 11 Apr 2022 7. OpenCV, https://opencv.org/. Accessed 11 Apr 2022 8. REG SUIT, https://reg-viz.github.io/reg-suit/. Accessed 11 Apr 2022 9. S. Hallé, N. Bergeron, F. Guérin, G. Le Breton, O. Beroual, Declarative layout constraints for testing web applications. J. Logical Algebraic Methods Programming 85(5, Part 1), 737–758 (2016). https://doi.org/10.1016/j.jlamp.2016.04.001, https://www.sciencedirect.com/science/ article/pii/S2352220816300293, special Issue on Automated Verification of Programs and Web Systems 10. S. Mahajan, W.G.J. Halfond, Detection and localization of HTML presentation failures using computer vision-based techniques, in Proceedings of the 2015 IEEE 8th International Conference on Software Testing, Verification and Validation (ICST) (2015), pp. 1–10. https://doi.org/ 10.1109/ICST.2015.7102586 11. H. Tanno, Y. Adachi, Support for finding presentation failures by using computer vision techniques, in Proceedings of the 2018 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW) (2018), pp. 356–363. https://doi.org/10.1109/ ICSTW.2018.00073

Part II

AI/ML-Based Software Development

Chapter 5

Collecting Insights and Developing Patterns for Machine Learning Projects Based on Project Practices Hironori Takeuchi , Kota Imazaki, Noriyoshi Kuno, Takuo Doi, and Yosuke Motohashi Abstract Machine learning (ML) techniques have been introduced into various domains in recent years. Thus, it is important to construct reusable knowledge on projects that develop ML-based service systems to implement such projects effectively. In this study, the collection of insights and as well as the development of architecture and design patterns for ML-based service systems are considered. We propose a method for collecting insights and developing patterns for ML projects by referring a development model based on project practices. Keywords Machine learning · ML service system · Pattern

5.1 Introduction At present, numerous machine learning (ML) techniques are available as application programming interfaces (APIs). Therefore, it is possible to use ML techniques for practical business applications. Accordingly, enterprises have started to implement such techniques in their business functions. In this study, we consider projects for developing service systems in enterprises using ML APIs. New features have emerged in service systems that use ML techniques (ML-based service systems). Furthermore, when ML techniques are applied to these business functions, it is important to acquire H. Takeuchi (B) Musashi University, Tokyo, Japan e-mail: [email protected] K. Imazaki Information-technology Promotion Agency, Tokyo, Japan N. Kuno Mitsubishi Electric Corporation, Tokyo, Japan T. Doi Lifematics Inc., Tokyo, Japan Y. Motohashi NEC Corporation, Tokyo, Japan © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Various et al. (eds.), Knowledge-Based Software Engineering: 2022, Learning and Analytics in Intelligent Systems 30, https://doi.org/10.1007/978-3-031-17583-1_5

55

56

H. Takeuchi et al.

training data on the target business domain. Thus, sufficient knowledge of, or prior experience in, the target business domain is essential. For this reason, the representatives of the IT division and relevant business division are required to participate in the project. As a result, numerous challenges arise in ML service system development projects (ML projects) in terms of the requirements, design, implementation, and test phases [6, 8]. It is necessary to collect insights from project practices and to create reusable knowledge for the effective undertaking of ML projects. In this study, we focus on the patterns of ML projects as reusable knowledge. It is considered that practitioners obtain their insights through ML project practices and share these in their organizations. Such insights are described using different formats and are not shared beyond organizations. We propose a method for constructing the reusable knowledge for ML projects as patters from project practices. In this method, we implement the development model for ML-based service systems so that the practitioners can provide their insights derived from the projects by referring to the model. Moreover, we define the steps according to which the practitioners construct patterns from the collected insights by using the enterprise architecture (EA)-based generic ML architecture and design pattern model. By applying the proposed method in practice, we attempt to confirm that the practitioners can collect insights on ML projects effectively and can construct the patterns as reusable knowledge for conducting ML projects. The remainder of this paper is organized as follows: In Sect. 5.2, we describe related studies. We introduce the ML-based service system and the architecture and design patterns for ML service systems in Sect. 5.3. Moreover, we define the research hypothesis of this work. In Sect. 5.4, we present our method for constructing the reusable knowledge for ML projects from project practices, and in Sect. 5.5, we demonstrate the practice in which we apply the proposed method. Finally, the results of the practice are discussed in Sect. 5.6, whereas the key points and future studies are summarized in Sect. 5.7.

5.2 Related Work Through a review of the literature and surveys on several actual projects, many software engineering challenges that arise in a project for developing ML systems were identified in [6, 8], and the extraction of knowledge for conducting ML projects was emphasized. Best practices in ML projects for constructing knowledge were collected through a literature survey [9]. Furthermore, a general workflow for developing ML systems was introduced in [1] as reusable knowledge for ML projects. The role of data scientists in ML system development projects was discussed in [5]. By combining these research outcomes, a project model that represents the relations among the project activities, stakeholders, and project goals was subsequently proposed [12]. An architecture for representing an entire system is required as knowledge in prac-

5 Collecting Insights and Developing Patterns for Machine …

57

tical projects in which big data analytics or ML techniques are applied [3]. Several reference architectures for development teams have also been proposed [2, 4]. In [17], a reference architecture for intelligent systems was presented, which combines digital strategies and architectures [18] with artificial intelligence. The development of architecture and design patterns has been considered as a knowledge resource for software system architecture, and many related works have been published. In [16], several patterns focusing on the operational stability of ML systems were introduced. Moreover, through a systematic literature review, software engineering patterns for ML systems were identified and formalized in [14, 15]. These patterns are typically described as itemized documents and are mainly aimed at data scientists or ML application developers. The representation of ML patterns as EA-based models was investigated in [10].

5.3 Research Subject and Hypothesis 5.3.1 ML-based Service System In this study, we use ArchiMate [13] as an EA modeling language. We use three business concepts and three application concepts that are defined in ArchiMate to represent a project that develops a system using ML techniques, as follows. • • • • •

Business service: an explicitly defined exposed business behavior. Business process: a sequence of business behaviors achieving a specific outcome. Business object: a concept that is used within a particular business domain. Application service: an explicitly defined exposed application behavior. Application component: an encapsulation of application functionality that is aligned to the implementation structure. • Data object: data that are structured for automated processing. Using this EA modeling approach, we represent a practical ML project in which we develop an ML-based service system for an enterprise function, as illustrated in Fig. 5.1.

5.3.2 Architecture Design Pattern for ML Service Systems A software design pattern is a form of reusable knowledge in software engineering. In software design patterns, best practices are formalized so that engineers can use them to solve typical problems that occur when an application or system is designed. Although no standard format exists in many patterns, the following items are defined to describe software design patterns:

58

H. Takeuchi et al.

Fig. 5.1 ML-based service system represented by ArchiMate

• • • • •

Intent: Objective of the pattern Problem: Forces that the pattern seeks to resolve Solution: Suggested activities to solve the problem Context: Environmental information on the system Discussion: Pre-conditions or limitations for applying the pattern.

Several security patterns [16], and architecture and design patterns [14, 15] have been introduced for ML service systems. For example, in [14], a pattern related to the ML system architecture pattern, known as a “data flows up, model flows down with federated learning” pattern, was described. The pattern elements outlined above is presented in Table 5.1. The solution that is proposed in this pattern is depicted in Fig. 5.2. As indicated in the context field, the “data flows up, model flows down” pattern is implemented in the ML system using mobile/edge devices, such as mobile phones, cameras, and IoT devices. A major retail company in the USA introduced the ML system to identify traffic problems in the parking lots of their store1 . They trained an ML model from the image data that were collected by the parking lot security cameras and deployed it at each store (model flows down). The model is applied to real-time images from security cameras in the stores, and notifications are sent to the staff when traffic jams occur. When a store encounters an adversarial situation, e.g., a false positive or false negative, the application sends the corresponding set of images to the cloud for model re-training (the data flows up).

1

https://www.ibm.com/products/maximo/remote-monitoring.

5 Collecting Insights and Developing Patterns for Machine …

59

Table 5.1 Data flows up, model flows down with federated learning pattern Intent Improve the response time for the input query and prediction performance based on the local users’ own queries and output results Problem The ML application cannot return its prediction results in real time if the ML model is deployed to the cloud environment. The prediction performance depends on the users’ own queries. If the data collected from the local device are stored in the cloud environment for the retraining, the user’s privacy and data confidentiality must be preserved 1 Deploy the ML model to the local device Solution  2 In each device, the ML model is re-trained by the locally collected data  3 The difference models provided by local devices are averaged and updated  into the ML model on the cloud Context • The application runs on a personal local device such as a smart phone • This problem should be considered in the system design phase Discussion • There are sufficient computing resources on the local device • Re-training based on each user’s log data does not have any negative impact • Prediction models can be combined with one another

Fig. 5.2 Solution proposed in “data flows up, model flows down with federated learning” pattern

Federated learning is a special case of this pattern. It is implemented on a Google Keyboard using Android, which is known as Gboard.2 When users type texts, their phones store information regarding the current context and whether they selected the suggestions provided by Gboard. Gboard uses an ML model locally and re-trains the model using these stored data. Thus, Gboard reflects the specific behavior of users and the suggestions are improved. The difference between the original model and re-trained models is uploaded to Google Cloud. The base ML model in the cloud is updated at fixed intervals using these uploaded difference models and re-deployed. 2

https://ai.googleblog.com/2017/04/federated-learning-collaborative.html.

60

H. Takeuchi et al.

Fig. 5.3 Overview of proposed method

5.3.3 Research Hypothesis As mentioned previously, the best practices or architecture and design patterns for ML-based service systems are based on literature surveys. This means that experts with sufficient knowledge on both software engineering and ML techniques need to analyze the literature and construct patterns. Moreover, the insights that are obtained through project practices are not systematized as best practices or patterns unless they are published. Although a large organization conducting various types of ML projects published their own insights as patterns [7], it is not clear how they systematized the insights into such patterns. Within this context, we consider the following research question (RQ).

RQ How can practitioners construct reusable knowledge for ML projects from real project practices?

For this RQ, we propose a method to collect insights from project practices and construct reusable knowledge as patterns from the insights. Furthermore, we confirm the effectiveness of the proposed method through practice.

5.4 Proposed Method 5.4.1 Overview In this study, we consider knowledge construction based on ML project practices. Figure 5.3 presents an overview of the proposed method. The proposed methods consists of the following steps: 1. Prepare a development model based on ML project practices. 2. Derive insights from ML projects by referencing the development model.

5 Collecting Insights and Developing Patterns for Machine …

61

3. Construct patterns from the collected insights.

5.4.2 Reference Development Model and Collection of Insights In the proposed method, a development model for collecting insights is prepared as the first step. In this study, we used the agile development model for ML service systems proposed in [11]. This model was extended from the general agile development model and the ML workflow model based on actual ML project practices. Figure 5.4 depicts the development model. As the second step, practitioners provide their insights that are derived from the ML projects by referencing this development model. In this study, we arranged a workshop where practitioners participated and shared their insights.

5.4.3 Construction of Patterns from Collected Insights Patterns are constructed from the collected insights as the third step. In such insights that are provided by practitioners, recommended activities for conducting ML projects effectively as well as the project phase during which these activities are conducted are described. For example, the insight “We should consider the metrics on the reliability, safety, or fairness as well as that on the accuracy when defining the metrics of the project success” is the recommended activity for the project planning phase. In this section, we outline steps for constructing patterns from insights with this formatted information. We use an EA model for the ML architecture and design patterns [10]. The relationships among key elements are sometimes not clearly described in the pattern documents, which hinders the common understanding regarding the patterns among stakeholders. In contrast, the EA model using ArchiMate represents the relationships among the pattern elements. Figure 5.5 presents a generic ML architecture and design pattern using ArchiMate [10]. In this model, the pattern elements in the ML architecture and design patterns can be connected to one another by means of relationships such as realization, composition, access, flow, serving, assignment, and association. For example, the solution, which is represented as a principle, is considered to realize the objective of the pattern, which is represented as an outcome. This relationship can be represented using the realization notation defined in ArchiMate. The descriptions in the collected insights correspond to the “Solution in the pattern” and “Phase where the pattern is applied” in the generic model. We embody the generic model and construct specific pattern models according to the following steps:

Fig. 5.4 Agile development model for ML service systems

62 H. Takeuchi et al.

5 Collecting Insights and Developing Patterns for Machine …

63

Fig. 5.5 Generic ML architecture and design pattern represented using ArchiMate

1. Obtain the “Objective of the pattern” by analyzing the “Solution in the pattern” described in the insight. 2. Analyze the issue to be solved by the solution and current situation, and derive the “Goal achieved by applying the pattern,” “Situation to be improved,” and “Assessment result of the situation to be improved.” 3. Analyze the exceptions in the solution and identify the pre-condition or limitation in the pattern. 4. Assess whether the solution can be applied only to the specific ML service system, and derive “System user” or “Device where system is running” if necessary. Every model element corresponds to the description in the pattern documents. Therefore, the constructed pattern model can be converted into the pattern document.

64

H. Takeuchi et al.

Table 5.2 Summary of collected insights Plan and design

Data

Results of workshop

8

4

Results of Surban et al.

0

5

Training

App. development

Deployment and maintenance

Organization Governance

Sum

7

2

3

4

0

28

11

3

6

3

1

29

Fig. 5.6 Pattern model constructed from collected insights

5.5 Practice We hosted an online workshop at the Working Conference on Machine Learning Software Engineering (MLSE2021) in Japan on July 2, 2021. A total of 12 practitioners with some experience in ML projects participated in the workshop. All participants accessed an online canvas where the reference ML development model was presented, and posted their insights that were obtained through the ML projects on the canvas. As a result, we collected 28 insights and categorized these based on the viewpoints extended from those in [9]. Table 5.2 displays a summary of the collected insights. From the collected insights, we selected “Confirm the processes and methods by which data are collected” as an example and applied the proposed pattern construction method. This insight corresponded to the description in the solution element in the pattern. By analyzing the purpose of the activity described in this insight, it was found

5 Collecting Insights and Developing Patterns for Machine …

65

Table 5.3 Constructed pattern (“Check the origin of the data”) Intent Avoid target leakage whereby the ML model is trained by data that are not available at the prediction runtime Problem The runtime input data for the prediction are sometimes not clearly defined when training the model. As a result, the accuracy of the prediction cannot be obtained as expected Solution Confirm the processes and methods by which data are collected Context This pattern is applied when designing the overall ML system Discussion A mechanism for collecting data must exist

that certain fields in the training data could not be used for the prediction because the runtime input for the prediction was not clearly defined when collecting the training data. This issue is known as “target leakage.” Therefore, the intent of the pattern was avoiding target leakage and the goal of the pattern was improving the accuracy. As a result, we obtained the pattern model for this insight, as illustrated in Fig. 5.6, and assigned the pattern name “Check the origin of the data.” The pattern descriptions were converted from the pattern model. Table 5.3 presents the constructed pattern. Through the example analysis, it was confirmed that the RQ could be solved by the proposed method.

5.6 Discussion In the proposed method, practitioners provide insights on their ML projects by referencing the development model. This reference model is based on the project practice and the insights are collected as best practices. In the implemented practice, 12 practitioners discussed their ML projects for two hours, based on which we could obtain almost as many insights as those in the literature survey-based method [9]. Therefore, it is expected that project insights can be collected effectively from practitioners using the proposed method. Table 5.2 demonstrates that we could collect insights that are different from those obtained by the literature survey-based method. This is because the detailed activities during the project planning stage were represented in the reference model. For example, we could obtain the following insights in the project planning stage: • Agree with the business division on the business goal and the goal to be achieved by the ML-based service system. • Agree with not only the goals, but also the available computing resources. • Confirm both the potential users of ML-based service systems and the number of such users.

66

H. Takeuchi et al.

It is expected that the proposed method can be used complementarily with the literature survey-based method. However, it is not clear that insights from ML project practices can be exhaustively collected using the proposed method. Furthermore, it is not confirmed whether the number or quality of the collected insights depends on the experiences or skills of The practitioners. The investigation of these items through continuous insight collection is necessary, which will be one of our future studies. Through the implemented practice, we confirmed that we could successfully construct patterns of ML projects using the collected insights with the proposed method. Moreover, we could obtain the pattern description without missing elements by converting the constructed model. This means that practitioners can systematize reusable knowledge on ML projects as patterns from the collected data without the support of experts with strong skills in software engineering and ML. However, when constructing the pattern models of ML projects, it is necessary to know the quality characteristics that are required for ML-based service systems. Thus, determining the typical issues or risks in ML-based service systems and systematizing them as knowledge will be investigated in future work.

5.7 Conclusions In this study, we focused on projects for the development of ML service systems in which ML techniques are applied to enterprise functions. We considered a method for collecting insights on ML projects from the practices and the construction of reusable knowledge as patterns from the collected insights. We proposed a reference development model based on real project practices, and the steps for constructing the patterns as models from the EA-based generic ML architecture and design pattern. Through the practice, we confirmed that practitioners could collect insights effectively and the patterns of ML projects could be constructed successfully from the collected insights. Future studies will focus on investigating the quality or coverage of the collected insights and systematizing the typical issues or risks in ML-based service systems as the knowledge required for the pattern development. Acknowledgements This work was supported by a JSPS Grant-in-Aid for Scientific Research (KAKENHI), Grant No. JP19K20416, and the JST-Mirai Project (Engineerable AI Techniques for Practical Applications of High-Quality Machine Learning-based Systems), Grant No. JPMJMI20B8.

References 1. S. Amershi, A. Begel, C. Bird, R. Deliner, H. Gall, E. Kamar, N.N.B. Nushi, T. Zimmermann, Software engineering for machine learning: a case study, in Proceedings of the 41st International Conference on Software Engineering (2019), pp. 291–300 2. Y. Demchenko, C. de Last, P. Membrey, Defining architecture components of the big data ecosystem, in Proceedings of the International Conference on Collaboration Technologies and

5 Collecting Insights and Developing Patterns for Machine …

67

Systems (CTS) (2014), pp. 104–112 3. S. Earley, Analytics, machine learning, and the internet of things. IEEE ITPro 17(1), 10–13 (2015) 4. J. Heit, J. Liu, M. Shah, An architecture for the deployment of statistical models for the big data era, in Proceedings of IEEE International Conference on Big Data (2016), pp. 1377–1384 5. M. Kim, T. Zimmermann, R. DeLine, A. Begel, The emerging role of data scientists on software development teams, in Proceedings of the 38th International Conference on Software Engineering (2016), pp. 96–107 6. F. Kumeno, Software engineering challenges for machine learning applications: a literature review. Intell. Decis. Technol. 13, 463–476 (2019) 7. V. Lakshmanan, S. Robinson, M. Mann, Machine learning design patterns: solutions to common challenges in data preparation, Model Building, and MLOps. O’Reilly (2020) 8. L.E. Lwakatare, A. Raj, J. Bosch, H.H. Olsson, I. Crnkovic, A taxonomy of software engineering challenges for machine learning systems: an empirical investigation, in Proceedings of the 20th International Conference on Agile Software Development (XP) (2019), pp. 227–243 9. A. Serban, K. van der Blom, H. Hoos, J. Visser, Adoption and effects of software engineering best practices in machine learning, in Proceedings of the ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (2020), pp. 3:1–3:12 10. H. Takeuchi, T. Doi, H. Washizaki, S. Okuda, N. Yoshioka, Enterprise architecture based representation of architecture and design patterns for machine learning systems, in Proceedings of the 13th Workshop on Service oriented Enterprise Architecture for Enterprise Engineering (IEEE 25th EDOC Workshop) (2021), pp. 246–250 11. H. Takeuchi, H. Kaiya, H. Nakagawa, S. Ogata, Reference model for agile development of machine learning-based service systems, in Proceedings of the 3rd International Workshop on Machine Learning Systems Engineering (Companion Proceedings of the 28th Asia-Pacific Software Engineering Conference) (2021), pp. 115–118 12. H. Takeuchi, S. Yamamoto, Ai service system development using enterprise architecture modeling, in Proceedings of the 23rd International Conference on Knowledge-Based and Intelligent Information and Engineering Systems (Procedia Computer Science vol. 159) (2019), pp. 923– 932 13. The Open Group: ArchiMate 3.1—A Pocket Guide. Van Hares Publishing (2019) 14. H. Washizaki, F. Khomh, Y.G. Guéhéneuc, H. Takeuchi, S. Okuda, N. Natori, N. Shioura, Software engineering patterns for machine learning applications (SEP4MLA)—part 2, in Proceedings of the 27th Conference on Pattern Languages of Programs (PLoP 2020) (2020) 15. H. Washizaki, H. Uchida, F. Khomh, Y.G. Guéhéneuc, Software engineering patterns for machine learning applications (SEP4MLA), in Proceedings of the 9th Asian Conference on Pattern Languages of Programs (AsianPLoP 2020) (2020) 16. H. Yokoyama, Machine learning system architectural pattern for improving operational stability, in Proceedings of IEEE International Conference on Software Architecture Companion (2019), pp. 267–274 17. A. Zimmermann, R. Schmidt, D. Jugel, M. Möhring, Evolution of enterprise architecture for intelligent digital systems, in Proceedings of the 14th International Conference on Research Challenges on Information Science (2020), pp. 145–153 18. A. Zimmermann, R. Schmidt, K. Sandkuhl, D. Jugel, J. Bogner, M. Möhring, Evolution of enterprise architecture for digital transformation, in Proceedings of the IEEE 22nd International Enterprise Distributed Object Computing Workshop (2018), pp. 87–96

Chapter 6

Supporting Code Review by a Neural Network Using Program Images Kazuhiko Ogawa and Takako Nakatani

Abstract In system development, code reviews are conducted to improve the quality of programs. The purpose of this study is to support code reviews and improve the quality of the program. In supporting code review, we use a convolutional neural network (CNN) that was trained with programs with a label of “withDefects” or “withoutDefects”. The training data are images of program fragments. In this paper, we call the trained CNN the CNN-BI (CNN based Bug Inference) system. The CNNBI system infers whether or not there is a defect in the image of a program fragment. In order to validate the effectiveness of the CNN-BI system for review, we conducted experiments. In the experiments, we provided a list of the inferred results by CNN-BI for each program fragment to reviewers. We analyzed the results of their reviews. As a result of the analysis, the effectiveness was observed in some examples. In this paper, we describe a method for making the training data and the results of the experiments. Keywords Bug inference · Convolutional neural network · Image of source code · Code review

6.1 Introduction Code review is one of the ways to find defects in a program. The reviewer investigates that the developed program satisfies the program specifications. If a defect is found by the review, the program will be corrected. In order to improve the quality of the program by code review, it is necessary to be able to point out as many defects as possible. The purpose of this research is to support code review so that defects in the program can be found efficiently. The ability to find images of cats from many K. Ogawa (B) · T. Nakatani The Open University of Japan, 2-11, Wakaba, Mihama-ku, Chiba 261-8586, Japan e-mail: [email protected] T. Nakatani e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Various et al. (eds.), Knowledge-Based Software Engineering: 2022, Learning and Analytics in Intelligent Systems 30, https://doi.org/10.1007/978-3-031-17583-1_6

69

70

K. Ogawa and T. Nakatani

images with high accuracy [7] has prompted research into applying CNNs to image recognition. The authors are surrounded by developers who can point out defects by just glancing at the program. Based on these research results and our experience, we expected that if we trained a CNN with images of programs labeled with defects, the CNN would be able to infer the presence or absence of defects in the program. In this paper, a CNN trained with program fragments labeled with the presence of defects is referred to as the CNN-BI system. In order to evaluate whether or not this research has achieved its purpose, we answer the following two research questions. • RQ1: Can the CNN-BI system that has learned the attributes of programs with defects infer the existence of defects in a program? • RQ2: Can a reviewer who uses the results inferred by CNN-BI system point out more defects than a reviewer who does not refer to the results? The structure of this paper is as follows: In Sect. 6.2, we presents related work. Section 6.3 presents examples of training data and the machine learning process in order to introduce the CNN-BI system. In Sect. 6.4, the design of our experiment is described. In Sect. 6.5, we discuss the results of the experiments and in the final section conclude this paper with our future work.

6.2 Related Work Cyclomatic complexity developed by T. McCabe [10] is a metric that quantifies the complexity of a program by the complexity of the control structure. A program with a high cyclomatic complexity value can be regarded as a complex program and therefore it may contain defects. Thus, it is necessary to focus and review the part where the value of the cyclomatic complexity of the program is high. Metrics for quantifying the complexity of object-oriented programs have also been developed [2, 8]. There are other ways to find error prone program fragments that should be reviewed with focus. Recently, research has been conducted to apply deep learning to improve the quality of software. Kondo et al. [6] applied deep learning to verify programs no to introduce any defects before committing to the modification. By using the source code of the modified part of the program, they predicted the inclusion of defects due to the modified program. Morisaki et al. [11] focused on the beauty of the appearance of programs to evaluate the quality of a program. They considered that a beautiful program is an easy-to-read program, and that if the program is easy to read, their quality will be improved. Thus, they applied CNN to evaluate the beauty of programs. Chen et al. [1] applied deep learning techniques to predict whether a program is defective. They developed an algorithm to convert character codes into RGB color codes. Using this algorithm, programs can be converted to an image composed of color pixels. They applied the images as training data. They proposed a method for inferring whether or not there was a defect in the program using the converted image, and evaluated its effectiveness. Their method converts one program into one image. Therefore, as the size of the program increases, it becomes difficult to identify the

6 Supporting Code Review by a Neural Network …

71

location of defects. To solve this problem, the method proposed in this paper divides the program into a certain number of lines and converts them into images. This makes it possible to infer the presence or absence of defects regardless of the size of the program. The reviewer’s effort to find defects should be reduced if defects are inferred only within a certain number of lines. In this paper, we evaluate whether or not the review efficiency can be improved by specifying programs that contain defects inferred by the CNN. Our next research topic is how to use the inference results of CNN to improve the quality of actual software. In this paper, we report the result of our experimental study in which we delivered a checklist based on the inference results to reviewers and validated the effectiveness in supporting reviews.

6.3 CNN-BI System 6.3.1 Training Method and the Training Data There are various ways to conduct a review efficiently [4, 5, 17]. In this study, in order to support the review, a checklist and programs are provided to reviewers. The label indicating the presence or absence of defects for each fragment of the programs are shown in the checklist. In order to create the checklist, we used supervised learning by labeling each program with the presence or absence of defects. We had programs developed in the past. Therefore, we knew the presence or absence of defects in each program. In this research, we focused on the fact that humans can intuitively point out defective parts rather than syntactic analysis. Hence, we applied the images of the program for the deep learning. We expect that a neural networks that has trained by these data will be able to classify programs according to the presence or absence of defects. The CNN-BI system is one of the neural networks trained by supervised learning with images of programs [12–14]. Since the size of each program is different, we divided each program into multiple program fragments with the same number of lines and converted these fragments into the images as training data. In this research, we focused on the fact that humans can intuitively point out defective parts rather than syntactic analysis. Hence, we applied the images of the program for machine learning. However, it was necessary to solve problems in imaging the program fragment.

6.3.2 Preparing the Learning Data The training data were 44 programs of a sales management system developed by five developers with three to fifteen years of experience in Visual Basic 2008. In constructing the CNN-BI system, we explored the way to create the program images as follows:

72

K. Ogawa and T. Nakatani

• The size of program fragments We split the program into program fragments of 30 lines each from the beginning. We have also attempted to divide a program into function units. However, as the size of the function increases, a single function is split into multiple fragments. The accuracy of inference was compared between two strategies: dividing the program into 30 lines and splitting the program first into functions and then into 30 lines. No significant differences were found. Therefore, we chose the first strategy. • Color code of programs Except for monochrome, there was no significant difference in the learning results of color coding. Therefore, we referred to the color coding of Sakura editor1 that is a Japanese text editor, suitable for programming, and has a function to freely change the color coding for highlighting keywords. We applied the color-coded patterns to the program’s image based on the standard patterns of the editor as follows: – – – – – – – – – –

Statement: blue(FF0000) Reserved word: orange(00A5FF) Comment: green(008000) The name of module: bright green (bold,90EE90) Global variable: deep blue(bold, 8B0000) Local variable: black(000000) UI object’s identification: pink(CBC0FF) User defined function, the name of sub-routine: red(0000FF) The name of class: brown(2A2AA5) String: purple(800080).

• Background color In machine learning trials with a different background color, the result of training data with a white background was acceptable. • Training data The image of the program fragments were saved as jpg files with the label of the presence or absence of known defects.

6.3.3 Check List Create a checklist using the inference results of the CNN-BI system to assist in the review. The checklist consists of the group label of the presence or absence of defects, the start and end line numbers of each program fragment within the original program, the image file name of the program fragment, and the output of SOFTMAX function. An example checklist is shown in Fig. 6.1.

1

https://sakura-editor.github.io.

6 Supporting Code Review by a Neural Network …

73

Fig. 6.1 A checklist example

6.4 Experimental Study 6.4.1 Overview We performed supervised learning to make the CNN-BI system learn the features of program defects. We verified the results and confirmed that the CNN-BI system was able to learn the presence or absence of defects in the program. The trained CNN-Bi system was given images of program fragments generated from multiple programs to obtain a list of defective program fragments. We used the results to create a review checklist and evaluate its effectiveness in supporting review. The subjects in charge of the review were three engineers, each with 17, 18 and 20 years of experience. Based on their years of experience, we considered the subjects to have almost the same abilities.

6.4.2 Applying Supervised Learning In order to make the CNN-BI system classify program fragments into two categories: i.e. the presence or absence of known defects, we applied supervised learning. The number of the program images used for the training data were 1,158 and 654 with defects and without defects, respectively. Supervised learning was executed in Python using TensorFlow-gpu and KERAS provided by Google on Windows10HomeEdition. The configuration of the development environment was CPU: Core i7-6700K, MainMemory: 32GB, GPU: Geforce RTX2700super. Since the number of training data was small, we applied VGG16(VGG ILSVRC 16 Layers) [16] for transfer learning and cross-validation. Furthermore, it was confirmed

74

K. Ogawa and T. Nakatani

Fig. 6.2 Heatmap of a result of a program in the category of with defects

that overfitting could be avoided and that the accuracy of learning was improved. In the supervised learning of the CNN-BI system, the last convolutional layer BLOCK5 and the fully connected layer of VGG16 were trained. We utilized Global Average Pooling (GAP) [9] for the fully connected layer, because the learning model using Global Average Pooling could suppress over-learning more than Flattening. The models and parameters used for supervised learning are as follows: • the model of final bond layer GlobalAveragePooling2D()(x) Dense(256, activation= ‘relu’)(x) Dropout(0.5)(x) • Data Augmentation channel_shift_range, brightness_range • Learning parameters Optimizer SGD.lr=0.00002, nesterov=True, decay=1e-6 The classification result based on the presence or absence of a program defect is output via the SOFTMAX function.

6.4.3 Visualization of the Training Image features can be learned by deep learning. The validity of the learning was confirmed by a heatmap [3, 15] that visualized the learning results. Figures 6.2 and 6.3 show the heatmaps when the CNN-BI system learned with a program fragment with defects and a program fragment without a defect, respectively. Note that the left side of each figure is the program image used for training, and the right side of the figure is the heatmap.

6 Supporting Code Review by a Neural Network …

75

Fig. 6.3 Heatmap of a result of a program in the category of without defects

A heatmap visualizes the most focused part and unfocused part of a program using colors on a scale from red to blue. We validated whether the CNN-BI system could properly learn the features of program fragments with defects. As a result, the heatmap that represented the case of a program with defects in the SQL statements properly showed the location of the SQL statements in red. Similarly, in other program images, it was confirmed by the heatmap focused on the defective part in the program. Thus, it was concluded that the CNN-BI system learned the presence or absence of defects by focusing on the locations where defects exist in the program images.

6.4.4 Verification of the Categorization Five programs were used for varification, and these programs were divided by the same manner as the training data. As a result, we got 314 jpg files of program fragments. These programs were developed by five engineers. The development language was Visual Basic 2008. The domain of these programs was the sales management system. The domain and the program language were the same as the programs used for the training of the CNN-BI system. Note that we knew which program fragment was defective. We input these program fragments into the CNN-BI system to classify them into defective and non-defective groups. The results are shown in Table 6.1. The types of defects included in each program are different. In order to clarify the inference characteristics of the CNN-BI system, we analyzed how much and what kind of defects the system could infer.

76

K. Ogawa and T. Nakatani

Table 6.1 Classification evaluation Program ID

#frag.a

#defective #inferred fragment defect

#TP

#FP

#FN

Preci.

Recall

F1

pg1

15

1

4

1

3

0

0.25

1.00

0.40

pg2

28

6

4

1

3

5

0.25

0.17

0.20

pg3

50

8

6

3

3

5

0.50

0.38

0.43

pg4

86

30

31

16

15

14

0.52

0.53

0.52

pg5

135

51

26

22

4

29

0.85

0.43

0.57

Total

314

96

71

43

28

53

0.61

0.45

0.51

a #frag.:

the number of program fragments

Table 6.2 Evaluation by type of defect Type of #defects TP FP defects SQL 20 statement IF block 5 FOR 3 block DO block 1 Others 67 Total 96

FN

Precision

Recall

F1

16

10

3

0.62

0.84

0.71

2 2

10 1

3 1

0.17 0.67

0.40 0.67

0.24 0.67

0 23 43

0 16 37

1 25 33

0.00 0.59 0.54

– 0.48 0.57

0.00 0.53 0.55

6.4.5 Types of Defects Inferred Table 6.2 shows the types and numbers of defects that could be inferred within the program fragments by the CNN-BI system. The precision of “SQL statement” was 0.84. The defects in SQL statements should be reviewed and found by reviewers, since SQL statements are simple strings that are out of scope of the editor’s structural checking. There were nineteen defects in the SQL statements. This is a quarter of the defects within the program fragments. Since the precision of SQL statements was 0.84, we may be able to contribute to improving the quality of the program by presenting the inference results to the reviewers. To support the review, we developed a checklist based on the inference results of the CNN-BI system. An experimental study was conducted to evaluate the effectiveness of the checklist.

6.4.6 Review Process The process of the experimental study was as follows: 1. (Experiment1) The subjects review the five programs as usual and deliver a defects report for each program.

6 Supporting Code Review by a Neural Network …

77

The defects report consists of the content of each defect and the location (line number of the program) where the defect was found. 2. A time of 3 weeks was set as a period for the reviewer to forget the result of the review of Experiment1. 3. (Experiment2)The subjects use the checklist to review the five programs and create a defects report for each program. 4. The subject submits a subjective impression of the effectiveness of the checklists.

6.4.7 Result In order to evaluate the effectiveness of the checklist, we compared the defect reports of Experiment1 and Experiment2 are quantitatively and qualitatively. • Quantitative Analysis In these experiments, the number of actual defects pointed out by reviewers is shown in Table 6.3 rather than the total number of defects because the three subjects had the same technical level. The results of two experiments shows in low precision and recall. It may imply that it was hard to detect the defects. Actually, the programs had been already released, and the 96 defects had been discovered after the systems’ releases. In addition, there are no significant difference between the results of Experiment1 and Experiment2. In order to clarify the differences between the two experiments, we analyzed the review results by the type of defects. Table 6.4 shows the results of the analysis. In the table, the number written in parentheses is the number of program fragments classified in the group “withDefect” in the checklist. The numbers after TP, FP, and FN represent the ID of each experiment. The SQL statement and IF block defects shown in the results of Experiment2 using the checklist were found in the defective program fragments within the checklist. This may represent the effectiveness of the checklist. In our future work, we will interview each reviewer to clarify their thoughts during their reviews. • Qualitative Analysis Furthermore, we investigated which reviewer discovered the two defects for which the number of defects found in Experiment2 had increased. As a result, the reviewer ReB found two defects, the reviewer ReC found one of them, and the reviewer ReA did not find any of the two defects. Reviewer ReA reviewed the entire program and then reconfirmed according to the checklist. After the experiment, the reviewer ReA said that he could not trust the checklist because its TP seemed low. Reviewer ReB used the checklist mainly as a place to review. Reviewer ReC also used the checklist as a focus for review. Since the reviewers’ technical capabilities are considered to be equivalent, these differences may be due to differences in the use of checklists and differences in the time spent on reviews. However, each reviewer found more defects in Experiment2 than in Experiment1. The results are shown in Table 6.5.

78

K. Ogawa and T. Nakatani

Table 6.3 The results of reviews Experiment ID #defects TP 1 2

96 96

29 26

FP

FN

Recall

Precision F1

64 41

67 70

0.31 0.39

0.30 0.27

Table 6.4 Evaluation by type of defect Type of #Defects TP1 TP2 defects SQL 20 statement IF block 5 FOR 3 block DO block 1 Others 67 Total 96

0.31 0.32

FP1

FP2

FN1

FN2

6

8(8)

17

17(6)

14

12(1)

2 1

3(2) 0

15 0

10(5) 0

2 2

2(1) 3(1)

0 17(4) 28(14)

0 30 62

0 13(4) 40(15)

1 47 66

0 20 29

1(0) 50(7) 68(10)

Table 6.5 The number of defects found in SQL statements and the time spent on reviews Reviewer ID Experiment1 Experiment2 #Defects Review time #Defects Review time ReA ReB ReC

3 6 4

1:29:03 3:48:44 3:15:53

5 8 5

1:34:18 1:03:04 2:55:34

As for the review time, ReA increased and ReB and ReC decreased in Experment2. As mentioned above, there were differences in the review styles of each reviewer. ReB and ReC, who reviewed based on the checklist, were able to point out almost the same number of defects as in Experiment1 in a shorter time. Thus, we can conclude that the checklist contributed to improving the productivity of reviews(the number of defects found per unit time).

6.5 Discussion 6.5.1 The Answer to RQ1 We answer the first research question. RQ1: Can the CNN-BI system that has learned the attributes of programs with defects infer the existence of defects in a program? It was confirmed by the heat map during learning that the CNN-BI system inferred by focusing on the defective part of the program.

6 Supporting Code Review by a Neural Network …

79

6.5.2 The Answer to RQ2 We answer the second research question. RQ2:Can a reviewer who uses the results inferred by CNN-BI system point out more defects than a reviewer who does not refer to the results? In our study, the results inferred by CNN were used as a checklist. In particular, we were able to observe the improvement of productivity with regard to reviews of SQL statement defects. However, the number of program fragments that contained defects and the number of fragments that did not contain defects were almost equal among the program fragments that contained SQL statements used in the training. This may be the reason why the CNN-Bi system was able to effectively infer the presence of defects in the SQL statements. It is possible that the CNN-BI system detected the characteristic shape of the SQL statement in the image of the program fragment. However, we reject this possibility. The precision and recall for inferring the presence or absence of SQL statement defects were 0.62 and 0.84, respectively. If the CNN-BI system had detected program fragments containing SQL statements, the precision of inference results would have been higher and the recall would have been lower.

6.5.3 Internal Validity There is a threat to internal validity that “the ability of the test subjects was high and the checklist may have increased the productivity of the review.” The results depend on the ability of the test subjects. The number of subjects was only three. Therefore, it is undeniable that we happened to be blessed with good subjects and obtained good results. We need more experimental studies. Furthermore, we have not evaluated whether the checklist works well for program beginners, but it seems that it is not necessary to evaluate it. This is because the formal review of programs is done by a highly qualified engineers. However, from the viewpoint of education, we expect that the effectiveness of self-review can be evaluated for beginners. There was another problem with our experiments. Prior to the experiment, we needed to explain how to use the checklist so that the reviewers could use the checklist properly.

6.5.4 External Validity There is a threat to external validity, such as “Can the CNN-BI system infer the presence or absence of defects for programs written in other languages?” In this research, programs used as the training data were written in Visual Basic. We consider that the CNN-BI system works language-dependently. Therefore, we have not verified

80

K. Ogawa and T. Nakatani

how effectively the CNN-BI system works for programs written in other languages as an additional research. However, at least if the image of the program fragment is generated in the same way as in this study and supervised learning is performed by CNN, we believe that new CNN-BI system will be able to infer the existence of defects in the problem fragment.

6.6 Conclusion The purpose of this study is to support code reviews and improve the quality of the program. In ordered to achieve the goal, we trained a convolutional neural network with the images of programs with a label of “withDefects” or “withoutDefects” and obtained the CNN-BI system. The CNN-BI system can infer the existence of a defect within an image of a program fragment. In this paper, we introduced the CNN-BI system and verified its effectiveness for a reviewer who uses a checklist based on the inference results of the CNN-BI system. As a result, we could conclude that the result of the inference of the CNN-BI system can work as a checklist to point out more defects for reviewers. In our future work, we will update the experimental research process with more detailed interviews with the subjects and verify the effectiveness of the CNN-BI system. We also evaluate the effectiveness of CNN-BI for novice reviewers.

References 1. J. Chen, K. Hu, Y. Yu, Z. Chen, Q. Xuan, Y. Liu, V. Filkov, Software visualization and deep transfer learning for effective software defect prediction. Proc. ICSE 20, 578–589 (2020) 2. S.R. Chidamber, C.F. Kemerer, A metrics suite for object-oriented design. IEEE Trans. Softw. Eng. 20, 476–493 (1994) 3. F. Chollet, Deep Learning with Python, 2nd edn. (Manning Publications, 2017) 4. M.E. Fagan, Design and code inspections to reduce errors in program development. IBM Syst. J. 15(3), 182–211 (1976) 5. T. Gilb, D. Graham, Software Inspection (Addison-Wesley, 1993) 6. M. Kondo, K. Mori, O. Mizuno, C. Eun-Hye, Just-in-time defect prediction applying deep learning to source code changes. IPSJ J. 59(4), 1250–1261 (2018) 7. Q.V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G.S. Corrado, J. Dean, A.Y. Ng, Building high-level features using large scale unsupervised learning. arXiv:1112.6209 (2011) 8. W. Li, S. Henry, Object-oriented metrics that predict maintainability. J. Syt. Softw. 23, 111–122 (1993) 9. M. Lin, Q. Chen, S. Yan, Network in network, in International Conference on Learning Representations 2014 (2014) 10. T.J. McCabe, A complexity measure. IEEE Trans. Softw. Eng. SE-2(4), 308–320 (1976) 11. M. Morisaki, Deep learning use cases and theirpoints and tips: 6. review source code, AI.IPSJMagazine 59(11), 985–988 (2018) 12. K. Ogawa, T. Nakatani, Predicting fault proneness of programs with CNN, in 11th International Conference on Agents and Artificial Intelligence, vol.1 (2019), pp. 321–328

6 Supporting Code Review by a Neural Network …

81

13. K. Ogawa, T. Nakatani, Research for improve accuracy using multiple models by CNN-BI system. Proceed. Softw. Sympos. 2020, 55–64 (2020) 14. K. Ogawa, T. Nakatani, Proposal of source code image analysis method for improving program quality using deep learning. Proceed. Softw. Sympos. 2021, 142–151 (2021) 15. R.R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-CAM: visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vis. 128(4), 336–359 (2020) 16. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, 1409–1556 (2015) 17. E. Yourdon, Structured Walkthroughs, 4th ed. (Prentice Hall, 1989)

Chapter 7

Safety and Risk Analysis and Evaluation Methods for DNN Systems in Automated Driving Tomoko Kaneko, Yuji Takahashi, Shinichi Yamaguchi, Jyunji Hashimoto, and Nobukazu Yoshioka Abstract There is a great concern about the quality of current AI, especially DNN (Deep Neural Network), in terms of its safety and reliability. In the case of automated driving, the risk management procedures for systems containing DNNs have been verified. In this paper, we present the whole picture in a demonstrative manner. Also, we show the problems, methods for solving them, and their significance in detail. The system-level safety analysis is connected to the development of DNN model single body, and its development process. Using this method, we aim to build a system that can fix the improvement of DNN repeatedly. Keywords AI · Deep Neural Network · Safety · Risk management · Autonomous driving · Safety argumentation · STAMP/STPA

T. Kaneko (B) · Y. Takahashi National Institute of Informatics, Tokyo, Japan e-mail: [email protected]; [email protected] Y. Takahashi e-mail: [email protected] T. Kaneko NTTDATA, Tokyo, Japan S. Yamaguchi Keio University, Tokyo, Japan e-mail: [email protected] J. Hashimoto Gree, Tokyo, Japan N. Yoshioka Waseda University, Tokyo, Japan e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Various et al. (eds.), Knowledge-Based Software Engineering: 2022, Learning and Analytics in Intelligent Systems 30, https://doi.org/10.1007/978-3-031-17583-1_7

83

84

T. Kaneko et al.

7.1 Introduction As IoT and AI advance, it is important to achieve safety in complex computer systems. The current AI operation is difficult to assure safety, and in particular, several accidents have occurred in the demonstration phase of automated driving, which is an AI system. The AI used in automatic driving systems to recognize objects in front of the driver and other vehicles has a risk of unpredictable behavior in response to unlearned data, making it a technically challenging topic. Therefore, the authors have studied safety arguments for risk assessment (risk identification, risk analysis), risk response, and risk evaluation in DNN (including Deep Neural Network) systems, which have high uncertainty among AI systems, following the overall risk management framework. In particular, to solve the problem of “not being able to assure the operation of critical scenes when taking corrective measures” during the development of DNN, we aimed to improve the accuracy of labels included in critical scenes with high risk in the DNN model by capturing risks in the system operation phase, so that the operation of critical scenes can be assured. This paper outlines the concrete implementation of this improvement mechanism. This paper presents related knowledge in Chap. 2, safety issues of automated driving and research questions of this study in Chap. 3, proposed safety and risk analysis and evaluation methods for DNN systems in Chap. 4, case studies in Chap. 5, and future issues in Chap. 6.

7.2 Related Work 7.2.1 Machine Learning Systems Engineering and Safety Systems that include machine learning, such as deep learning, have been successfully used for IT system prediction and classification. However, in fields where human lives are at stakes, such as traffic, transportation, and medical care, the impact of serious errors can be significant, so ensuring safety is a challenge [1]. JST has identified four important technical challenges for the systematization of AI software engineering (also called machine learning engineering, software 2.0, etc.) and basic research: (1) quality assurance of machine learning itself, (2) ensuring safety as a whole system, (3) an engineering framework to solve problems efficiently, (4) black box. ‘Countermeasures to address the problem’. In relation to “(1) quality assurance of machine learning itself,” there is debugging and correction technology, and research [2, 3] has been conducted to improve the accuracy of image classification on a label-by-label basis. The authors are also studying “(2) ensuring the safety of the entire system” [4].

7 Safety and Risk Analysis and Evaluation Methods for DNN Systems …

85

7.2.2 Safety Guidelines and Research Trends for Automated Driving ISO/PAS 21448:2019 (SOTIF) [5] is an international standard for the safety of intended functions such as automated driving. However, it is inconsistent with the concept of machine-learning AI in that even in automated driving, conventional responses to component failures are taken, and various guidelines have been issued in Japan. The AI Product Quality Assurance Guidelines [6] summarized the perspectives of quality assurance. In the automatic driving section, the process of system development incorporating AI was presented. The Safety Assessment Framework for Automated Driving (JAMA) [7] presents the automated driving system safety argumentation structure and deals with disturbance-centered risks based on physical principles. The Machine Learning Quality Management Guideline [8] deals with internal attribute risk. The AIQM reference also presented an example of object detection in automated driving. Research has also reviewed more than 170 test-based papers for safe automated driving [9], studies on cognitive uncertainty in automated driving [10, 11], and causal models of safety assurance for machine learning [12] were proposed. However, guidelines and studies that go into the performance and risk assessment of AI itself and present specific safety improvement methods for AI itself based on system risk have not yet been presented.

7.2.3 Model and STAMP and Related Methods A model is an abstraction that discards objects in the real world for a certain purpose and makes them easier to handle under that purpose. As shown in Fig. 7.1, in the problem domain, an analytical model is created by analyzing the problem, and in the solution domain, where the model is designed, the model is implemented to obtain a solution in the real world. In this study, we propose a two-phased modeling method in which the analytical model is used for the DNN system development in step 1 and the design model is used for the DNN alone in step 3, as described in Sect. 7.4. STAMP (System Theoretic Accident Model and Processes) [14] is an accident model based on systems theory. The above is a very important point. Unlike conventional failure causation models such as the domino model and the Swiss cheese model, which are based on conventional safety analysis methods, as shown in Fig. 7.2, this model is based on the interaction of the entire system. STPA (System Theoretic Process Analysis) performs hazard analysis as a representative method based on the STAMP model. There are methods such as CAST (Causal Analysis based on STAMP), which analyzes an accident/event accident as an event after it has occurred. In this study, risks at the system operation level are identified in step 1 by STAMP/STPA hazard analysis, and CAST is used in step 3 to solve the DNN stand-alone problem.

86

T. Kaneko et al.

Fig. 7.1 Role of the model [13]

Fig. 7.2 STAMP interaction model [14]

7.3 Safety Challenges for Machine Learning Systems 7.3.1 Safety Challenges of Automated Driving There are many issues in the safety of automated driving [15], and in this research, we are working on the construction of a system that can iteratively improve the DNN as a solution to a problem that is highly desired by automobile companies and others from the viewpoint of “inability to assure operation in critical situations”.

7 Safety and Risk Analysis and Evaluation Methods for DNN Systems …

87

7.3.2 The “Question” that Forms the Core of the Research The four research questions for this study are as follows. (a) What improvements can be made in image recognition for automated driving, which is challenging to ensure operation in critical situations? (b) Machine learning has the risk of unpredictable behavior for untrained data, and it is necessary to improve the coverage of learning high-risk scenes that lead to accidents. What methods are possible to obtain specific high-risk scenes in the dataset that is the input for machine learning? (c) How can we evaluate the performance of machine learning systems with uncertainty, especially in DNNs, to contribute to safety? (d) Can (a) to (c) be implemented as a series of mechanisms to enable safety verification of DNNs in accordance with the life cycle of the machine learning system?

7.3.3 Research Goals To ensure safety, it is necessary to analyze the hazard caused by the DNN behavior, create a dataset with appropriate labels and characteristics, and calculate the risk. In addition, since technologies to reduce the risk of hazards are necessary, we will evaluate and verify these technologies and form a system that enables the safety verification of DNNs through this series of efforts.

7.4 Proposal of Safety and Risk Analysis and Evaluation Methods for DNN Systems Safety is defined as “the absence of unacceptable risk” in ISO/IEC Guide 51 and risk is defined as “the effect of uncertainty on the objective” in JIS Q31000. Therefore, the authors propose the DNN safety analysis and evaluation method as a framework for a series of risk management steps (A. Situation Determination, B. Risk Assessment, B-1 Risk Identification, B-2 Risk Analysis, B-3 Risk Evaluation, C. Risk Response) to control risks appropriately. By taking a holistic view of risk management, we can make the effect of uncertainty (risk) of the DNN acceptable and ensure the safety of the DNN. Figure 7.3 describes the purpose, implementation items, and application techniques for each process of risk management as it applies to the process of DNN system development. Figure 7.3 shows the outline of Procedure 1, which is included in the system development process incorporating DNN, and Procedures 2–7, which are included in the DNN development model development process, and the details of the procedures are described in Sect. 4.1 and subsequent sections. The input/output between the processes is outlined in the following sections.

88

T. Kaneko et al.

Fig. 7.3 DNN safety risk analysis and assessment methodology procedures

Although this framework is abstract, it is versatile and meets the need to evaluate not only DNNs alone but also the entire system. However, in order to confirm the effectiveness of this framework, it is necessary to demonstrate it in a concrete case study, and we assume that the DNN debugging and correction technique is applied to image recognition for automatic driving. In this study, we propose a two-phased modeling of STAMP and its application to the hazard analysis method STPA and the accident analysis method CAST. The reason for using STAMP is that STAMP modeling [14] is suitable for analytical modeling of the entire system including the DNN, with the DNN as one component and the interaction with other system elements as a whole. The two-phased modeling creates the analytical model described in Sect. 2.3 and then performs design and analysis on the identified problem areas of the DNN unit. It refers to building a model. The modeling is described in the form of an interaction model that captures the entire STAMP system. In addition, STPA is used for risk analysis of the analyzed model. The reason for using STPA is to analyze the risk of complex situations that may lead to accidents in the real world and to make unknown hazards comprehensively known. A detailed analysis of this model leads to the uncertainty of the DNN as a factor that causes accidents. In order to improve the problem areas of DNNs, design models are constructed for each process of DNN model development (i) training data collection and preparation, (ii) data preprocessing, (iii) model generation, and (iv) model evaluation, and the problems are analyzed. Steps 2–7 are then derived as improvement policies for each process. Conventional problem analysis is based on the domino

7 Safety and Risk Analysis and Evaluation Methods for DNN Systems …

89

model and the Swiss cheese model and is conducted by understanding causal relationships through repetition of “why-why” and the like. However, the interaction model captured by STAMP is suitable for structural analysis of the problem of DNN itself by considering the influence of the external environment such as triggering conditions as risks. This is because the risk is not only caused by the component itself but also the interaction model is important to reduce the risk to an acceptable level and to ensure the safety of the system. This is the reason why we propose the application of STAMP/CAST analysis for problem analysis. On the other hand, judging the outcome of the interaction based on the unsafe control action (UCA) obtained by STPA can generate a high-risk scenario. The language of this research is to use this as training data to improve the model, and to design labels with safety in mind based on the scenarios. A scenario is a “knowledge representation used in a predefined sequence of events to determine the outcome of an interaction between known entities.” In order to further improve the model using training data reflecting high-risk scenes and utilizing debugging and correction techniques that can improve the accuracy of each label, we identify labels that should be improved based on the risk analysis of each image of object detection by the DNN model. The labels are identified as label pairs in order to clarify the performance limit of each label in recognition. In order to measure the effectiveness of the improvement, we set up evaluation criteria for the debugging and correction techniques. The proposed criterion is to use the confusion matrix for each important scene obtained from a series of risk analyses. In order to make the series of safety and risk analyses and evaluations work as a mechanism for improvement, the procedures and the input–output summary are set up as shown in Sect. 4.1 and thereafter.

7.4.1 Step 1: System-level Safety Analysis In step 1, A. risk management is the process of conducting a B-2 risk analysis from situation determination as a DNN system development. Objective: To address risks in the operational phase that could lead to accidents. Implementation: Using various guidelines related to automated driving, company needs, and understanding of accident situations as inputs, we determine rough targets for improvement based on the situations that need to be addressed, and then model the entire system in the operational phase. Furthermore, from the safety (risk) analysis at the system level, the risks caused by hazards and DNNs alone are identified (output). Techniques: Two-phased modeling, STPA hazard analysis, and factor analysis, where STAMP modeling is performed on analytical and design models.

90

T. Kaneko et al.

7.4.2 Step 2: Scenario and Training Data Generation for High-Risk Scenes In step 2, B-2 risk analysis is the process of conducting risk assessment B-2 risk analysis as i. Learning data collection creation for DNN model development. ii. Objective: Enhance learning of high-risk scenes. Practice: Using the hazards of step 1 as input, an exhaustive scenario of high-risk scenes including environmental and temporal conditions is generated (output) as a requirement for the training data. Technology: Scenario generation by label representation of data.

7.4.3 Step 3: DNN Design Modeling and Problem Analysis Step 3 is a process of identifying and analyzing problems in DNN model development. Purpose: To address the uncertainty of DNNs by themselves. Implementation: Analyze the problems of the DNN itself, taking as input the risks that the DNN poses, which were obtained in step 1. The design model of each process of DNN model development is detailed, the problems of the DNN that cause hazards in the operational phase are analyzed, and the improvement policies for the problems are identified (output). Technology: Problem analysis by STAMP/CAST to DNN stand-alone. Since the improvement method of this process is directly related to overcoming DNN uncertainty, specific examples and explanations are provided in Sect. 5.2.

7.4.4 Step 4: Design Labels with Safety in Mind In step 4, C. Risk response is the process of implementing risk response as ii data preprocessing for DNN model development. Objective: To enhance model safety by improving data. Implementation: Build a dataset that considers the safety of critical labels according to risk. Using the scenarios, company needs, accident conditions, etc. from step 2 as input, the object conditions and environmental conditions used are set as labels and attributes, and the label design method for the dataset needed to ensure safety is identified (output). Technology: Labeling that reflects safety analysis. By tying the results of the hazard analysis to the labels in the dataset as risk parameters, labels can be designed with safety in mind.

7 Safety and Risk Analysis and Evaluation Methods for DNN Systems …

91

7.4.5 Step 5: Model Evaluation by Risk In step 5, B-3 Risk Assessment is the process of conducting a risk assessment as an iv-model evaluation of DNN model development. The labels (inputs) included in the high-risk scenes are checked to see if they can be considered for problem-solving in the target DNN, and a risk assessment is performed. Objective: To quantitatively capture the uncertainty of DNNs and evaluate the safety of learning models. Implementation: Using the scenarios from step 2 and the safety dataset from step 4 as input, the model evaluates and compares the risk of each scene, and identifies (outputs) scenes with relatively high risk. Technology: Risk Calculation.

7.4.6 Step 6: Setting Evaluation Criteria In step 6, B-3 Risk Assessment DNN Model Development This process is conducted as a iv-model evaluation. Objective: Model performance criteria for risk. Implementation: Measure not only the accuracy of the criteria but also risk-valued criteria, confusion matrices, etc. Define evaluation criteria based on the results of risk assessment (input) in step 5 so that it can be confirmed whether labels included in risky scenes can be targeted for problem-solving by the target DNN. New evaluation criteria are set (output) to measure the performance of the risk-responsive technology based on the risk evaluation. Technology: Confusion Matrix by Important Scenes.

7.4.7 Step 7: Model Improvement Through Debugging and Modification Techniques In step 7, C. The process of implementing risk response as iii models (re)generation of DNN model development. Objective: To improve model performance. Implementation: Debug and modify the criteria in step 6 and the labels for which you want to improve accuracy as input, and improve (output) on a label-by-label basis. Technology: Automatic debugging and correction technology.

92

T. Kaneko et al.

7.5 Safety Arguments and Case Studies 7.5.1 Safety Arguments for DNN Safety Analysis and Assessment Methodology For camera image recognition in automated driving, GSN [16] presented an overall view of safety arguments (Fig. 7.4) for the application of debugging and correction techniques focused on high-risk and critical scenes by the layers of Why, What, and How. This shows the validity of the contents of the empirical experiments that demonstrate the effectiveness of improved debugging and modification techniques. In the “Why” layer, the goal of the overall safety assurance is presented, and in the “What” layer, the validity of the implementation items based on the risk management procedures is logically presented. In the How layer, we presented a concrete example of how the DNN debugging target was derived by a systematic and objective procedure. The concrete contents of steps 1–3 of the how-layer are shown in Sect. 5.2. The safety argument also includes procedures 8–10 for validation, which are not detailed in this paper.

7.5.2 Embodiment of Steps 1–3 We will focus only on steps 1 through 3 of the aforementioned procedure and explain the relationship and significance of the steps in a nutshell. In step 1, after determining the preconditions, the system level of automatic driving is modeled as a combined system of millimeter-wave and LiDAR, and then detailed modeling is performed focusing only on camera images, and hazard analysis due to DNN is conducted in accordance with the STPA procedure. As an example, several unsafe control actions (UCAs) can be identified that violate the system-level safety constraint of “correctly recognizing objects such as pedestrians and vehicles.” In step 2, scenarios that violate safety constraints are generated for the obtained multiple UCAs. By generating scenarios for each object condition and environmental condition, high-risk scenes under different conditions can be extracted comprehensively. In addition, the safety of the data input to the DNN can be comprehensively analyzed by hazard analysis and verbalized as scenarios, and the differences in risk and system requirements can be embodied as label attributes of data used in training and testing of machine learning to ensure the sufficiency of requirement analysis and data design in designing data. For example, in BDD100 K [17], in addition to object conditions such as Person, Car, and Rider, environmental conditions such as weather such as rainy and snowy, and time of days such as daytime and night are labeled attributes. The “A” and “B” are set as the “A” and “B” are set as the “B”. Scenarios are verbalized as a data set of high-risk scenes with the above-mentioned labels and

7 Safety and Risk Analysis and Evaluation Methods for DNN Systems …

93

Fig. 7.4 Safety argument, overall view (excerpts from steps 1–3)

used for training data to enhance the learning of scenes that are lacking in terms of safety. In addition, in step 1, following the STPA procedure, the analysis of the factors (HCFs) of UCA that violate the safety constraints in the system-level safety analysis identified DNN recognition as a problematic component. The DNN recognition identified as a factor was analyzed as a problem in step 3. In Procedure 3, the safety constraints, problem situations, unsafe control actions, and process model inadequacies of each process of DNN model development were identified to capture the problem situations inherent in each procedure. As a result, improvement measures for steps 2, 4, 6, and 7 were obtained for each process. Figure 7.5 shows the problem analysis for step 7. We modeled the current DNN caused by the external environment

94

T. Kaneko et al.

Fig. 7.5 Current DNN model (problem analysis status)

Fig. 7.6 Improved DNN model (debugging correction technique applied)

such as triggering conditions, referring to the causal model of safety assurance for machine learning [12]. Figure 7.6 shows the improved DNN model after applying debugging correction techniques. The interaction model was used to analyze the problem and illustrate how to improve the debugging correction technique by using label pairs.

7.6 Conclusion In this study, we obtained the following answers to the research questions listed in Sect. 3.2.

7 Safety and Risk Analysis and Evaluation Methods for DNN Systems …

95

(a) For possible improvements in image recognition for automatic driving, which is a challenging task to ensure the operation in critical situations, we consider the definition of safety and establish a “mechanism for improvement by identifying label pairs to be improved in high-risk situations and using them as input for DNN debugging and correction techniques” in the risk management procedure. The following is our proposal. (b) To obtain specific high-risk scenes in the dataset that is the input for machine learning, the system-level hazard analysis is performed using STAMP/STPA to create exhaustive scenarios corresponding to the dataset represented by labels as input for supervised learning (steps 1–2), and the following two methods are used to obtain the risky scenes in the dataset We devised a method (Step 3) to add and modify labels for safety reasons. In real-world automated driving, it is difficult to respond to risk by assuming complete safety of cognition, since an infinite number of situations can occur. However, it is possible to change unknown risks at the system level into known risks in a more comprehensive manner by capturing the components of the system in an interaction model with STAMP/STPA. The use of this exhaustive scenario as training data for DNNs is an unprecedented method of applying STAMP/STPA. In this study, we propose to use this exhaustive scenario as training data for DNNs. (c) For performance evaluation of DNNs, we proposed a method to create confusion matrices for important scenes and to identify label pairs for which we want to increase accuracy and those for which we do not want to decrease it. In step 3, we proposed the introduction of a mechanism to implement improvements in each development process of the DNN model by applying the STAMP/CAST accident analysis method to the problem analysis of the DNN itself. (d) The arguments in Fig. 7.4 are presented to show whether a) to c) can be safely implemented as a series of mechanisms to enable safety verification of DNNs in accordance with the lifecycle of machine learning systems. As a future issue, since the proposed system is still in the hypothesis stage, it is necessary to conduct rigorous verification of input/output, quantitative performance evaluation, and effectiveness measurement through demonstration experiments. Acknowledgements The authors would like to express their sincere thanks to the eAI project for their great cooperation in this research. This research was supported by the Grant-in-Aid for Scientific Research on “Establishment of Accident Analysis Methodology for Socio-technical Systems by System Theory (21K21301)” and the Japan Science and Technology Agency (JST) Project for the Creation of Future Society JPMJMI20B8.

References 1. Japan Science and Technology Agency (JST), Strategic Proposal for “Establishment of New Generation Software Engineering to Ensure Safety and Reliability of AI Applied Systems” 2. J. Sohn, et al., Search based repair of deep neural networks. https://arxiv.org/abs/1912.12463

96

T. Kaneko et al.

3. S. Tokui, et al., in NeuRecover: regression-controlled repair of deep neural networks with training history, SANER 2022 4. T. Kaneko, N. Yoshioka, in Ensuring the Safety of Entire Machine Learning Systems with STAMP S&S. Learning and Analytics in Intelligent Systems (LAIS) book series (Springer, 2022) 5. ISO/PAS 21448:2019 Road Vehicles—Safety of the Intended Functionality, ISO Std. Jan 2019. https://www.iso.org/standard/70939.html 6. AI Product Quality Assurance Consortium, AI Product Quality Assurance Guidelines, http:// www.qa4ai.jp/ 7. Japan Automobile Manufacturers Association, Safety Evaluation Framework for Automatic Driving Ver1.0, http://www.jama.or.jp/safe/automated_driving/pdf/framework.pdf 8. National Institute of Advanced Industrial Science and Technology, Machine Learning Quality Management Guidelines, 2nd ed. https://www.digiarc.aist.go.jp/publication/aiqm/guidelinerev2.html 9. M. Hoss, et al., A Review of Testing Object-Based Environment Perception for Safe Automated Driving, https://arxiv.org/abs/2102.08460 10. R. Salay, K. Czarnecki, et al., in PURSS: Towards perceptual uncertainty aware responsibility sensitive safety with ML, SafeAI 2020. 11. R. Salay, et al., in A Safety Analysis Method for Perceptual Components in Automated Driving, 2019 IEEE ISSRE 12. S. Burton, A causal model of safety assurance for machine learning, https://arxiv.org/abs/2201. 05451 13. IPA: STAMP/STPA (Application Edition) for Beginners—Future Safety from Systems Thinking, 2018 14. N.G. Leveson, Engineering a Safer World, MIT Press 15. T. Kaneko, et al., Ensuring the Safety of Entire Machine Learning Systems with STAMP S&S, LAIS book series, vol. 1 (Springer, 2022) 16. T. Kelly, R. Weaver, in The Goal Structuring Notation—A Safety Argument Notation, Proceedings of the Dependable Systems and Networks 2004 17. BDD100K, https://www.bdd100k.com

Chapter 8

Regulation and Validation Challenges in Artificial Intelligence-Empowered Healthcare Applications—The Case of Blood-Retrieved Biomarkers Dimitrios P. Panagoulias , Maria Virvou, and George A. Tsihrintzis Abstract Biomarkers have been proposed as powerful classification features for use in the training of neural network-based and other machine learning and artificial intelligence-based prognostic models in the field of personalized medicine and for targeted interventions in patient management. Biomarkers are measurable indications of a health state, that can be derived from blood sample, tissue or other bodily fluid. An example of a biomarker is the electrocardiogram that records electrical signals from the heart, and thus evaluates heart condition. Biomarkers can lead to actionable insights and for that are important tools for patient management and treatment administration. In this paper, we outline a medical application with a machine learning backbone built with biomarkers retrieved from blood exams that define health states (obesity, metabolic syndrome and systolic pressure), via rational unified process and cross industry standard process for data mining. By adopting novel ways to deploy these industry standards we can identify health sector related requirements and challenges and thus design and propose smart solutions that add value to all stakeholders. New technologies have the potential to create new pathways in medicine by bridging the gap between the laboratory and the patient, however strong medical validation of processes is required to ensure usability and patient’s safety. We recognise regulation and validation as key challenges and important factors for the improvement of the development of health care applications. Towards this we shall define when a software application is considered as a medical device. Since the regulator is identified as an important stakeholder, strategies are suggested for the proper handling of this stakeholder through out the production cycle.

D. P. Panagoulias · M. Virvou · G. A. Tsihrintzis (B) Department of Informatics, University of Piraeus, Karaoli and Dimitriou 80, 185 34 Piraeus, Greece e-mail: [email protected] D. P. Panagoulias URL: https://www.unipi.gr/unipi/en/ M. Virvou e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Various et al. (eds.), Knowledge-Based Software Engineering: 2022, Learning and Analytics in Intelligent Systems 30, https://doi.org/10.1007/978-3-031-17583-1_8

97

98

D. P. Panagoulias et al.

Keywords Personalised medicine · Bioinformatics · Biomarkers · Pattern recognition · Microservices · Rational Unified Process (R.U.P) · Innovation strategies · Golden circle of innovation · Machine learning

8.1 Introduction Patient evaluation and personalized treatments are pathways discovered via biomarker identification [1]. They are considered key research variables (Table 8.1) and features in the training of prognostic and predictive machine learning models [2]. They offer quantitative indications of medical states, derived by blood samples, tissue and other bodily fluids. The changes in the biochemical variations can also be of great importance and are also associated with biomarker discovery. In the last 20 years, the medical industry has undergone a significant change. The doctrine of one-treatment-for-all has been overcome by a per-patient approach. Biomarkers, which are measurable health indicators combined with statistics and deep learning techniques aim to provide key solutions, replacing older practices that led to inadequate or excessive treatment of patients. The HER2 marker [3], which is a genetic and protein predictive biomarker of response to treatment is used in clinical practice for breast cancer patients. When a medical application is utilized for therapeutic or diagnostic purposes, it is considered a medical device. As a medical device it is bound by regulations and thus there is a great need for the software to meet technical requirements locally, globally or both. This makes regulatory supervision necessary in different phases of the production cycle. For the medical application to be considered a tool by the medical society, it is necessary for the usability and utility of that application to be validated and verified along with any new implementations, before it can be used in a clinical scenario. Therefore, system transparency and inclusion of the medical professionals, both during the different production phases and the deployment stage, are necessities, that ensure proper usage and patient’s safety. To better understand the problems at hand, it is important to recognise the main stakeholders [4] in the patient’s journey

Table 8.1 Examples of biomarkers Name Description AST/ALT ratio Thyroglobulin Troponin I

Fasting glucose

A normal AST:ALT ratio should be below 1 Thyroglobulin is a protein made by cells in the thyroid Troponin I (TnI) is the protein subunit that inhibits muscle contraction in the absence of calcium Measures the amount of sugar in the diabetes index

Health results Indicator of liver damage thyroid cancer metastases Myocardial infarction

Diabetes, CHD, mortality, poor cognitive function

8 Regulation and Validation Challenges in Artificial Intelligence …

99

and evaluate their contributions. It is also important to evaluate each specific process from inception to creation. For that the Cross Industry Standard Process for Data Mining (C.R.I.S.P.) and the Rational Unified Process (R.U.P.) will be utilised for the formulation of the building blocks for each phase of the production cycle. To create a useful framework for rapid interventions and for an effective review process, it is necessary to synchronize the regulatory, medical and supervisory bodies and to strategically place quality control triggers and thresholds. For that purpose, via adapting known methodologies we outline methods, where the regulatory and medical supervisory bodies are recognised as stakeholders and are plugged in the production cycle (Sect. 8.4). A method to add transparency, as per the usability and validity of the neural networks is proposed in Sect. 8.2.5. The paper is organized as follows: In Sect. 8.2 we give emphasis on summarizing key issues and challenges that will be addressed in this paper. Section 8.3 discusses related literature, while in Sect. 8.4 future research will be discussed along with the tools that will be used for the implementations (application development). Finally, in Sect. 8.4 we discuss our conclusions and findings.

8.2 Key Issues and Challenges The development of a health device or application can face many obstacles from a regulatory and medical validation scope, especially when machine learning and artificial intelligence are used as tools to automate interventions or are deployed as assistive tools of a medical process. Awareness and incorporation of the related guidelines, into the product development cycle can add to the quality of the end product and reduce the time between inception and production of a market ready end product. Considering the challenges and issues that are discussed, we look into the deployment of a web application that handles blood exams and uses deep neural networks to clasify the user as per the weight category. This allows a personalized recommendation system to offer advice and interventions related to nutrition and heart risk. In Sect. 8.2.3 we outline the importance of regulation when developing a health/medical application. The regulator as a stakeholder is the one that ensures patient’s safety. Conformity to regulation is an important task in the production cycle. In Sect. 8.2.5 we discuss how implementing a rule that explores innovation in product development can improve the design process of a machine learning pipeline. This assists the validation process of the deployed neural networks, adding transparency to the process and explaining usability of product, to all parties involved. In summary we propose a methodology where the validation process in the regulatory and the medical scope is simplified by adapting known methodologies, making the development of an A.I. health care application easier and faster. Smart interventions are easier to deploy and value is added to the patient’s experience. In this paper we outline the process, by briefly describing the adaptations proposed, firstly of the golden circle of innovation (Sect. 8.2.5) and secondly of the cross industry process for data mining and the rational unified process (Sect. 8.4).

100

D. P. Panagoulias et al.

Fig. 8.1 Categories of biomarkers and clinical use cases

8.2.1 Biomarkers Biomarkers are used by all medical and paramedical disciplines to answer three key questions: how much, what and for whom (Fig. 8.1) [1]. Many of the biomarkers are easy to measure and some are part of routine medical examinations. Biomarkers can provide an early warning system for the occurrence of diseases in population groups with specific characteristics. Blood pressure, heart rate and pulse are indicators of cardiovascular function. Cholesterol and triglycerides provide information about metabolic processes and are used to assess the risk of coronary heart disease. Body measurements such as weight, body mass index (BMI) and waist to hip ratio serve as indicators of obesity, chronic metabolic disorders and fat deposits. T-cell measurements provide information about the state of the immune system. Cortisol, a steroid hormone, is often produced in response to chronic stress. Biomarkers, as seen in Fig. 8.1 are pharamacodynamic when they define the quantity of treatment. Predictive, when used to define the treatment options and prognostic when they are used to choose who to treat. An individual biomarker which exceeds a certain threshold constitutes an indication of risk of a disease to develop in the future. By taking into account other indicators and parameters that are collected as part of a patients record, a new biomarker could be designed and be associated with specific health information. For example, studies show that older men are at higher risk of serious health problems due to a combination of immune system and neuroendocrine disorders. At the same time, in older women, high health risks are usually due to high systolic blood pressure in combination with other signals of health issues related to specific biomarkers and metrics. Biomarkers are used throughout our research as measurable variables to trigger actions, classify users and automate system actions.

8 Regulation and Validation Challenges in Artificial Intelligence …

101

8.2.2 Automating Interventions and Patient’s Journey The Patient’s journey is a non linear process that integrates all parts of the healthcare ecosystem, from hospitals to doctors, intensive care and outpatient treatment. Although it is easy to consider of a patient’s journey as the interactions between the doctor and the patient, there are much more points of contact that guide the path to recovery including: Awareness, Education, Diagnosis and Treatment, Lifestyle Compliance or Behavior Changes, Continuous and Preventive Health (Wellness). We implement a plug and play system, testing an idea of revamping vital parts of the patients journey using artificial intelligence. It is a hybrid system in the sense that human intervention is not eliminated, but rather A.I is used as a tool to improve outcomes by limiting steps in patients journey and enhancing patients experience, and reduce the challenges that physicians face by better allocating resources.

8.2.3 The Importance of Regulation Regulation of medical devices and applications is set at a European level under the Medical Devices Directive (MDD). Although, as is the case with the U.K. there are also local agencies like MHRA that interpret and enforce the legislation transposed into U.K. law. According to the regulators if a health application is used for therapeutic or diagnostic purposes it is considered a medical device. To diagnose, offer treatment recommendations on medical conditions without the approval of a regulatory board is considered similar to practicing medicine without a license. U.S. Food and Drug Administration (FDA) and the U.S. Federal Trade Commission (FTC) are the federal regulators of diagnostic and assisting software. The FDA authorizes software prior to release on the front end, while the FTC can remove or limit software companies that make unsupported by evidence claims about released products. A software application that can use attachments(display screens, sensors or other components) can be considered a medical device and thus must comply to the related regulations [5, 6]. Patient-specific analysis that provide diagnosis is similarly considered a medical device [7]. On the other hand, when calculations are derived by software functions [6] for the purpose of assisting routine process in the clinical practice, it falls in the category of general tools. Same applies for functions that are intended for patient education or patient access to reference information. A very important parameter of heath related software is data protection and privacy, also under heavy regulatory rules, both and in the U.S.A and in Europe.

102

D. P. Panagoulias et al.

8.2.4 Validation of Neural Networks in Health Applications and Continuous Integration Using artificial neural networks (ANNs) in medical applications can be challenging [8] because of the often exprerimental nature of ANN construction. In many cases such as in the U.S. medical neural networks are regulated by the Food and Drug Administration [9]. Also important for a health app, is its usability context that can be evaluated by impact, ease of use, perceived ease of use and user control [10]. We consider usability as per the impact factor of an application, the measure of the improvement of outcome by way of patient accessibility and time saved from diagnosis to treatment. If a patient has more access to information and a preliminary diagnosis of symptoms, then that can lead to a better administration of a disease, a treatment that can be more beneficiary at an earlier stage of a diagnosis, improvement of quality of life and so on. When referring to usability as per the impact from the medical professional standpoint, we examine and measure the quality of the administrative tools from whether they reduce workload, simplify tasks and improve outcomes. Finally, the regulator tasked with validating the same usability questions will decide whether the usability suggested, conforms with regulation and adds value to patients and doctors without compromising data integrity and security. The medical and regulatory stakeholders are recognised in the production circle as defined by the R.U.P methodology. For continuous integration to be implemented, the mentioned stakeholders are also users of the released system, that will validate new implementations(doctors, regulators) and new machine learning models before release and also address usability concerns(doctors, regulators, patients) via an embedded review system based on ease of use, perceived ease of use, user control and impact for all users of the application. The logical view of the system will be described in the next section where also the users of the system are defined.

8.2.5 The Role of Machine Learning in Health Care We propose an adaptation of the golden circle of innovation [11] that is more commonly used to answer organisational issues [12] stated with the what the why and the how, to explain and better outline a machine learning pipeline in health care (Fig. 8.2). By using this proposed methodology it is easier to align all stakeholder needs and requirements and explain the importance of machine learning interventions to medical professionals. The golden circle of innovation as a process is split into three aspects [11]. In the first aspect the why of innovation is examined. By embracing a mission that fulfills a general need, the necessity of innovation is rectified. Einstein had stated that if one does what one always did, then the outcome will always be the same. In that

8 Regulation and Validation Challenges in Artificial Intelligence …

103

sense change is a very important driver and thus change is equal to innovation by definition [13]. The how of innovation in [11] is defined as the generation of innovative ideas by using effective managerial mechanisms, while the what recognises two different dimensions in innovation, innovation as a process and innovation as an outcome. In the first dimension of what, the drivers or rational behind the innovation must be stated. The level, where who is involved is indicated. The direction, defines how the innovation starts and whether it is bottom up or top down. The source, defines if innovation was initiated internally or externally [14] and the locus, refers to the geographical extension of innovation. In the second dimension of what, the form of innovation defines whether innovation refers to a product a service or/and an improvement [15]. The magnitude of innovation refers to potential impact and the type separates innovation into technical type and administrative type. By adapting all mentioned definitions to the machine learning health ecosystem we redefine and summarize the rule as shown below [11]. The What in Machine Learning The what in this adaptation is where the problem is defined. Here we approximate the process and what is the expected outcome. For the examined case in this paper, the problem we aim to solve is related to improving outcomes via artificial intelligence. To automate mundane processes with pattern recognition and thus ease the patients emotional journey, make personalized information available and decrease medical professionals workload. The Why in Machine Learning In this step the proposed solution related to the what is suggested. By defining health states, in this paper, we give the machine “eyes” by creating triggers to automate interventions via recommendations or request manual responses, as described in Sect. 8.4. By first making classifications, we can then identify health states. The How in Machine Learning In this step, the technical road map is outlined(tools, algorithms, neural network architecture, related case studies) and will be further analysed in the C.R.I.S.P methodology. The accuracy of the algorithm offers a proof of concept and thus complete the adapted circle.

8.3 Related Work The importance of data protection and related challenges in the distributed era is well documented in [16] and in [17] alternative ways of teaching with Synchronous and Asynchronous Learning Methods, taking into consideration General Data protection regulation, have been explored. Personalization of software via intelligent systems is examined in [18] and machine-learning methodologies based on artificial immune systems are proposed and tested. A strong relation between blood exams, weight and nutrition was established via neural networks, in [19, 20], while different classification methodologies to identify

104

D. P. Panagoulias et al.

Fig. 8.2 The (adapted) golden circle of innovation

weight related patterns in blood exams and more specifically standard biochemistry profile were explored in [1, 21]. A more precise methodology was proposed in [22], where an average accuracy of about 84% was achieved in correctly classifying body mass index and metabolic syndrome and an accuracy of 74% in identifying systolic blood pressure. We have used these findings to insert a machine learning algorithm into our system. Our goal is to use this proposed system as a benchmark and a canvas for implementation of health applications that can benefit from faster deployments, regulatory compliance and add value to both medical professionals and patients alike.

8.4 Building a Blood Exam-Based Personalised Recommender System Our proposed web application classifies patients through machine learning algorithms, by extracting valuable information from blood exams. The methodology was developed and validated in [1, 19–22]. In this web application patients/users are classified based on their weight, metabolic state as derived from metabolic syndrome guidelines. The system assesses the risk of high blood pressure and makes personalised recommendations based on nutritional patterns identified in healthy groups, as derived by system parameters and other lifestyle factors. The health care professionals can monitor the process, validate the results and intervene when necessary, whereas regulators ensure conformity with regulation. To define and outline the machine learning pipeline we have used the Cross Industry Standard Process for Data Mining and the rational unified process to analyse software’s implementation different phases and production cycles.

8 Regulation and Validation Challenges in Artificial Intelligence …

105

Fig. 8.3 Elements involved in the C.R.I.S.P methodology

8.4.1 Development Methodologies Cross Industry Standard Process for Data Mining (C.R.I.S.P–D.M) is a valuable tool for the itemization of a data analysis problem. Since machine learning is mainly a data and algorithmic procedure it can significantly benefit from C.R.I.S.P to outline and explain the key variables and the questions involved with understanding the problem at hand. C.R.I.S.P, was developed in 1996 by SPSS and is a process model that consists of six phases (Fig. 8.3) that outline the life cycle of a data analysis project. C.R.I.S.P is the proposed methodology for dealing with large set of data to ensure quality of result for policy makers [23]. Other iterations of C.R.I.S.P have also been proposed with quality assurance in mind [24] and extraction and optimization strategies for machine learning applications [25]. With neural networks we can recognise patterns to compress the patient’s journey, ease patient’s emotional journey by limiting unnecessary and mundane tasks. By creating awareness and better symptom administration, limit the necessity for specialized treatment, speed up processes, reduce severity and fatalities. Reduce Physician challenges and time between objectives. Downsize the educational challenges that a patient may face and offer better task coordination for the medical professionals. A simplified version of the patient’s journey can be seen in Fig. 8.4 where the business issue of C.R.I.S.P is analysed. The journey starts upon identification of a symptom, followed by a doctor visit, examinations, a second visit where the first read of the exams takes place. Now, by reading the exams the doctor will identify based on values, the existence of a medical state, a problem in values that may require more examinations to confirm results and so on. Then a second read may take place leading to a recommendation by the doctor.

106

D. P. Panagoulias et al.

Fig. 8.4 Simplified patient’s journey

Recognising the path and the issues, we see that some of the processes can be replaced by automation and machine learning, and here lies the basis of our system. For the purpose of this paper only the business issue was analysed, and from this starting point we developed neural networks that can produce accurate results and automation, offering faster paths, in the patient’s journey. It is important to note that outcomes of C.R.I.S.P are used as elements of the rational unified process. R.U.P is an agile software development methodology. R.U.P splits the project life cycle into four phases. During each of the phases, all six core development disciplines take place: business modelling, requirements, analysis and design, implementation, testing, and deployment. However, certain processes are more important and take up more time during every stage. For example, business modelling mostly takes place during the early phases, inception and elaboration. Each of the four phases has objectives which need to complete for the project to move on to the next phase. The more linear and recursive approach of R.U.P [26], compared to other iterative development methodologies, benefits a health application or medical device project, since in many cases some of the milestones of development are in need of validation from different actors for the project to continue to a next phase. Failure to validate will lead to the redesigning of key parameters and thus waste of significant resources. Knowledge based software and e-commerce applications have been used as benchmarks to examine the usability of R.U.P methodology, while using, designing, building and evaluating adaptive applications and clustering algorithms [27]. Using R.U.P we propose an adapted methodology where all stakeholders are recognised and are assigned particular tasks. We have adopted the C.R.I.S.P methodology as an external contributor to answer important product usability question following the golden circle of innovation. The process is visualized in an adapted R.U.P map that incorporates the different production cycles (Fig. 8.5), where the main tasks are also recognised. The C.R.I.S.P methodology(Crisp), the regulatory task(reg) and the validation process(val) are integrated in the R.U.P interactive map where the time of occurrence for each task can be seen in each phase of the production cycle. For the purpose of this paper only the logical view of the R.U.P methodology is described, to offer a general understanding of functionality offered to the end users.

8 Regulation and Validation Challenges in Artificial Intelligence …

107

Fig. 8.5 R.U.P methodology with implementations

Logical View of Proposed Application The design model, as per the R.U.P methodology will give a concrete description of system behaviour and distributed roles. Moving on in the production cycle the design model will be better refined into separate design artifacts. As seen in Fig. 8.6 the patient logs into the basic system where an automatic classification based on blood exams as per weight, metabolic syndrome indication, and heart risk can be acquired and basic automatic recommendation can take place. Based on patient’s willingness to contribute his/hers health data to the A.I system, those data will either be used to re-calibrate the A.I system and be promoted in the advanced system, or be discarded. In the case where a patient wants to contribute, when new data are used and new models are trained with those data, the regulator alongside the supervising doctor will have to give a verdict as per the usability, usefulness and validity of new models with tools provided, as per the predefined criteria and decide if new models will be promoted into production. In the advanced system the results of the classification model depending on probability of outcome are shown. If below a threshold of certainty as defined by the supervising doctor (for example if there is a 55% chance that the patient has metabolic syndrome, belongs in a weight class or has a high blood pressure risk), the results will trigger a manual assessment on individual blood exams or issue a request to patient for further examination for a better definition of outcome. Whereas, if the results are above that predefined threshold the process will be conducted automatically(diagnosis, health risk assessment and recommendation). The basic and the advanced system share the same analyzer(neural network model and classifier), the difference between the systems is that the advanced system will

108

D. P. Panagoulias et al.

Fig. 8.6 Analysis and design, the logical view case model

trigger further evaluations, that are dependant on outcome or/and health state. The importance, which is depicted in Fig. 8.6 with dotted and colored arrows (red, yellow and green), relates to the different triggers that are dependant on severity of health state, and signifies whether a nurse, a medical professional or a specialist/physician will be involved in facilitating the diagnosis. For each stakeholder a different user interface offers the necessary tools and visual components related to the described processes and system notifications activate call to actions. Finally a review system will verify the usability of the application as discussed in Sect. 8.2.4. As this is a case study based on a particular methodology, applications in more complex medical tasks can be implemented. The health care professionals and regulators are offered a secondary user interface where methodologies and technologies used in the system are well described and results properly visualized (System transparency in Fig. 8.6), to increase transparency, trust and continuous integration.

8.5 Conclusion and Research Key Findings In the case of modeling a medical application a solid starting point, is the patient’s journey. With smart technologies, physician’s challenges [28] can be reduced. The time loops between patient’s steps can be minimized. Integrating customized tools into patient’s journey can benefit patients in multiple ways [29], thus improving health care experience. Educational challenges can be addressed and a communication gap between health professionals and patients can be partially bridged. By creating awareness and better symptom administration, we can limit the need of specialized and costly treatments, speed up processes, reduce severity and fatalities [30].

8 Regulation and Validation Challenges in Artificial Intelligence …

109

In this paper we have adapted the golden circle of innovation as a way to identify machine learning contribution and thus offer a validation framework for the medical professionals to promote or dismiss proposed methodologies. We have outlined the use of established methodologies of C.R.I.S.P and R.U.P as ways for better alignment of stakeholder needs and requirements. Since medical validation and regulatory confirmation are the main challenges that have been addressed, both validation and regulation have been added in the production cycle road map as defined by the R.U.P methodology (Fig. 8.5), where C.R.I.S.P is also embedded in the R.U.P process. Acknowledgements This work has been partly supported by the University of Piraeus Research Center.

References 1. D.P. Panagoulias, D.N. Sotiropoulos, G.A. Tsihrintzis, Towards personalized nutrition applications with nutritional biomarkers and machine learning, in Advances in Assistive Technologies: Selected Papers in Honour of Professor Nikolaos G. Bourbakis, vol. 3, ed. by G.A. Tsihrintzis, M. Virvou, A. Esposito, L.C. Jain (Springer, Cham, 2022) Volume 28, pp. 73–122 2. J. Wiens, E.S. Shenoy, Machine learning for healthcare: on the verge of a major shift in healthcare epidemiology. Clin. Infect. Dis. 66(1), 149–53 (2018) 3. N. Omar, B. Yan, M. Salto-Tellez, HER2: an emerging biomarker in non-breast and non-gastric cancers. Pathogenesis 2(3), 1–9 (2015) 4. N.S. Nipa, M. Alam, M.S. Haque, Identifying relevant stakeholders in digital healthcare, in International Conference on Applied Intelligence and Informatics 2021 (Springer, Cham, 2021), pp. 349–357 5. J.A. Armontrout, J. Torous, M. Cohen, D.E. McNiel, R. Binder, Current regulation of mobile mental health applications. J. Am. Acad. Psychiatry Law. 46(2), 204–211 (2018) 6. Your Medical Device, FDA.GOV. https://www.fda.gov/medical-devices/overview-deviceregulation/classify-your-medical-device (Last visited Jan. 30, 2020) 7. How to Determine if Your Product is a Medical Device, FDA.GOV. https://www.fda.gov/ medical-devices/classify-your-medical-device/how-determine-if-your-product-medicaldevice (Last visited Jan. 30, 2020) 8. M.A. Ahmad, C. Eckert, A. Teredesai, Interpretable machine learning in healthcare, in Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (2018), pp. 559–560 9. R. Schnall, H. Cho, J. Liu, Health information technology usability evaluation scale (HealthITUES) for usability assessment of mobile health technology: validation study. JMIR mHealth and uHealth 6(1), e8851 (2018) 10. L. Zhou, J. Bao, I.M. Setiawan, A. Saptono, B. Parmanto, The mHealth App Usability Questionnaire (MAUQ): development and validation study. JMIR mHealth and uHealth. 7(4), e11500 (2019) 11. J. Spruijt, T. Spanjaard, K. Demouge, The golden circle of innovation: what companies can learn from NGOs when it comes to innovation, in Modern Marketing for Non-Profit Organizations: International Perspectives, ed. by S. Smyczek (University of Economics in Katowice Publishing House, Forthcoming, Katowice, 2013) 12. A. Sharma, B.J. Singh, Understanding lean six sigma 4.0 through golden circle model. EasyChair (pre-print) (2020 May 2) 13. M.M. Crossan, M. Apaydin, A multi-dimensional framework of organizational innovation: a systematic review of the literature. J. Manage. Stud. 47(6), 1154–91 (2010) 14. S.E. Reid, U. De Brentani, The fuzzy front end of new product development for discontinuous innovations: a theoretical model. J. Prod. Innovat. Manage. 21(3), 170–84 (2004)

110

D. P. Panagoulias et al.

15. H.W. Chesbrough, Open Innovation: The New Imperative for Creating and Profiting from Technology (Harvard Business Press, 2003) 16. E.A. Politou, E. Alepis, M. Virvou, C. Patsakis, Privacy and Data Protection Challenges in the Distributed Era (Springer, 2022), pp. 1–185. ISBN 978-3-030-85442-3 17. E. Mougiakou, S. Papadimitriou, M. Virvou, Synchronous and asynchronous learning methods under the light of general data protection regulation. IISA 1–7 (2020) 18. D.N. Sotiropoulos, G.A. Tsihrintzis , Machine learning paradigms—artificial immune systems and their applications in software personalization. Intelligent Systems Reference Library 118, 3–327 (2017). ISBN 978-3-319-47192-1. Springer 19. D.P. Panagoulias, D.N. Sotiropoulos, G.A. Tsihrintzis, Biomarker-based deep learning for personalized nutrition, in Proceedings of the 33rd IEEE International Conference on Tools with Artificial Intelligence (ICTAI), Virtually (1–3 November 2021), pp. 73–122 20. D.P. Panagoulias, D.N. Sotiropoulos, G.A. Tsihrintzis, Nutritional biomarkers and machine learning for personalized nutrition applications and health optimization, in Proceedings of the Twelfth IEEE International Conference on Information, Intelligence, Systems and Applications, Chania, Greece, vol. 12–14 (2021), pp. 731–733 21. D.P. Panagoulias, D.N. Sotiropoulos, G.A. Tsihrintzis, Nutritional biomarkers and machine learning for personalized nutrition applications and health optimization. (extended) Intell. Decis. Technol. 645–653 (2021) 22. D.P. Panagoulias, D.N. Sotiropoulos, G.A. Tsihrintzis, SVM-based blood exam classification for predicting defining factors in metabolic syndrome diagnosis. Electronics 11(6), 857 (2022) 23. E. Alogogianni, M. Virvou, Addressing the issue of undeclared work–Part I: applying associative classification per the CRISP-DM methodology. Intell. Decis. Technol. 15(4), 721–747 (2021) 24. S. Studer, T.B. Bui, C. Drescher, A. Hanuschkin, L. Winkler, S. Peters, K.R. Müller, Towards CRISP-ML (Q): a machine learning process model with quality assurance methodology. Mach. Learning Knowl. Extract. 3(2), 392–413 (2021) 25. W. Duch, R. Adamczak, K. Grabczewski, A new methodology of extraction, optimization and application of crisp and fuzzy logical rules. IEEE Trans. Neural Netw. 12(2), 277–306 (2001) 26. S. Ahangama, D.C. Poo, Unified structured process for health analytics. Int. J. Med. Health Biomed. Bioeng. Pharm. Eng. 8(11), 798–806 (2014) 27. A. Savvopoulos, M. Virvou, D.N. Sotiropoulos, G.A. Tsihrintzis, Clustering for user modeling in recommender e-commerce application: a RUP-based intelligent software life-cycle. JCKBSE 295–304 (2008) 28. D. Kauw, P.R. Huisma, S.K. Medlock, M.A. Koole, E. Wierda, A. Abu-Hanna, M.P. Schijven, B.J. Mulder, B.J. Bouma, M.M. Winter, M.J. Schuuring, Mobile health in cardiac patients: an overview on experiences and challenges of stakeholders involved in daily use and development. BMJ Innovat. 6(4) (2020) 29. M.L. Jacobs, J. Clawson, E.D. Mynatt, My journey compass: a preliminary investigation of a mobile tool for cancer patients, in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (2014 Apr 26), pp. 663–672 30. J.P. Gregg, T. Li, K.Y. Yoneda, Molecular testing strategies in non-small cell lung cancer: optimizing the diagnostic journey. Transl Lung Cancer Res. 8(3), 286 (2019)

Part III

Educational and Assistive Software

Chapter 9

Multi-agent Simulation for Risk Prediction in Student Projects with Real Clients Fumihiro Kumeno

Abstract We have conducted a software development project course with real clients: non-profit organizations of local communities, local governments, and special support schools. Although such a project course can bring project work experiences with real software development to students, many risks are inherent in projects. In this paper, we focus on schedule delay and workload imbalance risks. We propose an analysis approach to predict such risks based on a multi-agent simulation model, including task dependencies expressed using program evaluation and review techniques (PERT). The simulation result shows that task allocation methods may affect the possibility of schedule delay and workload imbalance risks. Keywords Student project with real clients · Risk Analysis · Multi-agent simulation

9.1 Introduction Project-based learning is an effective teaching approach to understanding software development practices. Instead of just acquiring some knowledge and taking tests in lectures, students actively apply what they learn to solve problems. We conducted a software development project course with real clients: local non-profit organizations, local governments, and special schools. Student projects with real clients are effective; they get the experience of exploring real-world problems and the practical skill to apply what they learn in a real-world setting. Depending on the clients, there are many problems, required technologies, difficulties, and scales. We must predict many risks of software developments for real clients and risks inherent in student group work, such as dropout, conflict situations, and free-ridings [1–3]. Students are required to manage risks by themselves. Although they do get some advice from teachers to support their management of various projects, they F. Kumeno (B) Department of Information Technology and Media Design, Nippon Institute of Technology, 4-1 Gakuendai, Miyashiro-machi, Minamisaitama-gun, Saitama pref 345-8501, Japan e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Various et al. (eds.), Knowledge-Based Software Engineering: 2022, Learning and Analytics in Intelligent Systems 30, https://doi.org/10.1007/978-3-031-17583-1_9

113

114

F. Kumeno

must rely on their experience and intuition because there is no established systematic methodology for risk management in this context. Risks to be predicted in a project depend on the problem and client; however, schedule delay and workload imbalance are common to all projects. In classroom-based training, a schedule delay will only affect the classroom. Students must avoid schedule delays in real client projects because they may harm the client. It is also important to avoid extreme imbalance of workload among project members. In this paper, we propose an analysis approach to predict such risks of student projects. We model a student team and project using the multi-agent approach and predict schedule delay and workload imbalance risks through its simulations. We present a multi-agent model that expresses a typical student team in our project course and shows its simulation result. To discuss the simulation validity, we compare the result of the questionnaire survey to real students conducted at the end of the project course. The remainder of the paper is organized as follows. Section 9.2 briefly introduces our software development project course with real clients. Section 9.3 presents the multi-agent models for projects described in Sect. 9.2. Section 9.4 presents simulation results based on some project scenarios. In Sect. 9.5, we explain the questionnaire survey to real students and discuss the validity of the simulation. We discuss related works in Sect. 9.6. Finally, Sect. 9.7 presents the conclusion and future work.

9.2 Software Development Project Course with Real Clients Our project course (one year) is for third-year software engineering students at the Department of Information Technology and Media Design, Nippon Institute of Technology. Clients are local non-profit organizations, local governments, and special support schools near our department. They have information technology problems, such as lack of skill or budget for developing electronic teaching materials, websites, and database applications. We gather their software development or software maintenance problems, which explain to the attending students. The students form their team (3–5 members), and each team selects one explained problem. In each team, student members make the project plan and decide how to progress the project. We give them additional explanations of the problems along with technical and project management advice; however, we do not intervene in their projects. Students face various situations occurring in progressing projects and cope with them. When we define the project’s success as the state where clients accept and begin to use the product, or the client’s evaluation of the product is high, the success or failure depends on the team’s development and project management capabilities, the difficulty of the problem. The following risk factors cause project failures in student projects.

9 Multi-agent Simulation for Risk Prediction in Student Projects with Real …

115

• Compared with professional software development teams, the knowledge, skill, and experience of student members is insufficient • Trial and error activities result from the situation where they must progress project and learn new techniques or knowledge simultaneously • Poor understanding of the significance of project management • Poor communication capability • Student dropout, conflict situations, free-riders Consequently, we obtain that the number of successful projects is almost the same as that of failed projects yearly. Project failures are bad events for clients; however, our clients accept them because they understand student projects are part of educational activities. We believe that there is educational significance in the experience of project failure, which gives concerned students good opportunities to understand the importance of project management.

9.3 Multi-agent Model of Student Projects We focus on risks on schedule delay and workload imbalance, which is common to project teams in our course. To understand and predict the process where schedule delay or workload imbalance occurs in a team and its situation gets serious, we construct a multi-agent simulation model where project members are agents who complete virtual tasks to finish the project. We construct the model using the following approach: • We first model a simple and an abstract scenario by eliminating problemdependent elements and specific complex situations. • We model one project, not our entire project course, in that several projects are progressing in parallel. • Models can involve dependencies among tasks expressed using the program evaluation and review technique (PERT). • Static/Dynamic task allocations to agents can be defined. • Agents’ skills can be defined, and their performance can be dynamically changed. • The model can include a situation where an agent fails to finish some task.

9.3.1 NetLogo We employ NetLogo [4], a popular multi-agent simulation tool suitable for modeling complex systems developing over time in natural and social research fields. In NetLogo, a virtual world consists of four agents: turtles, patches, links, and observers. Turtles are agents that move around in the world. The world is twodimensional and divided into a grid of patches. Each patch is a square piece of “ground” over which turtles can move. Links are agents that connect two turtles. The

116

F. Kumeno

observer is the special agent that monitors and controls the world, and it can give instructions to other agents or make new agents. In NetLogo models, time passes in discrete steps, called “ticks.” NetLogo includes a built-in tick counter; thus, we can track how many ticks have passed.

9.3.2 Student Project Model A project team consists of 3–5 turtles (students). The project workspace is the world that is the 10 × 10 grid plane, with 100 patches scattered on it. Patches are tasks that must be completed to finish the project. We refer to the minimum working unit for a project as a task. Each turtle moves around in the world to find a patch other turtle does not begin to do. Each patch has three states: “no turtles begin to do,” “some turtle begins to do,” and “finished by some turtle.” When a turtle finds a patch of which state is “no turtles begin to do,” it moves and mounts the patch to change the state to “some turtle begins to do,” and finally to “finished by some turtle.” The project is finished when all patches get finished. Figure 9.1 shows a snapshot of the user interface of our model. The square in the right area is the world’s monitor to visualize the project’s progress. There are 100 patches in the square. Each number means patch’s identifier. The line chart in the left-center area visualizes each turtle’s performance. The y-axis indicates workload, and the x-axis indicates the number of ticks. Other dialog boxes around the line chart are used to monitor the detailed performances of turtles. Figure 9.1 shows the initial state of the project, where all patches are red (“no turtles begin to do”), and four turtles are waiting in the lower-left area in the square.

Fig. 9.1 Snapshot of the user interface of our model (Initial state of a project)

9 Multi-agent Simulation for Risk Prediction in Student Projects with Real …

117

Fig. 9.2 Snapshot of the user interface of our model (Project in progress)

Each turtle moves around in the square when the project starts and makes patches blue, meaning “finished by some turtle” (Fig. 9.2). When the project is completed, all patches get blue, and all turtles move back to their original positions (Fig. 9.3).

Fig. 9.3 Snapshot of the user interface of our model (Terminal state of a project)

118

F. Kumeno

Fig. 9.4 PERT diagram example (activity on arrow)

Fig. 9.5 Mapping patches to each arrow in a PERT diagram

9.3.3 Dependencies Among Tasks in a Project We model dependencies among tasks based on a PERT diagram by defining turtles’ constraints and behaviors when they search for red patches. Figure 9.4 shows an example of a PERT diagram (activity on arrow). Node (1) is the initial state, and node (7) is the project’s finished state. Each arrow (except dummy arrows) indicates the corresponding activity required to finish the project. We map a group of patches to an arrow. For example, we map the patches in area A to arrows (2)–(3) and the patches in area B to arrow (3)–(6) (see Fig. 9.5). By the constraint that turtles cannot begin to do patches in area B without finishing all patches in area A, we can define the dependency between arrows (2)–(3) and (3)–(6). We model dependencies among tasks based on a PERT diagram by defining constraints of all arrows in the above manner.

9.3.4 Task Allocation to Project Members The static allocation of patches to turtles (static task allocation to project members) is also modeled by defining constraints of turtles’ behaviors when they search for red patches. Figure 9.6 shows an example of static allocation. Turtle 1 only finds and performs patches in Area 1 which is allocated to Turtle 1. We model the dynamic allocation as the state, where there is no constraint of the above, and each turtle greedily searches for patches.

9 Multi-agent Simulation for Risk Prediction in Student Projects with Real …

119

Fig. 9.6 Example of static task allocation to turtles

9.3.5 Members’ Skill and Performance Furthermore, we use the number of ticks required to finish one patch to model members’ skills and performance. If a turtle has a high skill or performs well, we set the number of ticks to finish one patch at a low value. If a turtle’s skill is low or performs poorly, we set it at a high value. These value settings can be changed dynamically. We also implement the failure behavior, and a turtle abandons a patch in action. Setting the failure rate to a turtle fails to perform a patch with the prescribed probability.

9.3.6 Risk Prediction by Simulating in Our Model There may be many condition settings of the model by combining the following conditions: the number of team members, structures of PERT diagrams, task allocations, members’ skills, and performances. When simulating a model in some condition, if the number of ticks it takes to finish the project (NTFP) results in a high value, which means that it takes long time to finish the project, we can interpret that this condition directly affects the risk for schedule delay. When simulating repeatedly, if there is great variability among results of NTPFs, we also interpret that there is the volatile schedule risk inherent in this condition. To analyze risks of workload imbalance, we focus on the variation of the following values among turtles: the total number of patches that a turtle has done (TNP) and the total number of ticks it takes for the turtle to finish these patches (TNT). When simulating a model in some conditions, if there is great variability of these values

120

F. Kumeno

among turtles, we interpret a serious risk of workload imbalance inherent in this condition.

9.4 Simulation Results We conducted some simulation experiments to explore how the condition setting described in the previous section affects the risks of schedule delay and workload imbalance. By 100 repetitions of simulations under some simulation scenarios, we took NTPFs, TNPs, and TNTs in each scenario and conducted a statistical analysis to compare results under the simulation scenarios. We assumed there were four team members because all real project teams consist of 3–5 students. We also assume that the performance of each turtle changes dynamically during the project. This assumption is based on a questionnaire survey conducted on real students; many students answered that their performance changes in every exercise. In our experiments, if a turtle finished one patch within two ticks, we defined the performance value of the turtle as 30. If the performance value is 60, the turtle can finish two patches within two ticks. To implement the turtle’s performance variation, we set the performance value at the normally distributed random floating-point number with a mean of 20.0 and a standard deviation of 5.0. All values and conditions on turtle’s performance and skill in our experiments are tentative to analyze the risks of schedule delay and workload imbalance. The result and analysis of the simulation experiments showed that the condition on task allocation affects the results of NTPFs, TNPs, and TNTs. We show three comparisons, which led to this conclusion. Static and Impartial Allocation versus Dynamic Allocation: We conduct simulations and compare the results under the following cases. Case1: A static task allocation where we allocate patches to turtles impartially. Case2: Each turtle greedily searches for patches and finishes them. Table 9.1 summarizes the results. The Range p1–4 is the Range of TNPs, i.e., the maximum difference of TNPs in one simulation. The Range j1–4 is the Range of TNTs, i.e., the maximum difference of TNTs in one simulation. Comparing Case1 and Case2, the average (Avg.) and standard deviation (SD) of NTPFs are very close. Although the average and SD of Range p1–4 in Case1 are 0 because of the condition, those in Case2 are also less than 1.0. The average and SD of Range j1–4 in Case1 are slightly larger than those in Case2. Table 9.1 Comparison of Case1 and Case2

Avg. SD

Case1 NTPF

Range j1-4

Range p1-4

Case2 NTPF

Range j1-4

Range p1-4

102.0

3.1

0.0

101.8

2.5

0.5

1.2

1.5

0.0

1.1

0.9

0.9

9 Multi-agent Simulation for Risk Prediction in Student Projects with Real …

121

Table 9.2 Comparison of Case1 and Case2 (There are skill differences)

Avg. SD

Case1 NTPF

Range j1-4

Range p1-4

Case2 NTPF

Range j1-4

Range p1-4

123.3

27.6

0.0

112.6

3.4

1.9

8.5

22.7

0.0

2.6

1.3

1.0

Table 9.2 summarizes the simulation result in the case where we add the difference of skills to Case1 and Case2. We assume that each turtle has 25 special patches that it will finish with a higher performance value of +30%. The other 75 patches will finish with a lower performance value of −30%. The allocation of special patches is random and mutually exclusive. The average and SD of NTPFs in Case1 are larger than those in Case2. For Range j1–4, the results in Case1 are much larger than those in Case2. In Case1 the result of Range p1–4 is 0; however, we can interpret that the risks are higher than those in Case2. Static Allocation Based on Arrows in a PERT Diagram versus Dynamic Allocation: We add the constraint based on the PERT diagram (Fig. 9.4) to the above Case2 with the skill differences. We refer to this case as Case3. Table 9.3 presents the comparison of Case2 and Case3, and the results are close in both cases. Case4 is where we statically allocate patches in each arrow to turtles. We allocate the patches in arrows (2)–(3), arrows (3)–(6) to turtle 0 and 2; we allocate the patches in arrows (4)–(5), an arrow (5)–(6) to turtle 1 and 3; we allocate the patches in arrows (6)–(7) to all turtles. Table 9.4 compares the simulation results in Case3 and Case4. The average and SD of NTPFs and Range j1–4 in Case4 are larger than those in Case3. The results of Range p1–4 in Case4 are smaller than those in Case3, but the differences are smaller than other results. We can interpret that Case3 is better than Case4 to avoid the risks. The Effect of Problematic Member: In addition to the above simulations, we conduct simulations under scenarios where problematic members finish patches with Table 9.3 Comparison of Case3 and Case2

Avg. SD

Case3 NTPF

Range j1-4

Range p1-4

Case2 NTPF

Range j1-4

Range p1-4

114.7

4.2

2.0

112.6

3.4

1.9

2.8

1.7

1.2

2.6

1.3

1.0

Table 9.4 Comparison of Case3 and Case4

Avg. SD

Case3 NTPF

Range j1-4

Range p1-4

Case4 NTPF

Range j1-4

Range p1-4

114.7

4.2

2.0

120.7

16.5

0.5

2.8

1.7

1.2

6.6

8.1

0.9

122

F. Kumeno

Table 9.5 Comparison of Case3 and Case4 (There is a problematic member)

Avg.

Case3 NTPF

Range j1-4

Range p1-4

Case4 NTPF

Range j1-4

Range p1-4 6.0

149.3

8.2

27.8

1236.4

828.2

SD

5.8

3.8

2.5

431.3

324.5

CV(=SD/Avg.)

3.9%

47.1%

9.0%

34.9%

39.2%

1.1 17.8%

many ticks or often fail because of their poor skills. To make scenarios simple, we assume the number of problematic members is one turtle whose skill is 50% of other turtles and fails to finish a patch with a probability of 20%. Table 9.5 compares the simulation results in Case3 and Case4 under this scenario. All results in Case4 are higher than those in Case3. In particular, the results of NTPFs and Range j1–4 in Case4 are very high. Table 9.5 presents the coefficient variations (CV) of each value to compare the dispersion of two cases. A possible cause of the results in Case4 is that turtles cannot begin to do patches in arrows (6)–(7) in a long time because it takes many ticks for a problematic turtle to finish all patches in arrows (2)–(3) and (3)–(6). When we increase the failure probability, some projects failed to finish in Case4, but there are no such projects in Case3. We guess from the above simulation results that the dynamic allocation approach is better than the static approach to avoid the risk of schedule delay and workload imbalance.

9.5 Questionnaire Survey As a faculty-development activity in our project course, we conduct the questionnaire survey at the end of the course. To compare the simulation results reported in the previous section and the results of real projects in our course, we add the following questions to the survey: Q1: Do you think your project progressed on schedule? (Ans.1: Disagree, 2: Disagree a little, 3: Agree a little, 4: Agree). Q2: Were there any changes in assignation in your project? (Ans.1: Not at all, 2: A little, 3: Quite a lot, 4: Very much). Q3: Do you think members’ workloads were lopsided? (Ans.1: No, almost the same, 2: A little lopsided, 3: Very lopsided). Q1 is the question about schedule; Q2 is the question related to task allocations, and Q3 is closely related to the balance of workload. We analyzed the relationship between Q2 and other questions with Spearman’s Rank correlation coefficient. The Spearman’s ρ of Q2 and Q1 is 0.36 (No correlation test: p = 0.0086) that indicates there is a positive tendency between Q2 and Q1. The Spearman’s ρ of Q2 and Q3 is -0.41 (No correlation test: p = 0.0057) that indicates there is a negative tendency between Q2 and Q3.We also made cross-tabulations by categorizing two groups.

9 Multi-agent Simulation for Risk Prediction in Student Projects with Real …

123

Table 9.6 Cross-tabulation of Q1 and Q2 (N = 43) Q1

Q2: negative

Ans = 1

Ans = 2

Q2:positive

Ans = 3

Ans = 4

1:Disagree

3

2

1

0

0

0

2: Disagree a little

4

1

3

6

6

0

3: Agree a little

9

3

6

15

10

5

4: Agree

1

0

1

5

3

2

Table 9.7 Cross-tabulation of Q2 and Q3 (N = 43) Q3

Q2:negative

Ans = 1

Ans = 2

Q2:positive

Ans = 3

Ans = 4

1: No, almost the same

2

2

0

5

2

3

2: A little lopsided

7

0

7

17

13

4

3: Very lopsided

8

4

4

4

4

0

One is the negative group (answers are 1 or 2); another is the positive group (answers are 3 or 4). Table 9.6 presents the cross-tabulation of Q2 and Q1. In the negative group, answers of Q1 are widely distributed from 1 to 4. In the positive group, there is no answer 1, and the number of answer 4 is higher than that in the negative group. We guess that there are more students in the positive group, who felt they could progress projects on schedule, than in the negative group. Table 9.7 presents the cross-tabulation of Q2 and Q3. In the negative group, the number of answer 3 is max, and answer 2 follows it. The number of answer 1 is minimum. In the positive group, the number of answer 2 is max, and answer 1 follows it. The number of answer 3 is minimum. We guess that more students in the positive group felt their workload was less lopsided than the negative group. The above results suggest that the frequent change in assignation may lessen the risk of schedule delay and workload imbalance, that is consistent with the simulation results.

9.6 Related Work Software risk management is a vital activity to succeed in software projects. Software engineering researchers and practitioners have proposed many approaches and techniques for managing risks in software projects [5]. Software process simulation modeling (SPSM) is an approach that helps us to understand and analyze various phenomena in software projects [6]. SPSM is also applied to software risk management, and it has mainly been applied in risk analysis and risk management planning activities [7]. Previous studies on applying SPSM to risk management have mainly assumed software projects by professional members. However, it is not easy to apply these results to student projects in the same way. Several simulation tools have

124

F. Kumeno

been proposed to educate software projects [8–11] to support students or trainees in learning and understanding software projects in education settings. There are various SPSM paradigms: hybrid simulation, state-based simulation, queuing models, discrete-event simulation, system dynamics, agent-based simulation, Petri-net models, simulation-based teaching, and Monte-Carlo simulation. We adopt multi-agent-based simulation (MABS) for two reasons. One reason is that we need information on each member’s state to analyze risks on workload imbalance among student members. Another reason is that members’ cooperation work and individual skill and performance are closely related to risks on schedule delay and workload imbalance. We can naturally model a member’s state and team performance by defining the process of projects with multi-agent activities. MABS is a promising approach to explore relationships between team composition, formation, and performance not limited to software projects [12]. In [12], it was reported that most cooperation methods do not consider time constraints, action dependencies, action failure, plan robustness, and dynamic task changes. We believe it is true as far as we surveyed. Our multi-agent model considers task dependencies based on PERT and agent’s task failure. A few studies analyze schedule risk or cost estimation risk through simulations based on PERT [13, 14]; however, their approaches are not MABS.

9.7 Conclusion and Future Works The final goal of our work is to establish a risk prediction method for student projects with real clients. As the first step to achieve the goal, we focused on schedule delay and workload imbalance risks and proposed a MABS approach to analyze causes and processes of schedule delay and workload imbalance. The contribution of the paper is that we showed a multi-agent model of student projects and the simulation results suggest that the frequent change in assignation may lessen the risk of schedule delay and workload imbalance. In order to start the analysis in simple cases, we present a simple model only with members’ skills, performances, task allocations, and task dependencies. Therefore, our model and simulation can be applied to group works other than software projects. We also reported questionnaire survey results. These results seem to be consistent with the simulation results; however, it is unclear whether they are for the same reason. We should validate and verify our model by project case studies, making closer models and implementing them in other MAS platforms. To make this model closer to real student projects, we need to implement specific models which include the aspects of student software projects and the agent model of clients. Acknowledgements This work was supported by KAKENHI 19K03011.

9 Multi-agent Simulation for Risk Prediction in Student Projects with Real …

125

References 1. M.L. Pertegal-Felices, A. Fuster-Guilló, M.L. Rico-Soliveres, J. Azorín-López, A. JimenoMorenilla, Practical method of improving the teamwork of engineering students using team contracts to minimize conflict situations. IEEE Access 7, 65083–65092 (2019) 2. J. He, Free-Riding in Student Software Development Teams: An Exploratory Study, AMCIS 2009 Proceedings, p. 575 (2009) 3. S. Koolmanojwong, B. Boehm, A Look at Software Engineering Risks in a Team Project Course, IEEE 26th Conference on Software Engineering Education and Training (CSEE&T), pp. 21–30 (2013) 4. U. Wilensky, NetLogo, Center for connected learning and computer-based modeling, Northwestern University, Evanston, IL. http://ccl.northwestern.edu/netlogo. Last accessed 11 Feb 2022 5. J. Masso, F.J. Pino, C. Pardo, F. García, M. Piattini, Risk management in the software life cycle: a systematic literature review. Comput. Stand. Interfaces 71 (2020) 6. J.A. García-García, J.G. Enríquez, M. Ruiz, C. Arévalo, A. Jiménez-Ramírez, Software process simulation modeling: systematic literature review. Comput. Stand. Interfaces 70 (2020) 7. D. Liu, Q. Wang, J. Xiao, The Role of Software Process Simulation Modeling in Software Risk Management: a Systematic Review, 2009 3rd International Symposium on Empirical Software Engineering and Measurement, Lake Buena Vista, FL, pp. 302–311 (2009) 8. I. Cohen, M. Iluz, A. Shtub, A simulation-based approach in support of project management training for systems engineers. Syst. Eng. 17(1), 26–36 (2014) 9. D.C.C. Peixoto, R.F. Resende, C.I.P.S. Pádua, in An Educational Simulation Model Derived From Academic and Industrial Experiences, 2013 IEEE Frontiers in Education Conference (FIE), pp. 691–697 (2013) 10. A. Drappa, J. Ludewig, Simulation in Software Engineering Training, Proceedings of the 22nd International Conference on Software Engineering (ICSE’ 00). pp. 199–208 (2000) 11. A. Jain, B. Boehm, SimVBSE: Developing a Game for Value-based Software Engineering, IEEE 19th Conference on Software Engineering Education and Training (CSEE&T), pp. 103–114 (2006) 12. E. Andrejczuk, J. Rodríguez-Aguilar, C. Sierra, A Concise Review on Multiagent Teams: Contributions and Research Opportunities, European Conference on Multi-Agent Systems International Conference on Agreement Technologies, Lecture Notes in Computer Science, pp. 31–39 (2017) 13. A.P. Hendradewa, Schedule risk analysis by different phases of construction project using CPM-PERT and Monte-Carlo simulation. IOP Conf. Ser.: Mater. Sci. Eng. 528, 012035 (2019) 14. V. Singh, V. Malik, R. Mittal, Risk analysis in software cost estimation: a simulation-based approach. Turk. J. Comput. Math. Educ. 12(6), 2176–2183 (2021)

Chapter 10

Automatic Scoring in Programming Examinations for Beginners Yoshinori Tanabe and Masami Hagiya

Abstract When evaluating a beginners’ ability in programming, it is a standard method to have them write programs to solve problems and judge their correctness. To apply this method in examinations that are taken by a large number of people, it is necessary to automatically evaluate the correctness of the program. In the method of preparing inputs and their expected outputs and comparing them with the outcome of the submitted program, an incorrect program may be mistakenly judged as correct. In this study, we propose a programming language based on Presburger arithmetic that can automatically judge the correctness of any program written in the language. In addition, to alleviate the difficulty that beginners tend to stumble due to the necessary restrictions for realizing automatic scoring, we introduce an environment in which only programs that conform to the restrictions can be created by combining blocks. It was confirmed that the answers to some typical questions can be created and automatically scored.

10.1 Introduction The Japanese government defines standard educational contents for high schools, and they are updated regularly. In the 2022 revision, all high school students should study the subject ‘Informatics I’. This subject includes programming, and standard textbooks include programs written in Python, JavaScript and so on. From 2025, programming problems will be given in the unified entrance examination for universities in Japan. Y. Tanabe (B) Tsurumi University, Yokohama, Japan National Institute of Informatics, Tokyo, Japan e-mail: [email protected] M. Hagiya University of Tokyo, Tokyo, Japan e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Various et al. (eds.), Knowledge-Based Software Engineering: 2022, Learning and Analytics in Intelligent Systems 30, https://doi.org/10.1007/978-3-031-17583-1_10

127

128

Y. Tanabe and M. Hagiya

var count, i; input a(10); count := 0; i := 0; repeat 10 { if (a[i] > 0 and a[i] != 19937) count += 1; i += 1; }

Fig. 10.1 Example of “wrong” program

In this background, a project for reforming university entrance examinations in the field of informatics was conducted, in which the second author played a key role. Based on the outcome of the project, the authors proposed a method to systematically generate questions, in the form of multiple-choice, that can evaluate the examinee’s understanding of programs [1]. If a large number of answers need to be processed in a short time, such as in the case of Japan’s unified university entrance examination where hundreds of thousands of students participate, or if the result is needed right after the examinee submits the answer, such as in the case of computer-based testing, the answer should be automatically scored. Multiple-choice questions can be naturally scored automatically. On the other hand, in problems of creating a program, it is required to automatically evaluate the program created by the examinee. For this purpose, a method in which a large number of input values and correct answer outputs are prepared in advance is used, and a system checks whether each submitted program outputs the correct values for all the inputs. This method is widely used in competitive programming sites such as Top Coder or Codeforces [2]. This method has its limits in principle. As a somewhat extreme example, consider a problem to set the number of positive values in array a to a variable count and assume that the code shown in Fig. 10.1 is submitted. This program is wrong because it behaves incorrectly when the input array contains value 19, 937. However, it is extremely difficult to judge that it is wrong using the method described above. As a practical matter, the questioner tries to prepare a set of inputs, where typical wrong programs output a different value than the correct answer; however, this is not an easy task. A correct program should not be mistakenly judged as an incorrect answer. This applies to examinations in general, especially for entrance examinations. The abovementioned method of collating the correct answers for multiple input values does not make this type of error. On the other hand, it may be considered that judging an incorrect program as correct is permissible compared to judging a correct program as incorrect. For example, a multiple-choice test has such a character. It can yield a positive score to an examinee who does not have correct understanding.

10 Automatic Scoring in Programming Examinations …

129

However, the input values are not disclosed to the examinees. Those who write programs that are judged as incorrect will be dissatisfied with the situation where some incorrect programs are judged as correct. For questions to write a correct program in entrance examinations, it is desirable to judge whether the answer is correct or incorrect based on absolute criteria. A practical programming language is Turing complete; thus, it is generally impossible to determine whether a program is correct. In this paper, we propose a programming language with assertions. The expressive power of the language is limited so that automatic judgement is possible whether an assertion is always satisfied for any possible input. In other words, programs written in this language can be correctly scored automatically. By showing examples, we demonstrate that typical programming examination problems can be created using this language. This programming language has several restrictions to enable the verification mentioned above. For example, in multiplication, at least one of the two operands must be a constant. Writing a program with these restrictions in mind is not easy, especially for beginners. To alleviate the difficulty, we developed a program editing environment where the examinee creates a program by combining parts (blocks) prepared in advance. Programs created in this way are guaranteed to comply with the restrictions of the language. The rest of the paper is organized as follows. The background and notations are presented in Sect. 10.2, while the methods are presented in Sect. 10.3. Section 10.4 covers some experiments. In Sect. 10.5, we discuss some issues with the proposed method. Finally, the paper is concluded in Sect. 10.6.

10.2 Preliminaries 10.2.1 Presburger Arithmetic The theory Th(Z, 0, 1, =, +, −, = 0 and x != 2) or z -= 5. In the following, we use such expressions without notice.

10 Automatic Scoring in Programming Examinations …

131

program = { declaration } , { statement } declaration = ( "var" | "input" ) , var_declaration , { "," , var_declaration } , ";" var_declaration = variable | ( array , "(", constant , ")" ) statement = block | input_statement | assignmemt_statement | conditional_statement | repeat_statement | break_statement | limited_while_statement | assert_statement | assume_statement block = "{" , { statement } , "}" input_statement = "input" , ( variable | array ) assignment_statement = ( variable | array_element ) , ":=" , expression , ";" conditional_statement = "if" , condition , statement , [ else , statement ] repeat_statement = "repeat" , constant , statement break_statement = "break" , ";" limited_while_statement = "while" , condition , limited_block limited_block = "{" , { limited_assignement_statement } , "}" , ";" limited_assignment_statement = ( variable | array_element ) , "+=" , constant , ";" assert_statement = "assert" , extended_condition assume_statement = "assume" , extended_condition condition = atomic_condition | "not" , condition | condition , "or" , condition atomic_condition = expression , "