Big Data, Cloud Computing, and Data Science Engineering 3031196074, 9783031196072

This book presents scientific results of the 7th IEEE/ACIS International Conference on Big Data, Cloud Computing, Data S

1,179 212 6MB

English Pages 189 [190] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Big Data, Cloud Computing and IoT 9781032284200, 9781032287430, 9781003298335, 103228420X

Cloud computing, the Internet of Things (IoT), and big data are three significant technological trends affecting the wor

347 84 44MB Read more

Cloud Computing and Big Data [1 ed.] 9781614993223, 9781614993216

Cloud computing offers many advantages to researchers and engineers who need access to high performance computing facili

374 34 6MB Read more

Software Engineering in IoT, Big Data, Cloud and Mobile Computing 9783030647735

823 104 37MB Read more

Software Engineering in IoT, Big Data, Cloud and Mobile Computing 3030647722, 9783030647728

This edited book presents scientific results of the International Semi-Virtual Workshop on Software Engineering in IoT,

1,287 221 8MB Read more

Big Data, Cloud Computing, and Data Science Engineering [1st ed. 2020] 978-3-030-24404-0, 978-3-030-24405-7

This edited book presents the scientific outcomes of the 4th IEEE/ACIS International Conference on Big Data, Cloud Compu

465 37 11MB Read more

Data Science in Chemistry: Artificial Intelligence, Big Data, Chemometrics and Quantum Computing with Jupyter 9783110629453, 9783110629392

The ever-growing wealth of information has led to the emergence of a fourth paradigm of science. This new field of activ

415 111 8MB Read more

Cloud Computing for Data Analysis The missing semester of Data Science

1,091 275 23MB Read more

Security and Privacy for Big Data, Cloud Computing and Applications 9781785617478, 1785617478

As big data becomes increasingly pervasive and cloud computing utilization becomes the norm, the security and privacy of

2,374 400 7MB Read more

Bloom Filter: A Data Structure for Computer Networking, Big Data, Cloud Computing, Internet of Things, Bioinformatics and Beyond 0128235209, 9780128235201

Bloom Filter: A Data Structure for Computer Networking, Big Data, Cloud Computing, Internet of Things, Bioinformatics, a

464 55 8MB Read more

Bloom Filter: A Data Structure for Computer Networking, Big Data, Cloud Computing, Internet of Things, Bioinformatics and Beyond 9780128235201, 0128235209

Bloom Filter: A Data Structure for Computer Networking, Big Data, Cloud Computing, Internet of Things, Bioinformatics, a

244 63 11MB Read more

Big Data, Cloud Computing, and Data Science Engineering
3031196074, 9783031196072

Author / Uploaded
Roger Lee

Categories
Computers
Databases

Table of contents :
Foreword
Contents
Contributors
Research on Development of Sponsorship Effect Analysis Module Using Text Mining Technique
1 Introduction
2 Theoretical Background
2.1 Definition of Sponsorship and Sponsorship Effect
2.2 Study of Text Analysis Techniques
2.3 Prior Studies Using News Text Analysis
3 Research Method and Result
3.1 Research and Development Process
3.2 Data Collection
3.3 Development of Sponsor Effectiveness Analysis Module
3.4 Building a Simple Dashboard
4 Conclusions
References
A Study on the Relationship Between ESG News Keywords and ESG Ratings
1 Introduction
2 Theoretical Background
2.1 Overview of ESG
2.2 Related Works
3 Research Method
3.1 Data Collection
4 Data Analysis
4.1 Frequency Analysis and Word Cloud
4.2 Relationship Analysis of the Influence on ESG Rating
5 Conclusion
References
Development of Associated Company Network Visualization Techniques Using Company Product and Service Information—Using Cosine Similarity Function
1 Introduction
2 Theoretical Background and Prior Research
2.1 Text Mining Research
2.2 Social Network Analysis Study
2.3 Associated Companies Analysis
3 Research Method
4 Analysis Results
5 Conclusion
References
Hybrid CNN-LSTM Based Time Series Data Prediction Model Study
1 Introduction
2 Theoretical Background
2.1 Time-Series Analysis
2.2 Recurrent Neural Network
2.3 Long- Short-Term Memory
2.4 Convolutional Neural Network
3 Research Method
3.1 Methodology to Predict Time-Series Using CNN-LSTM
3.2 Performance Procedure
4 Experiments and Results
4.1 Evaluation Methods
4.2 Evaluation Methods
4.3 Comparative Evaluation of Deep Learning Models
5 Conclusion
References
A Study on Predicting Employee Attrition Using Machine Learning
1 Introduction
2 Theoretical Background
2.1 Data Analytics in the Human Resources (HR)
2.2 Machine Learning Algorithm
3 Research Design
3.1 Research Framework
3.2 Feature Selection Using Filter Method
3.3 Classification Prediction Model
3.4 Machine Learning Performance Measurement
4 Research Result
4.1 Data Set
4.2 Data Preprocessing
4.3 Feature Selection
4.4 Ensemble Prediction Model Performance
5 Conclusion
References
A Study on the Intention to Continue Using a Highway Driving Assistance (HDA) System Based on Advanced Driver Assistance System (ADAS)
1 Introduction
2 Theoretical Background
2.1 Advanced Driver Assistance Systems (ADAS)
2.2 Highway Driving Assistance System (HDA)
2.3 HDA (Highway Driving Assist) User Characteristics
2.4 Protection Motivation Theory (PMT)
2.5 Technology Acceptance Model (TAM)
3 Research Method
3.1 Data Collection
3.2 Research Model and Selection of Variables
4 Data Result Analysis
5 Conclusions
References
Security Policy Deploying System for Zero Trust Environment
1 Introduction
2 Related Work
2.1 Current Security Threat Trends
2.2 Current Malware Technologies
2.3 Zero Trust Model
3 Current Security Environment Analysis
3.1 Security Environments Analysis
3.2 Requirement for Zero Trust Model Based Security Environment
4 Security Policy Deploying System
4.1 System Architecture
4.2 Security Policy Deploying Process
5 Implements
5.1 Verification Environments and Items
5.2 Verification Result
6 Conclusion
References
TTY Session Audit Techniques for Linux Platform
1 Introduction
2 Related Work
2.1 Enterprise System Operation Process
2.2 Security Threats Can Occur in Service Management
2.3 OS Protect Security Technique
3 Security Requirements for Information Service Operation
3.1 Security Environment Analysis About Current Information Service Operation
3.2 Security Requirement for Safe Information Service Operation
4 TTY Session Audit System for Linux Platform
4.1 TTY Session Audit System Architecture
4.2 TTY Session Audit System Usage Process
5 Implements
5.1 Verification Environment and Items
5.2 Verification Result
6 Conclusion
References
A Study on the Role of Higher Education and Human Resources Development for Data Driven AI Society
1 Introduction: The Necessity of Ethics in the Data Driven AI Society
2 Why Do We Need to Educate AI Developers on Ethics?
2.1 Target of AI Ethics Education and the Importance of Developer Education
2.2 Meaning of Professional Ethics Education
2.3 Developer-Centered Ethics Education in Computer Science
3 Current Status of AI Ethics Education in Higher Education
3.1 Type A: Technology-Based Ethics Convergence Education
3.2 Type B: Social Issues and Norm Education of Artificial Intelligence
3.3 Type C: Field-Oriented Practical Training
3.4 Analysis Results: Imbalance in Education Types
4 How Should Universities Educate AI Ethics?
5 Conclusions
References
The Way Forward for Security Vulnerability Disclosure Policy: Comparative Analysis of US, EU, and Netherlands
1 Introduction
2 Understanding Vulnerability Disclosure Policy
2.1 Cultural Aspects
2.2 Technological Aspects
2.3 Environmental Aspects
3 Vulnerability Disclosure Policy of Major Countries
3.1 United States
3.2 European Union
3.3 Netherlands
4 Key Challenges for Development of Vulnerability Disclosure Policy
4.1 Comparative Analysis
4.2 Protection of Security Researchers
4.3 Expansion of Vulnerability Disclosure Policy
4.4 Systematic Management of Known Vulnerability
5 Conclusions
References
Study on Government Data Governance Framework: Based on the National Data Strategy in the US, the UK, Australia, and Japan
1 Introduction
2 Background
2.1 Data Governance
2.2 Data Governance Approaches
2.3 Data Governance Framework
3 Research Method
3.1 National Data Strategy
3.2 DGF of the DGI
4 Data Governance Framework for National Data Strategy
4.1 Rules and Rules of Engagement
4.2 People and Organizational Bodies
4.3 Additional
5 Conclusions
References
A Study on the Attack Index Packet Filtering Algorithm Based on Web Vulnerability
1 Introduction
2 Related Study
2.1 Open Web Application Security Project Top 10
2.2 Packet Classification
3 Research Method
3.1 Packet Filtering Algorithm
4 Comparative Verification
4.1 Keyword Frequency Analysis Result
5 Conclusion
References
Analysis of IoT Research Topics Using LDA Modeling
1 Introduction
2 Related Study
2.1 Definition of IoT
2.2 Research on Internet of Things Research Trends
3 Research Method
3.1 Analysis Method
3.2 Analysis Data
4 Research Result
4.1 Keyword Frequency Analysis Result
4.2 Topic Modeling Result
5 Conclusion
References
Log4j Vulnerability Analysis and Detection Pattern Production Technology Based on Snort Rules
1 Introduction
2 Theoretical Background
2.1 Log4j Vulnerability
3 Research Method
3.1 Attack Configuration Analysis
3.2 Detection Pattern
4 Countermeasures of Log4j Vulnerability
4.1 Threat IP Blocking
4.2 Security Update of Log4j
4.3 Snort Detection Policy
5 Conclusions
References
A Study on Technology Innovation at Incheon International Airport: Focusing on RAISA
1 Introduction
2 Theoretical Background
2.1 Robot, Artificial Intelligence and Service Automation (RAISA)
2.2 RAISA at Airports
3 Case Analysis
3.1 Case Selection
3.2 RAISA at Incheon International Airport
3.3 Discussion
4 Conclusion
References
Author Index

Citation preview

Studies in Computational Intelligence 1075

Roger Lee Editor

Big Data, Cloud Computing, and Data Science Engineering

Studies in Computational Intelligence Volume 1075

Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland

The series “Studies in Computational Intelligence” (SCI) publishes new developments and advances in the various areas of computational intelligence—quickly and with a high quality. The intent is to cover the theory, applications, and design methods of computational intelligence, as embedded in the fields of engineering, computer science, physics and life sciences, as well as the methodologies behind them. The series contains monographs, lecture notes and edited volumes in computational intelligence spanning the areas of neural networks, connectionist systems, genetic algorithms, evolutionary computation, artificial intelligence, cellular automata, self-organizing systems, soft computing, fuzzy systems, and hybrid intelligent systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.

Roger Lee Editor

Big Data, Cloud Computing, and Data Science Engineering

Editor Roger Lee Software Engineering and Information Technology Institute Central Michigan University Mount Pleasant, MI, USA

ISSN 1860-949X ISSN 1860-9503 (electronic) Studies in Computational Intelligence ISBN 978-3-031-19607-2 ISBN 978-3-031-19608-9 (eBook) https://doi.org/10.1007/978-3-031-19608-9 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Foreword

The purpose of the 7th IEEE/ACIS International Conference on Big Data, Cloud Computing, Data Science and Engineering (BCD 2022) held on August 4–6, 2022, in Da Nang, Vietnam, was to bring together researchers, scientists, engineers, industry practitioners, and students to discuss, encourage, and exchange new ideas, research results, and experiences on all aspects of big data, cloud computing, data science, and engineering and to discuss the practical challenges encountered along the way and the solutions adopted to solve them. The conference organizers have selected the best 15 papers from those papers accepted for presentation at the conference in order to publish them in this volume. The papers were chosen based on review scores submitted by members of the program committee and underwent further rigorous rounds of review. In chapter “Research on Development of Sponsorship Effect Analysis Module Using Text Mining Technique,” Weonsun Choi, Kyunghyun Lee, Yoonje Sung, and Gwangyong Gim collected more than 200,000 online sports media data to analyze the sponsorship effect by minimizing the omission of sponsor brands exposed through online media and detecting sophisticated data. A commercial dictionary in the sports field was established, and a sponsorship effect analysis module was developed by quantifying text data using morpheme analysis and TF-IDF. In chapter “A Study on the Relationship Between ESG News Keywords and ESG Ratings,” Jaeyoung So, Myungho Lee, Jihun Park, and Gwangyong Gim, attempts to reveal the correlation between daily ESG-related news and each E/S/G grade, focusing on the keywords of the main disclosure metrics according to the ESG disclosure guidance by KRX. In chapter “Development of Associated Company Network Visualization Techniques Using Company Product and Service Information—Using Cosine Similarity Function,” Kyunghyun Lee, Weonsun Choi, Beonghwa Jeon, and Gwangyong Gim present a network of related companies providing artificial intelligence-related solutions and application services which were closely examined. It is divided into companies that develop artificial intelligence technology and companies that apply it to various fields.

v

vi

Foreword

In chapter “Hybrid CNN-LSTM Based Time Series Data Prediction Model Study,” Chungku Han, Hyeonju Park, Youngsoo Kim, and Gwangyong Gim present a hybrid CNN-LSTM to solve the long-term dependency problem that can reduce the learning time of long-term time series data. In chapter “A Study on Predicting Employee Attrition Using Machine Learning,” Simon Gim and Eun Tack Im use three machine learning methods, random forest, XGBoost, and artificial neural network to predict employee attrition to prevent unwanted loss of employees due to resignations. In chapter “A Study on the Intention to Continue Using a Highway Driving Assistance (HDA) System Based on Advanced Driver Assistance System (ADAS),” Myungho Lee, Jaeyoung So, and Jaewook Kim present a study to identify factors affecting the intention to continue using the ADAS-based HDA system, which will lead to the development of autonomous vehicles. In chapter “Security Policy Deploying System for Zero Trust Environment,” SungHwa Han proposes a security policy deploy system for zero trust-based environment. The proposed system consists of three components operating in two systems and uses an object-based security policy deployment method. In chapter “TTY Session Audit Techniques for Linux Platform,” Sung-Hwa Han proposes a TTY session audit technique that can check the command input by the user accessing the Linux environment and the resulting message. In chapter “A Study on the Role of Higher Education and Human Resources Development for Data-Driven AI Society,” Ji Hun Lim, Jeong Eun Seo, Jun Hyuk Choi, and Hun Yeong Kwon attempt to investigate and analyze the educational status of higher education in situations that train AI experts. They point out the problem of artificial intelligence ethics education in higher education and suggested the educational methods to be taken in the future. In chapter “The Way Forward for Security Vulnerability Disclosure Policy: Comparative Analysis of US, EU, and Netherlands,” Yoon Sang Pil analyzes the cases of the US, EU, and the Netherlands to derive key elements of the VDP and propose tasks to be considered when operating and improving the VDP in the future. In chapter “Study on Government Data Governance Framework: Based on the National Data Strategy in the US, the UK, Australia, and Japan,” JeongEun Seo and HunYeong Kwon analyze the NDS of the USA, the UK, Australia, and Japan based on the DGF of the DGI to derive the essential considerations in formulating national data strategies and suggest the components of the Government Data Governance Framework. In chapter “A Study on the Attack Index Packet Filtering Algorithm Based on Web Vulnerability,” Min Su Kim propose an application layer-based web packet filtering algorithm to improve web service continuity and operational efficiency based on the attack index for web vulnerabilities. In chapter “Analysis of IoT Research Topics Using LDA Modeling,” Daesoo Choi analyzes the research trend of “Internet of Things,” a core technology, and attempts to gain insight by identifying the fields of engineering and social sciences separately. In chapter “Log4j Vulnerability Analysis and Detection Pattern Production Technology Based on Snort Rules,” WonHyung Park and IlHyun Lee analyze the Log4j

Foreword

vii

vulnerability in detail and propose the Snort detection policy technology so that the security control system can quickly and accurately detect it. In chapter “A Study on Technology Innovation at Incheon International Airport: Focusing on RAISA,” Seo Young Kim and Min Seo Park present a study which categorizes the core technologies utilized in airports and explores the advantages, for instance attaining process efficiency and providing customer convenience from robot, AI, and service automation (RAISA). It is our sincere hope that this volume provides stimulation and inspiration and that it will be used as a foundation for works to come. August 2022

Dr. Vo Thi Thanh Thao Korea University of Information and Communication Technology The University of Danang Da Nang, Vietnam Dr. Jong-Bae Kim Soongsil University Seoul, South Korea

Contents

Research on Development of Sponsorship Effect Analysis Module Using Text Mining Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Weonsun Choi, Kyunghyun Lee, Yoonje Sung, and Gwangyong Gim

1

A Study on the Relationship Between ESG News Keywords and ESG Ratings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jaeyoung So, Myungho Lee, Jihun Park, and Gwangyong Gim

15

Development of Associated Company Network Visualization Techniques Using Company Product and Service Information—Using Cosine Similarity Function . . . . . . . . . . . . . . . . . . . . . . Kyunghyun Lee, Weonsun Choi, Beonghwa Jeon, and Gwangyong Gim Hybrid CNN-LSTM Based Time Series Data Prediction Model Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chungku Han, Hyeonju Park, Youngsoo Kim, and Gwangyong Gim A Study on Predicting Employee Attrition Using Machine Learning . . . . Simon Gim and Eun Tack Im A Study on the Intention to Continue Using a Highway Driving Assistance (HDA) System Based on Advanced Driver Assistance System (ADAS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Myungho Lee, Jaeyoung So, and Jaewook Kim

29

43 55

71

Security Policy Deploying System for Zero Trust Environment . . . . . . . . . Sung-Hwa Han

83

TTY Session Audit Techniques for Linux Platform . . . . . . . . . . . . . . . . . . . Sung-Hwa Han

95

A Study on the Role of Higher Education and Human Resources Development for Data Driven AI Society . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Ji Hun Lim, Jeong Eun Seo, Jun Hyuk Choi, and Hun Yeong Kwon

ix

x

Contents

The Way Forward for Security Vulnerability Disclosure Policy: Comparative Analysis of US, EU, and Netherlands . . . . . . . . . . . . . . . . . . . 119 Yoon Sang Pil Study on Government Data Governance Framework: Based on the National Data Strategy in the US, the UK, Australia, and Japan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Jeong Eun Seo and Hun Yeong Kwon A Study on the Attack Index Packet Filtering Algorithm Based on Web Vulnerability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Min Su Kim Analysis of IoT Research Topics Using LDA Modeling . . . . . . . . . . . . . . . . 153 Daesoo Choi Log4j Vulnerability Analysis and Detection Pattern Production Technology Based on Snort Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 WonHyung Park and IlHyun Lee A Study on Technology Innovation at Incheon International Airport: Focusing on RAISA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Seo Young Kim and Min Seo Park Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

Contributors

Daesoo Choi Department of Software Engineering, Joongbu University, Seoul, South Korea Jun Hyuk Choi School of Cybersecurity, Korea University, Seoul, South Korea Weonsun Choi Department of IT Policy and Management, Soongsil University, Seoul, South Korea Gwangyong Gim Department of Business Administration, Soongsil University, Seoul, South Korea Simon Gim Industrial and Labor Relations School, Cornell University, Ithaca, NY, USA Chungku Han Graduate School of IT Policy and Management, Soongsil University, Seoul, South Korea Sung-Hwa Han Department of Information Security, Tongmyong University, Busan, South Korea Eun Tack Im Department of Business Administration, Soongsil University, Seoul, South Korea Beonghwa Jeon Department of IT Policy and Management, Soongsil University, Seoul, South Korea Jaewook Kim Department of Business Administration, Soongsil University, Seoul, South Korea Min Su Kim Department of Information Security Engineering, Sun Moon University, Seoul, South Korea Seo Young Kim College of Business Administration, Inha University, Incheon, South Korea Youngsoo Kim Department of IT Policy and Management, Soongsil University, Seoul, South Korea xi

xii

Contributors

Hun Yeong Kwon School of Cybersecurity, Korea University, Seoul, South Korea IlHyun Lee Department of Information Security, Sangmyung University, Cheonansi, Chungcheongnam-do, Republic of Korea Kyunghyun Lee Department of Strategy Consulting, Korea Insight Institute, Seoul, South Korea Myungho Lee Department of Business Administration, Soongsil University, Seoul, South Korea Ji Hun Lim School of Cybersecurity, Korea University, Seoul, South Korea Hyeonju Park Department of IT Policy and Management, Soongsil University, Seoul, South Korea Jihun Park Department of IT Outsourcing, IT Nomads Co., Ltd., Seoul, South Korea Min Seo Park College of Business Administration, Inha University, Incheon, South Korea WonHyung Park Department of Information Security, Sangmyung University, Cheonan-si, Chungcheongnam-do, Republic of Korea Yoon Sang Pil School of Cybersecurity, Korea University, Seoul, South Korea Jeong Eun Seo School of Cybersecurity, Korea University, Seoul, South Korea Jaeyoung So Department of IT Policy Management, Soongsil University, Seoul, South Korea Yoonje Sung Department of IT Policy and Management, Soongsil University, Seoul, South Korea

Research on Development of Sponsorship Effect Analysis Module Using Text Mining Technique Weonsun Choi, Kyunghyun Lee, Yoonje Sung, and Gwangyong Gim

Abstract To expand the sponsorship market, sponsorship activities based on sponsorship effect analysis data through scientific and systematic analysis must be carried out. As the development of the media industry and the COVID-19 pandemic have caused many changes in the method of broadcasting professional sports, it is necessary to upgrade the analysis of sponsorship effects. In a crisis situation where the sponsorship effect analysis market is shrinking due to the COVID-19 pandemic, the development of brand exposure analysis programs that can be used based on online platforms will expand the sponsorship market and promote changes in the domestic analysis market that relies on overseas analysis programs. In this study, more than 200,000 online sports media data were collected to analyze the sponsorship effect by minimizing the omission of sponsor brands exposed through online media and detecting sophisticated data. A commercial dictionary in the sports field was established and a sponsorship effect analysis module was developed by quantifying text data using morpheme analysis and TF-IDF. A module based on UI was implemented to analyze the results of the custom morpheme analyzer and Elasticsearch Term Vectors AP. Keywords Text mining · Morpheme analysis · BoW (Bag of Words) · TF-IDF (Term Frequency-Inverse Document Frequency) · Sponsored brand · Sponsorship effect analysis W. Choi · Y. Sung Department of IT Policy and Management, Soongsil University, Seoul, South Korea e-mail: [email protected] Y. Sung e-mail: [email protected] K. Lee Department of Strategy Consulting, Korea Insight Institute, Seoul, South Korea e-mail: [email protected] G. Gim (B) Department of Business Administration, Soongsil University, Seoul, South Korea e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Big Data, Cloud Computing, and Data Science Engineering, Studies in Computational Intelligence 1075, https://doi.org/10.1007/978-3-031-19608-9_1

1

2

W. Choi et al.

1 Introduction To expand the sponsorship market, sponsorship activities based on sponsorship effect analysis data through scientific and systematic analysis must be carried out. The number of title sponsors in the professional sports league is billions, accounting for a large portion of the domestic sponsorship, and many other companies are sponsoring for large and small amounts of money. However, even though many companies are sponsoring activities, data analysis for sponsorship results or marketing cannot be performed due to the high analysis cost compared to the sponsorship amount. Hence, regardless of the size of the sponsorship, an automated analysis system is urgently needed to reduce the high cost of analysis due to excessive manpower and time input to provide evidence data for attracting and managing sponsors and decision-making. In addition, as the development of the media industry and the COVID-19 pandemic have caused many changes in the way professional sports are broadcasted, technology that can analyze vast amounts of data exposed through media such as online and SNS is needed. In a crisis where the sponsorship effect analysis market is shrinking due to the COVID-19 pandemic, the development of brand exposure analysis programs that can be used based on online platforms will expand the sponsorship market and change the domestic analysis market that relies on overseas analysis programs. To analyze the sponsorship effect through brand exposure of a company using news articles, a text analysis technique that processes and analyzes news articles, which are unstructured text data, is required. Text analysis techniques have been a major topic in the field of information search for a while, but have come to be spotlighted again as technical and social interest in big data grows. The main attempts to analyze text in the field of big data include emotional analysis and subject network analysis. In the case of emotional analysis, the quality of emotional expressions tends to have a great influence on the analysis results, so there are limitations in applying them to the actual work area. In the case of subject network analysis, there is an advantage of being able to grasp major topics at a glance. However, there is a limitation that the quality of the results is affected by the quality of the dictionary in the stage of grasping the main topic. In this study, statistical techniques based on keyword frequency, which are traditional methods, will be used instead of methods of emotional analysis or subject networks. After deriving text data and the frequency of exposure of corporate brands, the purpose of this study is to verify the impact of corporate sports sponsorship activities on brand awareness, image, and consistency through sponsorship effect analysis.

Research on Development of Sponsorship Effect Analysis Module …

3

2 Theoretical Background 2.1 Definition of Sponsorship and Sponsorship Effect 2.1.1

Concepts and Features of Sports Sponsorship

Sponsorship is the relationship between sponsors who support funds, technologies, facilities, and equipment [1]. Basically, sponsorship is defined based on the exchange law that provides mutual benefits, and the effect is maximized when a cooperative relationship is formed [2, 3]. The components of a sports sponsorship are contracted with broadcasters, sponsor companies, and event organizers, including sports organizations, individual athletes, and events. In general, it is easy to see the company name or product in various sports-related contents such as TV, stadium, and poster. Many sports events have company or product names. In the sports event hall, numerous corporate logos or product name signs are placed, and the player’s uniform is also engraved with a logo indicating the company name [4, 5]. This is designed to affect the audience who visited the stadium and the viewers who encounter broadcasting through various contents. These sports sponsorships are a useful opportunity to make consumers aware of them beyond simple financial support to achieve commercial goals. It is considered and used as part of marketing communication activities between companies and customers. Over the years, the area and scope of sports marketing, such as organizing and sponsoring various sports events, operating professional clubs, sponsoring star players, and sponsoring international games, are expanding [6]. By utilizing this, the sports organizer aims to establish the organization’s finances, expand and maintain the organization, and successfully host sports events. The company intends to increase brand assets through maximizing the image of the company by actively utilizing the acquired right of sponsorship to increase the promotion effect and efficiency [7].

2.1.2

Effects of Sports Sponsorship

The use of sports sponsorship as a marketing tool for companies is to expect to achieve various effects tailored to the environment along with increasing sales and generating interest rates. When sports sponsorship is used as a marketing tool, a company can continuously expose its name or product name to large sports audiences and viewers during the competition [8]. Sports have expanded their share of the mass media due to internationalization, popularity, and diversity. For this reason, companies are naturally delivering their brand images to consumers by combining them with specific images of sports in terms of easy exposure to sports media based on commercial purpose.

4

W. Choi et al.

Companies that recognize the marketing effect of using sports see sports as a product and strive to build the image and brand awareness of companies through this. If a company’s brand image creation effect and recognition increase effect have a positive effect on the value of brand assets, it can lead to a long-term increase in sales. Thus, active and strategic marketing using sports sponsorship can pursue various achievements [9].

2.2 Study of Text Analysis Techniques Text analysis is a subject that has been actively studied in the field of information search and machine translation. In recent years, it is also used in marketing areas such as customer segmentation and departure prediction after analyzing voice from customer call records to text according to the big data environment. It is also being actively studied in interactive systems such as chatbots. Among the general text analysis areas other than machine translation, the analysis procedure in the classification field follows the usual data analysis technique, such as text quantification, feature point selection, and application of machine learning analysis techniques. Text quantification is carried out using techniques such as keyword appearance frequency and vectorization to perform intertext information operations, and feature point screening uses techniques of dimension reduction to increase efficiency in terms of time or resources. Through machine learning techniques, analysis tasks for predicting and classifying target variables can be performed. There are various methods for each procedure. This study will research TF-IDF (Term Frequency-Inverse Document Frequency), which utilizes the frequency of keyword appearance of text quantization techniques, and the word2vec method that quantifies keywords using information in vector space.

2.2.1

Text Quantization Techniques

Word2vec is a method of quantifying text that has recently attracted attention. Word2vec is a methodology that quantitatively calculates the relationship of each word by finding a related rule according to adjacent information between words. It is an algorithm that calculates the proximity in the form of a vector by looking at the front-rear positional relationship between each word. Since word2vec does not perform learning in advance, it can be seen as machine learning based on ‘unsupervised learning’. To calculate vectors for the relationship between words, the analysis target corpus must be large. Word2vec has two methodologies: continuous bag-ofwords (CBOW) and continuous skip-gram (Skip-gram) [10]. CBOW is a method of estimating a single word from multiple words, which is useful for finding one word to look for from surrounding words, and is known to work effectively when there is a relatively small amount of data, and to be computationally fast. Skip-gram is used

Research on Development of Sponsorship Effect Analysis Module …

5

to predict a number of words associated with a single word and to estimate which words appear next when they appear [10]. Another method is TF-IDF using the frequency element of the keyword. Term Frequency Factor (TF) uses the simple occurrence frequency of keywords appearing in the document as a weight. For example, if many of the keywords occur within the target document, this keyword has a high weight. Inverse Document Frequency (IDF) compensates for the weight of each keyword based on the premise that the characteristics of the keyword are inversely proportional to the number of documents in which the keyword appears. The correction adjusts to have a low weight for keywords commonly appearing in many documents, and relatively high weights for keywords rarely appearing in fewer documents. In this way, the frequency of reverse documents acts as a counterweight to the specificity of keywords, preventing the side effects of common keywords frequently used in various documents appearing as key words in documents [11]. Since this study extracts brands exposed to news articles based on text, the TF-IDF method is used among the quantification techniques mentioned above. Because this quantification technique uses the simple occurrence frequency of keywords appearing in documents as a weight, the frequency-oriented TF-IDF method would be more reasonable choice than the one-hot presentation method or word2vec, which focuses on the order of words and surrounding words.

2.3 Prior Studies Using News Text Analysis Big data can be used with various analysis techniques and can be used according to structured/non-structured data. Big data analysis is being actively conducted in various studies because various data can be collected, analyzed, and predicted. Atypical data uses emotional analysis, network analysis, and topic modeling analysis. It is mainly used to analyze text to grasp social status or problems or to study people’s perceptions and emotions. Park et al. [12] compared the changes in the perception of tourism in Korea by Chinese tourists before and after the THAAD deployment through big data analysis. After collecting related search data on the theme of Korean tourism within the Chinese portal site ‘Baidu’, simultaneous word frequency, semantic network analysis, and Structural Equivalence (CONCOR) analysis were conducted for each keyword. As a result of the analysis, it was confirmed that ‘China’ and ‘problem’ appeared as top keywords after the THAAD deployment, and specific tourism information keywords such as ‘Seoul’, ‘Free Travel’, and ‘Hotel’ decreased, negatively affecting tourism. Yoo and Lim [13] conducted frequency analysis, semantic network analysis, and simultaneous network analysis with news article data containing the keyword “COVID-19 emotion” in Big Keynes. Through this, “mind,” “health,” and “anxiety” about COVID-19 emotions showed a high frequency. Keywords for psychological anxiety and depression such as “anxiety” and “psychology” showed high connection

6

W. Choi et al.

centrality. Based on this, it was suggested that not only physical quarantine but also psychological quarantine should be made in the prolonged COVID-19 situation. Previous studies of big data analysis were mainly studies that conducted prediction and classification using data mining analysis techniques. Among the big data techniques, the text mining technique is a multidisciplinary research area with linguistics and computer science [14]. This analysis method is used to discover meaningful results centering on words in the text and interpret them in various directions depending on the researcher. Recently, it has been used to identify consumer or market preferences or trends based on online postings in various fields such as business administration and economics.

3 Research Method and Result 3.1 Research and Development Process This study analyzed the technology and product analysis and the sponsorship effect analysis process through internal and external environmental analysis, and established the contents of the development plan. After that, online sports media data related to professional sports were collected and used as a data pool for module development. Using the collected data, a commercial dictionary in the field of sports was built mainly on the K-League, morpheme analysis using the sponsor brand as the keyword main, and a sponsorship effect analysis module was developed based on the numbers processed by text data using TF-IDF (Fig. 1).

3.2 Data Collection Among the articles interposed through Naver Sports in Naver, a portal site, a module that can efficiently import article data related to volleyball (V League), soccer (K League, national team), and basketball (KBL, WKBL) was implemented based on Python. Data collection modules were implemented using the requests library (Fig. 2). A total of 209,712 data were collected from Naver sports articles by date (2020.01.01–2021.10.31) and by event (volleyball, soccer, basketball). Data collected for data analysis were accumulated in the database in the form of BSON using MongoDB, which is a NoSQL, and data to be used as weights for analysis of sponsorship effects such as respective article views and netizens’ responses (like, sad, etc.) (Fig. 3).

Research on Development of Sponsorship Effect Analysis Module …

Fig. 1 Research and development process

Fig. 2 Implementation of article data collection module

7

8

W. Choi et al.

Fig. 3 Collect and accumulate article data

3.3 Development of Sponsor Effectiveness Analysis Module 3.3.1

Morpheme Analysis

For the accuracy of the sponsorship effect analysis, the data were organized through a preprocessing process that removes special characters, e-mails, Chinese characters, and spaces in the article text data. A dictionary of individual names of domestic sports clubs and sponsoring companies was created (Figs. 4 and 5). Morpheme analyzer is based on the Nori Morpheus Analyzer, a Korean morpheme analyzer built into Elasticsearch as a plug-in and included in Lucene’s analysis module. A custom morpheme analyzer was created by adding a dictionary of individual names of sports clubs and sponsorship companies. Using the ‘MeCab-ko-dic’ dictionary generated using the CRF model, it can be divided into morphemes and parts can be tagged.

Research on Development of Sponsorship Effect Analysis Module …

Fig. 4 Morphological analysis

Fig. 5 Official sponsor with the club where the sponsorship is signed

9

10

W. Choi et al.

Fig. 6 Principles of MeCap

As shown in Fig. 6, the input sentence is constructed with a combination of all possible words in the dictionary and scores of nodes (words) and edges (adjacent tag pairs) are required to find the most suitable combination (red lines) among them. Their scores can be extracted from the learning data, and words with high frequency within the learning data have high scores and frequent adjacent tag pairs have high scores (Fig. 7). To increase the accuracy of the analysis, ending (E), exclamation (IC), conjunction (MAJ), pronoun (NP), delimiter (SC), ellipsis (SE), period, question mark, exclamation mark (SF), spacing (SP), verb suffix (XSV), affirmative specifier (VCP), indefinite (VCN), number (SN), dependent noun (NNBC), rhetorical (NR), adjective (VA), adjective suffix (XSA), auxiliary verb or adjective (VX), dependent noun (NNB), general adverb (MAG), verb (VVVV), noun suffix (XSN), investigation (J), root (XR), tube (MMH), Chinese letter (SH), other symbols (SY), proper noun (NNP) were excluded from the torque aging process. Data on the vocabulary list for learning Korean at the National Institute of Culture, Sports and Tourism were processed as Stop Word to prevent unnecessary general nouns from being derived from the analysis results as much as possible.

3.3.2

Frequency and Sponsor Effect Analysis

BoW (Bag of Words) algorithm, a method of quantifying text data that focuses on the frequency of word appearance without considering the order of words, and the frequency of words and the frequency of reverse documents (to take a specific expression for the frequency of documents) were used. It was used for sponsorship effect analysis using the Term Frequency-Inverse Document Frequency (TF-IDF) technique, a method of weighting the degree of importance for each word within DTM. A custom morpheme analyzer and Elasticsearch’s Term vectors API were combined to tokenize the article body text, derive the frequency of the corresponding word in the document (BoW), and extract TF-IDF through the analysis module (Fig. 8).

Research on Development of Sponsorship Effect Analysis Module …

11

Fig. 7 Part of speech tags and learning vocabulary lists

3.4 Building a Simple Dashboard A module based on a simple UI was implemented to analyze the results from a custom morpheme analyzer and Elasticsearch Term Vectors API. When selecting the events (soccer, volleyball, basketball) and the period to be analyzed, the analysis results were generated in the form of word clouds and Excel files (Fig. 9). Through a simple dashboard, the top 50 words with the highest term frequency (TF) among all the words in the article and the number of exposures in the entire document were tabulated (Fig. 10).

4 Conclusions In this study, online news text data and the exposure frequency of corporate brands were derived using statistical techniques based on keyword frequency, and then sponsorship effect analysis was performed. Through this, an online sports media

12

W. Choi et al.

Fig. 8 Customizing morpheme analyzer

sponsorship effect analysis module was developed to verify the effect of a company’s sports sponsorship activities on brand awareness, image, and consistency. In this process, more than 200,000 online sports media data for the development of analysis modules were collected. After pre-processing the collected data, the text of the sports article was talk-aged based on the morpheme analyzer and the corporate list dictionary. Among them, only text related to the company was extracted and the frequency was analyzed by counting the sponsor companies. A module capable of analyzing the effect of sponsorship was developed using BoW and TF-IDF, and a simple dashboard was constructed that can check the results with word cloud, table, and Excel. This can reduce the manpower and time required to analyze the sponsorship effect. Also, it expands the sponsorship market by effectively supporting sophisticated data on the effectiveness of sponsorship exposure to sponsors, sports-related organizations and organizations. Ultimately, it will be able to have a positive effect on the development of the sports industry. This study will also be an influential research and development to enhance the technological competitiveness of domestic companies to strengthen their ability to respond to global companies entering the domestic sponsorship effect analysis market. By exporting the brand exposure analysis system to overseas sponsorship effect analysis companies or agencies and producing their own products in Korea without relying on overseas services, import and import substitution effects can be expected.

Research on Development of Sponsorship Effect Analysis Module …

Fig. 9 Simple dashboard excel sheet

Fig. 10 Simple Dashboard top exposure list

13

14

W. Choi et al.

Since this study developed an analysis module using limited data, it is necessary to utilize and continuously improve the prototype system. Through future research on the analysis methodology using modules, it can be used to analyze the effect of domestic sponsorship and the effect of domestic sponsorship of foreign brands and overseas sponsorship of domestic brands. It is expected that by evaluating the exact value of sponsorship, it will expand the domestic sponsorship market and step into the active overseas sponsorship activities of Korean corporate brands, which will have a positive impact on the development of the sports industry.

References 1. Park, J.H., Kwak, M.S., Cho, K.M.: Relationship between sponsorship activities factors and corporate image and brand assets using meta-analysis. J. Korea Sports Ind. Manage. Assoc. 19(6), 67–83 (2014) 2. Kim, Y.M.: The relationship between sports event propensity and related factors and sponsorship effectiveness. J. Korea Sports Ind. Manage. Assoc. (2014) 3. Kim, Y.M.: Sports marketing communication a scholarly history (2002) 4. Lee, J.H.: The effect of golf participants’ preference for players and sports sponsorship awareness on product purchase intention. Ph.D. thesis, Sungkyunkwan University’s Graduate School of General Studies (2017) 5. Yoon, D.Y.: A comparative analysis of the effect of a company’s brand image on the sponsorship effect through sponsorship of domestic and foreign golfers. A master’s thesis, Chung-Ang University (2020) 6. Mahony, D.F., Howard, D.R.: Sport business in then ex decade: a general overview of expected trends. J. Sport Manag. 15, 275–296 (2001) 7. Moon, J.W.: A Study on the halo effect in image transfer by sports sponsorship. Ph.D. thesis, Sungkyunkwan University’s Graduate School of General Studies (2014) 8. Mullin, B.J., Hardy, S., Sutton, W.A.: Sport Marketing. Human Kinetics, Champaign, IL (1993) 9. Jeong, H.B.: The influence of corporate participation in sports sponsorship on brand recognition and image change: focusing on the 2014 FIFA World Cup in Brazil. A master’s thesis, Hanyang University (2016) 10. Choi, J.W., Han, S.H., Lee, M.Y., Ahn, J.M.: A study on predicting corporate bankruptcy using text mining methodology. Prod. Papers (formerly Productivity Study) 29(1), 201–228 (2015) 11. Kim, N.G., Lee, D.H., Choi, H.C.: Text analysis technology and utilization trends. J. Korean Commun. Assoc. 42(2), 471–492 (2017) 12. Park, K.Y., Han, H.R., Choi, S.D.: Comparison of changes in Chinese tourists’ perception of Korean tourism before and after the THAAD deployment using big data. Tourism Leisure Res. 31(2), 25–43 (2019) 13. Yoo, S.Y., Lim, K.G.: News agenda analysis using text mining and semantic network analysis: focusing on COVID-19-related emotions. Korean Intell. Inf. Syst. Assoc. 27(1), 47–64 (2021) 14. Kim, S.H., Lee, Y.J., Shin, J.Y., Park, K.Y.: Text mining for macroeconomic analysis. Anal. Korean Econ. 26(1), 1–70 (2020)

A Study on the Relationship Between ESG News Keywords and ESG Ratings Jaeyoung So, Myungho Lee, Jihun Park, and Gwangyong Gim

Abstract Since the UN Principles for Responsible Investment (PRI) was launched in 2006, interest and investment related to ESG have been continued to increase. In particular, global asset managers such as BlackRock build an asset portfolio in consideration of ESG ratings, and ESG has now become important non-financial information when investors should consider when making in-vestment decisions. The top group of the MSCI ESG Index shows higher returns and lower stock price volatility, proving that ESG ratings are important investment information. However, ESG ratings are announced once a year, so they have become important information for investors who make mid- to long-term investments, but it can be seen that noninstitutional investors receive ESG news every day rather than ESG ratings. The purpose of this paper is to reveal the correlation between daily ESG-related news and each E/S/G grade, focusing on the keywords of the main disclosure metrics according to the ESG disclosure guidance by KRX. Keywords ESG news · Text mining · ESG ranking · Random Forest · Light GBM

J. So Department of IT Policy Management, Soongsil University, Seoul, South Korea e-mail: [email protected] M. Lee · G. Gim (B) Department of Business Administration, Soongsil University, Seoul, South Korea e-mail: [email protected] M. Lee e-mail: [email protected] J. Park Department of IT Outsourcing, IT Nomads Co., Ltd., Seoul, South Korea e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Big Data, Cloud Computing, and Data Science Engineering, Studies in Computational Intelligence 1075, https://doi.org/10.1007/978-3-031-19608-9_2

15

16

J. So et al.

1 Introduction Increasing awareness of the importance of integrating environmental, social, and governance (ESG) policies into a company’s strategy and operations is increasing worldwide demand for disclosure of non-financial information, such as ESG, to investment decisions [1]. In the case of ExxonMobil in the U.S., shares fell sharply in defiance of shareholders’ opinions calling for a decarbonization management strategy [2]. In the case of Danone, a representative ESG-managed French company, hedge fund shareholders stepped down as CEO of ESG due to falling financial performance [3]. Since the Paris agreement, countries have been striving to reduce greenhouse gases, and Korea has established a carbon–neutral committee under the president and has been implementing the Basic Carbon Neutral Act since September 2021. In particular, global asset management companies Blackrock, Vanguard, and SSGA (State Street Global Advisors) have launched and operated various funds and ETFs using ESG indicators. [4] Chairman Blackrock’s ESG reports are mandatory [5]. In addition, domestic companies’ carbon emission assets and liabilities amount to 523.7 billion won and 709.2 billion won in 2020, respectively [6]. The National Pension Service, an institutional investor in Korea, declared that it would invest more than 50% of its investment in consideration of ESG [7]. Depending on whether ESG is reflected in the management strategy, the distribution between companies with theoretically excellent ESG ratings and companies with low ESG ratings is differentiated as shown in Fig. 1. Accordingly, shareholder performance (payoff) varies [8]. This theoretical assumption is demonstrated by Morgan Stanley Capital International (MSCI) ESG Indexes Price, as shown in Fig. 1. It can be seen that the USA SRI index price, which represents the top 25% best in the ESG grade, shows the highest return. The calculation process of MSCI ESG Indexes is briefly summarized in Table 1 [9]. As the performance of investment is differentiated, the importance of ESG becomes increasingly prominent, and accordingly, investors will be interested in ESG news of companies as well as ESG ratings announced once a year. It is time to look at the daily ESG news from a certain perspective, and to study the relationship with ESG ratings.

Fig. 1 Hypothetical impact of ESG within Merton credit-risk model and MSCI ESG index price

A Study on the Relationship Between ESG News Keywords and ESG …

17

Table 1 Standard MSCI ESG Indexes and construction methodology Index

Index construction

MSCI ESG screened Market-capitalization weighted MSCI ESG universal Market-cap weight-tilt from 0.5 to 2.0 depending on • MSCI ESG rating • MSCI ESG rating change (upgrade, neutral or downgrade) MSCI ESG focus

Optimize index-level ESG score under tracking-error and sector constraints

MSCI ESG leaders

Best-in-class selection of top 50% of ESG-rated companies in terms of free-float market cap per • GICS sector and Sub-region (to avoid regional or sector biases) Market-capitalization-weighted

MSCI SRI

Best-in-class selection of top 25% of ESG-rated companies in terms of free-float market cap per • GICS sector and Sub-region (to avoid regional or sector biases) Market-capitalization-weighted

MSCI ACWI

MSCI All Country World Index

2 Theoretical Background 2.1 Overview of ESG ESG definitions are presented differently according to the purpose of establishment, business characteristics, and stakeholder differences by institutions. However, when combined, the concept of ESG is focused in the capital market and is defined as “the important non-financial factors that can affect investment decisions and long-term financial values” [1]. Looking at the process of ESG being graded, it is largely divided into a framework and evaluation, which can be viewed as a framework that provides ESG concepts and implementation measures, and evaluation refers to the act of metrics and grading them based on the framework. The main framework containing the contents of ESG is as follows. In addition, as of January 2021, there are about 374 certification criteria world-wide [10] (Table 2). In general, the evaluation is carried out in the following year after requesting data for the evaluation year from the company, and the presentation of the results is made in the second half of the next year of the evaluation year [12]. As companies complain of difficulties in ESG management due to various frameworks, the International Finance Reporting Standards Foundation establishes the International Sustainability Standards Board (ISSB) under its wing to integrate several frameworks to create investor-centered sustainable management and ESGrelated global disclosure standards. It published a prototype in November 2021 and an exposure draft in March 2022 and aims to establish disclosure standards for ESG in 2022 [13].

• NGO established by NGO CERES (Coalition for Environmentally Responsible Economics) and UNEP (United Nations Environment Programme) in 1997 • A representative global information disclosure initiative that provides standards for ‘sustainability reporting’ • The use of GRI Standards to provide transparency into how organizations contribute to sustainable development or achieve related goals and to impact the economy, the environment and society, including how organizations impact human rights and how to manage these impacts. To support greater transparency and accountability by disclosing impact

GRI (Global Reporting Initiative)

• Global financial institutions that started in the Netherlands in 2015, and is an accounting methodology for evaluating and disclosing greenhouse gas emissions from corporate loans and investments • In November 2020, PCAF published the guideline ‘PCAF Standard’ for measuring and reporting greenhouse gas emissions • As a global initiative to report environmental information including climate change information, it started in 2000 when financial institutions around the world shared the perception that climate change acts as an opportunity and a crisis for businesses • As of 2019, more than 8400 companies, accounting for more than 50% of the global market capitalization, are participating in the CDP program, and 525 investment institutions (approximately $96 trillion in assets under management) have enabled companies to reduce climate change, water, and Requires disclosure of forest-related information (continued)

PCAF (Partnership for Carbon Accounting Financials)

CDP (Carbon Disclosure Project)

TCFD (Task Force on Climate-related Financial Disclosures) • At the request of the G20, the Financial Stability Board (FSB) is a global consultative body established in 2015 to disclose climate change-related information • Recommendation published by TCFD in 2017 aims to enable companies to reflect climate change-related risks and opportunities in organizational risk management and decision-making through the disclosure of four main items • Major disclosures of the four key areas are governance, strategy, risk management, metrices and targets

Describe

Initiatives

Table 2 ESG framework (Source: KICPA, ESG Academy Textbook)

18 J. So et al.

Describe • To strengthen corporate climate action by providing guidelines and methodologies for setting science-based goals (SBT: science-based greenhouse gas emission reduction goals in line with the Paris Agreement goals) to respond to the climate crisis and achieve the goals of the Paris Agreement Established and operated jointly by the United Nations Global Compact (UNGC), the Carbon Disclosure Project (CDP), the World Resources Research Institute (WRI), and the World Wildlife Fund (WWF) • The meaning of “science-based target” is to provide the target setting criteria and requirements to reduce at the corporate level in order to limit future global warming to within 1.5 °C or 2 °C , and to establish public confidence in the target through third-party certification by the SBTi institution • Standard established by the US Sustainability Accounting Standards Board (SASB) in 2018 as a guideline for disclosure of “sustainable management information (“ESG information”) to US listed companies • Unlike other sustainability management information disclosure standards (GRI, etc.), the focus is on the disclosure of ESG information to investors • ESG indicators that are comparable to other companies in the same industry by presenting industry-specific standards according to 11 major sectors and 77 industry classification systems

Initiatives

SBTi (Science based targets initiative)

SASB (Sustainability Accounting Standards Board)

Table 2 (continued)

A Study on the Relationship Between ESG News Keywords and ESG … 19

20

J. So et al.

2.2 Related Works Most of the preceding studies on ESG empirically analyzed the effect of ESG on corporate management performance through statistical methods, and many studies showed that corporate ESG management had a positive effect on financial performance such as ROA and corporate value [14–16]. In addition, as a result of analyzing the relationship between ESG and corporate innovation [17], and ESG and credit rating [18], it was confirmed that there was a positive (+) relationship. These studies are characterized by analyzing using ESG integration scores. Previous studies that studied corporate value using ESG news [19, 20] and previous studies related to ESG text mining have defined major ESG issues using the word ‘ESG’ [21]. In Korea, the ESG rating of KCGS is the most widely used when conducting ESG-related research [11]. To understand the degree of influence of each variable, many studies are being conducted using RF (Random Forest) techniques [21, 22]. In addition, studies are using the Light GBM (Gradient Boosting Machine) for the classification of short texts [23, 24].

3 Research Method 3.1 Data Collection This study collected news related to 60 companies that ranked the top 5 companies in the stock market at each ESG grade in 2020. These companies were announced by KCGS in 2021. Based on the ESG disclosure guidelines of the Korea Exchange (2020), four disclosure areas for each sector, environmental (E), social (S), and governance (G) were selected. The “Big Kinds” news service provided by the Korea Press Foundation selected essential keywords for each disclosure area was selected. 11 major newspapers and 8 business newspapers were used to select ‘the essential keywords. Using this, frequency analysis and regression analysis were conducted to derive keywords that affect ESG grades. From 2018 to 2020, the total number of articles related to 60 companies was 830,353, and related keywords were 50,933,129, of which 60 keywords were 473,184, accounting for about 0.93%. Major disclosure areas were selected using the disclosure guidelines [1] of the Korea Exchange. The analysis of related words in 12 areas of E/S/G disclosure was conducted using Big kinds. Big kinds calculates weights based on the Topic Rank algorithm, which shows keywords with high semantic similarity to search terms as algorithms that generate semantic clustering by conducting concurrence and word clustering on search results [25]. Among the related search words related to the disclosure field, the top five search words with commonality with ESG evaluation indicators (Metrics) were selected (Table 3).

A Study on the Relationship Between ESG News Keywords and ESG …

21

Table 3 Key words analysis Section

Disclosure parts

Key word (No.)

Environmental (E)

Climate

GHG (E1), carbon neutral (E2), emission (E3), air pollution (E4), climate change (E5)

Waste disposal

Waste (E6), recycling (E7), plastics bottle (E8), waste plastic (E9), packaging material (E10)

Water

Water resource (E11), water management (E12), industrial water (E13), wastewater (E14), underground water (E15)

Energy

Renewable energy (E16), coal fuel (E17), solar power (E18), ESS (E19), power plant (E20)

Labor

Contract worker (S1), union (S2), collective agreement (S3), strike (S4), intern(S5)

Industrial safety

Industrial accident (S6), safety accident (S7), safety management (S8), serious disaster (S9), recall (S10)

Win–win management

Fair trade (S11), vendors (S12), shared growth (S13), subcontract (S14), monopoly (S15)

Gender equality

Gender (S16), gender equality (S17), sexual assault (S18), parental leave (S19), ministry of gender (S20)

Shareholder

Shareholder meeting (G1), voting right (G2), management (G3), minority shareholder (G4), major shareholder (G5)

Independent BOD

Responsible management (G6), board of director (G7), reappointment (G8), outside director (G9), inside director (G10)

Governance system

Governance (G11), sustainability (G12), ESG (G13), subsidiary (G14), internal trade (G15)

Ethics compliance

Anti-fraud (G16), private information (G17), privacy (G18), audit committee (G19), Compliance (G20)

Social (S)

Governance (G)

4 Data Analysis 4.1 Frequency Analysis and Word Cloud For 473,184 keywords secured through data collection, the frequency was calculated as in Table 4. The most frequent were in the governance sector, such as affiliates, board of directors, and major shareholders, followed by solar power, outside directors, and ESG. Word cloud was used for visualization of frequency, and the proportion of major search terms in the environment sector is in Fig. 2.

22

J. So et al.

Table 4 Frequency analysis Section

Key word

Environmental

E1

Social

Frequency 1693

Section

Key word

Social

S11

Frequency 1113

E2

688

S12

6475

E3

5148

S13

3964

E4

655

S14

2876 2554

E5

2748

S15

E6

4404

S16

508

E7

3650

S17

113 1187

E8

1043

S18

E9

1008

S19

855

E10

2163

S20

343

E11

594

E12

149

Governance

G2

G1

10,982 9947

E13

775

G3

8440

E14

1397

G4

2542

E15

614

G5

24,202

E16

9475

G6

1959 57,567

E17

1168

G7

E18

23,123

G8

2912

E19

21,537

G9

16,305

E20

15,599

G10

4864

S1

3161

G11

14,002

S2

7648

G12

7294

S3

1586

G13

15,177

S4

14,463

G14

121,946

S5

5993

G15

1951

S6

4324

G16

52

S7

2655

G17

4160

S8

1166

G18

1187

S9

158

G19

783

S10

7420

G20

719

As illustrated in Fig. 2, solar power, power generation, renewable energy, and ESS (energy storage device) constantly occupy a large proportion in the environmental field for three years. Among them, ESS (energy storage device) received great attention in 2019. In the social sector, strikes, labor unions, and suppliers continue to account for a large portion for three years, and recalls account for the largest portion in 2020. In the field of governance, affiliates, board of directors, and

A Study on the Relationship Between ESG News Keywords and ESG …

FY 2018

FY 2019

23

FY 2020

E

S

G

Fig. 2 Key word cloud analysis

major shareholders continue to occupy a large pro-portion for three years, and ESG is characterized by a high level in 2020.

4.2 Relationship Analysis of the Influence on ESG Rating This study attempts to understand the degree of relationship between each independent variable on the ESG grade by using the Random Forest model and the Light GBM model, which can measure the influence of each independent variable in the ensemble model. The RF model refers to the creation of multiple samples in a Bagging (abbreviation for boost aggregation), developing each model based on it, and then combining the results into one model, thereby obtaining “stability of the algorithm.” Rather than generating a model of a single sample, several different samples can be

24

J. So et al.

used to represent the population well. In addition, for nominal variables (categorical data), the prediction results are combined by voting, or by the highest probability value, and for continuous variables (numerical data), the values are aggregated by average. In addition, Begging can use parallel processing, which is highly efficient in model generation because it creates independent models with independent datasets [21, 22]. The Light GBM model processes samples in the Boosting method, which is a model that pursues higher accuracy of the model by adding more weight to the case of abnormal items. The previous model corrects the model that increases accuracy by giving a higher weight to the case due to misclassification. The numerical data uses a weighted average (median) to sum the predicted result values. If begging learns in parallel, boosting can be said to model sequentially. However, the boosting method may have higher accuracy, but may become vulnerable to outliers [23, 24]. (1) Random Forest (RF) Analysis Result The RF model is a kind of DT model and is one of the machine learning methods for decision-making. In the experiment of this paper, due to the scarcity of data, train data and test data were divided into 9:1 during analysis. Due to the scarcity of data, the accuracy was the highest when it was divided into 9:1 compared to 7:3. Performance evaluation of classification is usually performed through accuracy. However, if the dataset is unbalanced, prediction, the positive predictive value (PPV), sensitivity, and F1 score should be considered. The classification results are classified into True Positive (TP), False-Positive (FP), False-Negative (FN), and True-Negative (TN) [22, 23] (Table 5). As a result of RF analysis, significant results could be derived in the E and S sectors. The accuracy of the G category was low, and it is estimated that this result was derived because the G category accounted for 64.8% of the 473,184 keywords with 306,991 cases, and the number of news was not good. This is because Good News and Bad News are very complicated when it comes to governance. Table 5 RF analysis result E

S

G

Accuracy

0.72

0.72

0.39

Precision

0.59

0.63

0.24

Recall

0.72

0.72

0.39

F1-score

0.64

0.66

0.28

• Accuracy: Ratio of the number of accurate forecasts to the total forecast data = (TP + TN)/All • Precision: Percentage of data whose prediction and actual value match positive among targets with positive prediction = TP/Pred P • Recall: Percentage of data whose prediction and actual value match positively among targets whose actual value is Positive = TP/Act T • F-1 Score: Harmonizing bacteria of precision and reproduction rate = 2 * Precision * Recall/(Precision + Recall)

A Study on the Relationship Between ESG News Keywords and ESG …

E

25

S

Fig. 3 Impact of factors on the RF

Figure 3 shows the influence of each indicator on field E and field S, where significant analysis was made. In the E division, the energy sector had the highest impact among the environments with E20 (power plant), E18 (Solar Power), and E19 (ESS), while in the S sector, shared growth and labor were highly affected with S13 (shared growth), S2 (union), and S4 (strike). (2) Light Gradient Boosting Model (GBM) Analysis Result In this study, RF was tested as a main model in consideration of sample limitations, but the boosting method Light GBM was tested incidentally. Like RF, the analysis results of Light GBM, which was analyzed by dividing the train data and test data by 9:1, can be seen in Table 6. Unlike previous studies that showed high performance of Light GBM in text analysis, the accuracy of Test data was low in this study. The reason why Trian data accuracy is higher than Test data accuracy can be attributed to the overfitting of the Boosting method, which seems to be limited to seven levels because of the small number of data samples from about 60 companies. Due to these limitations, similar results were derived even when the ratio of Train data and Test data was changed. As seen in Fig. 4, the GBM model, unlike RF, E19, E10, and E20, were found to have a high effect, showing a slight difference from RF results. In the S sector, S2, S13, and S4 were highly influential, and there was also a slight difference in order. Table 6 Light GBM analysis result

E

S

G

Train data accuracy

0.963

0.931

0.994

Test data accuracy

0.500

0.333

0.444

26

J. So et al.

E

S

Fig. 4 Impact of factors on the GBM

5 Conclusion As a result of studying the relationship between ESG keywords and ESG grades, the E and S sectors showed accuracy at about 72% level. This can be said to have showed the predictability of each grade of E/S/G through text mining, and it is expected that predictive models will be possible through additional studies. Unfortunately, there was no high degree of relationship in the G sector, but it is estimated that there is a limit to simple frequency analysis and regression analysis because Good news and Bad news are very mixed in the G sector. Unlike previous studies that focus on correlation analysis between the existing comprehensive word ‘ESG’ and ESG integration score and correlation analysis between ESG integration score and corporate value, this study takes a step further and analyzes the relationship with keywords from Environment (E), Society (S), Governance (G) considering disclosure and evaluation indicators. In addition, it is expected that this study will serve as a reference when determining the areas to focus on when establishing a company’s ESG management strategy or when allocating resources. This study may have data limitations by analyzing 60 companies and 60 keywords over 3 years. Since the evaluation results of each E/S/G have been disclosed since 2018, analysis on data before 2018 as well as analysis by industry has not been performed. If the number of companies to be analyzed is expanded and the analysis year is added to secure sufficient data, it is expected that a grade prediction study for each E/S/G through text analysis will be possible. In addition, machine learning techniques seem to need to use time series analysis and deep learning techniques as well as Random Forest and Light GBM. Additionally, ESG grades announced by KCGS are limited to seven stages. Thus, it is also suggested to conduct research using ESG evaluation indicators with various score systems such as MSCI grades (AAA–CCC).

A Study on the Relationship Between ESG News Keywords and ESG …

27

References 1. Korea Exchange (KRX), ESG Information Guidance (2020) 2. Hankook Economic Daily, “Manage ESG”… The activist fund that attacks ExxonMobil, URL: https://www.hankyung.com/international/article/2020120865561 (8 Dec 2020) 3. Reuter news, URL: https://www.reuters.com/article/us-danone-management-idUSKBN2B 60PN (15 March 2021) 4. Ruy, J.S.: Global ESG Investment and Policy Trend. Financial Investment Association (2020) 5. Blackrock, Blackrock CEO Client letter, URL: https://www.blackrock.com/kr/2021-blackr ock-client-letter (2021) 6. Financial Supervisory Service, Analyzes the financial disclosure status of GHG emission rights of listed corporations and prepares best practices for note disclosure (9 April 2021) 7. Maeil Economic Daily, National Pension Service “ESG investment to expand to half of total assets next year”, URL: https://www.mk.co.kr/news/economy/9878703 (19 May 2021) 8. MSCI, Foundations of ESG Investing in Corporate Bonds (2020) 9. MSCI, “Understanding MSCI ESG indexes,” URL: https://www.msci.com/www/researchpaper/understanding-msci-esg-indexes/01525548808 10. KICPA, ESG Academy textbook, URL: https://kicpaacademy.com/product-category/academycourse/esg-expert/ (2021) 11. Lee, J.K., Lee, J.H.: Current status and future directions of research on ‘sustainable management’: focusing on the ESG measurement index. J. Strat. Manag. 23(2), 65–92 (2020) 12. KCGS, Announcement of ESG evaluation and rating of listed companies in 2020 (2021) 13. IFRS Foundation, URL: https://www.ifrs.org/news-and-events/news/2022/03/issb-deliversproposals-that-create-comprehensive-global-baseline-of-sustainability-disclosures/ (2022) 14. Oh, S.H., Lee, S.T.: A study on the relationship between ESG evaluation factors and corporate value. Comput. Account. Res. 17(2), 205–223 (2019) 15. Lim, U.B.: The effect of non-financial information on corporate performance: Focusing on ESG score. Int. Account. Stud. 86, 119–144 (2019) 16. Kim, Y.K.: Effect of corporate non-financial information (ESG) disclosure on financial performance and corporate value. Regul. Stud. 29(1), 35–59 (2020) 17. Choi, M.H.: Corporate innovation, sustainable management and corporate value. Tax Account. Res. 67, 55–73 (2021) 18. Kim, D.Y.: A study on the relationship between ESG evaluation information of a sound company and KIS credit score. J. Global Bus. Adm. 17(3), 131–155 (2020) 19. Kang, W., et al.: A study on the relationship between non-financial indicators and the market performance of firms: market response analysis of events used to develop ESG indicators. Yonsei Manage. Res. 57(2), 1–22 (2020) 20. Lim, H.J.: Analysis of SME ESG issues using text mining. Hum. Soc. Sci. 21 12(4) (2021) 21. Thakkar, R., et al.: Environmental fire hazard detection and prediction using random forest algorithm. In: 2022 International Conference for Advancement in Technology (ICONAT) Goa, India, Jan 21–22 (2022) 22. Swarupa, A.N.V.K., et al.: Disease prediction: smart disease prediction system using random forest algorithm. In: 2021 IEEE International Conference on Intelligent Systems, Smart and Green Technologies (ICISSGT) (2021) 23. Alzamzami, F., et al.: Light gradient boosting machine for general sentiment classification on short texts: a comparative evaluation. IEEE/Access (2020) 24. Abdurrahman, M.H., et al.: A review of light gradient boosting machine method for hate speech classification on twitter. IEEE Xplore (2020) 25. Bigkinds, Big Kinds User Manual (2022)

Development of Associated Company Network Visualization Techniques Using Company Product and Service Information—Using Cosine Similarity Function Kyunghyun Lee, Weonsun Choi, Beonghwa Jeon, and Gwangyong Gim Abstract The purpose of this study is to develop the concept of visualization of social network analysis of related companies using the company’s product and service introduction text data and cosine similarity function. Among the “Big Data Platform and Center Construction Project” promoted by the Korea Intelligence Information Society Promotion Agency to visualize the network of related companies, “Service Product Information” and “Investment attraction Information” were used. Looking at the results of visualizing the network between companies using service product data, groups of various related companies are derived. Looking at these groups, the characteristics of each group’s technology and services were examined. In this study, a network of related companies providing artificial intelligence-related solutions and application services was closely examined. It is divided into companies that develop artificial intelligence technology and companies that apply it to various fields. Related companies related to artificial intelligence are connected to the network around keywords such as interactive, assistant, and platform. By checking what type of business are applying, this study explored fields and companies that apply to various products and services using artificial intelligence. Keywords Big data · Visualization · Start-ups · Related companies · Related companies · Similar companies · Competitors · Cosine similarity · Network analysis · Term frequency · Inverse document frequency K. Lee Department of Strategy Consulting, Korea Insight Institute, Seoul, South Korea e-mail: [email protected] W. Choi · B. Jeon Department of IT Policy and Management, Soongsil University, Seoul, South Korea e-mail: [email protected] B. Jeon e-mail: [email protected] G. Gim (B) Department of Business Administration, Soongsil University, Seoul, South Korea e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Big Data, Cloud Computing, and Data Science Engineering, Studies in Computational Intelligence 1075, https://doi.org/10.1007/978-3-031-19608-9_3

29

30

K. Lee et al.

1 Introduction Heinrich’s Law proved that there must be a number of minor accidents and prior signs associated with them before a major accident occurred [1]. The important point is whether one can detect and predict these accidents in advance. One of the reasons why it is difficult to recognize the accident in advance is that the phenomenon of sending the cause signal is irregular, diverse, scattered, and paradoxically complicated. It is very important to collect these various forms of complex phenomena and situations and analyze the inherent complexity through mathematical and visual expressions to judge the past and present and predict the future. Scholars are trying to identify the relationship between each existence in various ways in each discipline in order to be aware of what will happen in advance and prepare for the future. This set of procedures and methodologies is called various terms, but in conjunction with the transition to an informatization and knowledge-based society, it is a trend named Social Network Analysis (SNA) after the twenty-first century [2]. The methodology, which is also referred to as a ‘social network’ in Korea, was introduced in Korea in the early 1980s. The initial use method visualized and digitized the connection relationship between people and analyzed based on the attributes inherent in the relationship. Many studies have been conducted in the field of social science, and it is used as a tool to analyze the structural properties of large and small societies, such as defining the roles of members present in groups of each unit and grouping members with similar tendencies [3]. The importance of social network analysis research is also being highlighted in various academic fields such as economics, policy studies, and business administration, which require a series of procedures to grasp social structural characteristics and eliminate conflicts and irrationalities [4]. As there are groups with unique characteristics in the field of industrial and corporate analysis, various social network analysis studies can be conducted to discover the characteristics inherent in them and solve the immediate problems [5–10]. Since the 2000s, the target of identifying relationships has been gradually expanding from ‘people’ to ‘data’, and social network analysis theory is rapidly evolving [11]. The cause can be found in the phenomenon that the history of information society is gradually accelerating and numerous knowledge is rapidly spreading on the web. Finding knowledge, data, meaning, etc. of significant value in this flood of information is complex and involves a process that involves a lot of time and effort. Research on social network analysis methodology is drawing attention due to the innovative development of computing technology and the emergence of the concept of big data. Big data has features such as big Volume, Variety, and Velocity, and it is very difficult to find meaning, value, and pattern in the raw data itself of the general big data concept. Thus, an advanced data processing and analysis system is essential, and social network analysis is expected to play this role [3]. One of the reasons why social network analysis has received attention can be seen after the field of application and data have been diversified in earnest. As the potential value of social network analysis was expanded to the humanities and social fields and

Development of Associated Company Network Visualization …

31

natural sciences, studies were conducted to apply analysis techniques and data by combining the characteristics of the academic domain. A virtuous cycle continues, such as expanding the types of research that can be analyzed through social network analysis and interapplication of various methodologies [3]. Research in social network analysis typically analyzes the field of analyzing people-to-people relationships from the beginning to the present and the structure of academic knowledge data in papers and patents. There is an emotional analysis that collects text indicating opinions and emotions of Internet users and derives trends, reputations, and cognitive levels through network mapping to derive potential technologies and research areas in the future [12]. In order to understand the complex relationship between companies, this study visualized the network between similar companies by using text data and social network analysis introducing the products and services of companies. This study aims to create a concept of the applicability of social network analysis to corporate and industrial analysis.

2 Theoretical Background and Prior Research 2.1 Text Mining Research Big data is not easy to collect, store, analyze, and visualize due to a large amount of data and diversity of contents. Sometimes the vast amount of data itself is part of the problem to be solved [3]. Text mining is an efficient way to analyze big data with these characteristics. Text mining aims to discover patterns and knowledge of data in the text through the search, extraction, and natural language processing and processing of useful information from large amounts of unstructured and semi-structured text data [13, 14]. One of the advantages of text mining is that it is not only possible to extract concepts that appear in the text, but also to identify and visualize relationships with other concepts [15]. In the case of existing traditional content analysis, it should rely on arbitrary analysis items created by the researcher. Hence, it is difficult to ensure external validity because the analysis is not only impossible for parts outside a specific range, but also depends on the encoder for analysis. There is also a disadvantage in that it has a difficulty in analyzing large amounts of data [16]. Text mining techniques can solve existing shortcomings and different interpretations of each classification through various classifications. Thus, text mining can overcome the limitations of existing content analysis [17, 18].

32

K. Lee et al.

2.2 Social Network Analysis Study The name of the analysis method for analyzing the relationship between the words constituting the text is used in various forms. Typically, it is used as a word such as a word network or text network analysis. It has been called by different expressions such as semantic networks and conceptual networks that focus on the form of words, but recently, the word Network Text Analysis (NTA) has been mainly used [19]. Text is a very high means of use by members of society in the process of mutual communication. It is also exerting a great influence on these various communication tools [20]. The combination of words in sentences using these texts was not created by chance [21], but the result of combining the author’s conscious and unconscious thoughts. The combination of these thoughts can confirm the emphasis through the frequency and relationships of word use, and the combination of words eventually tends to create a specific meaning [22]. In the era of big data, the amount of unstructured text data that can be analyzed continues to increase, and the demand for creating various and new knowledge through analysis is increasing. This increase in demand naturally leads to an increase in interest in text analysis and active research, and the field of researching analysis methodologies and technologies is rapidly growing [23]. Research related to text analysis has become a new field of text mining, linked to data mining. Text network analysis, which combines text mining, a kind of data mining, with social network analysis, refers to a technique that analyzes the meaning inherent with the surface meaning of the text by applying a scientific method [24]. Next, this study looks at the process of text network analysis. The data analysis methodology refers to any process of obtaining valuable and insightful information by analyzing the data itself, starting with a relatively low value of basic data collection [25]. This includes all steps from collection to data preprocessing and analysis visualization. Data collection is mainly done through automated methods using computers. Various collection methods exist. First, one can use the public API, and collect it using web crawling or web scraping. Alternatively, it is possible to collect data from a web server using a log collector or through a Rich Site Summary (RSS) [26]. Data collection through public APIs is linked to public data from search portals or institutions, so prior consultation with the relevant institution is required. Web crawling is mainly used to automatically collect open unstructured data over the Internet. Web scraping can extract only specific information desired by the user [8]. The order of text analysis is generally performed in the document collection, parsing, and filtering, structuring, frequency analysis, and similarity analysis, and text analysis-related techniques are largely divided into two stages: structured text analysis and utilization [3]. Text documents such as papers, patents, and research reports are provided in the form of databases, but in most studies, text documents are often collected directly through crawling. The collected text is variously applied to frequency analysis, clustering, and classification [18].

Development of Associated Company Network Visualization …

33

Various pretreatment and analysis are required to utilize text data. In particular, morpheme analysis, which is the basis for natural language processing, required various research methodologies and complex processing techniques. However, in recent years, morpheme analysis programs have been developed in the form of open sources and provided in various forms, increasing the convenience of researchers. In this study, Python and open source-based morpheme analysts and dictionaries were used for morpheme analysis. The Korean morpheme analyzer used MeCab, an open-source morpheme analysis engine, a mecab-ko-dic dictionary for Korean morpheme analysis, and OpenNLP for English morpheme analysis.

2.3 Associated Companies Analysis A study that led to the result that a decision-making process for corporate management in the supply chain could be provided through social network analysis [6] and the ecosystem of local clusters was studied by analyzing sales and purchase transaction data [7]. Various studies have been conducted on social network analysis of industries and companies from the perspective of the air raid chain. Also, various studies on the advancement of the standard industry classification were conducted by conducting a study on the linkage model of the standard industry classification through the analysis of topic modeling using the industry association table [10]. Studies on the public offering price and stock evaluation were conducted through benchmarking model analysis through cases of similar companies and competitors or information on similar companies. However, the study did not utilize social network analysis but used standard industry classification and a list of classification groups of the same companies in the existing stock market [10, 27].

3 Research Method In this study, ‘service product information’ and ‘investment attraction information’, which are data of start-up companies of ‘digital industrial innovation big data platform’ among NIA ‘big data platform and center construction project’, were visualized. Based on the introduction of the products and services produced or provided by the company, the group of related companies can be found by reflecting the company’s business and technology. In addition, by combining investment attraction information, the business and technology sectors where investment is being actively made were examined. By analyzing the words used in the introduction of companies’ products and services, the network between companies using similar words was connected. Also, the node size of the company was changed according to the investment amount, so that related companies could be viewed at a glance [28] (Tables 1 and 2).

34

K. Lee et al.

Table 1 Data sets used in the research Category

Data sets link

Number of data

Remark

Service product information

https://www.bigdata-dx. kr/product/DX0620000 40001

12,031

Digital industrial innovation big data platform

Investment attraction information

https://www.bigdata-dx. kr/product/DX0620000 10001

6714

Digital industrial innovation big data platform

Table 2 Field value information of service product information and investment attraction information Category

Field value

Service product information

Product serial number, company serial number, product Korean name, product English name, one line introduction, product Details, search tag content, homepage URL, App store URL, Google play Store URL

Investment attraction information Investment serial number, Company serial number, Investment date, Investment stage name, Investment amount, Company value amount, Investment attraction remarks, Press release publisher name, Press release title, press release URL, Investment institution number

First of all, service product information and investment attraction information data were merged into one dataset with a company serial number, and the analysis was conducted using the one-line introduction, investment date, and field values of the investment amount (Table 3). After merging the datasets, processing of missing and outliers was carried out. Data with one-line introduction content, investment date, and missing value of investment were removed, and companies that received investments less than 100 million won were removed. One of the characteristics of startups can be said to be innovation in solving market and customer problems through the convergence of various technologies and services and the combination of hardware and software. For this reason, problems Table 3 Data field values used for analysis Category

Contents

Company serial number To distinguish startups, each company is assigned a unique serial number with a number One line introduction

One-line introductory information about the products and services the startup is producing or providing

Investment date

The date on which the startup received the investment is expressed in 8 characters in the order of year, month, and day

Investment amount

Amount invested by startups from investors (Unit: KRW)

Development of Associated Company Network Visualization …

35

arise in defining startups and grouping similar companies through existing industry classification. In order to solve these problems, it is possible to solve the problems of the existing industry classification by network visualizing related companies through the main words of the startup’s product and service introduction data. The startup ecosystem network divides product and service introduction unstructured data into morpheme units and conducts Term Frequency-Inverse Document Frequency (TF-IDF) analysis, a universal weight used in information search and text mining. TF-IDF is a value that measures how important a word appears in a specific document using statistical figures and can be used to identify similarities between startup products and service information. In this study, it was used to define similarities between startups. The value of TF-IDF is defined as the value multiplied by the value of TF and IDF. T F−I D F = t f (d, t) · id f (D, t) The Term Frequency is a value that indicates how often a particular word appears in a document, and as the value increases, the importance of that word also increases. Document Frequency means that when a word is frequently used within a group of documents, the word appears frequently. Inverse Document Frequency (IDF) is the reciprocal (inversely proportional) of the DF value of the number of documents in which a particular word t appears, indicating how common a word appears throughout the document. If the value of TF-IDF 0.6 or higher is visualized using Cosine Similarity, which is a good method to measure the similarity between analysis units using the data processing results, the startup ecosystem network can be represented as follows. One vector may be projected onto another vector and expressed as a value obtained by multiplying the length corresponding to the same direction. A · B = |A||B| cos θ cos θ value has a range of −1 to 1 and represents a maximum value of 1 when the two vectors are in the same direction. θ is 1 for 0° and −1 for 180°. Cosine similarity is a formula that can be used to determine how similar the two vectors are. cos θ =

A · B |A||B|

At this time, when coding cosine similarity with Python, a problem that occurs when dividing the equation by zero occurs, and eps should be applied. Eps is a value added to prevent the denominator from becoming zero when both elements of x and y are zero, and eps has a very small bat, which in most cases is rounded, so it does not affect the calculation result.

36

K. Lee et al.

4 Analysis Results Each company in the startup ecosystem network is represented by a circle and is called a node in network visualization analysis. The line connecting these nodes is called an edge, and in the startup ecosystem network, edge-connected companies have keywords similar to products and services, so they can be interpreted as related companies. The size of the node was visualized by scaling the investment amount to a value between 20 and 300. Also, in the case of edges, the higher the cosine similarity, the thicker it becomes [28] (Fig. 1). In the social network analysis, 1122 companies out of 3517 companies that were invested in 2019–2020 with more than 100 million won were processed into data. The final 385 company data were used by matching the company code with service product information. In addition, social network analysis was conducted by connecting networks between companies with cosine similarity between companies of 0.6 or more. As a result, social network visualization results, shown in Fig. 2, were derived, and a separate HTML file was created so that it could be downloaded and verified (Download link: https://url.kr/bwhy2i). The network of artificial intelligence-related companies forming one community among the visualization results of all related companies was classified into yellow groups as shown in Fig. 3. An analysis of the network for related companies was done. Looking at the business fields of artificial intelligence-related companies, as seen in Fig. 4, it is divided into companies that develop eye-tracking and interactive AI technologies and companies that use artificial intelligence in various business fields such as stocks, finance, coding education, mobility, clothes, real estate, chest CT, skin, health, infants, and golf. Major similar keywords between companies connected to the center of artificial intelligence-related companies are summarized as shown in Fig. 5. Looking at major similar keywords, it can be seen that companies are connected by keywords such as interactive, assistant, platform, and emotion. Artificial intelligence-related companies use artificial intelligence a lot of assistant keywords that play an auxiliary role Fig. 1 Associated company network visualization outline

Development of Associated Company Network Visualization …

37

Fig. 2 Associated company network visualization results

AI

Fig. 3 Network of companies related to Artificial Intelligence (AI)

in helping people judge and act. It was also observed that interactive products and services are provided to provide a convenient UI for human use. Related companies that analyze human emotions, not data on simple facts, can also be seen using artificial intelligence. It was found that artificial intelligence-related companies introduce their services in various forms such as platforms, software, and solutions, and mainly express them with the keyword platform.

38 Fig. 4 Keywords related to business of artificial intelligence (Artificial Intelligence)

Fig. 5 Similar keywords between AI-related companies’ networks

K. Lee et al.

Development of Associated Company Network Visualization …

39

5 Conclusion The social network analysis method is not a sudden trend of analysis techniques, but one of the universal methodologies that have been continuously laying the theoretical foundation by various scholars for decades. Due to the advanced computing technology, the potential functions and mathematical processing capabilities of social network analysis that have not been recognized have only been highlighted at this point, and it can be judged that the time has come to demand in various fields. Compared to other statistical methods, social network analysis is a relatively old methodology, but paradoxically, it can be seen as an opportunity to explore many subsequent studies through network analysis visualization [29]. In relation to corporate and industrial analysis research, social network analysis visualization can be expected to increase the quality level of related fields by discovering companies and partners to invest in, and upgrading the search for new businesses. What is important in social network analysis is that developing analysis technology should of course be a prerequisite and that expertise in the field of analysis should be secured. The analysis result consists of a visualized network and a number explaining the network, and the relationship between nodes can be grasped through the result. However, all steps of true social network analysis can be said to be performed only when an insightful interpretation of it is presented by an expert in the field [30]. Therefore, effective research and field application is possible only when the technology for analyzing data and expertise in the field are simultaneously equipped. This study attempted to explain the concept and applicability of social network analysis to the field of corporate and industrial analysis through limited data and present how it can be used. In future studies, research that can be applied directly in the relevant field should be conducted based on richer data and expertise in related fields such as advancement in related fields. However, a rigorous comparison based on the clear acquisition and understanding of the existing statistical analysis methods must be preceded to verify the research value of social network analysis technology. Traditional statistical analysis and social network analysis are expected to solidify the role of accelerating the research approach of big data in the field of corporate and industrial analysis due to complementary relationships.

References 1. Heinrich, H.W.: Industrial Accident Prevention. A Scientific Approach, 2nd edn. (1941) 2. Woon, S.D.: Social Network Analysis. Kyunmoonsa, Seoul (2013) 3. Park, S.-j., Lee, J.-U.: Big Data era of social network analysis techniques and the use of the sports field strategy. Korea J. Sports Sci. 23(5), 933–946 (2014) 4. Scott, J.: Social Network Analysis, 3rd edn. MPG Books, Cornwall (2012) 5. Jang, B.-s., Lee, J.: A study on the use of similar company information and determination of the offering price. Korean J. Finan. Stud. 4(1), 205–232 (1998)

40

K. Lee et al.

6. Jung, J.: A study on the industrial clusters in a region using big data. J. Korea Contents Assoc. 17(2), 543–554 (2017) 7. Choi, H.H., Koo, Y.M.: A study on industrial classification system using topic model. J. Korea Soc. Innov. 15(5), 27–67 (41 pages) 8. Rodriguez-Rodriguez, R., Leon, R.D.: Social network analysis and supply chain management. Int. J. Prod. Manage. Eng. 4(1), 35–40 (2016) 9. Lee, Y.-s., Kim, H-b., Dong, K., Kang, I-k.: A linkage model building for technology and industry classification code using text mining techniques. DongGuk University, Korea Institute of Science and Technology Information (2015) 10. Oh, O.: The study on the stock valuation utilizing earnings reports and proxy firm valuation method. J. Taxation Account. 18(3), 217–239 (2017) 11. Jung, E.-e.: Social media utilization strategy using social network analysis of big data: focusing on the case of the Postal Service Headquarters. Master’s thesis, Yonsei University Graduate School (2012) 12. Bae, S.: A study on using big data based on network analysis. Korea Institute of Science and Technology Evaluation and Planning (2014) 13. Hotho, A., Nürnberger, A., Paaß, G.: A brief survey of textmining. Ldv Forum 20(1), 19–62 (2005) 14. Kim, D.-k., Kim, I.-s.: An analysis of hotel selection attributes present in online reviews using text mining. J. Tourism Sci. 41(9), 109–127 15. Paranyushkin, D.: Identifying the pathways for meaning circulation using text network analysis. 26. Nodus Labs (2011) 16. Pollak, S., Coesemans, R., Daelemans, W., Lavraˇc, N.: Detecting contrast patterns in newspaper articles by combining discourse analysis and text mining. Pragmatics (Quarterly Publication of the International Pragmatics Association (IPrA)) 21(4), 647–683 (2011) 17. Kam, M., Song, M.: Analysis of differences in content and tone according to newspaper companies using text mining. J. Intell. Inf. Syst. 18(3), 53–77 18. Kim, D.J.: An empirical study on filter bubbles in Youtube: using social network analysis and text network analysis 19. Diesner, J., Carley, K.M.: Revealing social structure from texts: meta-matrix text analysis as a novel method for network text analysis. In: Causal Mapping for Research in Information Technology, p. 81 (2004) 20. Kwon, H.: A Semantic network analysis of newspaper coverage of the 20th general election in Korea: comparing conservative and progressive newspapers. J. Polit. Commun. (42), 39–87 (2016) 21. Choi, Y.-j., Kwon, S-h.: A semantic network analysis of the newspaper articles on big data. J. Cybercommun. Acad. Soc. 31(1), 241–286 (2014) 22. Park, H.-W., Leydesdorff, L.: Understanding the KrKwic: a computer program for the analysis of Korean text. J. Korean Data Anal. Soc. 6(5), 1377–1387 (2004) 23. Kim, N.-g., Lee, D.-h., Choi, H.-C., Wong, W.X.S.: Investigations on techniques and applications of text analytics. J. Korean Inst. Commun. Inf. Sci. 42(2), 471–492 (2017) 24. Roberts, C.W. (ed.): Text Analysis for the Social Sciences: Methods for Drawing Statistical Inferences from Texts and Transcripts. Lawrence Erlbaum Associates (1997) 25. Lee, M.: Big data analytics and public data leverage. J. KIISE 30(6), 33–39 (2012) 26. Yu, Y., Baek, S.: Issue analysis of the related mass media’s news articles on the 2015 revised national curriculum using automated text analysis. J. Curriculum Eval. 19(3), 127–156 (2016) 27. Im, O.-h.: A study on the disability perspective in articles by network text analysis and content analysis: focused on the types of disability in the welfare law for disabled persons. Ph.D. thesis, Kyonggi University Graduate School (2019)

Development of Associated Company Network Visualization …

41

28. Lee, K.: Visualization of the Korean startup ecosystem. Integrated data map data story. www. bigdata-map.kr/datastory/new/story_28 29. Backstrom, L., Leskovec, J.: Supervised random walks: predicting and recommending links in social networks. In: WSDM ‘11 Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pp. 35–644 (2011) 30. Hong, S.-m.: The World of Social Networks and the Use of Big Data. Powerbook, Seoul (2013)

Hybrid CNN-LSTM Based Time Series Data Prediction Model Study Chungku Han, Hyeonju Park, Youngsoo Kim, and Gwangyong Gim

Abstract Recently, with the development of deep learning technology, its application in various fields is increasing. It is also important to apply deep learning techniques to time series data prediction models. For time series data analysis, various deep learning applied technologies are examined, and the current level and future status are reviewed. Time series data prediction models generally present data analysis and prediction models based on deep learning algorithms such as Recurrent Neural Network (RNN). The RNN model has a long-term dependency problem. We use the LSTM model to solve the long-term dependency problem. The LSTM model also requires significant training time to have predictive validity for time series data with rapidly changing characteristics such as fine dust pollution. The combined neural network learning model between Convolutional Neural Network (CNN) and LSTM offers the possibility to solve this problem. CNN calculates a feature map for each unit section from long-term time series data, and LSTM performs a time series trend learning operation on the feature map data. Therefore, the CNN-LSTM combined model can reduce the learning time of long-term time series data. In a verification experiment using AirKorea PM2.5 data, the prediction accuracy and predictive power consistently improved in the rapidly changing time series section. Keywords Machine learning · Deep learning · Time-series analysis · Convolutional neural network · CNN · LSTM · CNN-LSTM C. Han Graduate School of IT Policy and Management, Soongsil University, Seoul, South Korea e-mail: [email protected] H. Park · Y. Kim Department of IT Policy and Management, Soongsil University, Seoul, South Korea e-mail: [email protected] Y. Kim e-mail: [email protected] G. Gim (B) Department of Business Administration, Soongsil University, Seoul, South Korea e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Big Data, Cloud Computing, and Data Science Engineering, Studies in Computational Intelligence 1075, https://doi.org/10.1007/978-3-031-19608-9_4

43

44

C. Han et al.

1 Introduction With the recent development of deep learning technology, applications of the technology in various fields are increasing. Among them, analysis of time series data is becoming increasingly important. A huge amount of time series data is being generated in fields such as IoT, healthcare, smart cities, climate environment, and finance. With the development of atmospheric information measurement technology, air pollution measurement data has been steadily accumulated over the past decade. Hence, research to analyze and predict air pollution information is recognized as an important task [1, 2]. Traditional prediction models through statistical approaches such as logistic regression analysis and Support Vector Machine (SVM) are mainly conducted as research to predict fine dust pollution. Recently, research on prediction models using deep learning has been attempted to predict fine dust pollution. Although introducing deep learning into time series can be seen as a relatively new attempt, deep learning is a very flexible technology and has high potential value in time series analysis. Deep learning is being sought as a troubleshooter for these concerns at a time when computer technology, which is showing rapid progress in recent years, and predictive models that require analyzing enormous data are required [3]. However, the fine dust pollution prediction models currently used do not have high predictive power, and the Recurrent Neural Network (RNN) related deep learning model used for predictive accuracy causes long-term dependence problems when learning long-term time-series data. This is problematic in terms of the performance of the model to analyze and predict large amounts of time series data over a long period of time. Therefore, it is necessary to present a practical prediction model for these long-term large-capacity time series data [3, 4]. By further expanding this research, the composite structural model synthesizing LSTM and CNN models, which has been used in the existing time series prediction field, is applied to time series to show meaningful performance. Circular Neural Network models such as RNN and LSTM have the problem of computational costs for learning time series data over a long period of time, and as a solution, we study the CNN model to improve time series prediction performance. This paper collected ultra-fine dust (particle meter 2.5 ppm, PM2.5 ), particulate meter 10 ppm, PM10 ), sulfur dioxide (NO2 ), carbon monoxide (CO), ozone (O3 ), temperature, humidity, wind speed, and precipitation data for the past two years by AIR KOREA. PM10 and PM2.5 are predicted by time zone by grasping the significance between variables. The composition of this paper is as follows. Section 2 reviews existing studies for predicting fine dust pollution through theoretical background and examines limitations. Section 3 presents a fine dust pollution prediction model that combines CNNLSTM, the paper’s proposed method. Section 4 presents the experimental environment and outcome analysis, and Sect. 5 presents conclusions and future research tasks.

Hybrid CNN-LSTM Based Time Series Data Prediction Model Study

45

2 Theoretical Background 2.1 Time-Series Analysis It was confirmed in a 1980s study using weighted averages that it may be more accurate to combine two or more prediction methods than to obtain time series prediction results using a single prediction model. In this study, we conducted an experimental analysis of several methods for weight estimation as a prediction method, and as a result, the ‘Ensemble’ method of mixing several techniques showed better prediction results than the method of selecting the ‘best one’ to improve prediction performance [1, 3–5].

2.2 Recurrent Neural Network Figure 1 shows the basic structure of RNNs used in deep learning for sequence data such as time-series data. RNN is a kind of deep neural network architecture [6, 7] with deep structures in the time dimension, which has been widely used in time series modeling [8]. Traditional neural networks consider all units of input vectors to be independent of each other. As a result, existing neural networks cannot use sequential information. In contrast, the RNN model adds the hidden state generated by the sequential information of the time series and the output depends on the hidden state. Fig. 1 shows the RNN model deployed over the entire network.

Fig. 1 A recurrent neural network and the unfolding architecture [3]

46

C. Han et al.

Fig. 2 LSTM’s recurrent module [3, 11]

2.3 Long- Short-Term Memory RNN has been considered model that works well in the prediction field of timeseries data, but the gradient loss problem leads to a difficult problem to learn ‘longterm dependencies’ in areas that process long-term time-series data [8–10]. For this problem, an efficient gradient-based method called Long Short-Term Memory (LSTM) is introduced. The LSTM is designed to prevent gradient loss using memory cells [11, 12]. Figure 2 shows the memory cell structure of LSTM consisting of a total of four elements: input gate, output gate, forget gate, and self-circulation neuron. The gate controls the interaction between the adjacent memory cell and the memory cell itself. Whether the input signal may change the state of the memory cell is controlled by the input gate. On the other hand, the output gate may control the state of the memory cell according to whether the state of the other memory cell may be changed. The forgetting gate can also choose to remember or discard the previous state [3, 11, 13–16].

2.4 Convolutional Neural Network Convolutional Neural Networks (CNNs) are deep learning models designed to handle data with photographic grid patterns [2, 17–19]. CNNs are generally composed of Convolution, Pooling, and a fully connected layer (Fig. 3). In the convolution and pooling layers, it is responsible for extracting features from the input image. The image is classified in the complete connection layer. Here, the layer playing a key role is the Convolution layer. In digital images, pixel values can be viewed as two-dimensional grids, which are treated as matrices, and use ‘kernels’, another small grid parameter for extracting features of images. Features of images extracted through the ‘kernel’ provide the efficiency of the overall image learning process [20, 21].

Hybrid CNN-LSTM Based Time Series Data Prediction Model Study

47

Fig. 3 Schematic diagram of the convolution neural network (CNN) structure and learning process [20]

3 Research Method Among the air pollution measurement data collected and released on the Air Korea site of the National Institute of Environmental Research, the data format was organized based on time series data measured every hour from January 2019 to July 2021. Figure 4 is a schematic diagram of the research method of this study. For the officially collected fine dust data of PM10 and PM2.5 , the basic statistical properties of the data were first identified through basic statistical analysis. Figure 5 shows the overall trend of time series data used in this study. As can be seen from the trend curve, since the purpose of this study is to study the rapidly changing deep learning prediction model of the time series, the fine dust time series data measured by time zone were used as it is, and in the process, the final data set was constructed. For the final dataset, a total of 22,628 cases, seasonal variables were

Fig. 4 CNN-LSTM based prediction model for time-series data

48

C. Han et al.

inserted as additional variables in consideration of seasonal periodicity for a total of 911 days of time-series data.

3.1 Methodology to Predict Time-Series Using CNN-LSTM The time series data prediction model proposed in this study is the Hybrid CNNLSTM Deep Learning Model. The structure of the proposed model is shown in Fig. 6. The proposed model first performs matrix operation with a three-dimensional array structure for CNN operation on one-dimensional time series data. A convolution operation is performed at the CNN layer on the time series data processed in the form of a three-dimensional array. The feature map is extracted through the computation of the CNN layer using Conv1D, and time-series data were then learned by linking LSTM cells of the Recurrent layer.

3.2 Performance Procedure In this study, the performance procedure for predicting air pollution data follows the following steps. First, it secures a basic training dataset of high quality through the process of selecting and collecting data. Second, data preprocessing was done. Data transfer tests were conducted on the data adopted as the subject of this study, and missing values were corrected. Third, model learning was done. We learned the model by tracking the optimal hyper parameters for the proposed model.

4 Experiments and Results 4.1 Evaluation Methods In this study, the time-series data prediction performance of the model was compared using Root Mean Square Errors (RMSEs) and Mean Absolute Errors (MAEs). The formula is as follows [13, 22, 23]. ⌜ | n | 1 ⎲( )2 yi − yî RMSE = | n i=1 MAE =

n | 1 ⎲|| yi − yî | n i=1

(1)

(2)

0

200

400

600

800

Fig. 5 Time-series trend plot for PM10 and PM2.5

Concentration (ppm)

1000

PM10

PM25

PM10 and PM2.5 time-series trend

Hybrid CNN-LSTM Based Time Series Data Prediction Model Study 49

50

C. Han et al.

Fig. 6 CNN-LSTM based time-series prediction model

Here, n is the number of test datasets, yi is the predicted result value, and yî is the actual value.

4.2 Evaluation Methods The Hybrid CNN-LSTM time series prediction deep learning model consisted of a total of six layers. The first layer that receives time series data is a Convolutional layer, which takes matrix transformation with one-dimensional data, sets the kernel and travel distance as optimal values, and generates a feature map of the input data through padding processing.

4.3 Comparative Evaluation of Deep Learning Models Figure 7 shows the results of comparing prediction performance based on LSTM and CNN-LSTM models by placing six prediction intervals for PM10 and PM25 . The evaluation section was conducted in a total of 6 sections, comparing actual values and predictions for 72 h from February 1 to 3, comparing actual values and predictions for 72 h from March 1 to 3, 72 h from April 1 to 3, 72 h from May 1 to 3, and 72 h from June 1 to 3, respectively. The Hybrid CNN-LSTM model was learned on the time series data of PM10 and PM2.5 , and similarly, the LSTM model conducted learning on the same learning interval. Compared to the predictive performance of the LSTM-based deep learning model, the predictive performance of the CNN-LSTM-based deep learning model shows results closer to the actual value. To compare and evaluate the predictive performance between LSTM and Hybrid CNN-LSTM deep learning models, the results of comparing the RMSE values used in this study can be seen in Table 1 and MAE values can be seen in Table 2. The

Hybrid CNN-LSTM Based Time Series Data Prediction Model Study

51

Fig. 7 Actual versus prediction performance (LSTM vs. Hybrid CNN-LSTM model)

items in bold in Table 1 and 2 refer to the RMSE and MAE averages for Hybrid CNN-LSTM-based models. Table 1 Average values of RMSE for LSTM and CNN-LSTM Item

RMSE PM10 LSTM

PM2.5 CNN-LSTM

LSTM

CNN-LSTM

Average

5.230

4.956

3.631

3.414

N

6

6

6

6

Standard deviation

1.567

1.489

1.177

1.271

Standard error of the mean

0.640

0.608

0.481

0.519

Table 2 Average values of MAE for LSTM and CNN-LSTM Item

MAE PM10

PM2.5

LSTM

CNN-LSTM

LSTM

CNN-LSTM

Average

3.779

3.536

2.656

2.463

N

6

6

6

6

Standard deviation

1.065

0.853

0.402

0.653

Standard error of the mean

0.435

0.348

0.164

0.267

52

C. Han et al.

Fig. 8 Hybrid CNN-LSTM and LSTM model performance for PM10 and PM25

Figure 8 summarizes the results of comparing RMSE and MAE, which were used to verify the lateral performance of time series data between LSTM and CNN-LSTM, respectively, to the LSTM and Hybrid CNN-LSTM models, and shows a graph of the results over six intervals. The Hybrid CNN-LSTM deep learning model performed predictive performance verification through RMSE on the time series interval selected for this study for the LSTM model, and the Hybrid CNN-LSTM model showed consistent improved predictive power over the LSTM model in the time series data prediction interval of PM10 . Table 1 and 2 shows the RMSE and MAE used by the LSTM model and CNN-LSTM model to compare prediction errors using the difference between actual values compared to predictions for fine dust (PM10 ) and ultrafine dust (PM2.5 ) timeseries data. The RMSE results show that the RMSE mean value of the LSTM model shows 5.2302 levels while the RMSE value of the proposed hybrid CNN-LSTM model shows 4.9559, improving the predictive power of the hybrid CNN-LSTM model over LSTM. In addition, it was found to be lower by 0.24 on the MAE basis for performance comparison excluding outlier effects, indicating the effect of improving the predictive performance of the Hybrid CNN-LSTM model over LSTM. In addition, in the time series data prediction interval of PM10 , similarly, the Hybrid CNN-LSTM model showed consistent improved predictive power over the LSTM model. Looking at the RMSE results, the average RMSE value of the LSTM model showed a level of 3.6309, while the RMSE value of the proposed Hybrid CNN-LSTM model was 3.4145. Therefore, we show that the predictive power of the hybrid CNN-LSTM model is superior to that of the LSTM. In addition, the MAE value of the LSTM model on a MAE basis is 2.6564 and the MAE value of the Hybrid

Hybrid CNN-LSTM Based Time Series Data Prediction Model Study

53

CNN-LSTM model is 2.4627. Therefore, the hybrid CNN-LSTM model showed an improvement in the time-series performance compared to the LSTM model.

5 Conclusion In this study, a Hybrid CNN-LSTM time series prediction model was established using fine dust collection data for two years from 2019 and the prediction performance was evaluated through comparison experiments with existing cyclic neural network deep learning models. We confirm that the Hybrid CNN-LSTM deep learning model achieves improved prediction performance when compared to the Recurrent Neural Network-based model (LSTM) in the time series data prediction domain. This can be presented as meaningful research results when trying to improve predictive power in various fields dealing with time-series data. In particular, a method of performing a test using RMSE and MAE was used to evaluate the model for the prediction performance of nonlinear time series data. By using this statistical verification method, it was possible to secure the consistency of model evaluation for the time series prediction results for various sections. In addition, the predictive power of the next step was verified by separating the test section from the training section among the time series cross-validation methods in this study. In future studies, we will expand the study to cross-validate the predictive performance of the previous time-series learning model when skipping n steps to track the causal relationship of influencing factors that can continue the predictive validation stage of the deep learning-based time series prediction model.

References 1. Nielsen, A.: Practical Time Series Analysis. O’Reilly Media, Inc. (2019) 2. Hirschman, I.I., Widder, D.V.: The Convolution Transform. Courier Corporation (2012) 3. Bao, W., Yue, J., Rao, Y.: A deep learning framework for financial time series using stacked autoencoders and long-short term memory. PLoS ONE 12(7), e0180944 (2017). https://doi. org/10.1371/journal.pone.0180944 4. Traore, B.B., Kamsu-Foguem, B., Tangrara, F.: Deep convolution neural network for image recognition. Ecol. Inf. 48, 257–268 (2018) 5. Winkler, R., Makridakis, S.: The combination of forecasts. J. R. Stat. Soc. Ser. A (General) 146, 150–157 (1983). https://doi.org/10.2307/2982011 6. Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A., Jaitly, N., et al.: Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag. 29(6), 82–97 (2012) 7. Hinton, G.E., Osindero, S., Teh, Y.-W.: A fast learning algorithm for deep belief nets. Neural Comput. 18, 1527–1554 (2006) 8. Palangi, H., Deng, L., Shen, Y.L., Gao, J.F., He, X.D., Chen, J.S., et al.: Deep sentence embedding using long short-term memory networks: analysis and application to information retrieval. IEEE-ACM Trans Audio Speech Lang. 24(4), 694–707 (2016). https://doi.org/10.1109/taslp. 2016.2520371

54

C. Han et al.

9. Lee, H., Song, J.: Introduction to convolutional neural network using Keras; an understanding from a statistician. CSAM (Commun. Stat. Appl. Methods) 26, 591–610 (2019) 10. Palangi, H., Ward, R., Deng, L.: Distributed compressive sensing: a deep learning approach. IEEE Trans. Signal Process. 64(17), 4504–4518 (2016) 11. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). PMID: 9377276 12. van den Oord, A., et al.: Wavenet: A generative model for raw audio. arXiv preprint arXiv: 1609.03499 (2016) 13. TensorFlow:. TensorFlow [WWW Document]. Tensor-Flow (2018). https://www.tensorflo w.org. Last accessed on 12 Oct 2018 14. Traore, B.B., Kamsu-Foguem, B., Tangara, F.: Integrating MDA and SOA for improving telemedicine services. Telemat. Inform. 33, 733–741 (2016). https://doi.org/10.1016/j.tele. 2015.11.009 15. Traore, B.B., Kamsu-Foguem, B., Tangara, F.: Data mining techniques on satellite images for discovery of risk areas. Expert Syst. Appl. 72, 443–456 (2017). https://doi.org/10.1016/j.eswa. 2016.10.010 16. Traore, B.B., Kamsu-Foguem, B., Tangara, F., Tiako, P.: Software services for supporting remote crisis management. Sustain. Cities Soc. 39, 814–927 (May) (2018) 17. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015) 18. LeCun, Y., Boser, B.E., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.E., Jackel, L.D.: Handwritten digit recognition with a back-propagation network. Adv. Neur. Inf. Process. Syst., 396–404 (1990) 19. LeCun, Y., Cortes, C., Burges, C.J.: MNIST handwritten digit database. ATT Labs Online Available Httpyann Lecun Comexdbmnist 2 (2010) 20. Yamashita, R., Nishio, M., Do, R.K.G., et al.: Convolutional neural networks: an overview and application in radiology. Insights Imag. 9, 611–629 (2018). https://doi.org/10.1007/s13244018-0639-9 21. Geron, A.: Processing sequences Using RNNs and CNNs. In: Hands On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd edn. O’Reilly Media Inc., Sebastopol (2019) 22. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates, Inc. (2012) 23. Yoon, S.-j.: Do It! Deep Learning Textbook. EasysPub Inc. (2021)

A Study on Predicting Employee Attrition Using Machine Learning Simon Gim and Eun Tack Im

Abstract As corporations go through the Great Resignation, a post-pandemic economic trend in the surge of employee resignations, employee attrition has become one of the most significant problems for any organization. Employee attrition is defined as a reduction in the number of workers from various causes such as retirement, resignation, and termination. Because employees are important human resources (HR) of the organization and the subjects who own other valuable resources that the organization need, diverse opportunity costs occur when employee attrition takes place. To prevent such unwanted loss of valuable assets, various efforts have been made to predict and prevent employee attrition. In this study, three machine learning methods, Random Forest, XGBoost, and Artificial Neural Network, were used to predict employee attrition. Kaggle’s IBM HR Analytics Employee Attrition and Performance dataset which is composed of 1470 employee information was used as the data set. The variable to be predicted was whether or not employees leave the organization and a total of 35 variables such as academic background and environment satisfaction were considered. ‘Accuracy’, ‘Precision’, ‘Sensitivity’, and ‘F-1 Score’ were used as measures to calculate the prediction performance of the models. The result showed that XGBoost has the best performance in Accuracy while Random Forest showed the best performance in Precision. Artificial Neural Network showed the best performance in both Sensitivity and F1-Score. Keywords Employee attrition · Machine learning · Random Forest · Extreme Gradient Boost (XGBoost) · Artificial neural network

S. Gim (B) Industrial and Labor Relations School, Cornell University, Ithaca, NY, USA e-mail: [email protected] E. T. Im Department of Business Administration, Soongsil University, Seoul, South Korea © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Big Data, Cloud Computing, and Data Science Engineering, Studies in Computational Intelligence 1075, https://doi.org/10.1007/978-3-031-19608-9_5

55

56

S. Gim and E. T. Im

1 Introduction During the COVID-19 pandemic, the United States experienced a record high unemployment as the unemployment rate escalated to almost 15% [1]. Employees who managed to avoid the lay-off were still forced to leave their face-to-face offices and work remotely at home. Also, the health concerns increased the need for employerbased health insurance which left many of the employees unsatisfied [2]. After 2 years into the pandemic, the situation has changed as employees refuse to return to their pre-pandemic work conditions or even quit their job without hesitation. According to March 2021 Gallup poll, 48% of U.S. workers from all job categories, were not engaged or actively disengaged leading to 3.6 million resignations in May 2021 alone [3]. Even for current employees, 44% of employees are “job seekers,” according to Willis Towers Watson’s 2022 Global Benefits Attitudes Survey, and 33% are active job hunters who looked for new work in the fourth quarter of 2021 [4]. Now, the era of the Great Resignation, the term suggested by Anthony Klotz predicting a post-pandemic surge in employee resignations [1, 5], has arrived. Employees are important human resources (HR) of the organization and the subjects who own other valuable resources that the organization needs. With the Great Resignation, maintaining human resources within the organization has become the most important task for every organization. However, setting aside the Great Resignation, the attrition of employees is becoming more active in the process of different organizational changes such as intensifying competition and accelerating market changes [6]. Also, employee attrition is expected to increase even more as the labor market becomes more flexible. Employee attrition is defined as a reduction in manpower in any organization that occurs in diverse forms such as retirement, resignation, and termination [7]. The term ‘Turnover’ and ‘Attrition’ are often used together but are the business terminologies that conflict with each other. The main difference between the two is that the organization searches for human resources to supplant the leaving employee when turnover occurs. In case of attrition, the organization leaves the job open or eliminates the employment job itself [8]. The attrition of employees not only incurs costs because the invested human capital is not used for production activities but also negatively affects performance generation as a result [9]. In recent years, both the human resource department and the general employees are aware of the negative impact of human resource outflow on the organization and seeking ways to study the cause behind the attrition. Before, companies relied on traditional methods such as interviews or surveys done before employees leave the company to obtain the needed data for such research [10]. However, these traditional methods have shown limitation in not being able to encourage employees to provide accurate and honest answers and having data errors from bias created in the survey itself. To overcome such shortcomings, machine learning technique has been adapted recently to develop comprehensive and robust analysis of the HR data. Machine learning technique has been proven effective in showing accumulative insights of large-scale, historical, and

A Study on Predicting Employee Attrition Using Machine Learning

57

labelled datasets in not only the HR field but also diverse other fields of business including finance and marketing [11]. The aim of the paper is to predict employee attrition using different machine learning algorithms. The second section of the paper will cover the conceptual background on feature selection and prediction model algorithms and the related works done in the past. Section 3 will describe the purpose and design of the research model. Section 4 will contain the detailed experimental results of the prediction models and an analysis of the function of prediction models. Section 5 will present important implications based on the analysis.

2 Theoretical Background 2.1 Data Analytics in the Human Resources (HR) As employee attrition become one of the most important issues for corporations and machine learning algorithms arise as an innovative breakthrough in quantitative methods, various efforts have been made to better understand which features have the most influence on employee attrition through machine learning. Alao et al. [12] identified employee-related attributes that are related to the prediction of employees’ attrition by using demographic and job-related data of employees of one of the Higher Institutions in Nigeria. The study used Waikato Environment for Knowledge Analysis (WEKA) and See5 for Windows and generated decision tree models and rule sets. The results showed that Salary and Length of service were the determining factors for predicting employee attrition [12]. This study shows how the decision tree model, one of the most representative machine learning algorithms, can be used to predict employee attrition. However, the limitation of the study lies in the small size of the data which consisted of 309 records of the employees between 1978 and 2006. Tzeng et al. [13] applied a Support Vector Machine in predicting nurses’ intention to quit. This study used the data consisting of 380 cases of nurses in three hospitals located in Taiwan. As the predictors, working motivation, job satisfaction, and stress levels were used. Support Vector Machine showed a high accuracy rate of 89.2% in prediction [13]. The study is meaningful in successfully adapting the machine learning method to predicting nurses’ turnover. However, the limitation of the study is that only three predictors were used in the experiment, making it hard to establish the link between the turnover and attributes. Studies evaluating the performance of machine algorithms have also been studied in previous studies. Ajit et al. [14] applied the Extreme Gradient Boosting technique to predict employee turnover. Data from the HR Information Systems of a global retailer which consisted of 73,115 data points with key features such as age, pay, and peer attrition was used for the analysis. The research showed that the Extreme Gradient Boosting classifier is a superior algorithm with significantly high accuracy and relatively low runtimes for predicting employee turnover [14]. The significance

58

S. Gim and E. T. Im

of the study is that it utilized large data set with a machine learning algorithm to produce a meaningful prediction on employee turnover. Randall et al. [15] used a Neural Network to predict employee turnover and incorporated the Neural Network Simultaneous Optimization Algorithm (NNSOA) to optimize a Neural Network. This study found that the NNSOA was extremely helpful in eliminating unnecessary weight from the Neural Network model. Also, the NeuroShell software was shown to produce the second-best result [15]. Although this study presented the effectiveness of the Neural Network method, it has limits on researching with small data set. The data set used in the research was employee data of a small and family-owned company with less than 100 employees. In addition, there is an issue of generalization to other companies or industries because of the specialty of the data.

2.2 Machine Learning Algorithm 2.2.1

Random Forest

Random Forest is one of the most commonly used machine learning algorithms and combines the output of multiple decision trees to come to a single result [16]. In Random Forest, a decision tree that is based on individual CART (Classification and Regression Tree) combines multiple decision tree predictors so that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest [17]. It is one of the representative techniques of Bagging (Bootstrap Aggregation) in which a sample is boosted with multiple datasets through resampling and then aggregated. In building a model, some feature variables and part of data from the total data are used [18]. The characteristics of Random Forest are being free from overfitting, being less influenced by outliers, and achieving high accuracy. Random Forest creates a decision tree by resampling the entire data. Because some of the total data is resampled through restoration extraction, it has the advantage of being effective for missing values [16]. Also, it has the advantage of effectively reducing the variance of the prediction results by reflecting various variable situations. This is done by reflecting on the variables that are relatively not considered when making a model. Figure 1 shows the structure of the Random Forest Model (Bagging). Fig. 1 Random Forest model (bagging)

A Study on Predicting Employee Attrition Using Machine Learning

59

Fig. 2 Gradient Boosting model [22]

2.2.2

Extreme Gradient Boosting (XGBoost)

The Extreme Gradient Boosting (XGBoost) is an improved model of the boosting technique used in Decision Trees [19]. Boosting is an ensemble technique that combines numerous weak classifiers to create a strong classifier. However, existing boosting techniques had the disadvantage of being too slow due to sequential learning. XGBoost was able to improve this slow learning speed of previous models through parallelization. XGBoost uses the Classification and Regression Trees (CART) model, which is the most representative model of Decision Trees [20]. The biggest advantage of XGBoost model is that it can expand the sample to billions even in very restricted memory environments through the parallelized tree boosting techniques. Also, unlike previous boosting techniques, the learning speed is very fast while performing these steps [19]. Not only that, XGBoost model has the advantage of being able to use the model suitable for the characteristics of the data and the purpose of the model through parameter adjustment [21]. Figure 2 [22] shows the structure of the Random Forest Model (Bagging).

2.2.3

Multi Layer Perception

Multi Layer Perceptron (MLP) refers to an Artificial Neural Network (ANN) in which one or more hidden layers, which are intermediate layers, exist between an input layer and an output layer [23]. MLP has a similar structure to a Single-Layer Perceptron, but the input and output characteristics of the intermediate layer and each layer are expressed as nonlinear. Through this structure, MLP has the advantage of solving problems that the Single-Layer Perceptron could not solve [24]. MLP uses learning data consisting of input and target outputs to adjust the weight, which is the connection strength, to minimize the error between the Artificial Neural Network’s output value and the target output for the input value [23]. Artificial Neural Network model is learning algorithms inspired by humans’ biological neural networks as seen in Fig. 3. Artificial Neural Network finds patterns through repetitive learning processes from its own data similar to that of neural network activities in the human brain [24]. Through the generalization of the result from this repetitive learning processes, it can become a useful tool for researches that need to make prediction.

60

S. Gim and E. T. Im

Fig. 3 Framework of neural network and the neural in human body [25]

The most unique characteristic of Artificial Neural Network is the flexibility [26]. It can flexibly cope with complex situations despite the learning process being complicated. Thus, it has the advantage of being able to easily solve complex and diverse data. Regardless of whether the data contains qualitative variables (categorical variables) or quantitative variables (continuous variables), all variables can be easily analyzed by Artificial Neural Network. Also, non-linear combinations between input variables are possible, resulting high predictability even without statistical assumptions [26]. The disadvantage of Artificial Neural Network is that it is not easy to understand the interrelationship between the input variable and the output variable, making the interpretation of the results difficult. Also, there is a risk of overfitting or falling into a local minimum due to increased trial and error of the model [27] (Table 1). Table 1 Advantages and disadvantages of each model Category

Advantage

Disadvantage

Random Forest

1. Excellent generalization power coming from the combination of multiple decision trees 2. High predictive power

1. Difficulty in interpreting the result 2. Danger of overfitting

Extreme Gradient Boosting

1. Fast process speed 2. High predictive power

1. Danger of overfitting 2. Hard to interpret

Multi layer perception

1. High predictive power 2. No assumptions about linearity and correlations are needed

1. No explanation for models’ output 2. Different random weight initializations can lead to different validation accuracy

A Study on Predicting Employee Attrition Using Machine Learning

61

Fig. 4 Research framework

3 Research Design 3.1 Research Framework The purpose of this study is to predict employee attrition in the human resource management of companies using machine learning. For the research, HR attrition and performance data set [28], from Kaggle, an open big data platform, was used. Regarding the data set used in the research, data scaling was performed to remove variables that were unnecessary for the prediction. Variables with only one data in the variable and variables with different data in all rows were removed. Data scaling is intended to improve predictive power by narrowing the gap in data of different units. The data was then divided into a ratio of 70:30 to construct a predictive model of Random Forest, Extreme Gradient Boosting (XGBoost), and Deep Neural Network techniques. Train data set was conducted using 70% of the data. Accuracy, Precision, Recall, and F1-score were measured with 30% of the data to evaluate the performance of the predictive model (Fig. 4).

3.2 Feature Selection Using Filter Method Many of the independent variable data used for the projects involving machine learning prediction are often not ideal. Also, if there is a large number of features in the dataset, it can cause model overfitting and degrade prediction results. Not only

62

S. Gim and E. T. Im

that, it requires more time and is more computationally demanding [29]. It is necessary to improve this problem by removing feature, an independent variable that is irrelevant to predicting the dependent variable. The method of finding and removing this feature is called feature selection. In feature selection, there are largely three methods: Filter Method, Wrapper Method, and Embedded Method. Filter method refers to a method of removing and modeling unrelated variables before the machine learning modeling stage by grasping the relationship with dependent variables, which is to be predicted, through statistical methods in the data preprocessing stage. Features are selected using the chi-square test and correlation coefficient according to the data type of the dependent and independent variables [30]. Wrapper method learns the machine learning model according to a feature combination and then finds a feature combination that shows the best performance by repeatedly changing the feature combination [30]. Fast modeling algorithms such as Support vector machines and Naïve bays are used because a lot of time cost is incurred in the course of repeated implementation of the model. Also, there is a risk of overfitting because the training data set is used repeatedly. The Embedded method is a method in which feature selection proceeds while the model is running, and not only algorithms such as CART and C4.5, but also methods of granting penalties to features that do not contribute to the model of ElasticNet [30]. The purpose of the study is to compare the accuracy of the machine learning model in predictions using HR data. Thus, the same variable was designed as input in the machine learning models that were used. Feature selection was performed through a statistical method that considered the type of dependent variable that data processing was intending to predict and the type of independent variable that is used to predict the dependent variable.

3.3 Classification Prediction Model ‘Attrition’ is the dependent variable of the Kaggle data set used in this study, ‘Attrition’, is a categorical type of data consisting of ‘yes’ and ‘no’. Because it is categorical, a classification predictive model is needed. Among Ensemble’s representative methods, the Random Forest method, which is the Bagging method, and the Extreme Gradient Boosting method, which is the representative method of boosting, were used. Also, this study compared the performance by building a classification model using the Deep Neural Network method, which is an Artificial Neural Network method. Random Forest and Extreme Gradient Boosting used in this study are ensemble machine learning algorithms. Unlike singular models, the ensemble model creates multiple models and then combines them to produce improved results. Regarding the ensemble model, there is a difference coming from whether the bagging method or the boosting method was used. Random Forest is an extension of the bagging

A Study on Predicting Employee Attrition Using Machine Learning

63

(Bootstrap Aggregation) method which creates a comprehensive model that takes into account several singular CARTs that are made from some part of features and sub-data samples from sub-data, not the entire data [31]. Thus, Random Forest is effective in making a model when using datasets with missing values. Also, Random Forest has the advantage of effectively reducing the variance of the prediction results by reflecting the variable situation and making effective and stable predictions [32]. While Random Forest model based on Bagging method has the advantage of showing stable prediction results, Extreme Gradient Boosting has the advantage of developing models specialized for certain problems [32]. This is because Extreme Gradient Boosting is based on a boosting method that generates and learns models sequentially by weighting learning model residuals [33]. Not only that, Extreme Gradient Boosting attempted to overcome the shortcoming of overfitting in Gradient Boosting Model by reflecting regularization and subsampling to its model and applying penalty to weighted value. Artificial Neural Networks, one of the most important deep learning methods, are computing systems inspired by the humans’ natural network which is consisted of interconnected node complex [34, 35]. In a deep learning algorithm, the input layer receives the input of independent variables necessary for the prediction and then multiplies the weights connected to several hidden layers. While doing so, it goes through a process of substituting activation functions such as Sigmoid, Tanh, and ReLU so that the calculation value becomes nonlinearity. Then, forward propagation, which is a process of deriving the prediction result from the output layer of the next dependent variable to be predicted, is performed. Learning proceeds by repeating back propagation, which compares the predicted result with the real value and returns the existing error to the input layer again, adjusting the weight [34].

3.4 Machine Learning Performance Measurement A confusion matrix is a classification table of predicted results and actual values to be predicted which is commonly used as an indicator for evaluating the performance of Classification Machine Learning. Accuracy refers to the degree to which the actual positive and negative are predicted and classified in the classified sample. However, if the data of the variable to be predicted is biased to one side, a high number can be produced which can lead to the illusion that the performance is good. To prevent this misinterpretation, Precision and Sensitivity are additionally calculated. Precision refers to the ratio value of the actual positive among the number predicted as positive. Sensitivity is calculated by calculating the ratio of the actual positive predicted as positive among the total positive. F1-score is the combination of precision and sensitivity and is an indicator that can evaluate the performance of the machine learning. Table 2 shows the calculation method of the performance measure according to the confusion matrix.

64

S. Gim and E. T. Im

Table 2 Confusion matrix and machine learning performance measure Actual Predicted

Positive Negative

Positive

Negative

tp

fn

fp Pr ecision =

F1 − scor e = 2 ×

sensiti vit y =

tp t p+ f n

tn tp t p+ f p

Accuracy =

t p+tn t p+ f p+ f n+tn

Pr ecision×Recall Pr ecision+Recall

4 Research Result 4.1 Data Set To predict employee attrition, this study used Kaggle’s IBM HR Analytics Employee Attrition and Performance dataset which is composed of 1,470 employee information [28]. The variable to be predicted through this dataset is whether or not employees leave the organization. There are 35 variables including the target variable ‘attrition’ which is composed of ‘yes’ and ‘no’. Other variables include academic background, environment satisfaction, and work-life balance.

4.2 Data Preprocessing The purpose of data preprocessing is to improve the predictive power of predictive models by eliminating errors in data. Data preprocessing to develop this HR attrition prediction model was done by deleting ‘EmployeeCount’ and ‘over18’ which had only one unique value. Also, data preprocessing was done by deleting ‘Employee Number’ with different values in all rows. Data scaling was done to adjust the data with different units to have the same range. For this research, each feature value is expressed between − 1 and 1. If the data value was positive, the MaxAbsScaler, which is identical to MinMaxScaler, was used.

4.3 Feature Selection For better Prediction Machine Learning modeling, Feature Selection was done. Feature selection extracts independent variables that have a significant relationship with the target variable using statistical techniques from the filter methods for selecting variables in the pretreatment stage. Since the target variable is a

A Study on Predicting Employee Attrition Using Machine Learning

65

Table 3 T-test for feature selection Significance of equal variance

Significance of mean diff

Mean diff

Age

0.282

0.000

3.954

DailyRate

0.811

0.030

62.142

DistanceFromHome

0.026

0.004

−1.717

EnvironmentSatisfaction

0.000

0.000

0.307

HourlyRate

0.503

0.793

0.378

JobInvolvement

0.000

0.000

0.251

JobLevel

0.070

0.000

0.509

JobSatisfaction

0.120

0.000

0.310

MonthlyIncome

0.000

0.000

2045.647

MonthlyRate

0.860

0.561

−293.529

NumCompaniesWorked

0.004

0.116

−0.295

PercentSalaryHike

0.595

0.606

0.134

PerformanceRating

0.825

0.912

−0.003

RelationshipSatisfaction

0.055

0.079

0.135

StockOptionLevel

0.399

0.000

0.318

TotalWorkingYears

0.006

0.000

3.618

TrainingTimesLastYear

0.975

0.023

0.208

WorkLifeBalance

0.000

0.030

0.123

YearsAtCompany

0.114

0.000

2.238

YearsInCurrentRole

0.000

0.000

1.581

YearsSinceLastPromotion

0.257

0.206

0.289

YearsWithCurrManager

0.000

0.000

1.515

categorical variable, the relationship with the continuous variables was confirmed through a t-test. Cross table was used for the relationship with the category variable. Among them, independent variables with a significance relationship, which means having P-value greater than 0.1, were removed. As a result of the t-test analysis, ‘HourlyRate’, ‘MonthlyRate’, ‘NumCompaniesWorked’, ‘PercentSalaryHike’, ‘PerformanceRating’, and ‘YearsSinceLastPromotion’ were removed. As a result of the cross table analysis, “Gender” and “Education” were removed. The results of feature selection are shown in Tables 3 and 4.

4.4 Ensemble Prediction Model Performance Random Forest, XGBoost, and Artificial Neural Network were used to predict the target variable ‘Attention’, which is a category variable. Out of the total 1470 answers

66

S. Gim and E. T. Im

Table 4 Cross table test for feature selection

Pearson chi-square Attrition * BusinessTravel

0.000

Attrition * Department

0.005

Attrition * Education

0.525

Attrition * EducationField

0.011

Attrition * Gender

0.145

Attrition * JobRole

0.000

Attrition * MaritalStatus

0.000

Attrition * OverTime

0.000

in the data, 1197 (81.4%) answered “No” which means that the employee did not quit and 273 (18.6%) answered “Yes” which means they quit. For the better performance of the machine learning model, the data was randomly split into 1029 training datasets, which is 70% of the data, and 441 test datasets. This study evaluated the classification performance of machine learning by targeting ‘Yes’ which means employee attrition. The results of the Performance measure using Accuracy, Precision, Recall, and F1-score are shown in Table 5. The result showed that XGBoost had the best performance with 0.871 and ANN had the worst performance with 0.859. Regarding ‘Precision’, Random Forest showed very good performance compared to XGBoost and ANN with 0.857. However, regarding ‘Recall’, it showed the opposite result with 0.174. When it comes to ‘Sensitivity’, ANN had the best result with 0.420. F1-score, which shows the total collective value of Precision and Recall, ANN showed the best performance of 0.483. One of the unique aspect of the result is that Random Forest, which is generally good at predicting the sample, showed abnormally low performance in ‘Sensitivity.’ Random Forest was shown to have a high ‘Precision’ performance of 0.857, which is the ratio of actual positive out of the predicted positive, but had a very low ‘Sensitivity’ performance of 0.174 which is the ratio of predicted actual positives out of the actual positive. Since the model is made through sub-samples out of all samples, the small number of “Yes” in this data, which means attrition, seems to have caused a limitation in the performance of Random Forest which is generally good at predicting the sample. This research interpreted the cause of the difference in the predictive power to be the RF which models with sub-sampling bagging. Because employees who leave (‘yes’ in Attrition), which targets the main cause, is distributed in a small proportion in the data set, it increases the probability of modeling with even smaller Table 5 Prediction model performance Accuracy

Precision

Sensitivity

F1 score

Random Forest

0.866

0.857

0.174

0.289

Extreme Gradient Boost

0.871

0.650

0.377

0.477

Artificial neural network

0.859

0.569

0.420

0.483

A Study on Predicting Employee Attrition Using Machine Learning

67

number. This is because RF sets the number of targeted ‘yes’ to a smaller number in the model. In other words, the limitation in the performance of Random Forest comes when the targeted sample size is small.

5 Conclusion Employee attrition has been identified as a crucial issue in organizations as the Great Resignation continues. In the era of Great Resignation, this research aimed to predict employee attrition by using three different prediction models. Random Forest, XGBoost, and Artificial Neural Network were used to predict the target variable ‘Accuracy’, ‘Precision’, ‘Sensitivity’, and ‘F-1 Score.’ The result showed that XGBoost has the best performance in ‘Accuracy’ while Artificial Neural Network had the worst performance. Regarding ‘Precision’, Random Forest showed the best performance, and Artificial Neural Network had the worst performance. Regarding ‘Sensitivity, however, Random Forest showed the worst result while ANN had the best result. Regarding F1-Score, ANN showed the best performance while Random Forest had the worst performance. This research has implications for applying diverse machine learning models to predict employee attrition. Also, it compared the performances of three different models providing the weaknesses and strengths that each model has. This research also has important implications for HR professionals and corporations. By showing the effectiveness of machine learning tools in predicting employee attrition, this study provided tools and models that can enhance the understanding and prediction of attrition. Also, for practitioners in the HR field, this study provides guides on the usage of different prediction models for different purposes. For future studies, a bigger and more diverse data set can be implemented for more accurate prediction. In this research, the limitations of small data set were shown when Random Forest, which is known for its prediction performance, showed abnormally low performance in ‘Sensitivity.’ Because this research interpreted that the size of the targeted sample was the cause, data set with more and diverse data would increase accuracy. Also, there is a limitation on the generalization issue. Although this research used sizeable corporation HR data, there is still a lack of information and data to be generalized in different departments and industries. Thus, future studies should be conducted on combined data from different departments and industries.

References 1. Hopkins, J.C., Figaro, K.A.: The great resignation: an argument for hybrid leadership. Int. J. Bus. Manage. Res. 9(4), 393–400 (2021) 2. Klotz, A.C., Zimmerman, R.D.: On the turning away: an exploration of the employee resignation process. In: Research in Personnel and Human Resources Management. Emerald Group

68

S. Gim and E. T. Im

Publishing Limited (2015) 3. Watts, J.M.: Resignation. Fire Technol. 49(1), 1–2 (2013) 4. Allman, K.: Career matters: ‘The great resignation’ sweeping workplaces around the world. LSJ Law Soc. NSW J. 81, 46–47 (2021) 5. Klotz, A.C., Bolino, M.C.: Saying goodbye: the nature, causes, and consequences of employee resignation styles. J. Appl. Psychol. 101(10), 13986 (2016) 6. Cho, G.S.: The effects of commitment to organizational change on employee‘s turnover intention: an empirical investigation for Korean bank employees under mergers and acquisitions. J. Hum. Resour. Manage. Res. 13(1), 167–182 (2006) 7. Chakraborty, R., Mridha, K., Shaw, R., Ghosh, A.: Study and prediction analysis of the employee turnover using machine learning approaches. In: 2021 IEEE 4th International Conference on Computing, Power and Communication Technologies (GUCON), pp. 1–6 (2021) 8. Sisodia, D.S., Vishwakarma, S., Pujahari, A.: Evaluation of machine learning models for employee churn prediction. In: 2017 International Conference on Inventive Computing and Informatics (ICICI), pp. 1016–1020 (2017) 9. Kwon K.: The Relationship between employee turnover and firm performance: an explorative study, 16(1), 1–26. Korea Labor Institute (2016) 10. PARK, Y., Lee, D.G.: Development of a resignation prediction model using HR data. In: Proceedings of the Korean Institute of Information and Communication Sciences Conference, pp. 100–103. The Korea Institute of Information and Communication Engineering (2021) 11. Zhao, Y., Hryniewicki, M.K., Cheng, F., Fu, B., Zhu, X.: Employee turnover prediction with machine learning: a reliable approach. Adv. Intell. Syst. Comput. 869 (2019) 12. Alao, D.A.B.A., Adeyemo, A.B.: Analyzing employee attrition using decision tree algorithms. Comput. Inf. Syst. Dev. Inf. Allied Res. J. 4(1), 17–28 (2013) 13. Tzeng, H.M., Hsieh, J.G., Lin, Y.L.: Predicting nurses’ intention to quit with a support vector machine. CIN Comput. Inf. Nurs. 22(4), 232–242 (2004) 14. Ajit, P.: Prediction of employee turnover in organizations using machine learning algorithms. Algorithms 4(5), C5 (2016) 15. Sexton, R.S., McMurtrey, S., Michalopoulos, J.O., Smith, A.M.: Employee turnover: a neural network solution. Comput. Oper. Res. 32, 2635–2651 (2005) 16. Belgiu, M., Dr˘agu¸t, L.: Random forest in remote sensing: a review of applications and future directions. ISPRS J. Photogram. Remote Sens. 114, 24–31 (2016) 17. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001) 18. Valle, M.A., Ruz, G.A.: Turnover prediction in a call center: behavioral evidence of loss aversion using random forest and naïve bayes algorithms. Appl. Artif. Intell. 29(9), 923–942 (2015) 19. Chen, T.: Xgboost: extreme gradient boosting. R Package Version. 0.4-2 1(4), 1–4 (2015) 20. Kim, Y., Choi, H., Kim, S.: A study on risk parity asset allocation model with XGBoost. J. Intell. Inf. Syst. 26(1), 135–149 (2020) 21. Jain, R., Nayyar, A.: Predicting employee attrition using XGBoost machine learning approach. In: 2018 International Conference On System Modeling and Advancement in Research Trends (SMART). IEEE (2018) 22. Gim, G.: Kaggle Data Strategy Practice Using SPSS, R, Python, 1st edn. Cheong-Ram Publication (2022) 23. Somers, M.J.: Application of two neural network paradigms to the study of voluntary employee turnover. J. Appl. Psychol. 84(2), 177 (1999) 24. Esmaieeli Sikaroudi, A.M., Ghousi, R., Sikaroudi, A.: A data mining approach to employee turnover. J. Ind. Syst. Eng. 8(4), 106–123 (2015) 25. Im, E.T., Gim, G.: Developing AI models with the AutoML platform Wise Prophet, R, Python, 1st edn. Cheong-Ram Publication (2022) 26. Umang, S.: A comparison study between ANN and ANFIS for the prediction of employee turnover in an organization. In: 2018 International Conference on Computing, Power and Communication Technologies (GUCON). IEEE (2018)

A Study on Predicting Employee Attrition Using Machine Learning

69

27. Jain, A.K., Mao, J., Mohiuddin, K.M.: Artificial neural networks: a tutorial. Computer 29(3), 31–44 (1996) 28. Kaggle, IBM HR analytics employe attrition & performance. https://www.kaggle.com/datasets/ pavansubhasht/ibm-hr-analytics-attrition-dataset 29. Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R.P., Tang, J., Liu, H.: Feature selection: a data perspective. ACM Comput. Surv. (CSUR) 50(6), 1–45 (2017) 30. Jović, A., Brkić, K., Bogunović, N.: A review of feature selection methods with applications. In: 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), pp. 1200–1205. IEEE (2015) 31. Lee, K., Hong C., Lee, E.H., Yang, W.H.: Comparison of artificial intelligence methods for prediction of mechanical properties. IOP Conf. Ser. Mater. Sci. Eng. 967(1), 012031. IOP Publishing (2020) 32. Bühlmann, P.: Bagging, boosting and ensemble methods. In: Handbook of Computational Statistics, pp. 985–1022. Springer, Berlin (2012) 33. Friedman, J.H.: Stochastic gradient boosting. Comput. Stat. Data Anal. 38, 367–378 (2002) 34. Wu, Y.-C., Feng, J.-W.: Development and application of artificial neural network. Wireless Pers. Commun. 102(2), 1645–1656 (2018) 35. Jenkins, B.K., Tanguay, A.R.: Handbook of Neural Computing and Neural Networks. MIT Press, Boston (1995)

A Study on the Intention to Continue Using a Highway Driving Assistance (HDA) System Based on Advanced Driver Assistance System (ADAS) Myungho Lee, Jaeyoung So, and Jaewook Kim

Abstract The purpose of this study was to identify factors affecting the intention to continue using the ADAS-based HDA system, which will lead the development of autonomous vehicles. For the purpose of this study, an online and offline survey was conducted targeting 450 men and women over 20 who used the HDA system nationwide. Among them, hypotheses were verified through SPSS 18.0 and AMOS 22.0 programs for 409 copies of significant answers. As a result of the analysis, among the characteristics of the HDA system, convenience had a positive (+) effect only on perceived ease, safety had a positive (+) effect only on perceived usefulness, and interoperability had a positive (+) effect only on perceived ease confirmed. In addition, among user characteristics, innovativeness had a positive (+) effect only on perceived usefulness, perceived playfulness had a positive (+) effect only on perceived ease, and self-efficacy had a positive (+) effect on both perceived ease and usefulness. Influence was confirmed. On the other hand, perceived severity and reactive efficacy during protective motive behavior had a positive (+) effect on continued use intention, but perceived vulnerability was not significant, and perceived ease had a positive (+) effect on perceived usefulness. And perceived ease and usefulness all had a positive (+) effect on intention to continue use. Based on these results, this study presented practical implications for the future development and improvement of the HDA system, respectively, and future studies will provide more useful management and marketing implications if individual results are calculated, compared, and analyzed for potential consumers as well as experienced users suggested that it can be expected.

M. Lee (B) · J. Kim Department of Business Administration, Soongsil University, Seoul, South Korea e-mail: [email protected] J. Kim e-mail: [email protected] J. So Department of IT Policy Manangement, SooSil University, Seoul, South Korea © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Big Data, Cloud Computing, and Data Science Engineering, Studies in Computational Intelligence 1075, https://doi.org/10.1007/978-3-031-19608-9_6

71

72

M. Lee et al.

Keywords ADAS · HDA · System characteristics · User characteristics · Protective motive behavior · Perceived ease of use · Perceived usefulness · Intention to continue use

1 Introduction BCG (Boston Consulting Group) predicted that the global autonomous vehicle market will reach about 42 billion dollars (about 50 trillion won) in 2025 and grow to 77 billion dollars (about 90 trillion won) by 2035 [1]. The core of the technology driving the growth of autonomous vehicles is the Advanced Driver Assistance System (ADAS), a system that supports safe and convenient driving during the autonomous driving process. Although the autonomous vehicle industry is growing, it is expected that the period in which new technologies will be applied based on the conventional automobile industry will continue for the time being rather than abruptly fully autonomous driving. It is necessary to pay attention to the importance of the functions as they will become the core foundation that influences the development of the industry [2]. In particular, these ADAS functions are useful functions that prevent fatigue from being caused by high concentration and leading to big accidents by maintaining distance from cars in front of them and driving through lane recognition on highways where big accidents occur, so it is necessary to pay attention to the system development and evolution. Up to now, more than 90% of accidents on the road are due to driver error or mistake, and such driver error on the highway has serious irreversible consequences. It is self-evident that it should be noted [3]. Despite this importance, domestic consumers’ awareness of ADAS technology is very insufficient. According to the results of the ‘2019 Global Automotive Consumer Study’ by Deloitte Group, one of the world’s leading consulting firms, in Korea, only 41% of respondents in 2018 said they trusted self-driving cars, and in 2019, only 39% said they trusted it. It was suggested that the future acceptance rate is not bright unless there is a clear improvement in safety, convenience, and experience in the autonomous driving system [4]. Therefore, based on the necessity of the study, this study identifies the preceding factors influencing the intention to continue using the ADAS-based HDA system, which currently plays a key role in the development of autonomous vehicles, and provides useful management and marketing implications based on this. By presenting this, we want to contribute to the smooth development and evolution of the autonomous vehicle industry at this point.

A Study on the Intention to Continue Using a Highway Driving …

73

2 Theoretical Background 2.1 Advanced Driver Assistance Systems (ADAS) Advanced Driver Assistance System (ADAS) is a set of various systems and technologies to develop a vehicle capable of fully autonomous driving, and its core technologies are ACC (Adaptive Cruise Control) and Parking Steering Assist (Intelligent). It consists of Parking Assist System (IPAS), Lane Departure Warning System (LDWS), and Autonomous Emergency Braking (AEB) [5]. ACC is a function that adjusts the speed according to the traffic environment, IPAS is to increase parking convenience by detecting the parking position of the vehicle itself, LDWS is a system to warn the driver when the driver leaves the driving lane due to driver negligence, and AEB is to It is an emergency automatic braking system that activates the brakes to prevent accidents such as collisions. The reason ADAS technology is particularly noteworthy is that the early warning function can prevent almost all collisions [6]. 93% of traffic accidents are caused by human factors, that is, driver negligence, and 80% of these accidents occur within 3 s before the accident. Because it is caused by certain types of negligence (phone use, drowsiness, etc.), Swiss insurance company AXA claims that an early warning 1.5 s in advance prevents 90% of rear-end collisions. It can be prevented, and the early warning 2.0 s before can prevent almost all collisions, emphasizing the importance of ADAS system function [7]. ADAS is an important technology for entering the era of fully autonomous vehicles from the full-fledged autonomous vehicle era predicted in the future. The main technologies of ADAS are BVM (rear-vehicle monitor), FCA (forward collision avoidance assistance), SEA (safe exit assistance), HDA (highway driving assistance), LFA (lane departure avoidance assistance), NSCC (navigation-based smart cruise control) As previously reviewed, safety technology that actively engages in risk recognition, as well as convenience-oriented technology that supports more comfortable driving in driving situations, It can be predicted that both safety and convenience should be considered in the selection of variables that affect the intention to continue using ADAS [8] (Fig. 1).

2.2 Highway Driving Assistance System (HDA) HDA is a system that assists drivers on the highway, and it refers to a system that maintains and supplements the driving state by recognizing the distance and lane from the vehicle ahead on the highway. Therefore, HDA integrates the vehicle-to-vehicle distance control function (ASCC) that maintains the distance between vehicles, the lane-keeping function (LKAS) that supports to prevent accidents by leaving the lane, and navigation information, so that the vehicle itself on the highway can determine the distance between vehicles and the lane. While maintaining the level of 2nd level autonomous driving [9]. As it is the function most closely related to the life of the

74

M. Lee et al.

Fig. 1 Correlation between crashes and early warnings [6]

driver, the HDA market is spurring the advancement of technology. The currently released HDA2 automatically changes lanes when the turn signal is activated and adds a deceleration function when entering an IC or JC on a highway. In addition, in conjunction with NSCC (Navigation-Based Smart Cruise Control), the driving speed reduction function at corners and the recognition speed of a vehicle changing lanes or breaking in are faster than the existing HDA [10]. The number of traffic accident fatalities in Korea has decreased by 35.6% over the past 10 years as of 2018, but in the case of Korea, the fact that the highway accident rate is the highest among OECD countries requires technological development of the relevant function. It can be seen that the key technology of the HDA system is to create the best results by interworking between each function as well as safety and convenience [11].

2.3 HDA (Highway Driving Assist) User Characteristics Variables influencing the adoption of new technologies such as HDA consisted of innovativeness, involvement, and spontaneity according to studies such as Oh [12], Seddon and Min-Yen [13], Venkatesh and Davis [14] come. Innovativeness is a strong voluntary will to utilize new information technology, and individuals with this spontaneity are expected to more easily accept new information technology or information system as well as better utilize it. Therefore, it can be estimated that this study has a close relationship with the self-efficacy of Bandura [15]. On the other hand, as a more specific variable closely related to involvement, this study focused on perceived playfulness, which is the pleasure experienced by users while using a computer or the Internet was considered as perceived playfulness in this study. Therefore, in this study, HDA user characteristics were composed of innovativeness, perceived playfulness, and self-efficacy.

A Study on the Intention to Continue Using a Highway Driving …

75

2.4 Protection Motivation Theory (PMT) Driving on the highway is closely related to the fear of safety, and Rogers proposed a protective motivation theory to understand the behavioral changes according to the fear appeal that individuals experience. This theory hypothesizes that, under the influence of cognitive information processing theory and expectation-value theory, a factor causing a specific severe fear triggers a cognitive mediating process, which in turn affects protective motivation and changes behavior [16]. Therefore, Rogers suggested that the three main variables of protection motivation theory are the severity of the incident, the likelihood of one’s exposure to the incident, and the effectiveness of the countermeasures against the threat. Therefore, in this study, the sub-variables of the protective motivation theory were the perceived severity of recognizing the serious physical and financial damage suffered by a traffic accident while driving on the highway, perceived vulnerability, the degree to which one feels that such accidents can be reduced through the HDA system, and It was composed of response efficacy [16] that recognizes the usefulness of preventing accidents that occur while driving by using HDA, a highway driving support system.

2.5 Technology Acceptance Model (TAM) The model of this study is based on the technology acceptance model. This is a model established by Davis in 1985 to explain the intention of users in using computers. The technology acceptance model assumes that attitudes are formed by the perceived usefulness and perceived ease of new technologies and that they are connected to the intentions of use. That is, perceived usefulness and ease act as parameters for system characteristics and external variables assumed in this study, and perceived usefulness is the degree to which it is believed that the user’s work performance is enhanced when a specific system is used, Perceived ease refers to the degree to which a person can be free from a specific effort by using technology [17] (Fig. 2).

Fig. 2 Technology acceptance model (TAM)

76

M. Lee et al.

3 Research Method 3.1 Data Collection This study established a research model and hypothesis based on the purpose of searching for variables affecting the intention to continue use of HDA, an ADASbased highway autonomous driving assistance system, and conducted a survey to verify the hypothesis and verified the results through a statistical program. The survey was conducted online and offline by filling out a questionnaire through prior research. The survey was conducted on a total of 450 men and women over 20 years of age who had experience in highway autonomous driving through a car equipped with an HDA system. The final analysis was performed through. The collected data were analyzed through statistical programs SPSS 18.0 and AMOS 22.0.

3.2 Research Model and Selection of Variables In this study, the independent variable was composed of system characteristics, user characteristics, and protective motive behavior, the mediating variables were set as perceived ease and perceived usefulness according to the technology acceptance model, and the dependent variable was composed of continued use intention. In addition to the safety and convenience of the overall ADAS system, the HDA system features added interoperability considering that the HDA system creates the best results through the efficient interoperability of navigation, vehicle spacing, and emergency brake functions. As the HDA technology corresponds to the latest technology, user characteristics were composed of innovation, perceived playfulness that maximizes involvement, and self-efficacy, respectively. In addition, protective motive behavior consists of perceived severity to judge the severity of accidents on highways, perceived vulnerability to recognize how much the HDA system prevents such accidents, and reactive efficacy to recognize how effective HDA is in preventing accidents. Therefore, system characteristics and user characteristics are linked to continued use intention according to the parameters of perceived ease and usefulness, and protection motivation theory is an individual’s internal psychological description of risk and is independent of the technology acceptance model. A research model with direct influence was established as shown in Fig. 3.

4 Data Result Analysis In order to verify the causal relationship between the variables established in this study, the AMOS 22.0 program was used for analysis as shown in Fig. 4.

A Study on the Intention to Continue Using a Highway Driving …

77

Fig. 3 Research model

Fig. 4 Path analysis

To verify the hypothesis established in this study, path analysis was performed as above through AMOS 22.0. Whether or not the hypothesis was accepted was determined with a conceptual reliability C.R (Critical Ratio) value of ± 1.96 or higher, and a significance level value (P-Value) of 0.05. It was judged on the basis of the following. The results of the research model hypothesis test are shown in Table1. The results of the research model path analysis are summarized as follows. As a result of the hypothesis test, 12 out of 18 pathways were accepted and 6 hypotheses were rejected. Convenience among the HDA system characteristics

78

M. Lee et al.

Table 1 Hypothesis test result H

Theory

1–1

PE

←

CO

Path coefficient

Std. error

C.R

P-Value

Adop-tion

0.348

0.062

5.617

0.001***

Pass

1–2

PU

←

CO

0.096

0.059

1.625

0.104

Fail

2–1

PE

←

SA

0.003

0.043

0.069

0.945

Fail

2–2

PU

←

SA

0.088

0.039

2.217

0.027

Pass

3–1

PE

←

IA

0.088

0.039

2.244

0.025

Pass

3–2

PU

←

IA

0.058

0.036

1.614

0.107

Fail

4–1

PE

←

PI

− 0.022

0.043

− 0.501

0.616

Fail

4–2

PU

←

PI

0.103

0.039

2.621

0.009

Pass

5–1

PE

←

PP

0.079

0.038

2.056

0.04

Pass

5–2

PU

←

PP

0.057

0.035

1.623

0.105

Fail

6–1

PE

←

SE

0.514

0.069

7.424

0.001***

Pass

6–2

PU

←

SE

0.231

0.07

3.307

0.001***

Pass

7

CU

←

PS

0.087

0.043

2.02

0.043

Pass

8

CU

←

PV

− 0.024

0.025

− 0.941

0.347

Fail

9

CU

←

RE

0.086

0.043

1.979

0.048

Pass

10

PU

←

PE

0.484

0.064

7.538

0.001***

Pass

11

CU

←

PE

0.233

0.058

4.028

0.001***

Pass

12

CU

←

PU

0.534

0.06

8.914

0.001***

Pass

***

p < 0.001 * CO Convenience, SA Safety, IA Interoperability, PI Individual Innovation, PP Perceived Playfulness, SE Self-Efficacy, PS Perceived Severity, PV Perceived Vulnerability, RE Response Efficacy, PE Perceived Ease of Use, PU Perceived Usability, CU Intention to Continue Use

under hypothesis H1-1 had a significant positive (+) effect on perceived ease, and convenience among HDA system characteristics under hypothesis. H1-2 did not have a significant effect on perceived usefulness. Among the characteristics of the HDA system under hypothesis H2-1, safety had no significant effect on perceived ease, and among the characteristics of the HDA system under hypothesis H2-2, safety had a positive (+) effect on the perceived usefulness. Hypothesis H3-1 of the HDA system characteristics, interoperability, had a positive (+) effect on perceived ease, and among the system characteristics of H3-2, interoperability, interoperability had no significant effect on perceived usefulness. Among the HDA user characteristics under hypothesis H4-1, innovativeness did not have any significant effect on perceived ease of use, and among the user characteristics under hypothesis H4-2, innovativeness had a positive (+) effect on perceived usefulness. The perceived playfulness among the user characteristics under hypothesis H5-1 had a positive (+) effect on the perceived ease, and the perceived playfulness among the user characteristics under the hypothesis H5-2 had no significant effect on the

A Study on the Intention to Continue Using a Highway Driving …

79

perceived usefulness. Among the user characteristics under hypothesis H6-1, selfefficacy had a positive (+) effect on perceived ease, and among the user characteristics under hypothesis H6-2, self-efficacy had a positive (+) effect on perceived usefulness. Hypothesis H7, perceived severity, had a positive (+) effect on intention to continue use, hypothesis H8, perceived vulnerability, had no significant effect on intention to continue use, and hypothesis H9, reactive efficacy, had a positive effect on intention to continue use. (+) had an effect. Hypothesis H10, perceived ease, had a positive (+) effect on perceived ease, hypothesis H11, perceived ease, also had a positive (+) effect on continued use intention, and hypothesis H12, perceived usefulness. It also had a positive (+) effect on the intention of continuous use.

5 Conclusions This study focused on ADAS, a prerequisite key technology for entering the era of full self-driving cars, and especially on HDA systems that are most closely related to human accidents, to verify the effect of system characteristics, user characteristics, and protection motivation behavior on perceived ease and usefulness and intention to use. The practical implications according to the hypothesis test results are as follows. First, convenience among HDA systems had a positive (+) effect on perceived ease, but did not significantly affect perceived usefulness. These results suggest that if one perceives that the HDA system is convenient to use, one can perceive that the system is not difficult and does not require much effort, but does not recognize that such convenience enhances practical driving ability or guarantees safety on highways. do. Therefore, in the future HDA system configuration, both perceived ease and usefulness can be increased if the focus is on a system that is easy to operate and can secure practical safety. Second, HDA system safety had no significant effect on perceived ease, but had a positive (+) effect on perceived usefulness. This means that even if it is recognized that the currently provided HDA system ensures safety in highway operation, it is perceived that the operation is too difficult. Therefore, when the system focuses on practical safety-related performance and maintains an appropriate level that is not difficult to operate, the possibility of continued use can be maximized. Third, interoperability among HDA system characteristics had a positive (+) effect on perceived ease, but did not significantly affect perceived usefulness. This reflects the reality that even though it is recognized that operation is easy in the integration of each function constituting the HDA system, it does not significantly contribute to securing safety and convenience, which is the actual purpose of the HDA system. Therefore, in the future, the corresponding system will have to focus more on creating performance that can ensure optimal safety and convenient driving as well as mutually easy operation of each function. Fourth, among user characteristics, innovativeness had a positive (+) effect only on perceived usefulness. These results suggest that drivers who are satisfied with

80

M. Lee et al.

safe driving on highway driving, which is the result of the HDA system, but are inclined to actively introduce a new system also have some difficulties in operating the system. However, in the case of innovative groups, there is a tendency to take some difficulties. Therefore, if the difficulty of operating the system is appropriately adjusted and the performance is maximized, the possibility of continued use of the innovative group can be increased. Fifth, among user characteristics, perceived playfulness had a positive (+) effect only on perceived ease. This means that users who experience enjoyment and immersion with HDA systems easily operate the system, but do not perceive that they will actually help improve safety and driving skills. Therefore, in the future, it is necessary to design the system so that it reflects the elements of interest and enjoyment in operation and emphasizes that the fun is actually a process that further increases their safety. Among the user characteristics, self-efficacy had a positive (+) effect on both perceived ease and usefulness. These results indicate that the more confident and confident the driver operating the HDA system has in their ability to successfully utilize the system, the easier it is to use the system and the more confident they are in ensuring practical highway safety. Therefore, in the future, if the system provides more detailed and effective guides and ASs to further increase the user’s sense of self-efficacy in building and writing system usage guides, the user’s perception of convenience and actual performance perception can be achieved at the same time. Only perceived severity and response efficacy among protective motive behaviors had a positive (+) effect on intention to continue use, and perceived vulnerability was not significant. These results mean that the more HDA system users perceive that accidents on highways cause great damage to their lives or lives, the more loyal they become to the system. Therefore, in the future, the HDA system industry needs to create an effective message strategy for the severity of such highway accidents for users, and it is necessary to deliver somewhat dramatic visual and auditory messages in conveying the risk. In addition, as the reaction efficacy pays attention to the intention of continuous use, when operating each function of the HDA system, the user’s perceived ease of and usability can be secured. On the other hand, the perceived vulnerability did not pay attention to the intention of continuous use, which means that even if the degree of contribution of the HDA system in preventing accidents on the highway is recognized as high, it does not significantly affect the intention of continuous use. As verified through the above hypothesis, not only practical performance but also various other factors such as perceived playfulness and user innovation also affect perceived usefulness and ease, so fun and interesting factors and users If the customized operation method considering the tendency is taken into account, the intention to continue use can be further increased. Perceived ease of use had a positive (+) effect on perceived usefulness, and both perceived ease and usefulness had a positive (+) effect on intention to continue use. These results suggest that once people perceive that the operation of the HDA system is generally easy and convenient, the perception has a positive effect on the perception

A Study on the Intention to Continue Using a Highway Driving …

81

of actual performance of the system. Therefore, it is necessary to design in consideration of the importance of ease of operation in improving performance for securing safety on practical highways. However, even the highly innovative group experiences some difficulties in operating the system, but the user group using HDA is mostly innovative, and their tendency to consider the risk to some extent in the use of new devices gives them a sense of accomplishment. In consideration of this, excessively easy operation should be avoided. Lastly, as perceived ease and usefulness both have a positive (+) effect on the intention to continue using, each company has limited financial and manpower to choose between convenience and performance through a strategy of selection and concentration. The intention to continue using can be maximized when it provides satisfaction to the driver by balancing mutual harmony and balance in functions. According to the results of this study, it will be possible to provide useful guidelines when verifying the intention to continue using the overall ADAS system in the future. If differences between consumers are identified, more useful management and marketing implications can be expected.

References 1. Oh, C.: The era of autonomous driving is fast approaching... The need to supplement the system and resolve the imbalance in support. Electric newspaper. http://www.electimes.com/news/art icleView.html?idxno=217610 (2021) 2. Son, J.C.: Future mobility-based, self-driving car commercialization trend. Information and Communication Planning and Evaluation Institute (2013) 3. Deloitte Anjin Group. Global Automotive Consumer Research Report (2019) 4. Seo, H.H.: A study on the intention to use autonomous vehicles—focusing on network externalities and financial considerations, Ph.D. thesis, Soongsil University (2018) 5. Transportation Science Research Institute. Technology trends related to autonomous driving such as ADAS. Global New Technology Trend Analysis News Letter (2016) 6. K-ADAS homepage. https://k-adas.co.kr/ADAS 7. Park, J.H.: A study on civil liability for autonomous vehicle accidents. Ph.D thesis, Jeju National University Graduate School (2021) 8. HMG Journal. 10 charms of the 3rd generation K5 ADAS (2019) 9. Kumho Tire. What is Highway Driving Assist System HDA? https://blog.kumhotire.co.kr/1158 (2017) 10. Jo, J.: Lane change by yourself... Imminent release of Hyundai and Kia’s ‘HDA2’. Cnet. Korea. https://www.cnet.co.kr/view/?no=20190429160227 (2011) 11. Jeong, J.: Korea’s traffic accident fatalities enter the second half-life… The fastest among OECD scoops. ZD Net Korea. https://zdnet.co.kr/view/?no=20210614104814 (2021) 12. Oh, H.Y.: The effect of relative advantage and risk perception on intention to use mobile simple payment service: focusing on the moderating effect of consumer innovation propensity. Finan. Consum. Res. 5(1), 33–64 (2015) 13. Seddon, P., Min-Yen, K.: A partial test and development of the DeLone and McLean model ofr is success. In: International Conference on Information Systems, (hal. 99) (1994) 14. Venkatesh, V., Davis, F.D.: A theoretical extension of the technology acceptance model: four longitudinal field studies. Manage. Sci. 46(2), 186–204 (2000) 15. Bandura, A.: The explanatory and predictive scope of self-efficacy theory. J. Soc. Clin. Psychol. 4(3), 359–373 (1986)

82

M. Lee et al.

16. Rogers, R.W.: A protection motivation theory of fear appeals and attitude change1. J. Psychol. 91(1), 93–114 (1975) 17. Davis, F.D.: A technology acceptance model for empirically testing new end-user information systems: Theory and results (Doctoral dissertation, Massachusetts Institute of Technology) (1985)

Security Policy Deploying System for Zero Trust Environment Sung-Hwa Han

Abstract The internal user in the enterprise environment knows the information about the information service, so it is easy for him to imalicious attack. Zero trust model to solve this problem, it should manage many systems at the same time. However, since the information service environments are all different, it is difficult to deploy the same security policy to many systems. In this study, to solve this problem, we propose security policy deploy system for zero trust based environment. The proposed system consists of three components operating in two systems, and uses an object-based security policy deployment method. To verify the effectiveness, we implement sw as suggested architecture and verify the target function and performance. As a result, the architecture proposed in this study can deploy security policy in different system environments, and it is confirmed that it is effective because there are few resources used here. However, since we focused only on deploying a single policy that does not consider the dependencies between security policies, additional research is needed to deploy complex security policies at the same time. Keywords Security policy deploying · Difference system envrionment · Zero trust model · Security architecture · Malicious attack

1 Introduction Recently, there are various types of security threats. An external malicious attacker who directly accesses an information service performs a service deny attack or obtains important information used in an enterprise environment. Alternatively, a malicious attacker may install a backdoor program to attack later [1]. Through this, the malicious attacker collects a lot of detail information about the service. The malicious attacker then performs a more damaging attack.

S.-H. Han (B) Department of Information Security, Tongmyong University, Busan, Korea e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Big Data, Cloud Computing, and Data Science Engineering, Studies in Computational Intelligence 1075, https://doi.org/10.1007/978-3-031-19608-9_7

83

84

S.-H. Han

However, malicious attacks do not only occur externally. The internal user system is a trusted subject in the internal network. It is easy for internal users to access the information service and to get various information about the service. Because of this advantage, more attacks occur by internal users in enterprise environments. An external malicious attacker also knows the advantage of an internal user’s logical position. So, an external malicious attacker uses a cross-platform technique to deploy malware to the target system through the internal user system [2]. In addition, malicious attackers enforce many security techniques against malware. So many security techniques to detect malware are very difficult to detect current malware [3]. Because of this problem of enterprise security environment, a zero trust model has been proposed. In the zero trust model, any subject accessing a service resource is regarded as an attacker. Here, the service resource may be an application service, but may be a system, file/directory, process, etc. Only sufficiently verified subjects can access service resources. As many organizations learn the advantages of this zero trust model, the number of cases of switching to a zero trust model based security environment is increasing [4]. A zero trust model based security architecture should protect all service resources in the enterprise environment. Therefore, many security techniques are applied to protect various information services. Information service of nowadays enterprise environment is not composed only of same system and same application. One service can consist of many systems or many applications. Also, each system or application may have different types. A security manager should deploy a security policy that should allow limited access to complex information services. However, this is difficult as the information service is more complex. In this study, we propose a security policy deploying system that can solve the security policy deployment problem that can occur in such zero trust based environments. The proposed system can deploy the same security policy to services composed of different systems and applications. By applying this system to an enterprise environment, many systems can enforce the same security policy. Also, with this security function, the information service can implement the zero truth model in the enterprise environment. All service functions must be provided correctly. Therefore, the system’s function proposed in this study must also be correctly provided. To this end, in this study, the proposed system is empirically implemented, and then its effectiveness is checked by verifying the targeted security function.

Security Policy Deploying System for Zero Trust Environment

85

2 Related Work 2.1 Current Security Threat Trends There are many types of security threats that have recently occurred, and there are many techniques applied to security threats. A legacy security threat was an external malicious attacker attacking information services on the internal side of the enterprise environment. The malicious attacker stopped the information service with a DOS (Denial of Service) attack or got important information saved in the system [5]. Various security techniques are applied in the enterprise environment. Firewall and UTM (Unified Threat Management) as well as IPS (Intrusion Prevention System) and IDS (Intrusion Detect System) are applied. Therefore, it is very difficult to attack the target system by bypassing this security technique from the external side [6]. However, a malicious attacker on the internal side is easy to attack. It is easy for internal users to get information such as system OS, network, and application version of the target service. In most enterprise environments, a perimeter based model is applied to increase the efficiency of the security technique. In this perimeter based security model, since the internal user is a trusted subject, most security techniques do not enforce the security function to the internal user. Also, from the application service point of view, since the internal user is a trusted subject, internal user’s access is allowed [7]. In recent security threat rategenerated by internal users is higher than that generated by external users [8]. The method in which a malicious attacker directly attacks has the disadvantage that the attacker’s position can be easily identified. Malicious attackers are also aware of the internal user’s advantage and are trying to take advantage of it. So, the malicious attacker is physically in the outside, but logically uses the method of attacking the target from the inside attack.

2.2 Current Malware Technologies Malicious attackers use malware to take advantage of the internal user’s position. A malicious attacker generates malwares that can attack information on the target system, and then distributes them to the internal user system. The first generated malware gathers information of the target system. This malware should not be deleted by IPS or anti-virus in the process of distributing it. Therefore, a malicious attacker applies a security technique to this malware. A security technique that a malicious attacker can apply is obfuscation or packing [9]. Next, the second malware enters the internal system through the backdoor installed by the first malware. This malware makes the main attack.

86

S.-H. Han

Fig. 1 Obfuscation technique applied to malware

Obfuscation is a technique to protect sw ownership and sw logic. There are various types of obfuscation such as jumping code injection, renaming, redundant code injection, and replacement in detail as shown in Fig. 1. When this obfuscation technique is applied to malware, its logic is mixed like normal sw [10]. IPS, UTM, and anti-virus all use malware signatures. A malware signature is generated by a malware analyst. This identifies the features of the malware. So, a security technique that uses a malware signature can detect malware by analyzing a file or process. However, malware to which obfuscation is applied has a different malware feature than the default malware signature. Therefore, IPS or anti-virus cannot detect malware to which obfuscation is applied. Another security technique a malicious attacker can apply to malware is packing. This is a technique to disable the anti-virus detection function [11]. As shown in Fig. 2, the packing technique compresses malware. A simple compression technique can be decompressed by an anti-virus. In the decompressed state, the anti-virus can apply the malware signature and then detect the malware. However, the packing technique distributes and compresses the execution logic of malware. Since the logic of malware to which this packing technique is applied is distributed, the malware features are changed. So the anti-virus cannot detect the packed malware. When distributing malware to user systems, cross-platform technique is used. Cross-platform is a characteristic of SW that can operate on various platforms. However, a malicious attacker interpreted this as malware that can operate on various platforms or a technique that can deliver malware to the target system. So, the crossplatform technique applied to the malware is not detected until it is delivered to the target system.

Security Policy Deploying System for Zero Trust Environment

87

Fig. 2 Sequence running an executable packed using a UPX packing algorithm

2.3 Zero Trust Model As such, in an enterprise environment there is a malicious attacker on the external side and malware generated by the external malicious attacker on the internal side. It is difficult to determine authorized access in such an environment. Therefore, a zero trust model was proposed [12]. The zero trust model is a security model proposed by John Kindervag in [13]. The zero trust model was proposed by pointing out the limitations of the perimeter based security model chosen by many organizations. The perimeter based security model is a model in which the efficiency of security technique and the interference of information service by security function are minimized. So, the perimeter based security model focuses on outside monitoring and minimizes security function enforcement inside [14]. However, this approach is not suitable in the current environment. There are already malicious attacks on the internal side of the enterprise environment. In the perimeter based security model, there is a problem that such internal malicious attack cannot be denied [15]. However, in the zero trust model, all access subjects are considered malicious attackers. So basically, all access is denied. However, access of verified subjects is allowed. Here, verified subject includes user verification and user environment verification. So, in the zero trust model based security architecture, all subjects accessing service resources are verified. Enforce user’s identification and authentication, and verify user’s device OS, software kinds, version, network, etc. If the user environment is not safe enough, the access is denied [16]. In the zero trust model, the privileges assigned to users are assigned only the minimum privileges. In the zero trust model, even allowed users are not trusted. So, the zero trust model monitors user behavior. If an allowed behavior is tried, immediately deny the user behavior.

88

S.-H. Han

3 Current Security Environment Analysis 3.1 Security Environments Analysis The structure of information service is very complex. Even one service is related to various systems. Each system has a different OS installed, and applications running on the OS may be different. Also, each system has a different role. The server in charge of online service and the server in charge of mobile service are divided. Since there are various services, there is a system that distinguishes these services. Of course, there is a system that manages accounts, a system that provides board service, a system dedicated to e-commerce, and a system that does marketing [17]. It is very difficult for a security manager to security manage a service architecture in which various systems are related. An appropriate security policy must be defined for each system, and the security application to deploy the security policy must be managed separately. Security policy consists of subject, object, and privilege. Because each application is different, the subject and object constituting the security policy are also different. Because the operating environment of each system is different, the way the application works is also different. Therefore, the privileges of the security policy are all different. The number of such complex services will continue to increase. In a situation where a legacy service that has already been built a long time ago is being provided, there are many cases where a new service is provided. In this case, the system constituting the newly provided service should relate to the legacy system. However, since the newly built system installs the latest OS and the latest version of the application, it has an operation environment different from that of the legacy system.

3.2 Requirement for Zero Trust Model Based Security Environment The security policy is also used in the zero trust model based security architecture. You can manage applications running on the system only by defining users, user devices, and user privileges that can access the system. However, if the types of systems that consist of information service are diverse and the system environments are all different, the work of the security manager is increased. Security management must protect complex information services, but the burden on the security manager must be minimized. Therefore, it should be possible to convert the security manager’s managing behavior to suit each system environment.

Security Policy Deploying System for Zero Trust Environment

89

Fig. 3 Security policy deploying system architecture

4 Security Policy Deploying System 4.1 System Architecture In this study, when a complex service is provided in a zero trust model based security architecture, a security policy deploying system as shown in Fig. 3 is proposed that can reduce the burden on the security manager in charge of managing the security policy to protect this service. As a model that accommodates the proposed system policy object concept, security policy can be deployed in different system environments. The architecture of the security policy deploying system proposed in this study consists of three components. The Security Managing Interface is used by the security manager to manage the security policy. The security manager sets the subject, object, and privilege of the security policy using the security managing interface. The security policy set here is transmitted to the Security Policy Server. The Security Policy Server stores the policy set by the security manager in the Policy Database. When the Security Policy Server saves the security policy, it converts it to enforce policy and saves it. After that, the security policy server delivers the generated enforce policy to the policy agent of the security technique.

4.2 Security Policy Deploying Process The main function of the security policy deploying system proposed in this study is to deploy the same security policy in different systems and application environments. The security policy deployment process is shown in Fig. 4.

90

S.-H. Han

Fig. 4 Portal service architecture

The security manager generates the security policy using the Security Managing Interface. Here, the generated security policy is delivered to the Security Policy Server and saved. In the Security Policy Server, the security policy set by the security manager includes the policy attributes of subject, object, and privilege. The Security Policy Server generates enforce policy by converting policy attributes into subjects, objects, and privileges that can be enforced in each system. The Security Policy Server delivers the generated enforce policy to the policy agent of the security technique. If the security manager wants to review the enforcing policy that is enforced in the security technique, he can review the enforce policy by checking the Policy Database.

5 Implements 5.1 Verification Environments and Items Five systems hardware environment is i5-8500 CPU, 8 Gbyte Memory, and 256 Gbyte SSD environment. The security managing interface system and security policy server system applied Redhat Enterprise Linux 8.4. And the security technique applied the OS of Windows 10, CentOS 8, 4, Solaris 11 that provides the host firewall function. The security policy deploying system proposed in this study deploys the security policy as several unit functions relate. Therefore, the functions and performance verify items of the system proposed in this study are defined as shown in Table 1.

5.2 Verification Result Func_01 verify result, since the security policy was normally generated and the attribute of the generated security policy in the policy database was checked, it was confirmed that the security managing interface operates normally as shown Fig. 5.

Security Policy Deploying System for Zero Trust Environment

91

Table 1 Function and performance verify items Verify ID Description Func_01

• Check the creation and storage of security policy by the security managing interface • If the security policy is normally created, you can check the attributes of the generated security policy in the policy database

Func_02

• Check whether enforce policy is normally created by the security policy server • If the enforce policy is normally created, you can check the attribute of the enforce policy to be enforced for each Windows 10, CentOS 8.4, and Solaris 11 OS in the policy database

Func_03

• Check whether the applied security policy is normally applied in each system • If the enforce policy is normally received, the enforce policy is correctly enforced in the security technique (Host firewall)

Fig. 5 Func_01 verified result

As a result of verifying Func_02, the attribute of enforce policy to be enforced for each Windows 10, CentOS 8.4, and Solaris 11 OS was confirmed in the policy database. Therefore, it was confirmed that the enforce policy generating function of the security policy server operates normally as shown Fig. 6. Func_03 was verified. As a result, it was confirmed that the firewall policy was normally enforced in each system. Therefore, it was confirmed that the policy deploying function of the security policy server operates normally as shown Fig. 7. Since it was confirmed that all of these unit functions operate normally, it was confirmed that the security policy deploying system proposed in this study correctly provides the targeted security function.

6 Conclusion Due to IT convergence, the scope of information service will be wider, and the number of types will increase. As the number of services in the enterprise environment increases, the security threat will also increase. As the number of malware

92

S.-H. Han

Fig. 6 Func_02 verified result

Fig. 7 Func_03 verified result

applied with the security technique increased, the attacker’s position became indistinguishable. The zero trust model proposed to solve this problem is evaluated to be able to enhance the security of the current enterprise environment. In this zero trust model based security architecture, it should be able to enforce security policy for complex information service. In this study, a security policy deploying system was proposed to meet these requirements. To check the effectiveness of the proposed system, it was implemented empirically and then the function was verified. As a result, it was confirmed that the security policy deploying system proposed in this study provides the targeted security function. This study focused on security policy deploying. For this, it takes a long time to convert the actual enforce attribute for security subject, object, and privilege according to the system. As a next study, I will improve the performance of the system proposed in this study. Acknowledgements This research was supported by the Tongmyong University Research Grants no.2021A017.

Security Policy Deploying System for Zero Trust Environment

93

References 1. Cardona-Rivera, R.E., Young, R.M.: Toward combining domain theory and recipes in plan recognition. In: Workshops at the Thirty-First AAAI Conference on Artificial Intelligence (2017) 2. Xu, P., Zhang, Y., Eckert, C., Zarras, A.: HawkEye: cross-platform malware detection with representation learning on graphs. In International Conference on Artificial Neural Networks, Springer, Cham, pp 127–138 (2021) 3. Zolkipli, M.F., Jantan, A.: Malware behavior analysis: Learning and understanding current malware threats. In: 2010 Second International Conference on Network Applications, Protocols and Services, IEEE, pp. 218–221 (2010) 4. Kindervag, J.: No more chewy centers: the zero trust model of information security. Forrester Research Inc., (2016) 5. Cetinkaya, A., Ishii, H., Hayakawa, T.: An overview on denial-of-service attacks in control systems: attack models and security analyses. Entropy 21(2), 210 (2019) 6. Harvey, J., Kumar, S.: A survey of intelligent transportation systems security: challenges and solutions. In: 2020 IEEE 6th Intl Conference on Big Data Security on Cloud (BigDataSecurity). IEEE Intl Conference on High Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS), IEEE, pp. 263–268 (2020) 7. Rapuzzi, R., Repetto, M.: Building situational awareness for network threats in fog/edge computing: Emerging paradigms beyond the security perimeter model. Futur. Gener. Comput. Syst. 85, 235–249 (2018) 8. Gheyas, I.A., Abdallah, A.E.: Detection and prediction of insider threats to cyber security: a systematic literature review and meta-analysis. Big Data Analytics 1(1), 1–29 (2016) 9. Singh, J., Singh, J.: Challenge of malware analysis: malware obfuscation techniques. Int. J. Inf. Secur. Sci. 7(3), 100–110 (2018) 10. Ogiso, T., Sakabe, Y., Soshi, M., Miyaji, A.: Software obfuscation on a theoretical basis and its implementation. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 86(1), 176–186 (2003) 11. Bat-Erdene, M., Park, H., Li, H., Lee, H., Choi, M.S.: Entropy analysis to classify unknown packing algorithms for malware detection. Int. J. Inf. Secur. 16(3), 227–248 (2017) 12. Flanigan, J.: Zero trust network model. Tufts University, Medford, MA, USA (2018) 13. Kindervag, J., Balaouras, S., Hill, B., Mak, K.: Control and protect sensitive information in the era of big data. For Security & Risk Professionals, Technical Report (2012) 14. Kouvelas, A., Saeedmanesh, M., Geroliminis, N.: Enhancing model-based feedback perimeter control with data-driven online adaptive optimization. Transp. Res. Part B Methodological 96, 26–45 (2017) 15. Teerakanok, S., Uehara, T., Inomata, A.: Migrating to zero trust architecture: reviews and challenges. Secur. Commun. Netw. (2021) 16. de Weever, C., Andreou, M.: Zero trust network security model in containerized environments. University of Amsterdam, Amsterdam, The Netherlands (2020) 17. Chen, M., Zhang, D., Zhou, L.: Providing web services to mobile users: the architecture design of an m-service portal. Int. J. Mobile Commun. 3(1), 1–18 (2005)

TTY Session Audit Techniques for Linux Platform Sung-Hwa Han

Abstract Malicious attackers in an enterprise environment have a higher percentage of internal than external. This is because it is easy for internal users to gather service environment information and access service. A malicious attacker who has access to the system can access various information and leak or modify important information. For safe service operation, a behavior audit function is required to identify and deny unauthorized behavior by checking users accessing the system. However, the current OS only provides limited audit functions such as network or system resource monitoring and process listing. In this study, we propose a TTY session audit technique that can check the command input by the user accessing the Linux environment and the resulting message. To verify the effectiveness of the proposed technique, the function and performance are checked after empirical implementation. If the user behavior monitoring technique proposed in this study is applied to an enterprise environment, it can support safe service operation by monitoring the user behavior. However, this study focus on TTY session audit, suggested technique cannot provide background processing monitoring. Therefore, additional research on this is necessary. Keywords TTY session · Audit · User access · Command line messages · Malicious attack · Background processing

1 Introduction When operating an information service, the principle of least privilege should be applied, and only limited system administrators should have access to the information system in which the service is operating. A user who has access to the system must check the normal operation of the information service within the authorized This research was supported by the Tongmyong University Research Grants no.2021A023. S.-H. Han (B) Department of Information Security, Tongmyong University, Busan, South Korea e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Big Data, Cloud Computing, and Data Science Engineering, Studies in Computational Intelligence 1075, https://doi.org/10.1007/978-3-031-19608-9_8

95

96

S.-H. Han

authority. If the information service behaves abnormally, it should be corrected. This system manager’s permission is assigned by setting the permission of the system account and group account. The system account or system group account executes system commands or accesses files within the set permission. Since the information system manager’s privileges are usually assigned only limited privileges, use superuser privileges such as the root account when system-wide changes are required [1]. In recent security issues, attacks by insiders are more common than attacks by outsiders. This is because it is easier for an insider to collect information about the service, and because it is physically located inside, it is easy to access the target system by bypassing the security solution. An internal user with a malicious purpose can access the target system, leak or modify important information, and install other malware if necessary. Malware installed by malicious attackers usually acts as a backdoor, receiving commands from the outside and acting as messengers to execute them [2]. For the safe operation of information services, the security of information systems must be strengthened. For this, the operating system provides security techniques such as SELinux and AppArmor [3]. These security techniques are difficult to protect them using SELinux or AppArmor. Therefore, other security techniques are needed to protect them. In this study, we propose a TTY session audit system that can post response by auditing the behavior of users who have accessed the information system. The proposed system saves all input and output texts that occur in the TTY session of the user accessing the Linux operating system to a log. The TTY session audit system proposed in this study should not only save all texts generated in the TTY session as a log, but also provide a function to search this log. Also, the resource to provide this security function must be small enough so that it does not affect other user application services. Therefore, in order to confirm the effectiveness of this study, the function and performance are verified by empirical implementation of the system proposed in this study. If the system proposed in this study works correctly, the system manager or security manager can check the unauthorized behavior by auditing the saved TTY session log.

2 Related Work 2.1 Enterprise System Operation Process Information service is designed and implemented by the organization strategy and then operated. The information service has its own purpose, and the service quality must be achieved by satisfying the targeted function and performance. To this end, various standard models for information service provision were defined, and

TTY Session Audit Techniques for Linux Platform

97

a methodology for evaluating the maturity of the information service operation organization was also presented [4]. The quality of information services all depend on the purpose of the service. And the quality evaluation methods and tools are all different depending on the information service. However, it is common for an information service manager or system manager to monitor the operation status of an information service and to correct an abnormal operation of the information service. Although it depends on the operation tool of the information service, it is necessary to check the status of the application process that provides the information service, execute it, or check the status of related important files/directories. If the application process that provides the information service operates abnormally, the manager restarts this process or corrects the error [5]. Also, when an important file/directory to provide an application process is modified by malware, the manager restores it. In addition to this, it is also corrected when the user information file/directory accessed by the application process is faulted [6].

2.2 Security Threats Can Occur in Service Management The security manager accesses the information system to manage the information service. The tool used by the manager uses services such as application service monitoring SW or Enterprise Security Management (ESM) tool, or a terminal tool that can directly access the system [7]. Application service monitoring SW or ESM handles only limited data to check the status of application service, and the command to process data is also limited. Therefore, it can be said that it is safe to use this SW for application service monitoring. However, for application service monitoring, the terminal tool is not limited in handling data or commands. A terminal tool using a protocol such as Secure Shell (SSH) or Telnet can execute all commands that can be executed in the system. Although the privilege is enforced on the manager account, many commands can be executed because the OS provides the SETUID command [8]. Recently generated malware uses these features. Malware enters the target system through the manager’s system using cross-platform technique [9]. Malware entering the system has the same authority as the user using the TTY tool. Because user’s privilege is enforced, malware can execute many system command. Malware that has obtained superuser privilege can access application process or user data. This malware can abnormally terminate the application process or leak important information externally.

98

S.-H. Han

2.3 OS Protect Security Technique OS is composed of many system applications to provide user service as well as boot image to generate kernel. This system application is very important because it acts as a user interface. Most current OS enforces user permission to protect the boot image file as well as many files/directories to provide system applications [10]. However, this permission is not suitable for protecting important system files/directories because its usage is simple [11]. Current OS enforces a separate security technique to protect important system files/directories and system processes. Windows operating system provides Group Policy Objects (GPO), and Linux operating system provides SELinux or AppArmor [12]. All three security techniques enforce a pre-defined access control policy on user behavior. If the object the user wants to access is registered in the access control policy, this security technique denies the user access [13]. The access control policy registered in this security technique is defined by the OS provider. It is saved in the OS policy file or registry, and it is encrypted, so it is difficult to check the policy contents. When SELinux or AppArmor is applied to the enterprise environment, the manager can modify some access control policies by the user, but the configuration process is very complicated and difficult. GPO provides a GUI based manage interface. However, it is difficult to use because there is a lot of knowledge required to use this security function. Therefore, the usage rate of these security techniques is not high [14].

3 Security Requirements for Information Service Operation 3.1 Security Environment Analysis About Current Information Service Operation Most information services are implemented as an application process, and information services are operated by adding user data to it. After all, the information service can be operated only when both the application process and user data are provided. Therefore, for safe operation of information service, not only the running application process but also the application file/directory for executing it must be protected. Both the service and file/directory where user data is stored must be protected. Windows OS can protect a file/directory to run an application using GPO, or a service or file/directory where user data is stored. However, current Linux does not provide a protect technique that can protect these application files/directories and user data. SELinux or AppArmor provided by Linux OS applies pre-defined access control policy. To modify this access control policy, all complex relations between application processes should be considered and set. Therefore, it is not appropriate

TTY Session Audit Techniques for Linux Platform

99

to protect an application file/directory, process, or user data file through SELinux or AppArmor in a Linux environment. Recently, malware can be executed in a shell environment. These malware can leak important information by acquiring user or superuser rights and installing backdoor programs or accessing user data files. Before malware enters the system, it can be blocked by Intrusion Prevention System (IPS). If the malware bypasses the IPS, the malware can be detected and then deleted by the anti-virus running in the system. However, current malware is difficult to detect from IPS or anti-virus because it applies security techniques such as obfuscation and packing [15].

3.2 Security Requirement for Safe Information Service Operation As such, it is very difficult to deny unauthorized access by monitoring and monitoring access to application files or user data files in a Linux environment. The OS should provide a real-time access control function for the application process, file, or user data file, but it is not currently provided. Even if such a real-time access control function is provided, it is difficult to set a security policy because of the relation of the application process and the system file/directory access mechanism. Therefore, a post response method is required to protect the application process, file, and user data file. The post response method allows access to the application process, file, or user data file. However, this access log is recorded, and the access log is reviewed when necessary to identify the subject of unauthorized access and support the analysis of the process. In order to apply this post response method to the enterprise environment, the requirements shown in Table 1 must be satisfied.

4 TTY Session Audit System for Linux Platform 4.1 TTY Session Audit System Architecture In this study, we analyzed the requirements described in Table 1. We considered a security function that monitors all user’s access accessing the Linux system and monitors all user’s access type. The target security function should be able to monitor all user’s input/output messages and save them. Also, the user should not be able to bypass this monitoring function. To meet these requirements, in this study, a TTY session audit system as shown in Fig. 1 is proposed. The TTY session audit system proposed in this study monitors and saves the user behavior log including all user’s input and output messages generated in the TTY session that can access Linux. By reviewing the stored user’s behavior log, the

100

S.-H. Han

Table 1 Post response method requirement Requirement

Description

Based on monitoring policy

● Monitoring access to all application files/directories degrades system performance ● Therefore, only the access to important file/directory is monitored

Consider all user access type

● There are various ways to access user file/directory ● Therefore, it is necessary to monitor all access tools, regardless of the tools the user uses

Begin to end monitoring

● If monitoring performance is low, file/directory access may be missed ● It should provide sufficient monitoring capability to monitor all user file/directory access ● Monitoring must be enforced from the time the user login to the system until logout

Input/Output status audit

● The user can execute various commands. Therefore, it is necessary to create and write log of all commands executed by the user ● Also, the user uses the system command to check the system status or get system information. Therefore, the output text displayed by the user command execute should also be log written

Prevention monitoring function bypass ● The file access monitoring function should always be enforced when a user logs into the system ● The user must not be able to bypass this file access monitoring function

Fig. 1 TTY session audit system architecture

security manager can check all user’s behavior accessing the application file/directory or user data file. The TTY session audit system limited in this study consists of 4 components. Security Policy Manager is the security manager’s interface that sets the monitoring policy to access the system. Through this Security Policy Manager, the security manager sets the user’s IP address, user ID, and TTY protocol type. Monitoring

TTY Session Audit Techniques for Linux Platform

101

Module monitors user’s behavior accessing Linux system using TTY protocol, and saves user’s input and output text message generated in this session as log. Monitoring Module is a library and does not operate independently. Linux OS provides a PAM interface that can enforce user functions for TTY sessions. You can use this to activate a library that can monitor a user’s TTY session. Audit Database is a storage that stores user’s behavior log generated by Monitoring Module. The user’s behavior log is a text file, and the user’s input/output text is divided into time units and saved. Audit Reviewer is a security manager’s interface that audits user behavior logs stored in the Audit Database. Using this, the security manager can query the user behavior log.

4.2 TTY Session Audit System Usage Process The operation sequence of the TTY session audit system proposed in this study is shown in Fig. 2. First, the security manager sets the monitoring policy using the Security Policy Manager. The set monitoring policy is written to the policy file of the Monitoring Module. When a user accesses the Linux system using the TTY protocol, the PAM interface duplicates the user’s TTY session and forwards one session to the Monitoring Module. Monitoring Module gathers text messages generated in this TTY session in time unit. Monitoring Module stores gathered user’s text message in Audit Database. The security manager runs the Audit Reviewer to review the user’s behavior log. Audit Reviewer lists-up the TTY session logs stored in the Audit Database. When the security manager selects the TTY session log to review, Audit Reviewer plays the TTY session log according to the time unit.

Fig. 2 TTY session audit system’s security function provide process

102

S.-H. Han

Table 2 Function verify items Verify ID Description Func_01

● Verify that the security manager can use the Security Policy Manager to set the monitoring policy ● If the monitoring policy is normally registered, the user ID, user system’s ip address, and TTY protocol type indicating the monitoring policy are stored in the policy file used by the monitoring module

Func_02

● When the user accesses from the remote system to the Linux system using the TTY protocol, check whether the behavior log for the text generated in this TTY session is normally generated and saved ● If the behavior log is normally saved, the input/output text is saved for the user’s behavior according to the time unit

Func_03

● Check that the security manager uses Audit Reviewer to review the stored user behavior log ● If it can be reviewed normally, it replays the user behavior over time

5 Implements 5.1 Verification Environment and Items The TTY session audit system proposed in this study should correctly provide the targeted security function and should not interfere with the execution of other application processes. Therefore, after empirical implementation of the TTY session audit system proposed in this study, verify its function and performance. The hardware environment of the Linux system to verify the function and performance of the TTY session audit system proposed in this study was implemented in an i5-8500 CPU, 8 Gbyte memory, and 256 Gbyte SSD environment. The remote user system that accesses the Linux system by using the TTY protocol has the same specification of the Linux system. For the software environment, Redhat Enterprise Linux 8.4 was chosen for both systems. For the TTY protocol to be used in the remote user system, SSH2 and Telnet service were chosen. The TTY session audit system proposed in this study provides a security function that can monitor the TTY session and review the user behavior log as several unit functions relate. Therefore, the functions verify items of the system proposed in this study are defined as shown in Table 2.

5.2 Verification Result As a result of verifying Func_01, it was confirmed that the user ID, user system’s ip address, and TTY protocol type indicating the monitoring policy were correctly saved in the policy file used by the Monitoring Module as shown in Fig. 3.

TTY Session Audit Techniques for Linux Platform

103

Fig. 3 Func_01 verify result

Func_02 was verified. As shown in Fig. 4, it was confirmed that the behavior log for the text generated in this TTY session was correctly generated when the user accessed from the remote system to the Linux system using the TTY protocol. As a result of verifying Func_03, as shown in Fig. 5, when the security manager selects the user behavior log in the Audit Reviewer, it was confirmed that the user behavior is replayed according to the selected log. Fig. 4 Func_02 verify result

Fig. 5 Func_03 verify result

104

S.-H. Han

6 Conclusion Security threats to information services occur more internally than externally. It is good to prevent such behavior in advance, but it is difficult to enforce the security technique for this in a complex application operation environment. Therefore, post response should be applied to provide safe information service. In this study, we proposed a TTY session audit system that can monitor user behavior for this post response. The proposed system records input/output texts generated from all user TTY sessions accessing the Linux system in log. The security manager can review the stored behavior log. As a result of verifying the function, it was confirmed that the TTY session audit system proposed in this study correctly provided the targeted security function. This study focused on user behavior. However, malware uses a background program to perform various attacks. So, we plan to conduct additional research on tracing the background program. Acknowledgements This research was supported by the Tongmyong University Research Grants no. 2021A023.

References 1. Zhang, T., Shen, W., Lee, D., Jung, C., Azab, A.M., Wang, R.: A permission check analysis framework for Linux kernel. In: 28th USENIX Security Symposium (USENIX Security 19), pp. 1205–1220 (2019) 2. Nadler, A., Aminov, A., Shabtai, A.: Detection of malicious and low throughput data exfiltration over the DNS protocol. Comput. Secur. 80, 36–53 (2019) 3. Amith Raj, M.P., Kumar, A., Pai, S.J., Gopal, A.: Enhancing security of docker using Linux hardening techniques. In: 2016 2nd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT), pp. 94–99. IEEE (2016) 4. Pereira, R., Da Silva, M.M.: A maturity model for implementing ITIL V3 in practice. In: 2011 IEEE 15th International Enterprise Distributed Object Computing Conference Workshops, pp. 259–268. IEEE (2011) 5. Kamery, R.H.: The 21st century’s virtual corporation. In: Allied Academies International Conference. Academy of Legal, Ethical and Regulatory Issues. Proceedings, vol. 9, issue 1, p. 25. Jordan Whitney Enterprises Inc. (2005) 6. Werthmuller, D., Cook, M., Costello, J.: Constructing the New York State-Local Internet Gateway Prototype: A Technical View. Center for Technology in Government, Albany (2005) 7. Harsh, P., Ribera Laszkowski, J.F., Edmonds, A., Quang Thanh, T., Pauls, M., Vlaskovski, R., Avila-Garcia, O., Pages, E., Bellas, F.G., Gallego Carrillo, M.: Cloud enablers for testing largescale distributed applications. In: Proceedings of the 12th IEEE/ACM International Conference on Utility and Cloud Computing Companion, pp. 35–42 (2019) 8. Jeon, Y., Rhee, J., Kim, C.H., Li, Z., Payer, M., Lee, B., Wu, Z.: Polper: process-aware restriction of over-privileged setuid calls in legacy applications. In: Proceedings of the Ninth ACM Conference on Data and Application Security and Privacy, pp. 209–220 (2019) 9. Xu, P., Zhang, Y., Eckert, C., Zarras, A.: HawkEye: cross-platform malware detection with representation learning on graphs. In: International Conference on Artificial Neural Networks, pp. 127–138. Springer, Cham (2021)

TTY Session Audit Techniques for Linux Platform

105

10. Bacis, E., Mutti, S., Paraboschi, S.: AppPolicyModules: mandatory access control for thirdparty apps. In: Proceedings of the 10th ACM Symposium on Information, Computer and Communications Security, pp. 309–320 (2015) 11. Beuchelt, G.: UNIX and Linux security. In: Computer and Information Security Handbook, pp. 205–224. Morgan Kaufmann (2017) 12. Stanek, W.: Windows Group Policy: The Personal Trainer for Windows Server 2012 and Windows Server 2012 R2. Stanek & Associates (2015) 13. Nakamura, Y., Sameshima, Y., Yamauchi, T.: Reducing resource consumption of SELinux for embedded systems with contributions to open-source ecosystems. J. Inf. Process. 23(5), 664–672 (2015) 14. Zhu, H., Gehrmann, C.: Lic-Sec: an enhanced AppArmor Docker security profile generator. J. Inf. Secur. Appl. 61, 102924 (2021) 15. Mohsen, R., Pinto, A.M.: Evaluating obfuscation security: a quantitative approach. In: International Symposium on Foundations and Practice of Security, pp. 174–192. Springer, Cham (2015)

A Study on the Role of Higher Education and Human Resources Development for Data Driven AI Society Ji Hun Lim, Jeong Eun Seo, Jun Hyuk Choi, and Hun Yeong Kwon

Abstract Data-based AI technology is bringing many benefits to our society. However, artificial intelligence using vast amounts of data can pose numerous risks. For this reason, ethical artificial intelligence is needed. In recent years, the international community has made efforts to enact artificial intelligence ethics principles, but there has been a lack of concrete measures to implement them. We were interested in higher education as a way to share and practice the ethical values that artificial intelligence society should have. This paper attempted to investigate and analyze the educational status of higher education institutions that train AI experts. To this end, we investigated the artificial intelligence ethics education curriculum of 28 universities in 10 countries. Each education method was classified into three types, and we analyzed its characteristics. Finally, this paper pointed out the problem of artificial intelligence ethics education in higher education and suggested the educational methods to be taken in the future. Keywords Data ethics · AI ethics · Professional ethics · Higher education · Human resources development

J. H. Lim · J. E. Seo · J. H. Choi · H. Y. Kwon (B) School of Cybersecurity, Korea University, Seoul, South Korea e-mail: [email protected] J. H. Lim e-mail: [email protected] J. E. Seo e-mail: [email protected] J. H. Choi e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Big Data, Cloud Computing, and Data Science Engineering, Studies in Computational Intelligence 1075, https://doi.org/10.1007/978-3-031-19608-9_9

107

108

J. H. Lim et al.

1 Introduction: The Necessity of Ethics in the Data Driven AI Society AI began after the first introduction of the electronic computer in the 1940s, based on the theoretical concepts formulated by the British scientist, Alan Turing [1]. Soon after, four researchers proposed what became known as the “Dartmouth Proposal” in August 1955 [2], and the next year in 1956, the Dartmouth Conference heralded the beginning of serious research in artificial intelligence. Although research in artificial intelligence has a long history of 60 years, it wasn’t until in recent years that the general population began to experience the AI technology in their daily lives [3]. AI has expanded its capabilities to collect, analyze, and utilize data with the development of data science and the advancement of ICT infrastructure, and as the data-based approach is actively carried out in AI products, and services that incorporate AI technology are emerging in various areas. Many companies have demonstrated their AI products and services at the world’s biggest consumer electronic conference, CES 2022, and have emphasized that the AI technology is revolutionizing the user experience as well as raising our ability to deal with social problems [4]. In other words, the AI technology is not the end goal, but rather it is a tool for the prosperity of mankind by increasing the welfare of the individual and the society as well as bringing new innovations [5, 6]. However, as artificial intelligence becomes common in our society, it presents various ethical challenges as well [7]. When taking into consideration the potential harm that AI can cause, we will need the capacity to deal with the various ethical issues that such technology can bring in advance [8]. As Paul Goodman put it, “technology is a branch of moral philosophy” [9] and more than ever, ethics has become more important for AI. Early ethical studies on artificial intelligence were conducted on the ontological status of artificial intelligence, whether responsibility for actions performed by artificial intelligence, and the problem of human existence caused by superintelligence [10, 11]. Recently, however, AI ethics has viewed from political perspective [5], and is considered as ways to conscientiously distribute profits to society [12]. As a result, artificial intelligence ethics is drawing attention as a major policy issue for governments around the world to foster artificial intelligence industry as a new technology industry [13]. Accordingly, various AI principles and ethical regulations have been published based on these studies since 2016. According to various ethical principles, AI must ensure accountability, fairness, and transparency. To realize this, we need to address Implementation Challenges such as data privacy, security, diversity, contestability, reliability, and audacity [7]. However, in realizing AI ethics or data ethics, the preparation of ethical principles does not necessarily guarantee the ethical value of our society. Ethical principles are abstract [14]. And it is pointed out that this principled theory is likely to produce moral routinization [15]. In the end, to realize AI ethics, it is necessary to think about how to apply the AI ethics principle in the real world. The members of society who develop and use technology called AI should be able to share and practice the ethical

A Study on the Role of Higher Education and Human Resources …

109

values that AI society should have. Education is useful in that way. Education refers to the action of leading an educator from an existing natural state to an ideal state [16]. The role of education is more important than ever to implement the ideal state of an ethical AI society. From this point of view, this paper aims to investigate and analyze the educational curriculum of higher education that fosters AI professionals and present the role of education to realize ethical principles.

2 Why Do We Need to Educate AI Developers on Ethics? 2.1 Target of AI Ethics Education and the Importance of Developer Education Recent AI policies distinguishes between developers, suppliers, and users [17]. A developer is a technical expert (or organization) that builds an AI system. A supplier has the power to make decisions about the use of the technology and provides a service. A user is a person who users the service [18]. The ethics of AI should also be considered separately for each entity. The ethical issues can differ greatly depending on the perspective of producers, suppliers, and consumers in the existing logistics sector; Likewise, there exists a great difference in the perspectives of the subjects of each process from the development, distribution, consumption, and disposal of information-based goods such as software, AI systems and robotics [19]. Therefore, AI ethics education should be conducted according to each target as shown in Fig. 1. To build an ethical AI society, education ethics for all the subjects described above is necessary. However, in this paper, we would like to focus on AI ethics education, especially for developers. This is because AI operates based on technologyprogrammed learning. In the end, developers who understand the nature of AI should exercise ethical capabilities from the design stage of AI in that the ethical knowledge of people directly involved in the technology can lead to the use of ethical AI technology. Not only that, but It is also almost impossible to expect all users to behave morally perfectly with ethics education alone, and it is difficult to respond to potential threats

Fig. 1 Subjects of artificial intelligence ethics education

110

J. H. Lim et al.

posed by AI by leaving everything to consumers (users) [20]. As part of expert education, developer-centered artificial intelligence education is needed.

2.2 Meaning of Professional Ethics Education Professional ethics education stems from specialized theories, knowledge, and skills to keep the status and role as a professional with technical professionalism [21]. The crucial difference between professional and non-professional jobs is that they utilize skills supported by the rich knowledge that make up a system called “body of theory [22].” In addition, the process by which a particular profession is professionalized is carried out in stages, in the following order: (1) full-time professional activities, (2) the establishment of a school or equivalent institution for training, (3) the formation of associations, and (4) the enactment of a charter of ethics [23]. From this point of view, AI developers based on specialized theory, knowledge, and technology have requirements as professionals, and recent moves to enact AI training institutions and associations suggest that AI is becoming professional. Professional ethics is essential as a quality for professionals, and the biggest meaning of professional ethics is that it can act as a criterion for disciplining professional behavior even if it is not compulsory, and it can give ethical value to the way professionals exercise their expertise [24]. As professional ethics is particularly emphasized, professional education institutions provide education, such as lawyers, doctors, soldiers, and engineers, and AI developers need to have a professional understanding and calling for technology that enables their status and role.

2.3 Developer-Centered Ethics Education in Computer Science Developers should share the responsibility for the entire process of the product from the initial stage of its first development to its use and its quality control management as well. They need to possess a high degree of technical competency to minimize technical errors and risks from a safety and prevention point of view as well as develop the ethical aptitude needed. The ethical education for such AI developers and experts can be approached as part of ethical education for engineering. To provide the necessary knowledge and intellectual understanding for becoming a socially responsible engineer, the ethical education for engineering needs to promote the understanding of the ethical values of the influences that science and technology have on our society [25–28].

A Study on the Role of Higher Education and Human Resources …

111

3 Current Status of AI Ethics Education in Higher Education From an education perspective, a pressing question is how to ensure the knowledge and skills to develop and deploy AI systems that align with fundamental human principles and values, and with our legal system, and that serve the common good. Therefore, it needs to be supported by a solid knowledge not only of the technical aspects of AI, but also of its implications for law, ethics, economy and society [29]. Therefore, methods for artificial intelligence ethics education can be promoted in various ways. Especially opening AI ethics classes in undergraduate or graduate courses at universities that train professional AI developers can be a valid method [30]. This study attempted to investigate the AI ethics education curriculum conducted by 28 universities in 10 countries and analyze the purpose and method of artificial intelligence ethics education in the field of higher education [31–57]. Table 1 classifies the educational methods of each university above into three types.

3.1 Type A: Technology-Based Ethics Convergence Education It is not a knowledge transfer education that simply instills mechanical behavior or a code of ethics, but rather a way to provide ethical insights and problem-solving skills, as in the case of Harvard University, to elicit a fundamental response to the ethical issues that AI technology will bring about as a technical expert [58]. Thus, the current computer science classes for AI, machine learning, O/S, programming languages and more include philosophical contents. Also, all the curriculum involving the Embedded EthiCS are carried out together by the computer science professor and philosophy Ph.D. with experience in providing philosophical background information [59]. This type of education has the advantage of being able to grasp what ethical elements the specific technology of AI will have.

3.2 Type B: Social Issues and Norm Education of Artificial Intelligence Many universities around the world have opened courses related to AI ethics to provide education on social issues and norms of AI while striving to spread its value [60]. Earlier in 2014, Stanford University in the U.S. organized the “AI—Philosophy, Ethics, and Impact” course, and discussions on AI and ethics were conducted as a curriculum [33]. In the case of this class, it can be said that it is a class that learns the history and philosophy of AI and acquires knowledge of AI ethics that learns ethical, legal, social, and economic effects. This type of class is the easiest type to find in

112

J. H. Lim et al.

Table 1 The current status of artificial intelligence ethics education at universities Type

Country

University

Type A: Technology-Based Ethics Convergence Education

USA

Harvard University

Type B: Social Issues and Norm Education of Artificial Intelligence

MIT Canada

McGill University

Japan

The University of Tokyo

China

Peking University

USA

Stanford University Carnegie Mellon University

UK

Oxford Cambridge Imperial College London The University of Edinburgh

Switzerland

Swiss Federal Institute of Technology Lausanne (EPFL)

Israel

Hebrew University of Jerusalem (HUJI)

Australia

The University of Melbourne The University of Queensland The University of Sydney

China

Chinese Academy of Sciences

Japan

Nihon Tokyo Institute of Technology Hosei

South Korea

Korea University Kyunghee University Yonsei University Seoul National University of Science and Technology Sungkyunkwan University

Type C: Field-Oriented Practical Training

Canada

University of Toronto

Singapore

National University of Singapore

Japan

Waseda University

other University. In the case of this class, it is often conducted as a single class by philosophy or law experts, and it is necessary to raise basic ethical knowledge and awareness of social norms to AI developers.

A Study on the Role of Higher Education and Human Resources …

113

3.3 Type C: Field-Oriented Practical Training As AI technology is used in various fields of our society, more companies are using the technology. Waseda University in Japan conducts practical classes on AI ethics and data security to foster data science talents needed in industrial sites [61]. Through this, students have an opportunity to experience ethical situations that can be applied in the actual working field of AI companies and to consider dilemma situations that cannot be experienced in general classes. The University of Toronto in Canada and NUS in Singapore are also running courses that allow experts in artificial intelligence to acquire necessary ethical practices in the field [39, 48]. Field-oriented practical education is a necessary educational method in that it emphasizes the practice of artificial intelligence developers and experts beyond simply acquiring knowledge theoretically in the classroom. However, this type of education cannot be conducted by an educational institution alone, and there must be cooperation from private companies, governments, and research institutes.

3.4 Analysis Results: Imbalance in Education Types After comparing the three education types as shown in Fig. 2, Type B education was the most common in the world. This is because the education method is the most common ethical education method. The class starts from the perspective of philosophy and ethics and deals with the overall concept of AI ethics in one class. Therefore, many schools are opening classes in that it is less burdensome to operate the curriculum and that students can raise basic awareness of AI’s ethical issues, related laws, and ethical principles. However, there are limitations to education in the knowledge transfer method that simply instills mechanical behavior or ethical norms. From this point of view, technology-based ethics convergence education is a learning method to review ethical issues that may arise in using each technology in technical classes such as Programming Languages, Machine Learning, and Computing Hardware, and to prepare technical means to secure the reliability of AI. The education method is based on a deep understanding of technology, philosophy, or law, so engineering and humanities and social studies instructors must cooperate, and there is a burden in opening classes because this method of operation is required in various computer science classes. Therefore, only five universities adopt classes in this way. Finally, field-oriented practical education is a way of conducting education through industry-academia cooperation between universities and private enterprises. In this class, students will get out of the classroom and acquire practical ethical knowledge from experts in private companies that use actual AI. Waseda University is conducting such an educational method, but there is a limitation that it is not a regular class but a non-regular class and is not an essential class for students in AI-related departments.

114

J. H. Lim et al.

Fig. 2 Percentage of AI ethics education types

4 How Should Universities Educate AI Ethics? As can been seen from the examples above, the goal of ethical education for computer science experts, specifically the AI developer, is to provide the knowledge and understanding to strengthen the ethical capability needed for the developer to become a socially responsible technical expert. The social responsibilities of the technical experts are: (1) realize the complicated relationship between technological advances and social welfare, (2) understand the responsibilities of the expert and the rules for fulfilling them, and (3) learn to make mature decisions and deal with ethical issues when raised. Figure 3 shows the process of these ethics education.

Fig. 3 Design of AI ethics curriculum

A Study on the Role of Higher Education and Human Resources …

115

First, based on the specific technical expertise of AI, we identify ethical issues that the technology can cause, and train the implementation of trustworthiness technology. In other words, it is an effort to technically solve the AI ethics problem. In parallel with this, it is necessary to train oneself to design ethical norms based on an understanding of ethical social norms. Ethics for the AI developer is most akin to the professional code of conduct. Such code depends on autonomous regulation and mutual checks and balances between the expert groups most capable of taking responsibility for technology adoption in the most ethical manner as well predicting the consequences of such adoption [62]. The AI developer community also needs such voluntary regulation based on a code of ethical conduct. Rather than blindly following ideal regulations made by the government or private companies, we need to strengthen the capabilities of the AI developers to take the matter into their own hands and develop their own set of regulations and guidelines. Finally, ethical dilemma education is important. Ethics is comprised of three elements: cognition, emotion and behavior. While these three elements seem separate, they are only so in concept. All the three elements come into play together and are interconnected thereby requiring the goal of ethics education to take these three elements that culminate into ethical behavior. In other words, even if you have learned the right behavior for AI ethics technically and socially normative, you must have the ability to translate it into actual behavior. Therefore, ethical education for AI needs to train computer scientists to make ethical decisions and put them into action. Recently, AI technology has been closely integrating into business models, thus requiring a sense of balance between demand and motivation from the developer [63]. The dilemma discussion methodology may help to resolve this issue. The dilemma discussion methodology is a methodology based on the theory of cognitive development developed by Piaget [64] and Kohlberg [65] that is comprised of the following stages: (1) recognition of ethical problem, (2) potential position statement, (3) moral reasoning and discussions, and (4) discussion of position statement [66]. Through such training, the AI developer must internalize the ability to deal with countless moral dilemmas that AI developers will face in the field through constant and unending ethical considerations. To include all of these education methods, higher education institutions should provide various types of classes so that students can develop various ethical competencies, rather than adopting only one of the types A, B, and C discussed above.

5 Conclusions With the wide adoption of the AI technology in our society today, together with the convenience it brings, we are also facing various ethical and moral dilemmas associated with the AI technology. It should be noted that ethics is a practical ideology. In today’s fast paced and ever-changing society, it is difficult if not impossible to adhere to a set of absolute values or a priori [67]. This may be the reason why there

116

J. H. Lim et al.

is much criticism on the lack of concrete action from the countless ethical principles announced in the global community today. In the end, the realization of an ethical AI society will only be possible when all the constituencies of that society fully understand the true nature of ethical values that such ethical principles purport and act out their own parts. Most importantly, we need ethical domain experts who can analyze social changes and recognize the moral issues and execute ethical solutions to resolve these issues. Because highly specialized experts such as the AI developer have practical monopoly on knowledge and technical operation of their domain, it is extremely difficult for the public to discover and control their unethical conduct. Therefore, a high degree of ethics and morality proportional to their expertise and autonomous freedom in their domain is needed by the AI developer. Also, considering that the AI machine learning process is based on the designs of the AI developer, the ethical capability of the developer is ever more important. Seen from this perspective, the authors of this paper are arguing that the importance of institutions such as the university in nurturing computer science experts such as the AI developer has become all the bigger. The university can act as incubator that promote the convergence of diverse IT technologies and provide sustainability for action. Furthermore, it has the social responsibility [68] to identify the issues we face today and act upon them. The university should no long simply train capable developers to code well but also enable their trainees to take the responsibilities for the issues that their creation can raise in the AI society. To achieve this goal, we need to develop AI ethics education that can be put into action rather than simple transfer of ethical knowledge through curriculum such as (1) Technical Education, (2) Ethical Norms Education, and (3) Practical Education.

References 1. Turing, A.M.: Computing machinery and intelligence. Mind 59(236), 433–460 (1950) 2. McCarthy, J., Minsky, M.L., Rochester, N., Shannon, C.E.: A proposal for the Dartmouth summer research project on artificial intelligence, August 31, 1955. AI Mag. 27(4), 12–14 (2006) 3. Kim, M.J.: The necessity of artificial intelligence ethics and international trends. J. Korean Inst. Commun. Sci. 34(10), 45–54 (2017) 4. CES. https://www.ces.tech/Topics/Robotics-Machine-Intelligence/Artificial-Intelligence.aspx 5. High-level expert group on artificial intelligence: ethics guidelines for trustworthy AI (2019) 6. Villani, C., Bonnet, Y., Rondepierre, B.: For a Meaningful Artificial Intelligence Towards a French and European Strategy (2018) 7. Lim, J.H., Kwon, H.Y.: A study on the modeling of major factors for the principles of AI ethics. In: DG.O2021: The 22nd Annual International Conference on Digital Government Research, pp. 208–218 (2021) 8. Leslie, D.: Understanding artificial intelligence ethics and safety: a guide for the responsible design and implementation of AI systems in the public sector (2019) 9. Martin, M.W., Schinzinger, R.: Ethics in Engineering. McGraw-Hill Companies Inc., New York (2004)

A Study on the Role of Higher Education and Human Resources …

117

10. Samuel, A.L.: Some moral and technical consequences of automation a refutation. Science 132(3429), 741–742 (1960) 11. Wiener, N.: Some moral and technical consequences of automation. Science 131(3410), 1355– 1358 (1960) 12. Swan, M.: Is technological unemployment real? An assessment and a plea for abundance economics. In: LaGrandeur, K., Hughes, J.J. (eds.) Surviving the Machine Age, pp. 19–33. Springer International Publishing, New York (2017) 13. West, S.M., Whittaker, M., Crawford, K.: Discriminating Systems, pp. 1–33. AI Now (2019) 14. Lee, D.I.: Practical Ethics Theology. Catholic Archdiocese of Seoul Catholic Publishing House, Seoul (2003) 15. Cambell, C.S.: Experience and moral life: a phenomenological approach to bioethics. In: Dubose, E.R. Hamel, R.P., O’Connell, L.J. (eds.) A Matter of Principles? Ferment in U.S. Bioethics. Trinity Press International, Valley Forge, PA (1994) 16. Yoon, G.G., Jeong, S.C.: The essence of human nature and education. In: Introduction to Pedagogy. Taeyoung Publishing, Seoul (2014) 17. European Parliament: Draft report with recommendations to the commission on a framework of ethical aspects of artificial intelligence, robotics and related technologies (2020/2012)(INL) (2020) 18. Stahl, B.C.: Artificial Intelligence for a Better Future: An Ecosystem Perspective on the Ethics of AI and Emerging Digital Technologies. Springer Nature, Switzerland (2021) 19. Lim, S.S.: Moral education in the age of artificial intelligence: from the perspective of consumer ethics. J. Korean Ethics Stud. 117, 89–116 (2017) 20. Reich, R.: Embedded EthiCS for a Better Tomorrow. https://blog.ncsoft.com/ai-frameworkep06-210617/ 21. Sullivan, W., Benner, P.: Challenges to professionalism: work integrity and the call to renew and strengthen the social contract of the professions. Am. J. Crit. Care 14(1), 78–84 (2005) 22. Caplow, T.: The Sociology of Work. University of Minnesota Press, Minneapolis (1954) 23. Wilensky, H.L.: The professionalization of everyone? Am. J. Sociol. 70(2), 137–158 (1964) 24. Kim, M.R., Yoon, S.P., Kwon, H.Y.: The role of professional ethics and the direction of ethical standards in the age of artificial intelligence. Kookmin Law Rev. 32(3), 9–53 (2020) 25. Harris, C.E.: The good engineer: giving virtue its due in engineering ethics. Sci. Eng. Ethics 14(2), 153–164 (2008) 26. Pritchard, M.S.: Professional responsibility: focusing on the exemplary. Sci. Eng. Ethics 4(2), 215–233 (1998) 27. Crawford-Brown, D.J.: Virtue as the basis of engineering ethics. Sci. Eng. Ethics 3(4), 481–489 (1997) 28. Herkert, J.R.: Future directions in engineering ethics research: microethics, macroethics and the role of professional societies. Sci. Eng. Ethics 7(3), 403–414 (2001) 29. Dignum, V.: The role and challenges of education for responsible AI. Lond. Rev. Educ. 19(1), 1–11 (2021) 30. Lee, S.G.: Improvement tasks for ethical use of artificial intelligence. Issues Perspect. 1759, 1–4 (2020) 31. Harvard University. https://embeddedethics.seas.harvard.edu/ 32. MIT Electrical Engineering & Computer Science Department. https://www.eecs.mit.edu/aca demics/undergraduate-programs/curriculum/ 33. Stanford University. https://web.stanford.edu/class/cs122/ 34. Carnegie Mellon University. https://www.cs.cmu.edu/bs-in-artificial-intelligence/curriculum 35. University of Oxford. https://www.conted.ox.ac.uk/courses/artificial-intelligence-ethics? code=O21C008V5Y#teaching_container 36. University of Cambridge. https://www.ice.cam.ac.uk/course/mst-artificial-intelligence-ethicsand-society 37. Imperial College London. https://www.imperial.ac.uk/computing/current-students/courses/ 70052/ 38. The University of Edinburgh. http://www.drps.ed.ac.uk/17-18/dpt/cxphil10167.htm

118

J. H. Lim et al.

39. University of Toronto. https://ethics.utoronto.ca/ethics-of-artificial-intelligence-in-context-eth 1000y/ 40. McGill University. https://www.mcgill.ca/study/2021-2022/courses/ecse-557 41. Swiss Federal Institute of Technology Lausanne (EPFL). https://edu.epfl.ch/coursebook/en/ the-ethics-and-law-of-artificial-intelligence-HUM-392 42. Hebrew University of Jerusalem (HUJI). https://csrcl.huji.ac.il/event/ai-law-and-policy-dayI 43. The University of Melbourne. https://handbook.unimelb.edu.au/2021/subjects/comp90087 44. The University of Queensland (UQ). https://my.uq.edu.au/programs-courses/course.html?cou rse_code=BSAN7210 45. The University of Sydney (USYD). https://www.sydney.edu.au/courses/units-of-study/2022/ phil/phil2680.html 46. University of China Academy of Science. https://bii.ia.ac.cn/peai 47. Peking University (PKU). https://elective.pku.edu.cn/elective2008/edu/pku/stu/elective/contro ller/courseDetail/getCourseDetail.do?kclx=BK&course_seq_no=BZ2021104834240_11656 48. National University of Singapore (NUS). https://www.ntu.edu.sg/scse/admissions/progra mmes/graduate-programmes/certificate-in-ai-ethics-and-governance 49. Nihon University College of Science and Technology. https://www.cst.nihon-u.ac.jp/syllabus/ 2020/course/1/P24K/8000/index.html 50. The University of Tokyo Center for Research and Development of Higher Education. https:// catalog.he.u-tokyo.ac.jp/detail?code=4890-1047&year=2020&interface_language=en 51. Tokyo Institute of Technology. https://www.titech.ac.jp/news/2020/047163 52. Hosei University. https://onl.sc/vFRtS2x 53. Korea University. https://info.korea.ac.kr/info/under/ai_course.do 54. Department of Computer Science and Engineering, KyungHee University. https://com.khu.ac. kr/ce/user/contents/view.do?menuNo=1600080 55. Department of Artificial Intelligence, College of Artificial Intelligence Convergence, Yonsei University. https://ai.yonsei.ac.kr/eng/eng3_1_a.php 56. Seoul National University of Science and Technology. https://aai.seoultech.ac.kr/curriculum/ curriculum/ 57. Department of Applied Artificial Intelligence, Sungkyunkwan University. https://skb.skku.edu/ skkuaai/curriculum.do#a 58. Grosz, B.J., Grant, D.G., Vredenburgh, K., Behrends, J., Hu, L., Simmons, A., Waldo, J.: Embedded EthiCS: integrating ethics across CS education. Commun. ACM 62(8), 54–61 (2019) 59. Harvard. https://embeddedethics.seas.harvard.edu/about 60. Kim, J.M.: Artificial intelligence ethics issues and curriculum trends. SW-Centered Soc. 7, 38–44 (2019) 61. PR TIMES: Seminar on the utilization of AI and data science from a data ethics and security perspective. https://prtimes.jp/main/html/rd/p/000000126.000053429.html 62. Davis, M.: Thinking like an engineer: the place of a code of ethics in the practice of a profession. Philos. Public Aff. 20(2), 150–167 (1991) 63. Wayner, P.: 12 ethical dilemmas gnawing at developers today. https://www.infoworld.com/art icle/2607452/12-ethical-dilemmas-gnawing-at-developers-today.html 64. Piaget, J.: The Moral Judgment of the Child (Gabain, M., Trans.). Free Press, New York (1965) 65. Kohlberg, L.: The development of modes of moral thinking and choice in the year ten to sixteen. Ph.D. dissertation, University of Chicago (1958) 66. Galbraith, R.E., Jones, T.M.: Moral Reasoning: A Teaching Handbook for Adapting Kohlberg to the Classroom. Greenhaven Press, New York (1976) 67. Dewey, J.: The Quest for Certainty: A Study of the Relation of Knowledge and Action, pp. 8–9. Minton, Balch & Company, New York (1929) 68. Jongbloed, B., Enders, J., Salerno, C.: Higher education and its communities: interconnections, interdependencies and a research agenda. High. Educ. 56(3), 303–324 (2008)

The Way Forward for Security Vulnerability Disclosure Policy: Comparative Analysis of US, EU, and Netherlands Yoon Sang Pil

Abstract If someone who finds vulnerability does not disclose it, the vulnerability may never be revealed or may be exploited by other malicious users. Therefore, it is necessary to disclose vulnerabilities to enhance security. Some countries already operating vulnerability disclosure policy (VDP). As a case study, this paper analyzed the cases of the US, EU, and Netherlands to derive key elements of the VDP and propose tasks to be considered when operating and improving the VDP in the future. The main implications of this study are that: (1) It is important to protect security researchers from criminal or civil liability, (2) the VDP should be spread across the public and private sector, (3) Manage the known vulnerabilities and ensure transparency and accountability of the management entity. Keywords Vulnerability · Security vulnerability · Vulnerability disclosure · Vulnerability disclosure policy (VDP) · Security researcher · White hacker

1 Introduction Program cannot be perfect because programmer is human being. Program is basically bound to contain errors or defects. Among these errors or defects, factors that may breach security become security vulnerabilities. Vulnerabilities cannot be revealed unless they cause specific risks or the person who finds vulnerabilities discloses them. Therefore, it is reasonable to understand that most vulnerabilities have not been discovered yet. In this respect, it can be said that vulnerabilities can be weakened by many eyeballs [1]. With more diverse and creative approaches of white hackers, it is easier to find and remove vulnerabilities. In this regard, disclosing vulnerability is an essential policy for a reliable digital environment. According to a survey by the U.S. National Telecommunications and Information Administration (NTIA), more than 54% of companies could save marketing and development costs of software product and service by implement and manage vulnerability disclosure policies [2]. Y. S. Pil (B) School of Cybersecurity, Korea University, Seoul, South Korea e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Big Data, Cloud Computing, and Data Science Engineering, Studies in Computational Intelligence 1075, https://doi.org/10.1007/978-3-031-19608-9_10

119

120

Y. S. Pil

Originally, the vulnerability disclosure policy (hereinafter “VDP”) is a kind of practice that began with the private sector. Hackers have often uploaded vulnerabilities they found on their online community and discussed solutions as research activities. After then, some tech companies such as Microsoft and Google have stepped up to implement the VDP. The VDP is now required for public purposes at the national level. The international community also officially mentioned the necessity of the VDP to report and disclose vulnerabilities and began to present policy directions. On February 2021, the Working Party on Security in the Committee on Digital Economy Policy of the OECD published a VDP recommendation report [3]. Also, it is remarkable that the United States (US) and the European Union (EU) are taking the lead in institutionalizing VDP. The Netherlands is mentioned as a model example of implementing VDP by EU already [4]. Therefore, this paper purposed to identify the common direction of the VDP in major countries and propose the key tasks for managing and improving the VDP. To do this, the study clarified the background and significance of VDP firstly, and compared and analyzed the current situation of the US, EU and Netherlands, which have been actively responding as case study.

2 Understanding Vulnerability Disclosure Policy 2.1 Cultural Aspects VDP requires cooperation of security researchers based on value of openness and sharing. This culture is similar to the practice of the early professional community when computer and Internet technology were developing. Hacking was a term for play or prank originally. It included the meaning of opening computers and information to ensure free access to it [5]. They shared various hacking techniques through the hacker community. When they found security vulnerabilities, they shared them, discussed together, and informed the companies. It was also a hacker’s honor to receive some sort of badge or credit from the company [6]. Hacking was even the act of finding aesthetic elements and finding beauty from computers in a perfect program that could free their mind or soul [7]. Social motivation was also an important factor in the hacker community. Strong social motivations, such as exchanging academic discussions, were formed by sharing and competing passion with fellow hackers [8]. Furthermore, hacking checks authority and control. Because the culture itself pursues liberal based political values [9]. It is important to understand that computer machine and hacking were dealt with from the perspective of pranks, curiosity, and research. Since then, as computers and the Internet have become daily lives and valuable, cybercrime has also increased, but such value still remains. Many hackers rather believe that hacking serves a useful purpose by finding security flaws and vulnerabilities [10]. Therefore, it is important to strengthen and support the good faith based hacker community and their culture.

The Way Forward for Security Vulnerability Disclosure Policy: …

121

2.2 Technological Aspects The vulnerability is only manifested by an external approach. In other words, while defects in automobiles directly affect the function of stable operation, defects in software do not lead to vulnerabilities unless found. Even if it leads to vulnerabilities, in does not work unless there is a direct command or manipulation to exploit it in an attack. These attributes can be seen as passive vulnerability. Eventually, vulnerabilities must require external access, whether used for attack or defense. To find and fix vulnerabilities, researchers must approach them in the same way as those who want to attack them. In the end, the difference between the attacker and the defender lies solely in intention. In this respect, it should be possible to actively encourage intentions and actions to strengthen security. Once a security vulnerability has been discovered, it should be recognized that the risk can be reduced by making it available somehow. This is because there is no way to know whether someone else has found a vulnerability or is already using it as a zero-day vulnerability. Once a vulnerability is found, it is like a state of defence without countermeasures. In such a situation, the fact that we cannot know if someone is exploiting the vulnerability means that we cannot recognize whether our products or services are being exploited. Of course, it may not have been discovered, but it is not known if it has been discovered in the same manner, so the way to increase the probability of informing that there is a zero-day vulnerability can help enhancing security. Otherwise, some zero-day vulnerabilities can be forever-day vulnerabilities. Conversely, if vulnerabilities can be known in some way through research and disclosure, this increases the possibility of making zero-day vulnerabilities an N-day vulnerabilities. Once released, there is a risk until the patch is developed, but after N-days, the vulnerability is removed, the patch spreads thereafter, and the overall risk can be eliminated or lowered [3]. Therefore, disclosing and researching security vulnerabilities is crucial in that they can ultimately improve security [11]. The effectiveness and efficiency of vulnerability discovery and analysis procedures can be enhanced when various white hackers participate more and enter the system through creative methods [12].

2.3 Environmental Aspects With the development of Internet of Things (IoT), the risk of cybersecurity is also increasing. Gartner predicted that the IoT market will grow significantly, forming about 5.8 billion IoT endpoints [13]. The problem is that most IoT devices being manufactured are sold on the market without considering security. Typical home router software, for example, is about four to five years old and contains rudimentary vulnerabilities, such as being patchable or, if possible, only partially applicable [14]. Symantec predicted that IoT devices connected to 5G networks will become more vulnerable to direct attacks with the development of communication technology. In

122

Y. S. Pil

particular, it is expected that IoT-based security threats will go beyond Distributed Denial of Service (DDoS) attacks and brings threats that abuse home Wi-Fi routers or devices of private sector user with poor security in various ways will increase [15]. Therefore, it is worth noting that vulnerable devices will be connected each other in the future, which means that the possibility of vulnerabilities between devices may increases. In addition, the probability that there are more ways that cybercriminals can access the real space connected through the IoT network can also be an important issue from a security perspective. Along with this, another environmental factor is a regulation issue. This is because there are some concerns that finding vulnerabilities in good faith may be recognized as cybercrime even though white hackers or security researchers need to actively discover and report vulnerabilities. In fact, security researchers are always at risk of getting involved in legal disputes. About 60% of security experts are concerned about legal responsibility by reporting vulnerabilities [16]. Therefore, it can be said that the practice of treating and punishing the act of finding security vulnerabilities as a crime unconditionally fails to take into account the social context and environmental changes [17].

3 Vulnerability Disclosure Policy of Major Countries 3.1 United States The U.S. is managing VDP from the perspective of IoT cybersecurity. The U.S. House of Representatives proposed the IoT Cybersecurity Improvement Act of 2020. The bill became law on December 4, 2020. The VDP is stipulated in Article 5. In accordance with Article 5(a), the National Institute of Standards and Technology (NIST) shall prepare guidelines for reporting, coordinating, and posting vulnerability information of federal agency’s information system and receiving the information from IoT devices manufacturers. In Article 6(a), the director of the Office of Management and Budget (OMB) shall develop and supervise policies, standards, and guidelines for resolving vulnerabilities of IoT devices and information systems in consultation with the secretary of the Department of Homeland Security (DHS) and the secretary of the DHS provide operational and technical assistance in implementing such guidelines. Specifically, the secretary may, in consultation with the director of the OMB, implement and supervise the information security policies and practices of federal agency information systems, except for some national security issues. Also, the secretary may issue binding operational directives (BOD) to require agencies to prepare and implement policies, principles, standards, and guidelines for security enhancement and risk mitigation. Also, the secretary may develop and operate coordinated vulnerability disclosure policies (hereinafter “CVD”) through coordination with business and other stakeholders. In September 2020, the Cybersecurity and

The Way Forward for Security Vulnerability Disclosure Policy: …

123

Infrastructure Security Agency (CISA) of the DHS established VDP manual (BOD 20-01) [18]. In the manual, the VDP aims to reduce risks to agencies, infrastructure, and the public by protecting vulnerability reporters who can help defend the public interest and allowing time to remove vulnerabilities before disclose. Accordingly, all federal agencies are required to add the contact information of security personnel and post the VDP on the agency’s website. In addition, the target of VDP should be increased by at least one system or service connected to the Internet every 90 days. Also, by September 2022, federal agencies must include all systems or services in the subject of VDP. The VDP process should be easier than tweeting the vulnerability issue, requiring that vulnerability reporting be received through web forms or dedicated web applications. In January 2022, CISA selected Bugcrowd and EnDyna as the operating platform for the VDP by federal agency. As a result, the first in the U.S. to report vulnerabilities of federal agencies from citizens, has been further expanded [19]. In May 2022, the Department of Justice (DOJ) announced a new policy that would not charge the people for acts aimed at researching security in good faith when applying the Computer Fraud and Abuse Act (CFAA) [20]. Therefore, if the CFAA is to be applied, federal prosecutors should comply with this policy. According to the policy, goodfaith security research means assessing a computer for the purpose of good-faith testing, investigation, and removing security defects or vulnerabilities in a manner that does not cause any harm to stakeholders or third parties, and the information obtained by research is used to improve security primarily. If it is confirmed as evidence that it is for the purpose of such good-faith security research and that it was performed so, the prosecutor shall not prosecute.

3.2 European Union On 12 March 2019, the European Parliament adopted the Cybersecurity Act. The Recital no. 30 highlights the opportunity for organizations to identify and mitigate vulnerabilities before they are disclosed to the public by providing specific structure procedures for reporting them to the owners of information systems or services. In addition, the Act strengthened its functions and powers by having the European Union Agency for Cybersecurity (ENISA) as a permanent institution and having the ENISA support security policies of member states. Article 6 stipulates the function of ENISA and supports the establishment of voluntary VDP of member states. Article 54 regulates the elements of the European cybersecurity certification system and requires that rules on reporting and management of unknown security vulnerabilities regarding ICT products, services, etc. be included in the certification system. Article 55 requires certified ICT products, services, and supply chain companies to provide and disclose information on security. According to this, companies should provide information on the period, manufacturer’s contact, a window for receiving vulnerability reports from users and security researchers, a list of disclosed security vulnerabilities, and supervisory agencies.

124

Y. S. Pil

Also, the E.U. believes that new requirements should be institutionalized for crossborder cooperation and information sharing by competent agencies and private actors through 2020 impact assessment, and recommends legal base of CVD at the E.U. level. Specifically, it requires member states to develop a common policy framework for CVD and requires country-specific CSIRTs to function as coordinators, suggesting that ENISA roles as an agency to register newly discovered vulnerabilities [21]. In 2022, ENISA investigated the current status of CVD policy in the E.U. member states and recommended future improvement tasks. According to the report, the E.U. should first be able to provide a waiver to security researchers. In addition, if necessary, it was recommended to provide safe harbour by amending the criminal laws. Furthermore, ENISA emphasized that security researchers should be recognized and protected as the status of whistleblowers, and requested to define the role of ethical hackers [22].

3.3 Netherlands The Netherlands is considered the best country operating the CVD recommended at the E.U. level. The National Cyber Security Center (NCSC) publicly requests security vulnerability report of government agencies and has prepared web pages and procedures to receive the report [23]. The NCSC requires that security vulnerabilities be reported to the center if they are found in the government system or homepage, and does not disclose them to the public until measures are taken to mitigate the vulnerabilities. In particular, the NCSC published guideline for CVD so that agencies and companies can refer to them and implement policies in 2017 [24]. According to the guideline, organizations should operate CVD to enhance security. Security researchers who report the vulnerabilities by CVD should be exempted from legal charge. Also, it should be stated that malicious acts such as uploading malicious codes, extorting the authorities, or leaking data are prohibited. In addition, a person who reports according to this guideline is considered to be in good faith, and it should be recognized that an investigation can be carried out if the specified procedure is not followed or intended malicious code is confirmed as a result of internal analysis. However, unless it is such a case, it basically states that no civil or criminal responsibility will be held for the act of finding and reporting vulnerabilities, allowing security researchers to conduct research for the purpose of good faith. It also requires information about the reporter to be kept secret and communicated with the reporter from time to time regarding the report and related follow-up measures. After reporting, when mitigating measures for reported vulnerabilities are completed, the reporter is allowed to disclose the vulnerabilities freely. In addition, if necessary, compensation can be paid in consideration of the severity of the vulnerability, the quality of the report, etc. The most important success factor of the Netherlands model is the bottom-up approach. The Netherlands basically believes that it is difficult to require companies

The Way Forward for Security Vulnerability Disclosure Policy: …

125

or institutions to implement CVD unless they are faithful with individuals reporting vulnerabilities. Therefore, it is more important to recognize the role of security researchers and white hackers with good faith in society and to form an ecosystem that prevents them from unreasonable legal responsibilities [25]. These culture was born from the court decision in 2008 [26]. Some researchers at Radbound University investigated the security of smart cards for public transportation, confirming that they could find security vulnerabilities in NXP’s MIFARE chips used in the care readers. They could use unlimited public transportation, and enter into government building without authority. Accordingly, the researchers proposed a six-month period to NXP and the Dutch government for patch before the academic presentation. However, NXP sought a restraining order to prevent academic presentation. They claimed that it could be dangerous if related information are disclosed considering their intellectual property rights. However, the court emphasized that the public interest by freedom of expression is more important, and considering the increasing use of electronic products, the researcher’s intention to inform social problems in a free and transparent way was crucial. In addition, the harm that may occur due to the problem is not based on the act of revealing those vulnerabilities identified by research, but inherently the NXP which produced those chips with defects. With this precedent, since 2008, the Dutch security researchers have relied on the decision to realize the argument that it is possible to disclose vulnerabilities in a responsible and coordinated manner and that it is actually desirable to enhance security. Since then, as hacking accidents have increased, the Dutch government published the first CVD guideline by NCSC in 2013 [27].

4 Key Challenges for Development of Vulnerability Disclosure Policy 4.1 Comparative Analysis With a comprehensive analysis of the above cases, this study can derive the following implications. First, the VDP is recognized and introduced at the national level. The U.S. is taking the lead in introducing VDP at the federal government level and making efforts to spread it. In particular, it is characterized by the mandatory implementation of VDP by federal agencies and contractors through the IoT Cybersecurity Act of 2020. Although there is no direct legal basis, the E.U. indirectly requires member states and companies to implement VDP by establishing a cybersecurity certification schemes through the EU Cybersecurity Act and including VDP in the scheme. It also directs ENISA to support the implementation of VDP and comprehensively manage related data. Currently, discussions are underway to legislate CVD. In the Netherlands, VDP is implementing autonomously as social awareness changes through the regarding cases without an institutional basis. Also, the guidelines are published by the all cases to provide reference implementing VDP. Regarding liability issues, the

126

Y. S. Pil

Table 1 Comparison of vulnerability disclosure policy system Category

United States

European Union

Netherlands

Legal base of VDP

IoT Cybersecurity Act (2020)

EU Cybersecurity Act §54 1(m) (2019)

–

Mandatory VDP

Only for Federal Agencies and Contractors

Member States and – Provider of ICT product, service and process

VDP guideline

DHS CISA Binding Operational Directive 20-01 (2020)

Recommendations from ENISA: Coordinated Vulnerability Disclosure Policies in the EU (2022)

NCSC Coordinated Vulnerability Disclosure: The Guideline (2017)

Safe harbour to researchers

DOJ Justice Manual 9-48.000 (2022)

Providing by Each Organizations CVD (if necessary, amendment or enact recommended)

Providing by Each Organizations CVD

U.S. has announced legal manual on CFAA at the DOJ. The E.U. and the Netherlands basically specify the safe harbour when designing VDP so that each organization can implement it autonomously (Table 1).

4.2 Protection of Security Researchers Considering the global trends, it seems that it should be possible to explicitly allow the vulnerability research in good faith and create such culture. Like as IT companies, large global manufacturers are establishing and disclosing procedures to allow vulnerability research, VDP should be prepared at the all public and private sector with the exemption provisions. It is necessary to break down the rather simple and one-sided security view that external intrusion into products or services must be prevented perfectly, flexibility and resilience should be secured beyond the rigid concept. Basically, the effectiveness and efficiency of vulnerability discovery and analysis procedures increase when various white hackers participate more and invade the system through various creative methods [12]. In other words, as a very fundamental problem, the way to identify and remove security vulnerabilities is to have more security experts find them. This is because the more security researchers participate in vulnerability detection process such as VDP and bug bounty, the more bugs or security vulnerabilities could be found [28]. After all, the best way to quickly detect, identify, and remove security vulnerabilities is to verify by multiple people. In this regard, white hackers are strong allies that our society must actively accept and support.

The Way Forward for Security Vulnerability Disclosure Policy: …

127

Therefore, it can be understood that the approach to punish every crime unconditionally is failed to take into account the social context [17]. In consideration of the specific intention or means of the act, good faith acts should be encouraged. Therefore, in order to legalize vulnerability researches, it is necessary to protect security researchers from legal charges such as copyright infringement or computer crime, etc. and find ways to allow research and report of vulnerabilities. In fact, 4% of security researchers were found to keep vulnerabilities without informing to public even if they find it in order not to get involved in legal problems [29]. In addition, 60% of the researchers said they are afraid of becoming criminals even though they started with good intentions if they announce security vulnerabilities through research, and they need a policy to ensure a waiver from the regarding legal charges.

4.3 Expansion of Vulnerability Disclosure Policy In the near future, VDP should spread to all public and private sectors. Regarding the current situation, the industry rankings collaborating with hackers are as follows. Internet and online service providers are 59%, 47% for finance, 43% for computer software, 41% for distribution and e-commerce, 37% for media and entertainment, 32% for education, 31% for government programs, and 25% for telecommunications. In particular, consumer product manufacturers are 20% and medical technology companies are 16% [30]. If digital is applying broadly, vulnerabilities should be disclosed in most areas, but considering the spread of IoT devices in the future, more areas of manufacturing consumer products, which currently account for 20%, should be increasingly operate VDP and cooperate with white hackers. Governments has already started to actively utilize VDP by creating guidelines or enacting law and applying them to government agencies. It is time for policies that can spread VDP in the private sector to be considered. In this regard, from the economical viewpoint, the bug bounty system is a more voluntary and active implementation of the VDP. Its driver is the ‘deal’. Bug bounty can be said to be the most representative vulnerability marketplace in the legal domain, that is, the white market [31]. The sooner vulnerabilities are disclosed, the higher the security of the entire society. Therefore, when vulnerabilities are identified, creating competitive environment that provides appropriate compensation can be the most effective way to respond to cyber security vulnerability risks [32]. In the long perspective, it is right for all companies to implement the VDP voluntarily, as in the case of the Netherlands. In this case, it is necessary to support small and medium-sized enterprises by the government and public institutions, or by linking with bug bounty platforms. Also, VDP should be linked to policies that improve society’s overall awareness of cybersecurity and hackers. An education system to train security experts such as white hackers at the national level, basic cybersecurity education and publicity for ordinary citizens, and VDP education and promotion for companies should be implemented together. Accordingly, it is necessary to spread awareness of overall security and improve culture, and to implement a voluntary

128

Y. S. Pil

VDP. In addition, policies that select, award, and promote some best cases that have improved security through voluntary VDP need to be combined to further spread the VDP.

4.4 Systematic Management of Known Vulnerability The threat of known vulnerabilities should also be considered. According to Trend Micro’s 2021 security forecast report, not only zero-day vulnerabilities but also known and unpatched vulnerabilities, n-day vulnerabilities, are expected to pose a greater threat [33]. In fact, along with the presence of vulnerabilities in most products and services, one of the other big problems is that even if vulnerabilities are identified and known, patches are not proceeded. Another research shows that it took an average of 67 days to remove vulnerabilities, and 36 days for serious vulnerabilities. If the vulnerabilities were not taken within 30 days after discovery, no action was taken even after 90 days. In the case of serious vulnerabilities, it was confirmed that about 65% remained after 90 days if they were not resolved within 30 days [34]. The other study also found that 60% of security accidents occurred because of unpatched known vulnerabilities. For this reason, 88% of the respondents said it would take an average of 12 days to apply patches even if they were distributed due to coordination issues that had to be discussed with other departments of the organization, and 72% said it was difficult to prioritize items that needed patches [35]. In this regard, vulnerabilities reported and disclosed through VDP should be systematically managed to respond efficiently to attacks using known vulnerabilities. Also, it is important problem that who should manage vulnerabilities. Transparency must be guaranteed if the government is in charge on that problem. The U.S. traditionally operates a National Vulnerability Database by DHS. The E.U. gave the responsibility to ENISA. It may be also possible to operate a vulnerability databased by forming a specialized institution through public and private cooperation. But most importantly, in any case, there is a need for a policy that can ensure the transparency and accountability of the agency that manages the vulnerability database. We can consider the obligation to report to the congress on a regular basis and to publish a white paper, etc.

5 Conclusions This study has investigated and compared the cases of the U.S., the E.U., and the Netherlands as major countries that are actively implementing the VDP. In a nutshell, the best form of operating VDP is based on autonomy. Because, the meaning of autonomously operating a VDP is that there is a high awareness of security and those organizations have active will to solve the problem of security vulnerabilities.

The Way Forward for Security Vulnerability Disclosure Policy: …

129

However, it is difficult to expect such level of autonomy in the current situation. Accordingly, the U.S. and the E.U. are trying to spread VDP by law, and the Netherlands is well implementing VDP as a special case. This is because the Netherlands court viewed vulnerability research in good faith and pointed out that the vulnerability problem is not revealed through such research or disclose, but is essentially a defect in software. Based on this case, the research culture of security vulnerabilities has become positive in the Netherlands. In the future, VDP should spread to all areas. To this end, this study proposed protecting security researchers from legal charges. In addition, considering the IoT environment, it is important that VDP should be most actively implemented in the product manufacturing field, and that in order to activate VDP, bog bounty, etc., which provides appropriate compensation measures should be utilized organizationally. In addition, when operating VDP, the operator must establish a system that allows security researchers to contact and report. Specifically, it is necessary to accurately identify the assets of the systems and services that service provider operate and clearly present the scope to which the operator want to apply VDP. This could be important standards for determining whether charging the legal responsibility in the event of a problem, as the service provider or operator determines the allowable range of authority in advance. For reported vulnerabilities, internal capabilities to substantially verify and patch them should be provided. And the vulnerability should be notified whether it is patched or not while communicating with the reporter from time to time. Through these process, it will be possible to establish a trustworthy relationship between organizations and security researchers and active the VDP. Furthermore, since it is important to respond to known vulnerabilities, a policy that can systematically manage vulnerability is needed. The vulnerability database can be operated, and in this case, a policy that can ensure the transparency and accountability of the management entity should be designed. The obligation of reporting to the congress, disclose statistics, and publish white paper can be considered. The VDP is essential to strengthen the voluntary security of all related parties, including the government and companies. Along with digital transformation, the impact of cyberspace is getting bigger. As malicious approaches increase, threats and damage will inevitably rise. It should be possible to expand its position to strengthen security while simultaneously providing institutional improvement and support. As a basis, white hackers or security researchers should be supported to activate vulnerability research.

References 1. Raymond, E.S.: The Cathedral and the Bazaar, p. 7 (1997) 2. U.S. NTIA: Vulnerability Disclosure Attitudes and Actions, p. 10 (2016) 3. OECD: Encouraging Vulnerability Treatment: Responsible Management, Handling and Disclosure of Vulnerabilities, DSTI/CDEP/SDE(2020)3/FINAL, p. 17 (2021) 4. Schaake, M., Pupillo, L., Ferreira, A., Varisco, G.: Software Vulnerability Disclosure in Europe: Technology, Policies and Legal Challenges, p. 13. Center for European Policy Studies (2018)

130

Y. S. Pil

5. Levy, S.: Hackers: Heroes of the Computer Revolution. O’Reilly Media, Inc., Sebastopol, California (2010) 6. Microsoft: Microsoft thanks the following people for reporting this issue to us and working with us to protect customers; Li0n of A3 Security Consulting Co., Ltd. (http://www.a3sc.co.kr) for reporting the out of process privilege elevation vulnerability. Microsoft Security Bulletin MS02-062—Moderate (2002). Access: 10 July 2022. URL: https://docs.microsoft.com/en-us/ security-updates/SecurityBulletins/2002/ms02-062 7. Sterling, B.: The Hacker Crackdown: Law and Disorder on the Electronic Frontier, p. 53. Bantam Books, New York (1992) 8. Himanen, P., Torvalds, L., Castells, M.: The Hacker Ethic: A Radical Approach to the Philosophy of Business, p. 51. Random House, New York (2001) 9. Powell, A.: Hacking in the public interest: authority, legitimacy, means, and ends. New Media Soc. 18(4), 60 (2016) 10. Leeson, P.T., Coyne, C.J.: The economics of computer hacking. J. Law Econ. Policy 1(2), 530 (2005) 11. Swire, P.: A model for when disclosure helps security: what is different about computer and network security? J. Telecommun. High Technol. Law 3(1), 206 (2004) 12. Zhao, M., Grossklags, J., Chen, K.: An exploratory study of white hat behaviors in a web vulnerability disclosure program. In: Proceedings of the 2014 ACM Workshop on Security Information Workers (SIW’14), p. 51. ACM (2014) 13. Gartner: Gartner Says 5.8 Billion Enterprise and Automotive IoT Endpoints Will Be in Use in 2020, 29 Aug. 2019. https://www.gartner.com/en/newsroom/press-releases/2019-08-29-gar tner-says-5-8-billion-enterprise-and-automotive-io 14. Schneier, B.: The Internet of Things is Wildly Insecure-and Often Unpatchable. WIRED, 2014.01.06. Access: 11 July 2022. URL: https://www.wired.com/2014/01/theres-no-goodway-to-patch-the-internet-of-things-and-thats-a-huge-problem/ 15. Symantec: Security Forecasts to Watch for 2019 (2019). Access: 21 June 2022. URL: https:// www.symantec.com/connect/blogs/2019-0 16. U.S. NTIA: Vulnerability Disclosure Attitudes and Actions, p. 6 (2016) 17. Katyal, N.K.: Deterrence’s difficulty. Mich. Law Rev. 95(8), 2445 (1997) 18. U.S. CISA: Develop and Publish a Vulnerability Disclosure Policy, Binding Operational Directive 20-01 (2020) 19. U.S. CISA: CISA Announced New Vulnerability Disclosure Policy Platform (2022) 20. U.S. DOJ, 9-48.000: Computer Fraud and Abuse Act, Justice Manual 9-48.000 (2022) 21. European Commission: Impact Assessment Report Part 1/3, p. 66 (2020) 22. ENISA: Coordinated Vulnerability Disclosure Policies in the EU, pp. 74–75 (2022) 23. Netherlands NCSC: Coordinated Vulnerability Disclosure. Access: 16 May 2022. URL: https:// www.ncsc.nl/english/security 24. Netherlands NCSC: Coordinated Vulnerability Disclosure: The Guideline (2017) 25. ENISA: Coordinated Vulnerability Disclosure Policies in the EU, p. 59 (2022) 26. NXP BV v.: Radbound University (2008) 27. Stevens, Y., Tran, S., Atkinson, R., Andrey, S.: See something, say something: coordinating the disclosure of security vulnerabilities in Canada. In: Cybersecure Policy Exchange, pp. 7–8 (2021) 28. Maillart, T., Zhao, M., Grossklegs, J., Chuang, J.: Given enough eyeballs, all bugs are shallow? Revisiting Eric Raymod with bug bounty programs. J. Cybersecur. 3(2), 87 (2017) 29. U.S. NTIA: Vulnerability Disclosure Attitudes and Actions, p. 5 (2016) 30. HackerOne: The 2021 Hacker Report, p. 6 (2021) 31. Algarni, A.M., Malaiya, Y.K.: Software vulnerability markets: discoverers and buyers. Int. J. Comput. Inf. Sci. Eng. 8(3), 72 (2014) 32. Cavusoglu, H., Cavusoglu, H., Raghunathan, S.: Efficiency of vulnerability disclosure mechanisms to disseminate vulnerability knowledge. IEEE Trans. Software Eng. 33(3), 183 (2007) 33. Trend Micro: Trend Micro Security Predictions for 2021, p. 12 (2020)

The Way Forward for Security Vulnerability Disclosure Policy: …

131

34. Contrast Security: 2020 Application Security Observability Report, p. 21 (2020) 35. Ponemon Institute for ServiceNow: Cost and Consequences of Gaps in Vulnerability Response (2019)

Study on Government Data Governance Framework: Based on the National Data Strategy in the US, the UK, Australia, and Japan Jeong Eun Seo and Hun Yeong Kwon

Abstract Most government agencies today have a perception that data is essential. However, creating a culture that encourages public servants to perceive data as an asset and make data-driven decisions is challenging. Data governance helps reduce the cost of data management and create value from the data. However, data is often dispersed across many organizations with different data policies in place, stored, and utilized. It can lead to accountability issues and poor data quality, and economic decline based on data utilization. The government data governance framework is one of the solutions to this problem, but there is a lack of discussion of a national data governance framework. Therefore, this paper analyzes the NDS of the US, the UK, Australia, and Japan based on the DGF of the DGI to derive the essential considerations in formulating national data strategies. And then, we suggest the components of the Government Data Governance Framework. These components are essential elements to be discussed in the establishment of NDS. This paper’s results can help establish a new NDS or modify the established NDS. Keywords Data governance · Data governance framework · National Data Strategy · Government data governance framework · DGI

1 Introduction Most government agencies today have a perception that data is essential. However, creating a culture that encourages public servants to perceive data as an asset and make data-driven decisions is challenging [1]. Data governance helps reduce the cost of data management and create value from the data. However, data is often dispersed across many organizations with different data policies in place, stored, and utilized. J. E. Seo · H. Y. Kwon (B) School of Cybersecurity, Korea University, Seoul, South Korea e-mail: [email protected] J. E. Seo e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Big Data, Cloud Computing, and Data Science Engineering, Studies in Computational Intelligence 1075, https://doi.org/10.1007/978-3-031-19608-9_11

133

134

J. E. Seo and H. Y. Kwon

It can lead to accountability issues and poor data quality, a red flag for economic growth based on data utilization [2]. Government data governance framework is one of the solutions to this problem, [3] but there is a lack of discussion of a national data governance framework. Therefore, this paper aims to analyze the National Data Strategies (NDS) of the United States, the United Kingdom, and Japan based on the Data Governance Framework (DGF) of the Data Governance Institute (DGI) to derive the essential considerations in formulating NDS. The rest of the article is structured as follows. Section 2 introduces the concept of data governance, data governance approaches, and data governance frameworks. The research methodology for this study is in Sect. 3, and Sect. 4 summarized NDS is applied to and explained to the DGF presented in Sect. 3. Finally, Sect. 5 covered this paper’s main conclusions and contributions and the topics for future works.

2 Background 2.1 Data Governance There is no established definition of data governance. Khatri and Brown [4] defined data governance as “organizational decision-making and accountability for data assets”. Weber et al. [5] defined data governance as “defining the dynamics associated with data and assigning responsibility for decision-making”. Laudon and Jane [6] state that data governance deals with the policies and processes for managing the availability, usability, integrity, and security of the data employed in an enterprise, with particular emphasis on promoting privacy, security, data quality, and compliance with government regulations.

2.2 Data Governance Approaches A common problem with data governance is that the flow and logic of the data may not follow the organization’s structure. Inconsistent organizational structure and the flow of use of data can lead to data silo problems, unclear accountability issues, and data control failures throughout the entire lifecycle [2]. Given these issues, it is challenging to clarify the choice of the data governance approach, but it is crucial [7]. Figure 1 describes the approach to data governance. There are three main approaches to data governance. The three approaches are not mutually exclusive and can be used to complement each other [2].

Study on Government Data Governance Framework: Based …

135

Fig. 1 Data governance approaches [2]

2.2.1

Planning and Control Approach

The Planning and Control approach, commonly used in IT governance frameworks, focuses on setting goals, allocating budgets, and defining, implementing, monitoring, and evaluating projects at a set interval [8]. The disadvantage of this approach is that it does not adapt quickly to change [9]. However, continuous monitoring can help plan appropriate projects and rational allocation of resources.

2.2.2

Organizational Approach

The Organizational approach emphasizes organizational structure, responsibilities, and reporting [10]. This approach uses top-level design principles to set up an organizational structure for data governance and treats data as a defining authority. This approach recommends having a separate decision-making structure for the data area. It is recommended that this structure include Chief Data Officers (CDOs), Chief AI Officers (CAIO), Chief Privacy Officers (CPOs), and Chief Ethics Officers (CEOs), as well as responsibilities for data stewardship [11].

2.2.3

Risk-Based Approach

A risk-based approach is a data governance approach that focuses on managing risk [12]. In particular, it is known to be an effective way to manage AI-based risks such as data or algorithm errors, data or algorithm bias, and data-embedded discrimination [13].

2.3 Data Governance Framework Because no definition of data governance has been established and data governance approaches vary, the Data Governance Framework (DGF) is also interpreted and

136

J. E. Seo and H. Y. Kwon

designed with various aspects. Sarsfield [14] defined DGF as a “process for organizing data asset management.” Tomusange et al. [15] noted that “DGFs use data aggregation and standardization to reduce the time, human and financial investments required in repeated data collection.” In particular, Mao et al. [16] referred to the DGF as “aid government decision-making via data analysis and processing, secure data and enable data evaluation” he also proposed a framework for government data governance based on the central platform concept. Organizational structures are often characterized by distributed development. It is necessary to ensure the ready availability, accuracy, and integration of critical business data [2]. An imperfect organizational DGF will cause inconsistent data standards, poor quality, and low management efficiency. Therefore, DGFs are increasingly valued in government and enterprise informatics [3].

3 Research Method This study examines the NDS of the United States, the United Kingdom, Australia, and Japan based on the Data Governance Institute (DGI)’s DGF including three of the Data Governance Approaches. In detail, the items in the NDS that conform to the components of the Government DGF that are revised and presented for this study are matched to determine which components are essential to the formulation of the government’s data strategy. Furthermore, check if any additional components need to be included and present.

3.1 National Data Strategy Governments and local governments have announced several strategies related to data. We establish the scope of research as a document published by the central government and a NDS that applies to the entire country. The U.S. Federal Data Strategy, announced in 2019, was conceived and developed in 2018 when the President designated “Leveraging Data as a Strategic Asset” as the Cross-Agency Priority (CAP) Goal. The Federal Data Strategy, which comprises the Mission Statement, 10 Principles, and 40 Practices, provides a whole-of-government vision and offers guidance on how agencies should manage and use Federal data. The Federal Data Strategy is designed to derive value from data while supporting robust data governance and protecting security, privacy, and confidentiality [17, 18]. The UK’s Department for Digital, Culture, Media & Sport (DCMS) released the NDS in September 2020. The strategy states that the UK government is aware of the value of data and affirms that it is a key to driving growth in the digital sphere and the growth of the UK economy. The strategy comprises Data Opportunities, Missions, and Pillars of Effective Use that include Data Foundations, Skills, Availability, and Responsibility [19].

Study on Government Data Governance Framework: Based …

137

The Australia’s Department of the Prime Minister and Cabinet has released the Australian Data Strategy in December 2021. The strategy aims to foster the industry and transform Australia into a modern data society by 2030 by ensuring efficiency and safety when using data based on recognition of the importance of data. This Strategy sets out how the Government will enhance effective, safe and secure data use over the period to 2025. The Strategy focuses on three key themes of Maximising the value of data, Trust and protection, and Enabling data use [20]. The Japanese government announced its NDS in 2021. The NDS is a comprehensive idea of the data strategy promoted by the Japanese government, which clarifies the basic concepts of data strategy and establishes the fundamental values of data strategy: philosophy, a vision of society based on philosophy, and principles for realizing the vision. The strategy comprises the Philosophy, the Vision, Principles, the Trust, a Platform, a Service Platform, Data, Digital Infrastructures, Human Resources and Organizations, Security, and International Collaboration [21].

3.2 DGF of the DGI The DGF of the DGI contains ten components, as shown in Table 1 [22]. The study focuses on the government DGF, not the corporate DGF. Therefore, among the components, the ‘Rules’ is divided into compliance requirements, data definitions, and data standards, and the ‘Data Governance Processes’ is replaced by ‘Data Quality’ to conduct research. We included ‘data standard,’ which means setting rules for the data itself in ‘Rules.’ Data Quality is meaningful if managed throughout the data processing process. The ‘Data governance processes’ is the methods used to govern data, which also means managing them throughout the data processing process. In DGF for the government, improving data quality is more meaningful than managing the data itself, so this study is carried out by replacing the Data Governance Process with Data Quality.

4 Data Governance Framework for National Data Strategy The result of matching NDS to the DGF modified for this study is in Table 2. NDS in the United States, the United Kingdom, Australia, and Japan contain most of the DGF components of the DGI that were modified and presented in this study. The Rules and Rules of Engagement domain had distinct country-specific characteristics. Because all four countries have similar orientations, all components of the People and Organizational Bodies domain and the Process domain were included in the NDS. In particular, data quality improvement is a standard-essential item in all three countries, and the relevant discussions have continued, and plans have been formulated. After this, the analysis results of Rules and Rules of Engagement, People and Organizational Bodies, and additions are examined.

138

J. E. Seo and H. Y. Kwon

Table 1 The DGI data governance framework [22] Domain

Components

Descriptions

Rules and rules of engagement

Mission and vision

• The direction

Goals

• Goals should be specific, measurable, actionable, relevant, and timely

Governance metrics Success measures Funding strategies

People and organizational bodies

Processes

Rules

• Data-related policies, standards, compliance requirements, business rules, data definitions, etc.

Decision rights

• For the responsibility of the data governance

Accountabilities

• To define accountabilities that can be baked into everyday processes and the organization’s software development life cycle

Controls

• Risk management

Data stakeholders

• An individual or group that could affect or be affected by the data under discussion • Groups who create data, those who use data, and those who set rules and requirements for data

Data Governance Office (DGO)

• Facilitates and supports data governance and data stewardship activities

Data stewards

• The set of activities that ensure data-related work is performed according to policies and practices as established through governance • Set policy and specify standards, or craft recommendations that are acted on by a higher-level Data Governance Board

Data governance processes

• Proactive, reactive, and ongoing data governance processes • The methods used to govern data

Process

People and organizational bodies Principle2

Data stewards Practice21

Practice2, 36 OMB

Data stakeholders

A Data Governance Office (DGO)

Practice34, 35

Controls

Pillar1.2

Practice11 Practice11

Pillar1,3

Mission1/Pillar1,3

Mission3/Pillar1

Pillar4.1

Pillar4.2

Annex A

Pillar1.1 Pillar1

– Practice20, 29

Data definitions

Data standards

Mission2/Pillar4

Practice31

Compliance requirements

–

3.2

1.4/2.2/3.2

3.2

1.1/1.2/1.4

1.4/2.1

2.1

–

1.4/3.2

3.2

3.2

The policy landscape

Mission3/Pillar3,4 Pillar1.2

Success measures Practice7

Funding strategies Practice18

Decision rights

Data quality

Australia Vision Three key themes

Accountabilities

Data rules and definitions

UK – Opportunities

Principles

Governance metrics

Goals

USA Mission statement

Components

Mission and vision

Domain

Rules and rules of engagement

Table 2 Comparison of National Data Strategy using DGF Japan

Platform/data/digital infrastructure/international collaboration

Organization

Digital agency/organization

Principles

Trust

Organization

–

Platform/service platform/international collaboration

Platform/data

–

–

Principles

Vision

Philosophy

Study on Government Data Governance Framework: Based … 139

140

J. E. Seo and H. Y. Kwon

4.1 Rules and Rules of Engagement The United States has ten principles and 40 practices under a clear Mission Statement and annually establishes and publishes an Action Plan to achieve them. The 40 Practices contain most of the DGF proposed in this study, and each practice’s actions consist of several milestones. Data Definitions are included in the Action Plan. It can also be seen that each milestone includes a measurement, a target date, and additionally includes a reporting mechanism, required or encouraged, responsible party, etc., clearly stating the NDS executor and responsible entity and the NDS promotion schedule. The UK does not include vision in its NDS, but it does reveal its vision through other official channels [23]. The UK includes Funding Strategies in the BUDGET document published annually by the UK Government under the NDS. It also specifies the ‘list of actions and owners’ in NDS’s Annex to ensure accountabilities. The Australian NDS listed the components needed to achieve each theme, with three Key Themes under a clear Vision. An Action Plan is regularly announced to achieve the goals of each Key Theme. The Action Plan includes the title, description, and end date of the project underway to achieve the goal, indicating each project’s purpose, budget, responsibility, and implementation schedule. The Australian NDS presented a landscape of data and digital-related policies previously announced by the Australian government with the part “In brief: the policy landscape.” The Australian NDS budget is allocated based on the Digital Economy Strategy. Japan does not appear to have included discussions on compliance requirements and decision rights in the NDS because the Japanese government is the largest data holder in Japan. The government is focused on providing “a platform of platforms” for the entire country. As far as the budget is concerned, the Digital Agency is only referred to as reviewing the information systems budget. However, like the United States and the United Kingdom, it reflects the budget for each fiscal year under the NDS. All four governments include in NDS most of the components that make up the Rules and Rules of Engagement in the larger context. However, there was some difference in clarity and specificity. The United States secured clarity and concreteness through the Action Plan, and the UK secured clarity and specificity through the NDS itself. Along with the NDS Action Plan, Australia has secured clarity and specificity through the Digital Economy Strategy, the Australian Cyber Security Strategy 2020, and the Digital Government Strategy. In the case of Japan, the structure of NDS was presented by expressing “Government as a platform” as the role of the Japanese government. Furthermore, a plan for core tasks corresponding to each layer was presented to secure the specificity of NDS. This result is attributed to political and cultural differences between countries.

Study on Government Data Governance Framework: Based …

141

4.2 People and Organizational Bodies The United States has established a government-wide data policy organization by restructuring the relevant organization around data and having a Chief Data Officer (CDO) Council and a Data Policy Committee in the OMB, which has policy promotion and budgetary powers. The UK began discussions with stakeholders even before the strategy was formulated, incorporating the results into the NDS and mentioning the importance of collaboration with stakeholders throughout the NDS. In addition, the Chief Data Officer for Government is required to designate a government-wide data officer. It also has a Central Digital and Data Office (CDDO) under the Cabinet Office to handle practical data-related tasks. The Department of the Prime Minister and Cabinet, Department of Home Affairs, and Digital Transformation Agency have established and announced Australian digital and data-related strategies. Many of these other strategies are referred to in the NDS. Meanwhile, Australia recognized the importance of data value through NDS and specified the need to establish a data sharing system. To this end, we build and implement a DATA Scheme. The Office of the National Data Commissioner (ONDC) is an affiliate of the Department of the Prime Minister and Cabinet, which oversees and supports the DATA Scheme, a data sharing system, and oversees most of the programs included by the Australian NDS. Japan has also launched a Data Strategy Task Force (TF) consisting of business people, associations, professors, and civil servants and has announced NDS based on the results discussed and published in TF. It also establishes and operates a digital agency that acts as a control tower to implement data strategies at the governmentwide level. The U.S., the UK, and Japan governments attach great importance to communication and cooperation with stakeholders, have a government-wide data officer or department in charge to act as a control tower, and design and implement datarelated policies at the cross-government level. As stated in “The policy landscape”, the Australian Government has a department responsible for digital and data-related strategies to specify the institutions that perform the supervisory functions necessary to carry out each strategy. This result means that the government designed NDS, emphasizing the organizational approach among the data governance approaches. Governments are committed to complex and big goals: to deal with more and more diverse data than corporations, to foster relevant industries, and at the same time to ensure the trust of the people. Therefore, it is understood that the Organizational approach was introduced more often than other approaches.

142

J. E. Seo and H. Y. Kwon

4.3 Additional The United States also includes many practices that need to be planned and implemented at the national level, such as: Provide Resources Explicitly to Leverage Data Assets, Use Data to Guide Decision-Making, Preparing to Share, Use Data to Increase Accountability, Monitor and Address Public Perceptions, Connect Data Functions Across Agencies, Increase Capacity for Data Management and Analysis, Align Quality with Intended Use, Promote Wide Access, Review Data Releases for Disclosure Risk. The NDS of the UK includes the development of data capabilities, securing international data flows, enacting legislation to mandate participation in data initiatives, facilitating a policy of openness of public data, securing a legal framework for the use of data based on trust, providing a shared benefit and ensuring the rights to control self-information. Australia includes data sharing system construction, data security capabilities improvement, data trust enhancement, data infrastructure construction including data laws and regulations, data integration, and data capability development in the NDS. Japan also secures international data flows, ensures the control of the personal information of data subjects, activates the policy of opening up public data, includes matters concerning data-based administration in the NDS, and includes matters concerning the maintenance of digital infrastructure in the main components of the strategy. Therefore, the “development of data capabilities, securing international data flows, activating the policy of open public data, ensuring the privacy of data subjects, and data-driven administration” that are common in the NDS mentioned above are considered appropriate for the DGF for the government to be included as a component of the DGF for the government. Table 3 represents the components that must be included in the DGF for Government.

5 Conclusions In this study, NDS in the US, UK, Australia, and Japan were analyzed based on the DGF of the DGI and identified the components that must be included in the NDS. The DGF for government derived from this study presents components that must be discussed in establishing NDS or modifying the established NDS. In addition, by comparing NDSs in the United States, the United Kingdom, Australia, and Japan with different backgrounds, we provide a basis for choosing a DGF approach that fits the situation of the NDS-establishing countries. However, our research has some limitations. First, most of the DGFs currently in place are designed to acquire business value through data asset management [24]. However, because governments must pursue public values, there is a gap between the data governance frameworks designed and built to date and the government

Study on Government Data Governance Framework: Based …

143

Table 3 DGF for government Domain

Components

Rules and rules of engagement

Mission and vision Goals

Governance metrics Success measures Funding strategies

Data rules and definitions

Compliance requirements Data definitions Data standards

Decision rights Accountabilities Controls People and organizational bodies

Data stakeholders A Data Governance Office (DGO) Data stewards

Process

Data quality Increase capacity for data management and analysis

Public interest

Right to control self-information Evidence based administration Open data

National interest

Championing the international flow of data

data governance frameworks [25]. These differences can affect different strategies, governance approaches, and even designs in implementing data governance decisions [4]. Therefore, government DGFs should be designed to pursue public service values such as improving citizenship and promoting social equity [16]. Although in this paper, DGF for Government included components such as ‘Right to control self-information, Evidence-based administration, Open data, Championing the international flow of data’, in order to establish a better “National” Data Strategy (NDS), additional components must be included in terms of pursuing public interests and fostering related industries. Second, we need to consider a data governance framework that can be linked between government organizations and sectors. The lack of links between government organizations and sectors creates barriers to long-term data collection and sharing [26]. It will pose a significant challenge to economic growth based on government data. While the organization’s role and decision-making authority has been mentioned as a solution, data governance decisions on the separation of duties and responsibilities and data ownership-related decisions need to be constantly discussed [27]. Based on these discussions, a government DGF design that considers the links between government organizations and sectors is needed.

144

J. E. Seo and H. Y. Kwon

References 1. Benfeldt, O., Persson, J.S., Madsen, S.: Data governance as a collective action problem. Inf. Syst. Front. 22(2), 299–313 (2020) 2. Janssen, M., Brous, P., Estevez, E., Barbosa, L.S., Janowski, T.: Data governance: organizing data for trustworthy artificial intelligence. Gov. Inf. Q. 37(3), 101493, 1–8 (2020) 3. McGuirk, P.M., O’Neill, P.M., Mee, K.J.: Effective practices for interagency data sharing: insights from collaborative research in a regional intervention. Aust. J. Public Adm. 74(2), 199–211 (2015) 4. Khatri, V., Brown, C.V.: Designing data governance. Commun. ACM 53(1), 148–152 (2010) 5. Weber, K., Otto, B., Österle, H.: One size does not fit all—a contingency approach to data governance. J. Data Inf. Qual. (JDIQ) 1(1), 1–27 (2009) 6. Laudon, K.C., Jane, P.: Management Information Systems: Managing the Digital Firm, 13th edn. Pearson Education Limited (2014) 7. Koltay, T.: Data governance, data literacy and the management of data quality. IFLA J. 42(4), 303–312 (2016) 8. Van De Haes, S., Grembergen, W., Debreceny, R.S.: COBIT 5 and enterprise governance of information technology: building blocks and research opportunities. J. Inf. Syst. 27(1), 307–324 (2013) 9. Janssen, M.V., Der Voort, H.: Adaptive governance: towards a stable, accountable and responsive government. Gov. Inf. Q. 33(1), 1–5 (2016) 10. Mullon, P.A., Ngoepe, M.: An integrated framework to elevate information governance to a national level in South Africa. Rec. Manag. J. 29(1/2), 103–116 (2019) 11. Rothstein, H., Borraz, O., Huber, M.: Risk and the limits of governance: exploring varied patterns of risk-based governance across Europe. Regul. Gov. 7(2), 215–235 (2013) 12. Ladley, J.: Data Governance: How to Design, Deploy, and Sustain an Effective Data Governance Program. Academic Press (2019) 13. Janssen, M., Kuk, G.: The challenges and limits of big data algorithms in technocratic governance. Gov. Inf. Q. 33(3), 371–377 (2016) 14. Sarsfield, S.: The Data Governance Imperative. IT Governance Publishing (2009) 15. Tomusange, I., Yoon, A., Mukasa, N.: The data sharing practices and challenges in Uganda. Proc. Assoc. Inf. Sci. Technol. 54(1), 814–815 (2017) 16. Mao, Z., Wu, J., Qiao, Y., Yao, H.: Government data governance framework based on a data middle platform. Aslib J. Inf. Manag. 74(2), 289–310 (2021) 17. Office of Management and Budget, “Background”. Access: 13 June 2022. URL: https://str ategy.data.gov/background/ 18. Office of Management and Budget: Federal Data Strategy (2019) 19. Department for Digital, Culture, Media & Sport: National Data Strategy (2020) 20. Department of the Prime Minister and Cabinet: Australian Data Strategy (2021) 21. Cabinet Office: National Data Strategy (包括的データ戦略) (2021) 22. The Data Governance Institute: The DGI Data Governance Framework (2020) 23. UK Gov: National Data Strategy. Access: 30 June 2022. URL: https://www.gov.uk/guidance/ national-data-strategy 24. Panian, Z.: Some practical experiences in data governance. World Acad. Sci. Eng. Technol. 62(1), 939–946 (2010) 25. Rajagopalan, M.R., Vellaipandiyan, S.: Big data framework for national e-governance plan. In: 2013 Eleventh International Conference on ICT and Knowledge Engineering, pp. 1–5. IEEE (2013) 26. Paskaleva, K., Evans, J., Martin, C., Linjordet, T., Yang, D., Karvonen, A.: Data governance in the sustainable smart city. Informatics 4(4), 41–59 (2017) 27. Alhassan, I., Sammon, D., Daly, M.: Data governance activities: an analysis of the literature. J. Decis. Syst. 25(sup 1), 64–75 (2016)

A Study on the Attack Index Packet Filtering Algorithm Based on Web Vulnerability Min Su Kim

Abstract Web Service is a representative Internet service that uses TCP/IP communication protocol, and a malicious attack attempt through the open http 80 port was found to be its vulnerability. Web vulnerability attacks can be mainly classified into Client-side Attack and Server-side Attack. Among them, Server-side Attack means a direct attack against the server, and is mostly an attack type that modifies or steals data in the database. Attach techniques include SQL Injection, XXS, CSRF, and security at the network layer using security equipment such as firewalls, IDS, and IPS cannot block attacks against the application layer. Therefore, this study proposes an application layer-based web packet filtering algorithm to improve web service continuity and operational efficiency based on the attack index for web vulnerabilities. Keywords Web vulnerability · Web packet filtering algorithm · Client-side attack · Server-side attack · Web service continuity

1 Introduction With the development of information and communication technology in the knowledge information society, information is being provided in various forms throughout society. However, due to continuous attacks exploiting network and system vulnerabilities, efforts are being made to respond to cyber threats based on the risk of web vulnerabilities. The characteristics of major cyber threats in 2020 [1] announced by the EU Network Information Security Agency (ENISA) include access control policy that provides services to many and unspecified persons according to the operation of the web protocol, and responses to unknown vulnerabilities or new vulnerabilities, etc. Web

This research is supported by Joongbu University. M. S. Kim (B) Department of Information Security Engineering, Sun Moon University, Seoul, South Korea e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Big Data, Cloud Computing, and Data Science Engineering, Studies in Computational Intelligence 1075, https://doi.org/10.1007/978-3-031-19608-9_12

145

146

M. S. Kim

service attacks are ongoing because internal information leakage, malicious code distribution, and D-Face attacks are possible as attacks using vulnerabilities [2]. Web Service is a representative Internet service that uses TCP/IP communication protocol, and a malicious attack attempt through the open http 80 port was found to be its vulnerability. Web vulnerability attacks can be mainly classified into Clientside Attack and Server-side Attack. Among them, Server-side Attack means a direct at-tack against the server, and is mostly an attack type that modifies or steals data in the database. Therefore, this study proposes an application layer-based web packet filtering algorithm to improve web service continuity and operational efficiency based on the attack index for web vulnerabilities.

2 Related Study 2.1 Open Web Application Security Project Top 10 OWASP Top 10 updated in 4 years after 2017 include A01, access control vulnerability, A02, Broken Access Control and Sensitive Data Exposure, A03, Cryptographic Failures, Unreliable Data Instruction or Query Statement Vulnerability, A04, various vulnerabilities expressed by injection and inefficient control design, A05, vulnerability in which Insecure Design, unnecessary functions are activated or installed, A06, vulnerability resulting from Security Misconfiguration, end of support or use of outdated versions, A07, vulnerability when Vulnerable and Outdated Components, user identification, authentication, and session management are not properly performed, A08, vulnerability of Identification and Authentication Failures and integrity, A09 related to Software and Data Integrity Failures, detection and response to ongoing attacks, A10, vulnerability caused by Security Logging and Monitoring Failures and manipulated requests and Server-Side Request Forgery [3] as shown in Fig. 1. In the OWASP vulnerability, the following data elements exist and their meanings are as follows [3]. a. CWEs Mapped: Number of CWEs mapped to each vulnerability item

Fig. 1 2021 OWASP Top 10

A Study on the Attack Index Packet Filtering Algorithm Based on Web …

147

b. Incidence Rate: Proportion of applications vulnerable to CWE in the population tested in that year c. Coverage: Percentage of Applications Tested by All Organizations for a Specific CWE d. Weighted Exploit: Exploit subscore of 10 CVSSv2 and CVSSv3 scores in CVE mapped to CWE e. Weighted Impact: 10 subscores of the influence of CVSSv2 and CVSSv3 scores in CVE mapped to CWE f. Total Occurrences: Total number of applications found with CWE mapped to categories g. Total CVEs: Total number of CVEs in NVD DB mapped to CWEs mapped to categories. 1.

2.

3.

4.

Broken Access Control a

b

c

d

e

f

g

34

3.81%

47.22%

6.92

5.93

318,487

19,103

Previously, one service was in charge of access control, but it seems to have moved to the first position because there are difficulties in access control as the functions are divided. Cryptographic Failures a

b

c

d

e

f

g

29

4.49%

34.85%

7.29

6.81

233,788

3075

OWASP determined that the sensitive data exposure that existed in 2017 was not a root cause, but a symptom that occurred widely in various fields. Injection a

b

c

d

e

f

g

33

3.37%

47.90%

7.25

7.15

274,228

32,078

In the past, injection had various vulnerabilities such as SQL, NoSQL, operating system command, ORM (Object Relational Mapping), LDAP, EL (Expression Languages), OGNL (Object Graph, Navigation Library) injection, etc. It is still considered a vulnerability with risk. Insecure Design a

b

c

d

e

f

g

40

3.00%

42.51%

6.46

6.78

262,407

2691

148

5.

6.

7.

8.

9.

M. S. Kim

In the case of an application with a bad design, it seems that the items found by the security tests that are conducted later cannot be easily taken action or there are many cases where it is necessary to operate with risks. Security Misconfigurations a

b

c

d

e

f

g

20

4.51%

44.84%

8.12

6.56

208,387

789

Even when developing a safely designed application, there are many cases in which options are set incorrectly for testing and debugging. For this reason, OWASP seems to have raised the ranking of the vulnerability by one level. Vulnerable and Outdated Components a

b

c

d

e

f

g

3

8.77%

22.47%

5.00

5.00

30,457

0

In the case of an application that connects various libraries and components based on open source, known vulnerabilities may exist in the provided libraries and components, and vulnerabilities may be embedded in software due to the use of components with known vulnerabilities. Identification and Authentication Failures a

b

c

d

e

f

g

22

2.55%

45.72%

7.40

6.50

132,195

3897

The item appears to have declined in ranking as the availability of development frameworks increases in software currently being developed. Software and Data Integrity Failures a

b

c

d

e

f

g

10

2.05%

45.35%

6.94

7.94

47,972

1152

It is one of the most weighted data in CVE/CVSS data, but does not change in location, as it relates to software updates, critical data, and CI/CD pipelines that do not prove integrity. Security Logging and Monitoring Failures a

b

c

d

e

f

g

4

6.51%

39.97%

6.87

4.99

53,615

242

Logging and monitoring are difficult to test for security and do not show up well in CVE/CVSS data. 10. Server-Side Request Forgery

A Study on the Attack Index Packet Filtering Algorithm Based on Web …

149

Table 1 OWASP Top 10 detailed risk factors Risk

E

P

D

T

S

Injection

3

2

3

3

8

Authentication

3

2

2

3

7

Sensitive data exposure

2

3

2

3

7

XML external entities

2

2

3

3

7

Broken access control

2

2

2

3

6

Security misconfiguration

3

3

3

2

6

Cross-site scripting

3

3

3

2

6

Insecure deserialization

1

2

2

3

5

Vulnerable components

2

3

2

2

4.7

Insufficient logging and monitoring

2

3

1

2

4

a

b

c

d

e

f

g

1

2.72%

67.22%

8.28

6.72

9503

385

The item received an above-average rating in the category of ‘vulnerability and impact potential’ in the data, and showed a relatively low incidence in the above-average test range. In addition, Table 1 shows the evaluation of attack possibility (E), vulnerability distribution (P), ease of detection (D) and technical impact (T), which are detailed risk factors of the 2017 OWASP Top 10 [4]. In the pattern step of this study, refer to them for the attack index.

2.2 Packet Classification In order to detect security threat packets, a technique to retrieve the details of the payload of signature-based network traffic is being used [5], and packet classification is defined as a flow of packets that share fields for IP/TCP/UDP header information, and an algorithm for classifying packets to a certain rule in dynamic packet flow was developed [6, 7] (Fig. 2).

3 Research Method 3.1 Packet Filtering Algorithm 1. Classification step

150

M. S. Kim

Fig. 2 Sliding window pattern matching

It is the classification according to the importance of security threats for each individual packet, and is classified according to the characteristics of the packets flowing into the network, and Fig. 3 shows the classification for the application of the web packet filtering algorithm by differential processing according to its characteristics. The lowest step is a step for packet classification according to the attack technique, and a detailed pattern is applied in the subsequent extraction step. 2. Extraction step After going through the classification step for each individual packet, 7 Layer patterns are extracted based on the 3, 4, 7 Layer PDU (Protocol Data Unit) details based on the first filtered packet and applied to each Rule. Figure 4 shows the rule pattern according to the PDU of the packet and is applied to threat packet extraction. 3. Pattern step Based on the individual attack pattern applied to the payload of the packet and the additional points according to the risk assessment, the matching point

Fig. 3 Hierarchical relationship between packet filtering algorithm application patterns

A Study on the Attack Index Packet Filtering Algorithm Based on Web …

151

Fig. 4 Rule extraction according to packet PDU

Fig. 5 SQL-injection pattern

between the incoming packet and the threat packet pattern is measured as shown in Fig. 5. P(r1, pk) is a formula that verifies the similarity between Rule 1 and the incoming packet, and as the set of denominators increases, a penalty is applied. | | | | | r 1 ∩ pk | | | r 1 ∩ pk |=| | P(r1 , pk ) = || | | |r1 | + | pk | − |r1 ∩ pk | | r 1 ∪ pk

4 Comparative Verification 4.1 Keyword Frequency Analysis Result For the verification of the web packet filtering algorithm proposed in this study, Table 2 shows the results of comparative verification of the existing network security equipment (A), the signature-based packet filter algorithm (B), and the proposed algorithm (C). Basically, the 3 Layer-based signature rule is applied to the verification target, and the packet filtering algorithm is applied up to the PDU area of 4 Layer. In the proposed algorithm, the matching point with the pattern is measured through the similarity evaluation to the rule for the payload of 7 Layer and the dynamic pattern in the algorithm rule.

152 Table 2 Comparative verification result

M. S. Kim Section

A

B

C

3 layer

◯

◯

◯

4 layer

X

◯

◯

7 layer

X

X

◯

Signature rule

◯

◯

◯

Dynamic pattern

X

∆

◯

5 Conclusion A malicious attack attempt toward a web service vulnerability results in tampering or stealing data in the database. We proposed an application layer-based web packet filtering algorithm to improve web service continuity and operational efficiency based on the attack index for web vulnerabilities. As the attack index, a rule applying the vulnerability attack pattern is determined based on the OWASP Top 10 risk assessment value and the number of packets of the L7 protocol, and the matching points with the patterns are measured through similarity evaluation to the rule for the individual attack pattern applied to the payload of the attack packet. According to the measured packets, the IP-based network intrusion prevention policy is applied to prevent the primary damage. Based on the payload information of 7 Layer, the system’s secure coding is applied to signature-based filtering, and the analysis step for the pattern is applied to the new pattern. By applying the dual filtering rule of 3, 4 Layer and 7 Layer, the existing web filtering algorithm is somewhat inefficient for pattern matching, but the application of the algorithm presented in this study will increase the security at the network layer. Acknowledgements This is paper was supported by Joongbu University Research & Development Fund, in 2022.

References 1. European Union Agency for Network and Information Security (ENISA): ENISA Threat Landscape 2020—List of Top 15 Threats (2020) 2. Jin, H.H., Kim, H.K.: A study on web vulnerability risk assessment model based on attack results: focused on cyber kill chain. J. Korea Inst. Inf. Secur. Cyptol. 31(4) (2021) 3. OWASP. https://owasp.org/www-project-top-ten (2022) 4. OWASP: OWASP Top 10-2017 (2017) 5. Sung, J., Seok-Min, K., Lee, Y., Taeck-Geun, K., Kim, B.: High-speed pattern matching algorithm using TCAM. J. Korea Inf. Process. Soc. 12(4) (2005) 6. Wang, Z., Che, H., Kumar, M., Das, S.K.: Consistent TCAM policy table up date with zero impact on data path processing. IEEE Comput. 56(12) (2004) 7. Jeong, H., Song, I., Lee, Y., Kwon, T.: An efficient update algorithm for packet classification with TCAM. J. Korean Inst. Commun. Inf. Sci. 31(2A) (2006)

Analysis of IoT Research Topics Using LDA Modeling Daesoo Choi

Abstract The three major characteristics of the 4th industrial revolution are hyperconnectivity, superintelligence, and hyperconvergence. The purpose of this study is to analyze the research trend of ‘Internet of Things’, a core technology that represents these changes. In particular, an attempt was made to gain insight by identifying the fields of engineering and social sciences separately. Additionally, macroscopic research trends were confirmed through topic modeling. As a result, differences in research topics for each field were identified, and topics that could confirm research trends were derived. Keywords Internet of things (IoT) · Text mining · Topic modeling · Latent Dirichlet allocation (LDA)

1 Introduction Interest in the 4th industrial revolution is continuously maintained. Although there is an argument that it should be included in the scope of the third industrial revolution, that is, the information revolution, which has greatly changed human life patterns through computers and the Internet, many people have the intention to define the faster and wider flow of change as a new concept. There seems to be Starting with the Davos Forum in 2016, discussions on the concept, scope, and content of the Fourth Industrial Revolution have been ongoing in various fields. These concerns and discussions can be broadly divided into two categories. One is related to innovative technology or technological innovation that leads the fourth industrial revolution, and the other is social-economic change.

This research is supported by Joongbu University. D. Choi (B) Department of Software Engineering, Joongbu University, Seoul, South Korea e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Big Data, Cloud Computing, and Data Science Engineering, Studies in Computational Intelligence 1075, https://doi.org/10.1007/978-3-031-19608-9_13

153

154

D. Choi

The growing interest in technology that leads new changes is very natural. It has been a common pattern in the past for new technologies to be developed out of human need. But these days, new technologies are emerging, and it is more natural for them to change our lifestyles. Therefore, we pay attention to the emergence of new technologies and respond sensitively to changes in technology. Among the technologies that drew attention in the discussion of the 4th industrial revolution, there are the Internet of Things, big data, and artificial intelligence. This technology can be said to be an advanced version of the information technology that led the third industrial revolution. It is no exaggeration to say that the keywords ‘super-connection, hyper-convergence, and super-intelligence’ were derived based on these technologies. In particular, the Internet of Things (IoT) is very innovative in that the subject of information delivery is things, not people, and is evaluated as a core technology leading the 4th industrial revolution. Therefore, it is very important to check the research trend of the Internet of Things, and it can be said that it is a necessary activity. The increasing interest in new changes is also progressing at the social-economic level. In particular, the emergence of new products-services and new processes developed using innovative technologies helps to predict major social-economic changes. Therefore, it is also necessary to examine the research trends of the Internet of Things in research in the field of social sciences. The purpose of this study is to predict the future direction of technological innovation in related fields by analyzing the research trends of the Internet of Things. In particular, by examining not only the research on technology itself, but also research in the social sciences that utilizes it, we want to check the changes in the socialeconomic dimension together. Since the Internet of Things (IoT) is a technology applicable to various products and services, it will be a meaningful attempt to predict the future direction of technology convergence and industrial convergence.

2 Related Study 2.1 Definition of IoT The term Internet of Things (IoT) was mentioned in a report published in 2005 by the International Telecommunication Union (ITU Internet Reports 2005: The Internet of Things), and discussion of standardization began in earnest [1]. Since then, numerous products have been developed using the Internet of Things (IoT), and are still in progress. Sometimes, the precise technical concept of the Internet of Things is lost, and it is used incorrectly as a buzzword to emphasize the innovativeness of a product. Regarding the definition of IoT, market research institute IDC published a rather unique definition of a business of things, and Gartner expressed it as a network of things that can be sensed and interacted with [2, 3]. In fact, the Internet of Things

Analysis of IoT Research Topics Using LDA Modeling

155

(IoT), which is used in various materials, is spreading from technology-oriented to business-oriented.

2.2 Research on Internet of Things Research Trends Various studies are being conducted on the Internet of Things (IoT) research trends in Korea. Research methods can be divided into two main categories. One is a pattern of deriving insights from major literature data, and the other is a method of analyzing information (title, keyword, text) of a number of papers published in a certain period. The analysis methods and key conclusions of the major papers are as follows. Kim and Pyo [4] summarized research trends using the method of separately analyzing reports and major papers from major organizations related to the Internet of Things. Issues at the technical level ranging from the architecture, protocol, scalability, big data, and security and privacy of the Internet of Things are summarized, and product status and The market outlook is described in a separate chapter. In summary, it is said that products that are still connected are still in trend, and that intelligent IoT products will be developed in the future [4]. Joo and Na [5] analyzed 101 IoT related papers published in academic journals from 2010 to 2015. The subject of each paper was divided into technology, industry, service, policy, and others. The research methods of each paper were classified into literature review, case analysis, survey, interview, test, mathematical prediction and analysis, and others. Through this analysis, it was pointed out that research on the Internet of Things was lacking and that it was necessary to expand research topics and methods [5]. Lee [6] conducted research on a number of IoT related papers from 2015 to 2019. He confirmed the change of the research topic by analyzing the frequency and centrality of keywords. The main results are summarized as follows. The frequency of keywords for car and intelligence was consistently high, and it was confirmed that research related to the industrial revolution was actively conducted after 2017. He also confirmed that the topic of security and intelligence is continuously discussed with many connections with other studies [6]. In order to confirm objective data on research trends, this study collected and analyzed Internet related thesis information using text mining. As mentioned in the introduction, I tried to confirm the macroscopic flow of IoT research from various angles by comparing research in the fields of engineering and social sciences and analyzing research trends using topic modeling.

156

D. Choi

3 Research Method 3.1 Analysis Method Text mining is a method of extracting meaningful information from text. In this study, word cloud and topic modeling were used among text mining research methods. Word cloud is a method of analyzing the frequency of words and proceeds through the following procedure. First, the collected dataset is checked, and only the desired language is left and removed. Remove stopwords and spaces, and calculate the frequency to proceed with visualization. Although it is possible to indirectly examine which concepts are being studied a lot through frequently used words, there is a limit to looking at the overall research trend. The method used to supplement this is topic modeling. Topic modeling can be said to be a text mining technique for finding topics in a set of documents. Topic modeling is based on the intuition that certain words will appear frequently in documents about a particular topic. Latent Dirichlet Allocation (LDA) is a representative topic modeling algorithm. It is a probabilistic model of which topics exist in each document.

3.2 Analysis Data This study was mainly divided into two stages. First, text frequency analysis was conducted to check what topics the Internet of Things is being researched on in engineering and social science fields, and secondly, topic modeling was performed to confirm the research trends. In the first analysis, papers published in the SCOPUS journal were analyzed by dividing them into engineering journals and social science journals. Although there were overwhelmingly many papers in the field of engineering, 300 papers were used in consideration of the number of journals in the social science field. ‘Internet of Things (IoT)’ was used as a search term, and searched papers were selected in order of accuracy. The second topic modeling was obtained from the overseas electronic information service of the Academic Research Information Service (www.riss.kr) operated by the Korea Research and Education Information Service (KERIS), and abstracts of 100 English papers published from 2016 to 2021.

Analysis of IoT Research Topics Using LDA Modeling

157

Fig. 1 Engineering graph

4 Research Result 4.1 Keyword Frequency Analysis Result The results confirmed through frequency analysis are as follows. In engineering papers, words such as system, smart, cloud, data, network, energy, application, monitoring took priority, and in social science papers, network, system, application, data, service, smart, technology, using appeared to be many. The exact ranking can be confirmed in Figs. 1 and 2. Looking at the pictures visualized through the word cloud in Figs. 3 and 4, in the field of engineering, there are also many technical words, and in the field of social science, words related to socio-economic phenomena stand out. However, it can be seen that there are also many technical words in the field of social science. In particular, the words network, application, enabled, data, service, etc., which are relatively frequently used in social science fields, make it possible to infer that IoT related economic and management research has been actively conducted.

4.2 Topic Modeling Result In order to confirm the main research fields and themes of the IoT, an abstract of each study was created as a single document, and then the topic was derived using

158

D. Choi

Fig. 2 Social science graph

Fig. 3 Engineering word-cloud

LDA analysis. Five topics were set in order to minimize overlap in 100 papers. 15 words were extracted from one topic, and the results are shown in Table 1. The topic modeling result shows that the same word is duplicated for each topic, making it difficult to determine the topic label. It is judged that the number of data to be analyzed needs to be expanded for a clearer classification. Nevertheless, I tried to classify the words by considering the priority of the words in each topic. The first topic is ‘device’. Most of the IoT research is being conducted in the engineering field, and among them, the device related part occupied one topic. The second topic

Analysis of IoT Research Topics Using LDA Modeling

159

Fig. 4 Social science word-cloud

Table 1 Topic modeling results No. Main words

Topic label

0

0.022 * “iot” + 0.010 * “system” + 0.010 * “network” + 0.009 * “device” + Device 0.007 * “data” + 0.006 * “internet” + 0.006 * “based” + 0.006 * “model” + 0.005 * “proposed” + 0.005 * “sensor” + 0.005 * “thing” + 0.005 * “security” + 0.004 * “paper” + 0.004 * “used” + 0.004 * “application”

1

0.020 * “iot” + 0.012 * “system” + 0.009 * “network” + 0.008 * “based” + Application 0.007 * “device” + 0.007 * “internet” + 0.007 * “application” + 0.007 * “model” + 0.007 * “thing” + 0.007 * “data” + 0.006 * “proposed” + 0.005 * “sensor” + 0.005 * “using” + 0.005 * “security” + 0.004 * “time”

2

0.025 * “iot” + 0.009 * “system” + 0.008 * “based” + 0.007 * “data” + Data 0.007 * “network” + 0.006 * “device” + 0.006 * “internet” + 0.006 * “thing” + 0.005 * “model” + 0.005 * “proposed” + 0.005 * “security” + 0.005 * “application” + 0.004 * “technology” + 0.004 * “smart” + 0.004 * “time”

3

0.026 * “iot” + 0.011 * “based” + 0.009 * “system” + 0.008 * “device” + Smart 0.007 * “proposed” + 0.006 * “network” + 0.006 * “smart” + 0.006 * “application” + 0.006 * “data” + 0.005 * “internet” + 0.005 * “security” + 0.005 * “thing” + 0.005 * “result” + 0.005 * “paper” + 0.004 * “technology”

4

0.030 * “iot” + 0.011 * “system” + 0.009 * “based” + 0.009 * “data” + Security 0.008 * “network” + 0.008 * “device” + 0.006 * “proposed” + 0.006 * “internet” + 0.006 * “security” + 0.006 * “thing” + 0.005 * “application” + 0.004 * “also” + 0.004 * “time” + 0.004 * “sensor” + 0.004 * “paper”

is ‘application’. The application required to control and utilize IoT based products was selected as the next topic. The third topic is ‘data’. One of the characteristics of IoT products is that they generate a lot of data by exchanging digital signals. The fourth topic is ‘smart’. It can be seen as a field that provides smart service and smart solution by utilizing device, application, and data. The fifth topic is ‘security’. In IoT research, the part related to security is always under consideration. The pyLDAvis

160

D. Choi

package was used to visualize the LDA analysis results. The results shown in Figs. 5, 6, 7, 8 and 9 were derived.

Fig. 5 Topic1 graph

Fig. 6 Topic2 graph

Analysis of IoT Research Topics Using LDA Modeling

161

Fig. 7 Topic3 graph

Fig. 8 Topic4 graph

5 Conclusion The purpose of this study is to predict the future direction of technological innovation by analyzing the research trends of the Internet of Things. Research in the field of engineering and research in the social sciences were divided and examined together. Text mining was used to analyze the advantages and disadvantages of existing research methods, and the results are as follows.

162

D. Choi

Fig. 9 Topic5 graph

First of all, the keyword frequency analysis confirmed through the study title clearly confirmed the difference between engineering and social science. The words frequently used in the engineering field were ‘system, smart, cloud, data, network, energy’ in the order, and in social science papers, ‘network, system, application, data, service’ appeared in the order. Since understanding of technology is an important topic, it is judged that the term for technology is used a lot in social sciences as well. In particular, the frequent use of the words ‘application, data, service’ can be inferred that research on the commercialization and business of IoT based products is being actively conducted. In the LDA analysis conducted to model the major research areas of the Internet of Things, five topics were derived, and ‘device, application, data, smart, security’ was finally selected. Considering the process of developing IoT technology and applying it, it can be seen as a natural result. Data generated in the process of developing and operating devices and applications of IoT, providing smart solutions and services using them, and even security issues can be explained. In conclusion, it is summarized as follows. First, the field of interest in the Internet of Things in the engineering field and the field receiving attention in the social sciences are similar and have strong connectivity. It can be analyzed that research is being conducted with an even interest in the areas connected to the development of technology, commercialization using it, and social change. It is considered that many scholars are interested in researching this technology because it is a technology of high interest worldwide and a representative technology that leads innovation. Second, predictable results were derived for the representative research topics of the Internet of Things, but additional research is needed to derive more diverse topics. In order to classify and analyze the overall study, useful results have been derived, but there are insufficient areas to find future research directions or detailed categories.

Analysis of IoT Research Topics Using LDA Modeling

163

This study is meaningful in that it was possible to macroscopically confirm the research trends of the Internet of Things through objective data and analysis methods. However, although it was possible to confirm the overall trend of IoT research, there were insufficient parts to predict the future direction. To compensate for this, machine learning needs to be performed through more data, and time series analysis needs to be added.

References 1. 2. 3. 4. 5.

ITU internet reports 2005: the internet of things, ITU (2005) IDC industrializing IoT: from concept to reality (2016) Gartner.: The internet of things and related definitions (2016) Kim, Pyo.: Internet of things (IoT) research trends. J. Korean Electromagn. Soc. 27(4) (2016) Joo, C., Na, H.: Domestic research trend analysis on internet of things (IoT). Informatization Policy 22(3) (2015) 6. Lee, T.: Domestic internet of things research trends based on keyword centrality analysis. J. Korean Contents Assoc. 20(12) (2020)

Log4j Vulnerability Analysis and Detection Pattern Production Technology Based on Snort Rules WonHyung Park and IlHyun Lee

Abstract Recently, a vulnerability was discovered in Log4j, an open source logging library that is very widely used by Apache. Currently, the risk of this vulnerability is at the ‘highest’ level, and it is being used in many systems that developers are not aware of, so it is evaluated as the worst vulnerability ever. This Log4j vulnerability takes a lot of time from identifying the existence of the Log4j program in the existing 3rd-party products used by companies, making it difficult to respond to the initial response. In the future, there is a strong possibility that hacking accidents due to the LOG4J vulnerability will continue to occur. In this paper, we analyze the Log4j vulnerability in detail and propose the Snort detection policy technology so that the security control system can quickly and accurately detect it. Through this, as a security officer, you can effectively monitor the LOG4J vulnerability attack detected by the snort rule and based on the Log4j Indicator Of Compromise (IoC), it is possible to actively and quickly respond to LOG4J vulnerability situations with high risk. Keywords Open source · Log4J · Vulnerability · Security monitoring system · Snort detection rules

1 Introduction As the IT (Information Technology) industry develops and becomes more sophisticated, the convergence industry market with OT (Operational Technology) is also growing in size. It is not an exaggeration to say that development is indispensable in almost all industrial fields as the fields are also diversified and subdivided into AI, big data, virtual reality, and IoT within the rapidly developing OT/IT industry. It is a time when various forms of development such as programs and platforms are needed in various fields in order not to be left behind in a developing industry. However, there is a disadvantage that it requires too much time and effort for a developer to develop W. Park · I. Lee (B) Department of Information Security, Sangmyung University, Cheonan, 31, Sangmyeongdae-gil, Dongnam-gu, Cheonan-si, Chungcheongnam-do, Republic of Korea e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Big Data, Cloud Computing, and Data Science Engineering, Studies in Computational Intelligence 1075, https://doi.org/10.1007/978-3-031-19608-9_14

165

166

W. Park and I. Lee

Fig. 1 Trends in use of open source

by himself. In order to solve these shortcomings, many companies are distributing open source (OSS, Open Source Software), and borrowing and developing has the advantage that immediate function implementation is possible, so it is widely used all over the world (Fig. 1). On the other hand, taking advantage of open source distribution also has fatal disadvantages [1]. Most companies do not check the existing open source in the development stage, and their main goal is to create a profit by developing any product, and the security awareness of the developed products is low or overlooked. In addition, there is a limit in that it is difficult to add the security technology of IoT devices close to daily life, such as routers or home appliances, so there is a limit in that rapid industrial development is easy to be exposed to vulnerabilities and intrusions from a security point of view [2]. This is the part that security experts were concerned about, and in the end, the Log4j vulnerability, which is evaluated as the worst vulnerability ever, was discovered at the end of last year [3, 4]. A vulnerability discovered in Log4j, an open source logging library used by Apache, that can be used as a vulnerability by adding a specific string value to a location where the log is recorded [5, 6]. It is a vulnerability with a large potential problem. For this reason, the Log4j vulnerability has been designated as the ‘highest’ level of risk, and to prevent this, it takes a lot of time and effort to even identify the presence or absence of open source in third-party (3RD-PARTY) products used by many companies, and most Developers are not aware of this and are being used in the system, so there are significant limitations. Therefore, in this paper, the Snort detection policy technology is explained based on the results of analyzing and researching the Log4j vulnerability. As a security expert, I have expectations to be able to respond effectively to the vulnerability.

Log4j Vulnerability Analysis and Detection Pattern Production …

167

2 Theoretical Background 2.1 Log4j Vulnerability Log4j stands for Log For Java, and it is one of the projects for Java in JakartaProject. Log4j is one of several Java logging frameworks developed by the Apache Foundation. It is a Java-based logging utility used to log all Java-based programs and multiple servers for maintenance and management. As it is an open source that is essential for managing all Java-based servers and programs, it is an open source software developed by most IT companies such as Apple, Amazon, and Twitter, as well as global companies such as companies running websites and government agencies, known to be affected. In November of last year, the first vulnerability was discovered in the very widely used open source library Log4j, and security updates began to be released in December. Unlike other vulnerabilities previously discovered, CVSS (Common Vulnerability Scoring System) is set to 10 because remote code execution is possible with just one line of code in the payload sending a request, and it can execute malicious files or penetrate the system, was selected More than 80% of the code used for application development comes from open source libraries, but it can be said that this was caused by not being supervised during the development stage. As the proportion of open source libraries is so large, there is a high possibility that it will become a passageway for infringement accidents. The scope of this Log4j vulnerability is very wide, and it takes a lot of time and effort to identify whether the library is used and version information in the company, and it is difficult to patch without damaging others because they are dependent on each other due to the nature of the open source. In addition, in the case of introduced third-party products, there is a problem of having to wait until the corresponding vendor provides a security update. Currently, several companies and product companies are sending update advisory emails to major institutions, companies, and member companies, or providing continuous guidance such as FAQs so that they can understand Log4j vulnerabilities and take necessary actions, and whether or not to use the vulnerable Log4j program. Scanners that can be checked easily are distributed to each company. The Log4j vulnerability is a vulnerability that occurs in the Java Naming and Directory Interface (JNDI) used for configuration, log messages and parameters in Log4j. An attacker can exploit the Lookup function to obtain arbitrary information loaded into the attacker’s Lightweight Directory Access Protocol (LDAP) server. code can be executed (Fig. 2). JNDI is Java to provide one common interface to interact with different naming and directory services such as Remote Method (RMI), LDAP, Active Directory, Domain Name Service (DNS), Common Object Request Broker Architecture (CORBA), etc. It is a base interface, and uses Java serialization for naming or binding to a Java object of the directory service to get the object in units of bytes. A Naming Service here is an entity made up of a name and a value, also called “Bindings”, which can

168

W. Park and I. Lee

Fig. 2 JNDI structure

be used to find objects based on their names using “Search” or “Lookup” operations. A directory service is a special type of naming service that can store and retrieve directory objects. Starting with the CVE-2021-44,228 vulnerability, which opened the door to the Log4j vulnerability, new vulnerabilities related to security updates continue to be discovered. In this paper, several vulnerabilities emerged after the announcement of the Log4j vulnerability, but CVE-2021-44,228, which had the highest CVSS of 10, will be explained.

3 Research Method 3.1 Attack Configuration Analysis Looking at the attack configuration in Fig. 3, unlike a normal user’s request, the attacker makes a GET method request to a vulnerable server and requests the directory service with the attacker’s server in the User-Agent value of the header and the exploit code as JNDI payload. At this time, the requested part can be requested not only in the User-Agent value but also in other HTTP header parts such as referer and XForward-ip, and the request can be made by inserting a JNDI query anywhere in the log. A web server with a vulnerable version of Log4j responds to the request and requests the User-Agent value from the attacker server. The attacker server

Log4j Vulnerability Analysis and Detection Pattern Production …

169

Fig. 3 CVE-2021-44,228 attack configuration diagram

creates and responds to a malicious Java program file, and again the vulnerable web server requests the created malicious Java program file. After that, the malicious Java program in the attacker’s server may be executed to perform remote code execution (RCE), or various attacks intended by the attacker may be executed.

3.2 Detection Pattern The initial attack path of the Log4j vulnerability started from ${jndi:ldap://attacker_server/exploit} and started to attack by modifying the service provider of the JNDI registry. Also, as shown in Fig. 4, use the lower and upper functions that change the case, use the “-” sign that returns the default value, or use the non-existent environment variables such as “env” and “date”. Finding, has been transformed into methods of obfuscating “jndi” and “ldap” strings to evade detection policies, such as using them to print the following word characters. In addition, as shown in Fig. 4, the attacker opens a reverse shell on the C2 server by encoding the string behind /Basic/Command/Base64 behind /Basic/Command/Base64 behind the attacker server, downloads the spearhead attack script, and executes it, “burpcollaborator.net” This involves sending a “DNS” beacon to the server to execute commands directly with a payload that checks if the server is vulnerable [7]. As such, WAF/IPS is an effective measure to prevent attacks from outside, but attackers are trying to bypass it by continuously changing the syntax to avoid it [8].

170

W. Park and I. Lee

Fig. 4 Detection bypass attack technique

4 Countermeasures of Log4j Vulnerability 4.1 Threat IP Blocking After the Log4j vulnerability was disclosed, a number of representative attack site IPs and distribution sites with malicious exploit files were announced as shown in Table 1. Corporate security officers refer to the following attack destination IP, distribution destination IP, and OSINT (Open Source Intelligence) reputation inquiry and malicious influence of IPs with multiple threats [8]. This could be one of the countermeasures [9]. One thing to note is that the IP bands of cloud and hosting companies related to work are highly likely to be used by hackers, and it is also found that not only hackers attack, but also scanning at institutions majoring in or researching cyber security. Therefore, it is necessary to proceed while checking the ASN Name or Org Name in OSINT so as not to interfere with work when blocking, and whether or not there is work relevance.

Log4j Vulnerability Analysis and Detection Pattern Production …

171

Table 1 Incident indicator IP Spread IP

Attack IP 89.163.243.88

DE

http://62.210.130.250/lh.sh

62.210.130.250

FR

45.137.155.55

UK

http://45.130.229.168:1389/Exploit

45.130.229.168

SG

195.54.160.149

RU

http://45.137.155.55/xmrig.exe

45.137.155.55

UK

218.22.21.22

CN

http://185.154.53.140/get

185.154.53.140

RU

104.248.128.115

DE

http[:]//80.71.158[.]12/kinsing

80.71.158.12

UK

45.153.160.140

NL

http://164.52.212[.]196/st.sh

164.52.212.196

IN

107.189.7.175

LU

http://165.22.213.147:7777/backdoor.sh

165.22.213.147

IN

185.220.103.113

NL

http://212.96.189.52/mvt/bash

212.96.189.52

CZ

194.195.246.87

DE

http://45.153.240.94:1389/drydat

45.153.240.94

DE

206.29.176.51

US

http://185.8.172.132:1389/a

185.8.172.132

IR

4.2 Security Update of Log4j At the time of discovery of the vulnerability, it was the most cumbersome problem to understand and respond to the usage and version information of Log4j in each company [10]. It is difficult for security personnel to respond one by one, so several security companies have developed and distributed the Log4j vulnerability scanner for free. In addition, you can check the usage and version information through the basic search function for each operating system. If a vulnerable version is identified after scanning, security personnel check each version and remove the JndiLookup class in the case of Log4j 2.0-beta9 ~ 2.10.0 or if the update is inevitably not possible, Log4j 2.10 ~ 2.14.1 In this case, you must update to the latest version of Log4j 2.15.0 (Table 2). Table 2 Log4j vulnerability versions CVE

Affected versions

Patch versions

CVE-2021-4104

Log4j 1.x

Update to Log4j2.x recommended

CVE-2021-44,228

Log4j 2.0-beta9 ~ 2.14.1

Log4j 2.15.0

CVE-2021-45,046

Log4j 2.0-beta9 ~ 2.12.1 Log4j 2.13.0 ~ 2.15. 0

Log4j 2.16.0

CVE-2021-45,105

Log4j 2.0-alpha1 ~ 2.16.0 (Except 2.12.3)

Log4j 2.17.0

CVE-2021-44,832

Log4j 2.0-beta7 ~ 2.17.0

Log4j 2.17.1 2.12.4, 2.3.2

172

W. Park and I. Lee

Table 3 PCRE code of snort detection IPS, WAF detection policy ${jndi:ldap:/” ${jndi:rmi:/” ${jndi:ldaps:/” ${jndi:dns:/” ${jndi:iiop:/” ${jndi:http:/” ${jndi:nis:/” ${jndi:nds:/” ${jndi:corba:/” Bypass detection policy $%7Bjndi:” %2524%257Bjndi” %2F%252,524%25,257Bjndi%3A” ${jndi:${lower:” ${::-j}${” ${${env:BARFOO:-j}” ${::-l}${::-d}${::-a}${::-p}” ${base64:JHtqbmRp”

4.3 Snort Detection Policy After an attack is announced, various security companies and expert have to update their security equipment with additional attack phrases to bypass the detection policy, starting with the initially discovered attack payload. The bypass detection techniques disclosed so far include the above-mentioned obfuscation attack, as in the bypass detection policy in Table 3, it is a JNDI query after all, but with double encoding to avoid inspection to bypass the detection policy, and to create an attack variant that does not contain the string “JNDI”, they refer to the documentation guidelines of Log4j 2 Lookup and abuse it.

5 Conclusions In this paper, the Log4j vulnerability, which is called the worst vulnerability in history, was studied. As mentioned earlier, as the industry as a whole is converging with IT, the use of open source is on the rise. As the use of open source increases, the proportion of open source in developing programs is also increasing. For wellknown vulnerabilities and security threats such as the OWASP TOP 10, it can be

Log4j Vulnerability Analysis and Detection Pattern Production …

173

seen that the security process such as secure coding is generally well performed, but the vulnerability check for open source is insufficient. Since the possibility of discovering vulnerabilities in other open sources is high, it is not an exaggeration to say that this Log4j vulnerability is the beginning of an open source vulnerability. Attacks to bypass the currently public detection rules are continuously being attempted, and related new vulnerabilities are expected to continue to emerge in the future. Therefore, security personnel should share and block representative attack destination IPs to reduce the risk, and continuously update the detection rule of security equipment. While the Log4j library is widely used around the world, there has been no case of successful attack, information theft or breach, and IoC for host analysis is not disclosed. There are difficulties. It is necessary to conduct detailed outbound communication inspection of servers that have already penetrated, centering on important systems within the company, and through continuous monitoring, it is necessary to check from various viewpoints, such as whether unusual external communication occurs and whether abnormal processes exist. In addition, in order to prevent another major confusion from repeating because this incident has exposed the security vulnerabilities for the use of open source, a process for checking open source vulnerabilities should be introduced from the stage of code development for program development [11]. At the beginning of the introduction, there will be difficulties in the transition period, but if the inspection process such as input value verification for open source starts to be introduced little by little, awareness will be solidified like the OWASP TOP10, and the possibility of the worst case will be lowered.

References 1. Kim, J.-H.: Complete conquest of open source SW and shared works Open source also has copyright (2018). https://www.etnews.com/20180619000087?m=1 2. Ryu, S.-M.: Open source vulnerabilities analysis. In: Proceedings of the Korea Information Processing Society Conference, pp. 149–151 (2019) 3. KrCERT. Log4j threat response report (2022). https://www.krcert.or.kr/data/reportView.do? bulletin_writing_sequence=36476 4. Jun, K.: The worst security vulnerability ‘Log4j’, remaining issues and challenges for security officers (2022). https://www.boannews.com/media/view.asp?idx=105394&page=1&kind=1 5. Apache. https://logging.apache.org/log4j/2.x/security.html. (2022) 6. IglooCorporation. Apache log4j vulnerability analysis and countermeasures (2022). http:// www.igloosec.co.kr/BLOG_Apache%20Log4j%20%EC%B7%A8%EC%95%BD%EC% A0%90%20%EB%B6%84%EC%84%9D%20%EB%B0%8F%20%EB%8C%80%EC% 9D%91%EB%B0%A9%EC%95%88?bbsCateId=1 7. Kozmer. https://github.com/kozmer/log4j-shell-poc. (2022) 8. Akamai, Application Security Threat Research Team. Threat intelligence for log4j CVE: key findings and related content (2021). https://www.akamai.com/ko/blog/security/threat-intellige nce-on-log4j-cve-key-findings-and-their-implications 9. All4Chip. S2W, Announcement of log4j security vulnerability countermeasures—analysis report (2021). https://all4chip.com/archive/news_view.php?no=13654

174

W. Park and I. Lee

10. Won, B.-C.: https://www.boannews.com/media/view.asp?idx=103344. (2021) 11. Lim, Y.-K.: The second log 4j incident is coming again… Open source inspection should be done in the development (2022). https://zdnet.co.kr/view/?no=20220310162310

A Study on Technology Innovation at Incheon International Airport: Focusing on RAISA Seo Young Kim and Min Seo Park

Abstract Airports are implementing innovative technologies represented as Information and Communication Technologies (ICT) to prepare for increase of customer numbers and provide better experience in post-corona era. This study aims to explore the role of technologies recently introduced in airports and the benefits of such technology-driven transition. To achieve the purpose, this study focuses on technologies of robot, AI and service automation (RAISA) along service process from the perspective of airport customers. For the analysis, the best practice of airport in South Korea is selected and provide basic data for RAISA utilization. Through the case analysis, this study categorizes the core technologies utilized in airports and explores the advantages, for instance attaining process efficiency and providing customer convenience from RAISA. In conclusion, despite suffering from COVID19 pandemic, it can be a possible opportunity to restructure and reorganize the airport service process through implementation of RAISA. Keywords Technology innovation · Digital transition · Airport best practice

1 Introduction Since World Health Organization (WHO) declared the COVID-19 pandemic, there have been major impacts in daily life and economy of the globe. Air transport has been so severely affected by the pandemic that air travels have been closed by governments [1]. Such shut-down led to the sharp fall of numbers of customers and revenues for airports. Recently, Airport Council International (ACI) World published analysis and assessment regarding the economic impact of the COVID-19 pandemic to airports across the world. In the report, the global passenger traffic in 2021 was 4.4 billion S. Y. Kim (B) · M. S. Park College of Business Administration, Inha University, Incheon, South Korea e-mail: [email protected] M. S. Park e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Big Data, Cloud Computing, and Data Science Engineering, Studies in Computational Intelligence 1075, https://doi.org/10.1007/978-3-031-19608-9_15

175

176

S. Y. Kim and M. S. Park

Fig. 1 Quarterly global paasenger traffic projection compared to pre-COVID-19 forecast (2019– 2022, in billons of passengers)

which is only at 48.3% of 2019 whereas airport revenue is expected to be reduced by 45.2% comparing to 2019 [2]. Despite of such dramatic economic impact on airport occurred and on-going COVID-19 pandemic, the projection on global passenger indicates future recovery on global passenger traffic to some extent as Fig. 1 [2]. Nevertheless, such positive scenario is going under the pre-COVID-19 forecast on global passenger traffic, it would provide airports a signal to be brace for post-COVID-19 era. According to ACI World, technology would provide priorities for airports preparing pre-COVID19 era and airports have been showed accelerating investment and improvement on technologies [3, 4]. Along the trend in airports, the study aims to conduct a case analysis in order to explore the role of technologies recently introduced in airports and figure out the manner technologies improve services at airports. The study selected Incheon International Airport of South Korea as the target case while a “Four-stage Construction Project”, a project to improve airport itself more green, smart and safe based on cutting edge technologies, is in progress. In this study, technologies mainly focused are categorized into Robot, Artificial Intelligence and Service Automation (RAISA) while such categories are emerging in prior studies regarding service industry. It is expected that this study provides the basic data for RAISA utilization and the way RAISA improves service efficiency and experience.

A Study on Technology Innovation at Incheon International Airport: …

177

2 Theoretical Background 2.1 Robot, Artificial Intelligence and Service Automation (RAISA) RAISA technologies indicate technologies of robots, artificial intelligence and service automation and RAISA has entered our daily lives. RAISA provides improvements on operation processes, cost optimization, customer experience and service capacity to various industries such as education, manufacturing, transportation, and hospitality [5]. The main example of RAISA can be front desk robots and delivery robots for robot, chatbots and AI platform for AI and kiosks for service automation [6]. Numbers of prior studies tried to explore pros and cons of RAISA introduced in several service sectors [7–9]. The main benefits of RAISA can be found as improved work efficiency from automated service provided by robots, providing customers convenience with completing orders with simple controls on kiosks or vending machines and better understanding on customers with AI analysis. On the other hand, the application of RAISA would induce replacement of human functions which would lead to human employees and lack of emotional bond with customers during the service process. However, the tendency of RAISA application is expected to continue, and such digital transition has been occurring in airports nowadays as well.

2.2 RAISA at Airports The advance of technology offers airports an opportunity to reconsider their business and operational processes, especially regarding recent safety and health issues under pandemic [4]. According to a survey conducted by ACL World and SITA, airports have been agile in adapting to health and safety requirement with increasing investment on IT representative of automated and touchless solutions for internal and external customers [3]. The major technologies at airports under category of RAISA are shown as below (Table 1): AI adopted by airports is utilized as a business intelligence and also allowing robots and automated services provide useful functions to customers. Robots at airports can be distinguished from traditional self-service technologies (SSTs), for instance kiosks, while robots provide more flexible, customized and personalized services through the process very similar with human employee would [12]. On the other hand, service automation plays an important role at airports by improving customer experience, increasing service efficiency and ensuring security [12]. Moreover, the adoption of RAISA at airports brings great value under COVID-19 pandemic. Disinfection and sterilization robots and also a kind of ambulance robot

178

S. Y. Kim and M. S. Park

Table 1 Comparison of input and output variables in previous research Reference

Input variables

Robot

Humanoid service robot (e.g., robot-guide) Mobility robot (e.g., self-driving cars)

Artificial intelligence

Bag tracking system Baggage checking system Chatbots

Service automation

Self-service (e.g., kiosks or mobile for check-in) Self-boarding gates (e.g., biometric) Self bag-drop

Source from [4, 10, 11]

placed in airports ensure safety and health for airport staff and users by offering disinfection function and supplying medical tools automatically [13]. This study explores circumstances which airports are facing under COVID-19 and the manner RAISA is operated upto this chapter. Next, the study explains the process of case selection and analyse the case in order to explore the role of RAISA in airports.

3 Case Analysis 3.1 Case Selection The study selected Incheon International Airport, South Korea as a case to be analyzed while this airport already announced provision based on digital transition and a huge project is under proceeding. In 2017, Incheon International Airport started a “Four-stage Construction Project” comprising expansion of terminal, construction of a new airstrip and other surrounding facilities. The project aims to expand the travel capacity and provide better value and experience to travelers. Numbers of cutting-edge Information Communication Technologies (ICT) are introduced into the airport as part of the project and the project is expected to finish in 2024. Moreover, the airport recently announced “New Vision of 2030” and presented technology-driven future direction to reduce time for departure, provide various experiences to customers with augmented reality (AR) and virtual reality (VR), and the shift to energy-friendly airport as a new ESG platform. Accompanying ongoing technology-driven innovation, Incheon International Airport is introducing RAISA technologies across the whole service process from the ticketing to the departure. Therefore, the study selected Incheon International Airport as the case analyzed and all information in respect to cases is collected from official website of Incheon International Airport.

A Study on Technology Innovation at Incheon International Airport: …

179

3.2 RAISA at Incheon International Airport IN this chapter, the case of RAISA in Incheon International Airport will be introduced and analyzed along the procedure of the customer uses RAISA technologies from the ticketing to the departure. The study set the procedure into 7 stages stem from the framework suggested by Choi [14] as Fig. 2. In this study, the procedure inserted one more stage at the beginning to reflect the manner RAISA functions on customer experience before reaching at the airport. To sum up, the whole procedure suggested and RAISA related to each stage is shown as Table 2. Before a customer reach Incheon International Airport, it is able to check various information on the flight, immigration, baggage, shopping and transportation through a chatbot named “Airbot”. Such chatbot is expected to ensure customer convenience with improved efficiency and effectiveness of information search [15] (Fig. 3). Moreover, Incheon Airport is also offering a mobile application named “Incheon Airport Guide App”. This application provides more timely information to customers by offering real-time information on traffic navigation to the airport and congestion of parking area and departure hall. This application would allow customers to have useful information before and on the way to the airport. The chatbot and application, certainly, can be accessed in any place with Internet (Fig. 4). In the check-in stage, Incheon Airport collaborates with airline to offer an automated pre-check-in service allowing customers complete check-in procedure via website and mobile application in advance. User of such service is able to directly

Fig. 2 Departure procedures post COVID-19 [14]

Table 2 RAISA of Incheon international airport along departure procedure Reference

Input variables

Before airport

Chatbot (“Airbot”) Mobile application (“Incheon Airport Guide App”)

Self-examination

Chatbot

Check-in

Mobile and web check-in Check-in Kiosk self bag drop

Departure hall

Robots for guide, disinfection, cleaning, cart

Security check

“Ai-driven X-ray auto screen system”

Immigration

Biometric immigration (“Smart Entry Service”)

Departure gate

Robots for guide, disinfection, cleaning, cart

Boarding

None

180

S. Y. Kim and M. S. Park

Fig. 3 Interface of airbot. https://airbot.kr/airBot/airbot.do?type=pc

Fig. 4 Interface of Incheon airport guide application [21]

head to security check [16]. The airport is also equipped with check-in kiosk for customers not using pre-check-in service. Moreover, kiosks are provided in selfcheck-in zone for self-bag drop. Such automated check-in procedure is expected to reduce waiting time and congestion and allow airport to accommodate more customers than traditional way. In many sectors at airport, especially at the departure hall, Incheon Airport is operating various robots to enhance customer convenience, health and safety. Guide robots named Airstar is a type of guide robot providing information on location of facilities, flights, congestion, and travel-related regulations as well as guiding to a place customers desire to visit. The second robot the airport offers is disinfection and sterilization robot. This robot is designed to check real-time issue and event in airport and automatically conduct disinfection work along the route from departure gate to quarantine zone. Except robots for guiding and disinfection, Incheon Airport have also implemented cleaning robots, baggage handling robots, and serving robots (Fig. 5).

A Study on Technology Innovation at Incheon International Airport: …

181

Fig. 5 The guide robot in Incheon airport. https://www.airport.kr/co/ko/cmm/cmmBbsView. do?PAGEINDEX=1&SEARCH_STR=&FNCT_CODE=121&SEARCH_TYPE=&SEARCH_ FROM=2017.05.11&SEARCH_TO=2018.05.11&NTT_ID=23197

There has been service automation provided in both stage of security check and immigration. For the security check, the Incheon International Airport introduced an AI-driven X-ray auto screening system. In the system, AI automatically screen the Xray image of the baggage and check whether any prohibited items are contained. Yet, the system is not perfectly perceiving all type of items, but it would be developed in the future by improving the X-ray image quality and perception capacity after deep-learning. On the other hand, an automated immigration system is introduced in the airport. Now, kiosks equipped along the immigration route scan customers’ passport, face, and fingerprint to perceive identification. The system nowadays is 4th generation and the time needed for traditional face-to-face could be reduced from 30 min to an hour. However, the airport also established a roadmap for brand-new auto identification system called SmartPass. This new system will be operated with face registration, identification machine and sharing system. Reaching at airport, customers could register their own facial information on face identification machine. After that, the identification information will be shared throughout all facilities and all procedure until boarding. Hence, customers would not be required to have their passport or tickets checked while each stage will be equipped with automatic facial scanning system on the route to the boarding gate. Incheon International Airport expects time spent on passenger procedure will be decreased by 10%. In addition, this non-faceto-face and automatic system would provide customer improved convenience and reduction on waiting time and also the operation of airport could be more efficient.

182

S. Y. Kim and M. S. Park

3.3 Discussion The implementation of RAISA at the airport can be expected to change customers’ understanding on airport service experience and service quality. Applying technologies regarding RAISA in Incheon Airport differs the current service from the traditional manner by making customers engage in emotional and intellectual way rather than personal interaction with staff [17]. While customers’ experience on service is a major component of service quality in hospitality industry [18], RAISA is expected to provide a memorable experience by offering service based on a set of brand-new technologies and it would positively impact on perceived service quality. In addition, robots, AI, and technologies on service automation can be considered as a process innovation providing efficiency, reliability, and speed of service, and these factors also place customers in positive impression on perceived service quality [17]. RAISA at Incheon Airport consists of chatbots, robots, and automated security check and immigration process supported by AI. Chatbots provide reliable and standardized information at more efficient manner than traditional personal interaction such as through information desk and phone call. The reliability and way to interact compose the service quality of chatbot and such quality is considered as positive antecedent factors to customers’ relational, emotional experience and intention to re-visit [19]. Regarding robots and AI at the airport, they could be considered as an additional process innovation, while these technologies provide a new experience, standardized service, efficient assist on using facilities at the airport. In addition, cases like disinfection robot could improve health and safety issues through service process at the airport. In conclusion, RAISA at Incheon Airport is applied into each stage of service process. Those technologies in cases bring about process innovation by creating value from reliability, convenience, homogeneity, and efficiency, thus the implementation of RAISA could have a positive impact on service experience and perceived quality.

4 Conclusion This study explores RAISA in airports while the implementation of ICT is considered as an alternative for future return to normality from COVID-19. The case of Incheon International Airport is selected as a best practice of applying RAISA. As the case is analyzed in this study, the airport has introduced robots, AI, and service automation into whole departure procedures at airport. The result shows those technologies offer a new experience at more standardized, convenient, reliable, and efficient level than traditional manner. It implies that a trial to create process innovation through applying RAISA at airports could be a preparation for post COVID era with a number of advantages and offering improved perceived service quality to customers. However, this study is limited because the result is derived from analysis of one case, mainly focusing on departure procedure, and lack of empirical explore. The limit of this

A Study on Technology Innovation at Incheon International Airport: …

183

study could be improved by future study. The future study could be an empirical study for exploring factors related to RAISA. For instance, it might be regarding deriving quality dimensions of RAISA at airport or exploring perceived risk and benefit factors for RAISA and the manner those factors impacting customers intention might expand result of this study. In addition, the future empirical study could explore not only pros of RAISA in this study but also cons of RAISA regarding emotional interaction mentioned in prior research [20].

References 1. Twinn, I., Qureshi, N., Rojas, D.S.P., Conde, M.L.: The impact of COVID-19 on airports: an analysis. IFC 1–6 (2020) 2. ACI World official website. The impact of COVID-19 on the airport business and the path to recovery (2021). Access: 13 Mar 2022. https://aci.aero/2021/03/25/the-impact-of-covid-19on-the-airport-business-and-the-path-to-recovery/. 3. ACI World official website. Airports invest in technology to advance industry recovery (2021). Access: 13 Mar 2022. https://aci.aero/2021/03/11/airports-invest-in-technology-to-advanceindustry-recovery/ 4. Shallow, B.: Airport technology priorities in a time of pandemic. ACI Insights (2021). Access: 13 Mar 2022. https://blog.aci.aero/airport-technology-priorities-in-a-time-of-pandemic/ 5. Ivanov, S.H., Webster, C.: Adoption of robots, artificial intelligence and service automation by travel, tourism and hospitality companies—a cost-benefit analysis. Sofia University (2017) 6. Lukanova, G., Ilieva, G.: Robots, artificial intelligence and service automation in hotel. In: Robots, Artificial Intelligence, and Service Automation in Travel, Tourism and Hospitality. Emerald Publishing Limited (2019) 7. Buhalis, D., Harwood, T., Bogicevic, V., Viglia, G., Beldona, S., Hofacker, C.: Technological disruptions in services: lessons from tourism and hospitality. J. Serv. Manag. 30(4), 484–506 (2019) 8. Ivanov, S., Gretzel, U., Berezina, K., Sigala, M., Webster, C.: Progress on robotics in hospitality and tourism: a review of the literature. J. Hosp. Tour. Technol. 10(4), 489–521 (2019) 9. Samala, N., Katkam, B.S., Bellamkonda, R.S., Rodriguez, R.V.: Impact of AI and robotics in the tourism sector: a critical insight. J. Tourism Futures 8(1), 73–87 (2022). Emerald Publishing Limited 10. SITA. 2020 Air transport IT insights (2021). Access: 17 Mar 2022. https://www.sita.aero/res ources/surveys-reports/air-transport-it-insights-2020/ 11. Ivanov, S., Hristov, S., Webster, C., Berezina, K.: Adoption of robots and service automation by tourism and hospitality companies. Revista Turismo & Desenvolvimento 27(28), 1501–1517 (2017) 12. Wirtz, J., Kunz, W., Paluch, S.: The service revolution, intelligent automation and service robots. Eur. Bus. Rev. 29(5), 38–44 (2021) 13. Zeng, Z., Chen, P.J., Lew, A.A.: From high-touch to high-tech: COVID-19 drives robotics adoption. Tour. Geogr. 22(3), 724–734 (2020) 14. Choi, J.H.: Changes in airport operating procedures and implications for airport strategies post-COVID-19. J. Air Transp. Manag. 94, 1–13 (2021) 15. Ashfaq, M., Yun, J., Yu, S., Loureiro, S.M.C.: Chatbot: modeling the determinants of users’ satisfaction and continuance intention of AI-powered service agents. Telematics Inform. 54, 101473 (2020) 16. Korean Air. More convenient travel Abroad with smart check-in. Korean Air News Room (2019). Access: 19 Mar 2022

184

S. Y. Kim and M. S. Park

17. Naumov, N.: The impact of robots, artificial intelligence, and service automation on service quality and service experience in hospitality. In: Robots, Artificial Intelligence, and Service Automation in Travel, Tourism and Hospitality. Emerald Publishing Limited (2019) 18. Choi, J.H.: Interrelationships among perceived service quality, customer attitudes, satisfaction and revisit intention in hotel service encounters. J. Tourism Leisure Res. 15(2), 59–77 (2003) 19. Kim, S.J., Park, C.: The effect of chatbot characteristics on customer experience and intention to reuse. Acad. Customer Satisfaction Manag. 23(1), 119–142 (2021) 20. Huang, M.H., Rust, R.T.: Artificial intelligence in service. J. Serv. Res. 21(2), 155–172 (2018) 21. Jeong, B.W.: A New version of customized ‘Incheon Airport Guide Application’ is Newly Released, Dae Han News (2015)

Author Index

C Choi, Daesoo, 153 Choi, Jun Hyuk, 107 Choi, Weonsun, 1, 29

Kim, Min Su, 145 Kim, Seo Young, 175 Kim, Youngsoo, 43 Kwon, Hun Yeong, 107, 133

G Gim, Gwangyong, 1, 15, 29, 43 Gim, Simon, 55

L Lee, IlHyun, 165 Lee, Kyunghyun, 1, 29 Lee, Myoungho, 71 Lee, Myungho, 15 Lim, Ji Hun, 107

H Han, Chungku, 43 Han, Sung-Hwa, 83, 95

I Im, Eun Tack, 55

P Park, Hyeonju, 43 Park, Jihun, 15 Park, Min Seo, 175 Park, WonHyung, 165 Pil, Yoon Sang, 119

J Jeon, Beonghwa, 29

K Kim, Jaewook, 71

S Seo, Jeong Eun, 107, 133 So, Jaeyoung, 15, 71 Sung, Yoonje, 1

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 R. Lee (ed.), Big Data, Cloud Computing, and Data Science Engineering, Studies in Computational Intelligence 1075, https://doi.org/10.1007/978-3-031-19608-9

185