Parallel and Distributed Processing Techniques and Applications [1 ed.] 9781683925781, 9781601325082

Proceedings of the 2019 International Conference on Parallel and Distributed Processing Techniques and Applications (PDP

195 52 11MB

English Pages 252 Year 2015

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Handbook on Parallel and Distributed Processing 9783540664413, 3540664416

In this volume authors of academia and practice provide practitioners, scientists and graduate students with a good over

839 109 10MB Read more

Parallel And Distributed Computing

1,731 145 3MB Read more

Parallel programming: techniques and applications using networked workstations and parallel computers [2nd ed] 0131405632, 9780131405639

This accessible text covers the techniques of parallel programming in a practical manner that enables readers to write a

3,874 306 2MB Read more

Parallel and Distributed Simulation Systems 0471183830, 9780471183839

A state-of-the-art guide for the implementation of distributed simulation technology. The rapid expansion of the Interne

912 110 2MB Read more

Concurrent, parallel and distributed computing 9781774695777, 9781774694480

The book "Concurrent, Parallel, and Distributed Computing" offers an excellent overview of the various areas o

1,430 259 8MB Read more

Computer Architecture and Parallel Processing 0070315566, 9780070315563

908 207 126MB Read more

Modeling and Optimization of Parallel and Distributed Embedded Systems 9781119086413

1,346 201 4MB Read more

Cyber Security in Parallel and Distributed Computing 9781119488057

Today cybersecurity is one of the prime concerns for any organization, whether governmental or private sector; and for t

799 35 5MB Read more

Musical Networks: Parallel Distributed Perception and Performance 0262071819, 9780262071819

This volume presents the most up-to-date collection of neural network models of music and creativity gathered together i

493 79 17MB Read more

Implementing Parallel and Distributed Systems 9781032458670, 9781032151229, 9781003379041

Parallel and distributed systems (PADS) have evolved from the early days of computational science and supercomputers to

527 135 22MB Read more

Parallel and Distributed Processing Techniques and Applications [1 ed.]
9781683925781, 9781601325082

Author / Uploaded
Hamid R. Arabnia
Kazuki Joe
Hayaru Shouno
Fernando G. Tinetti

Citation preview

WORLDCOMP’19

PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON PROCEEDINGS THE 2019 INTERNATIONAL CONFERENCE& ON PARALLEL OF & DISTRIBUTED PROCESSING TECHNIQUES PARALLEL & DISTRIBUTED PROCESSING TECHNIQUES & APPLICATIONS

APPLICATIONS

Parallel and Distributed Processing Techniques and Applications

PDPTA’19 Editors Hamid R. Arabnia Kazuki Joe, Hayaru Shouno Fernando G. Tinetti

U.S. $159.95 ISBN 9781601325082

15995

EMBD-PDPTA19_Full-Cover.indd All Pages

Arabnia

9 781601 325082

Publication of the 2019 World Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE’19) July 29 - August 01, 2019 | Las Vegas, Nevada, USA https://americancse.org/events/csce2019

Copyright © 2019 CSREA Press

18-Feb-20 5:42:04 PM

This volume contains papers presented at the 2019 International Conference on Parallel & Distributed Processing Techniques & Applications. Their inclusion in this publication does not necessarily constitute endorsements by editors or by the publisher.

Copyright and Reprint Permission Copying without a fee is permitted provided that the copies are not made or distributed for direct commercial advantage, and credit to source is given. Abstracting is permitted with credit to the source. Please contact the publisher for other copying, reprint, or republication permission.

American Council on Science and Education (ACSE)

Copyright © 2019 CSREA Press ISBN: 1-60132-508-8 Printed in the United States of America https://americancse.org/events/csce2019/proceedings

Foreword It gives us great pleasure to introduce this collection of papers to be presented at the 2019 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’19), July 29 – August 1, 2019, at Luxor Hotel (a property of MGM Resorts International), Las Vegas, USA. The preliminary edition of this book (available in July 2019 for distribution on site at the conference) includes only a small subset of the accepted research articles. The final edition (available in August 2019) will include all accepted research articles. This is due to deadline extension requests received from most authors who wished to continue enhancing the write-up of their papers (by incorporating the referees’ suggestions). The final edition of the proceedings will be made available at https://americancse.org/events/csce2019/proceedings . An important mission of the World Congress in Computer Science, Computer Engineering, and Applied Computing, CSCE (a federated congress to which this conference is affiliated with) includes "Providing a unique platform for a diverse community of constituents composed of scholars, researchers, developers, educators, and practitioners. The Congress makes concerted effort to reach out to participants affiliated with diverse entities (such as: universities, institutions, corporations, government agencies, and research centers/labs) from all over the world. The congress also attempts to connect participants from institutions that have teaching as their main mission with those who are affiliated with institutions that have research as their main mission. The congress uses a quota system to achieve its institution and geography diversity objectives." By any definition of diversity, this congress is among the most diverse scientific meeting in USA. We are proud to report that this federated congress has authors and participants from 57 different nations representing variety of personal and scientific experiences that arise from differences in culture and values. As can be seen (see below), the program committee of this conference as well as the program committee of all other tracks of the federated congress are as diverse as its authors and participants. The program committee would like to thank all those who submitted papers for consideration. About 65% of the submissions were from outside the United States. Each submitted paper was peer-reviewed by two experts in the field for originality, significance, clarity, impact, and soundness. In cases of contradictory recommendations, a member of the conference program committee was charged to make the final decision; often, this involved seeking help from additional referees. In addition, papers whose authors included a member of the conference program committee were evaluated using the double-blinded review process. One exception to the above evaluation process was for papers that were submitted directly to chairs/organizers of pre-approved sessions/workshops; in these cases, the chairs/organizers were responsible for the evaluation of such submissions. The overall paper acceptance rate for regular papers was 23%; 19% of the remaining papers were accepted as poster papers (at the time of this writing, we had not yet received the acceptance rate for a couple of individual tracks.) We are very grateful to the many colleagues who offered their services in organizing the conference. In particular, we would like to thank the members of Program Committee of PDPTA’19, members of the congress Steering Committee, and members of the committees of federated congress tracks that have topics within the scope of PDPTA. Many individuals listed below, will be requested after the conference to provide their expertise and services for selecting papers for publication (extended versions) in journal special issues as well as for publication in a set of research books (to be prepared for publishers including: Springer, Elsevier, BMC journals, and others).    

Prof. Abbas M. Al-Bakry (Congress Steering Committee); University President, University of IT and Communications, Baghdad, Iraq Prof. Emeritus Nizar Al-Holou (Congress Steering Committee); Professor and Chair, Electrical and Computer Engineering Department; Vice Chair, IEEE/SEM-Computer Chapter; University of Detroit Mercy, Detroit, Michigan, USA Prof. Hamid R. Arabnia (Congress Steering Committee); Graduate Program Director (PhD, MS, MAMS); The University of Georgia, USA; Editor-in-Chief, Journal of Supercomputing (Springer);Fellow, Center of Excellence in Terrorism, Resilience, Intelligence & Organized Crime Research (CENTRIC). Dr. P. Balasubramanian; School of Computer Science and Engineering, Nanyang Technological University, Singapore

     

   

                   

Prof. Dr. Juan-Vicente Capella-Hernandez; Universitat Politecnica de Valencia (UPV), Department of Computer Engineering (DISCA), Valencia, Spain Prof. Juan Jose Martinez Castillo; Director, The Acantelys Alan Turing Nikola Tesla Research Group and GIPEB, Universidad Nacional Abierta, Venezuela Prof. Emeritus Kevin Daimi (Congress Steering Committee); Director, Computer Science and Software Engineering Programs, Department of Mathematics, Computer Science and Software Engineering, University of Detroit Mercy, Detroit, Michigan, USA Prof. Leonidas Deligiannidis (Congress Steering Committee); Department of Computer Information Systems, Wentworth Institute of Technology, Boston, Massachusetts, USA; Visiting Professor, MIT, USA Prof. Mary Mehrnoosh Eshaghian-Wilner (Congress Steering Committee); Professor of Engineering Practice, University of Southern California, California, USA; Adjunct Professor, Electrical Engineering, University of California Los Angeles, Los Angeles (UCLA), California, USA Prof. George A. Gravvanis (Congress Steering Committee); Director, Physics Laboratory & Head of Advanced Scientific Computing, Applied Math & Applications Research Group; Professor of Applied Mathematics and Numerical Computing and Department of ECE, School of Engineering, Democritus University of Thrace, Xanthi, Greece. Prof. Houcine Hassan; Department of Computer Engineering (Systems Data Processing and Computers), Universitat Politecnica de Valencia, Spain Prof. Hiroshi Ishii; Department Chair, Tokai University, Minato, Tokyo, Japan Prof. Makoto Iwata; School of Information, Kochi University of Technology, Kami, Kochi, Japan Prof. George Jandieri (Congress Steering Committee); Georgian Technical University, Tbilisi, Georgia; Chief Scientist, The Institute of Cybernetics, Georgian Academy of Science, Georgia; Ed. Member, International Journal of Microwaves and Optical Technology, The Open Atmospheric Science Journal, American Journal of Remote Sensing, Georgia Prof. Kazuki Joe (Session Chair, PDPTA); Nara Women's University Nara, Japan Prof. Byung-Gyu Kim (Congress Steering Committee); Multimedia Processing Communications Lab.(MPCL), Department of Computer Science and Engineering, College of Engineering, SunMoon University, South Korea Prof. Tai-hoon Kim; School of Information and Computing Science, University of Tasmania, Australia Prof. Louie Lolong Lacatan; Chairperson, CE Department, College of Engineering, Adamson University, Manila, Philippines; Senior Member, International Association of Computer Science and Information Technology (IACSIT), Singapore; Member, International Association of Online Engineering (IAOE), Austria Prof. Dr. Guoming Lai; Computer Science and Technology, Sun Yat-Sen University, Guangzhou, P. R. China Prof. Hyo Jong Lee; Director, Center for Advanced Image and Information Technology, Division of Computer Science and Engineering, Chonbuk National University, South Korea Dr. Andrew Marsh (Congress Steering Committee); CEO, HoIP Telecom Ltd (Healthcare over Internet Protocol), UK; Secretary General of World Academy of BioMedical Sciences and Technologies (WABT) a UNESCO NGO, The United Nations Prof. Salahuddin Mohammad Masum; Computer Engineering Technology, Southwest Tennessee Community College, Memphis, Tennessee, USA Dr. Ali Mostafaeipour; Industrial Engineering Department, Yazd University, Yazd, Iran Prof. Hiroaki Nishikawa; Faculty of Engineering, Information and Systems, University of Tsukuba, Japan Prof. Dr., Eng. Robert Ehimen Okonigene (Congress Steering Committee); Department of Electrical & Electronics Engineering, Faculty of Engineering and Technology, Ambrose Alli University, Nigeria Prof. James J. (Jong Hyuk) Park (Congress Steering Committee); Department of Computer Science and Engineering (DCSE), SeoulTech, Korea; President, FTRA, EiC, HCIS Springer, JoC, IJITCC; Head of DCSE, SeoulTech, Korea Dr. Prantosh K. Paul; Department of CIS, Raiganj University, Raiganj, West Bengal, India Prof. Dr. R. Ponalagusamy; Department of Mathematics, National Institute of Technology, India Dr. Manik Sharma; Department of Computer Science and Applications, DAV University, Jalandhar, India Prof. Hayaru Shouno (Session Chair, PDPTA); The University of Electro-Communications, Japan Dr. Akash Singh (Congress Steering Committee); IBM Corporation, Sacramento, California, USA; Chartered Scientist, Science Council, UK; Fellow, British Computer Society; Member, Senior IEEE, AACR, AAAS, and AAAI; IBM Corporation, USA Ashu M. G. Solo (Publicity), Fellow of British Computer Society, Principal/R&D Engineer, Maverick Technologies America Inc. Prof. Fernando G. Tinetti (Congress Steering Committee); School of CS, Universidad Nacional de La Plata, La Plata, Argentina; also at Comision Investigaciones Cientificas de la Prov. de Bs. As., Argentina Prof. Hahanov Vladimir (Congress Steering Committee); Vice Rector, and Dean of the Computer Engineering Faculty, Kharkov National University of Radio Electronics, Ukraine and Professor of Design

 

 

Automation Department, Computer Engineering Faculty, Kharkov; IEEE Computer Society Golden Core Member; National University of Radio Electronics, Ukraine Dr. Haoxiang Harry Wang (CSCE); Cornell University, Ithaca, New York, USA; Founder and Director, GoPerception Laboratory, New York, USA Prof. Shiuh-Jeng Wang (Congress Steering Committee); Director of Information Cryptology and Construction Laboratory (ICCL) and Director of Chinese Cryptology and Information Security Association (CCISA); Department of Information Management, Central Police University, Taoyuan, Taiwan; Guest Ed., IEEE Journal on Selected Areas in Communications. Prof. Layne T. Watson (Congress Steering Committee); Fellow of IEEE; Fellow of The National Institute of Aerospace; Professor of Computer Science, Mathematics, and Aerospace and Ocean Engineering, Virginia Polytechnic Institute & State University, Blacksburg, Virginia, USA Prof. Jane You (Congress Steering Committee); Associate Head, Department of Computing, The Hong Kong Polytechnic University, Kowloon, Hong Kong

We would like to extend our appreciation to the referees, the members of the program committees of individual sessions, tracks, and workshops; their names do not appear in this document; they are listed on the web sites of individual tracks. As Sponsors-at-large, partners, and/or organizers each of the followings (separated by semicolons) provided help for at least one track of the Congress: Computer Science Research, Education, and Applications Press (CSREA); US Chapter of World Academy of Science; American Council on Science & Education & Federated Research Council (http://www.americancse.org/). In addition, a number of university faculty members and their staff (names appear on the cover of the set of proceedings), several publishers of computer science and computer engineering books and journals, chapters and/or task forces of computer science associations/organizations from 3 regions, and developers of high-performance machines and systems provided significant help in organizing the conference as well as providing some resources. We are grateful to them all. We express our gratitude to keynote, invited, and individual conference/tracks and tutorial speakers - the list of speakers appears on the conference web site. We would also like to thank the followings: UCMSS (Universal Conference Management Systems & Support, California, USA) for managing all aspects of the conference; Dr. Tim Field of APC for coordinating and managing the printing of the proceedings; and the staff of Luxor Hotel (Convention department) at Las Vegas for the professional service they provided. Last but not least, we would like to thank the Co-Editors of PDPTA’19: Prof. Hamid R. Arabnia, Prof. Kazuki Joe, Prof. Hayaru Shouno, and Prof. Fernando G. Tinetti. We present the proceedings of PDPTA’19.

Steering Committee, 2019 http://americancse.org/

Contents SESSION: PARALLEL AND DISTRIBUTED PROCESSING AND ALGORITHMS + PERFORMANCE ANALYSIS AND RELATED ISSUES A Dynamic Adaptation Strategy for Energy-Efficient Keyframe-Based Visual SLAM Abdullah Khalufa, Graham Riley, Mikel Lujan

3

On the Distance and Spatial Complexity of Complete Visibility Algorithms for Oblivious Mobile Robots Rory Hector, Ramachandran Vaidyanathan

11

Enhancing Scheduling Robustness with Partial Task Completion Feedback and Resource Requirement Biasing Nicolas Grounds, John K. Antonio

19

Using Apache Spark for Distributed Computation on a Network of Workstations Jordan Koeller, Mark Lewis, David Pooley

26

Thread and Process Efficiency in Python Roger Eggen, Emeritus Maurice Eggen

32

An Actor-Based Runtime Environment for Heterogeneous Distributed Computing Ahmed Abdelmoamen Ahmed, Tochukwu Eze

37

SESSION: COMMUNICATION SYSTEMS: INTERCONNECTION NETWORKS AND TOPOLOGIES Fault-tolerant Routing based on Fault Information of Cross-edges in Dual-cubes Nobuhiro Seki, Kousuke Mouri, Keiichi Kaneko

47

On the Construction of Optimal Node-Disjoint Paths in Folded Hypercubes of Even Dimensions Cheng-Nan Lai

54

The Weakly Dimension-Balanced Pancyclic on Tm, n for m or n Being Even and the Other Being Odd Ruei-Yu Wu, Zong-You Lai, Justie Su-Tzu Juan

59

SESSION: HPC AND NOVEL APPLICATIONS + DATA SCIENCE + OPTIMIZATION METHODS High-Performance Host-Device Scheduling and Data-Transfer Minimization Techniques for Visualization of 3D Agent-Based Wound Healing Applications Nuttiiya Seekhao, Grace Yu, Samson Yuen, Joseph JaJa, Luc Mongeau, Nicole Y. K. Li-Jessen

69

PECT: A Program Energy Consumption Tuning Tool Cuijiao Fu, Depei Qian, Tianming Huang, Zhongzhi Luan

77

New Dynamic Warp Throttling Technique for High Performance GPUs Gwang Bok Kim, Jong Myon Kim, Cheol Hong Kim

83

Modified Data Aggregation for Aerial ViSAR Sensor Networks in Transform Domain Mohammad Reza Khosravi, Sadegh Samadi

87

SESSION: CLOUD COMPUTING, EDGE COMPUTING, AND APPLICATIONS Dynamically Adjusting the Stale Synchronous Parallel Model in Edge Computing Sangsu Lee, Taeho Lee, Yili Wang, Hee Yong Youn A Blockchain-based IoT Platform Integrated with Cloud Services Debrath Banerjee, Hai Jiang

93

100

SESSION: INTERNATIONAL WORKSHOP ON MATHEMATICAL MODELING AND PROBLEM SOLVING - MPS Fast Bayesian Restoration of Poisson Corrupted Images with INLA Takahiro Kawashima, Hayaru Shouno

109

Joint Replenishment Policy in Multi-Product Inventory System using Branching Deep Q-Network with Reward Allocation Hiroshi Suetsugu, Yoshiaki Narusue, Hiroyuki Morikawa

115

Molecular Activity Prediction Using Graph Convolutional Deep Neural Network Considering 122 Distance on a Molecular Graph Masahito Ohue, Ryota Ii, Keisuke Yanagisawa, Yutaka Akiyama Acceleration of Machine Learning-based Sequence Alignment Generation for Homology Modeling Masato Narui, Shuichiro Makigaki, Takashi Ishida

129

Layout Analysis using Semantic Segmentation for Imperial Meeting Minutes Sayaka Iida, Yuki Takemoto, Yu Ishikawa, Masami Takata, Kazuki Joe

135

A Discrete Three-wave System of Kahan-Hirota-Kimura Type and the QRT Mapping Yuko Takae, Masami Takata, Kinji Kimura, Yoshimasa Nakamura

142

Improvement of the Thick-Restart Lanczos Method in Single Precision Floating Point Arithmetic using Givens Rotations Masana Aoki, Masami Takata, Kinji Kimura, Yoshimasa Nakamura

149

On an Implementation of Two-Sided Jacobi Method

156

Sho Araki, Masami Takata, Kinji Kimura, Yoshimasa Nakamura

A Study on the Effects of Background in Oddball Tasks for a User's Motivation and Event-Related Potential Tadashi Koike, Tomohiro Yoshikawa, Takeshi Furuhashi

163

A Method to Acquire Multiple Satisfied Soltuions Tomohiro Yoshikawa, Kouki Maruyama

169

Performance Evaluation of MEGADOCK Protein-Protein Interaction Prediction System Implemented with Distributed Containers on a Cloud Computing Environment Kento Aoyama, Yuki Yamamoto, Masahito Ohue, Yutaka Akiyama

175

Structure of Neural Network Automatically Generating Fonts for Early-Modern Japanese Printed Books Yuki Takemoto, Yu Ishikawa, Masami Takata, Kazuki Joe

182

Applying CNNs to Early-Modern Printed Japanese Character Recognition Suzuka Yasunami, Norie Koiso, Yuki Takemoto, Yu Ishikawa, Masami Takata, Kazuki Joe

189

Shape Recognition Technique for High-accuracy Mid-surface Mesh Generation Megumi Okumoto, Sumadi Jien, Junko Niiharu, Kiyotaka Ishikawa, Hirokazu Nishiura

196

SESSION: LATE BREAKING PAPERS: PARALLEL & DISTRIBUTED PROCESSING AND APPLICATIONS A GPU-MapCG based Parallelization of BSO Metaheuristic for Molecular Docking Problem 205 Hocine Saadi, Nadia Nouali Taboudjemat, Malika Mehdi, Ousmer Sabrine, Hafida Benboudjelthia Resilient and Hierarchical Controller Placement Problem for Collaborative Virtual SDN Services Sakir Yucel

211

Docker-based Platform for Real-time Face Recognition Jongkwon Jang, Sanggil Yeoum, Moonseong Kim, Byungseok Kang, Hyunseung Choo

218

Controller Placement Problem for Virtual SDN Services Sakir Yucel

222

Multi-Start Parallel Tabu Search for the Blocking Job Shop Scheduling Problem Adel Dbah, Nadia Nouali Taboudjemat, Abdelhakim AitZai, Ahcene Bendjoudi

229

Atomic Commitment Protocol in Distributed Systems with Fail-Stop Model Sung-Hoon Park, Su-Chang Yoo

236

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

SESSION PARALLEL AND DISTRIBUTED PROCESSING AND ALGORITHMS + PERFORMANCE ANALYSIS AND RELATED ISSUES Chair(s) TBA

ISBN: 1-60132-508-8, CSREA Press ©

1

2

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

3

A Dynamic Adaptation Strategy for Energy-Efficient Keyframe-Based Visual SLAM Abdullah Khalufa, Graham Riley, and Mikel Luján School of Computer Science, The University of Manchester, Manchester, United Kingdom

Abstract— This paper presents a new light-weight dynamic control strategy for VSLAM. The control strategy relies on sensor motion for adapting the control parameters of the algorithm and computing device used at runtime. We evaluate the strategy on two platforms, desktop and mobile processor, in the context of both direct and indirect keyframebased VSLAM algorithms, ORB-SLAM and DSO, using a control metric based on the change in camera pose over the trajectory of the sensor. As control parameters, the strategy uses DVFS (Dynamic Voltage and Frequency Scaling) on the device and a frame-skipping technique which attempts to identify and skip frames with low value to the accuracy of the VSLAM algorithm. We present results from execution on a number of synthetic and real scenes taken from the ICLNUIM and EuRoC MAV datasets, respectively, illustrating the power savings and impact on accuracy and robustness. Our results show a best-case power reduction of 75% with marginal impact on the accuracy and robustness of the VSLAM algorithms over multiple runs of most of the scenes compared to original base real-time versions of the algorithms. Analysis of the scenes which shows the most impact on robustness indicates this is caused by some critical points in the trajectory which motivates the continued search for improved control metrics. Keywords: SLAM, Runtime Adaptation, Energy-Accuracy tradeoffs

1. Introduction Visual simultaneous localisation and mapping (VSLAM) is the process of simultaneously estimating and tracking the pose (primarily position and orientation) of a moving platform using a camera sensor while building a map of its surrounding environment. VSLAM is a key building block in many vision and robotics applications ranging from Unmanned Aerial Vehicles (UAVs) to Augmented Reality (AR) and gaming gadgets. This wide range of applications has resulted in a variety of VSLAM algorithms, formulated mainly to achieve accurate estimation in each target environment. The deployment of these algorithms on platforms with limited computational and power resources is challenging and involves trading different performance objectives, such as accuracy, power consumption, or frame rate, at design time by exploring the available parameter space [1][2][3].

However, during VSLAM runtime, trading these performance objectives requires knowledge about the deployment environment and the nature of camera motion and of the scene, all of which are usually unknown in advance. Coupling this with the fact that VSLAM algorithms interpret the scene in different ways, a portable adaptation strategy that relies purely on the camera pose as a heuristic can prove useful for adapting a small set of general parameters with large influence on the performance objectives. Using the change in camera motion as a heuristic can capture difficulties within VSLAM tracking, where slow motion usually implies redundant and easy-to-track scenery, assuming that the scene is relatively rich in tracking information and static (i.e. without moving objects). For example, in the case of a quad-copter equipped with a camera, a sudden increase in motion can lead to a significant impact on tracking accuracy due to the lack of adequate tracking information, whereas slow and steady movements can be utilised for saving power, for example, to maximise battery life. The ability to adapt to these unpredictable and sudden changes in motion while controlling the accuracy and power consumption is essential for prolonging the range of tracking achievable. In this paper, we propose a portable dynamic adaptation strategy capable of adapting to unknown levels of variations in VSLAM camera motion to reduce power consumption with minimal impact on accuracy, in the context of a fixed frame rate. This is achieved by using the estimated change in sensor pose as a metric to quantify these variations, and subsequently adapting DVFS and dynamically skipping frames which can be expected to be redundant. Our goal is to explore how effective it can be to rely on only sensor motion change as a heuristic for guiding the adaption process. We focus on two prominent algorithms: ORB-SLAM [4] and Direct Sparse Odometry (DSO) [5] in the case study. The contributions of this paper are: •

•

•

A novel run-time linear adaptation strategy based on min-max normalisation of the change in VSLAM tracking state. A case study of two portable adaptation parameters for improving power consumption with minimal impact on the accuracy of the two algorithms on two computing devices using scenes from the EuRoC MAV and ICLNUIM datasets. An analysis of the robustness of the algorithms over

ISBN: 1-60132-508-8, CSREA Press ©

In an environment where a VSLAM sensor experiences changing motion, the power consumption and accuracy can be balanced in accordance with the change. For example, if the sensor moves slowly and steadily, the observed scene may be essentially redundant within the tracking frame window. In this case, power can be reduced with little impact on accuracy. Although both ORB-SLAM and DSO perform tracking based on keyframes, the amount of their computation is partially influenced by the camera dynamics. To illustrate this, we designed a synthetic dataset where the camera observes a scene rich in features throughout the dataset to ensure robust tracking on both algorithms and rule out any effect caused by the changes in the observed scene itself. The camera has a sinusoidal 2D motion with slow speed on the sine wave peaks but being faster in between, as shown in Fig. 1; this means that the observed scene is most similar and easier to track at the peaks. We ran both ORB-SLAM and DSO on the synthetic dataset to observe how they are affected in terms of computation intensity and power consumption. To facilitate this, we pined the main two threads of each algorithm, the tracker and mapper threads, to separate CPU cores and measured the number of cycles executed per second along with the power consumed by both cores1 .

Power (w)

2.1 The Case for VSLAM Runtime Adaptation

Tracker (cycle/s)

2. Methodology

16 12 8 4 0 3 2 1 3 2 1

×109

×109

0

25

50

75 100 Time (s)

125

150

125

150

(a) ORB-SLAM

Power (w)

multiple runs through the same scene. The identification and analysis of critical points in the sensor trajectory which impact robustness.

16 12 8 4 0

Tracker (cycle/s)

•

Mapper (cycle/s)

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

3 2 1

Mapper (cycle/s)

4

3 2 1

×109

×109

0

25

50

75 100 Time (s)

(b) DSO

Fig. 2: Runtime power consumption and computation intensity of ORB-SLAM (a) and DSO (b) main threads, running a synthetic dataset with varying camera speed. ORB-SLAM tracker and mapper threads are insensitive to the variations. DSO mapper is less sensitive to the variations compared to its tracker, but has larger impact on power consumption. Fig. 1: The camera trajectory (red) aligned with camera change D (green) representing the differences between each subsequent estimated camera pose. The amount of computation performed by ORB-SLAM tracker and mapper threads does not seem to be correlated with the change in the camera motion, shown in Fig. 1, as can be seen in the top plot of Fig. 2. On the other hand, DSO tracker and mapper threads are affected to a certain degree by the change, as can be seen in Fig. 2, where 1 The measurements were obtained using Linux Turbostat on the same desktop machine (see Section 3.2) with the same frequency scaling governor (Performance) and under a fixed frame rate (30fps).

relatively less power consumption can be observed when the camera moves slowly. Looking at the same Figure, the DSO mapper is more dominant in terms of the amount of computation and shows more impact on power consumption compared to the tracker, but it is less sensitive to the change in camera motion. Given the fact that both ORB-SLAM and DSO algorithms take little advantage of slow camera motion and redundant scenery to perform fewer computations for saving power, we see an opportunity to perform runtime adaptations based on the sensor motion to reduce power consumption with minimal impact on the SLAM accuracy. From the above discussion, the benefit from such adaptation

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

5

2.2 The Adaptation Strategy In order to take advantage of camera motion variations, knowledge about the nature of the variation is required for the adaptation. In practice, the variations in the camera motion will depend on the hosting platform, e.g drones or smartphones, and will usually be unpredictable in advance. To address this issue we propose a linear adaptation model - based on min-max scaling of the change to a monitored metric, D (camera pose, for example), denoted by D′ - and a predefined range over which a selected control parameter, X, may be varied (for example, the DVFS levels available on the processor). Based on the correlation type between the scaled change, D′ , and the applied change to the control parameter, X, the model can be written as the following for positive correlation: Xp = Xmin + (Xmax − Xmin )Dt′

(1)

Xn = Xmax − (Xmax − Xmin )Dt′

(2)

And for negative correlation as:

where D′ is defined as: Dt′ =

Dt − min(D) max(D) − min(D)

(3)

In the above equations, Xmin and Xmax are the minimum and maximum values allowed for the control parameter, X, and min(D) and max(D) are the minimum and maximum values of the metric D encountered as the sensor progresses along its trajectory. This model can be used for parameter adaptation in both the SLAM formulation and the target hardware platform, as demonstrated in this work. The advantages of this approach are two-fold: first, it provides the ability to adapt and respond quickly to larger changes as they are encountered by the SLAM without the need for prior knowledge about the nature of motion or scene. Assuming positive correlation, for example, when the change is at its peak (Dt = max(D)) the model immediately ensures that the adaptation parameter operates at its highest value, leading to enhanced robustness, as illustrated in Fig 3. The second advantage is that it decouples the variation in the metric from the adaptation process, which means the model can work with any change metric. In this paper, we use the change in sensor motion as a heuristic for guiding the adaptation process. In the case of constant motion, the control parameter X will operate on its highest or lowest level, depending on the correlation. In such a case, the value Xmax /Xmin can easily be tuned to achieve the desired tradeoff. The model is used to explore two general adaptation strategies that do not involve significant changes to the

D

Xp

can be expected to be greater for ORB-SLAM compared to DSO.

Time

Fig. 3: A visual illustration of equation 1, when the change D in (green) is at its peak in (blue), the parameter Xp is at it is highest level.

SLAM base code and settings, which is useful if the formulation parameters and settings are already tuned through design space exploration, for example. Further, these adaptations can be applied to a variety of SLAM formulations and platforms. In this paper, the first adaptation strategy attempts to identify and skip redundant frames dynamically before they are fed to the SLAM system, indirectly affecting power consumption, and the second targets the platform through DVFS adaptations, thus directly influencing power consumption.

2.3 Motion-based Metric Characterising the change in SLAM state with a small number of parameters has been proposed in [6], where the Kullback-Leibler divergence is used as a measure for the change in the scene, which usually implies that the sensor has moved unless the scene has dynamic objects. In this work, however, we focus on the change in motion estimated by the SLAM as a heuristic for the tracking difficulty, where faster motion implies difficult tracking. For each frame fed to SLAM, there is an estimated translational and rotational value computed for the camera pose. We define the change in camera motion at time t between two subsequent frames as: D t = Δ⊤ p Δp

(4)

where Δp is the absolute difference between camera pose at time t and t − 1. The metric D can then be used with Eq. 3 to guide the adaptations.

2.4 The Adapted Parameters 2.4.1 Dynamic Frame Skipping Skipping frames can be useful in reducing the amount of computation required to be performed by the tracker thread. Assuming a fixed frame rate, when a frame is skipped the tracker thread becomes idle until a new frame arrives for processing. However, deciding which frames to skip without leading to a significant impact on the accuracy is

ISBN: 1-60132-508-8, CSREA Press ©

6

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

not trivial. Dense SLAM formulations usually require all frames to be processed to build a full map of the scene, whereas key-frame-based formulations rely on only a subset of the frames chosen for tracking, depending on formulation requirements; for example, the change in the field of view or scene brightness. Slow motion usually implies observing a similar scene across multiple frames, allowing the possibility that these could be skipped to save power with little impact on tracking accuracy. This is true unless there are dynamic objects in the scene, in which case it may be preferable to couple the motion metric with a scene similarity metric to compare between subsequent frames to avoid skipping potentially useful frames. The larger the change in motion the fewer frames should be skipped; Eq. 2 is used to determine the number of frames skipped in succession before processing a new frame, with a limit on the number of successive skipped frames (a preset maximum number). We denote this adaptation with (S) throughout the rest of the paper.

monoVO dataset on the basis that it does not provide the full ground truth trajectory which is essential for isolating in-sequence abnormal drift or loss of track, as required by our evaluation. EuRoC MAV has real video scenes captured by a stereo camera mounted on a drone at 20 frames per second (FPS). Since we are evaluating the monocular version of both algorithms, we used the left camera footage in our work. The dataset has 11 sequences with varying difficulty in terms of motion and scene, which makes it suitable for our evaluation. ICL-NUIM, on the other hand, is a synthetic indoor dataset containing a total of 8 video sequences from office and living room scenes with complete ground truth trajectories. The dataset runs at 30 FPS which makes it ideal for evaluating the adaption strategies under a real-time constraint. Thus, we use it to evaluate the adaptation applied to DSO on both a desktop and a mobile processor. We pick a set of scenes from both datasets that have varying difficulty and different motion dynamics. Table 1 summarises the sequences selected from both datasets.

2.4.2 DVFS Adaptation Most current CPUs employ Dynamic Voltage and Frequency Scaling (DVFS), which governs the operating frequency by a predefined policy, to save energy or deliver extra performance. The goal here is to adapt the frequency based on the scaled change in motion of the VSLAM sensor to save power with minimal impact on the accuracy, where the larger the change in motion, the higher is the desired frequency (i.e. there is a positive correlation between the measured change and desired frequency). Since the correlation is positive, Eq. 1 can be used to determine the appropriate frequency. To facilitate this adaptation, SLAM Tracker and Mapper threads are pinned to different CPUs with the Performance Governor enabled, allowing them to operate at the highest frequency. The frequency is then updated at runtime using the value X, according to the model. This adaptation is denoted by (F) throughout the rest of the paper.

Table 1: Datasets used for evaluating the adaptation.

3. Experimental setup The goals of the experiments are two-fold: first, to evaluate the effectiveness of the runtime adaptation strategy, described in Section 2.2, in saving power and to quantify the impact on accuracy and robustness of ORB-SLAM and DSO; and, second, to understand the implications and/or limitations of relying only on motion change as a metric to drive the adaptation strategies. To achieve these goals, we applied the strategies to fixed frame rate versions of both algorithms, denoted by (RT) for Real-Time, running on datasets with a variety of motion dynamics, namely: EuRoC MAV and ICL-NUIM, which are described next.

3.1 Datasets For our evaluation, we picked two widely known datasets: EuRoC MAV [7] and ICL-NUIM [8]. We did not use the

Dataset

EuRoC MAV (Real @ 20 FPS)

ICL-NUIM (Synthetic @ 30 FPS)

Sequence MH01 MH02 MH05 V101 V201 V202 of kt1 of kt3 lr kt0 lr kt1

Frames 3682 3040 2273 2912 2280 2348 966 1241 1509 966

Scene Machine Hall/Easy Machine Hall/Easy Machine Hall/Difficult Vicon Room/Easy Vicon Room/Easy Vicon Room/Medium Office Scene Office Scene Living Room Living Room

3.2 Platforms Setup The strategy is evaluated on both a desktop and a mobile processor. The desktop processor is an Intel Skylake i76700 CPU with 16 GB of RAM, running Ubuntu 16.04 LTS with Kernel 4.13.0-37-generic. We use acpi-cpufreq as the frequency scaling driver. Linux TurboStat is used for recording power measurements, which are sampled at 1KHz. Intel TurboBoost was disabled in all of the experiments to prevent it from interfering with our control of DVFS. The mobile processor is a Snapdragon 820 (msm8996) with 3 GB of main memory hosted on a development board and running Android 6.0. The default frequency scaling driver is used for the adaptive DVFS strategy. To measure power, the ARM energy probe is used to sample the power drawn by the processor at 10K samples per second [9].

3.3 SLAM Setup We use the open source code bases for both DSO2 and ORB-SLAM23 . Both algorithms track scene features within a window of frames. The evaluated DSO version is pure 2 https://github.com/JakobEngel/dso

3 https://github.com/raulmur/ORB_SLAM2

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

visual odometry, meaning that it does not perform loop closure when it revisits a previous location, as opposed to ORB-SLAM. The evaluation is based on the monocular version of both algorithms. Throughout all experiments, we apply the adaptation strategies on the base real-time version of each algorithm (RT) and this is used as the reference case to evaluation our adaptation strategies. For DSO we set the (preset = 1) with the multithreading mode disabled (nomt = 1). Otherwise, the default DSO settings were used except for setting the parameter gth to 3 on the ICL-NUIM sequences, in accordance with their guidelines in [5]. Since a fixed frame rate execution is enforced on both algorithms at a rate given by the dataset, frames are skipped by default if they do not meet this constraint. In addition, for all sequences, all frames are preloaded into the main memory of the test machine before the start. After algorithm initialisation, the adaptation strategies and power measurements commence until the end of the video sequence.

3.4 Evaluation Both ORB-SLAM2 and DSO were originally designed and evaluated on desktop processors. For this reason, we evaluate the adaptation strategy with both algorithms running on the desktop processor using EuRoC MAV as it has more realistic motion dynamics compared to ICL-NUIM. We use the ICL-NUIM dataset to evaluate a cross-compiled version of DSO running on a mobile processor for comparison with the Desktop version. ICL-NUIM sequences have a real-time frame rate, making it suitable for evaluating the impact of operating on less powerful processors. To facilitate DVFS adaptations, the Tracker and Mapper threads are pinned on separate CPU cores with the Performance governor enabled. Redundant frames are skipped at run-time before they are fed to the system. In each experiment, a total of 20 runs were performed for each sequence to evaluate the impact of applying the adaptation strategies on the tracking accuracy and robustness of each algorithm, compared to the baseline (RT) versions. Low variability in the accuracy results (based on the difference between the achieved track and the ground truth) from run-to-run implies a higher (better) robustness. For each run, the estimated trajectory is scaled and aligned to the ground truth trajectory using sim(3) Umeyama alignment [10], then the RMSE (Root Mean Squared Error) is calculated over the resulting aligned trajectory. If the track is completely lost in a run we set the RMSE to ‘infinity’. The adaptation strategies are applied to the real-time baseline version of each code separately and in combination. We evaluate each strategy considering only change in motion as the adaptation metric. We evaluate the combination of the DVFS adaption (F) and the dynamic frame skipping (S) adaptation, since our goal is to allow both SLAM and the platform to adapt to the change in motion to achieve the highest power improvement. In terms of the predefined range over which the strategies can vary, that is Xmax and Xmin ,

7

for each dataset we set fixed and portable values across all algorithms (NB. DVFS ranges are platform-dependent). Skipping a large number of frames in a row can lead to undesirable spikes in the change in motion metric, defined in Section 2.2, which in some cases degrades the overall robustness. To avoid this effect we employ a running geometrical mean value to smooth the change over a predefined window (with the, fixed, window size determined by experiment).

4. Results And Analysis For each sequence, we evaluate the improvement on power consumption and the impact on accuracy and robustness after adapting DVFS and/or frame skipping (denoted by (F) and (S), respectively) relative to the baseline version of each algorithm (denoted by RT) running on the target platform. The results for each sequence are shown in figures having two vertical subplots showing power reduction and accuracy results respectively, and sharing the x-axis. The accuracy and robustness results for the proposed adaptations are presented as violin plots showing the distribution of the RMSE values of the 20 runs with their mean and median values.

4.1 Results on Desktop Processor: Fig. 4a (ORB-SLAM, EuRoC-MAV), Fig. 4b (DSO, EuRoC-MAV) and Fig. 5a (DSO, ICL-NUIM) show the impact on accuracy and robustness along with the power reduction achieved relative to the baseline real-time (RT) version of each algorithm running on the desktop using scenes from the EuRoC MAV and ICL-NUIM datasets. Overall, applying the adaption to ORB-SLAM shows greater benefit in terms of power reduction compared to DSO. This conforms with analysis performed on the tracker and mapper threads of each algorithm in Section 2.1. It should be remembered that ORB-SLAM has the advantage of being a full SLAM algorithm compared to DSO, which is only a visual odometry, and therefore does not correct and relocalise the track in the case of loop closure. This is apparent in the overall accuracy and robustness results in Fig. 4 showing EuRoC MAV sequences. Considering the results for each sequence, the accuracy results obtained from applying the proposed adaptation strategies fall into three categories based on the impact on the accuracy and robustness compared to the baseline (RT): Improved Robustness: Applying the adaptation strategies shows an improvement in terms of the robustness of the 20 runs performed after applying the adaptations compared to the baseline. See, for example, MH05 in Fig. 4a and 4b, for which the scene is characterised as difficult with fast motion and dark scene. Skipping redundant frames, (RT+S), enables both algorithms to have useful frames within their tracking window, resulting in improved robustness. The combination (RT+F+S) then achieves the best trade-off between robustness and power reduction compared to the baseline (RT).

ISBN: 1-60132-508-8, CSREA Press ©

8

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

V101

V201

80

Power Reduction(%)

Power Reduction(%)

80 40 0

0.2 0.0

40 0

MH02

MH05

0.2

40 0 0.4

RMSE

RMSE

MH01

0

80

Power Reduction(%)

Power Reduction(%)

MH01

0.4 0.2 0.0

0.2 0.0

MH02

MH05

80

Power Reduction(%)

80

Power Reduction(%)

V202

40

0.0

V202

80

40 0

40 0 0.4

RMSE

0.4

RMSE

V201

0.4

RMSE

RMSE

0.4

V101

0.2 0.0

0.2 0.0

RT

+F RT

+S RT

+ RT

F

+S

RT

+F RT

+S RT

+S

F

+ RT

(a) ORB-SLAM: Applying the adaptations to ORB-SLAM has marginal impact on the accuracy and robustness compared to the baseline (RT) on the majority of the tested sequences. The highest power reduction, relative to the baseline of ORB-SLAM (RT), is achieved with (RT+F+S) adaptations on all sequences compared.

RT

F

+ RT

S

+ RT

S

F+

+ RT

RT

F

+ RT

S

+ RT

S

F+

+ RT

(b) DSO: Several of the adaptations have similar or better robustness compared to the DSO baseline (RT), especially (RT+F and RT+F+S) which also have larger power reductions compared to (RT+S). .

Fig. 4: Results of applying the adaptation strategies in ORB-SLAM (a) and DSO (b) using the selected EuRoC MAV sequences and running on a desktop (20 runs). All sequences share the x-axis and each sequence has two subplots, the top is the adaptation’s power reduction relative to the baseline (RT) (higher is better) and the bottom is the impact on accuracy and robustness (lower is accurate and less vertical spread is more robust). Marginal Impact on Accuracy and Robustness: This is the case for most of the tested sequences for ORB-SLAM, Fig. 4a, and for sequences V201, MH01 and MH02 for DSO, Fig. 4b, where almost all of the adaptation strategies achieve similar accuracy to the baseline, with marginal variability in the RMSE values of the 20 runs (i.e. there is high robustness). Adapting DVFS only, (RT+F), results in much higher power reduction compared to frame skipping only, (RT+S). While the combination of both, (RT+F+S), results in an improvement of between 65-75% over the baseline power consumption of ORB-SLAM, and between 39-60% with DSO. The reason for such improvement while producing minimal impact on the accuracy is that the change in camera motion corresponds to redundancy in the observable tracking information. This is where the proposed adaptation strategy meets its goal and ensures that skipping redundant frames and lowering the CPU frequency have minimal impact on the

accuracy and robustness but provides significant improvements in power consumption. Significant Accuracy and Robustness Impact: Looking at Figures 4a and 4b, the most impacted sequence in terms of robustness after applying the adaptations to both algorithms is V202. To better understand this impact, Fig. 6 shows a boxplot summarising the error of the estimated 20 camera tracks recorded at ten intervals throughout the tracking of ORB-SLAM running scene V202. In this Figure, two prominent or critical points can be identified at which an impact on the tracking robustness can be observed. The first is at 10%, where the combined (RT+F+S) adaptation shows a large variation in the error (the black box). Following this, the (RT+F+S) box can be seen to return to the normal range of the drift error, which implies that ORB-SLAM has corrected its path through loop closure. The same happens at 90% of tracking with (RT+S), indicating that another critical

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

of kt1

of kt3

80

Power Reduction(%)

Power Reduction(%)

80

9

40 0

0.2 0.0

lr kt1

lr kt1

0

0.2

80

Power Reduction(%)

Power Reduction(%)

lr kt0

40

0.0

lr kt0

80 40 0

40 0 0.4

RMSE

0.4

RMSE

of kt3

0.4

RMSE

RMSE

0.4

of kt1

0.2 0.0

0.2 0.0

RT

F

+ RT

S

+ RT

RT

S

+ +F

F

+ RT

RT

S

+ RT

S

+ +F

RT

(a) DSO-desktop: The adaptations show similar results to those obtained in Fig 4b, with (RT+F+S) achieving the highest reduction in power.

RT

F

+ RT

S

+ RT

S

F+

+ RT

RT

F

+ RT

S

+ RT

S

F+

+ RT

(b) DSO-mobile: The robustness of DSO for Sequences lr kt1 and lr kt2 is impacted when running on the mobile processor, especially with (RT+F), while (RT+F+S) achieves the best energy-accuracy tradeoff.

Fig. 5: This figure follows similar formatting to Fig 4, and shows the results after applying the adaptation strategies to DSO using the selected ICL-NUIM sequences (operating at 30 fps) on both desktop (a) and mobile (b) processors for comparison.

RT

Error

0.3

RT+F

RT+S

RT+F+S

0.2

0.1

0.0

10

20

30

40

50

60

70

80

90

100 (%)

Fig. 6: Error recorded at 10 intervals throughout ORBSLAM 20 tracks of sequence V202, there are abnormal drift at different points applying frame skipping (RT+S) and the combination of DVFS and frame skipping (RT+F+S)

point has been encountered. DSO was found to be more impacted by these critical points since the version that we are using is pure visual odometry with no loop closure. The impact of these critical points, signified by the change of level in the error in Fig. 6, can mainly be attributed to the inability of the motion change metric to identify these critical tracking points, since it only quantifies the change in motion regardless of the motion type itself. For example, the change may be due to pure rotational or shaky motion which can affect the redundancy of the information available. This case motivates the continued search for new metrics that are capable of identifying specific types of motions or scene complexities that give rise to such critical points.

4.2 Results on Mobile Processor: For the Snapdragon processor, we compare the DSO results with those obtained from running DSO on a desktop, shown in Fig. 5a, using the same ICL-NUIM sequences, as we want to observe the impact of running on a less powerful mobile processor. The Snapdragon processor results are presented in Fig. 5b. These results show that the mobile processor, after applying the adaptations, can handle the tracking computation in sequences of kt3 and lr kt1 with similar accuracy and robustness as achieved on the desktop. This is not the case, however, when the (RT+F) adaptation is used for sequences of kt1 and lr kt0. For these sequences, there is an impact on the accuracy and robustness compared to the RT baseline results on the mobile processor. This happens because, due to the slow camera motion, the frequency is driven to a low level by the (RT+F) adaptation. With this adaptation, operating at low frequencies on the Snapdragon leads to a large number of random frames being skipped in order to enforce the required real-time execution, and these frames may contain valuable tracking information. Applying the (RT+S) adaptation for these sequences leads (by design) to skipping mainly redundant frames, reducing the amount of computation required and producing similar accuracy and robustness results to the ones obtained for the RT baseline. (RT+F+S) achieves similar accuracy and robustness to (RT) and (RT+S) but achieves by far the highest power savings since, in this case, mostly redundant frames are skipped and the real-time enforcement is achieved.

ISBN: 1-60132-508-8, CSREA Press ©

10

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

5. Related work

References

The Evaluation of VSLAM can be based on multiple metrics, for example, trajectory and map accuracy, per-frame execution time, resource utilisation, etc. [11][12][13]. The trajectory accuracy can be measured by evaluating the RMSE over all the estimated camera translational poses [14], and this is the metric used in this paper. Finding the desired trade-off between these performance metrics involves tuning a large number of parameters originating from both VSLAM algorithm and the execution platform side and can be achieved by Design Space Exploration (DSE) [1][2][3]. During runtime, however, adapting all of these parameters to cope with changing motion and environment is not feasible, especially under a real-time constraint. Saeedi et al. [6] propose the use of a small set of parameters based on information theory to measure the change in the scene which usually implies a change in motion, while in [15], a selected set of ”knobs” specific to KinectFusion [16], a dense form of VSLAM, is adapted based on the sensor motion. Their control model is designed based on the assumption that movement between successive frames is small, which is possible in such an environment, but it is not always the case for a wider range of VSLAM applications. In this work, we also use the change in motion estimated by the algorithm as a heuristic. However, we propose a portable approach, capable of adapting to a changing environment and which does not require explicit knowledge of the VSLAM algorithm. Further, we explore runtime adaptations on two of the most widely used key-frame-based formulations.

[1] B. Bodin, L. Nardi, M. Z. Zia, H. Wagstaff, G. S. Shenoy, M. Emani, J. Mawer, C. Kotselidis, A. Nisbet, M. Lujan, B. Franke, P. H. J. Kelly, and M. O’Boyle, “Integrating algorithmic parameters into benchmarking and design space exploration in 3D scene understanding,” in 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT), sep 2016, pp. 57–69. [2] M. Z. Zia, L. Nardi, A. Jack, E. Vespa, B. Bodin, P. H. Kelly, and A. J. Davison, “Comparative design space exploration of dense and semi-dense SLAM,” Proceedings - IEEE International Conference on Robotics and Automation, vol. 2016-June, pp. 1292–1299, 2016. [3] L. Nardi, B. Bodin, S. Saeedi, E. Vespa, A. J. Davison, and P. H. Kelly, “Algorithmic performance-accuracy trade-off in 3D vision applications using HyperMapper,” Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017, pp. 1434–1443, 2017. [4] R. Mur-Artal and J. D. Tardós, “ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017. [5] J. Engel, V. Koltun, and D. Cremers, “Direct Sparse Odometry,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 3, pp. 611–625, 2018. [6] S. Saeedi, L. Nardi, E. Johns, B. Bodin, P. H. Kelly, and A. J. Davison, “Application-oriented design space exploration for SLAM algorithms,” Proceedings - IEEE International Conference on Robotics and Automation, pp. 5716–5723, 2017. [7] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart, “The EuRoC micro aerial vehicle datasets,” The International Journal of Robotics Research, vol. 35, no. 10, pp. 1157–1163, jan 2016. [Online]. Available: https://doi.org/10.1177/0278364915620033 [8] A. Handa, T. Whelan, J. McDonald, and A. J. Davison, “A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM,” in 2014 IEEE International Conference on Robotics and Automation (ICRA), 2014, pp. 1524–1531. [9] Hitex, “Hitex: Arm energy probe,” https://www.hitex.com/toolscomponents/test-tools/analyzer/energy-optimization/arm-energyprobe/, 2019, (Accessed on 01/23/2019). [10] S. Umeyama, “Least-squares estimation of transformation parameters between two point patterns,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 13, no. 4, pp. 376–380, April 1991. [11] L. Nardi, B. Bodin, M. Z. Zia, J. Mawer, A. Nisbet, P. H. J. Kelly, A. J. Davison, M. Lujan, M. F. P. O’Boyle, G. Riley, N. Topham, and S. Furber, “Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM,” in 2015 IEEE International Conference on Robotics and Automation (ICRA), may 2015, pp. 5783– 5790. [12] J. Delmerico and D. Scaramuzza, “A Benchmark Comparison of Monocular Visual-Inertial Odometry Algorithms for Flying Robots,” International Conference on Robotics and Automation (ICRA), pp. 2502–2509, 2018. [13] M. Abouzahir, A. Elouardi, R. Latif, S. Bouaziz, and A. Tajer, “Embedding SLAM algorithms: Has it come of age?” Robotics and Autonomous Systems, vol. 100, pp. 14–26, feb 2018. [14] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of RGB-D SLAM systems,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012, pp. 573–580. [15] Y. Pei, S. Biswas, D. S. Fussell, and K. Pingali, “SLAMBooster: An Application-aware Controller for Approximation in SLAM,” nov 2018. [Online]. Available: http://arxiv.org/abs/1811.01516 [16] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and A. Fitzgibbon, “KinectFusion: Real-time dense surface mapping and tracking,” 2011 10th IEEE International Symposium on Mixed and Augmented Reality, ISMAR 2011, pp. 127–136, 2011.

6. Conclusion and Future Work In this paper, we have presented a runtime adaptation strategy which provides significant power reduction with small impact on the accuracy and robustness, relying only on sensor motion for guidance. Evaluation of the proposed strategy using two VSLAM-related algorithms on two machines shows that the desired objective is achieved for the majority of sequences examined. Analysis of the sequences for which the objective is not fully met reveals that the robustness is impacted only at certain critical points in the tracking, which are not identified by the current motion metric, and, therefore, cannot be responded to. This motivates the continued search for alternative metrics which would be capable of identifying and characterising motion type, rather than simply relying on pure motion changes. Future work will also consider incorporating information from the scene, which can be expected to provide further improvements.

Acknowledgments Abdullah Khalufa is sponsored by Umm Al-Qura University, Mecca, Saudi Arabia. We also acknowledge the support of EPSRC grants RAIN Hub (EP/R026084/1) and PAMELA (EP/K008730/1). Mikel Luján is supported by an Arm/RAEng Research Chair appointment.

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

11

On the Distance and Spatial Complexity of Complete Visibility Algorithms for Oblivious Mobile Robots Rory Hector and Ramachandran Vaidyanathan Division of Electrical and Computer Engineering, Louisiana State University, Baton Rouge, LA, USA

Abstract— Much work has been done on oblivious mobile robots operating in real-planes in distributed look-computemove environments. Most of this work has focused on algorithms that terminate in finite time, although some examine the time complexity. In this paper we examine two other aspects of these algorithms. The distance complexity reflects the distance moved by the robots in executing the algorithm. This measure is important as the energy spent and mechanical wear-andtear are correlated with how far robots move. The spatial complexity is the “area” occupied by the robots. This is important to ensure that the swarm can operate in tight spaces. We examine a class of “convex polygon based complete visibility algorithms” and present matching upper and lower bounds on their distance and spatial complexities. Keywords: Mobile robots, distributed algorithms, complete visibility, distance complexity, spatial complexity

1. Introduction Swarm robotics [2], [27], [30], in which a large number of small autonomous robots collectively solve a problem, has drawn much interest in recent years. Computationally, they form a distributed system, often communicating only by sight. While the broad field deals with many aspects of robotics (sensors, actuators, control, image processing, etc.), in this paper we will consider only the distributed computing aspect [23]. The classical model of distributed computing by mobile robots abstracts each robot as a point on the real plane [15]. In this model, all robots execute the same algorithm and proceed in “Look-Compute-Move” (LCM) cycles. An active robot first obtains a snapshot of its surroundings (Look), then computes a destination based on the snapshot (Compute), and finally moves to the destination (Move). Typically, robots are assumed to be oblivious; that is, a robot bases its current computation solely on the current “look” [15]. Further, robots are silent and communicate indirectly only through what they see and where they move. A variant of this model with more direct communication is the robots with lights model [9], [15], [22], where each robot has an externally visible light that can assume colors from a constant sized set; robots explicitly communicate with each other using these colors. The colors are persistent; that is,

the color is not erased at the end of an LCM cycle. Except for the lights, the robots are oblivious as in the classical model. In general, these robot swarms could operate in a plane [21], [18] or in 3-dimensional space [5]. We will however consider a 2-dimensional model on the real plane; nevertheless, some of the ideas we discuss may be extensible to 3-dimensional models. In this paper, we consider the problem of complete visibility [12], [28] in which robots, starting at an arbitrary initial position, must place themselves in a position in which each robot can see every robot. That is, assuming opaque robots, the final configuration has no three robots that are collinear. This problem is fundamental as from a position of complete visibility, several other problems can be solved (such as leader election and pattern formation [29]). Specifically, we will consider a class of complete visibility algorithms on the robots with lights model in which complete visibility is achieved by placing all n robots on the corners of an n-sided convex polygon [20], [24]; we will refer to these algorithms as “convex polygon based complete visibility algorithms.” Contributions: In this paper we formally define the “distance complexity” of an autonomous robots algorithm. In particular, we study the distance travelled by robots in executing a convex polygon based complete visibility algorithm. This is an important consideration as physical robots in a swarm are considerably affected by the distance they travel (impacting the time, energy used, and component wear-andtear). Since robots operate on the real plane (with infinite precision in terms of how close together or spread apart they could be), it is more meaningful to talk of the distance traveled in terms of the initial distance between robots (or the “diameter” of the initial configuration). In this paper we show that the above convex polygon based complete visibility algorithms all work with an optimal “distance complexity” of O(1); this idea of distance complexity is enunciated more precisely in Section 4. We also consider the area needed (relative to the initial area occupied by the robots) for the algorithms to run. This “spatial complexity” is an important measure of the algorithm’s ability to allow the swarm to operate within tight spaces. We prove that the spatial complexity of the above class of complete visibility algorithms is also optimal

ISBN: 1-60132-508-8, CSREA Press ©

12

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

at O(1), with details appearing in Section 5. Previous Work: Flocchini et al. and Prencipe [16], [23] provide a comprehensive discussion on oblivious mobile robots, including different models, problems and techniques. A large variety of problems have been addressed for autonomous robot swarm models, including gathering/rendezvous/convergence, pattern formation, dispersal, and several other coordination problems [1], [22], [19], [8], [31], [10], [17], [13]. The complete visibility problem has been considered under different contexts [12], [20], [24]. The idea of distance covered has been considered before, particularly in the context of the gathering problem [6], [7], [14]. Bhagat et al. [3] consider the number of moves by a robot (but not the distance covered). Distance is considered in a different way when the robots operate on a graph (as opposed to the real plane) [4], [11].

2. Convex Polygon Based Complete Visibility A polygon is convex if and only if any line joining two points in or on the polygon, lies entirely in or on the polygon. Figure 1 shows and example of a convex polygon. If robots are positioned at the vertices of a convex polygon, then they are completely visible to each other. To our knowledge, all complete visibility algorithms so far follow this strategy. However, we recognize that complete visibility does not require this approach. In this paper we consider this class of “convex polygon based complete visibility algorithms.” Each algorithm in this class has the following three phases: • Phase 0 (Initialization): A set of points that are not collinear has (well-defined) a convex hull. This phase is to perturb robots out of a collinear initial configuration, so that they can start with a convex hull. If all n robots are in a collinear initial configuration, then a subset of robots move a small constant distance perpendicular to the line on which they lie, thus forming a convex hull. Figure 1(a,b) illustrates this. • Phase 1 (Interior Depletion): By the start of this phase, all robots are on the perimeter or inside of a convex polygon. The goal of this phase is to move all of the robots from the interior of the convex polygon to the perimeter. This results in a configuration such that all robots are either corners of the convex polygon or lie along its edges. (see Figure 1(c)). • Phase 2 (Edge Depletion): Robots on the sides of the convex polygon will move outward to become corners of a new convex polygon. This may iterate multiple times until no side robots remain, and thus, all robots are corner robots (see Figure 1(d)). This configuration is sufficient for complete visibility. Particular algorithms considered in this paper themselves differ in the techniques used and in the time taken for these

(a)

(b)

(c)

(d)

Fig. 1: An illustration of the phases of a convex polygon based complete visibility algorithm; (a),(b) show possible actions in Phase 0; (c), (d) show Phases 1 and 2, respectively.

phases. In particular the algorithm of Sharma et al. [25], [24] Vaidyanathan et al. [28], [26] and Di Luna et al. [12], [20] run in constant, O(log n) and O(n) time, respectively.

3. Distance Complexity Let {ri : 0 ≤ i < n} be the set of robots. In this paper, the position of robot ri is the only meaningful state of the robot that we will be concerned with. At any instant in time t, let pi (t) = (xi (t), yi (t)) be the coordinates of the position of robot ri . Let di,j (t) be the distance between robots ri and rj . At time t, the configuration is C (t) = {pi (t) : 0 ≤ i < n}. Where there is no danger of confusion, we will drop the (t) attribute. Further, it must be noted that C is a multiset with n elements, with some of the pi ’s possibly having the same value (as some configurations may position two different robots on the same point). Let the diameter of a configuration C be the largest of the distances between the robots in that configuration. That is: dia(C ) = max {di,j : 0 ≤ i, j < n} For a given algorithm A , a given initial configuration C0 , and a final configuration Cf that A takes the robots to. Let robot ri move a total distance of δi as it goes from C0 to Cf . Clearly δi is a function of A , C0 , Cf (although the notation for δi does not show this for clarity). Let the distance moved between the initial and final configurations, averaged over all the robots, be δavg (A , C0 , Cf ) = Ave(di ).

ISBN: 1-60132-508-8, CSREA Press ©

i

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

Let the maximum distance moved by any robot between the initial and final configurations be δmax (A , C0 , Cf ) = Max(di ). i

We now define two distance complexity measures for the algorithm. The average distance complexity of algorithm A is δavg (A , C0 , Cf ) . Davg (A ) = Max C0 ,Cf dia (C0 ) The average distance complexity of A is the distance moved by the robots on average, expressed in terms of the diameter of the initial configuration. This “initial diameter” is necessary as point robots on a real plain allow any configuration to be expanded or contracted arbitrarily without affecting the execution of the algorithm. Further, the distance complexity considers the worst possible initial and final configurations. Without this, an initial configuration that is the same as the final configuration will give a distance complexity of 0. In a similar manner, the maximum distance complexity of algorithm A is δmax (A , C0 , Cf ) . Dmax (A ) = Max C0 ,Cf dia (C0 ) It should be noted that the distance complexity can take many forms. In general with an initial diameter of D, an algorithm could (a) cause no robot movement resulting in a distance complexity of “0,” or (b) cause the robots to move a constant distance each (independent of both n and D) 1 resulting in a “decreasing” distance complexity Θ D , or (c) cause robots to move distances that are functions of n and/or D1+ǫ for ǫ > 0; here the distance complexity could be a function of n and Dǫ . However, for the convex polygon based complete visibility algorithms, the distance complexities, likely are Ω(1). Conjecture 1: Every convex polygon based complete visibility algorithm has a distance complexity of Ω(1). Proof outline: Arrange the n robots in an initial configuration so that they are placed in two concentric n circles of diameter D and D 2 with 2 robots uniformly placed on each circle. We conjecture that to reduce the average distances moved, the final n-corner convex polygon must occupy the area between the two circles. If so, it can now be shown that the sum of the minimum distances from the robots’ initial positions to the final convex hull is Ω(nD), establishing an average (and hence maximum) distance complexity of Ω(1). We will show that the convex polygon based complete visibility algorithms that we study in this paper all have a maximum (and hence, average) distance complexity of O(1), which if Conjecture 1 is true, is optimal.

13

4. Distance Analysis In this section, we analyze convex polygon based complete visibility algorithms to determine their distance complexity. Along the way, we also establish the optimality of the distance complexity of the algorithms studied. We now go through the three phases (see Section 2) of these algorithms and show that each one of them has a maximum distance complexity of O(1). Before we proceed, we observe that for all of the convex polygon based complete visibility algorithms [12], [20], [24], [28], [25], [26] that we study, no robot moves a distance of greater than D (the initial configuration diameter) at any step. This implies that the constant time algorithm of Sharma et al. [25], [24] has a constant distance complexity. In the remaining discussion, we will deal more specifically with the algorithms of Vaidyanathan, Sharma et al. [28], [26] and Di Luna et al. [12], [20]; however, our discussion also applies to the techniques used in the constant time algorithm of Sharma et al.

4.1 Phase 0: Initialization Recall that the goal of Phase 0 is to take a linear arrangement of robots and make it a nonlinear configuration that has a convex hull. A subset of robots (for example, the endpoint robots) move a small distance perpendicular to the line on which they lie. We will make a change to the algorithms such that instead of moving a small constant distance, each moving robot will travel a distance not exceeding the distance between itself and its furthest visible neighbor. Their movement breaks the collinearity of the initial configuration. Immediately upon completion of this phase (whether or not robots must actually move) it is clear that a convex polygon has been formed. Let D be the diameter of the initial configuration before Phase 0. Clearly, the maximum distance that a robot may move is D and the diameter of the resultant configuration at the end of Phase 0 would be Θ(D). With no loss of generality, let us call the diameter of the configuration at the end of Phase 0 simply D. Lemma 2: The maximum (and hence, average) distance complexity of Phase 0 of any convex polygon based complete visibility algorithm is O(1).

4.2 Phase 1: Interior Depletion At the start of Phase 1, all robots are at the corners, on sides, or in the interior of a convex polygon of diameter D. The goal of Phase 1 is to move all interior robots to the sides. In the algorithms we consider, every robot in the interior moves a constant c ≥ 1 number of times within the convex polygon in order to get to a side. Clearly these movements cannot cover a distance of greater than cD, where D is the diameter of the convex polygon.

ISBN: 1-60132-508-8, CSREA Press ©

14

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

Lemma 3: The maximum (and hence, average) distance complexity of Phase 1 of any convex polygon based complete visibility algorithm is O(1). Unlike usual complexity measures, the distance complexity can be a decreasing function or even 0, where the robots move o(D) distance (or possibly not move at all). Therefore, a O(1) distance complexity is not necessarily the best possible. However, we show now that there exists a configuration for which the average (and hence, maximum) distance complexity for Phase 1 is Ω(1). Let the convex polygon at the start of Phase 1 be an equilateral triangle with three robots at its corners and the remaining n − 3 robots clustered in the interior infinitely close to the centroid of the equilateral triangle. The diameter D of this configuration is the side length of the triangle. The = D √ = shortest distance from the centroid to a side is D 2 3 Θ(D). Therefore, each of the n − 3 interior robots moves a distance of at least D and the average distance moved is Thus the average distance complexity is D. at least n−3 n Ω

(n−3)D nD

= Ω(1).

Lemma 4: The average (and hence, maximum) distance complexity of Phase 1 is Ω(1).

4.3 Phase 2: Edge Depletion Recall that at the start of Phase 2, all robots are either at corners of the convex hull or on its sides. The goal of this phase is to move the side robots outward to new corners; the corner robots do not move for the rest of the algorithm. Phase 2 of all of the convex polygon based complete visibility algorithms builds on the idea of a “safety triangle” that allows side robots to move to corners of a new polygon in a manner that (a) keeps corners of the existing polygon as corners of the new polygon, and (b) keeps the new polygon convex; we will talk about safety triangles in more detail below. The different algorithms we consider move side robots at different times and in slightly different ways in Phase 2, but the underlying idea of all of these safety triangle based movements is similar. As noted earlier, the constant time algorithms need no additional consideration to establish a O(1) distance complexity. In this section we will consider the O(log n)-time algorithm of Vaidyanathan et al. [28], that uses a procedure called “corner insertion” to complete Phase 2. While the algorithm of Di Luna et al. [12], [20] runs in O(n) time, the ideas in this section apply to it as well. 4.3.1 Safety and Pivot Triangle Consider a side S of a polygon P0 (see Figure 2). Let S make interior angles θ0 and φ0 with its neighboring sides at corners c1 , c2 , respectively. Because P0 is convex, θ0 , φ0 < π. In Phase 2, side points move to corners of a new convex polygon P1 ; for example, a side point on side S in Figure 2

moves to a new point x (that forms a corner of P1 ). The question is, how “far” from the side S can the point x be?

Fig. 2: Safety Triangle (Figure Not to Scale) Define the safety angle of corner c1 (with interior angle θ0 ) to be π − θ0 (1) θ′ = 4 Notice that because 0 < θ0 < π, the safety angle satisfies 0 < θ′ < π4 . Each of side S and its neighboring side on corner c1 has an associated safety angle θ′ (see Figure 2). Corner c2 with interior angle φ0 has a safety angle φ′ = π−θ0 4 . The triangle c1 , c2 , h (see Figure 2) is called the safety triangle of side S. In a similar manner, the side adjacent to S on corner c1 also has a safety triangle. To see the significance of the safety triangle, observe that the interior angle at corner c1 that is bounded by sides of the two safety triangles is θ0 + π π − θ0 =

π 1 Similarly, tan φi > 4φi . Therefore,

Hi
0, forward the message to a, and, go back to Step 1. Otherwise, go to Step 4. Step 4 Among the spare neighbor nodes N1 (u, v) except for the node that forwarded the message to c, select the node b

ISBN: 1-60132-508-8, CSREA Press ©

50

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

that has the largest limited-global information. Step 5 If P (b) > 0, forward the message to b, and, go back to Step 1. Otherwise, go to Step 6. Step 6 Report the failure of message delivery and terminate the algorithm. By using the algorithm, routing from the current node c = (0, 0, 0, 0, 0) to the destination node d = (1, 1, 1, 1, 1) is as follows: (1) The preferred neighbor nodes of the current node c = (0, 0, 0, 0, 0) to the destination node d = (1, 1, 1, 1, 1) are (0, 0, 0, 0, 1) and (0, 0, 0, 1, 0). Because P (0, 0, 0, 0, 1) = 0.8 and P (0, 0, 0, 1, 0) = 0.6, the node (0, 0, 0, 0, 1) with the largest limited-global information value is selected, and the message is forwarded to it. (2) The preferred neighbor node of the current node c = (0, 0, 0, 0, 1) to d is (0, 0, 0, 1, 1). Because P (0, 0, 0, 1, 1) = 0.5 > 0, it is selected, and the message is forwarded to it. (3) The preferred neighbor node of the current node c = (0, 0, 0, 1, 1) to d is (1, 0, 0, 1, 1). Because P (1, 0, 0, 1, 1) = 0, it is not selected and the message is forwarded to one of the spare neighbor nodes of c to d, that is, the nodes (0, 0, 0, 0, 1) and (0, 0, 0, 1, 0). However, (0, 0, 0, 0, 1) is the node that forwarded the message to c, and it is excluded. Because P (0, 0, 0, 1, 0) = 0.6 > 0, it is selected and the message is forwarded to it. (4) The preferred neighbor node of c = (0, 0, 0, 1, 0) to d is (0, 0, 0, 1, 1). However, it is the node that forwarded the message to c, and it is excluded. Hence, the message is forwarded to one of the spare neighbor nodes of c to d, that is, the nodes (0, 0, 0, 0, 0) and (1, 0, 0, 1, 0). Because P (0, 0, 0, 0, 0) = 0.7 and P (1, 0, 0, 1, 0) = 0.4, the node (0, 0, 0, 0, 0) with the largest limited-global information value is selected, and the message is forwarded to it. (5) The preferred neighbor nodes of the current node c = (0, 0, 0, 0, 0) to the destination node d = (1, 1, 1, 1, 1) are (0, 0, 0, 0, 1) and (0, 0, 0, 1, 0). However, (0, 0, 0, 1, 0) is the node that forwarded the message to c, and it is excluded. Because P (0, 0, 0, 0, 1) = 0.8 > 0, it is selected and the message is forwarded to it. Then, (2) to (5) are infinitely repeated, and the infinite path (0, 0, 0, 0, 0) → (0, 0, 0, 0, 1) → (0, 0, 0, 1, 1) → (0, 0, 0, 1, 0) → (0, 0, 0, 0, 0) → (0, 0, 0, 0, 1) → · · · is obtained. This failure is caused by the structure of the dual-cube. That is, if the classes of the current node and the destination node are different, it is necessary to take the cross-edge that is incident to the intermediate destination node. Hence, if the cross-edge itself is faulty or another node to which the cross-edge is incident is faulty, it is necessary to take the cross-edge that is not included in the shortest paths to the destination node. However, the routing algorithms by the previous methods always give priority to the preferred

neighbor nodes and forward the message to one of them first. Therefore, if the current node is not the intermediate destination node and there is a non-faulty node in the preferred neighbor nodes, the neighbor node connected to the current node by the cross-edge is a spare node, and the message does not forwarded to the node. Consequently, the message will be forwarded inside the cluster infinitely unless the current node does not have any preferred node and the node connected to the current node by the cross-edge has the largest limited-global information among the spare neighbor nodes.

4.3 Improvement To avoid the infinite loop explained in the previous section, we propose additional limited-global information by which each node can identify if the cross-edge incident to its neighbor node is non-faulty and the node to which the cross-edge is non-faulty. We call the information as fault information of the cross-edge. Definition 9: For a node u in a dual-cube, its fault information of the cross-edge Γ(u) is defined as follows: 0 if u or the cross-edge incident to u is faulty, Γ(u) = 1 otherwise. We assume that each node can judge if its faulty neighbor node u is faulty or not in constant time on startup of the system. If u is faulty, then the node records that Γ(u) = 0. Otherwise, because u is non-faulty, it checks its crossedge and calculate the value of Γ(u), and distribute it to its neighbor nodes. We introduce a new routing algorithm that uses the fault information of cross-edges in Fig. 4. procedure FTR(c, d, p) /* ** c: current node ** d: destination node ** p: previous node */ begin d := d(c, d); if d = 0 then begin deliver the message to c; exit end; b (b); b∗ := arg maxb∈N0 (c,d)\{p} Pd−1 ∗ ∗ b∗ if Pd−1 (b ) > 0 and (b I(d) or Γ(b∗ )=0) then FTR(b∗ , d, c) else begin b (b); b∗ := arg maxb∈N1 (c,d)\{p} Pd+1 ∗ b∗ if Pd−1 (b ) > 0 then FTR(b∗ , d, c) else exit(’delivery failed’) end end

Fig. 4: Our fault-tolerant routing algorithm.

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

51

5. Evaluation Experiment To evaluate our method, we compare the method with those by Jiang and Wu and by Park et al. in a computer experiment. The procedure of the experiment is as follows: Step 1 Repeat the following steps from 2 to 4 for 10,000 times with the ratios of faulty nodes f = 0, 0.1, . . . , 0.4 and D2n−1 (n = 4, 5, 6). Step 2 Select the ⌊f 22n−1 ⌋ faulty nodes randomly in D2n−1 . Step 3 Select two distinct nodes randomly among the nonfaulty nodes as the source node s and the destination node d. Step 3 If there is not any fault-free path between s and d, go back to Step 2. Step 4 Apply the methods and measure the numbers of successful routings and the lengths of the constructed paths and the fault-free shortest paths between s and d. For evaluation of the fault-tolerant routing algorithms, we use the following indices: (ratio of successful routings) = (number of successful routings) / (number of trials) and (ratio of detours) = (sum of constructed path lengths) / (sum of fault-free shortest path lengths) − 1.

5.2 Results

Table 1: Ratio of successful routings in D7 . Jiang&Wu

Park et al.

Our method

1.000 0.982 0.890 0.714 0.565

1.000 0.961 0.886 0.794 0.705

1.000 0.977 0.915 0.836 0.744

Table 2 shows the result of the ratios of detours by the method by Jiang and Wu, the method by Park et al., and our method in D7 . The result is also depicted in Fig. 6. Table 2: Ratio of detours in D7 .

Ratio of faulty nodes 0.0 0.1 0.2 0.3 0.4

0.6 our method 0.4

Park et al. Jiang&Wu

0.2

0

0.1

0.2 Ratio of Faulty Nodes

0.3

0.4

Fig. 5: Ratio of successful routings in D7 . 0.4 0.35

our method

0.3

Park et al.

0.25

Jiang&Wu

0.2

0.15

Table 1 shows the result of the ratios of successful routings by the method by Jiang and Wu, the method by Park et al., and our method in D7 . The result is also depicted in Fig. 5.

Ratio of faulty nodes 0.0 0.1 0.2 0.3 0.4

0.8

0

Ratio of Detours

5.1 Procedure

Ratio of Successful Routings

1

Jiang&Wu

Park et al.

Our method

0.0000 0.0835 0.1700 0.1710 0.1270

0.0000 0.0397 0.0885 0.1250 0.1290

0.0000 0.0402 0.0940 0.1470 0.1540

Table 3 shows the result of the ratios of successful routings by the method by Jiang and Wu, the method by Park et al., and our method in D9 . The result is also depicted in Fig. 7.

0.1 0.05 0

0

0.1

0.2 Ratio of Faulty Nodes

0.3

0.4

Fig. 6: Ratio of detours in D7 . Table 4 shows the result of the ratios of detours by the method by Jiang and Wu, the method by Park et al., and our method in D9 . The result is also depicted in Fig. 8. Table 5 shows the result of the ratios of successful routings by the method by Jiang and Wu, the method by Park et al., and our method in D11 . The result is also depicted in Fig. 9. Table 6 shows the result of the ratios of detours by the method by Jiang and Wu, the method by Park et al., and our method in D11 . The result is also depicted in Fig. 10.

5.3 Discussion First, we compare our method with the method by Park et al. In the evaluation experiment with faulty nodes, from Fig. 5, Fig. 7, and Fig. 9, our method improved the ratios of successful routings compared to the method by Park by at most 0.042 in D7 , at most 0.043 in D9 , and 0.037 in D11 . From Fig. 6, Fig. 8, and Fig. 10, the ratios of detours of our method is slightly larger than those of the method by Park et al. However, this is caused by the fact that our method

ISBN: 1-60132-508-8, CSREA Press ©

52

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

0.4

Table 3: Ratio of successful routings in D9 . Jiang&Wu

Park et al.

Our method

1.000 0.982 0.890 0.714 0.565

1.000 0.961 0.886 0.794 0.705

1.000 0.977 0.915 0.836 0.744

0.35

our method

0.3

Park et al.

0.25

Jiang&Wu

Ratio of Detours

Ratio of faulty nodes 0.0 0.1 0.2 0.3 0.4

0.2

0.15

1 Ratio of Successful Routings

0.1 0.05

0.8

0

0

0.1

0.6

Park et al.

0.2

0

0.1

0.2 Ratio of Faulty Nodes

0.3

0.4

Fig. 7: Ratio of successful routings in D9 . solved the trials that could not be solved by the method by Park et al. Hence, our method outperforms the method by Park et al. Next, we compare our method with the method by Jiang and Wu. In the evaluation experiment with faulty nodes, from Fig. 5, Fig. 7, and Fig. 9, our method showed better performance in D7 and D9 than the method by Jiang and Wu. On the other hand, in D11 , their method showed comparable results to our method. However, the ratios of detours of the method by Jiang and Wu are much larger than our method from Fig. 6, Fig. 8, and Fig. 10. Hence, our method outperforms the method by Jiang and Wu.

6. Conclusion and Future Works In this paper, we have proposed a fault-tolerant routing method in dual-cubes by introducing the fault information of cross-edges and modifying the routing algorithm. We have conducted an experiment to evaluate our method

Table 4: Ratio of detours in D9 .

Ratio of faulty nodes 0.0 0.1 0.2 0.3 0.4

0.4

Table 5: Ratio of successful routings in D11 .

Jiang&Wu

0

0.3

Fig. 8: Ratio of detours in D9 .

our method 0.4

0.2 Ratio of Faulty Nodes

Ratio of faulty nodes 0.0 0.1 0.2 0.3 0.4

Jiang&Wu

Park et al.

Our method

1.000 0.950 0.841 0.769 0.625

1.000 0.936 0.831 0.727 0.589

1.000 0.949 0.861 0.764 0.624

by comparing it with the method by Park et al. and the method by Jiang and Wu. From the experimental results, we have shown that our method outperforms the method by Park et al. in the ratios of successful routings and the method by Jiang and Wu in the ratios of detours. However, there still remain the cases in which the routing fails. Introducing additional limited-global information and improving the fault-tolerant routing algorithm are included in the future work.

Acknowledgment This study was partly supported by a Grant-in-Aid for Scientific Research (C) of the Japan Society for the Promotion of Science under Grant No. 17K00093.

References [1] Y. Li and S. Peng, “Dual-cube: A new interconnection network for high-performance computer clusters,” in Proceedings of the Interna-

Table 6: Ratio of detours in D11 .

Jiang&Wu

Park et al.

Our method

0.0000 0.0963 0.1870 0.2240 0.2390

0.0000 0.0428 0.0917 0.1470 0.1830

0.0000 0.0424 0.1000 0.1680 0.2140

Ratio of faulty nodes 0.0 0.1 0.2 0.3 0.4

ISBN: 1-60132-508-8, CSREA Press ©

Jiang&Wu

Park et al.

Our method

0.0000 0.1130 0.1580 0.2400 0.3210

0.0000 0.0393 0.0909 0.1500 0.2070

0.0000 0.0397 0.0992 0.1690 0.2280

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

53

Ratio of Successful Routings

1 [9]

0.8 [10]

0.6 our method 0.4

[11]

Park et al. Jiang&Wu

[12]

0.2 [13]

0

0

0.1

0.2 Ratio of Faulty Nodes

0.3

0.4 [14]

Fig. 9: Ratio of successful routings in D11 . [15]

0.4 [16]

0.35

our method Park et al.

0.25

Jiang&Wu

Ratio of Detours

0.3

[17] [18]

0.2

0.15

[19]

0.1 [20]

0.05 0

0

0.1

0.2 Ratio of Faulty Nodes

0.3

0.4

Fig. 10: Ratio of detours in D11 .

[21] [22] [23]

[2] [3]

[4] [5] [6]

[7]

[8]

tional Computer Symposium, Workshop on Computer Architecture, pp. 51–57, Dec. 2000. C. L. Seitz, “The cosmic cube,” Communications of the ACM, vol. 28, pp. 22–33, Jan. 1985. S.-Y. Chen and S.-S. Kao, “Hamiltonian related properties with and without faults of the dual-cube interconnection network and their variations,” International Journal of Mathematical, Computational, Physical, Electrical and Computer Engineering, vol. 10, pp. 201–205, Apr. 2016. J.-C. Chen and C.-H. Tsai, “Conditional edge-fault-tolerant hamiltonicity of dual-cubes,” Information Sciences, vol. 181, pp. 620–627, Feb. 2011. Z. Jiang and J. Wu, “A limited-global information model for faulttolerant routing in dual-cube,” The International Journal of Parallel, Emergent and Distributed Systems, vol. 21, pp. 61–77, Feb. 2006. K. Kaneko and S. Peng, “Node-to-set disjoint paths routing in dual-cube,” in Proceedings of the 2008 International Symposium on Parallel Architectures, Algorithms, and Networks, pp. 77–82, May 2008. K. Kaneko and S. Peng, “Set-to-set disjoint paths routing in dualcube,” in Proceedings of the 2008 Ninth International Symposium on Parallel and Distributed Computing, Applications and Technologies, pp. 129–136, Dec. 2008. Y. Li, S. Peng, and W. Chu, “Hamiltonian cycle embedding for fault tolerance in dual-cube,” in Proceedings of the IASTED International

[24] [25] [26] [27] [28]

[29] [30] [31]

Conference on Networks, Parallel and Distributed Processing, and Applications, pp. 1–6, Oct. 2002. Y. Li, S. Peng, and W. Chu, “Fault-tolerant cycle embedding in dualcube with node faulty,” in Proceedings of the Fourth International Conference on Parallel and Distributed Computing, Applications and Technologies, pp. 71–75, Aug. 2003. Y. Li, S. Peng, and W. Chu, “Efficient collective communications on dual-cube,” The Journal of Supercomputing, vol. 28, pp. 71–90, Apr. 2004. J. Park, N. Seki, and K. Kaneko, “Stochastic fault-tolerant routing in dual-cubes,” IEICE Transactions on Information and Systems, vol. E100-D, pp. 1920–1921, July 2017. Y.-K. Shih, H.-C. Chuang, S.-S. Kao, and J. J. M. Tan, “Mutually independent hamiltonian cycles in dual-cubes,” The Journal of Supercomputing, vol. 54, pp. 239–251, Nov. 2010. A. Bossard and K. Kaneko, “The set-to-set disjoint-path problem in perfect hierarchical hypercubes,” The Computer Journal, vol. 55, pp. 769–775, June 2012. A. Bossard and K. Kaneko, “Node-to-set disjoint-path routing in hierarchical cubic networks,” The Computer Journal, vol. 55, pp. 1440– 1446, Dec. 2012. A. Bossard and K. Kaneko, “Set-to-set disjoint paths routing in hierarchical cubic networks,” The Computer Journal, vol. 57, pp. 332– 337, Feb. 2014. J.-S. Fu, G.-H. Chen, and D.-R. Duh, “Node-disjoint paths and related problems on hierarchical cubic networks,” Networks, vol. 40, pp. 142– 154, Oct. 2002. Q.-P. Gu and S. Peng, “An efficient algorithm for the k-pairwise disjoint paths problem in hypercubes,” Journal of Parallel and Distributed Computing, vol. 60, pp. 764–774, June 2000. D. Kocík and K. Kaneko, “Node-to-node disjoint paths problem in a Möbius cube,” IEICE Transactions on Information and Systems, vol. E100-D, pp. 1837–1843, Aug. 2017. K. Kaneko and N. Sawada, “An algorithm for node-to-node disjoint paths problem in burnt pancake graphs,” IEICE Transactions on Information and Systems, vol. E90-D, pp. 306–313, Jan. 2007. C.-N. Lai, “An efficient construction of one-to-many node-disjoint paths in folded hypercubes,” Journal of Parallel and Distributed Computing, vol. 74, pp. 2310–2316, Apr. 2014. L. Lipták, E. Cheng, J.-S. Kim, and S. W. Kim, “One-to-many nodedisjoint paths of hyper-star networks,” Discrete Applied Mathematics, vol. 160, pp. 2006–2014, Sept. 2012. Y. Suzuki and K. Kaneko, “An algorithm for disjoint paths in bubblesort graphs,” Systems and Computers in Japan, vol. 37, pp. 27–32, Nov. 2006. Y. Suzuki, K. Kaneko, and M. Nakamori, “Node-disjoint paths in a transposition graph,” IEICE Transactions on Information and Systems, vol. E89-D, pp. 647–653, Feb. 2006. R.-Y. Wu, G.-H. Chen, Y.-L. Kuo, and G. J. Chang, “Node-disjoint paths in hierarchical hypercube networks,” Information Sciences, vol. 177, pp. 4200 – 4207, Oct. 2007. Y. Xiang and I. A. Stewart, “One-to-many node-disjoint paths in (n, k)-star graphs,” Discrete Applied Mathematics, vol. 158, pp. 62– 70, Jan. 2010. K. Menger, “Zur allgemeinen Kurventhoerie,” Fundamenta Mathematicae, vol. 10, no. 1, pp. 96–115, 1927. J. Wu, “Reliable unicasting in faulty hypercubes using safety levels,” IEEE Transactions on Computers, vol. 46, pp. 241–247, Feb. 1997. M. Myojin and K. Kaneko, “A fault-tolerant routing algorithm using directed probabilities in hypercube networks,” in Proceedings of the 2012 International Conference on Parallel and Distributed Processing Techniques and Applications, vol. 1, pp. 131–136, July 2012. J. Al-Sadi, K. Day, and M. Ould-Khaoua, “Probability-based faulttolerant routing in hypercubes,” The Computer Journal, vol. 44, no. 5, pp. 368–373, 2001. D. T. Duong and K. Kaneko, “Fault-tolerant routing based on approximate directed routable probabilities for hypercubes,” Future Generation Computer Systems, vol. 37, pp. 88–96, July 2014. L. B. Ngoc, B. T. Thuan, Y. Hirai, and K. Kaneko, “Stochastic link-fault-tolerant routing in hypercubes,” Journal of Advances in Computer Networks, vol. 4, pp. 100–106, June 2016.

ISBN: 1-60132-508-8, CSREA Press ©

54

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

2QWKH&RQVWUXFWLRQRI2SWLPDO 1RGH'LVMRLQW3DWKVLQ)ROGHG+\SHUFXEHV RI(YHQ'LPHQVLRQV &KHQJ1DQ/DL 'HSDUWPHQWRI%XVLQHVV&RPSXWLQJ 1DWLRQDO.DRKVLXQJ8QLYHUVLW\RI6FLHQFHDQG7HFKQRORJ\.DRKVLXQJ7DLZDQ

VWUXFWXUHRIDIFXEHZKHUH DQG DUH WKH IRXU FRPSOHPHQW OLQNV :LWK WKHKHOSRIFRPSOHPHQWOLQNVWKHGLDPHWHURIDQQIFXEHLV UHGXFHG WR ªQº $Q QFXEH DQG DQ QIFXEH KDYH FRQQHFWLYLWLHVQDQGQUHVSHFWLYHO\7KHFRQQHFWLYLW\RID QHWZRUN LV WKH PLQLPXP QXPEHU RI QRGHV ZKRVH UHPRYDO FDQPDNHWKHQHWZRUNGLVFRQQHFWHGRUWULYLDO

$EVWUDFW 1RGHGLVMRLQW SDWKV KDYH PDGH VLJQLILFDQW

FRQWULEXWLRQV WR WKH VWXG\ RI URXWLQJ UHOLDELOLW\ DQG IDXOW WROHUDQFHRIDQLQWHUFRQQHFWLRQQHWZRUN ,QWKLVSDSHUZH FRQVWUXFW P QRGHGLVMRLQW SDWKV IURP RQH VRXUFH QRGH WR RWKHUPQRWQHFHVVDULO\GLVWLQFW WDUJHWQRGHVUHVSHFWLYHO\ LQ DQ QGLPHQVLRQDO IROGHG K\SHUFXEH VR WKDW QRW RQO\ LV WKHLU WRWDO OHQJWK PLQLPL]HG EXW WKHLU PD[LPDO OHQJWK LV DOVR PLQLPL]HG LQ WKH ZRUVW FDVH ZKHUH PdQ DQG Q LV HYHQ ,Q DGGLWLRQ WKHVH P QRGHGLVMRLQW SDWKV FDQ EH FRQVWUXFWHG LQ 2PQPQORJ Q WLPH DQG HDFK SDWK LV HLWKHUVKRUWHVWRUQHDUO\VKRUWHVW

.H\ZRUGV )ROGHG K\SHUFXEH K\SHUFXEH QRGHGLVMRLQW SDWKVPDWFKLQJRSWLPL]DWLRQ

,QWURGXFWLRQ 0RGHUQKDUGZDUHWHFKQRORJ\KDVPDGHLWFRPPRQWREXLOG DODUJHVFDOHPXOWLSURFHVVRUV\VWHPFRQVLVWLQJRIKXQGUHGV RUHYHQWKRXVDQGVRISURFHVVRUV%HIRUHGHVLJQLQJDPXOWL SURFHVVRUV\VWHPLWVLQWHUFRQQHFWLRQQHWZRUNQHWZRUNIRU VKRUW LQZKLFKQRGHVDQGOLQNVUHVSHFWLYHO\FRUUHVSRQGWR SURFHVVRUV DQG FRPPXQLFDWLRQ FKDQQHOV PXVW EH GHWHUPLQHG ILUVW 6LQFH WKH WRSRORJ\ RI D QHWZRUN PDNHV JUHDWLQIOXHQFHVLQWKHV\VWHPSHUIRUPDQFHPDQ\SRVVLEOH RSWLRQV KDYH EHHQ SURSRVHG LQ WKH OLWHUDWXUH IRU SUDFWLFDO LPSOHPHQWDWLRQV DQGRU WKHRUHWLFDO VWXGLHV )RU WKH ODWWHU SXUSRVH WKH IROGHG K\SHUFXEH ZDV RQH RI WKH QHWZRUNV ZKLFK KDYH UHFHLYHG PXFK DWWHQWLRQ IURP RXWVWDQGLQJ UHVHDUFKHUV>@ $ IROGHG K\SHUFXEH LV EDVLFDOO\ D K\SHUFXEH ZLWK DGGLWLRQDO OLQNV DXJPHQWHG ZKHUH WKH DGGLWLRQDO OLQNV FRQQHFW DOO SDLUV RI QRGHV ZKRVH GLVWDQFHV DUH ORQJHVW LQ WKH K\SHUFXEH $Q QGLPHQVLRQDO K\SHUFXEH DEEUHYLDWHG WRDQQFXEH >@FRQVLVWVRIQQRGHVWKDWDUHODEHOHGZLWK QELQDU\VWULQJVRIOHQJWKQ7ZRQRGHVDUHFRQQHFWHGE\D OLQN LI DQG RQO\ LI WKH\ GLIIHU E\ H[DFWO\ RQH ELW 7KH GLDPHWHU RI DQ QFXEH LV Q 2Q WKH RWKHU KDQG DQ Q GLPHQVLRQDOIROGHGK\SHUFXEHDEEUHYLDWHGWRDQQIFXEH >@ LVEDVLFDOO\DQQFXEHDXJPHQWHGZLWKQFRPSOHPHQWOLQNV (DFKFRPSOHPHQWOLQNFRQQHFWVWZRQRGHVZKRVHODEHOVDUH WKH FRPSOHPHQWV RI HDFK RWKHU )LJXUH VKRZV WKH

$VLPSOH SDWKLQDQHWZRUNFRQVLVWVRIDVHTXHQFHRI PXWXDOO\ GLVWLQFW QRGHV VXFK WKDW WKHUH LV D OLQN EHWZHHQ DQ\ WZR FRQVHFXWLYH QRGHV 7ZR SDWKV DUH LQWHUQDOO\ QRGHGLVMRLQW GLVMRLQW IRU VKRUW LI WKH\ GR QRW VKDUH DQ\ FRPPRQQRGH H[FHSWWKHLU HQG QRGHV 'LVMRLQWSDWKV KDYH PDGH WKHPVHOYHV SOD\ DQ VLJQLILFDQW UROH LQ WKH VWXG\ RI URXWLQJUHOLDELOLW\DQGIDXOWWROHUDQFHRIDQHWZRUNEHFDXVH SDUDOOHO URXWLQJ XVHV WKHP WR DYRLG FRQJHVWLRQ DFFHOHUDWH WUDQVPLVVLRQ UDWH DQG SURYLGH DOWHUQDWLYH WUDQVPLVVLRQ URXWHV7KHUHDUHWKUHHNLQGVRIGLVMRLQWSDWKV>@LHRQH WRRQH > @ RQHWRPDQ\ > @ DQG PDQ\WRPDQ\ > @ 7KH RQHWRRQH GLVMRLQW SDWKV DOVR FDOOHG WKH FRQWDLQHU >@ KDYH FRPPRQ HQG QRGHV $FFRUGLQJWR0HQJHU VWKHRUHPWKHUHH[LVWVDFRQWDLQHUIRU DQ\WZRQRGHVLQDQHWZRUNZLWKLWVZLGWKLHWKHQXPEHU RIGLVMRLQWSDWKVLQLWQRWOHVVWKDQWKH FRQQHFWLYLW\RIWKH QHWZRUN 7KH VWXG\ RI FRQWDLQHUV SURYLGHV LPSRUWDQW PHDVXUHV VXFK DV ZLGHGLVWDQFH DQG ZLGHGLDPHWHU IRU

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

55

VLVWKHVRXUFHQRGHDQGGG«GPDUHPQRWQHFHVVDULO\ GLVWLQFW GHVWLQDWLRQ QRGHV LQ DQ QFXEH ZKHUH PdQ DQG V ^G G « GP` 6LQFH DQQFXEH LV QRGH V\PPHWULF ZH

DQDO\]LQJWKHUHOLDELOLW\DQGIDXOWWROHUDQFHRIDQHWZRUN 7KHRQHWRPDQ\GLVMRLQWSDWKVIURPDFRPPRQQRGH WRRWKHUPXWXDOO\ GLVWLQFWQRGHV ZHUH ILUVWVWXGLHG LQ>@ DQG DQ ,QIRUPDWLRQ 'LVSHUVDO $OJRULWKP ,'$ IRU VKRUW ZDV SURSRVHG RQ WKH K\SHUFXEH %\ WDNLQJ DGYDQWDJHV RI GLVMRLQWSDWKVWKH,'$KDVQXPHURXVSRWHQWLDODSSOLFDWLRQV WR VHFXUH DQG IDXOWWROHUDQW VWRUDJH DQG WUDQVPLVVLRQ RI LQIRUPDWLRQ,QDGGLWLRQWKHVWXG\RIRQHWRPDQ\GLVMRLQW SDWKVSURYLGHVLPSRUWDQWPHDVXUHVVXFKDVVWDUGLDPHWHU>@ DQG5DELQQXPEHU>@IRUDQDO\]LQJWKHUHOLDELOLW\DQGIDXOW WROHUDQFHRIDQHWZRUN$VGHVFULEHGOLWHUDOO\WKHPDQ\WR PDQ\ GLVMRLQWSDWKVFRQQHFWWZRVHWVRIQRGHV,QRUGHUWR UHGXFH WKH WUDQVPLVVLRQ FRVW DQG ODWHQF\ WKH WRWDO OHQJWK DQG PD[LPDO OHQJWK RI GLVMRLQW SDWKV DUH UHTXLUHG WR EH PLQLPL]HG UHVSHFWLYHO\ ZKHUH WKH OHQJWK RI D SDWK LV WKH QXPEHURIOLQNVLQLW 5RXWLQJIXQFWLRQVKDYHEHHQVKRZQWREHHIIHFWLYHRQ GHULYLQJYDULRXVGLVMRLQWSDWKVLQWKHK\SHUFXEH>@DQGLWV YDULDQWVVXFKDVWRUL>@JHQHUDOL]HGK\SHUFXEHV>@DQG IROGHG K\SHUFXEHV >@ ,Q WKH SUHYLRXV SDSHU >@ P GLVMRLQW SDWKV IURP RQH VRXUFH QRGH WR RWKHU P QRW QHFHVVDULO\ GLVWLQFW WDUJHW QRGHV UHVSHFWLYHO\ KDYH EHHQ FRQVWUXFWHGE\WKHDLGRIDPD[LPDOSDUWLDOURXWLQJIXQFWLRQ LQ DQ QIFXEH VR WKDW QRW RQO\ LV WKHLU WRWDO OHQJWK PLQLPL]HG EXW WKHLU PD[LPDO OHQJWK LV DOVR PLQLPL]HG LQ WKHZRUVWFDVHZKHUHPdQDQGQLVRGG,QWKLVSDSHUZH IXUWKHUVKRZWKDWIRUHYHQQWKLVNLQGRIGLVMRLQWSDWKVFDQ DOVREHFRQVWUXFWHG,QDGGLWLRQHDFKSDWKLVHLWKHUVKRUWHVW RUQHDUO\VKRUWHVW6LQFHHYHQQPDNHVDQQIFXEHQRORQJHU ELSDUWLWHDSDUWLFXODUURXWLQJIXQFWLRQPXVWEHGHVLJQHGIRU LW 6LPLODUO\ WKH SUREOHP RI FRQVWUXFWLQJ RSWLPDO GLVMRLQW SDWKV LQ DQ QIFXEH ZDV ILUVW WUDQVIRUPHG LQWR D FRUUHVSRQGLQJSUREOHPRIFRQVWUXFWLQJGLVMRLQWSDWKVLQDQ Q FXEH ZLWK VSHFLDO SURSHUWLHV DQG WKHQ LWV VROXWLRQV DUHDSSOLHGWRGHULYHWKHUHTXLUHGSDWKVLQWKHQIFXEH %\ WKH DLG RI WKH QHZO\ GHVLJQHG URXWLQJ IXQFWLRQ DQG WKH FRQVWUXFWLRQSURFHGXUHVRI>@LWZLOOEHVKRZQWKDWWKHVH RSWLPDO P GLVMRLQW SDWKV FDQ EH FRQVWUXFWHG LQ 2PQ PQORJ Q WLPH 6LQFH ,'$ >@ UHOLHV KHDYLO\ RQ RQHWR PDQ\ GLVMRLQW SDWKV WKH FRQVWUXFWLRQ RI RSWLPDO GLVMRLQW SDWKVLQDQQIFXEHLVQRWRQO\WKHRUHWLFDOO\LQWHUHVWLQJEXW DOVRSUDFWLFDOLQUHDODSSOLFDWLRQV 7KH UHVW RI WKLV SDSHU LV RUJDQL]HG DV IROORZV ,QWKH QH[W VHFWLRQ WKH URXWLQJ IXQFWLRQV DQG WKH FRQVWUXFWLRQ RI GLVMRLQW VKRUWHVW SDWKV LQ DQ QFXEH DUH GHVFULEHG %RWK RI WKHP DUH QHFHVVDULO\ XVHG LQ WKLV SDSHU $V VKRZQ LQ 6HFWLRQ WKH RSWLPDO GLVMRLQW SDWKV LQ DQ QIFXEH FDQ EH FRQVWUXFWHG LQ 2PQPQORJ Q WLPH ,Q 6HFWLRQ WKLV SDSHUFRQFOXGHVZLWKVRPHUHPDUNVRQWKHHIILFLHQF\RIRXU UHVXOWVDQGGHVFULEHVWKHIXWXUHZRUN)RUWKHEUHYLW\RIWKLV SDSHU ZKHQHYHU ZH GLVFXVV WLPH FRPSOH[LW\ ZH PHDQ ZRUVWFDVHWLPHFRPSOH[LW\

௡

ᇩᇪᇫ DVVXPH V ͲͲǤ Ǥ ǤͲ Q LH WKH RULJLQ ZLWKRXW ORVV RI JHQHUDOLW\/HW' ^GG«GPÈHDPXOWLVHWDQG, ^N N«NPÈHDVHWRIPGLVWLQFWLQWHJHUVUDQJLQJIURPWR QDFWXDOO\NN«NPGHQRWHPGLPHQVLRQVRIDQQFXEH $ PXOWLVHW LV D FROOHFWLRQ RI HOHPHQWV LQ ZKLFK PXOWLSOH RFFXUUHQFHV RI WKH VDPH HOHPHQW DUH DOORZHG ,Q >@ D RQHWRRQH FRUUHVSRQGHQFH ) ' o , ZDV UHIHUUHGWR DV D URXWLQJ IXQFWLRQ DQG LW ZDV VKRZQ WKDW URXWLQJ IXQFWLRQV FDQEHHIIHFWLYHO\XVHGWRGHULYHPGLVMRLQWSDWKVIURPVWR GG«GPUHVSHFWLYHO\LQDQQFXEH $V GHILQHG LQ >@ ZH OHW HNW ͲNWିଵ ͳͲQିNW DQG GL GLGLxxxGLQ ZKHUH dWdP dLdP DQG GLN GHQRWHV WKH NWK ELW IURP WKH OHIW RI GL IRU DOO dNdQ ,QWXLWLYHO\ )GL NWPHDQVWKDW HNW LVFKRVHQIRUWKHQRGHIROORZLQJV ZKHQ ZH URXWH IURP V WR GL 6LQFH GLĭGL GLNW DVVXUHV WKDW HNW LVLQFOXGHGLQDVKRUWHVWSDWKIURPVWRGLZHSUHIHU D)ZLWK GLĭGL IRUDOOdLdPLQRUGHUWRURXWHDVKRUWHVW SDWK IURPV WR GL 8QIRUWXQDWHO\ WKH SUHIHUUHG )GRHV QRW DOZD\VH[LVWIRUDUELWUDU\'DQG, ,Q >@ DQ RSWLPDO 2PQ FRQVWUXFWLRQ SURFHGXUH QDPHG3DWKVZDVSURSRVHG:LWKLQSXWDUJXPHQWV)PQ ' DQG , LW ZDV VKRZQ LQ >@ WKDW 3DWKV) P Q ' , FDQ SURGXFH P GLVMRLQW SDWKV GHQRWHG E\ 4 4 « 4P IURPVWRGG «GPUHVSHFWLYHO\VRWKDW 4LLV VKRUWHVW ZLWKOHQJWK_GL_LI GLĭGL IRUDOOdLdPZKHUH_GL_GHQRWHV WKHQXPEHURIELWVFRQWDLQHGLQGLLHWKHGLVWDQFHIURPV WR GL 7KH IROORZLQJ OHPPD GHVFULEHV WKH DERYH UHVXOWV IRUPDOO\ /HPPD>@*LYHQDURXWLQJIXQFWLRQ)'o,VXFKWKDW GLĭGL IRU DOO dLdP WKHQ 3DWKV) P Q ' , FDQ SURGXFHPGLVMRLQWSDWKV44«4PIURPVWRGG« GP UHVSHFWLYHO\ LQ DQ QFXEH VR WKDW 4L LV VKRUWHVW ZLWK OHQJWK_GL_IRUDOOdLdP ,Q>@DRQHWRRQHPDSSLQJ :^GG«GPò ^ « Q` ZDVUHIHUUHG WRDVD SDUWLDOURXWLQJIXQFWLRQ ZKHUHPdQ2EYLRXVO\ZHFDQREWDLQDURXWLQJIXQFWLRQ ) E\ GHILQLQJ , ^:G :G « :GP ` DQG )GL :GL IRUDOOdLdP0RUHRYHULI GLȍGL IRUDOOdLdPWKHQZH KDYH GLĭGL GLȍGL DQG KHQFH /HPPD HQVXUHV WKDW WKHUHH[LVWPGLVMRLQWVKRUWHVWSDWKVIURPVWRGG«GP

&RQVWUXFWLQJ RSWLPDO GLVMRLQW SDWKV LQ DQ QIFXEHIRUHYHQQ

3UHOLPLQDU\

6XSSRVHWKDWVLVWKHVRXUFHQRGHDQGWW«WPDUHPQRW QHFHVVDULO\ GLVWLQFW WDUJHW QRGHV LQ DQ QIFXEH ZKHUH Pd Q V ^W W « WP` DQG Q LV HYHQ 6LQFH IROGHG

,Q WKLV VHFWLRQ ZH EULHIO\ GHVFULEH VRPH UHVXOWV RI >@ ZKLFK DUH QHFHVVDULO\ XVHG LQ WKLV SDSHU )RU UHIHUHQWLDO LQWHJULW\ZHDOVRIROORZWKHV\PEROVLQ>@6XSSRVHWKDW

ᇩᇪᇫ K\SHUFXEHVDUHQRGHV\PPHWULFZHPD\DVVXPHV ͲͲǤ Ǥ ǤͲ QLHWKHRULJLQZLWKRXWORVVRIJHQHUDOLW\,QWKLVVHFWLRQ LWZLOOEHVKRZQWKDWPGLVMRLQWSDWKVIURPVWRWW«WP

௡

ISBN: 1-60132-508-8, CSREA Press ©

56

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

UHVSHFWLYHO\LQWKHQIFXEH FDQEHFRQVWUXFWHGLQ2PQ PQORJ Q WLPH VR WKDW QRW RQO\ LV WKHLU WRWDO OHQJWK PLQLPL]HG EXW WKHLU PD[LPDO OHQJWK LV PLQLPL]HG LQ WKH ZRUVWFDVH $FFRUGLQJWRRXUFRQVWUXFWLRQPHWKRGHDFKQRGHWLLV ILUVWPDSSHGWRDQRGHGLLQDQQ FXEHIRUDOOdLdPE\ VXEVWLWXWLQJ WKH OLQNV RI WKH Q WK GLPHQVLRQ RI Q FXEH IRU WKH FRPSOHPHQW OLQNV RI QIFXEH 7KH LGHD RI QRGHPDSSLQJFRPHVIURPWKHREVHUYDWLRQWKDWDQ\VKRUWHVW SDWKDQGQHDUO\VKRUWHVWSDWKEHWZHHQWZRQRGHVLQDIROGHG K\SHUFXEHFRQWDLQVDWPRVWRQHDQGWZRFRPSOHPHQWOLQNV UHVSHFWLYHO\ 7KHQ E\ WKH DLG RI SURFHGXUH 3DWKV P GLVMRLQW SDWKV 4 4 « 4P IURP Q WR G G « GP UHVSHFWLYHO\ FDQ EH FRQVWUXFWHG LQ WKH Q FXEH $V VKRZQ ODWHU HDFK SDWK IURP V WR WL GHQRWHG E\ 5L FDQ EH GHULYHG IURP 4L IRU DOO dLdP VR WKDW 5 5 « 5P DUH RSWLPDODQGGLVMRLQWWRHDFKRWKHU,QWKHUHVWRIWKLVVHFWLRQ ZH ILUVW GHVFULEH KRZ WKH QRGHPDSSLQJ ZRUNV DV ZHOO DV WKHQHZURXWLQJIXQFWLRQ7KHQZHVKRZKRZ4LDQG5LDUH FRQVWUXFWHGDQGJLYHRXUPDLQUHVXOW /HW9QDQG9QGHQRWHWKHWZRVHWVRIQRGHVRIDQQ IFXEHLHQFXEH DQGDQQ FXEHUHVSHFWLYHO\6LQFHQ LVHYHQZHREWDLQWKHIROORZLQJQRGHPDSSLQJE\VOLJKWO\ PRGLI\LQJWKDWLQ>@

RWKHUZLVHFLȍFL IRUDOOdLdP 3URRI)RUDOOdLdPZHOHWGL FLH[FHSW FLȍFL DQG_FL_ Q,QFDVHRIWKDWZHOHWGL FL FL xxxFLQ LH GLN FLN IRUDOOdNdQ:LWKRXWORVVRIJHQHUDOLW\ZH PD\DVVXPH _FM_ QIRUDOOUdMdSZKHUHUdSdP)RUDOO SdMdP OHW G M GMGMxxx GMȍFM GMȍFM GMȍFM xxx GMQ2EYLRXVO\G MGM LVDOLQNLQWKHQ FXEH 'HILQHDURXWLQJIXQFWLRQ )^GG«GSG S« G Pò^:F :F « :FP `VXFKWKDW )GL :FL IRU DOO dLdS DQG )G M :FM IRU DOO SdMdP ,W LV HDV\ WR YHULI\ WKDW GLĭGL GL:FL FL:FL IRU DOO dLdU GĭG GM:FM FMȍFM IRU DOO UdMdS DQG G MĭG M G M:FM GMȍFM FMȍFM IRU DOO SdMdP %\ /HPPD 3DWKV) P Q ^G G « GS G S « G P` ^:F :F « :FP ` FDQSURGXFHPGLVMRLQWSDWKV44« 4S4 S« 4 PIURPQWRGG«GSG S«G P UHVSHFWLYHO\ LQ DQ Q FXEH VR WKDW 4L DQG 4 M DUH ERWK VKRUWHVWZLWKOHQJWKV_GL_DQG_G M_UHVSHFWLYHO\IRUDOOdLdS DQG SdMdP )RU DOO SdMdP FRQVWUXFW 4M DV WKH FRPELQDWLRQ RI 4 M DQG D OLQN G M GM DQG 4M KDV OHQJWK _G M_ %\ WDNLQJ DGYDQWDJH RI WKH SURSHUWLHV RI WKH SUHIHUUHG:DVVXPHGDERYH LWZDVVKRZQLQ>@WKDW4 4«4PDUHGLVMRLQWWRHDFKRWKHU 'HILQHI9Qo9QDVDQRGHPDSSLQJZKLFKPDSVD QRGHX XXxxxXQ9QWRDQRGH[ [[xxx[Q9QVXFK WKDWIRUDOOdMdQ[M XMLIXQ DQG[M XMHOVHXQ ,WLVHDV\WRYHULI\WKDWIRUHYHU\OLQNXY LQDQQ FXEH WKHUHLVDOLQNIX IY LQDQQIFXEH)RUdLdPOHW5L EH FRQVWUXFWHG DV IROORZV )RU HDFK OLQN X Y LQ 4L ZH KDYHOLQNIX IY LQFOXGHGLQ5L%\GHILQLWLRQZHKDYH HLWKHU GLQ DQG GLN WLN RU GLQ DQG GLN WLN IRU DOO dNdQ ,W IROORZV WKDW IGL WL ZKLFK LPSOLHV WKDW 5L LV D SDWKIURPVWRWLLQDQQIFXEHEHFDXVH IQ VDQG4LLVD SDWKIURPQWRGLLQWKHQ FXEH,QDGGLWLRQ5LDQG4L KDYHWKHVDPHOHQJWK3OHDVHUHIHUWR>@55«5PDUH GLVMRLQWWRHDFKRWKHU )RUDOOdLdPHDFK5LLVHLWKHUVKRUWHVWZLWKOHQJWK_FL_ LI FLȍFL RUQHDUO\VKRUWHVWZLWKOHQJWK_FL_DQG_FL_ UHVSHFWLYHO\IRU_FL_ QDQG_FL_dQRWKHUZLVHFLȍFL DVVKRZQEHORZ6LQFHWKHOHQJWK RI5LLV _GL_IRU DOOdLdU DQGWKHOHQJWKRI5MLV_G M_IRUDOOSdMdPLWLVVXIILFLHQW WRVKRZWKDW_GL_ _FL_IRUDOOdLdU_GM_ _FM_IRUDOOUdMdS DQG _G M_ _FM_ IRU DOO SdMdP )RU DOO dLdU ZH KDYH GL FLZKLFKLPSOLHV_GL_ _FL_)RUDOOUdMdSZHKDYH_FM_ Q ZKLFK LPSOLHV _GM_ Q _FM_ Q _FM_ )RU DOO S dMdP ZH KDYH _G M_ _GM_ EHFDXVH RI G M GMGMxxx GMȍFM ͳ GMȍFM GMȍFM xxxGMQ DQG GMȍFM FMȍFM 6LQFH GM FM ZH KDYH _G M_ _GM_ _FM_ ZKLFK LPSOLHV _G M_ _FM_ ,W LV QRW GLIILFXOW WR YHULI\ WKDW DOO RI G G « GS G S«G PFDQEHREWDLQHGLQ2P WLPHDQGHDFK5LFDQ EH GHULYHG IURP 4L LQ 2Q WLPH IRU DOO dLdP 6LQFH SURFHGXUH 3DWKV WDNHV 2PQ WLPH DQG WKH SUHIHUUHG :

'HILQLWLRQ/HW M9Qo9QEHDQRGHPDSSLQJZKLFK PDSV D QRGH [ [[xxx[Q9Q WR D QRGH X XXxxxXQ 9QVXFKWKDWHLWKHUXQ DQGXN [NIRUDOOdNdQLI_[_dQ RUXQ DQGXM [NIRUDOOdNdQHOVH_[_!Q ZKHUH[N XNXQ^ÌRUDOOdNdQ /HW FL MWL IRU DOO dLdP ,W LV QRW GLIILFXOW WR YHULI\ WKDW _FL_dQ DQG _FL_ GLVWIQ V WL ZKHUH GLVWIQ V WL LV WKH GLVWDQFHIURPVWRWLLQWKHQIFXEH,WVKRXOGEHQRWHGWKDW _FL_ Q KROGV LI _WL_ Q RU _WL_ Q /HW Z EH D ZHLJKW IXQFWLRQ VXFK WKDW HLWKHU Z_FL_ LI _FL_ Q RU Z_FL_ HOVH_FL_dQ ,QIDFW Z_FL_ GHQRWHVWKHDGGLWLRQDOOHQJWK WKDW WKH QHDUO\ VKRUWHVW SDWK IURP V WR WL ZLOO FDXVH LQ FRPSDULVRQ ZLWK WKH VKRUWHVW RQH 6XSSRVH WKDW : ^F F«FPò^«Q`LVDSDUWLDOURXWLQJIXQFWLRQ $Q : ZDVVDLGWR EHRSWLPDO LI 3:d3: IRU DQ\ : ^F F « FP` o ^ « Q` ZKHUH 3: σP L FLȍFL uZȁFL ȁ 3OHDVH UHIHU WR >@ VXFK DQ RSWLPDO : FDQ EH GHWHUPLQHG LQ 2PQQORJ Q WLPH E\ VROYLQJ D FRUUHVSRQGLQJ PD[LPXP ZHLJKWHG ELSDUWLWH PDWFKLQJ SUREOHP :LWKRXW ORVV RI JHQHUDOLW\ ZH PD\ DVVXPHWKDW FLȍFL IRUDOOdLdUDQG FMȍFM IRUDOOU dMdP ZKHUH dUdP 6LPLODU WR >@ VXFK DQ : FDQ EH GHWHUPLQHG LQ 2PQ u2PQ Q ORJQ 2PQPQORJQ WLPH

/HPPD ,Q DQ QIFXEH P GLVMRLQW SDWKV 5 5 « 5P IURP V WR W W « WP UHVSHFWLYHO\ FDQ EH FRQVWUXFWHG LQ 2PQPQORJ Q WLPH VR WKDW HDFK 5L LV HLWKHU VKRUWHVW ZLWK OHQJWK _FL_ LI FLȍFL RU QHDUO\ VKRUWHVW ZLWK OHQJWK _FL_ DQG _FL_ UHVSHFWLYHO\ IRU _FL_ Q DQG _FL_dQ

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

57

DVVXPHG DERYH FDQ EH FRPSXWHG LQ 2PQPQORJ Q WLPH WKH FRQVWUXFWLRQ RI 5 5 « 5P WDNHV 2PQ PQORJ Q WLPHDVDZKROH

2PQQORJ Q WLPH ZH FRQMHFWXUH WKDW WKH WRWDO WLPH FRPSOH[LW\ FDQ EH IXUWKHU UHGXFHG LI WKH FRQVWUXFWLRQ PHWKRG FDQ EH VOLJKWO\ PRGLILHG DQG LWLV FXUUHQWO\ XQGHU RXUVWXG\>@

)RU DOO dLdP LI _WL_ Q RU _WL_ Q WKHQ ZH KDYH _FL_ QDQG/HPPDHQVXUHVWKDW5LLVHLWKHUVKRUWHVWZLWK OHQJWK _FL_ Q RU QHDUO\ VKRUWHVW ZLWK OHQJWK _FL_ Q 2QWKHRWKHUKDQG_WL_zQDQG_WL_zQ ZHKDYH_FL_dQ DQG /HPPD HQVXUHV WKDW WKH 5L LV HLWKHU VKRUWHVW ZLWK OHQJWK_FL_dQRUQHDUO\VKRUWHVWZLWKOHQJWK_FL_dQ +HQFHWKHOHQJWKRI5LLVQRWJUHDWHUWKDQQ6LQFHWKHUH H[LVWV D FDVH LQ >@ VXFK WKDW LWV PD[LPDO OHQJWK LV QRW OHVV WKDQ Q WKH PD[LPDO OHQJWK RI 5 5 « 5P LV PLQLPL]HGLQWKHZRUVWFDVH )RUDOOdLdP/HPPDHQVXUHVWKDWWKHOHQJWKRI5LLV HTXDOWR_FL_FLȍFL uZ_FL_ ZKLFKLPSOLHVWKDWWKHWRWDO OHQJWK RI 5 5 « 5P LV σP L ȁFL ȁ P ^ σP L FLȍFL uZȁFL ȁ ` σL ȁFL ȁ 3: :H FODLP WKDW σP L ȁFL ȁ 3:LVPLQLPL]HGEHFDXVHLIWKHUHH[LVWPGLVMRLQW SDWKV5 5 «5 PIURPVWRWW«WPUHVSHFWLYHO\VR WKDWWKHLUWRWDOOHQJWKLVOHVVWKDQWKDWRI55«5PWKHQ LW OHDGV WR D FRQWUDGLFWLRQ DV VKRZQ EHORZ :H KDYH P σP L OL σL ȁFL ȁ 3 :ZKHUHOLLVWKHOHQJWKRI5 LIRUDOOd LdP 'HILQHDSDUWLDO URXWLQJIXQFWLRQ : ^F F «FP` o^«Q`DVIROORZV : FL NLUHVS : FL Q LI HNL UHVSQ LVLQFOXGHG5 LDOOdLdP,I FLȍ FL WKHQ ZHKDYHOLt_FL_ _FL_FLȍ FL uZ_FL_ 2WKHUZLVHFLȍ FL 5 L LV QRW VKRUWHVW DQG KHQFH ZH KDYH OLt_FL_FLȍ FL u Z_FL_ +HQFHZHKDYH_FL_FLȍ FL uZ_FL_ dOLIRUDOOdLd P P ZKLFK LPSOLHV σP L ȁFL ȁ ^ σL FLȍ FL uZȁFL ȁ `d P P σP L OL σL ȁFL ȁ 3 :6LQFH 3: ^σL FLȍ FL uZȁFL ȁ ` ZH KDYH 3: 3: ZKLFK FRQWUDGLFWV WR WKDW : LV RSWLPDO 6LQFH IROGHGK\SHUFXEHVDUHQRGHV\PPHWULFZH KDYHWKH IROORZLQJWKHRUHPZKLFKLVWKHPDLQUHVXOWRIWKLVSDSHU

5HIHUHQFHV >@ &&&KHQDQG-&KHQ1HDUO\RSWLPDORQHWRPDQ\ SDUDOOHO URXWLQJ LQ VWDU QHWZRUNV ,((( 7UDQVDFWLRQV RQ3DUDOOHODQG'LVWULEXWHG6\VWHPVYROQRSS >@ ; % &KHQ 0DQ\WRPDQ\ GLVMRLQW SDWKV LQ IDXOW\ K\SHUFXEHV ,QIRUPDWLRQ 6FLHQFHV YRO QR SS >@ 0 'LHW]IHOELQJHU 6 0DGKDYDSHGG\ DQG , + 6XGERURXJK 7KUHH GLVMRLQW SDWK SDUDGLJPV LQ VWDU QHWZRUNV3URFHHGLQJVRIWKHWKLUG,(((6\PSRVLXP RQ 3DUDOOHO DQG 'LVWULEXWHG 3URFHVVLQJ SS >@ ' 5 'XK DQG * + &KHQ 2Q WKH 5DELQ QXPEHU SUREOHP1HWZRUNVYROQRSS >@ $ (O$PDZ\ DQG 6 /DWLIL 3URSHUWLHV DQG SHUIRUPDQFH RI IROGHG K\SHUFXEHV ,((( 7UDQVDFWLRQVRQ3DUDOOHODQG'LVWULEXWHG6\VWHPVYRO QRSS >@ - ) )DQJ 7KH ELSDQFRQQHFWLYLW\ DQG P SDQFRQQHFWLYLW\RIWKHIROGHGK\SHUFXEH7KHRUHWLFDO &RPSXWHU 6FLHQFH YRO QR SS >@ -6)X)DXOWIUHHF\FOHVLQIROGHG K\SHUFXEHV ZLWK PRUHIDXOW\HOHPHQWV,QIRUPDWLRQ3URFHVVLQJ/HWWHUV YROQRSS >@ = *DOLO (IILFLHQW DOJRULWKPV IRU ILQGLQJ PD[LPXP PDWFKLQJLQJUDSKV$&0&RPSXWLQJ6XUYH\VYRO QRSS >@ 6 *DR DQG ' ) +VX 6KRUW FRQWDLQHUV LQ &D\OH\ JUDSKV 'LVFUHWH $SSOLHG 0DWKHPDWLFV YRO SS >@43*XDQG63HQJ$QHIILFLHQWDOJRULWKPIRUWKH NSDLUZLVH GLVMRLQW SDWKV SUREOHP LQ K\SHUFXEHV -RXUQDORI3DUDOOHODQG'LVWULEXWHG&RPSXWLQJYRO SS >@4 3 *X DQG 6 3HQJ 1RGHWRVHW DQG VHWWR VHW FOXVWHU IDXOW WROHUDQW URXWLQJ LQ K\SHUFXEHV 3DUDOOHO &RPSXWLQJYROSS >@6 @ 6 @& 1 /DL &RQVWUXFWLQJ DOO VKRUWHVW QRGHGLVMRLQW SDWKV LQ WRUXV QHWZRUNV -RXUQDO RI 3DUDOOHO DQG 'LVWULEXWHG&RPSXWLQJYROSS -DQXDU\ >@& 1 /DL 2Q WKH FRQVWUXFWLRQ RI DOO VKRUWHVW YHUWH[GLVMRLQW SDWKV LQ &D\OH\ JUDSKV RI DEHOLDQ JURXSV 7KHRUHWLFDO &RPSXWHU 6FLHQFH YRO SS 0DUFK >@& 1 /DL 2SWLPDO FRQVWUXFWLRQ RI QRGHGLVMRLQW VKRUWHVW SDWKV LQ IROGHG K\SHUFXEHV -RXUQDO RI 3DUDOOHO DQG 'LVWULEXWHG &RPSXWLQJ YRO SS $SULO >@& 1 /DL 2Q WKH FRQVWUXFWLRQ RI RSWLPDO QRGH GLVMRLQW SDWKV LQ IROGHG K\SHUFXEHV RI RGG GLP HQVLRQV 3URFHHGLQJV RI WKH ,QWHUQDWLRQDO &RQIHUHQFH RQ 3DUDOOHO DQG 'LVWULEXWHG 3URFHVVLQJ 7HFKQLTXHVDQG$SSOLFDWLRQVSS-XO\ >@&1/DL&RQVWUXFWLQJRSWLPDOQRGHGLVMRLQWSDWKVLQ IROGHGK\SHUFXEHVPDQXVFULSW >@6 & /LDZ DQG * - &KDQJ *HQHUDOL]HG GLDPHWHUV DQG 5DELQ QXPEHUV RI QHWZRUNV -RXUQDO RI &RPELQDWRULDO 2SWLPL]DWLRQ YRO SS >@0 0D 7KH VSDQQLQJ FRQQHFWLYLW\ RI IROGHG K\SHUFXEHV ,QIRUPDWLRQ 6FLHQFHV YRO QR SS >@@@':DQJ(PEHGGLQJ+DPLOWRQLDQF\FOHVLQWRIROGHG K\SHUFXEHVZLWKIDXOW\OLQNV-RXUQDORI3DUDOOHODQG 'LVWULEXWHG&RPSXWLQJYROSS >@5 4. That is a contradiction. Therefore, Tm, 3 does not contain every (4k + 2)WDBC for 4 < k < m/2 - 1. When k = 1, 2, 3 or 4, we find (4k + 2)-WDBC on Tm, 3 is even as Figure 2 shows.

… (a) k=1

…

… (b) k=2

(c) k=3

… (d) k=4 Fig. 2: For m is even, Tm, 3 contains (4k + 2)-WDBC. Lemma 3. If m is even, Tm, 3 does not contain any (2k + 1)WDBC for k = 9 or 10 < k < m – 2. Proof. Since we already know Tm, 3 does not contain any (2k + 1)-DBC for 7 < k < m – 1 by [11]. Assume that C is a (2k + 1)WDBC on Tm, 3. By Property 1 and 2, we know that |E1(C)| is

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

even; |E2(C)| is odd and C contains odd number of edges in bridges2, so let |E1(C)| = k1 and |E2(C)| = k2 where k1 + k2 = 2k + 1, k1 be even, k2 be odd and |k1 - k2| = 3. Note that in such assumption, k1 = k + 3 or k – 1, both < m, so that there is no edge in bridge1 be used in C. This proof is similar to Lemma 2. Since cycle is a 2-regular graph, we know that for i Î Zm and i¢ = (i + 1) mod m, 6 ³ dc((i¢, 0)) + dc((i¢, 1)) + dc((i¢, 2)) = |𝐸$"(C)| + |𝐸$"¢ (C)| + 2|𝐸&"¢ (C)| is even. If |𝐸$"(C)| + |𝐸$"¢ (C)| = 2 then |𝐸&"¢ (C)| £ 2, this case only occurs when i = m - 1 or i = (k1 / 2) - 1. If |𝐸$"(C)| + |𝐸$"¢ (C)| = 4 then |𝐸&"¢ (C)| £ 1, this case only occurs when 0 £ i £ (k1 / 2) - 2. Thus k2 = |E2(C)| £ 4 + (k1 / 2) - 1. If k1 = k2 + 3, then k2 £ 9 and k1 £ 12. If k1 = k2 - 3, then k2 £ 3 and k1 £ 0. When k = 9, 2k + 1 = 19, so that k1 = 8 and k2 = 11, a contradiction. When 10 < k < m – 2, 21 < 2k + 1 = k1 + k2 £ 21, a contradiction. Therefore, Tm, 3 does not contain every (2k + 1)-WDBC for k = 9 or 10 < k < m – 2. Note that a DBC is a WDBC by the definition. According to Lemma 3 and Table 1, we have the result that when k = 1, 6, 8, 10 or m – 2, it is possible to find (2k + 1)-WDBC on Tm, 3 for m is even. Figure 3 shows the structure of (2k + 1)-WDBC on Tm, 3 for k = 1, 6, 8, 10, respectively. And Lemma 4 give a (2k + 1)-WDBC on Tm, 3 for k = m – 2.

61

Lemma 5. For one of m and n is odd, the other is even, Tm, n exists (2k + 1)-WDBC where k = m – 2 for m > 3 is odd, and n is even; or k = n – 2 for n > 3 is odd, and m is even. Proof. Assume that C is a (2k + 1)-WDBC on Tm, n. Without loss of generality, let m be odd and n be even. By Property 1 and 2, we know |E2(C)| is even; |E1(C)| is odd and C contains odd number of edges in bridges1, that is |E1(C)| must ³ m. Since there is no (2k + 1)-DBC on Tm,n for k = m – 2 [11], we let |E1(C)| = m = k + 2 , |E2(C)| = m – 3 = k – 1. Let g1 = (k – 1) / 2, b1 = ((k – 1) mod (2n - 2)) / 2, n1 = [(k – 1 - 2b1) / (2n - 2)] – &,+,+ 1. If g1 £ n – 1, we construct a (2k + 1)-cycle C = á(0, 0), 𝑅+,g$ , $,g ,+

$ , (m - 1, g ), 𝑅 &,*-$,- , (m - 1, 0), (0, 0)ñ shows in (0, g1), 𝑅+,*-$ 1 g$,+ Figure 5. According to Figure 5, we know that |E1(C)| = m = k + 2 and |E2(C)| = 2g1 = k – 1. Then ||E1(C)| - |E2(C)|| = 3. That is, C is a (2k + 1)-WDBC for k = m – 2. If g1 > n – 1, we construct a (2k + 1)-cycle C shows in Figure 6. Then |E1(C)| = m = k + 2 and |E2(C)| = 2n - 2 + 2n1(n - 1) + 2b1 = k – 1. Because ||E1(C)| - |E2(C)|| = 3, C is a (2k + 1)-WDBC for k = m – 2.

Lemma 6. For one of m and n is even, the other is odd, Tm, n does not exist (2k + 1)-WDBC where 1 £ k < m – 2 for m > 3 is odd, and n is even; or 1 £ k < n – 2 for n > 3 is odd, and m is even. (m - 1, 0)

(0, 0)

(a) k=1

(b) k=6

(c) k=8

(d) k=10

(0, g1)

Fig. 5: The construction of C on Tm, n when g1 £ n – 1 for Lemma 5.

Fig. 3: For m is even Tm, 3 contain (2k + 1)-WDBC. Lemma 4. If m ³ 6 is even, Tm, 3 contains (2k + 1)-WDBC for k = m – 2. Proof. For k = m – 2, let |E1(C)| = m = k + 2 , |E2(C)| = k – 1 = m – 3 and b1 = (k – 1) - 3 = k – 4 = m – 6 ³ 0. We can construct a (2k + 1)-cycle C = á(0, 0), (0, 2), (0, 1), (1, 1), (1, 2), (2, 2), &,*-$,(2, 1), ..., (b1 – 1, 2), (b1, 2), 𝑅b$,&,+ , (m - 1, 2), 𝑅&,+ , (m $,*-$ 1, 0), (0, 0)ñ. Figure 4 shows the structure of C. We have |E1(C)| = m = k + 2 and |E2(C)| = 4 + b1 + 1 = k – 1. Then ||E1(C)| |E2(C)|| = 3. That is, C is a (2k + 1)-WDBC for k = m – 2.

(m - 1, 0)

(0, 0)

… (0, 2) (1, 2)

… (b1, 2)

(m - 1, g1)

(m - 1, 2)

Fig. 4: The construction of C on Tm, 3 for Lemma 4.

(2n1 + 1 , 0)

(m - 1, 0)

(m - 2, 0) … (2n1 + 1 , n - 1) (2n1, n - 1) Fig. 6: The construction of C on Tm, n when g1 > n – 1 for Lemma 5. Proof. Assume that C is a (2k + 1)-WDBC on Tm, n. Without loss of generality, let m be odd and n be even. By Property 1 and 2, we know |E2(C)| is even; |E1(C)| is odd and C contains the odd number of edges in bridges1, so |E1(C)| must ³ m. Then |E2(C)| = 2k + 1 – |E2(C)| £ 2k + 1 – m < 2(m – 2) + 1 – m = m – 3, and ||E1(C)| - |E2(C)|| > 3 is a contradiction. Therefore, Tm, n does not exist (2k + 1)-WDBC, where 1 £ k < m – 2 for m >

ISBN: 1-60132-508-8, CSREA Press ©

62

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

3 is odd, and n is even; or 1 £ k < n – 2 for n > 3 is odd, and m is even.

3

Main results

In this section, for even integer m ³ 4 we prove that there exists a weakly dimension-balanced cycle whose length is (4k + 2) for any integer m/2 - 1 £ k £ ë(3m- 2)/4û for Tm, 3, and 1 £ k £ ë(mn- 2) / 4û for Tm, n where n ³ 5 is odd. We give Theorem 1 as the beginning for n = 3. Next, Lemma 7 and Theorem 2 present the general case. Theorem 1. For m ³ 4 is even integer, Tm, 3 contains every (4k + 2)-WDBC for m/2 - 1 £ k £ ë(3m- 2)/4û. Proof. When k = ë(3m- 2)/4û = (3m- 2)/4, Tm, 3 embedded a Hamiltonian WDBC, for any even integer m by [8]. Thus, we only need to discuss the cases: (a) m/2 - 1 £ k £ ë(3m - 2)/4û - 1 or (b) k = ë(3m - 2)/4û = (3m - 4)/4 for 3m mod 4 = 0. Both implies 4k < 3m - 2. Since there is no (4k + 2)-DBC in these case [11] and |E1(C)| must be even by Property 1, We know that {|E1(C)|, |E2(C)|} = {2k, 2k + 2}. In the following, we will construct a WDBC C on Tm, 3 with |E1(C)| = 2k + 2 and |E2(C)| = 2k for 2m + 2 £ 4k + 2 < 3m. Let m1 = (2k + 2 - m)/2, b1 = ((2k - 4) mod 4)/2 = (k - 2) mod 2 = k mod 2, n1 = (2k - 4 2b1)/4 = ëk /2û – 1. We construct a (4k + 2)-cycle C shows in Figure 7. Obviously, |E1(C)| = m + 2m1 = 2k + 2 and |E2(C)| = 4 + 2(2n1) + 2b1 = 2k. Hence C is a (4k + 2)-WDBC. Note that, by definition of b1 = k mod 2, the only possible value of b1 is 0 or 1. We know that if 2n1 + m1 + 1 ³ m - 2 when b1 = 1, or 2n1 + m1 + 1 ³ m when b1 = 0, there will be wrong when constructing C. Thus, we need to ensure 2n1 + m1 + 1 < m - 2 when b1 = 1, and 2n1 + m1 + 1 < m when b1 = 0. By definition, 2n1 + m1 + 1 = (2k - 4 - 2b1 + 2k + 2 - m)/2 + 1 = (4k - m - 2b1 - 2)/2 + 1 = (4k - m)/2 - b1 < (3m - 2 - m)/2 b1 = m - 1 - b1. Therefore, 2n1 + m1 + 1 < m - 2 if b1 = 1; 2n1 + m1 + 1 < m - 1 < m if b1 = 0. Hence, C is a well-defined (4k + 2)-WDBC.

(0, 0)

…

(2n1 + 1, 0) (m - 1, 0) (2n1 + m1 + 1, 1) (m - 1, b1)

Obviously, |E1(C)| = 2k + 2 and |E2(C)| = k + k = 2k. Hence C is a (4k + 2)-WDBC. If n > m, construct a (4k + 2)-cycle C = &,+,+ $,,+$+ &,,,$,+,á(0, 0), 𝑅+,,+$, (0, k + 1), 𝑅+,, , (k, k + 1), 𝑅,+$,+, (k, 0), 𝑅,,+ , (0, 0)ñ. Obviously, |E1(C)| = k + k = 2k and |E2(C)| = 2k + 2. Hence C is a (4k + 2)-WDBC. Case 2. min{m, n} £ k £ max{m, n} - 1 If m > n, let b1 = ((2k - 2n + 4) mod 2(n - 2)) / 2 = (k - n + 2) mod (n - 2) and n1 = ((2k + 2) - (2n - 2) - 2b1) / 2(n - 2) = ë(k - n + 2)/(n - 2)û. Then construct a (4k + 2)-cycle D1 shows in Figure 8. Note that because m ³ 4 is even integer, n ³ 5 is odd, 2n1 = 2 ë(k - n + 2)/(n - 2)û £ k - 3, the structure of D1 is well-defined. Besides, |E1(D1)| = k + k = 2k and |E2(D1)| = 2n 2 + 2n1(n - 2) + 2b1 = 2k + 2. That is, D1 is a (4k + 2)-WDBC. If n > m, let a1 = ((2k - 2m + 4) mod 2(m - 2))/2 = (k - m + 2) mod (m - 2) and m1 = ((2k + 2) - (2m - 2) - 2a1)/2(m - 2) = ë(k - m + 2)/(m - 2)û. Then construct a (4k + 2)-cycle D2. Figure 9 shows the structure of D2. Note that because m ³ 4 is even integer, n ³ 5 is odd, 2m1 = 2 ë(k - m + 2)/(m - 2)û £ k 3, the structure of D2 is well-defined. Besides, |E1(D2)| = 2m 2 + 2m1(m - 2) + 2a1 = 2k + 2 and |E2(D2)| = k + k = 2k. That is, D2 is a (4k + 2)-WDBC. (2n1 + 2, 0) (2n1+1, 0) (k, 0)

(0, 0) …

(0, n - 1)

(1, n - 2)

(0, 0)

(2n1 + m1 + 1, 2)

(m - 2, 1) (m - 1, 0)

(0, 2m1 - 1) (0, 2m1) (0, 2m1 + 1) (0, k)

…

(2n1, 2)

(k, n-1)

Fig. 8: The construction of D1 on Tm, n for Lemma 7.

(0, 2) (1, 2)

(2n1 + 1, b1)

(a1, 2m1 + 2) (m - 1, k)

Fig. 7: The construction of C on Tm, 3 for Theorem 1.

Fig. 9: The construction of D2 on Tm, n for Lemma 7.

Lemma 7. For m ³ 4 is even integer, n ³ 5 is odd, Tm, n contains every (4k + 2)-WDBC for 1 £ k £ max{m, n} - 1.

Theorem 2. For m ³ 4 is even integer, n ³ 5 is odd, Tm, n contains every (4k + 2)-WDBC where 1 £ k £ ë(mn- 2)/4û. Proof. According to Lemmas 7, when 1 £ k £ max{m, n} - 1, Tm, n embeds every (4k + 2)-WDBC for m ³ 4 is even integer, n ³ 5 is odd. In addition, when k = ë(mn- 2)/4û and mn mod 4 = 2, Tm, n embedded a Hamiltonian WDBC, for any m, n ³ 3 by [8]. Thus, we only need to discuss that case: max{m, n} £ k £ ë(mn- 2)/4û - 1 or k = ë(mn- 2)/4û for mn mod 4 = 0. That can

Proof. Note that Tm, n = Cm ´ Cn. According to m, n, and k, we divided this proof into two cases. Case 1. 1 £ k £ min{m, n} - 1 &,+,+ If m > n, construct a (4k + 2)-cycle C = á(0, 0), 𝑅+,, , (0, $,,,+ &,,+$,$,+,k), 𝑅+,,+$ , (k + 1, k), 𝑅,,+ , (k + 1, 0), 𝑅,+$,+ , (0, 0)ñ.

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

(m - 1, b1)

… (2n1 + m2 + 1, n - 3) (2n1 + m2 + g1 + 1, n - 2) (2n1 + m2 + g1 + 1, n - 1)

… (2n1, n - 1)

(0, n - 1)

Fig. 10: The construction of D1 on Tm, n for Theorem 2. (2n1 + 1, 0) (m - 2, 0) (2, 0) (m - 1, 0)

(0, 0)

(m - 1, b1) (2n1 + m2 + 1, n - 2s1 - 4) (2n1 + m2 + 1, n - 2s1 - 3)

… (2n1 + m2 + g1 + 1, n - 2s1 - 1) (2n1 + m2 + 3, n - 3) (2n1 + m2 + 3, n - 2)

(2n1 + 1, n - 3)

(0, n - 1)

…

(2n1 + m2 + 3, n - 1)

(2n1, n - 1)

Fig. 11: The construction of D2 on Tm, n for Theorem 2. (2, 0)

(2n1 + 1, 0)

(m - 2, 0) (m - 1, 0)

…

Case 2. m1 = 0 or n - 1 £ a1 + 2b1 < 2n - 2. Let g2 = (a1 mod 4) / 2 and s2 = (a1 - 2g2)/4. According to s2, we use m1, s2 and g2 to replace m2, s1 and g1 in D1 or D2, then construct a cycle D3 (D4, respectively) for s2 = 0 (s2 > 0, respectively). Figure 12 shows the structure of D4 for s2 > 0. By Case 1, |E1(D1)| = |E1(D2)| = m + m2(n - 1) + 4s1 + 2g1 = 2k + 2 and |E2(D1)| = |E2(D2)| = 2n - 2 + 2n1(n - 1) + 2b1 = 2k, so |E1(D3)| = |E1(D4)| = m + m1(n - 1) + 4s2 + 2g2 = 2k + 2 and

(2n1 + 1, 0) (m - 2, 0) (m - 1, 0) , (2 0)

…

Case 1. m1 > 0 and a1 + 2b1 < n - 1. At first, let m2 = m1 - 1, a2 = a1 + n - 1, g1 = (a2 mod 4) / 2 and s1 = (a2 - 2g1) / 4. If s1 = 0, we construct a 4k-cycle as &,+,+ &,$,follows : D1 = á(0, 0), 𝑅+,--$ , (0, n - 1), (1, n - 1), 𝑅--$,+ , (1, &,&,+ 0), (2, 0), 𝑅+,--$ , (2, n - 1), (3, n - 1), ..., (2n1, n - 1), (2n1 + 1, $,--$,+ n - 1), 𝑅&-$+$,&-$+*&+g$+$, (2n1 + m2 + g1 + 1, n - 1), (2n1 + m2 $,--&,+ g1 + 1, n - 2), 𝑅&, (2n1 + 1, n - 2), (2n1 + 1, $+*&+g$+$,&-$+$, $,--.,+ n - 3), 𝑅&-$+$,&-$+*&+$, (2n1 + m2 + 1, n - 3), (2n1 + m2 + 1, n $,--/,- 4), 𝑅&, (2n1 + 1, n - 4), (2n1 + 1, n - 5), $+*&+$,&-$+$, $,--0,+ , (2n + m2 + 1, n - 5), (2n1 + m2 + 1, n - 6), ..., 𝑅&1 $+$,&-$+*&+$ &,*-&,+ $,+,+ (2n1 + 1, 0), 𝑅&-$+$,*-& , (m - 2, 0), 𝑅+, , (m - 2, b1), (m b$ &,*-$,1, b1), 𝑅b$,+ , (m - 1, 0), (0, 0)ñ. Figure 10 shows the construction of D1. By the construction of D1, we know that |E1(D1)| = m + m2(n - 1) + 2g1 = m + (m1 - 1)(n - 1) + a1 + n 1 = 2k + 2 and |E2(D1)| = 2n - 2 + 2n1(n - 1) + 2b1 = 2k. Hence, D1 is a (4k + 2)-WDBC. Similarly, if s1 > 0, we construct a (4k + 2)-cycle D2 shows in Figure 11. Then |E1(D2)| = m + m2(n 1) + 4s1 + 2g1 = m + (m1 - 1)(n - 1) + (a1 + n - 1) = 2k + 2 and |E2(D2)| = 2n - 2 + 2n1(n - 1) + 2b1 = 2k. So D2 is a (4k + 2)WDBC. According to Figure 10 and Figure 11, note that g1 £ 1, we know that if 2n1 + m2 + 3 > m - 1 or b1 ³ n - 2s1 – 2g1, there will be an overlap when constructing D1 and D2. Thus, we need to ensure 2n1 + m2 + 3 £ m - 1, b1 < n -2s1 - 2 for g1 = 1 and b1 < n - 2s1 for g1 = 0 separately. Because 4k < mn - 2, then 2n1 + m2 + 3 = 2n1 + m1 + 2 = (4k - m - 2n + 4 - a1 - 2b1)/(n - 1) + 2 = (4k - m - a1 - 2b1 + 2)/(n - 1) < (mn - m - a1 2b1)/(n - 1) = m - (a1 + 2b1)/(n - 1). Since 0 £ a1 + 2b1 and 2n1 + m1 + 2 is an integer, 2n1 + m2 + 3 < m - (a1 + 2b1)/(n 1) £ m. So 2n1 + m2 + 3 £ m - 1. Next, we check the range of b1. Since a1 + 2b1 < n - 1 and a1 = 4s1 + 2g1 - n + 1, 2s1 + g1 + b1 < n - 1. Then b1 < n - 2s1 - 1 - g1. If g1 = 0, b1 < n - 2s1 - 1 < n - 2s1, if g1 = 1, b1 < n - 2s1 - 2. In summary, D1 and D2 are well-defined cycles.

|E2(D3)| = |E2(D4)| = |E2(D1)| = |E2(D2)| = 2k. That is, D3 and D4 are (4k + 2)-WDBC.

…

be rewrite as: max{m, n} £ k £ é(mn- 2)/4ù - 1. Since there is no (4k + 2)-DBC in these case [11] and |E1(C)| must be even by Property 1, We know that {|E1(C)|, |E2(C)|} = {2k, 2k + 2}. In the following, we will construct a WDBC C on Tm, n with |E1(C)| = 2k + 2 and |E2(C)| = 2k. Let a1 = (2k + 2 - m) mod (n - 1) , m1 = (2k + 2 - m - a1)/(n - 1) and b1 = ((2k - 2n + 2) mod (2n - 2))/2 = (k - n + 1) mod (n - 1) = k mod (n – 1), n1 = (2k - 2n + 2 - 2b1)/(2n - 2) = ëk /(n - 1)û – 1. Note that 4k < mn - 2. According to the value of m1 and a1 + 2b1, we separate this proof into three cases.

63

(m - 1, b1) (2n1 + m1 + 1, n - 2s2 - 4) (2n1 + m1 + 1, n - 2s2 - 3)

…

(2n1 + 1, n - 3) (2n1 + m1 + g2 + 1, n - 2s2 - 1) (2n1 + m1 + 3, n - 3) (2n1 + m1 + 3, n - 2) (0, n - 1)

(2n1, n - 1)

(2n1 + m1 + 3, n - 1)

Fig. 12: The construction of D4 on Tm, n for Theorem 2 Similarly, we need to check if 2n1 + m1 + 3 £ m - 1, b1 < n -2s2 - 2 for g2 = 1 and b1 < n - 2s2 for g2 = 0 separately. Note that 2n1 + m1 + 3 = (4k - m - 2n + 4 - a1- 2b1) / (n - 1) + 3 = (4k - m + n - a1 - 2b1 + 1) / (n - 1) < (mn - m + n - a1 - 2b1 - 1) / (n - 1) = m - (a1 + 2b1 - n + 1) / (n - 1). Besides, a1 + 2b1 ³ n - 1, so 2n1 + m1 + 3 < m - ((a1 + 2b1 - n + 1) / (n - 1)) £ m. Because 2n1 + m1 + 3 is an integer, so 2n1 + m1 + 3 £ m - 1. Next, we check the range of b1. Since a1 + 2b1 < 2n - 2

ISBN: 1-60132-508-8, CSREA Press ©

64

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

and a1 = 4s2 + 2g2, 4s2 + 2g2 + 2b1 < 2n - 2. Hence, b1 < n 2s2 - 1 - g2. If g2 = 0, b1 < n - 2s2 - 1 < n - 2s2. If g2 = 1, b1 < n - 2s2 - 2. In conclusion, D3 and D4 are well-defined cycles. (0, 0)

(2, 0)

(2n1 + 1, 0)

(m - 2, 0) (m - 1, 0)

(2n1 + 1, n - a1 - 2) (m - 1, b1)

… ……

(0, n - 1)

weakly dimension-balanced cycle whose length is l for any integer 3 £ l £ mn. We give Table 2 as a summary. Again, let m be odd and n be even for convenience in this table. Because the WDB pancyclic problem on Tm, n for both m, n is even [9], one of m, n is even and the other is odd (this paper) had been discussed, we want to study whether the toroidal mesh graph Tm, n contains a weakly dimension-balanced cycle whose length is l for any integer 3 £ l £ mn for both of m, n is odd in the future.

(2n1 + m1 + 2, n - 2)

(2n1, n - 1) 1)(0, 2m1 + 2) 4kWDBC

Fig. 13: The construction of D5 on Tm, n Theorem 2. Case 3. a1 + 2b1 ³ 2n - 2. In this case, we construct a (4k + 2)-cycle D5 = á(0, 0), $,--$,+ D1((0, 0), (2n1 + 1, n - 1)), (2n1 + 1, n - 1), 𝑅&-$+$,&-$+*$+&, $,--&,(2n1 + m1 + 2, n - 1), (2n1 + m1 + 2, n - 2), 𝑅&, $+*$+&,&-$+$, $,--.,+ (2n1 + 1, n - 2), (2n1 + 1, n - 3), 𝑅&-$+$,&-$+*$+&, (2n1 + m1 + $,--a$-$,+ 2, n - 3), ..., (2n1 + 1, n - a1 - 1), 𝑅&, (2n1 + m1 + $+$,&-$+*$+$ $,--a$-&,1, n - a1 - 1), (2n1 + m1 + 1, n - a1 - 2), 𝑅&, (2n1 $+*$+$,&-$+$ $,--a$-.,+ + 1, n - a1 - 2), (2n1 + 1, n - a1 - 3), 𝑅&-$+$,&-$+*$+$, ..., (2n1 + 1, 0), D1((2n1 + 1, 0), (0, 0)), (0, 0)ñ. Figure 13 shows the structure of D5. Hence, |E1(D5)| = m + m1(n - 1) + a1 = 2k + 2 and |E2(D5)| = 2n - 2 + 2n1(n - 1) + 2b1 = 2k. That is, D5 is a (4k + 2)-WDBC. According to Fig. 13, we need ensure 2n1 + m1 + 2 £ m 3 and b1 £ n - 1, or the way of constructing D5 will be wrong. Note that 4k < mn - 2, we have 2n1 + m1 + 2 < m - (a1 + 2b1) / (n - 1) similar to Case 2. Because a1 + 2b1 ³ 2n - 2, (a1 + 2b1) / (n - 1) ³ 2. So 2n1 + m1 + 2 < m - ((a1 + 2b1) / (n - 1)) £ m - 2. In addition, 2n1 + m1 + 2 is an integer, 2n1 + m1 + 2 £ m - 3. Next, we check the range of b1. By definition of b1 = k mod (n – 1), so b1 < n - 1. In brief, D5 is well-defined cycle. As conclusion, by [11], Lemmas 3, 4, Theorem 1, we have Corollary 1; by [11], Theorem 2 and Lemma 5, we can obtain Corollary 2. Corollary 1. For even integer m, Tm, 3 are (2m – 3)-WDB pancyclic. Corollary 2. For integers m, n ³ 4, Tm, n are (a) WDB bipancyclic when one of m, n is even, the other is odd and; (b) (2n – 4)-WDB pancyclic when m is even, n is odd; (2m – 4)-WDB pancyclic when m is odd, n is even.

4

Conclusions

Table 2: Summary of this Paper m ³ 4 is even, m ³ 4 is even, n ³ 4 is odd n=3 Yes, k = 1, 2, 3, and m/2 £ k £ ëmn/4û Yes, [11]; 1 £ k £ ëmn/4û [11]. No, 3 < k < m/2 Lemma 1.

Yes, (4k + 1 £ k £ ë(mn – 2)/4û 2)Thm. 2. WDBC

Yes, k = 1, 2, 3, 4 and m/2 – 1 £ k £ ë(3m – 2)/4û Fig. 2, Thm. 1; No, 4 < k < m/2 – 1 Lemma 2.

Yes, n – 2 £ k £ mn/2 – 1 (2k + [11], Lemma 5; 1)No, WDBC 1. To determine a good partition size, a modified octree algorithm is used. This heuristic consists of two passes: topdown and bottom-up. At the end of each pass, each node (partition) will either be labeled white (insignificantly low density) or black (significantly high density). The first pass starts top-down from the whole simulation grid, and split on the density conditions given the parameters ρlow and ρhigh . Once the program is done splitting as all nodes satisfy one of the density conditions (node density < ρlow or > ρhigh ), the program will return a list of black node dimensions. Here, the smallest volume dimensions will be picked to minimize the partition size. The program then execute the second pass in an attempt to combine consecutive partitions if and only if the density condition requirements still hold. This bottomup pass is executed to further minimize the number of hostdevice communications. Given a wound healing model m =< wc, mc, ic, r >, where wc denotes the wound configuration (dimensions, wound location), mc denotes model configuration (dimensions), ic denotes the initial conditions (patient’s cytokine

ISBN: 1-60132-508-8, CSREA Press ©

74

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

levels, treatment type etc.) and r denotes the model rules, if we know the optimal subvolume size of m, then we can choose a scaled subvolume size for model m′ =< wc′ , mc, ic′ , r >. For example, if the wound depth of m and m′ is 1 mm and 0.5 mm respectively in the x-direction, then we can scale the optimal subvolume size of m in the x direction by 21 and reuse it for m′ . This way, once we determined an optimal subvolume size for model m, we can apply it to any model m′ defined earlier.

6. Results 6.1 Model Configurations We have developed VF ABM models for two different mammals: rat and human. As discussed in Section 2.3, the rat VF ABM serves as a test model due to its size and the availability of empirical data. The model configurations were determined based on empirical data and vocal fold literature reviews [17], [18], [19], [20], [22], [23], [24]. Table 2 summarizes the configurations of both models. Table 2: Summary of mammal vocal fold simulation configurations Unit World Size ECM Data Chemical Data Platelets Neutrophil Macrophage Fibroblast

3D Patches Data Points Data Points Initial Number of Initial Number of Initial Number of Initial Number of

Cells Cells Cells Cells

Rat

Human

2.0 M 6.0 M 16.0 M 78.8 K 517 315 3.5 K

153.9 M 0.46 G 1.23 G 34 M 1.72 M 0.97 M 12.20 M

6.2 Performance Evaluation To evaluate the quality of the subvolume size recommendation made by the heuristic proposed in Section 5, an experiment was run with different HADC partition sizes for different model and wound configurations. As shown in Fig. 4, the 2-pass octree technique worked well for small and medium size wounds (sizes up to 31 of the VF size). However, it did not work well for large wound sizes as it recommended a sub-optimal partition. Thus, it is best for the modeler to use the heuristic to obtain the recommendation from smaller wound model, m and scale the partition size for m′ with larger wounds. The original frame rate of the human VF visualization was bounded by the compute time of approximately 7 s per iteration [36], [44]. This resulted in poor interactivity due to the 0.14 fps frame rate. With the scheduling technique described Section 4, the frame rate improved to 7.2 fps. By using the subvolume size recommended by the heuristic proposed in Section 5 for HADC optimization, the visualization performance improved significantly to 42.8 fps. For performance evaluation purposes, the human VF ABM was compared against our previous and other similar ABM work (Table 3). The bacteria-macrophage-antibiotic

Fig. 3: (a) Comparison of vocal fold image (collagen in red, elastin and cells in green) of real rat vocal fold (uninjured control on the left and scarred in the middle) [43] and a zoomed image obtained from VF ABM simulation (right). (b) Visualization of human vocal fold both ECM proteins and signaling proteins (chemical gradients in turquoise-pear) during the healing process. The ECM proteins include collagen (red), hyaluronic acid (blue), and elastin (green). The healing and elapsed time stats are displayed in the top-left and top-right corner of the screen, respectively. This image is a result of a transfer function that emphasizes newly deposited ECM proteins in the wound area and assign low opacity to existing ECM proteins outside of the wound area.

ABM was implemented with FLAME GPU [27]. FLAME GPU is a widely used modern HPC ABM framework, and thus, serves as a good performance standard. Additionally, we included a well regarded high-performance visualization prototype, MegaMol for comparison [45]. Despite MegaMol being an atomic-level visualization prototype, this powerful visualization tool is particle-based. Since particle-based simulation engine offer frameworks that can be adapted to visualize cellular-level data, MegaMol is well-suited to serve as our visualization performance comparison base. Our VF ABM is able to process data orders of magnitude larger than FLAME GPU at a similar frame rate. Furthermore, we are

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

Fig. 4: Data copy time speedup using HADC. For each wound configuration, the model was run with different HADC partition sizes. The partition size that resulted in the best average data transfer time was used to compute the speedup and compared with speedup using partition size recommended by heuristic described in Section 5. The 2-pass octree technique worked well for small and medium size wounds. In contrary, the scaled 2-pass octree using the scaled recommendation from smaller wound model, m, for m′ with larger wounds work well for all wound sizes. Table 3: Performance and scale comparison with existing biological visualization platforms #Data Points (×106 )

In Situ Frame Support Rate (fps)

MegaMol [45]

100

10

2D FLAME GPUa [27]

0.48

×

33b

3D VF ABM (original [10]) 3D VF ABM (optimized)

0.46

0.13

1700

42.8

a Bateria-Macrophage-Antibiotic b Derived

Hardware

Intel Core i7-2600 (16 GB RAM) NVIDIA GeForce GTX TITAN Intel Core i7 (8 GB RAM) NVIDIA 830M GeForce Intel Xeon E5-2699 v4 (128 GB RAM) NVIDIA Tesla M40 Intel Xeon E5-2699 v4 (128 GB RAM) NVIDIA Tesla M40

ABM

frame rate

able to visualize 17x more data points than MegaMol at a 4.2x better frame rate. It is worth noting that MegaMol does perform more sophisticated rendering process than our framework. However, while our framework couples the visualization engine with grid-based real-time data generation, MagaMol did not support this feature [45].

7. Conclusion We presented a scheduling technique for highperformance 3D ABM for wound healing application using GPU hyper tasking (GHT) to achieve high level of simulation interactivity. To achieve optimal concurrency in this scheduling scheme, different tasks were assigned to different devices. This task assignment resulted in

75

output data being distributed across the CPU host and multiple GPUs. Thus, a Host-Device Activity-Aware Data Copy (HADC) technique was proposed to minimize hostdevice data copies. The resulting framework was used to implement a VF tissue repair response to injuries. The in situ visualization of the ECM and signaling proteins in healing VF was capable of processing and rendering 1.7 billion data points, at an average of 42.8 fps frame rate. Our ABM framework offers biomedical researchers a tool to efficiently explore the large amounts of output data with a high level of user-simulation interactivity. We are currently exploring verification techniques to quantitatively compare model-generated ECM images to fluorescence microscopy images obtained from real tissue samples. Further, we plan to develop optimization techniques for visualization of inflammatory cells. As cell populations exhibit both structural and mobility properties, capturing their dynamics efficiently through visualization is a challenging task. The goal would be to couple cell visualization with the current framework, while maintaining a high level of interactivity.

Acknowledgment The work is supported by National Institute of Deafness and other Communication Disorder of the National Institutes of Health under Grant No R01DC005788, Natural Sciences and Engineering Research Council of Canada under Grant No. RGPIN-2018-03843 and the Candian Institutes of Health Research under Grant No. 388583. The authors gratefully acknowledge the support provided by National Science Foundation under Grant No. CNS-1429404 MRI Project. The authors would like to thank Sujal Bista for guidance in developing the visualization component and UMIACS staff for assistance in VirtualGL configuration.

References [1] C. M. Macal, “Everything you need to know about agent-based modelling and simulation,” Journal of Simulation, vol. 10, no. 2, pp. 144–156, 2016. [2] N. Li, K. Verdolini, G. Clermont, Q. Mi, E. N. Rubinstein, P. A. Hebda, and Y. Vodovotz, “A patient-specific in silico model of inflammation and healing tested in acute vocal fold injury,” PloS one, vol. 3, no. 7, p. e2789, 2008. [3] F. Wall, “Agent-based modeling in managerial science: an illustrative survey and study,” Review of Managerial Science, vol. 10, no. 1, pp. 135–193, 2016. [4] R. M. D’Souza, M. Lysenko, and K. Rahmani, “Sugarscape on steroids: simulating over a million agents at interactive rates,” in Proceedings of Agent2007 conference. Chicago, IL, 2007. [5] N. Collier and M. North, “Parallel agent-based simulation with repast for high performance computing,” Simulation, vol. 89, no. 10, pp. 1215–1235, 2013. [6] J. T. Murphy, E. S. Bayrak, M. C. Ozturk, and A. Cinar, “Simulating 3-d bone tissue growth using repast hpc: Initial simulation design and performance results,” in Winter Simulation Conference (WSC), 2016. IEEE, 2016, pp. 2087–2098. [7] M. H. Swat, G. L. Thomas, J. M. Belmonte, A. Shirinifard, D. Hmeljak, and J. A. Glazier, “Multi-scale modeling of tissues using compucell3d,” Methods in Cell Biology, vol. 110, p. 325, 2012.

ISBN: 1-60132-508-8, CSREA Press ©

76

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

[8] S. Coakley, M. Gheorghe, M. Holcombe, S. Chin, D. Worth, and C. Greenough, “Exploitation of high performance computing in the flame agent-based simulation framework,” in High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC-ICESS), 2012 IEEE 14th International Conference on. IEEE, 2012, pp. 538–545. [9] M. Cytowski and Z. Szymanska, “Large-scale parallel simulations of 3d cell colony dynamics,” Computing in Science & Engineering, vol. 16, no. 5, pp. 86–95, 2014. [10] N. Seekhao, C. Shung, J. JaJa, L. Mongeau, and N. Y. Li-Jessen, “High-performance agent-based modeling applied to vocal fold inflammation and repair,” Frontiers in Physiology, vol. 9, p. 304, 2018. [11] M. Rivi, L. Calori, G. Muscianisi, and V. Slavnic, “In-situ visualization: State-of-the-art and some use cases,” PRACE White Paper, pp. 1–18, 2012. [12] T. V. Project, “VirtualGL background,” http://www.virtualgl.org/ About/Background, Tech. Rep., 2015. [13] D. Commander, “User’s guide for turbovnc 0.6. retrieved january 14, 2010,” 2009. [14] F. Gottrup, M. S. Ågren, and T. Karlsmark, “Models for use in wound healing research: a survey focusing on in vitro and in vivo adult soft tissue,” Wound Repair and Regeneration, vol. 8, no. 2, pp. 83–96, 2000. [15] X. Lim, I. Tateya, T. Tateya, A. Muñoz-Del-Río, and D. M. Bless, “Immediate inflammatory response and scar formation in wounded vocal folds,” Annals of Otology, Rhinology & Laryngology, vol. 115, no. 12, pp. 921–929, 2006. [16] N. V. Welham, X. Lim, I. Tateya, and D. M. Bless, “Inflammatory factor profiles one hour following vocal fold injury.” The Annals of otology, rhinology, and laryngology, vol. 117, no. 2, pp. 145–152, 2008. [17] S. Kurita, “A comparative study of the layer structure of the vocal fold,” Vocal Fold Physiology, pp. 3–21, 1981. [18] M.-C. Su, T.-H. Yeh, C.-T. Tan, C.-D. Lin, O.-C. Linne, and S.-Y. Lee, “Measurement of adult vocal fold length,” The Journal of Laryngology & Otology, vol. 116, no. 6, pp. 447–449, 2002. [19] J. K. Kutty and K. Webb, “Tissue engineering therapies for the vocal fold lamina propria,” Tissue Engineering Part B: Reviews, vol. 15, no. 3, pp. 249–262, 2009. [20] J.-M. Prades, J. M. Dumollard, S. Duband, A. Timoshenko, C. Richard, M. D. Dubois, C. Martin, and M. Peoc’h, “Lamina propria of the human vocal fold: histomorphometric study of collagen fibers,” Surgical and Radiologic Anatomy, vol. 32, no. 4, pp. 377–382, 2010. [21] N. Y. Li, Y. Vodovotz, P. A. Hebda, and K. V. Abbott, “Biosimulation of inflammation and healing in surgically injured vocal folds,” The Annals of otology, rhinology, and laryngology, vol. 119, no. 6, p. 412, 2010. [22] S. Zörner, M. Kaltenbacher, and M. Döllinger, “Investigation of prescribed movement in fluid–structure interaction simulation for the human phonation process,” Computers & fluids, vol. 86, pp. 133–140, 2013. [23] N. Y. Li, H. K. Heris, and L. Mongeau, “Current understanding and future directions for vocal fold mechanobiology,” Journal of Cytology & Molecular Biology, vol. 1, no. 1, p. 001, 2013. [24] P. Bhattacharya and T. Siegmund, “A computational study of systemic hydration in vocal fold collision,” Computer methods in biomechanics and biomedical engineering, vol. 17, no. 16, pp. 1835–1852, 2014. [25] P. Richmond, D. Walker, S. Coakley, and D. Romano, “High performance cellular level agent-based simulation with flame for the gpu,” Briefings in bioinformatics, vol. 11, no. 3, pp. 334–347, 2010. [26] P. Richmond and M. K. Chimeh, “Flame gpu: Complex system simulation framework,” in High Performance Computing & Simulation (HPCS), 2017 International Conference on. IEEE, 2017, pp. 11–17. [27] A. de Paiva Oliveira and P. Richmond, “Feasibility study of multiagent simulation at the cellular level with flame gpu.” in FLAIRS Conference, 2016, pp. 398–403. [28] S. Tamrakar, P. Richmond, and R. M. D’Souza, “Pi-flame: A parallel immune system simulator using the flame graphic processing unit environment,” Simulation, vol. 93, no. 1, pp. 69–84, 2017. [29] M. J. North, T. R. Howe, N. T. Collier, and J. R. Vos, “The repast simphony runtime system,” in Proceedings of the agent 2005 confer-

[30] [31]

[32] [33] [34]

[35]

[36]

[37]

[38] [39] [40]

[41]

[42]

[43]

[44]

[45]

ence on generative social processes, models, and mechanisms, vol. 10. ANL/DIS-06-1, co-sponsored by Argonne National Laboratory and The University of Chicago, 2005, pp. 13–15. T. Emonet, C. M. Macal, M. J. North, C. E. Wickersham, and P. Cluzel, “Agentcell: a digital single-cell assay for bacterial chemotaxis,” Bioinformatics, vol. 21, no. 11, pp. 2714–2721, 2005. G. R. Mirams, C. J. Arthurs, M. O. Bernabeu, R. Bordas, J. Cooper, A. Corrias, Y. Davit, S.-J. Dunn, A. G. Fletcher, D. G. Harvey, et al., “Chaste: an open source c++ library for computational physiology and biology,” PLoS Computational Biology, vol. 9, no. 3, p. e1002970, 2013. S. Hoehme and D. Drasdo, “A cell-based simulation software for multi-cellular systems,” Bioinformatics, vol. 26, no. 20, pp. 2641– 2642, 2010. J. Starruß, W. de Back, L. Brusch, and A. Deutsch, “Morpheus: a user-friendly modeling environment for multiscale and multicellular systems biology,” Bioinformatics, vol. 30, no. 9, pp. 1331–1332, 2014. M. Falk, M. Ott, T. Ertl, M. Klann, and H. Koeppl, “Parallelized agent-based simulation on cpu and graphics hardware for spatial and stochastic models in biology,” in Proceedings of the 9th International Conference on Computational Methods in Systems Biology. ACM, 2011, pp. 73–82. L. Zhang, B. Jiang, Y. Wu, C. Strouthos, P. Z. Sun, J. Su, and X. Zhou, “Developing a multiscale, multi-resolution agent-based brain tumor model by graphics processing units,” Theoretical Biology and Medical Modelling, vol. 8, no. 1, p. 46, 2011. N. Seekhao, C. Shung, J. JaJa, L. Mongeau, and N. Y. Li-Jessen, “Real-time agent-based modeling simulation with in-situ visualization of complex biological systems a case study on vocal fold inflammation and healing,” IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2016. U. Ayachit, A. Bauer, B. Geveci, P. O’Leary, K. Moreland, N. Fabian, and J. Mauldin, “Paraview catalyst: Enabling in situ data analysis and visualization,” in Proceedings of the First Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization. ACM, 2015, pp. 25–29. T. Kuhlen, R. Pajarola, and K. Zhou, “Parallel in situ coupling of simulation with a fully featured visualization system,” 2011. A. Henderson, J. Ahrens, C. Law, et al., The ParaView Guide. Kitware Clifton Park, NY, 2004. H. Childs, E. Brugger, K. Bonnell, J. Meredith, M. Miller, B. Whitlock, and N. Max, “A contract based system for large data visualization,” in Visualization, 2005. VIS 05. IEEE. IEEE, 2005, pp. 191–198. Y. Su, Y. Wang, and G. Agrawal, “In-situ bitmaps generation and efficient data analysis based on bitmaps,” in Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing. ACM, 2015, pp. 61–72. A. Krekhov, J. Grüninger, R. Schlönvoigt, and J. Krüger, “Towards in situ visualization of extreme-scale, agent-based, worldwide diseasespreading simulations,” in SIGGRAPH Asia 2015 Visualization in High Performance Computing. ACM, 2015, p. 7. J. M. Coppoolse, T. Van Kooten, H. K. Heris, L. Mongeau, N. Y. Li, S. L. Thibeault, J. Pitaro, O. Akinpelu, and S. J. Daniel, “An in vivo study of composite microgels based on hyaluronic acid and gelatin for the reconstruction of surgically injured rat vocal folds,” Journal of Speech, Language, and Hearing Research, vol. 57, no. 2, pp. S658–S673, 2014. N. Seekhao, J. JaJa, L. Mongeau, and N. Y. Li-Jessen, “In situ visualization for 3d agent-based vocal fold inflammation and repair simulation,” Supercomputing Frontiers and Innovations, vol. 4, no. 3, p. 68, 2017. S. Grottel, M. Krone, C. Müller, G. Reina, and T. Ertl, “Megamol prototyping framework for particle-based visualization,” IEEE transactions on visualization and computer graphics, vol. 21, no. 2, pp. 201–214, 2015.

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

77

3(&7 $ 3URJUDP (QHUJ\ &RQVXPSWLRQ 7XQLQJ 7RRO &XLMLDR )X 'HSHL 4LDQ 7LDQPLQJ +XDQJ =KRQJ]KL /XDQ 6LQR*HUPDQ -RLQW 6RIWZDUH ,QVWLWXWH 6FKRRO RI &RPSXWHU 6FLHQFH DQG (QJLQHHULQJ %HLKDQJ 8QLYHUVLW\%HLMLQJ &KLQD ÎXFXLMLDR GHSHLT WLDQPLQJK OXDQ]KRQJ]KL`#EXDDHGXFQ

=KRQJ]KL /XDQ LV FRQWDFW DXWKRU $EVWUDFW 5HGXFLQJ HQHUJ\ FRQVXPSWLRQ GXULQJ DSSOLFDWLRQ RSHUDWLRQ KDV EHFRPH D YHU\ XUJHQW GHPDQG 3UHYLRXV UHVHDUFKHUV KDYH PRVWO\ RSWLPL]HG WKH HQHUJ\ FRQVXPSWLRQ RI DSSOLFDWLRQ UXQWLPH E\ RSWLPL]LQJ WKH KDUGZDUH DUFKLWHFWXUH VWUDWHJ\ RU WKH ZD\ VRIWZDUH JXLGHV KDUGZDUH RSHUDWLRQ 2SWLPL]LQJ HQHUJ\ FRQVXPSWLRQ IURP WKH VRIWZDUH OHYHO ZLOO EH PRUH IOH[LEOH DQG KDYH PRUH RSWLPL]DWLRQ VSDFH ,Q WKLV SDSHU D VRXUFH FRGH RULHQWHG VRIWZDUH HQHUJ\ FRQVXPSWLRQ RSWLPL]DWLRQ VFKHPH LV SURSRVHG DQG DQ RSWLPL]DWLRQ WRRO LV GHVLJQHG DQG LPSOHPHQWHG 7KLV WRRO FDQ DQDO\]H WKH SURJUDP VWUXFWXUH DQG HVWLPDWH WKH SURJUDP HQHUJ\ FRQVXPSWLRQ :H SURYLGH LW WR XVHUV LQ WKH IRUP RI (FOLSVH SOXJLQ 'HYHORSHUV FDQ XVH WKLV SOXJLQ WR GHEXJ SURJUDPV IURP WKH SHUVSHFWLYH RI HQHUJ\ FRQVXPSWLRQ WR KHOS WKHP GHYHORS DSSOLFDWLRQV ZLWK ORZHU HQHUJ\ FRQVXPSWLRQ .H\ZRUGV (QHUJ\ FRQVXPSWLRQ 2SWLPL]DWLRQ 6RXUFH FRGH RULHQWHG 7XQLQJ WRRO

,QWURGXFWLRQ

7KH SUREOHP RI HQHUJ\ FRQVXPSWLRQ LQ ELJ GDWD FHQWHUV DQG KLJKSHUIRUPDQFH FRPSXWLQJ HQYLURQPHQWV LV JHWWLQJ PRUH DQG PRUH VHULRXV +RZHYHU GXH WR WKH ODFN RI WRROV IRU VHQVLQJ SURJUDP HQHUJ\ FRQVXPSWLRQ WKDW LV KLJKO\ LQWHJUDWHG ZLWK WKH SURJUDPPLQJ HQYLURQPHQW VRIWZDUH GHYHORSHUV VHOGRP FRQVLGHU SURJUDP HQHUJ\ FRQVXPSWLRQ ZKHQ GHVLJQLQJ DQG LPSOHPHQWLQJ VRIWZDUH DW SUHVHQW DQG GR QRW XQGHUVWDQG WKH LPSDFW RI WKHLU GHFLVLRQV RQ SURJUDP HQHUJ\ FRQVXPSWLRQ ZKHQ SURJUDPPLQJ $FFRUGLQJ WR WKH FXUUHQW VLWXDWLRQ LI GHYHORSHUV FDQ REVHUYH WKH HQHUJ\ FRQVXPSWLRQ RI WKHLU FRGH ZKLOH FRGLQJ WKH\ FDQ PRGLI\ WKH FRGH DW DQ\ WLPH WR RSWLPL]H WKH HQHUJ\ FRQVXPSWLRQ RI WKH FRGH $V D GHEXJJLQJ WRRO WKDW FDQ DVVLVW GHYHORSHUV LQ HQHUJ\ FRQVXPSWLRQ RSWLPL]DWLRQ LW VKRXOG KDYH WKH IROORZLQJ IXQFWLRQV 'R QRW GHVWUR\ WKH RULJLQDO SURJUDPPLQJ HQYLURQPHQW NHHS WKH RULJLQDO EDVLF IXQFWLRQV FRQYHQLHQW IRU XVHUV WR ZULWH FRPSLOH DQG UXQ SURJUDPV /RFDWLQJ (QHUJ\ &RQVXPSWLRQ +RWVSRW )XQFWLRQ6RXUFH &RGH LV D FRPELQDWLRQ RI PRGXOHV IXQFWLRQV VLPLODU WR SHUIRUPDQFH RSWLPL]DWLRQ ZKLFK UHTXLUHV ILQGLQJ WKH KRWVSRW IXQFWLRQ IRU HQHUJ\ FRQVXPSWLRQ +RWVSRW IXQFWLRQ LV XVXDOO\ WKH PRGXOH ZLWK WKH JUHDWHVW RSWLPL]DWLRQ SRWHQWLDO &DOFXODWH WKH HQHUJ\ FRQVXPSWLRQ LQIRUPDWLRQ RI FRGH VHJPHQWVLQ WKH KRWVSRW IXQFWLRQV FRGH EORFNV ORRS ERGLHV HWF RIWHQ RFFXS\ WKH PRVW ZLWK PRUH HQHUJ\ FRQVXPSWLRQ WKH

WRRO VKRXOG KDYH WKH DELOLW\ WR FDOFXODWH FRGHOHYHO HQHUJ\ FRQVXPSWLRQ 8QGHUVWDQGLQJ WKH GLVWULEXWLRQ RI HQHUJ\ FRQVXPSWLRQ GXULQJ IXQFWLRQ FDOO FDQ QRW RQO\ DVVLVW GHYHORSHUV LQ ORFDWLQJ HQHUJ\ FRQVXPSWLRQ KRWVSRW IXQFWLRQV QXPEHU EXW DOVR KHOS GHYHORSHUV XQGHUVWDQG WKH FRGH FKDQJHV RQ RWKHU PRGXOHV IXQFWLRQV

5HODWHG ZRUNV

(DUO\ UHVHDUFK RQ RSWLPL]DWLRQ RI FRPSXWHU V\VWHP HQHUJ\ FRQVXPSWLRQ IRFXVHG RQ FLUFXLW LPSURYHPHQW RI FRPSXWHU V\VWHP KDUGZDUH DQG RSWLPL]DWLRQ RI FRUUHVSRQGLQJ SK\VLFDO SDUDPHWHUV VXFK DV G\QDPLF YROWDJH VFDOLQJ '96 WHFKQRORJ\ FRPPRQO\ XVHG E\ SURFHVVRU PDQXIDFWXUHUV LQ WKHLU SURFHVVRU FKLSV LH UHGXFLQJ V\VWHP HQHUJ\ FRQVXPSWLRQ E\ V\QFKURQRXVO\ UHGXFLQJ SRZHU VXSSO\ YROWDJH DQG FORFN IUHTXHQF\ GXULQJ SURJUDP RSHUDWLRQ DFFRUGLQJ WR WKH FKDUDFWHULVWLFV RI &026 FLUFXLWV $IWHU WKH KDUGZDUH RSWLPL]DWLRQ LV UHODWLYHO\ PDWXUH PRUH DWWHQWLRQ LV WXUQHG WR VRIWZDUH RSWLPL]DWLRQ WU\LQJ WR DFKLHYH WKH FRPELQDWLRQ RI ORZ SRZHU FRQVXPSWLRQ VRIWZDUH DQG KDUGZDUH WR HIIHFWLYHO\ UHGXFH WKH HQHUJ\ FRQVXPSWLRQ RI WKH V\VWHP ,Q WKH EHJLQQLQJ WKH RSWLPL]DWLRQ RI VRIWZDUH HQHUJ\ FRQVXPSWLRQ ZDV PRUH IRFXVHG RQ WKH RSWLPL]DWLRQ RI WKH HPEHGGHG V\VWHP VRIWZDUH EHFDXVH HPEHGGHG V\VWHPV PRVWO\ XVH EDWWHU\ SRZHU DQG LW LV D SUDFWLFDO HQJLQHHULQJ SUREOHP WR HQDEOH WKH VRIWZDUH WR UXQ ORQJHU XQGHU WKH FRQGLWLRQ RI OLPLWHG HQHUJ\ UHVRXUFHV 7KLV SUREOHP KDV EHHQ VROYHG YHU\ PDWXUHO\ LQ WKH SDVW IHZ GHFDGHV 7KH HQHUJ\ FRQVXPSWLRQ RSWLPL]DWLRQ DW WKH VRIWZDUH OHYHO PHQWLRQHG DERYH LV PRUH IOH[LEOH DQG LQYROYHV PRUH ILHOGV )URP WKH ERWWRP KDUGZDUH V\VWHP VWUXFWXUH WR WKH PLGGOH RSHUDWLQJ V\VWHP DQG WKH XSSHU DSSOLFDWLRQ SURJUDPV LW FDQ EH WDNHQ DV WKH JRDO RI HQHUJ\ FRQVXPSWLRQ RSWLPL]DWLRQ DW WKH VRIWZDUH OHYHO $W SUHVHQW WKH RSWLPL]DWLRQ RI HQHUJ\ FRQVXPSWLRQ IURP WKH VRIWZDUH OHYHO FDQ EH GLYLGHG LQWR WZR OHYHOV DSSOLFDWLRQ VRIWZDUH OD\HU DQG SODWIRUP VRIWZDUH OD\HU 7KH IROORZLQJ DUH WZR FRPPRQ PHWKRGV IRU RSWLPL]LQJ HQHUJ\ FRQVXPSWLRQ DW WKHVH WZR VRIWZDUH OHYHOV 3ODWIRUP VRIWZDUH LV WKH VRIWZDUH WKDW SURYLGHV EDVLF VHUYLFHV VXFK DV FDOFXODWLRQ DQG VWRUDJH IRU DSSOLFDWLRQ VRIWZDUH UXQQLQJ RQ LW ,W LV WKH GLUHFW FRQVXPHU RI KDUGZDUH UHVRXUFHV 5HDVRQDEOH UHVRXUFH PDQDJHPHQW DQG VFKHGXOLQJ FDQ DFKLHYH JRRG HQHUJ\ FRQVXPSWLRQ RSWLPL]DWLRQ HIIHFW ZKLFK LV HVSHFLDOO\ REYLRXV LQ FOXVWHU V\VWHPV 8VLQJ HQHUJ\ DZDUH UHVRXUFH VFKHGXOLQJ VWUDWHJ\ WKH FOXVWHU V\VWHP FDQ VDYH D ODUJH DPRXQW RI SRZHU WR WKH GDWD FHQWHU DQG UHGXFH RSHUDWLQJ FRVWV ZKLOH HQVXULQJ VHUYLFH TXDOLW\ 7KHUH DUH DOVR

ISBN: 1-60132-508-8, CSREA Press ©

78

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

D ORW RI UHVHDUFKHV LQ WKLV DUHD )RU H[DPSOH UHVRXUFHV FDQ EH VFKHGXOHG WR DGDSW WR WDVNV WR UHGXFH WKH RYHUDOO V\VWHP HQHUJ\ FRQVXPSWLRQ 7KH UHVRXUFH VFKHGXOLQJ WHFKQRORJ\ XVHG LQ )DFHERRN V GDWD FHQWHU GHVFULEHG LQ GRFXPHQW >@ FDQ UHGXFH WKH HQHUJ\ FRQVXPSWLRQ RI LWV GDWD FHQWHU E\ HYHU\ \HDU $QRWKHU H[DPSOH LV WR 5HGXFH KDUGZDUH LGOH WLPH G\QDPLFDOO\ WR VDYH HQHUJ\ :KHQ UXQQLQJ 0DS5HGXFH WDVNV WKH GLVWULEXWHG FOXVWHU FDQ DGRSW FRUUHVSRQGLQJ VWUDWHJLHV WR SUHYHQW 5HGXFH QRGHV IURP ZDLWLQJ DOO WKH WLPH IRU DOO QRGHV SHUIRUPLQJ 0DS WR ILQLVK WKH WDVN DQG UHWXUQ >@ 7KHUH DUH DOVR YLUWXDOL]DWLRQ DOORFDWLRQ SROLFLHV LQ (XFDO\SWXV>@ 7KH DSSOLFDWLRQ VRIWZDUH UXQV RQ WKH SODWIRUP VRIWZDUH DQG LV WKH ILQDO FRQVXPHU RI HQHUJ\ 6WXG\LQJ WKH DSSOLFDWLRQ VRIWZDUH ZLWK ORZ HQHUJ\ FRQVXPSWLRQ FDQ DOVR UHGXFH WKH HQHUJ\ FRQVXPSWLRQ RI WKH V\VWHP 7KHUH DUH WZR PDLQ SRLQWV LQ WKH UHVHDUFK RI HQHUJ\ FRQVXPSWLRQ RSWLPL]DWLRQ DW WKH DSSOLFDWLRQ VRIWZDUH OHYHO QDPHO\ ORZSRZHU FRPSLODWLRQ RSWLPL]DWLRQ DQG VRXUFH FRGHRULHQWHG HQHUJ\ FRQVXPSWLRQ RSWLPL]DWLRQ 0RGHUQ FRPSLOHUV DUH SRZHUIXO DQG FDQ JUHDWO\ RSWLPL]H FRGH SHUIRUPDQFH +RZHYHU WKH JRDO RI FXUUHQW FRPSLOHU RSWLPL]DWLRQ LV WR VKRUWHQ WKH FRGH H[HFXWLRQ WLPH DQG UHGXFH WKH VL]H RI WKH VWRUDJH VSDFH XVHG ZKHQ WKH SURJUDP LV UXQQLQJ DQG WKH WHFKQRORJ\ IRU ORZSRZHU FRPSLODWLRQ RSWLPL]DWLRQ LV YHU\ ODFNLQJ $W SUHVHQW ORZ SRZHU FRQVXPSWLRQ FRPSLODWLRQ RSWLPL]DWLRQ PHWKRGV LQFOXGH PHWKRGV WR RSWLPL]H WKH JHQHUDWHG FRGH LQ WKH FRGH JHQHUDWLRQ VWDJH LQ WKH FRPSLODWLRQ SURFHVV VXFK DV LQ >@ >@ >@ 7KHVH PHWKRGV DUH DOO JRRG H[SORUDWLRQV IRU ORZSRZHU FRPSLODWLRQ RSWLPL]DWLRQ EXW WKH\ DUH QRW PDWXUH HQRXJK FRPSDUHG ZLWK FRPSLOHU SHUIRUPDQFH RSWLPL]DWLRQ WHFKQRORJ\ $W WKH VDPH WLPH PDQ\ WHFKQRORJLHV UHO\ WRR PXFK RQ KDUGZDUH FKDUDFWHULVWLFV DQG ODFN XQLYHUVDOLW\ 7KH ORZSRZHU FRPSLODWLRQ RSWLPL]DWLRQ WHFKQRORJ\ QHHGV IXUWKHU UHVHDUFK DQG LPSURYHPHQW ,Q DGGLWLRQ WR WKH ORZSRZHU FRPSLOHU RSWLPL]DWLRQ WHFKQRORJ\ WKHUH LV DQRWKHU ELJ GLUHFWLRQ IRU WKH DSSOLFDWLRQ VRIWZDUH OHYHO HQHUJ\ FRQVXPSWLRQ RSWLPL]DWLRQ WHFKQRORJ\ ZKLFK LV WR GLUHFWO\ RSWLPL]H WKH VRXUFH FRGH RI WKH DSSOLFDWLRQ VRIWZDUH VR DV WR FRPSOHWH WKH RSWLPL]DWLRQ RI HQHUJ\ FRQVXPSWLRQ LQ WKH SURFHVV RI ZULWLQJ RU PRGLI\LQJ 'XH WR WKH VWURQJ GHPDQG IRU HQHUJ\ FRQVHUYDWLRQ DQG WKH VLPSOLFLW\ RI WKH SODWIRUP DUFKLWHFWXUH WKH VRXUFH FRGHRULHQWHG HQHUJ\ FRQVXPSWLRQ RSWLPL]DWLRQ KDV DFKLHYHG JRRG UHVXOWV RQ HPEHGGHG SODWIRUPV EHIRUH EXW RQ JHQHUDO FRPSXWLQJ SODWIRUPV GXH WR WKH FRPSOH[LW\ RI WKH V\VWHP DQG SODWIRUP WKH VRXUFH FRGHRULHQWHG HQHUJ\ FRQVXPSWLRQ RSWLPL]DWLRQ WHFKQRORJ\ KDV PDGH VORZ SURJUHVV $FFRUGLQJ WR WKH REMHFW RI RSWLPL]DWLRQ FRGHRULHQWHG HQHUJ\ FRQVXPSWLRQ RSWLPL]DWLRQ FDQ EH GLYLGHG LQWR LQVWUXFWLRQ OHYHO VWDWHPHQW OHYHO DQG PRGXOH OHYHO >@ ,QVWUXFWLRQ OHYHO RSWLPL]DWLRQ FDQ XVH WKH FRPPDQGV SURYLGHG E\ WKH FRPSLOHU WR RSWLPL]H FRGH HQHUJ\ FRQVXPSWLRQ ZLWK VPDOO JUDQXODULW\ 6WDWHPHQWOHYHO HQHUJ\ FRQVXPSWLRQ RSWLPL]DWLRQ DWWHPSWV WR UHGXFH HQHUJ\ FRQVXPSWLRQ E\ DGRSWLQJ PRUH HIILFLHQW DQG ORZHQHUJ\ FRQVXPLQJ FRGH VWUXFWXUHV DQG GDWD VWUXFWXUHV 7KLV OHYHO LV FORVH WR SURJUDPPHUV DQG GHEXJJHUV DQG FDQ JXLGH WKHP LQ HQHUJ\ FRQVXPSWLRQ GHEXJJLQJ 7KH OHYHO VWXGLHG LQ WKLV SDSHU LV

VWDWHPHQWOHYHO HQHUJ\ FRQVXPSWLRQ RSWLPL]DWLRQ 0RGXOH OHYHO HQHUJ\ FRQVXPSWLRQ RSWLPL]DWLRQ LV WR VHOHFW D ORZHU HQHUJ\ FRQVXPSWLRQ DOJRULWKP RU PRGXOH GHVLJQ PHWKRG DFFRUGLQJ WR WKH FRQWH[W WR RSWLPL]H HQHUJ\ FRQVXPSWLRQ $FFRUGLQJ WR WKH RSWLPL]DWLRQ PHWKRG VRXUFH FRGH RULHQWHG HQHUJ\ FRQVXPSWLRQ RSWLPL]DWLRQ FDQ EH GLYLGHG LQWR WZR FDWHJRULHV VRXUFH FRGH WUDQVIRUPDWLRQ DQG DOJRULWKP RSWLPL]DWLRQ 7KHUH DUH IRXU PDLQ W\SHV RI VRXUFH FRGH WUDQVIRUPDWLRQ GDWD VWUXFWXUH WUDQVIRUPDWLRQ F\FOLF WUDQVIRUPDWLRQ LQWHUQDO WUDQVIRUPDWLRQ RI SURJUDPV DQG WUDQVIRUPDWLRQ RI RSHUDWRUV DQG FRQWURO VWUXFWXUHV &\FOLF WUDQVIRUPDWLRQ LV XVHG PRVW IUHTXHQWO\ &KDQJLQJ WKH F\FOLF VWUXFWXUH LQ SURJUDP VRXUFH FRGH FDQ RIWHQ LPSURYH GDWD DFFHVV SHUIRUPDQFH VXFK DV SDJH WDEOH KLW WKXV RWKHU HIIHFWV RI LPSURYLQJ HQHUJ\ FRQVXPSWLRQ &RPSDUHG ZLWK SURJUDP WUDQVIRUPDWLRQ DOJRULWKP RSWLPL]DWLRQ LV PRUH DEVWUDFW QHHGV WR EH FRPELQHG ZLWK WKH VSHFLILF LPSOHPHQWDWLRQ RI SURJUDP IXQFWLRQV DQG LV GLIILFXOW WR LPSOHPHQW /LWHUDWXUH >@ JLYHV WKH SHUIRUPDQFH DQG HQHUJ\ FRQVXPSWLRQ RI VHYHUDO PRVW FRPPRQ VRUWLQJ DOJRULWKPV 7R VXP XS WKH H[LVWLQJ UHVHDUFK RQ VRIWZDUH HQHUJ\ FRQVXPSWLRQ RSWLPL]DWLRQ LV PDLQO\ IRFXVHG RQ HPEHGGHG SODWIRUPV $V D JHQHUDOSXUSRVH FRPSXWLQJ ILHOG ZLWK PRUH HQHUJ\VDYLQJ UHTXLUHPHQWV WKLV UHVHDUFK LV VWLOO QRW PDWXUH HQRXJK 7KH FRPSOH[LW\ RI WKH SODWIRUP LWVHOI GHWHUPLQHV WKDW VRIWZDUHRULHQWHG HQHUJ\ FRQVXPSWLRQ RSWLPL]DWLRQ HVSHFLDOO\ DW WKH DSSOLFDWLRQ VRIWZDUH OHYHO LV GLIILFXOW +RZHYHU WKH XUJHQW GHPDQG IRU WKH UHVHDUFK DQG LPSOHPHQWDWLRQ RI HQHUJ\ VDYLQJ WHFKQRORJLHV LQ ,7 LQGXVWU\ DQG WKH JHQHUDOLW\ DQG IOH[LELOLW\ RI VRIWZDUH HQHUJ\ FRQVXPSWLRQ RSWLPL]DWLRQ PHWKRGV FRPSDUHG ZLWK KDUGZDUH HQHUJ\ FRQVXPSWLRQ RSWLPL]DWLRQ PHWKRGV GHWHUPLQH WKDW VRIWZDUH HQHUJ\ FRQVXPSWLRQ RSWLPL]DWLRQ PHWKRGV KDYH YHU\ LPSRUWDQW UHVHDUFK YDOXH $PRQJ WKHP WKH VRXUFH FRGHRULHQWHG HQHUJ\ FRQVXPSWLRQ RSWLPL]DWLRQ PHWKRG DW WKH OHYHO RI DSSOLFDWLRQ VRIWZDUH RSWLPL]HV HQHUJ\ FRQVXPSWLRQ DW WKH VRXUFH RI VRIWZDUH GHYHORSPHQW ZKLFK LV PRUH JHQHUDO DQG FDQ SURYLGH JXLGDQFH DQG KHOS IRU PDQ\ GHYHORSHUV LQ HQHUJ\ FRQVXPSWLRQ RSWLPL]DWLRQ GXULQJ GHYHORSPHQW %DVHG RQ WKH H[LVWLQJ VRXUFH FRGHRULHQWHG HQHUJ\ FRQVXPSWLRQ RSWLPL]DWLRQ UHVHDUFK WKLV SDSHU VWXGLHV WKH SURJUDP GHEXJJLQJ PHWKRG IRU HQHUJ\ FRQVXPSWLRQ RSWLPL]DWLRQ DQG LPSOHPHQWV WKH GHEXJJLQJ PHWKRG DV D SOXJLQ LQ WKH LQWHJUDWHG GHYHORSPHQW HQYLURQPHQW WR SURYLGH GHYHORSHUV ZLWK HQHUJ\ FRQVXPSWLRQ RSWLPL]DWLRQ IRU SURJUDPV ZULWWHQ E\ WKHP

2YHUDOO 'HVLJQ DQG ,PSOHPHQWDWLRQ

RI 7RROV $FFRUGLQJ WR WKH UHTXLUHPHQW DQDO\VLV DERYH ZH LPSOHPHQWHG WKH WRRO LQ (FOLSVH DV D SOXJLQ 'XH WR WKH PDLQWHQDQFH RI WKH RULJLQDO GHYHORSPHQW HQYLURQPHQW GHYHORSHUV FDQ PDLQWDLQ WKH RULJLQDO GHYHORSPHQW KDELWV DQG RSWLPL]H WKH HQHUJ\ FRQVXPSWLRQ RI WKH SURJUDP E\ UHSHDWHGO\ ZULWLQJ SLOLQJ UXQQLQJ DQG GHEXJJLQJ WKH SURFHVV ,Q WKH SURFHVV RI GHEXJJLQJ LW LQYROYHV GHYHORSHUV KDUGZDUH SODWIRUPV DQG SURJUDP GHEXJJLQJ WRROV IRU HQHUJ\

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

FRQVXPSWLRQ RSWLPL]DWLRQ IXQFWLRQV

7KH V\VWHP FDQ SURYLGH WKUHH

6WUXFWXUH DQDO\VLV

79

GHEXJJLQJ WRROV SOXJLQV LQ 7DEOH , %HORZ ZH ZLOO LQWURGXFH KRZ WR LPSOHPHQW RXU SOXJLQV RQ WKH IUDPHZRUN $SSOLFDWLRQ IUDPHZRUN

7KH V\VWHP ZH KDYH GHYHORSHG HQDEOHV XVHUV WR REWDLQ WKH IXQFWLRQ FDOO VWUXFWXUH RI WKH SURJUDP WKH\ KDYH ZULWWHQ DQG SUHVHQWHG LW WR XVHUV LQ WKH IRUP RI SLFWXUHV $W WKH VDPH WLPH WKLV VWUXFWXUDO DQDO\VLV FDQ EH FRPELQHG ZLWK WKH HQHUJ\ &RPPDQG FRQVXPSWLRQ PHDVXUHPHQW IXQFWLRQ WR VKRZ WKH IXQFWLRQOHYHO HQHUJ\ FRQVXPSWLRQ GLVWULEXWLRQ WR XVHUV VR WKDW XVHUV FDQ REWDLQ WKH HQHUJ\ FRQVXPSWLRQ UDWLR RI HDFK IXQFWLRQ LQ WKH SURJUDP WR WKH RYHUDOO SURJUDP KHOS XVHUV ORFDWH HQHUJ\ FRQVXPSWLRQ KRWVSRW IXQFWLRQV DQG IRFXV RQ WKH RSWLPL]DWLRQ -)DFH 9LHZHU RI KLJK HQHUJ\ FRQVXPSWLRQ IXQFWLRQV

(QHUJ\ FRQVXPSWLRQ PHDVXUHPHQW

(GLWRU

7$%/( ĉ 781,1* 722/ 3/8*,1 /,67

3OXJLQ

0RQLWRU UXQQLQJ SOXJLQ 5HDOWLPH 3RZHU &RQVXPSWLRQ 'LDJUDP 3OXJLQ )XQFWLRQ &DOO (QHUJ\ $OORFDWLRQ 3OXJLQ &RGH 6HJPHQW (QHUJ\ &RQVXPSWLRQ &DOFXODWLRQ 7DEOH 3OXJLQ $FFHVVLQJ 5HGXQGDQW 6WDWHPHQW /RFDWLQJ 3OXJLQ

'HSHQGHQF\ H[WHQVLRQ

RUJHFOLSVHXLFRPPDQG RUJHFOLSVHXLKDQGOHUV RUJHFOLSVHELQGLQJV RUJHFOLSVHXLPHQXV

RUJHFOLSVHXLYLHZV )XQFWLRQ (QHUJ\ &RQVXPSWLRQ $QDO\VLV RUJHFOLSVHXLSHUVSHFWLYH 7DEOH 3OXJLQ ([WHQVLRQV RUJHFOLSVHKHOSFRQWH[W +LJKOLJKW &RGH 3OXJLQ

RUJHFOLSVHKHOSFRQWH[WV

)RU WKH HYDOXDWLRQ RI WKH RSWLPL]DWLRQ HIIHFW RI WKH &RPPDQG DSSOLFDWLRQ IUDPHZRUN UHGXQGDQW RSWLPL]DWLRQ PRGXOH LQWURGXFHG LQ WKH SUHYLRXV (FOLSVH XVHV &RPPDQG WR SURYLGH XVHU LQWHUIDFH DUWLFOH DQG WKH SRVLWLRQLQJ RI WKH HQHUJ\ FRQVXPSWLRQ KRW VSRW IXQFWLRQ LQ WKH ILUVW VWUXFWXUDO DQDO\VLV PRGXOH LW LV QHFHVVDU\ RSHUDWLRQV &RPPDQG FDQ DSSHDU LQ (FOLSVH PHQXV WRROEDUV WR PHDVXUH WKH HQHUJ\ FRQVXPSWLRQ ZKHQ WKH SURJUDP LV DQG SRSXS PHQXV E\ H[WHQGLQJ WKH RUJHFOLSVHXLPHQX UXQQLQJ DQG WKLV PRGXOH LV WR FRPSOHWH WKLV IXQFWLRQ 7KH H[WHQVLRQ SRLQW 7KH ILYH SOXJLQV LQ WKH WRRO WKH PRQLWRULQJ VSHFLILF ZRUNLQJ SULQFLSOH RI WKH HQHUJ\ FRQVXPSWLRQ DQG UXQQLQJ SOXJLQ WKH UHDOWLPH SRZHU FRQVXPSWLRQ GLDJUDP PHDVXUHPHQW PRGXOH LV WKDW ZKHQ WKH WDUJHW SURJUDP DIWHU SOXJLQ WKH IXQFWLRQ FDOO HQHUJ\ FRQVXPSWLRQ DOORFDWLRQ SOXJ VWUXFWXUH DQDO\VLV DQG SLOH LQVHUWLRQ>@ EHIRUH UHGXQGDQF\ LQ WKH FDOFXODWLRQ FRGH VHJPHQW HQHUJ\ FRQVXPSWLRQ SOXJLQ RSWLPL]DWLRQ DQG DIWHU UHGXQGDQF\ RSWLPL]DWLRQ UXQV 3$3,>@ DQG WKH DFFHVV UHGXQGDQF\ VWDWHPHQW SRVLWLRQLQJ SOXJLQ DUH DOO LV XVHG WR REWDLQ WKH YDOXH RI WKH VSHFLILHG KDUGZDUH SOXJLQV WKDW DUH EDVHG RQ WKH &RPPDQG IUDPHZRUN DQG QHHG SHUIRUPDQFH FRXQWHU DQG WKHQ WKH YDOXH LV VXEVWLWXWHG LQWR WKH WR EH WULJJHUHG E\ GHYHORSHUV EHIRUH H[HFXWLRQ &RPPDQG LQ HQHUJ\ FRQVXPSWLRQ PRGHO WR REWDLQ WKH HQHUJ\ FRQVXPSWLRQ (FOLSVH LV D GHFODUDWLYH GHVFULSWLRQ RI D FRPSRQHQW DQG KDV 7KLV PRGXOH FDQ PHDVXUH WKH HQHUJ\ FRQVXPSWLRQ RI WKH ZKROH QRWKLQJ WR GR ZLWK LPSOHPHQWDWLRQ GHWDLOV &RPPDQG SURJUDP IXQFWLRQV LQ WKH SURJUDP DQG FRGH VHJPHQWV GHILQLWLRQ LV PDLQO\ LPSOHPHQWHG WKURXJK H[WHQVLRQ SRLQWV VSHFLILHG E\ WKH XVHU DW PXOWLSOH JUDQXODULWLHV $W WKH VDPH WLPH 7KH WKUHH H[WHQVLRQ SRLQWV LQ 7DEOH ,, DUH FORVHO\ UHODWHG WR LW FDQ GLVSOD\ WKHVH UHVXOWV WR WKH XVHUV LQ WKH IRUP RI LQWXLWLYH &RPPDQG RI ZKLFK WKH RUJHFOLSVHXLKDQGOHUV H[WHQVLRQ SRLQW DQG ULFK FKDUWV VR WKDW WKH\ FDQ HYDOXDWH WKH RSWLPL]DWLRQ FRQQHFWV &RPPDQG ZLWK D FODVV WKDW QHHGV WR LQKHULW WKH KDQGOHU FODVV DQG HIIHFW DQG REWDLQ GHWDLOHG HQHUJ\ FRQVXPSWLRQ LQIRUPDWLRQ RI RUJHFOLSVHFRUHFRPPDQGVDEVWUDFW WKH SURJUDP 7KH WRRO GLVSOD\V WKH HQHUJ\ FRQVXPSWLRQ LPSOHPHQW WKH ,+DQGOHU LQWHUIDFH H[HFXWH PHWKRG 2QFH LQIRUPDWLRQ RI WKH SURJUDP FRGH IRU GHYHORSHUV PDLQO\ LQ &RPPDQG LV H[HFXWHG WKLV FODVV ZLOO EH FDOOHG 7KLV FODVV IRXU FDWHJRULHV UHDOWLPH SRZHU FRQVXPSWLRQ GLDJUDP GHILQHV WKH EHKDYLRU RI WKH UHDO &RPPDQG VR LW LV DOVR FDOOHG D VHOHFWHG FRGH LQIRUPDWLRQ IXQFWLRQ HQHUJ\ FRQVXPSWLRQ KDQGOHU LQIRUPDWLRQ DQG HQHUJ\ FRQVXPSWLRQ DOORFDWLRQ LQIRUPDWLRQ 7$%/(Ċ &200$1' '(3(1'(17 (;7(16,21 32,17 GXULQJ WKH IXQFWLRQ FDOO &RPELQHG ZLWK WKH SUHYLRXV VWUXFWXUDO ([WHQVLRQ SRLQW 'HVFULSWLRQ DQDO\VLV DQG UHGXQGDQF\ RSWLPL]DWLRQ IXQFWLRQV WKH GHEXJJLQJ RUJHFOLSVHXLFRPPDQG 'HFODUDWLYH GHVFULSWLRQ RI WKH FRPSRQHQW HIIHFW RQ SURJUDP HQHUJ\ FRQVXPSWLRQ LV WUXO\ DFKLHYHG 'HILQHV WKH UHDO EHKDYLRU 2QFH &RPPDQG :KHQ WKH WRRO LV LPSOHPHQWHG HDFK IXQFWLRQ LV DGGHG WR RUJHFOLSVHXLKDQGOHUV H[HFXWHV WKH FODVV H[HFXWHV (FOLSVH LQ WKH IRUP RI SOXJLQ ZKLFK UHDOL]HV EHWWHU 7KH IRUP RI WKH &RPPDQG DQG ZKHUH LQ WKH XVHU RUJHFOLSVHXLPHQX LQWHUIDFH H[SDQVLELOLW\ DQG IDFLOLWDWHV WKH LPSURYHPHQW RI LWV IXQFWLRQV LQ -)DFH YLHZHU IUDPHZRUN WKH IXWXUH (FOLSVH -)DFH 9LHZHU LV PDLQO\ XVHG WR GLVSOD\ YDULRXV $W SUHVHQW WKH WRRO PDLQO\ LQFOXGHV SOXJLQV PRQLWRULQJ RSHUDWLRQ SOXJLQ UHDOWLPH SRZHU FRQVXPSWLRQ GRPDLQ PRGHOV 2Q WKH EDVLV RI QRW FKDQJLQJ WKHVH GRPDLQ GLDJUDP SOXJLQ IXQFWLRQ FDOO HQHUJ\ FRQVXPSWLRQ DOORFDWLRQ PRGHOV /LVW 7UHH RU 7DEOH FDQ EH XVHG WR GLVSOD\ WKHVH SOXJLQ IXQFWLRQ HQHUJ\ FRQVXPSWLRQ DQDO\VLV WDEOH SOXJLQ PRGHOV -)DFH 9LHZHU DSSOLHV 09& SDWWHUQ GHVLJQ WR ILOO WKH KLJKOLJKWHG FRGH SOXJLQ FRGH VHJPHQW HQHUJ\ FRQVXPSWLRQ GRPDLQ PRGHO LQWR WKH FRUUHVSRQGLQJ FRPSRQHQW &RQWURO FDOFXODWLRQ SOXJLQ FRGH DFFHVV UHGXQGDQW VWDWHPHQW 2XU IXQFWLRQ HQHUJ\ FRQVXPSWLRQ DQDO\VLV WDEOH SOXJLQ LV D SRVLWLRQLQJ SOXJLQ 2XU SOXJLQV DUH EDVHG RQ WKUHH 7DEOH9LHZHU EDVHG RQ (FOLSVH -)DFH VR ZH ZLOO IRFXV RQ KRZ DSSOLFDWLRQ IUDPHZRUNV &RPPDQG -)DFH 9LHZHU DQG (GLWRU WR LPSOHPHQW WKLV IXQFWLRQ ZLWK 7DEOH9LHZHU %HFDXVH WKH $FFRUGLQJ WR WKH IUDPHZRUN XVHG E\ WKH SOXJLQV WKH SOXJLQV WDEOH GLVSOD\ LV GHVLJQHG LQ 09& PRGH LW LV QHFHVVDU\ WR FDQ EH GLYLGHG LQWR WKUHH FDWHJRULHV DV VKRZQ LQ WKH OLVW RI FRRSHUDWH ZLWK WKH &RQWHQW 3URYLGHU DQG /DEHO3URYLGHU ZKHQ XVLQJ WKH 7DEOH9LHZHU WR GLVSOD\ GDWD 7KH FRQWHQW SURYLGHU FRQYHUWV WKH LQSXW GRPDLQ PRGHO GDWD LQWR DQ DUUD\ 7KH

ISBN: 1-60132-508-8, CSREA Press ©

80

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

RULJLQDO GRPDLQ PRGHO FDQ EH UHSUHVHQWHG E\ DQ\ W\SH VXFK DV OLVWV KDVKHV HWF $IWHU EHLQJ HQFDSVXODWHG E\ WKH FRQWHQW SURYLGHU LW ZLOO EH DQ DUUD\ RI WKH FRUUHVSRQGLQJ W\SH 7KH ODEHO SURYLGHU GLVSOD\V WKH GDWD HQFDSVXODWHG E\ WKH FRQWHQW SURYLGHU RQ WKH WDEOH LQ WKH VSHFLILHG FRQWHQW DQG IRUPDW )RU H[DPSOH LQ WKH (FOLSVH -)DFH 9LHZHU ER[ DQ HOHPHQW RI WKH DUUD\ LV DXWRPDWLFDOO\ PDSSHG WR D URZ LQ WKH WDEOH WKDW LV WKH REMHFW HQFDSVXODWHG LQ WKH VHOHFWLRQ RI WKH 7DEOH9LHZHU LV WKH FRUUHVSRQGLQJ REMHFW LQ WKH DUUD\ 7DEOH9LHZHU SURYLGHV D VHW RI ORJLF DQG UHODWHG LQWHUIDFHV DQG FODVVHV WR HQVXUH GHYHORSHUV FDQ ZULWH WDEOHEDVHG LQWHUIDFH GHVLJQ DQG GHYHORSPHQW ZLWK KLJK HIILFLHQF\ ,Q (FOLSVHEDVHG VRIWZDUH GHYHORSPHQW SURFHVVHV VXFK DV SOXJLQ GHYHORSPHQW DQG ULFK FOLHQW SODWIRUP GHYHORSPHQW 7DEOH9LHZHU LV ZLGHO\ XVHG ,W GLVSOD\V GDWD LQ WKH IRUP RI D WDEOH 'HYHORSHUV FDQ FXVWRPL]H WKH VW\OH RI GLVSOD\ FRQWHQW IRQW VW\OH EROG LWDOLF HWF GLVSOD\ SLFWXUHV DQG WH[W LQ WKH VDPH FROXPQ DQG VRUW HDFK FROXPQ LQ WKH WDEOH HWF (GLWRU IUDPHZRUN (FOLSVH SURYLGHV DQ (GLWRU IUDPHZRUN IRU GHYHORSHUV WR H[WHQG WKH (GLWRU (GLWRU PDLQO\ FRPSOHWHV WKH IXQFWLRQV RI WH[W VHJPHQW VHJPHQWDWLRQ NH\ZRUG KLJKOLJKWLQJ DQG FRQWH[W SURPSWLQJ LQ WKH HGLWRU :KHQ H[SDQGLQJ WKH (GLWRU ZH ILUVW QHHG WR GHWHUPLQH RXU RZQ WH[W EORFNLQJ UXOHV DQG WH[W UHQGHULQJ UXOHV WKHQ LPSRUW WKHVH UXOHV LQ WKH FRQILJXUDWLRQ DQG ILQDOO\ JHQHUDWH (GLWRU REMHFWV (GLWRU REMHFW FRQVLVWV RI 6RXUFH9LHZHU&RQILJXUDWLRQ DQG 'RFXPHQW3URYLGHU LQ ZKLFK 'RFXPHQW3URFLGHU SURYLGHV WH[W EORFNLQJ UXOHV DQG 6RXUFH9LZHU&RQILJXUDWLRQ SURYLGHV WH[W GLVSOD\ 7KH (GLWRU IUDPHZRUN FDOFXODWHV ZKHWKHU WKH FXUUHQW FRGH FRQIRUPV WR D FHUWDLQ UXOH LQ UHDO WLPH DQG FRPSOHWHV WDVNV VXFK DV WH[W V\QWD[ KLJKOLJKWLQJ DQG WH[W SURPSWLQJ LQ SDLUV RI (GLWRU UHJLRQV LQ UHDO WLPH LQ FRPELQDWLRQ ZLWK WH[W UHQGHULQJ UHJXODWLRQV :KHQ LPSOHPHQWLQJ WKH SOXJLQ ZH QHHG WR LPSOHPHQW WKH FODVV LQKHULWHG IURP )LOH'RFXPHQW3URYLGHU UHJLVWHU WKH WH[W EORFNLQJ PHWKRG LQ LW DQG FRPSOHWH WKH FXVWRP EORFNLQJ PHWKRG E\ LQKHULWLQJ 5XOH%DVHG3DUWLWRLQ6FDQQHU $QG UHQGHU WKH FRUUHVSRQGLQJ WH[W EORFN LQ 6RXUFH9LHZHU&RQILJXUDWLRQ +HUH ZH GHILQH WKH EORFN DQG UHQGHULQJ UXOHV ZLWK KLJK FRGH EULJKWQHVV EHWZHHQ WKH IXQFWLRQ B &86,240

N/A

Randomized kd-trees

998

0.716

2,510

0.715

hierarchical k-means tree

5.3 Performances prediction

Fig. 4. Hyperparameter optimization for randamized kd-trees

of

protein

structure

Table IV shows the performances of protein structure prediction with different parameters for reduction of k-nearest neighbor predictions. Using randomized kd-trees, the prediction becomes much faster than the original method, and it almost did not affect the accuracy of protein structure prediction. The acceleration is approximately 6-fold and it is much smaller than the acceleration of k-nearest neighbor itself (approximately 47-fold.) This is because protein structure prediction consists of multiple process including sequence alignment, model generation, model optimization, and so on. Thus, the influence of the acceleration of k-nearest neighbor prediction becomes much smaller. About the influence of reduction of k-nearest neighbor predictions, when the parameter was enough large (>1/4), the time needed for structure prediction decreased almost in

ISBN: 1-60132-508-8, CSREA Press ©

134

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

proportion to the parameter and the accuracy of structure prediction is kept. However, the parameter becomes smaller than 1/4, he time needed for structure prediction also decreased but the accuracy of structure prediction becomes worse. Thus, we think is the best parameter. Using the parameter, proposed method was approximately 21-times faster than the original one in exchange of trivial decrease in accuracy. Table IV. Prediction time and accuracy of structure prediction r

kNN algorithm

Prediction time (sec.)

Accuracy (TM-score)

1

Original (brute force)

7,643

0.512

1

Randomized kd-trees

1,242

0.511

1/2

Randomized kd-trees

712

0.511

1/4

Randomized kd-trees

354

0.510

1/8

Randomized kd-trees

213

0.498

1/16

Randomized kd-trees

153

0.487

1/32

Randomized kd-trees

112

0.461

6 Conclusions In this research, we proposed a new method to accelerate a machine learning-based sequence alignment generation method for homology modeling. Proposed method was approximately 21-times faster than the original one in exchange of trivial decrease in accuracy of structure prediction. As result, the prediction time of proposed method becomes reasonable even for proteome-level prediction and now we can apply the method for homology detection search and so on. Unfortunately, the optimization of hyperparameters was performed with smaller dataset and the parameter search was also insufficient. Thus, more optimization would be performed as future work. In addition, reduction of score calculation by k-nearest neighbor does not use any information about the length of a protein sequence and the number of gaps in the initial alignment. Thus, using these information, we may dynamically optimize the parameter, which controls the range of ignored regions.

Acknowledgement This work was supported by JSPS KAKENHI Grant Number 18K11524.

7 References [1] S. K. Burley et al., “Protein Data Bank: The single global archive for 3D macromolecular structure data,” Nucleic Acids Res., 2019. [2] M. T. Muhammed and E. Aki-Yalcin, “Homology modeling in drug discovery: Overview, current applications, and future perspectives,” Chemical Biology and Drug Design. 2019. [3] S. F. Altschul et al., “Gapped BLAST and PSI-BLAST: A new generation of protein database search programs,” Nucleic Acids Research. 1997. [4] J. Söding, “Protein homology detection by HMM-HMM comparison,” Bioinformatics, 2005. [5] M. Lozajic et al., “A Completely Reimplemented MPI Bioinformatics Toolkit with a New HHpred Server at its Core,” J. Mol. Biol., 2017. [6] J. Kopp, L. Bordoli, J. N. D. Battey, F. Kiefer, and T. Schwede, “Assessment of CASP7 predictions for templatebased modeling targets,” in Proteins: Structure, Function and Genetics, 2007. [7] S. Makigaki and T. Ishida, “Sequence alignment using machine learning for accurate template-based protein structure predcition,” Bioinformatics, btz483, 2019. [8] N. K. Fox, S. E. Brenner, and J. M. Chandonia, “SCOPe: Structural Classification of Proteins - Extended, integrating SCOP and ASTRAL data and classification of new structures,” Nucleic Acids Res., 2014. [9] J. H. Freidman, J. L. Bentley, and R. A. Finkel, “An Algorithm for Finding Best Matches in Logarithmic Expected Time,” ACM Trans. Math. Softw., 2002. [10] R. H. Chanop Silpa-Anan, “Optimized KD-trees for image descriptor matching,” CVPR, 2008. [11] K. Mikolajczyk and J. Matas, “Improving descriptors for fast tree matching by optimal linear projection,” in Proceedings of the IEEE International Conference on Computer Vision, 2007. [12] M. Muja and D. Lowe, “Fast approximate nearest neighbors with autormatic alogrithm configuration,” in Proceedings of the Fourth International Conference on Computer Vision Theory and Applications, 2009. [13] Y. Zhang and J. Skolnick, “Scoring function for automated assessment of protein structure template quality,” Proteins Struct. Funct. Genet., 2004.

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

135

Layout analysis using semantic segmentation for Imperial Meeting Minutes Sayaka Iida1, Yuki Takemoto1, Yu Ishikawa2, Masami Takata1, Kazuki Joe1 1 Nara Women's University, Japan 2 Shiga University

Abstract - In this paper, we propose a layout analysis method for text extraction of Imperial Meeting Minutes. Character recognition accuracy depends on character extraction accuracy. In order to improve the accuracy of character segmentation, it is necessary to analyze the layout of the document image. Histogram analysis is usually used for layout analysis. However, it is difficult to perform general-purpose layout analysis using only histogram patterns, and it is necessary to visually consider the document configuration for accurate document area extraction. Therefore, we propose a layout analysis method using semantic segmentation. We apply the proposed method to Imperial Meeting Minutes and compare the case of layout analysis with histogram and extracted characters to confirm the usefulness of the proposed method. Keywords: Layout Analysis, Imperial Meeting Minutes, Semantic Segmentation

1

Introduction

The National Diet Library [1] provides a web service called Imperial Meeting Minutes Search System [2]. Imperial Meeting Minutes Search System digitally publishes the fast recording of all sessions of Imperial Meeting during 18901947. The meeting minutes were published before font standardization, and therefore, with different fonts and formats depending on years. Their images are searchable by the tables of contents, index or speaker. The meeting minutes of last two years have been converted to text to be searchable by text while the meeting minutes prior to 1944 do not have any text data not to be searchable by text. Therefore, it is necessary to convert the non-textified Imperial Meeting Minutes into text. Since the number of meeting minutes available for the Imperial Meeting Minutes Search System is huge (about 200,000), manual text conversion is practically impossible in cost. Although we proposed the character recognition method for early-modern printed books [3], a huge amount of training data is required to improve the recognition rate. Also, in order to automatically convert the meeting minute images into text, it is necessary to analyze the layout of the meeting minute images in detail. The character recognition depends largely on correct character clipping. It depends on the accuracy of document area clipping from document images. The meeting minute images are organized in columns with containing straight lines in addition to the main document parts. Layout analysis is required to clip the document area from the meeting minute images.

In general, histogram analysis is used for the layout analysis of document images. In this method, a pattern is detected from the shape and the change of a pixel projection histogram, and the document area is clipped from the document image according to the pattern. Since boundary area between document parts and other parts is very difficult to be clearly detected just using the histogram pattern, the performance of the general layout analysis with histogram is not always enough. In order to appropriately clip the document area, it is necessary to visually confirm the structure of the document. However, the number of undocumented meeting minutes currently available in the Imperial Meeting Minutes Search System is too huge to visually identify the layout and clip each area by hand. In this paper, we propose a layout analysis method using semantic segmentation [4] for the Imperial Meeting Minutes. We apply the method to some of the Imperial Meeting Minutes so that we compare the case of layout analysis with the accuracy of histogram and character extraction, and show the usefulness of the proposed method. The structure of this paper is shown below. Section 2 introduces existing research on semantic segmentation, and Section 3 describes the histogram method, which is a classical layout analysis method. Section 4 proposes a layout analysis method using semantic segmentation, and Section 5 describes experimental methods, experimental results, and discussions. Section 6 gives conclusions.

2

Existing Research Segmentation

on

Semantic

Semantic segmentation is a method to cluster pixels belonging to the same object in an image. The most semantic segmentation methods mainly use CNN (Convolution Neural Network). There are two problems in semantic segmentation using CNN before FCN (Fully Convolution Network). The first problem is that the size of the input image has to be fixed in order to use the output from the fully-connected layer. The second problem is that the pool layer discards the location information. In the first full connectivity layer issue, FCN promotes the CNN architecture without the full connectivity layer. Because FCNs do not use the fully connected layer, segmentation maps are created using images of any size. Most approaches to semantic segmentation after FCN are CNN architectures that do not have the fully connected layer. For the second pooling layer problem, the encoder-decoder architecture is one of the main approaches for the position information retention. The encoder path is useful for high-

ISBN: 1-60132-508-8, CSREA Press ©

136

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

speed computations to recover object boundaries in the decoder path. This section describes existing CNNs that incorporate the encoder-decoder architecture. SegNet [6] is one of the early encoder-decoder architectures. It mainly targets RGB landscape images and extracts roads, people, cars, etc. from the images. In SegNet, to increase the resolution of segmentation map, the index information of maximum pooling layers is sent to the decoder, and feature map upsampling is performed. Also, it performs batch normalization after the convolution layer to prevent gradient loss and explosion. U-net [7] is a CNN proposed for segmentation of biomedical images. It mainly targets gray scale images such as radiographs and MRI images, and detects the position of internal organs and lesions from the images. In U-Net, the connection from the encoder to the decoder is skipped so that the sum for the channel is taken. U-net retains the detail features by the connection skip. Besides U-net, there are FCN and ResNet as architectures to make the connection skip. The difference with U-net is that FCN takes the channel sum while ResNet takes the residual out of the input. The image size of the target data trained by CNN for the above mentioned existing semantic segmentation is up to 500 by 500 pixels. The image size of Imperial Meeting Minutes targeted in this paper is about 3,200 pixel width and 4,500 pixel height. It is desirable that the CNN architecture performing semantic segmentation for Imperial Meeting Minutes should be lightweight. Furthermore, the conference recording image is binary, and the boundary of the document area is blank. The document area is surrounded by blanks, so it is not necessary to keep the segmentation map boundary details. The boundaries between the character area and the frame area are clearly separated in black and white. It is considered easier to distinguish details than landscapes or biomedical images. Therefore, in this paper, we use SegNet Basic, which is a lightweight model to reduce the number of layers rather than SegNet.

3

Existing layout analysis method

Conventional layout analysis for document images typically creates rectangles using several histograms. In the histogram based method, first, a paper image is converted into a black and white binary image to obtain a projection histogram of black pixels constituting the document. A pattern is found from the shape of the histogram and the amount of change to be divided linearly. The extracted pattern is enclosed in a rectangle, and the document is extracted based on the area and the aspect ratio of the rectangle. However, it is difficult to distinguish the difference between the document and other elements contained in the document image using only the histogram pattern. To get accurate rectangles, we need to visually consider the composition of the document.

Figure 1 Imperial Meeting Minute images

There are two ways to create rectangles: top-down and bottom-up. The top-down method analyzes the rough layout structure and then gradually creates small rectangles. For example, an image is clipped vertically or horizontally along a space expected to be a boundary of character strings or a boundary of a paragraph to create a rectangle of the document area. The top-down method cannot take account of crosscolumn headings or non-rectangular areas in the document image. The bottom-up method gradually integrates a rectangle detected from an image. Small parts in the document image are classified to be merged as rectangles taking into account the distance, shape, and area between the parts. Figure 1 shows images of imperial meeting minutes targeted in this paper. The images of the meeting minutes are divided into two to five columns. Some have titles and other do not. Document columns and titles are framed. That is, in order to perform layout analysis on the meeting minute image, it is necessary to detect the line boundary and the pattern of the straight line from the histogram. However, the frame lines included in the meeting minute image are not clear straight lines, but are distorted due to the distortion of the original meeting record, or the deviation at the scanning time. As in the case of frame lines, the characters may be misaligned. In addition, meeting minutes contain noise such as ink stains. That is, in the meeting minute image, layout analysis using only the black pixel projection histogram along the horizontal the vertical directions is difficult, and it is necessary to adjust the size and position of the rectangle visually to some extent. The proposed method uses semantic segmentation that does not require visual adjustment at the time of region extraction. Semantic segmentation is a method of automatically classifying pixels that belong to the same object, and suitable for extracting regions from images. A character area, a document area, and a frame area are extracted from the meeting minute image, and layout analysis is performed in the bottom-up manner.

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

i

ii

A

B

C

D

In the layout analysis process using Label A to D, as is similar to the top-down method, the structure is roughly extracted and then the detail rectangle is created. Because the purpose of this layout analysis is to extract document and character areas, labels containing only frame areas are not generated. Meeting minute images are binary images so the color of binary images is expressed as a 256 gray scale. It is difficult to segment an image whose label boundary is blank as in document area, too. Therefore, the space between characters of meeting minute images is smoothed with a Gaussian filter shown in equation (1) to be used as an aid for border judgment of the document area. The kernel of the Gaussian filter used in this paper is square. ʹ݇‫ ݁ݖ݅ݏ‬ଶ (1) ሻ ʹߪ ଶ ξʹߨߪ The width and the height of the filter mask are odd. The value σ is determined from the kernel size using equation (2). ‫ݏݏݑܽܩ‬ሺ݇‫݁ݖ݅ݏ‬ሻ ൌ

Figure 2 Six labels to train the model

4

137

Proposed method

We propose a layout analysis method using semantic segmentation for meeting minute image. We use the SegNet Basic architecture for the semantic segmentation. In the proposed method, layout analysis processing is performed using the segmentation map output by semantic segmentation. In addition, when training with meeting minute images, Gaussian filters with different filter sizes are applied to the meeting minute images for comparison. The model is trained with a pair of meeting minute images and target labels to perform semantic segmentation. The labels used in this paper are the following six types.

ߪ ൌ ͲǤ͵Ͳሺ

ͳ

݁‫݌ݔ‬ሺെ

݇‫݁ݖ݅ݏ‬Ȃ ʹ ሻ ൅ ͲǤͺͲ ʹ

We compare the images applied with and without the Gaussian filter. The maximum Gaussian filter size is less than 60, assuming that the distance between two adjacent characters is approximately 60 pixels. The conditions of the size K (n) of the Gaussian filter are as follows. z Condition 1: without Gaussian filter z Condition 2: increase of 1 kernel size ‫ܭ‬ሺ݊ሻ ൌ ʹ݊ െ ͳሼ݊ ‫ א‬ηȁͲ ൏ ݊ ൏ ͵Ͳሽ

z Label A Document area and Frame area z Label B Document area and Character area, Frame area z Label C Character area z Label D Document area and Character area The six label types are shown in Figure 2. In the labels, the red, yellow and green area represents character, frame and document area, respectively. Label i and ii do not include the document area while Label A to D include the document area. In the layout analysis process using Label i and ii, small areas are integrated to get a rectangle as in the bottom-up method.

(3)

z Condition 3: increase of 5 kernel size

z Label i Character area z Label ii Character area and Frame area

(2)

‫ܭ‬ሺ݊ሻ ൌ ͸݊ െ ͷሼ݊ ‫ א‬ηȁͲ ൏ ݊ ൏ ͳͲሽ

(4)

z Condition 4: increase of 9 kernel size ‫ܭ‬ሺ݊ሻ ൌ ͳͲ݊ െ ͻሼ݊ ‫ א‬ηȁͲ ൏ ݊ ൏ ͸ሽ

(5)

Learning is performed with a total of 24 patterns for each of the six labels and each of the above four conditions. The number of meeting minute images to learn is five. In order to reduce the calculation, we divide 20 sheets into squares of 1,024px side and learn meeting minute images. We explain the flow of the proposed method. First, we perform semantic segmentation to extract regions. Next, the preprocessing is performed to the output of the semantic segmentation. Finally, the character extraction processing is

ISBN: 1-60132-508-8, CSREA Press ©

138

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

performed. When outputting a character area in the character extraction processing, we use the character area label of the segmentation map. 6WDUW

5HJLRQH[WUDFWLRQ

&UHDWHUHFWDQJOHIURP DUHD

6KRUWUDQJHUHFWDQJOH LQWHJUDWLRQ

The flow of the layout analysis processing using semantic segmentation is described as shown in Figure 4. First, using the trained model, we label the segments using semantic segmentation on the meeting minute images. Next, the original image is preprocessed using the segmentation map output. If the output segmentation map includes a frame area, we replace black pixels in the area determined as frame area with white pixels. If the output segmentation map includes document area, the image is clipped for each document area. Finally, the character extraction processing is performed as described above, but when the output segmentation map includes character area, the area extraction is performed using the character area.

5

,PDJHH[WUDFWLRQ

5.1 (QG

Figure 3

Character extraction process flow 6WDUW

,QFOXGLQJ IUDPHDUHD"

)UDPHUHPRYDO

,QFOXGLQJWKH GRFXPHQW DUHD"

([WUDFWLPDJHVE\ GRFXPHQWDUHD

(QG

Figure 4

The procedure of the character extraction process is described as shown in Figure 3. In the process, characters in the document are extracted one by one using bottom-up layout analysis. First, in the area extraction process, the area of the detail part is detected. The detected area is represented as the outermost contour of an image element. Next, we generate a rectangle that encloses the detected area. If the generated rectangles are overlapped or close, they are combined. Finally, the image is clipped within the rectangle. In the area extraction, the outermost contour of the rectangle is detected. Here, ink stains and incomplete boundaries on the document image interfere with the region extraction. Also, the characters constructed with several parts, which are also other characters, are not extracted well at all.

Experiments Method

In this paper, we compare the character extraction accuracy of Imperial Meeting Minute images. The target meeting minute images are used for training samples and test samples. We apply the histogram method and the proposed method to the meeting minute images to extract characters, and compare the proportions of correctly extracted characters. From the comparison results, the usefulness of layout analysis using semantic segmentation is confirmed. In the histogram method, the meeting minute images are divided into several stages using pixel projection histograms. In addition, we compare the results of the total 24 patterns of the six segmentation maps output by the proposed method and the four conditions of the Gaussian filter size extensions. The number of characters evaluated for learning and test images is 1,293 and 837, respectively. In the character extraction processing, rectangle integration fails when each part constructing a character are apart. Such characters were excluded from the evaluation. Since it is another open problem, we do not discuss in this paper. 5.2

Pre-processing using output of semantic segmentation

Experiments

Results and discussions

Layout analysis is applied to the learning and test images on the model and the extracted characters are compared. In the histogram method, 46.3% of characters in the learning images and 4.78% in the test images are extracted. The histogram approach cannot remove borders and brackets in documents. These noises are considered to hinder region extraction. The results of semantic segmentation on test images using the proposed method are shown in figure 5. When the frame is blurred or dirty, or when there is a blank area at the border of the frame, it is recognized as character area even if the frame area is incorrectly recognized. In some cases, frame joints or straight parts of characters may be recognized as frame area. In addition, ink stains may be recognized as a character or a frame. The misrecognized area is the noise of the subsequent processing. In the case of condition 4, the segmentation map is blurry. The reason is that the kernel size is too large.

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

139

Table 1 The number of extracted characters of learning images (%)

Original Image

ii

A

B

C

D

Average

93.6

91.3

93.3

96.4

93.6

41.5

84.9

75.3

79.5

94.5

97.8

86.3

39.6

78.8

82.0

78.0

93.4

97.1

94.4

47.5

82.1

0.70

0.23

0

40.5

0

87.7

21.5

Average

Condition4

C

Condition4

Condition3

B

Condition3

Condition2

A

Condition2

Condition1

ii

Condition1

i

i

62.9

62.3

70.3

83.0

68.6

54.1

Table 2 The number of extracted characters of test images (%) A

B

C

D

Average

79.6

79.6

84.0

87.2

82.2

78.0

81.8

70.0

82.9

84.5

83.0

85.2

82.9

81.4

Condition3

66.8

60.2

73.4

84.6

81.7

83.0

75.0

Condition4

7.41

1.08

0.60

1.31

0.48

77.5

14.7

Average

Tables 1 and 2 show the proportion of characters extracted from each image by the proposed method. Table 1 shows the percentage of extracted characters for learning images. In the case of conditions 1, 2 and 3 of the proposed method, the proportion of extracted characters of the learning images is 84.9%, 78.8% and 82.1% in average, respectively. The proportion of extracted characters of the test images is 81.8%, 81.4% and 75.5% in average, respectively. Under conditions 1, 2 and 3, the proportion of characters extracted in average is higher than layout analysis using the histogram. This is because frame removal and document area extraction are performed from the document images. For condition 4 of the proposed method, the average proportion of extracted characters of the learning and test images is 21.5% and 14.7%, respectively. Condition 4 cannot extract most characters except label D. For label D with condition 4, 87.7% and 77.5% of the characters in the learning and test images are extracted, respectively. This is due to the fact that the segmentation map cannot be correctly generated except for label D.

ii

Condition2

Figure 5 Semantic segmentation results for labels and Gaussian filter conditions

i Condition1

D

55.9

55.9

60.6

64.0

62.4

80.4

Table 3

The average number of extracted characters of learning images with each label (%)

ISBN: 1-60132-508-8, CSREA Press ©

Do not include

Include

140

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

document area

document area

Condition1

92.5

81.2

Condition2

77.4

79.5

Condition3

80.0

83.1

Condition4

0.46

32.1

Table 4

The average number of extracted characters of test images with each label (%) Do not include document area

Include document area

Condition1

79.6

82.9

Condition2

76.4

83.9

Condition3

63.5

80.7

Condition4

4.24

20.0

In condition 4 with labels i and ii, the character area is used, but character extraction is failed because the output is blurred. In the case of condition 4 with labels A, B, and C, the vicinity of the frame line is divided into document areas. With regard to label D, the border line portion is not recognized as a document area as well as in the case of condition 4. When the proportion of extracted characters in Table 1 is the highest, label B is used and the Gaussian filter size extension is with condition 2. When the proportion of extracted characters in Table 2 is the highest, label B is used and the Gaussian filter is with condition 1. With labels i and ii, document area is not contained while with labels A to D, document area is contained. The average percentage of extracted characters with the labels that do not contain document area is 62.6% for learning images and 55.9% for test images. The average number of extracted characters with the labels that contain document areas is 69.0% for learning images and 66.9% for test images. Therefore, we find that the labels including document area is more effective. In addition, among the labels including document area, the average of the number of extracted characters with label B is the largest when condition 4 is excluded. Table 3 and 4 shows the proportion of the average number of extracted characters of learning and test images with each (including/not including document area) label, respectively. We validate the effectiveness of blank parts interpolation by the Gaussian filter when performing semantic

segmentation for binary images. We compare the conditions of the Gaussian filter when document area is not included and included. In the case of not included and condition 1, 92.5% of the text is extracted from the learning images while 79.6% from the test images. When text area is not included, the Gaussian filter with condition 1 achieves the highest percentage of character extraction. When character area is included with condition 3, 1.9% of the characters of learning images are extracted. In condition 2, 1.7% of characters are extracted from learning images and it is smaller than that in condition 1. From the test images, 1.0% of characters are extracted with condition 2. In the case of condition 3, 2.2% of characters are extracted and it is smaller than in condition 1. Therefore, when the text area is included, the blank part interpolation by the Gaussian filter may be effective but not always necessarily. From the above, we find that the accuracy of character extraction is improved by performing layout analysis with the proposed method. The accuracy of character extraction depends on the accuracy of region extraction by semantic segmentation. In the future, we aim to improve the accuracy of semantic segmentation by increasing training data, changing conditions of Gaussian filter, and applying noise and blurring to input images. We confirm that it is not possible to correctly recognize the joint of the frame and the straight portion in characters as well as accurately extract character area and frame area. In addition, as applying semantic segmentation to binary images, white space is complemented with Gaussian filters, but the usefulness is not shown from the evaluation results.

6

Conclusions

In this paper, we propose a layout analysis method using semantic segmentation for the purpose of extracting the text of Imperial Meeting minute images. In the proposed method, the text area, the frame area, and the document area are extracted using semantic segmentation. The CNN architecture used for the semantic segmentation adopts SegNet Basic. There are six combinations of objects included in the label images to be trained. Furthermore, Gaussian filters of different kernel sizes are applied to the images to support segmentation of binary images. In order to validate the proposed method, the ratio of the extracted characters by the layout analysis method using histogram and the proposed method was compared. Also, in the proposed method, the type of objects included in the label and the conditions of the Gaussian filter size are varied to compare the proportions of the character extraction. In the layout analysis method by histogram, meeting minute images are scanned for each column using a pixel projection histogram. In the proposed method, layout analysis is performed using the output of semantic segmentation, and the document area is extracted. Changing the object types contained in the labels and the Gaussian filter size, they are applied to the learning image to compare the proportions of

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

the extracted characters. The first to sixth labels are character area, character and frame area, document and frame area, document, character and frame area, character area, document and character area, respectively. There are four types of Gaussian filter size extension. Character extraction was performed on learning and test images using the histogram method and the proposed method. The number of characters extracted for each was compared. In the case of semantic segmentation, the label selection and Gaussian filter size were examined. The original images have 1,293 characters for learning and 837 characters for test. As a result, 46.3% and 4.78% of characters in the learning images and the test images were extracted by the histogram method, respectively. In the proposed method, when using the kernel sizes increasing of 9, the proportion of the extracted characters of the learning and the test images were 21.5% and 14.7% in average, respectively. Regardless to say, it showed poor performance. In the case of other increasing sizes, the proportion of the extracted characters of learning images with without filter, the kernel size of increasing of 1 and 5 is 84.9%, 82.1% and 78.8%, respectively. In the case of test images, they are 81.8%, 81.4% and 75.0%. Therefore, compared with the histogram method, the proposed method is effective except in the case of using the kernel sizes increasing of 9. In addition, we showed the conditions where the number of extracted characters for learning and test images count the highest values. In the case of learning images using document area, text area and frame area as labels, 97.8% characters are extracted with kernel sizes increasing of 5. In the case of test images labeled with document area, text area and frame area, 87.2% characters are extracted without Gaussian filter.

141

segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, p. 3431-3440, 2015. [6] Badrinarayanan, Vijay, Alex Kendall, and Roberto Cipolla. "Segnet: A deep convolutional encoder-decoder architecture for image segmentation." IEEE transactions on pattern analysis and machine intelligence, 39.12: 2481-2495, 2017. [7] RONNEBERGER, Olaf; FISCHER, Philipp; BROX, Thomas. “U-net: Convolutional networks for biomedical image segmentation”. In: International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, p.234-241, 2015.

Acknowledgment This work is partially supported by Grant-in-Aid for scientific research from the Ministry of Education, Culture, Sports, Science and Technology of Japan (MEXT) No. 17H01829.

As future work, it is necessary to improve the output accuracy of semantic segmentation. We aim to improve the accuracy by increasing the learning data, changing the conditions of the Gaussian filter, and applying de-noising and blurring to the input images.

7

References

[1] “National Diet Library,Japan” (in japanese) 2019/5/1 accessed http://www.ndl.go.jp/ [2] “Imperial meeting minutes search system” (in japanese) 2019/5/1 accessed http://teikokugikai-i.ndl.go.jp/ [3] FUJIMOTO, Kaori, et al. Early-Modern Printed Character Recognition using Ensemble Learning. In: Proceedings of PDPTA2017, p.288-294, 2017. [4] THOMA㸪 Martin. A survey of semantic segmentation. arXiv preprint arXiv:1602.06541㸪 2016. [5] LONG㸪 Jonathan; SHELHAMER㸪 Evan; DARRELL㸪 Trevor. Fully convolutional networks for semantic

ISBN: 1-60132-508-8, CSREA Press ©

142

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

A Discrete Three-wave System of Kahan-Hirota-Kimura Type and the QRT Mapping Yuko Takae1 , Masami Takata2 , Kinji Kimura3 , and Yoshimasa Nakamura1 1 Graduate School of Informatics, Kyoto University, Kyoto, Kyoto, JAPAN 2 Research Group of Information and Communication Technology for Life, Nara Women’s University, Nara, Nara, JAPAN 3 Department of Electrical and Electronics Engineering, University of Fukui, Fukui, Fukui, JAPAN Abstract— The integrable three-wave interaction system is a well-known partial differential equation appearing in fields such as nonlinear optics and plasma physics. By eliminating the spatial derivative term from the three-wave system, we obtain the three-wave ordinary differential equation (ODE) system. Petrera et al. performed a Kahan-Hirota-Kimura discretization of the three-wave ODE system and succeeded in finding three conserved quantities of the system. However, Lax pairs for and solutions of the discrete three-wave system have not yet been obtained. In this paper, we derive three conserved quantities of the discrete three-wave system of Kahan-Hirota-Kimura type using computer algebra. Moreover, we show that in our discretized system, there is a certain variable, corresponding to the Hamiltonian of the three-wave ODE system, that is a variable in the QuispelRoberts-Thompson (QRT) mapping and can be expressed in terms of elliptic functions. By obtaining an elliptic expression for the Hamiltonian of the three-wave ODE system, it is possible to show the stability, which means that the value corresponding to the Hamiltonian is analytic, of the difference scheme though the implicit Runge-Kutta method always possesses the A-stable. Keywords: three-wave interaction system, Quispel-Roberts-

Thompson mapping, Hamiltonian, Kahan-Hirota-Kimura discretization, Gröbner basis

1. Introduction The study of dynamical systems has contributed significantly to developments in science and engineering. We can investigate various properties of a system, such as a natural phenomenon, by constructing a dynamical systems model that describes the behaviour of the system. There is a special class of dynamical systems–called integrable systems–that have remarkable properties, such as conserved quantities and exact solutions, even if the systems are nonlinear. Discrete integrable systems, derived via "integrable discretization," a process that preserves the properties of the original continuous integrable system, have been attracting attention. An understanding of these systems has led to the

discovery of a close relationship between soliton equations and numerical algorithms. In this paper, we present a discretization of an integrable three-wave system. In section 2, the three-wave interaction system is discussed. In section 3, we review the Quispel-Roberts-Thompson (QRT) mapping. In section 4, we introduce an integrable discretization called the KahanHirota-Kimura discretization. In section 5, we obtain a discrete three-wave system of Kahan-Hirota-Kimura type. In section 6, we show a relationship between the discrete three-wave system of Kahan-Hirota-Kimura type and the Quispel-Roberts-Thompson (QRT) mapping. In Section 7, we conclude this paper.

2. The Three-Wave system The three-wave system is a well known partial differential equation (PDE) appearing in the fields of nonlinear optics and plasma physics[2]. The three-wave system is as follows: ∂z1 ∂z1 + α1 = ǫz 2 z 3 , ∂t ∂x ∂z2 ∂z2 + α2 = ǫz 3 z 1 , ∂t ∂x ∂z3 ∂z3 + α3 = ǫz 1 z 2 . ∂t ∂x

(1) (2) (3)

Here, the parameters αi (i = 1, 2, 3) and ǫ are real numbers, and z i (i = 1, 2, 3) represent the complex conjugates of zi (i = 1, 2, 3), respectively. We consider the case where the following condition is satisfied: ∂zi = 0, ∂x

i = 1, 2, 3.

(4)

We then obtain the following system of ordinary differential equations (ODE): dz1 = z2z3, dt

dz2 = z3z1, dt

dz3 = z1z2. dt

(5)

Hereafter, we will refer to this system as “ the three-wave ODE system. ”

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

The three-wave ODE system can be written in the following form: H(p, q) ≡

3

k=1

pk −

3

qk

= p1 p2 p3 − q1 q2 q3 , dpj ∂H = =− dt ∂qj

(6)

k=1

3

qk ,

(7) (8)

k =j

3 dqj ∂H pk . = =− dt ∂pj

(9)

k =j

By choosing qk ≡ zk to be the coordinate variables and pk ≡ zk to be the momentum variables, the system admits a Hamiltonian formulation and possesses a Hamiltonian H(p, q) given by (7).

3. The Quispel-Roberts-Thompson (QRT) mapping

143

Specifically, f1 (x) =(a21 x2 + a22 x + a23 )(b31 x2 + b32 x + b33 ) − (a31 x2 + a32 x + a33 )(b21 x2 + b22 x + b23 ), (15) f2 (x) =(a31 x2 + a32 x + a33 )(b11 x2 + b12 x + b13 ) − (a11 x2 + a12 x + a13 )(b31 x2 + b32 x + b33 ), (16) f3 (x) =(a11 x2 + a12 x + a13 )(b21 x2 + b22 x + b23 ) − (a21 x2 + a22 x + a23 )(b11 x2 + b12 x + b13 ). (17) Hereafter, we will refer to this mapping as “ the symmetric QRT mapping. ”Each member of this family possesses a 1-parameter family of invariant curves that fill the plane. (a11 + Kb11 )(xn )2 (xn+1 )2 + (a12 + Kb12 )((xn )2 (xn+1 ) + (xn )(xn+1 )2 ) + (a13 + Kb13 )((xn )2 + (xn+1 )2 ) + 2(a22 + Kb22 )(xn )(xn+1 ) + (a23 + Kb23 )((xn ) + (xn+1 )) + (a33 + Kb33 ) =0

(18)

The Quispel-Roberts-Thompson mapping (QRT mapping), introduced in [15] [16], is an 18-parameter family of birational transformations of the plane. The mapping is as follows:

where the constant of integration K is invariant on each curve [16]. The biquadratic equation (18) can be parametrized in terms of elliptic functions [1]. We consider the symmetric biquadratic relations:

f1 (y n ) − xn f2 (y n ) , (10) f2 (y n ) − xn f3 (y n ) g1 (xn+1 ) − yn g2 (xn+1 ) , (11) g2 (xn+1 ) − yn g3 (xn+1 ) ⎞⎛ 2 ⎞ ⎛ x a11 a12 a13 ⎝ a21 a22 a23 ⎠ ⎝ x ⎠ 1 a31 a32 a33 ⎛ ⎞⎛ 2 ⎞ b11 b12 b13 x × ⎝ b21 b22 b23 ⎠ ⎝ x ⎠ , (12) 1 b31 b32 b33 ⎞⊤ ⎛ 2 ⎞ ⎛ y a11 a12 a13 ⎝ a21 a22 a23 ⎠ ⎝ y ⎠ 1 a31 a32 a33 ⎞⊤ ⎛ 2 ⎞ ⎛ b11 b12 b13 y × ⎝ b21 b22 b23 ⎠ ⎝ y ⎠ .(13) 1 b31 b32 b33

ax2 y 2 +b(x2 y+xy 2 )+c(x2 +y 2 )+2dxy+e(x+y)+f = 0, (19) where x and y are variables (complex numbers) and a, b, c, d, e, f are given constants. Firstly, we apply a linear fractional transformation to (19):

xn+1

=

y n+1 = ⎞ ⎛ f1 ⎝ f2 ⎠ = f3

⎞ g1 ⎝ g2 ⎠ g3 ⎛

=

The symbols ⊤ and × represent the transposition of a matrix and the outer product of two vectors, respectively. If ajk = akj , bjk = bkj (j, k = 1, 2, 3), the mapping is called a symmetric QRT mapping and can be written as a 3-point map. xn+1 =

f1 (xn ) − xn−1 f2 (xn ) . f2 (xn ) − xn−1 f3 (xn )

(14)

x → (αx + β)/(γx + δ),

y → (αy + β)/(γy + δ), (20)

where α, β, γ, δ are generally complex and αδ = βγ. We can choose α, β, γ, δ so that b and e vanish in (19) and so that a = f = 0. Dividing (19) by a, the biquadratic relation can be written in the following form: x2 y 2 + 1 + c(x2 + y 2 ) + 2dxy = 0.

(21)

We consider (21) to be a quadratic equation in y; thus, the solution of (21) can be expressed as follows: ! dx ± −c + (d2 − 1 − c2 )x2 − cx4 y=− . (22) c + x2 The argument of the square root is a quartic polynomial in x. We can write it as a perfect square by transforming the variable x into the variable u, where 1

x = k 2 sn(u).

(23)

Here, sn(u) is a Jacobian elliptic sn function with argument u and modulus k, where k + k −1 =

ISBN: 1-60132-508-8, CSREA Press ©

(d2 − 1 − c2 ) . c

(24)

144

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

The argument of the square root is −c[1 − (k + k −1 )x2 + x4 ] = −c(1 − sn2 (u))(1 − k 2 sn2 (u)) = −c cn2 (u) dn2 (u).

We define a parameter η by −1 . c= (k sn2 (η))

(25)

(26)

Then from (24), we can choose the sign of η so that d=

cn(η) dn(η) . (k sn2 (η))

(27)

Substituting these expressions into (22), it follows that 1

y = k2

sn(u) cn(η) dn(η) ± sn(η) cn(u) dn(u) . 1 − k 2 sn2 (u) sn2 (η)

(28)

Using the addition formula, we simplify this result to 1

y = k 2 sn(u ± η).

(29)

Thus, the equation for y is the same as it is for x (23) but with u replaced by u ± η. The conserved quantity HQ of the symmetric QRT mapping is as follows: N HQ = , (30) D 2 2 2 2 N =a11 xn−1 xn + a12 xn−1 xn + a13 xn−1 + a21 xn−1 x2n + a22 xn−1 xn + a23 xn−1 + a31 x2n + a32 xn + a33 , (31) D =b11 x2n−1 x2n + b12 x2n−1 xn + b13 x2n−1 + b21 xn−1 x2n + b22 xn−1 xn + b23 xn−1 + b31 x2n + b32 xn + b33 . (32)

4. The Kahan-Hirota-Kimura cretization

dis-

In this section, we introduce a discretization method called the Kahan-Hirota-Kimura discretization. The discretization method was introduced in 1993 by W. Kahan, first appearing in his unpublished notes [11]. It is applicable to any system of ODEs for x : R → Rn satisfying

dx = f (x) = Q(x) + Bx + c, (33) dt where each component of Q : Rn → Rn is a quadratic form, B ∈ Rn×n , and c ∈ Rn . Consider a numerical integration method xn → xn+1 with a step size δ. The Kahan-HirotaKimura discretization reads as follows: xn+1 − xn 1 = Q(xn , xn+1 ) + B(xn + xn+1 ) + c, (34) δ 2 where 1 Q(xn , xn+1 ) = (Q(xn + xn+1 ) − Q(xn ) − Q(xn+1 ) 2 (35)

is the symmetric bilinear form corresponding to the quadratic form Q. It is sometimes more useful to use 2δ for the time step size to avoid powers of 2 in the various formulas. Kahan applied the discretization method (34) to a scalar Riccati equation and a two-dimensional Lotka-Volterra system [12]. The most remarkable feature of the scheme is that it produces solutions that stay on closed curves. Most other schemes produce solutions that either spiral in towards the equilibrium point or spiral out of the equilibrium point. Petrera, Pfadler, and Suris applied the discretization to many integrable systems, including the three-wave system, and showed that in most cases, the discretization preserves the integrability [14]. The discretization method coincides with the following Runge-Kutta method when the applied system is restricted to quadratic vector fields[3]. Let us consider the Runge-Kutta method of order s for the following ODE (36): dx = f (t, x). (36) dt The Runge-Kutta method of order s for (36) is as follows: s

xn+1 − xn bi ki , = δ i=1 ⎛

ki = f ⎝δ(n + ci ), xn + δ

(37) s j=1

⎞

aij kj ⎠ , i = 1, . . . , s.

(38)

It is well-known that the Runge-Kutta method can be represented by the so-called Butcher tableau, which puts the coefficients of the integrator (37) and (38) in a table as follows: c A (39) bT where

⎛ ⎞ b1 ⎜ ⎜ ⎟ A = (aij ), b = ⎝ ... ⎠ , c = ⎝ bs ⎛

⎞ c1 .. ⎟ . . ⎠ cs

(40)

The Kahan-Hirota-Kimura discretization of the ODE (36) and the Butcher tableau of the scheme are expressed as follows: [3] n+1 xn+1 − xn + xn 1 x 1 n − f (xn+1 ). = − f (x ) + 2f δ 2 2 2 (41) 0 1 2

1

− 41

1 − 14

− 21

2 − 12

− 21

2 − 12

Some properties of the discretization method follow from those of the Runge-Kutta method. For example, let us

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

consider the stability of a linear ODE. We introduce the stability function R(z) =

det(I − zA + zebT ) , det(I − zA)

(42)

and the domain of absolute stability

R = {z ∈ C : |R(z)| < 1}

(43)

where e stands for the vector of ones [5]. We can reduce a linear ODE to a one-dimensional linear ODE by a variable transformation of the coefficient matrix. Thus, we apply the Runge-Kutta method to the linear test problem dx = λx, dt

(44)

where λ is an eigenvalue of the coefficient matrix. The Runge-Kutta method (37) with (38) applied to the linear test problem is as follows: (45)

xn+1 = R(λδ)xn .

If the domain of absolute stability R contains the left half plane, the Runge-Kutta method is said to be A-stable, i.e.,for any fixed δ, if the eigenvalues of the linear ODE lie in the left half-plane, the numerical method is stable. The classical Runge-Kutta method (RK4) is represented by the following Butcher tableau. 0 1 2 1 2

1 2

0

1 2

1

0

0

1

1 6

1 3

1 3

1 6

5. A discrete three-wave system of Kahan-Hirota-Kimura type In this section, we introduce a discrete three-wave system of Kahan-Hirota-Kimura type. Petrera, Pfadler, and Suris introduced a discrete three-wave system of Kahan-HirotaKimura type by carrying out the Kahan-Hirota-Kimura discretization of the three-wave ODE system [14]. Through the variable transformations z1 z2 z3 w1 = − , w 2 = − , w 3 = . (48) i i i from (5), the three-wave ODE system can be rewritten as follows: dw1 dw2 dw3 = i w2 w3 , = i w3 w1 , = i w1 w2 . dt dt dt (49) Using wi = xi + i yi ,

det(I − zA + zebT ) 1 1 1 = 1 + z + z2 + z3 + z4 det(I − zA) 2 6 24 (46)

The region R of absolute stability for the Runge-Kutta method does not include the whole left half plane. Thus, RK4 is not A-stable. We analyse the stability of the Kahan-Hirota-Kimura discretization scheme (41). The stability function of the Kahan-Hirota-Kimura discretization is as follows: R(z) =

det(I − zA + zebT ) z+2 =− det(I − zA) z−2

(47)

The region R of absolute stability of the Kahan-HirotaKimura discretization includes the whole left half plane Rez < 0. Thus, the Kahan-Hirota-Kimura discretization is A-stable.

(50)

i = 1,2,3,

from (49), we obtain dxi (51) = xj y k + yj x k , dt dyi (52) = xj xk − yj yk , dt where (i, j, k) represents one of the cyclic permutations of (1, 2, 3). The Kahan-Hirota-Kimura discretization is as follows: n+1 xi − xni + yjn+1 xnk , ykn + yjn xn+1 =xnj ykn+1 + xn+1 j k δ (53) n+1 n − yi yi xnk − yjn ykn+1 − yjn+1 ykn . + xn+1 =xnj xn+1 j k δ (54) In matrix form:

The stability function for RK4 is as follows: R(z) =

145

where

n x xn+1 = A(x, y, δ) yn y n+1 n+1 n x x −1 , (55) = A ⇐⇒ (x, y, δ) yn y n+1

A(x, y, δ) = ⎛ 1 −δy3 ⎜ −δy3 1 ⎜ ⎜ −δy2 −δy1 ⎜ ⎜ 0 −δx3 ⎜ ⎝ −δx3 0 −δx2 −δx1

−δy2 −δy1 1 −δx2 −δx1 0

0 −δx3 −δx2 1 −δy3 −δy2

−δx3 0 −δx1 −δy3 1 −δy1

−δx2 −δx1 0 −δy2 −δy1 1

⎞

⎟ ⎟ ⎟ ⎟. ⎟ ⎟ ⎠

(56)

The purpose of our work is to obtain the solutions of the discrete system. Setting uni = 2yin − 2i xni ,

ISBN: 1-60132-508-8, CSREA Press ©

vin = 2yin + 2i xni

i = 1, 2, 3, (57)

146

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

-0.27415

n

r

d3wave -0.27416

-0.27417

0.6

-0.27418

0.4 -0.27419

0.2 u3n

0

-0.27420

-0.2 -0.27421

-0.4 -0.6

-0.27422

-0.27423 0

-0.6

-0.4

-0.2 u1n

0

0.2

0.4

0.6 -0.6

-0.4

-0.2

0

0.2 u2n

0.4

(58) (59)

(i, j, k) represents one of the cyclic permutations of (1, 2, 3). Here, vin (i = 1, 2, 3) denotes the complex conjugate of uni (i = 1, 2, 3), respectively. Hereafter, we shall call (58)(59) the discrete three-wave system of Kahan-Hirota-Kimura type.

6. Relationship between the discrete system and the QRT mapping In this section, we obtain the conserved quantities of the discrete three-wave system and a relationship between the Kahan-Hirota-Kimura type discrete three-wave system and the QRT mapping. In the three-wave ODE system (5), the quantity r = z1 z2 z3 − z1 z2 z3

(60)

rn = un1 un2 un3 − v1n v2n v3n

(61)

is a Hamiltonian and is thus conserved. However, in the discrete three-wave system (58)-(59),

is not a conserved quantity. From the formula for rn , we can see that the Kahan-Hirota-Kimura type discrete three-wave system has a periodic solution. We assume the following biquadratic equation: 2 2 2 2 a0 (rn ) rn+1 + a1 (rn ) rn+1 + a2 rn rn+1 2 2 + a3 (rn ) + a4 rn+1 + a5 rn rn+1 + a6 rn + a7 rn+1 + a8 = 0.

100

150

200

250

300

350

400

450

500

Fig. 2: the behaviour of rn

Fig. 1: An orbit of the discrete three-wave system of Kahan-Hirota-Kimura type with initial values (u01 , u02 , u03 ) = (−0.36 + 0.26i, −0.28 + 0.51i, 0.52 + 0.118i) and δ = 0.1. from (53) and (54) we obtain n+1 ui − uni /δ = vjn+1 vkn + vjn vkn+1 /2, n+1 unk + unj un+1 − vin /δ = un+1 /2, vi j k

50

0.6

(62)

where ai (i = 0, . . . , 8) are complex constants. We set the initial values u0i , vi0 (i = 1, 2, 3), compute the time evolution

rn (i = 0, . . . , 9) through (58) ~ (59) and (61), and solve the following equations: 2 2 1 2 A1,1 = r0 , A1,2 = r0 r1 , r 2 2 A1,3 = r0 r1 , A1,4 = r0 , 2 A1,5 = r1 , A1,6 = r0 r1 , A1,7 = r0 , A1,8 = r1 ,

A1,9 = 1, · · · 2 2 9 2 A9,1 = r8 , A9,2 = r8 r9 , r 2 2 A9,3 = r8 r9 , A9,4 = r8 , 2 A9,5 = r9 , A9,6 = r8 r9 , A9,7 = r8 , A9,8 = r9 , A9,9 = 1, ⎞⎛ ⎛ A1,1 · · · A1,9 ⎜ .. .. .. ⎟ ⎜ ⎝ . . . ⎠⎝ A9,1

···

A9,9

⎞ a0 .. ⎟ = 0. . ⎠ a8

(63)

We observe that these equations have non-trivial solutions. In particular, a0 = 1,

a1 = a2 = a3 = a4 = 0,

a 6 = a7 ,

(64)

Eq. (62) is as follows: 2 2 (rn ) rn+1 + a5 rn rn+1 + a6 rn + rn+1 + a8 = 0. (65) Thus, a suitable choice of a5 , a6 , a8 yields the conserved quantities of the discrete system. To simplify the calculation, we assume h1 and h2 in the following equation are the conserved quantities: 2 (rn ) rn−1 + rn+1 + h1 rn + h2 = 0. (66) We obtain (66) by subtracting (65) from (67) and assuming h1 = a5 , h2 = a6 , yielding: n−1 2 n 2 r (r ) + a5 rn−1 rn + a6 rn−1 + rn + a8 = 0. (67)

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

We obtain the following equations: 2 (rn ) rn−1 + rn+1 + h1 rn + h2 = 0, n+1 2 n r r + rn+2 + h1 rn+1 + h2 = 0,

(68) (69)

where h1 and h2 are expressed in terms of r ,r ,r , and rn+2 . Moreover, using the computer algebra system REDUCE and the following equations: n−1

n

n+1

rn−1 = u1n−1 u2n−1 u3n−1 − v1n−1 v2n−1 v3n−1 , rn = un1 un2 un3 − v1n v2n v3n ,

(70) (71)

un+2 un+2 − v1n+2 v2n+2 v3n+2 , rn+2 = un+2 1 2 3

(73)

un+1 un+1 − v1n+1 v2n+1 v3n+1 , rn+1 = un+1 1 2 3 n n−1 un1 − u1n−1 v2 v3 + v3n v2n−1 = , n n−1 2 n n−1 n δ n−1 + v1 v 3 v v u2 − u2 = 3 1 , δ 2 n v n v n−1 + v2n v1n−1 u3 − u3n−1 = 1 2 , n n−1 2 n n−1 n δ n−1 + u 3 u2 u u v1 − v1 = 2 3 , δ 2 n un un−1 + un1 u3n−1 v2 − v2n−1 = 3 1 , n n−1 2 n n−1 n δ n−1 + u 2 u1 u u v3 − v3 = 1 2 , δ 2

and

n+2 n+1 n+2 v3 + v3n+2 v2n+1 v u1 − un+1 1 = 2 , n+2 n+1 2 n+2 n+1 n+2 δ n+1 v 1 + v1 v 3 v u2 − u2 = 3 , n+2 n+1 2 n+2 n+1 n+2 δ n+1 v 2 + v2 v 1 v u3 − u3 = 1 , δ 2 n+2 + un+2 un+1 un+2 un+1 v1 − v1n+1 3 3 2 = 2 , δ 2 n+2 + un+2 un+1 un+2 un+1 v2 − v2n+1 1 1 3 = 3 , δ 2 n+2 + un+2 un+1 un+2 un+1 v3 − v3n+1 2 2 1 = 1 , δ 2 we obtain the explicit representation of h1 and h2 in of uni (i = 1, 2, 3) and vin (i = 1, 2, 3):

(72)

(un1 )4 (v1n )4

We can check that h1 and h2 are conserved quantities of the discrete three-wave system (58)-(59) by using the Gröbner basis in the computer algebra Risa/ASIR [17]. Details of the Risa/ASIR program can be found in in Appendix A.2. Let G be the Gröbner basis of the discrete three-wave ODE system, and let hn1 , hn2 be the conserved quantities. The result is as follows: * Numerator hn+1 − hn1 − → 0, (88) 1 G n+1 * Numerator h2 − hn2 − → 0. (89) G

We see that (66) can be expressed as follows: rn+1 =

(74) (75)

(77) (78) (79)

(80) (81) (82) (83) (84) (85) terms

−h1 rn − h2 − rn−1 (rn) (rn )

2

2

.

(90)

We note that (90) is a special case of the QRT mapping (14), where the conditions f1 (rn ) = f2 (rn ) =

(76)

2(−3(un1 )6 (un2 )2 (un3 )2 (v1n )4 + · · · ) − 8(un1 ) − 3un2 un3 (v1n )3 v2n v3n + · · · 105 terms = , (86) 42 terms n 9 n 3 n 3 n 6 4((u ) (u2 ) (u3 ) (v1 ) − · · · ) h2 = n 6 n 61 (u1 ) (v1 ) − 12(un1 )5 un2 un3 (v1n )5 v2n v3n + · · · 336 terms = . (87) 106 terms Details of the REDUCE program are in Appendix A.1. h1 =

147

n

f3 (r ) ⎞ a11 a12 a13 ⎝ a21 a22 a23 ⎠ a31 a32 a33 ⎞ ⎛ b11 b12 b13 ⎝ b21 b22 b23 ⎠ b31 b32 b33 ⎛

= =

=

−h1 rn − h2 , n 2

(r ) , 0, ⎞ ⎛ 0 0 0 ⎝ 0 0 0 ⎠, 0 0 1 ⎞ ⎛ 1 0 0 ⎝ 0 h1 h2 ⎠ . 0 h2 1

(91) (92) (93) (94)

(95)

are satisfied. For this reason, the variables rn = un1 un2 un3 − v1n v2n v3n are those in which the QRT mapping takes place. As discussed above, the integration is carried out in terms of elliptic functions. Thus, rn is integrated in terms of elliptic functions. The Kahan-Hirota-Kimura discretization does not strictly preserve the value of r (60). However, the discrete analogue of the Hamiltonian (61) of the Kahan-Hirota-Kimura discretization is expressed via the elliptic functions.

7. Conclusion We derive the conserved quantities of the discrete threewave system of Kahan-Hirota-Kimura type using computer algebra. Moreover, the variables rn = un1 un2 un3 − v1n v2n v3n , which correspond to the Hamiltonian of the continuous three-wave ODE system and can be expressed in terms of elliptic functions, are the variables in which the QRT mapping takes place. The QRT mapping takes place in terms of elliptic functions. Thus, rn is integrated in terms of elliptic functions. It is known that the solutions of the continuous threewave ODE system can be expressed in terms of hyperelliptic functions. However, no relationship between the discrete three-wave system of Kahan-Hirota-Kimura type and elliptic functions has been found. In this paper, we show a

ISBN: 1-60132-508-8, CSREA Press ©

148

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

new relationship between the discrete three-wave system of Kahan-Hirota-Kimura type and elliptic functions via the QRT mapping. Finding the solutions uni , vin (i = 1, 2, 3) of the discrete three-wave system of Kahan-Hirota-Kimura type requires future work.

Acknowledgment This work was supported by JSPS KAKENHI Grant Number JP17H02858 and JP17K00167. We are deeply grateful to Dr. Kazuki Maeda from the University of Fukuchiyama. His advice and comments greatly helped us to understand A-stabile within the context of the implicit Runge-Kutta method.

References [1] Baxter, R.J.: Exactly Solved Models in Statistical Mechanics, Academic Press, London, (1982). [2] Benney, D.J., and Newell, A.C.: The propagation of nonlinear wave envelopes, J. Math. Phys. 46, pp.133–139 (1967). [3] Celledoni, E., McLachlan, R.I., Owren, B., and Quispel, G.R.W.: Geometric properties of Kahan’s method, J. Phys. A: Math. Theor. 46 2, 025201 (2013). [4] Fairlie, D.: An elegant integrable system, Phys. Lett. A 119, p.438 (1987). [5] Hairer, E., and Wanner, G.: Solving Ordinary Differential Equations II: Stiff and Differential-Algebraic Problems (second ed.), Springer Verlag, Berlin, (1996). [6] Hirota, R.: Exact Solution of the Korteweg-de Vries equation for multiple collisions of solitons, Phy. Rev. Lett. 27, p.1192 (1971). [7] Hirota, R., and Kimura, K.: Discretization of the Euler top, J. Phys. Soc. Jpn. 69 3, pp.627–630 (2000). [8] Hirota, R., and Yahagi, H.: Recurrence Equations, An Integrable System, J. Phys. Soc. Jpn. 71, pp.2867–2872 (2002). [9] Iatrou, A., and Roberts, J.A.G.: Integrable mappings of the plane preserving biquadratic invariant curves II, Nonlinearity 15, pp.4599489 (2002). [10] Ivanov, R.: Hamiltonian formulation and integrability of a complex symmetric nonlinear system, Phys. Lett. A 350, pp.232–235 (2006). [11] Kahan, W.: Unconventional numerical methods for trajectory calculations, Unpublished lecture notes (1993). [12] Kahan, W., and R.-C. Li: Unconventional schemes for a class of ordinary differential equationswith alications to the Korteweg de Vries equation, J. Comput. Phys. 134 2, pp.316–331 (1997). [13] Kimura, K., and Hirota, R.: Discretization of the Lagrange top, J. Phys. Soc. Jpn. 69 10, pp.3193–3199 (2000). [14] Petrera, M., Pfadler, A., and Suris, Y.B.: On integrability of HirotaKimura type discretization, Regular Chaotic Dyn. 16, pp.245–289 (2011). [15] Quispel, G.R.W, Roberts, J.A.G., and Thompson, C.J.: Integrable maings and soliton equations, Phys. Lett. A 126, pp.419–421 (1988). [16] Quispel, G.R.W., Roberts J.A.G., and Thompson, C.J.: Integrable maings and soliton equations II, Physica. D 34, pp.183–192 (1989). [17] [18]

Appendix

off nat on ezgcd ans := solve({uu1 − u1 = v2 ∗ vu3 + v3 ∗ vu2, uu2 − u2 = v1∗vu3+v3∗vu1, uu3−u3 = v1∗vu2+v2∗vu1, vu1−v1 = u2 ∗ uu3 + u3 ∗ uu2, vu2 − v2 = u1 ∗ uu3 + u3 ∗ uu1, vu3 − v3 = u1∗ uu2 + u2 ∗ uu1}, {uu1, uu2, uu3, vu1, vu2, vu3}) r1 := sub(ans,uu1 ∗ uu2 ∗ uu3 − vu1 ∗ vu2 ∗ vu3) t:={num(sub(ans, uuu1−uu1−(vu2∗vuu3+vu3∗vuu2))), num(sub(ans, uuu2 − uu2 − (vu1 ∗ vuu3 + vu3 ∗ vuu1))), num(sub(ans, uuu3 − uu3 − (vu1 ∗ vuu2 + vu2 ∗ vuu1))), num(sub(ans, vuu1 − vu1 − (uu2 ∗ uuu3 + uu3 ∗ uuu2))), num(sub(ans, vuu2 − vu2 − (uu1 ∗ uuu3 + uu3 ∗ uuu1))), num(sub(ans, vuu3 − vu3 − (uu1 ∗ uuu2 + uu2 ∗ uuu1)))} ans := solve(t, {uuu1, uuu2, uuu3, vuu1, vuu2, vuu3}) r2 := sub(ans, uuu1 ∗ uuu2 ∗ uuu3 − vuu1 ∗ vuu2 ∗ vuu3) ans := solve({u1 − ud1 = vd2 ∗ v3 + vd3 ∗ v2, u2 − ud2 = vd1∗v3+vd3∗v1, u3−ud3 = vd1∗v2+vd2∗v1, v1−vd1 = ud2∗u3+ud3∗u2, v2−vd2 = ud1∗u3+ud3∗u1, v3−vd3 = ud1 ∗ u2 + ud2 ∗ u1}, {ud1, ud2, ud3, vd1, vd2, vd3}) rd1 := sub(ans,ud1 ∗ ud2 ∗ ud3 − vd1 ∗ vd2 ∗ vd3) r0 := u1 ∗ u2 ∗ u3 − v1 ∗ v2 ∗ v3 ans := solve({r02 ∗ (rd1 + r1) + h1 ∗ r0 + h2 = 0, r12 ∗ (r0 + r2) + h1 ∗ r1 + h2 = 0}, {h1, h2}); end

B Checking command for conserved quantities

To check the conserved quantities, we adopt the Risa/Asir[17] command h1 = (2 ∗ (−3 ∗ u16 ∗ u22 ∗ u32 ∗ v14 + 4 ∗ u15 ∗ u23 ∗ u32 ∗ v13 ∗ v2 + ... − 4 ∗ u3 ∗ v3 + 1) h2 = (4 ∗ (u19 ∗ u23 ∗ u33 ∗ v16 − 2 ∗ u18 ∗ u24 ∗ u33 ∗ v15 ∗ v2 − ... − 6 ∗ u3 ∗ v3 + 1) G1 = subst(h1, u1, uu1, u2, uu2, u3, uu3, v1, vu1, v2, vu2, v3, vu3) G2 = subst(h2, u1, uu1, u2, uu2, u3, uu3, v1, vu1, v2, vu2, v3, vu3) V = [uu1, uu2, uu3, vu1, vu2, vu3] G=nd_gr_trace([ uu1 − u1 − (v2 ∗ vu3 + v3 ∗ vu2), uu2 − u2 − (v1 ∗ vu3 + v3 ∗ vu1), uu3 − u3 − (v1 ∗ vu2 + v2 ∗ vu1), vu1 − v1 − (u2 ∗ uu3 + u3 ∗ uu2), vu2 − v2 − (u1 ∗ uu3 + u3 ∗ uu1), vu3 − v3 − (u1 ∗ uu2 + u2 ∗ uu1)], V, 1, 1, 0); load(”gr”) p_nf(nm(H1 − G1), G, V, 0); p_nf(nm(H2 − G2), G, V, 0); end

A Conserved quantities through computer algebra To obtain conserved quantities through computer algebra, we use the Reduce[18] command as follows:

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

149

Improvement of the Thick-Restart Lanczos Method in Single Precision Floating Point Arithmetic using Givens rotations Masana Aoki1 , Masami Takata2 , Kinji Kimura3 , and Yoshimasa Nakamura1 1 Graduate School of Informatics, Kyoto University, Kyoto, Kyoto, JAPAN 2 Research Group of Information and Communication Technology for Life, Nara Women’s University, Nara, Nara, JAPAN 3 Department of Electrical and Electronics Engineering, University of Fukui, Fukui, Fukui, JAPAN Abstract— This paper proposes an improvement to the Thick-Restart Lanczos method, which can compute the truncated eigenvalue decomposition (EVD). EVD is one of the most basic computations in linear algebra and plays an important role in scientific and technical computation. Truncated EVD is used in molecular orbital computation and vibration analysis. As the restart algorithm of the improved method, the reorthogonalized eigen vectors of small matrices are needed. Ishida et.al. have improved the augmented implicitly restarted Lanczos bidiagonalization method, which can compute the truncated singular value decomposition, using the QR decomposition in terms of the Householder reflector. On the other hand, in this paper, these are computed using the QR decomposition based on the Givens rotation. Hence, a QR decomposition based on the Givens rotation should be implemented. For implementation of QR decomposition, knowledge of localization for the eigen vectors of the diagonally dominant matrix is required. This study concludes that, in single precision floating point arithmetic, these improvements increase both the speed and orthogonality of the truncated EVD compared with a conventional algorithm. Keywords: Givens rotation, Thick-restart-Lanczos method, TR-

LAN

1. Introduction Eigenvalue decomposition (EVD) is known as the most basic computation in linear algebra and it is important in scientific and technical computation. In particular, truncated EVD has been adopted for some molecular orbital computations and vibration analyses. The Thick-Restart Lanczos (TRL) method [8] has been proposed to compute truncated EVD. Ishida et.al. have improved the augmented implicitly restarted Lanczos bidiagonalization method[1], [2], which can compute the truncated singular value decomposition, using the QR decomposition in terms of the Householder reflector[3]. On the other hand, in this paper, the TRL method is improved as follows. Once eigen vectors have been computed using the QR decomposition based on the Givens rotation, then the reorthogonalized eigen vectors of small matrices is used in the restart algorithm. To improve

orthogonality of the eigen vectors, the QR decomposition based on the Givens rotation must be implemented properly using knowledge of the localization for the eigen vectors of the diagonally dominant matrix. Consequently, in single precision floating point arithmetic, several numerical experiments show that, compared with a conventional algorithm, the improvements proposed in this study reduce computation time and orthogonality of truncated EVD. Section 2 introduces the Lanczos[4] and TRL methods, known as the Krylov subspace method. Section 3 explains the QR decomposition based on the Givens rotation. Section 4 discusses new restart strategy. Section 5 compares the conventional method with the proposed method in TRLAN[9].

2. Krylov Subspace Method The linear Krylov subspace is composed of a matrix A ∈ Rn×n , an initial vector q1 ∈ Rn (q1 = 0), and an iteration number k. Krylov vectors q1 , Aq1 , A2 q1 , . . . are obtained by multiplying A by the power of q1 . The order-k Krylov subspace is spanned by k Krylov vectors: K(A, q1 , k) = span{q1 , Aq1 , . . . , Ak−1 q1 }.

(1)

Krylov subspace methods have been developed to solve linear problems using the Krylov subspace.

2.1 Lanczos method The Lanczos method, which is a Krylov subspace method, reduces a symmetric matrix to a tridiagonal symmetric matrix of a smaller size. By introducing the modified GramSchmidt method to the base vector in the Krylov subspace K(A, q1 , k), orthonormal bases are obtained and ⊤ hi,j = vi⊤ Avj = vi⊤ A⊤ vj = (Avi )⊤ vj = vj⊤ (Avi ) (2)

= hj,i ,

is satisfied. Thereafter, A is transformed into the tridiagonal symmetric matrix by the Lanczos method. Algorithm 1 is a pseudocode. In the k-th step, eq.(3) is satisfied: AVk = Vk Sk + βk vk+1 e⊤ k,

ISBN: 1-60132-508-8, CSREA Press ©

(3)

150

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

Algorithm 1 Lanczos method 1: Generate an initial vector q1 , which is set to random number; 2: v1 := q1 /|q1 |; 3: β0 := 0; 4: q0 := 0; 5: for j := 1 to k do 6: rj := Avj ; 7: αj := vj⊤ rj ; 8: rj := rj − αj vj − βj−1 vj−1 ; 9: βj := |rj |; 10: vj+1 := rj /βj ; 11: end for where

⎡ α1 ⎢ ⎢ β1 ⎢ ⎢ ⎢0 Sk := ⎢ ⎢ . ⎢ .. ⎢ ⎢ . ⎣ .. 0

β1

··· .. . .. . .. . .. . 0

0 .. .

α2

..

β2 .. .

.. ..

. .

. ···

···

··· .. .. ..

0 .. . .. .

. . .

βk−1

⎤

⎥ ⎥ ⎥ ⎥ ⎥ ⎥. ⎥ 0 ⎥ ⎥ ⎥ βk−1 ⎦

(4)

αk

The stopping criterion of the Lanczos method is deter(k) mined using the Wilkinson theorem[7]. Let λj ∈ R and (k) yj ∈ rn be set as an eigenvalue and the corresponding normalized eigen vector in Sk . In the Lanczos method, (k)

Sk yj

(k) (k)

= λj y j .

2.2 Thick-restart-Lanczos method The TRL method is an improved Lanczos method that uses the restart strategy. In the Lanczos method, until an approximate matrix is obtained, the Krylov subspace is expanded and a new basis vector is added at each iteration. Consequently, this takes more memory and computation time as the cost of reorthogonalization increases. To reduce the cost of reorthogonalization, the base number of the Krylov subspace must be limited. Using the limited subspace, an initial vector of the Krylov subspace is chosen at the subsequent time. This operation is called a restart. TRLAN implements the TRL method, which have been developed as a restart strategy only for real symmetric matrices. The number of desired eigen pairs and the number of bases in the Krylov subspace are set to l and m (l < m ≪ n), respectively. In the TRL method, once the Lanczos method performs m iterations, the iteration is restarted with a new initial vector v˜l+1 ∈ Rn . After m iterations in the Lanczos method, AVm = Vm Sm + βm vm+1 e⊤ m,

is satisfied by eq. (3). Then, the eigenvalues γ1 , . . . , γm and the corresponding eigen vectors y1 , . . . , ym in Sm are obtained as follows: Sm yi = γi yi ,

:=

(k) V k yj

n

∈R ,

(k)

is satisfied. By eq. (3), (k)

Axj

(6)

Dl := diag(γ1 , . . . , γl ), Yl := [y1 . . . yl ],

(k)

= AVk yj

(k)

= AVk yj =

(k)

(k)

(k)

− Vk Sk yj =

(k)

= AVk yj

(k) (k)

AVm Yl = Vm Sm Yl + βm vm+1 e⊤ m Yl

− Vk λj yj

= Vm Yl Dl + βm vm+1 e⊤ m Yl .

(k)

= (AVk − Vk Sk )yj

(k) (e⊤ k yj )βk vk+1 .

(8)

(k) Axj

(k) (k) − λj xj 2 = (k) (e⊤ k yj )βk vk+1 2

(k) (e⊤ k yj )βk vk+1 2 (k) = (e⊤ k yj )βk .

Thus, when λi (A) (i = 1, . . . , n) is an eigenvalue in A, (k)

min |λj i

(k)

− λi (A)| ≤ Axj

(k) (k)

V˜l := Vm Yl ,

(17)

e⊤ m Yl ,

(18)

η := (9) (k)

− λj xj 2 = (e⊤ k yj )βk ,

(10)

is satisfied by the Wilkinson theorem. Hence, by using (k) (e⊤ k yj )βk , the stopping criterion can be implemented.

(16)

is satisfied. By eq. (16), when,

Moreover, from vk+1 2 = 1, we compute, =

(15)

is satisfied. By eqs. (11) and (15),

− λj Vk yj

(k) βk vk+1 e⊤ k yj

(13) (14)

are defined. Then, by eq. (12), Sm Y l = Y l D l ,

(k) (k)

− λj xj

(12)

(5)

(7)

xj 2 = 1,

i = 1, . . . , m.

Among m eigenvalues, approximate eigenvalues and corresponding normalized eigen vectors are set as γ1 , . . . , γl and y1 , . . . , yl , respectively.

From, (k) xj

(11)

are defined, then AV˜l = V˜l Dl + βm vm+1 η.

(19)

The i-th vector in V˜l is set to v˜i (i = 1, . . . , l) and is satisfied as v˜l+1 := vm+1 . By definition V˜l , since v˜1 , . . . v˜l , v˜l+1 are orthogonalized, v˜i (i = l + 2, . . . , m), α ˜ i (i = l+1, . . . , m), β˜i (i = l+1, . . . , m−

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

Algorithm 2 TRL Method 1: Set m and l; 2: Input Lanczos decomposition AVm = Vm Sm + βm vm+1 e⊤ m; 3: for i := 1, 2, · · · do 4: Compute all eigenvalue γ1 , . . . , γm and the normalized eigen vectors y1 , . . . , ym corresponding to the eigenvalue in Sm ; 5: Extract the required eigenvalues γ1 , . . . , γl and the eigen vectors y1 , . . . , yl ; 6: Dl := diag(γ1 , . . . , γl ); 7: Yl := [y1 . . . yl ]; 8: V˜l := Vm Yl ; 9: η := e⊤ m Yl ; v˜l+1 := vm+1 ; 10: ˜rl+1 := A˜ 11: vl+1 ; ⊤ ˜ rl+1 ; 12: α ˜ l+1 := v˜l+1 ˜rl+1 := ˜rl+1 − lj=1 βm η(j)˜ vj − α ˜ l+1 v˜l+1 ; 13: ˜ 14: βl+1 := |˜rl+1 |; v˜l+2 := ˜rl+1 /β˜l+1 ; 15: 16: V˜l+1 := [V˜l v˜l+1 ]; D l βm η ⊤ ; 17: S˜l+1 := βm η α ˜ l+1 ˜ + β˜l+1 v˜l+2 e⊤ to the 18: Adopt AV˜l+1 = V˜l+1 Sl+1 l+1 Lanczos method at m − l − 1 times; 19: end for

1) can be computed by eq. (19), when the initial vector is set to v˜l+1 and the Lanczos method is restarted. Using, v1 . . . v˜l v˜l+1 . . . v˜m ], V˜m := [˜ S˜m := ⎡ Dl β m η ⊤ ⎢βm η α ˜ l+1 β˜l+1 ⎢ ⎢ ˜ l+2 β˜l+2 β˜l+1 α ⎢ ⎢ .. .. ⎢ . . β˜l+2 ⎢ ⎢ . . .. .. ⎢ ⎢ ⎢ .. ⎣ . O

(20) O

.. ..

.

. ˜ βm−1

⎤

⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥, ⎥ ⎥ ⎥ ⎥ ˜ βm−1 ⎦ α ˜m (21)

151

3. QR decomposition based on the Givens rotation Consider for example QR decomposition for a m×n(m ≥ n) matrix A. To adopt QR decomposition to an m × n(m ≥ n) matrix A, the Givens rotation, which is an orthogonal transformation, is introduced. Gk (ik , jk ) = ⎛ 1 ⎜ .. ⎜ . ⎜ ⎜ cos (θk ) ⎜ ⎜ ⎜ ⎜ ⎜ − sin (θk ) ⎜ ⎜ ⎝

⎞ ..

sin (θk ) . cos (θk )

..

. 1

⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ . (24) ⎟ ⎟ ⎟ ⎟ ⎠

Here, cos (θk ) is placed in the elements (ik , ik ) and (jk , jk ). The element (ik , jk ) stores sin (θk ), and the element (jk , ik ) is − sin (θk ). ⎞ ⎞ ⎛ ⎛ .. .. . ⎟ ⎜ . ⎟ ⎜ 2 + y2 ⎟ ⎜ xik ⎟ ⎜ x ik jk ⎟ ⎟ ⎜ ⎜ ⎟ ⎟ ⎜ ⎜ ⎟, .. (25) Gk (ik , jk ) ⎜ ... ⎟ = ⎜ ⎜ ⎟ . ⎟ ⎜ ⎟ ⎜ yj ⎟ ⎜ ⎟ 0 ⎝ k ⎠ ⎜ ⎝ ⎠ .. . .. . x ik yj k cos (θk ) = , sin (θk ) = . (26) x2ik + yj2k x2ik + yj2k Here, for computation in cos (θk ) and sin (θk ), Algorithm 3 is adopted. In this transformation, an element, placed in two rows or columns, can be changed to 0. When the Givens rotations are performed from the left side of the matrix A many times, an upper triangular matrix R can be obtained. By combining the Givens rotations used to obtain R in reverse order, the orthogonal matrix Q can be computed. Eq. (27) is known as a representative strategy for transforming a certain element into 0. G(n, n + 1) · · · G(n, m − 1)G(n, m) × ···

AV˜m is satisfied as follows: AV˜m = V˜m S˜m + β˜m v˜m+1 e⊤ m, ⊤˜ ˜ Vm Vm = I.

(22) (23)

When all elements of βm η are sufficiently small, the stopping criterion follows the Wilkinson theorem. γi and v˜i (i = 1, . . . , l) are close to the eigenvalue and the eigen vector. Algorithm 2 shows the pseudo code of the TRL method.

× G(2, 3) · · · G(2, m − 1)G(2, m) × G(1, 2) · · · G(1, m − 1)G(1, m)A = R.

(27)

Here, the bottom term G(1, 2), · · · , G(1, m − 1), G(1, m), the above term G(2, 3), · · · , G(2, m − 1), G(2, m), and the top term G(n, n + 1), · · · , G(n, m − 1), G(n, m) are computed for the purpose of zeroing off the off-diagonal component of the first column, the second column, and the n-th column, respectively. In this strategy, Q is computed

ISBN: 1-60132-508-8, CSREA Press ©

152

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

using eq.(28). Q =G(1, m)⊤ G(1, m − 1)⊤ · · · G(1, 2)⊤

× G(2, m)⊤ G(2, m − 1)⊤ · · · G(2, 3)⊤ × ···

× G(n, m)⊤ G(n, m − 1)⊤ · · · G(n, n + 1)⊤ .

(28)

S m Y m = Ym D m ,

(29)

Here, the subscript k of the eq. (25) is omitted in the eq. (28). In the TRL method, Sm , which appeared in the final stage of convergence, non-diagonal elements are much smaller in terms of absolute value. An EVD in Sm is expressed as where the matrix Ym consists of eigen vectors. To extract the required eigenvalues γ1 , . . . , γl and the eigen vectors y1 , . . . , yl , the vectors of Y must be sorted in the desired order. As a strategy of QR decomposition for the sorted Y , eq.(27) should not be adopted, but eq.(30).

Algorithm 3 Implementation of the Givens rotation 1: f ← |xik | 2: g ← |yjk | 3: t ← max(f, g) 4: if t = 0 then 5: cos (θk ) ← 1 6: sin (θk ) ← 0 7: x2ik + yj2k ← 0 8: else 9: u ← f /t 10: v ← g/t 11: if f ≥ g then 12: r ← 1 + v2 13:

14: 15: 16: 17:

G(n, n + 1) · · · G(m − 2, m − 1)G(m − 1, m)

18: 19:

× G(2, 3) · · · G(m − 2, m − 1)G(m − 1, m) × G(1, 2) · · · G(m − 2, m − 1)G(m − 1, m)A = R. (30)

21: 22:

× ···

Here, the bottom term G(1, 2), · · · , G(m−2, m−1), G(m− 1, m), the above term G(2, 3), · · · , G(m−2, m−1), G(m− 1, m), and the top term G(n, n + 1), · · · , G(m − 2, m − 1), G(m − 1, m) are computed for the purpose of zeroing off the off-diagonal component of the first, second, and n-th columns, respectively. In this strategy, Q is computed using the following: Q =G(m − 1, m)⊤ G(m − 2, m − 1)⊤ · · · G(1, 2)⊤

× G(m − 1, m)⊤ G(m − 2, m − 1)⊤ · · · G(2, 3)⊤ × ···

In the TRL method, Yl := [y1 . . . yl ] should not be used, but rather Q in the eq.(31) as Yl . To avoid overflow and underflow, the Givens rotation should be implemented as the Algorithm 3. The fused multiply-accumulate can be adopted in the doubleunderlined part of the lines 12 and 17 of Algorithm 3.

4. New Restart Strategy 4.1 Rayleigh Quotient in Eigenvalue Decomposition The Rayleigh quotient [6] in the eigen vectors is defined ρ=

1 xi . x ˜⊤ A˜ ||˜ xi ||2 i

cos (θk ) ← u/r sin (θk ) ← v/r x2ik + yj2k ← r × t end if end if

ρ in Eq. (32) can satisfy the following equation using computed eigen vector x ˜i : xi − z x ˜i ||2 . ρ = arg min ||A˜ z

(33)

Here, ρ closely approximates an eigenvalue λi (i = 1, . . . , n) of A.

4.2 Implementation

× G(m − 1, m)⊤ G(m − 2, m − 1)⊤ · · · G(n, n + 1)⊤ . (31)

as

20:

cos (θk ) ← u/r sin (θk ) ← v/r x2ik + yj2k ← r × t else r ← 1 + u2

(32)

In the TRL algorithm, the EVD of the small matrix Sm is performed internally, with the result used at the restarting point of the algorithm. Unless computation errors are considered, the eigen vectors obtained by EVD will be orthogonal matrices. It is known that the orthogonality in the Lanczos algorithm worsens because of the rounding error. To avoid this problem, an algorithm is proposed that restarts with orthogonalization of the eigen vectors of the small matrix Sm . A method is introduced to obtain eigen vectors of Sm with maximum orthogonality by decomposing the eigen vectors into a column orthogonal matrix and an upper triangular matrix using the QR decomposition [5] in terms of the Givens rotation. The whole algorithm is described in Algorithm 4. In the conventional algorithm, l vectors are extracted from eigen vectors y1 , . . . , ym and set as new Yl , where Yl is an m × l matrix. The new algorithm is used the QR decomposition with Givens rotation with Yl = QR for

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

Algorithm 4 TRL algorithm (proposal algorithm) 1: Set m and l; 2: Input Lanczos decomposition AVm = Vm Sm + βm vm+1 e⊤ m; 3: for i := 1, 2, · · · do 4: Compute all eigenvalue γ1 , . . . , γm and the normalized eigen vectors y1 , . . . , ym corresponding to the eigenvalue in Sm ; 5: Extract the required eigenvalues γ1 , . . . , γl and the eigen vectors y1 , . . . , yl ; 6: Dl := diag(γ1 , . . . , γl ); 7: Yl := [y1 . . . yl ]; 8: Compute the QR Decomposition using Givens rotation of Yl = QR 9: Yl ← Q 10: [Dl ]i,i ← Yl⊤ Sm Yl i,i for i = 1, . . . , l 11: V˜l := Vm Yl ; 12: η := e⊤ m Yl ; v˜l+1 := vm+1 ; 13: ˜rl+1 := A˜ 14: vl+1 ; ⊤ ˜ rl+1 ; 15: α ˜ l+1 := v˜l+1 ˜rl+1 := ˜rl+1 − lj=1 βm η(j)˜ vj − α ˜ l+1 v˜l+1 ; 16: 17: β˜l+1 := |˜rl+1 |; v˜l+2 := ˜rl+1 /β˜l+1 ; 18: 19: V˜l+1 := [V˜l v˜l+1 ]; D l βm η ⊤ ˜ ; 20: Sl+1 := βm η α ˜ l+1 ˜ + β˜l+1 v˜l+2 e⊤ to the 21: Adopt AV˜l+1 = V˜l+1 Sl+1 l+1 Lanczos method at m − l − 1 times; 22: end for

153

is redefined by using the Rayleigh quotients in Eq. (38). Moreover, by Eq. (16), Vl⊤ AVl

To satisfy

= QR, ← Q.

S m Y l = Yl D l , close approximately, it is set that [Dl ]i,i ← Yl⊤ Sm Yl i,i .

Using the eq.(17), the line 11 of TRL algorithm (proposal algorithm) is implemented as follows: V˜l :=Vm G(m − 1, m)⊤ G(m − 2, m − 1)⊤ · · · G(1, 2)⊤ × G(m − 1, m)⊤ G(m − 2, m − 1)⊤ · · · G(2, 3)⊤ × ···

G(m − 1, m)⊤ G(m − 2, m − 1)⊤ · · · G(l, l + 1)⊤ . (41) Eq. (41) can be computed using xROT in LAPACK. Using the eq.(31), the line 10 of TRL algorithm (proposal algorithm) is also implemented as follows: Dl ←G(l, l + 1) · · · G(m − 2, m − 1)G(m − 1, m) × ··· × G(2, 3) · · · G(m − 2, m − 1)G(m − 1, m) × G(1, 2) · · · G(m − 2, m − 1)G(m − 1, m)

× Sm G(m − 1, m)⊤ G(m − 2, m − 1)⊤ · · · G(1, 2)⊤

× G(m − 1, m)⊤ G(m − 2, m − 1)⊤ · · · G(2, 3)⊤ × ···

In more detail,

(37)

X (0) := Sm ,

Here, [Dl ]i,i are the Rayleigh quotients, which closely approximate eigenvalues of Sm using eigen vector Yl . When Yl in Eq. (36), for which orthogonality is improved, are adopted, x, x

=

||Sm Yl − Yl Dl ||,

× G(m − 1, m)⊤ G(m − 2, m − 1)⊤ · · · G(l, l + 1)⊤ . (42)

(35) (36)

(38)

(39)

is defined by using the computed eigenvalue Dl (1 : l, 1 : l) in the line 10 of Algorithm 4. By improving the orthogonality of Yl , x becomes larger. To avoid this problem, Dl

(40)

Yl⊤ Sm Yl = Dl ,

is led. Therefore, in terms of vectors Yl , [Dl ]i,i , which are close to eigenvalues of A, can be regarded as the Rayleigh quotients.

orthogonalizing Yl . Let the orthogonal matrix Q be a new Yl : (34) Yl ← y 1 , y 2 , . . . , y l , Yl Yl

=

X

(1)

X

(2)

:= G(m − 1, m)X

(43) (0)

G(m − 1, m) ,

:= G(m − 2, m − 1)X

(44)

⊤

(1)

⊤

G(m − 2, m − 1) , · · · . (45)

Eqs. (43), (44) and (45) can be computed using xROT in LAPACK. (i)

(i)

In X (i) , all elements except Xm−1,m , Xm,m−1 , (i) (i) Xm−1,m−1 , Xm,m can be implemented using xROT simply. As a remark, in mathematics X (i) is a symmetric matrix. An implementation must therefore be devised to maintain symmetry. Moreover, the diagonal elements must be computed

ISBN: 1-60132-508-8, CSREA Press ©

154

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

with high accuracy. Consequently,

as the average error value and

(i)

pp =Xm−1,m−1 ,

(46)

(i) =Xm−1,m , (i) =Xm,m ,

(47)

pq qq

(48)

(i)

Xm−1,m = cos (θk ) × sin (θk ) × (qq − pp) + pq × (cos (θk ) − sin (θk ))

(i) Xm,m−1 (i) Xm−1,m−1

(i) Xm,m

are adopted.

× (cos (θk ) + sin (θk )),

(i) =Xm−1,m ,

(50)

= cos (θk ) × cos (θk ) × pp

+ 2 cos (θk ) × sin (θk ) × pq + sin (θk ) × sin (θk ) × qq,

(51)

− 2 cos (θk ) × sin (θk ) × pq + cos (θk ) × cos (θk ) × qq,

(52)

= sin (θk ) × sin (θk ) × pp

5. Experiments 5.1 Environment

For the experimental environment, a computer equipped with Intel(R) Xeon(R) Silver 4116 @ 2.10 GHz (2 CPUs) and a memory of (192 GB) is used. Each program is compiled using gfortran 7.4.0 and Lapack 3.8.0[10] as computation libraries. Sparse matrices are stored in CRS format. For the numerical computation, we use the single precision floating point arithmetic. For these numerical experiments, two types of matrices are prepared. First, real sparse matrices are used A1 ∈ R1,000,000×1,000,000 and A2 ∈ R1,800,000×1,800,000 as input. There are 1,000 elements consisting of uniform random numbers of [0, 1) in each row. A1 and A2 are examples of large-scale sparse matrices. By performing EVD for these matrices, it is determined that the new implementation can solve actual problems more accurately. Second, real tridiagonal matrices A3 ∈ R10,000×10,000 and A4 ∈ R50,000×50,000 are used. All diagonal elements are 0 and all off-diagonal elements are 1. The i-th eigen value (i = 1, · · · , n) of i A3 and A4 is 2 cos π where n is the matrix size. n+1 Large eigen values of these matrices, therefore, are clustered around 2 and −2. These matrices therefore pose difficult problems to solve. By solving EVD for these matrices, the implementation is determined capable of solving difficult problems with high speed and accuracy. The output is l (l = 10, 20, 30) eigenvalue pairs corresponding to the larger eigen values of the input matrices. We use 1 ||A˜ vi − σ ˜i v˜i || (53) l 1≤i≤l

as the maximum error value for a machine-computed eigen pair (˜ σi , v˜i ) of A. Moreover, the orthogonal errors (55) are used to check orthogonality of V˜l = v˜1 , v˜2 , . . . , v˜l . ||V˜l⊤ V˜l − I||

(49)

(54)

vi − σ ˜i v˜i || max ||A˜

1≤i≤l

5.2 Discussion of Numerical Experiment

Figure 1 shows the computational results for performing truncated EVD. Consequently, in the case of the proposed algorithm that restarts with orthogonalization of the eigen vectors of the small matrix Sm , computation time and iteration number decrease compared with those needed using the conventional algorithm. When the eigen values of tolerance accuracy are computed, the proposed algorithm is terminated. Thus, computation time is shortened, since there are fewer iterations in the proposed algorithm than in the conventional algorithm. In matrices A3 and A4 , the orthogonality of eigen vector is improved. The accuracy of the eigen value, however, is diminished. This result is caused by the fact that the proposed algorithm is terminated with the tolerance accuracy of the eigen value. As (a), (b), and (c) in Fig.1 show, although the accuracy of eigen values in the proposed algorithm is lower than that of the conventional algorithm, the orthogonality of the eigen vectors is better. The eigen vectors are emphasized in practical problems and consequently, there would be no significant effect even with slightly worse accuracy of the eigen values. In the proposed algorithm, the orthogonality of matrices A3 and A4 , of which EVD is difficult, is equivalent to that of matrices A1 and A2 , in which elements are set to random numbers. This shows that the orthogonality of the proposed algorithm is less sensitive to the nature of the matrix than the conventional algorithm. It is therefore appropriate to adopt the proposed algorithm to the experimental matrices.

6. Conclusions This paper has presented an improved TRL algorithm for computing the truncated EVD of an input large-scale sparse matrix. The proposed algorithm restarts with orthogonalization of eigen vectors of the small matrix Sm generated inside the TRL algorithm. At restarting, the improved implementation executes the QR decomposition based on the Givens rotation. Using numerical experiments, it is verified that the iteration number and the computation time are reduced, compared with a conventional algorithm in single precision floating point arithmetic. The orthogonality of the proposed algorithm is less sensitive to the nature of the matrix than

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

A1 (a) Average of eigen value errors

A2

155

A3

A4

(b) Maximum of eigen value errors

(c) Orthogonal errors of V˜

(d) Computation time

(e) Iteration number

Fig. 1: Performance of truncated EVD. (C and P denote the conventional and proposed algorithms, respectively.) the conventional algorithm, and the proposed algorithm is terminated with tolerance accuracy of eigen value. Future research will adopt this algorithm to solve practical problems.

Acknowledgment This work was supported by JSPS KAKENHI Grant Number JP17H02858 and JP17K00167.

References [1] Baglama, J., and Reichel, L.: Augmented implicitly restarted Lanczos bidiagonalization methods, SIAM Journal on Scientific Computing, 27(1), pp.19–42 (2005). [2] Calvetti, D., et al. An implicitly restarted Lanczos method for large symmetric eigenvalue problems, Electronic Transactions on Numerical Analysis, 2(1), pp.1–21 (1994).

[3] Ishida, Y., Takata, M., Kimura, K., and Nakamura Y.: Improvement of the Augmented Implicitly Restarted Lanczos Bidiagonalization Method in Single Precision Floating Point Arithmetic, IPSJ Transactions on Mathematical Modeling and its Applications, 11(3), pp.19–25 (2018). [4] Lanczos, C.: An iteration method for the solution of the eigenvalue problem of linear differential and integral operators, J. Res. Nat. Bureau Standards, Sec, vol.B, no.45, pp.255-282(1950). [5] Golub, G. H., and Van Loan, C. F.: Matrix Computations., Johns Hopkins University Press, 4th edition (2012). [6] Parlett, B. N.: The Symmetric Eigenvalue Problem, Society for Industrial and Applied Mathematics (1998). [7] Wilkinson, J. H. D.: The Algebraic Eigenvalue Problem, Clarendon Press(1965). [8] Wu, K., and Simon, H.: Thick-restart Lanczos method for large symmetric eigenvalue problems, SIAM J. Matrix Anal. and Appl., 22(2), pp.602-616(2000). [9] Wu, J.: Research on Eigenvalue Computations, , (1999). [10] Linear Algebra PACKage, ,

ISBN: 1-60132-508-8, CSREA Press ©

156

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

On an Implementation of Two-Sided Jacobi Method Sho Araki1 , Masami Takata2 , Kinji Kimura3 , and Yoshimasa Nakamura1 1 Graduate School of Informatics, Kyoto University, Kyoto, Kyoto, JAPAN 2 Research Group of Information and Communication Technology for Life, Nara Women’s University, Nara, Nara, JAPAN 3 Department of Electrical and Electronics Engineering, University of Fukui, Fukui, Fukui, JAPAN

Abstract— The Jacobi method for singular value decomposition can compute all singular values and singular vectors with high accuracy. Previously published studies have reported that Jacobi’s method is more accurate than the QR algorithm. The computation cost in the Jacobi method is higher than that of the computation method, which combines the QR method and bidiagonalization using the Householder transformation. However, the computation cost is insignificant for very small matrices. Moreover, the Jacobi method can be implemented as an embedded system such as FPGA because of its simple operation pattern. Based on the Jacobi method, one-sided and two-sided Jacobi methods are proposed. The one-sided method has already been implemented in LAPACK. There are still many parts which can be improved in the implementation of the twosided Jacobi method. Thus, in this paper, we improve the two-sided Jacobi method. We confirmed through experiments that the two-sided Jacobi method has a shorter computation time and exhibits a higher accuracy than the one-sided Jacobi method for small matrices. Keywords: singular value decomposition, one-sided Jacobi method, two-sided Jacobi method, False position method, secant method, fused multiply–accumulate

1. Introduction Many mathematical applications require a generalized eigenvalue formula comprising a symmetric matrix and a positive definite symmetric matrix, although these applications use only some eigenvalues and the corresponding eigen vectors. The Sakurai-Sugiura method [15] is known as a truncated eigenvalue decomposition and uses a column space. To compute the column space, a rectangular matrix should be decomposed using a singular value decomposition. Generally, a given matrix is transformed into a bidiagonal matrix by using the Householder transformation [3] as a preprocessing method. In [1], a computation method for column space, adopted to a bidiagonal matrix, has been proposed. The method combines the DQDS (differential qd with shift) [7], [13] and OQDS (orthogonal qd with shift) methods [12]. The Sakura-Sugiura method is sufficient only for computing column space, which is based on the left singular vectors, in a given upper bidiagonal matrix. Because

the row space in a lower bidiagonal matrix is equal to the column space in the upper bidiagonal matrix, it can be computed based on the right singular vectors, achieved by using the OQDS method, which was proposed in [1]. To reduce the computation costs and improve accuracy, the Sakurai-Sugiura method was modified by Imakura et al. [9]. The modified Sakurai-Sugiura method requires both the left and right singular vectors. The OQDS method can achieve high accuracy when only column space is required. Thus, the OQDS method is compatible with the original Sakurai-Sugiura method. However, when the left singular vectors of the lower bidiagonal matrix is computed by the OQDS method, the OQDS method requires a matrix twice as large as the given matrix size. The left singular vectors of a matrix are obtained by extracting the smaller matrix from the larger matrix. Consequently, it is not guaranteed that the computed left singular vectors in the lower bidiagonal matrix will have high orthogonality. Hence, to implement the modified Sakurai-Sugiura method with high accuracy, it is necessary to establish a method with high accuracy for singular value decomposition, which can address all singular values and left and right singular vectors. James Demmel and Kresimir Veselic reported in their paper that the Jacobi method is more accurate than QR [4]. As the Jacobi method is for singular value decomposition, one-sided and two-sided Jacobi methods have been proposed [5], [6], [2], [8], [10]. The one-sided Jacobi method was implemented in LAPACK [11]. There are still many parts which can be improved in the implementation of the two-sided Jacobi method. Thus, in this paper, we improve the two-sided Jacobi method. Experimental results confirmed that the two-sided Jacobi method has shorter computation time and higher accuracy than the one-sided Jacobi method for small matrices.

2. Target matrices The two-sided Jacobi method for eigenvalue decomposition can compute the eigenvalues and eigenvectors of a real symmetric matrix. More precisely, it is also possible to extend the target matrix to Hermitian matrix. Actually, the two-sided Jacobi method for singular value decomposition can even be designed to perform computations on

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

(i) ⊤ L R(i)

Fig. 1: Space sharing of the upper triangular matrix .

complex matrices of any size. However, in this paper, we have considered only the real upper triangular matrices. By preprocessing using the QR and LQ decompositions in the case of rectangular matrices, singular value decomposition of the rectangular matrices can be reduced to that of upper triangular matrices. Moreover, since we can easily extend our method to an upper complex matrix, a singular value decomposition using the two-sided Jacobi method is designed for allowing computations on real upper triangular matrices.

3. Singular value decomposition using two-sided Jacobi method 3.1 Outline Let J (i) , K (i) , N (i) , and M (i) be the products of rotation matrices. Let R(i) and L(i) be set to a real upper and lower triangular matrix, respectively. In a singular value decomposition using two-sided Jacobi method, eqs.(1) and (2) are computed repeatedly. K (i) R(i) J (i) = L(i) , N (i) L(i) M (i) = R(i+1) ,

(1) i = 0, 1, · · ·

(2)

By these iterative computations, R(i) and L(i) converge into a diagonal matrix. In the convergence, the left singular vector U and the right singular vector V can be computed as follows. ⊤ ⊤ ⊤ ⊤ U = K (0) N (0) K (1) N (1) ⊤ ⊤ N (m−1) , · · · K (m−1) (3) V =J (0) M (0) J (1) M (1) · · · J (m−1) M (m−1) ,

(4)

where m is the iteration number in the case of convergence. Here, the matrix multiplication in eqs. (3) and (4) is accomplished by Givens rotations. Fig.1 shows that R(i) and L(i) are stored together in the upper triangular matrix. In the case of Fig.1, memory allocation is not needed for R(i) and L(i) as separate matrices. Hence, R(i) and L(i) can be computed in the same memory area itself.

157

As shown in eq. (7), Rj,k is converted to 0 by using the rotation matrices P and Q. Here, I means an identity matrix. ⎛ ⎞ I 0 ··· ··· 0 ⎜ . ⎟ ⎜ 0 c1 · · · s1 .. ⎟ ⎜ ⎟ ⎜ .. .. .. ⎟ , P = ⎜ ... (5) . I . . ⎟ ⎜ ⎟ ⎜ . ⎟ ⎝ .. −s1 · · · c1 0 ⎠ 0 ··· ··· 0 I ⎞ ⎛ I 0 ··· ··· 0 ⎜ . ⎟ ⎜ 0 c2 · · · −s2 .. ⎟ ⎟ ⎜ ⎜ .. ⎟ , .. .. (6) Q = ⎜ ... ⎟ . . . I ⎟ ⎜ ⎟ ⎜ . ⎝ .. s2 · · · c2 0 ⎠ 0 ··· ··· 0 I ⎛ ⎞ .. . ··· ··· ··· ··· ⎜ ⎟ ⎜ .. .. ⎟ ⎜ . ⎟ · · · R . R j,j j,k ⎜ ⎟ ⎜ .. . . . . .. .. .. .. ⎟ P ×⎜ . ⎟×Q ⎜ ⎟ ⎜ . ⎟ . .. ⎟ ⎜ .. 0 · · · R k,k ⎝ ⎠ .. . ··· ··· ··· ··· ⎛ ⎞ .. . ··· ··· ··· ··· ⎜ ⎟ ⎜ .. .. ⎟ ˆ ⎜ . Rj,j · · · 0 . ⎟ ⎜ ⎟ ⎜ .. .. .. .. ⎟ . (7) .. =⎜ . . . . . ⎟ ⎜ ⎟ ⎜ . ⎟ . . ⎜ .. ˆ k,k . ⎟ 0 ··· R ⎝ ⎠ .. . ··· ··· ··· ···

By repeating the eq.(7), R(i) can be transformed into L(i) . However, it should be noted that the eq. (7) is not the computation of L(i) from R(i) . Thus, the eq. (7) is not (i) (i) (i) (i) (i) expressed using Rj,j , Rj,k , Rk,k , Lj,j , and Lk,k . Since P and Q are rotation matrices, θ1 and θ2 satisfy c1 = cos θ1 , s1 = sin θ1 , c2 = cos θ2 , and s2 = sin θ2 . Hereafter, we will discuss only those element parts in which the values show a change. c1 s 1 c2 −s2 Rj,j Rj,k = s2 c2 0 Rk,k −s1 c1 ˆ j,j R 0 ˆ k,k . (8) 0 R To compute L(i) from R(i) , the eq.(8) needs to be repeated many times. In iterative procedures, an ordering strategy for erasing the off-diagonal elements, explained in sec.3.2, is adopted. We use the same procedure to obtain R(i+1) from L(i) . ˆ j,j , and R ˆ k,k from Computation of c1 , s1 , c2 , s2 , R Rj,j , Rj,k , Rk,k is explained in sec.3.3, 3.4 and 3.5.

ISBN: 1-60132-508-8, CSREA Press ©

158

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

3.2 Ordering strategy and convergence criterion In the ordering strategy, off-diagonal elements in an upper triangular matrix R(i) are reduced to 0. Because the non-zero elements appear in the lower triangular part, that part is set as (i) the lower triangular matrix L . The details are as follows: (0) (0) If R1,1 ≥ Rn,n , we use the following strategy. The offdiagonal elements are reduced to 0 in the order of elements (1, 2)(1, 3), · · · , (1, n), (2, 3), (2, 4), · · · , (n−2, n−1), (n− 2, n), (n−1, n). Then, the off-diagonal elements in the lower triangular matrix L(i) are reduced to 0 in the order of elements (2, 1)(3, 1), · · · , (n, 1), (3,2), (4, 2),· · · , (n − 1, n − (0) (0) 2), (n, n − 2), (n, n − 1). If R1,1 < Rn,n , we use the following strategy. The off-diagonal elements are reduced to 0 in the order of elements (n−1, n)(n−2, n), · · · , (1, n), (n− 2, n − 1), (n − 3, n − 1), · · · , (1, 3), (1, 2). Then, the offdiagonal elements in the lower triangular matrix L(i) are reduced to 0 in the order of elements (n, n − 1)(n, n − 2), · · · , (n, 1), (n−1, n−2), (n−1, n−3), · · · , (3, 1), (2, 1). By using the two-sided Jacobi method, all the off-diagonal elements converge to 0. Computationally, since the number of iterations is limited to a finite number, the off-diagonal elements may not be an exact 0. Therefore, in case the eq.(9) is satisfied, the element is set to Rj,k ← 0. |Rj,k | ≤ ε

|Rj,j | ×

|Rk,k |.

(9)

Once all the off-diagonal elements converge to 0, the iteration is terminated.

3.3 Implementation method using arctangent function Unlike in the one-sided Jacobi method, singular value decomposition using the two-sided Jacobi method requires many operations to decide c1 , s1 , c2 , and s2 . In numerical computations, performing such a large number of operations introduces numerous errors into the variables under computation. Therefore, we propose to implement the method using the arctangent function. In the proposed implementation method, the number of operations for computing c1 , s1 , c2 , and s2 is decreased using tan−1 , θ1 , and θ2 . Here, c1 , s1 , c2 , and s2 are computed using tan−1 , θ1 , and θ2 ; Rj,k α = tan−1 , (10) R − Rk,k j,j −Rj,k β = tan−1 , (11) Rj,j + Rk,k 1 1 (12) θ1 = (α + β) , θ2 = (α − β) , 2 2 (13) c1 = cos (θ1 ) , s1 = sin (θ1 ) , c2 = cos (θ2 ) , s2 = sin (θ2 ) . (14)

Here, − π2 ≤ θ1 ≤ π2 and − π2 ≤ θ2 ≤ π2 . Then, the computed c1 , s1 , c2 , and s2 are substituted in eqs.(15) and (16); u = c1 + c2 , s2 ˆ ˆ k,k Rj,j = Rj,j + × Rj,k , R u

(15) s1 × Rj,k . = Rk,k − u (16)

The fused multiply–accumulate can be adopted in the double underlined part of the equations. It reduces the error of the final operation result by performing a product-sum operation in one instruction without rounding up the integration in the middle, and is, therefore, important to achieve high accuracy. In the case that |x0 | is much larger q than |xi |(i = 1, · · · , q), the method based on Tq = i=1 xqi and Sq = x0 + Tq is suitable for computing Sq = i=0 xi . The process is adopted in the eq.(16).

3.4 Rutishauser’s implementation method

By using Rutishauser’s implementation method [14] in the two-sided Jacobi method for eigenvalue decomposition, the fused multiply–accumulate, which can achieve high accuracy, prevents mixing of errors into c1 , s1 , c2 , and s2 . t1 and t2 is decided as follows; * Rj,j − Rk,k 1 2 , f1 = 1 + (h1 ) , t1 = , h1 = Rj,k h1 ± f 1 (17) * 1 Rj,j + Rk,k 2 h2 = − , f2 = 1 + (h2 ) , t2 = . Rj,k h2 ± f 2 (18) Here, the signs of f1 and f2 have to be matched with the signs of h1 and h2 , respectively. Then, by using t1 and t2 , (19)

v1 = 1 − t1 × t2 , w1 = t1 + t2 ,

(20)

u1 = max(|v1 | , |w1 |), c 1 = +

v1 u1

v1 u1

and

2

+

w1 u1

2 , s 1 = +

w1 u1

v1 u1

2

+

w1 u1

(22)

v2 = 1 + t1 × t2 , w2 = t1 − t2 ,

(23)

u2 = max(|v2 | , |w2 |), c 2 = +

v2 u2

v2 u2

2

+

w2 u2

2 , (21)

2 , s 2 = +

w2 u2

v2 u2

2

+

w2 u2

2 , (24)

are computed. Finally, the eqs.(25) and (26) can be computed using c1 , s1 , c2 , and s2 ,

(25) u = c1 + c2 , s2 s1 ˆ ˆ Rj,j = Rj,j + × Rj,k , Rk,k = Rk,k − × Rj,k . u u

ISBN: 1-60132-508-8, CSREA Press ©

(26)

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

Algorithm 1 Implementation of the Givens rotation 1: f ← |x| 2: g ← |y| 3: t ← max(f, g) 4: if t = 0 then 5: cos (θ) ← 1 6: sin ! (θ) ← 0 x2 + y 2 ← 0 7: 8: else 9: u ← f /t 10: v ← g/t 11: if f ≥ g then 12: r ← 1 + v2 13:

14: 15: 16: 17:

Algorithm 2 Implementation method using the Givens rotation 1: h1 ← Rj,j − Rk,k 2: g1 ← |Rj,k | 2 3: f1 ← |h1 | + h21 + Rj,k The Givens rotation is adopted in underlined part g1 ← SIGN (Rj,k , Rj,k /h1 ) h2 ← Rj,j + Rk,k 2 h22 + Rj,k 6: f2 ← |h2 | +

4: 5:

7: 8: 9: 10:

cos (θ) ← u/r sin ! (θ) ← v/r x2 + y 2 ← r × t else r ← 1 + u2

11: 12: 13:

cos (θ) ← u/r sin ! (θ) ← v/r x2 + y 2 ← r × t 20: 21: end if 22: end if 18: 19:

159

14:

19: 20:

3.5 Implementation method using Givens rotation

21:

3.5.1 Implementation of the Givens rotation

22:

Consider the Givens rotation. y x , sin (θ) = ! . (27) cos (θ) = ! 2 2 2 x +y x + y2 ! Here, for computing cos (θ), sin (θ), and x2 + y 2 , Algorithm 1 is adopted. To avoid overflow and underflow, the Givens rotation should be implemented as the Algorithm 1. The fused multiply-accumulate can be adopted in the doubleunderlined part of the lines 12 and 17 of Algorithm 1. 3.5.2 Detail of the implementation For computing c1 , s1 , c2 , and s2 , Algorithm 2 is adopted. Here, The function SIGN(A, B) returns the value of A with the sign of B. Then, the eqs. (28) and (29) are computed using c1 , s1 , c2 , and s2 , u ← c1 + c2 , ˆ j,j ← Rj,j + s2 × Rj,k , R ˆ k,k R u

(28) s1 × Rj,k . ← Rk,k − u (29)

The fused multiply–accumulate can be adopted in double underlined part in the equations.

adopted in underlined part g2 ← SIGN (Rj,k , −Rj,k /h2 ) if f1 ≥ f2 then t1 ← g1 /f1 cˆ1 ← −t1 × g2 + f2 sˆ1 ← t1 × f2 + g2

Compute c1 and s1 using the Givens rotation for x ← cˆ1 and y ← sˆ1 cˆ2 ← t1 × g2 + f2

sˆ2 ← t1 × f2 − g2

Compute c2 and s2 using the GIvens rotation for x ← cˆ2 and y ← sˆ2 16: else 17: t2 ← g2 /f2 18: cˆ1 ← −g1 × t2 + f1

15:

The fused multiply–accumulate can be adopted in double underlined part in the equations.

The Givens rotation is

sˆ1 ← f1 × t2 + g1

Compute c1 and s1 using the Givens rotation for x ← cˆ1 and y ← sˆ1 cˆ2 ← g1 × t2 + f1

sˆ2 ← −f1 × t2 + g1

Compute c2 and s2 using the GIvens rotation for x ← cˆ2 and y ← sˆ2 24: end if

23:

3.6 Addition of sorting function to the twosided Jacobi method

(0) (0) ˆ ˆ In the case R1,1 ≥ Rn,n , if R j,j < Rk,k and s1 > 0 are satisfied, we set c1 ← s1 and s1 ← −c1 , which means ˆ ˆ θ1 ← θ1 − π2 , and, if R j,j < Rk,k and s1 ≤ 0 are satisfied, we set c1 ← −s1 and s1 ← c1 , which means ˆ ˆ θ1 ← θ1 + π2 . If R j,j < Rk,k and s2 > 0 are satisfied, π we set c2 ← s2 and −c2 , which means θ2 ← θ2 − 2 , s2 ← ˆ ˆ and, if R j,j < Rk,k and s1 ≤ 0 are satisfied, we set c2 ← −s2 and s2 ← c2 , which means θ2 ← θ2 + π2 . If π 2 is subtracted for or is added to both θ1 and θ2 , then we ˆ j,j ← R ˆ k,k , R ˆ k,k ← R ˆ j,j . Otherwise, we set R ˆ j,j ← set R ˆ ˆ ˆ −Rk,k , Rk,k − ←R j,j . (0) (0) ˆ ˆ In the case R1,1 < Rn,n , if R R ≥ k,k and s1 > 0 j,j are satisfied, we set c1 ← s1 and s1 ← −c1 , which means

ISBN: 1-60132-508-8, CSREA Press ©

160

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

f (x)

f (x)

xM x1

x2

0

0

x

Fig. 2: False position method

4. Correction of c1 , s1 , c2 , or s2 By using the Rutishauser’s implementation method [14], we can correct c1 , s1 , c2 , or s2 . Let c˜, s˜, cˆ, and sˆ be a variable representing both c1 and c2 , a variable representing both s1 and s2 , a result of correction for c1 and c2 , and a result of correction for s1 and s2 , respectively.

4.1 False position method Fig. 2 shows the image of the false position method. In the initial setting, x1 and x2 have different values. The sign of f (x1 ) is set to be different from that of f (x2 ). In the false position method, xM in the eq.(30) is set to a new position to compute the real root x in f (x) = 0; x1 × f (x2 ) − x2 × f (x1 ) . f (x2 ) − f (x1 )

(30)

Here, in case the sign of f (x1 ) is equal to that of f (xM ), then x1 ← xM . On the other hand, in case the sign of f (x2 ) is equal to that of f (xM ), then x2 ← xM . As shown in fig. 2, xM is set to a new x1 .

4.2 Secant method Fig. 3 shows the image of the secant method.

x

Fig. 3: Secant method

ˆ ˆ θ1 ← θ1 − π2 , and, if R j,j > Rk,k and s1 ≤ 0 are satisfied, we set c1 ← −s1 and s1 ← c1 , which means ˆ ˆ θ1 ← θ1 + π2 . If R j,j > Rk,k and s2 > 0 are satisfied, π we set c2 ← s2 and s2 ← −c2 , which means θ2 ← θ2 − 2 , ˆ ˆ and, if R j,j > Rk,k and s1 ≤ 0 are satisfied, we set c2 ← −s2 and s2 ← c2 , which means θ2 ← θ2 + π2 . If π 2 is subtracted for or is added to both θ1 and θ2 , then we ˆ j,j ← R ˆ k,k , R ˆ k,k ← R ˆ j,j . Otherwise, we set R ˆ j,j ← set R ˆ ˆ ˆ −Rk,k , Rk,k − ← Rj,j . Furthermore, by adding the above operation, the two-sided Jacobi method becomes to has the function of sorting from larger singular values to smaller singular values. Please note that after the above operation, c1 and c2 are still nonnegative.

xM =

x2 x1 x0

In the secant method, the following recurrence relation is adopted in order to compute the real root x in f (x) = 0: xn − xn−1 xn+1 = xn − f (xn ) × f (xn ) − f (xn−1 ) xn−1 f (xn ) − xn f (xn−1 ) = . (31) f (xn ) − f (xn−1 ) From the initial setting x0 and x1 , the sequence of x2 , x3 , · · · converges to the real root x, as the point sequence is computed in order.

4.3 Correction method Theoretically, c˜2 + s˜2 = 1 is satisfied. However, computationally, the equation is not satisfied because of the effect of a rounding error. Therefore, we propose a correction method for c˜ and s˜. Assuming that s˜ is correct, c˜ is decided by x2 + s˜2 = 1.

(32)

Assuming that c˜ is correct, s˜ is computed using c˜2 + x2 = 1.

(33)

Equations (32) and (33) can be used properly by introducing c˜ = cos θ and s˜ = sin θ. For the case − π4 ≤ θ ≤ π4 , eq.(32) is used, whereas eq.(33) is adopted for π4 < θ ≤ π2 or − π2 ≤ θ < − π4 . In a singular value decomposition using the two-sided Jacobi method, c˜ ≥ 0 is satisfied. Therefore, we can assume − π2 ≤ θ ≤ π2 . When the nonlinear single equation f (x) = 0 and the initial numbers x0 and x1 are given, x2 , which is the result at one iteration of the secant method, is exactly equal to xM achieved by the false position method: x0 f (x1 ) − x1 f (x0 ) , (34) f (x1 ) − f (x0 ) In the case − π4 ≤ θ ≤ π4 , c˜ is recomputed using the eq.(32). The case − π4 ≤ θ ≤ π4 can be considered equivalent to c˜ ≥ |˜ s|. In order to compute c˜, the initial numbers are set to x0 = 1 and x1 = c˜. When f (x) = x2 + s˜2 − 1, x2 =

cˆ =

(˜ c2 + s˜2 − 1) − c˜s˜2 s˜ = 1 − s˜ × , 2 2 2 (˜ c + s˜ − 1) − (˜ s ) 1 + c˜

ISBN: 1-60132-508-8, CSREA Press ©

(35)

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

is obtained, where cˆ is more suitable for satisfying f (x) = x2 + s˜2 − 1. The Givens rotation for vectors x and y is defined as x ← c˜x + s˜y, y ← −˜ sx + c˜y.

When not using c˜ but cˆ, s˜ z1 = , 1 + c˜

(36)

(37)

x ← (1 − s˜ × z1 ) x + s˜y = s˜ −z1 x + y + x,

(38)

y ← −˜ sx + (1 − s˜ × z1 ) y = −˜ s z1 y + x + y, (39) is obtained. The case π4 < θ ≤ π2 can be regarded as being equivalent to c˜ < |˜ s| and s˜ ≥ 0. In order to compute s˜, the initial numbers are set to x0 = 1 and x1 = s˜. When f (x) = c˜2 + x2 − 1, sˆ =

(˜ c2 + s˜2 − 1) − s˜c˜2 c˜ = 1 − c˜ × , (˜ c2 + s˜2 − 1) − (˜ c2 ) 1 + s˜

(40)

is obtained, where sˆ is more suitable for satisfying f (x) = c˜2 + x2 − 1. When not using s˜ but sˆ, the Givens rotation for vectors x and y can be represented as follows: c˜ z2 = , (41) 1 + s˜ x ← c˜x + (1 − c˜ × z2 ) y = c˜ −z2 y + x + y,

(42)

y ← − (1 − c˜ × z2 ) x + c˜y = c˜ z2 x + y − x.

(43)

The case − π2 ≤ θ < − π4 can be regarded as being equivalent to c˜ < |˜ s| and s˜ ≤ 0. To compute s˜, the initial numbers are set to x0 = −1 and x1 = s˜. When f (x) = c˜2 + x2 − 1, sˆ =

c˜ −(˜ c2 + s˜2 − 1) − s˜c˜2 = −1 + c˜ × , 2 (˜ c + s˜2 − 1) − (˜ c2 ) 1 − s˜

(44)

is obtained, where sˆ is more suitable for satisfying f (x) = c˜2 + x2 − 1. When not using s˜ but sˆ, the Givens rotation for vectors x and y can be represented as follows: c˜ z3 = , (45) 1 − s˜ x ← c˜x + (−1 + c˜ × z3 ) y = c˜ z3 y + x − y,

(46)

y ← − (−1 + c˜ × z3 ) x + c˜y = c˜ −z3 x + y + x.

(47) Please note that the fused multiply–accumulate can be adopted in the double underlined part.

161

Table 1: Experimental Environment CPU RAM OS Compiler Options Software Precision

Intel(R) Xeon(R) Silver 4116 @ 2.10GHz (2 CPUs) 192 GB Ubuntu 18.04 LTS gfortran 7.4.0 -O3 -mtune=native -march=native Lapack 3.8.0 single precision

5. Experiments We checked if the two-sided Jacobi method (arctan and Rutishauser versions) has shorter computation time and higher accuracy than the one-sided Jacobi method implemented in LAPACK [11] for small matrices. Table 1 shows the experimental environment. We used eight matrices for the comparison: • A1 (dimension size: 500 × 500, an upper triangular matrix) • A2 (dimension size: 1000 × 1000, an upper triangular matrix) • A3 (dimension size: 1500 × 1500, an upper triangular matrix) • A4 (dimension size: 2000 × 2000, an upper triangular matrix) and • A5 (dimension size: 500 × 500, an upper triangular matrix) • A6 (dimension size: 1000 × 1000, an upper triangular matrix) • A7 (dimension size: 1500 × 1500, an upper triangular matrix) • A8 (dimension size: 2000 × 2000, an upper triangular matrix) In A1 , A2 , A3 , and A4 , all elements are set to random numbers ∈ [0, 1] generated by a uniform random number generator. In A5 , A6 , A7 , and A8 , all elements are set to 1. The two-sided Jacobi method (arctan and Rutishauser versions) has shorter computation time and higher accuracy than the one-sided Jacobi method implemented in LAPACK [11] for the 500 × 500, 1000 × 1000, and 1500 × 1500 upper triangular matrices with elements generated by a uniform random number generator and also for the 500 × 500 and 1000 × 1000 upper triangular matrix in which all elements are 1. The performance results are given in Tables 2 and 3.

6. Conclusion We confirmed through experiments that the two-sided Jacobi method (arctan and Rutishauser versions) has shorter computation time and higher accuracy than the one-sided Jacobi method for smaller matrices. For future work, we plan to apply our two-sided Jacobi Method to implement the modified Sakurai-Sugiura method [9].

ISBN: 1-60132-508-8, CSREA Press ©

162

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

Table 2: Comparison of Jacobi SVD Algorithms(I) One-sided Jacobi A1 ||U ⊤ U − I||F ||V ⊤ V − I||F ||A − U ΣV ⊤ ||F Computation time[s] A2 ||U ⊤ U − I||F ||V ⊤ V − I||F ||A − U ΣV ⊤ ||F Computation time[s] A3 ||U ⊤ U − I||F ||V ⊤ V − I||F ||A − U ΣV ⊤ ||F Computation time[s] A4 ||U ⊤ U − I||F ||V ⊤ V − I||F ||A − U ΣV ⊤ ||F Computation time[s] A5 ||U ⊤ U − I||F ||V ⊤ V − I||F ||A − U ΣV ⊤ ||F Computation time[s] A6 ||U ⊤ U − I||F ||V ⊤ V − I||F ||A − U ΣV ⊤ ||F Computation time[s] A7 ||U ⊤ U − I||F ||V ⊤ V − I||F ||A − U ΣV ⊤ ||F Computation time[s] A8 ||U ⊤ U − I||F ||V ⊤ V − I||F ||A − U ΣV ⊤ ||F Computation time[s]

Table 3: Comparison of Jacobi SVD Algorithms(II)

Two-sided Jacobi arctan

1.94 ∗ 10−4 8.20 ∗ 10−5 8.13 ∗ 10−4 1.777

4.34 ∗ 10−5 4.32 ∗ 10−5 3.73 ∗ 10−4 1.185

5.89 ∗ 10−4 1.73 ∗ 10−4 2.56 ∗ 10−3 15.116

8.91 ∗ 10−5 8.67 ∗ 10−5 1.06 ∗ 10−3 9.711

1.06 ∗ 10−3 2.68 ∗ 10−4 4.39 ∗ 10−3 50.523

1.40 ∗ 10−4 1.31 ∗ 10−4 2.08 ∗ 10−3 42.888

1.47 ∗ 10−3 4.15 ∗ 10−4 7.21 ∗ 10−3 120.362

1.95 ∗ 10−4 1.74 ∗ 10−4 3.13 ∗ 10−3 159.974

1.97 ∗ 10−4 9.38 ∗ 10−5 9.32 ∗ 10−4 1.446

4.50 ∗ 10−5 4.49 ∗ 10−5 5.84 ∗ 10−4 1.128

5.35 ∗ 10−4 2.08 ∗ 10−4 1.94 ∗ 10−3 12.709

9.14 ∗ 10−5 9.15 ∗ 10−5 1.73 ∗ 10−3 9.467

9.92 ∗ 10−4 3.49 ∗ 10−4 3.38 ∗ 10−3 45.221

1.39 ∗ 10−4 1.39 ∗ 10−4 3.73 ∗ 10−3 43.477

1.69 ∗ 10−3 5.17 ∗ 10−4 5.43 ∗ 10−3 110.136

1.85 ∗ 10−4 1.84 ∗ 10−4 5.88 ∗ 10−3 162.029

Acknowledgment This work was supported by JSPS KAKENHI Grant Number JP17H02858 and JP17K00167.

References [1] S. Araki, H. Tanaka, M. Takata, K. Kimura, and Y. Nakamura, Fast Computation Method of Column Space by using the DQDS Method and the OQDS Method, Proc. of PDPTA 2018, pp. 333–339, 2018. [2] R. P. Brent, F. T. Luk, and C. van Loan, Computation of the singular value decomposition using mesh-connected processors, Journal of VLSI and computer systems, Vol. 1, pp. 242–270, 1985. [3] J. Demmel, Applied Numerical Linear Algebra, SIAM, Philadelphia, 1997. [4] J. Demmel and K. Veselic, Jacobi’s Method is More Accurate than QR, SIAM J. Matrix Anal. Appl., Vol. 13, No. 4, pp. 1204–1245, 1992. [5] Z. Drmac and K. Veselic, New fast and accurate Jacobi SVD algorithm: I., SIAM J. Matrix Anal. Appl., Vol. 29, pp. 1322–1342, 2008. [6] Z. Drmac and K. Veselic, New fast and accurate Jacobi SVD algorithm: II., SIAM J. Matrix Anal. Appl., Vol. 29, pp. 1343–1362, 2008.

Two-sided Jacobi Rutishauser A1 ||U ⊤ U − I||F ||V ⊤ V − I||F ||A − U ΣV ⊤ ||F Computation time[s] A2 ||U ⊤ U − I||F ||V ⊤ V − I||F ||A − U ΣV ⊤ ||F Computation time[s] A3 ||U ⊤ U − I||F ||V ⊤ V − I||F ||A − U ΣV ⊤ ||F Computation time[s] A4 ||U ⊤ U − I||F ||V ⊤ V − I||F ||A − U ΣV ⊤ ||F Computation time[s] A5 ||U ⊤ U − I||F ||V ⊤ V − I||F ||A − U ΣV ⊤ ||F Computation time[s] A6 ||U ⊤ U − I||F ||V ⊤ V − I||F ||A − U ΣV ⊤ ||F Computation time[s] A7 ||U ⊤ U − I||F ||V ⊤ V − I||F ||A − U ΣV ⊤ ||F Computation time[s] A8 ||U ⊤ U − I||F ||V ⊤ V − I||F ||A − U ΣV ⊤ ||F Computation time[s]

Two-sided Jacobi Givens rotation

4.33 ∗ 10−5 4.31 ∗ 10−5 3.77 ∗ 10−4 1.134

4.34 ∗ 10−5 4.33 ∗ 10−5 3.60 ∗ 10−4 1.126

8.93 ∗ 10−5 8.59 ∗ 10−5 1.07 ∗ 10−3 9.420

8.86 ∗ 10−5 8.61 ∗ 10−5 1.06 ∗ 10−3 10.733

1.39 ∗ 10−4 1.29 ∗ 10−4 2.01 ∗ 10−3 41.058

1.40 ∗ 10−4 1.29 ∗ 10−4 2.11 ∗ 10−3 47.813

1.95 ∗ 10−4 1.71 ∗ 10−4 3.17 ∗ 10−3 154.635

1.95 ∗ 10−4 1.72 ∗ 10−4 3.14 ∗ 10−3 168.760

4.52 ∗ 10−5 4.51 ∗ 10−5 5.71 ∗ 10−4 1.127

4.48 ∗ 10−5 4.51 ∗ 10−5 5.72 ∗ 10−4 1.156

9.16 ∗ 10−5 9.14 ∗ 10−5 1.77 ∗ 10−3 9.560

9.16 ∗ 10−5 9.13 ∗ 10−5 1.77 ∗ 10−3 9.627

1.39 ∗ 10−4 1.38 ∗ 10−4 3.52 ∗ 10−3 43.005

1.39 ∗ 10−4 1.38 ∗ 10−4 3.40 ∗ 10−3 43.303

1.85 ∗ 10−4 1.84 ∗ 10−4 5.06 ∗ 10−3 164.593

1.86 ∗ 10−4 1.84 ∗ 10−4 4.96 ∗ 10−3 164.633

[7] K. V. Fernando and B. N. Parlett, Accurate singular values and differential qd algorithms, Numer. Math., Vol. 67, pp. 191–229, 1994. [8] G. E. Forsythe and P. Henrici, The cyclic Jacobi method for computing the principal values of a complex matrix, Transactions of the American Mathematical Society, Vol. 94, pp. 1–23, 1960. [9] A. Imakura and T. Sakurai, Block Krylov-type complex moment-based eigensolvers for solving generalized eigenvalue problems, Numer. Alg., Vol. 75, pp. 413–433, 2017. [10] E. Kogbetliantz, Solution of linear equations by diagonalization of coefficients matrix, Quarterly of Applied Mathematics, Vol. 13, pp. 123–132, 1955. [11] Linear Algebra PACKage, , [12] U. von Matt, The orthogonal qd-algorithm, SIAM J. Sci. Comput., Vol. 18, pp. 1163–1186, 1997. [13] B. N. Parlett and O. A. Marques, An Implementation of the dqds Algorithm (Positive Case), Lin. Alg. Appl, Vol. 309, No. 1–3, pp. 217– 259, 2000. [14] H. Rutishauser, The Jacobi Method for Real Symmetric Matrices, Numerische Mathematik, Vol. 9, No. 1, pp. 1–10, 1966. [15] T. Sakurai and H. Tadano, CIRR: a Rayleigh-Ritz type method with counter integral for generalized eigenvalue problems, Hokkaido Math. J., Vol. 36, pp. 745–757, 2007.

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

163

A Study on the Effects of Elements of Games in Oddball Tasks for User’s Motivation and Event-Related Potential Tadashi KOIKE1 , Tomohiro YOSHIKAWA1 , and Takeshi FURUHASHI1 1 Graduate School of Engineering, Nagoya University, Nagoya, Japan

Abstract— P300 is one of the event-related potentials that arise when an infrequent stimulus appears. Oddball tasks are often used to measure P300. This research attempts to create a game-based oddball task for users to enjoy the tasks. There are several elements in a game. This paper discusses the effects of score, feedback system and background in oddball tasks. Keywords: P300, Oddball Tasks, Score, Feedback System, Background

1. Introduction The authors investigate the degree of dementia from several parameters : age, schooling history, the latency of P300, the task difficulty, and so on [1]. P300 is one of the event-related potentials (ERPs) that arise 300 ms after an infrequent stimulus appears [2] [3]. In this paper, the latency of P300 is the time from the stimulus appearing to the peak of P300, and the amplitude of P300 is the potential value (Fig.1).

the stimuli by counting the number of infrequent stimuli or pushing a button when an infrequent stimulus appears. The authors used an oddball task with yellow circles (Fig.2), although there are various oddball tasks [4]. A large circle is a frequent stimulus (standard stimulus), and a small circle is an infrequent stimulus (target stimulus). There are three types of target stimuli, whose radius ratios are 90%, 70%, and 50% to a standard stimulus. Also, users are requested to push a button when a target stimulus appears. The task difficulty is the radius ratio of the target stimulus because it is easier to push a button as the difference between the radius of target stimulus and that of standard stimulus is larger.

Fig. 2: Stimulus in Oddball Tasks However, conventional oddball tasks are too monotonous; hence, users often feel bored and the amplitude of P300 gets smaller [5] [6]. For this reason, this research attempts to create a game to measure P300. There are several elements of games (score, feedback system, BGM, sound effect, and background) [7], and it is difficult to discuss the effects of all elements at the same time. Thus, this paper discusses the effects of score, feedback system, and background, which are some of the many elements mentioned above.

2. Related Studies 2.1 A brief history of P300

Fig. 1: Latency and Amplitude of P300 Oddball tasks are often used to measure P300. These tasks are those in which infrequent stimuli appear in frequent stimuli to induce P300. We can observe a user’s P300 with an electroencephalograph (EEG) while he/she attempts oddball tasks. P300 is observed clearly when a user concentrates

The P300 was first reported about 55 years ago (Sutton et al., 1965) [8]. Its discovery from the confluence of increased technological capability for signal applied to human neuroelectrics measurements and the impact of information theory (Sutton et al., 1979) [9]. The term “P300” is used to refer to the canonical ERP component, which is also called “P3.” The terms “P3a” and “P3b” denote the difference between the two subcomponents as defined below. The primary theoretical approaches to P300 are improved with a lot of reviews previously (Donchin and Coles, 1988; Hillyard

ISBN: 1-60132-508-8, CSREA Press ©

164

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

and Kutas ,1983; Hillyard and Picton, 1987; Johnson, 1986; Jhonson, 1998; Molnar, 1994; Picton 1992; Price and Smith, 1974; Verleger, 1988) [10] [11] [12] [13] [14] [15] [16] [17] [18]. P300 is known to be related to a lot of elements. The latency of P300 gets longer with various factors such as target type, task difficulty, stimulation interval, low-frequency stimulation ratio, user’s age, dementia, and so on. Previous studies reported that the latency of P300 in healthy adults becomes longer with aging and it also depends on the task difficulty [19] [20]. We have investigated some of them by studying the relation between P300 and some elements.

2.2 A brief history of oddball task Oddball task was first proposed about 45 years ago (Squires,N.K., Squires,K.C. & Hillyard, S.A., 1975) [21]. The task was developed to induce P300. For example, in a visual oddball task, there might be a 95% chance for a squire to be presented to a user and a 5% chance for a circle. When the targets (e.g. circles) appear, the user must take a response, such as pushing a button or counting the number of target stimuli, to concentrate the targets. The oddball task and its variants have been used in more than one thousand published studies (Herrmann & Knight, 2001; Picton, 1992) [22] [16]. Figure.3 is one of the examples of visual oddball tasks [23]. There are not only visual oddball tasks, but also auditory oddball tasks [24].

Fig. 4: Oddball Task with Score the other hand, the system indicates “cross” when a user incorrectly pushes a button after a standard stimulus or does not push a button after a target stimulus (Fig.5).

Fig. 5: Oddball Task with Feedback System

3.3 Oddball Task with Changed Background Fig. 3: Visual Oddball Task (Aamir Saeed Malik, 2017)

3. Proposed Tasks 3.1 Oddball Task with Score Score is introduced for users to have a goal. “Rank” is determined according to the total score, and the goal is “getting the best rank.” Score is calculated by subtracting the time until pushing a button from 1400 ms (the sum of stimulus-on time and stimulus-off time) as 1 point per 1 ms. For example, in the case of pushing a button 500 ms after a target stimulus appearing, the user gets 900 points (Fig.4). A user loses 700 points (=stimulus-off time) if he/she pushes a button after a standard stimulus appearing. Thus a user can get more points by pushing a button more quickly after only target stimulus.

3.2 Oddball Task with Feedback System Oddball task with feedback system has the time for a feedback (200 ms). The system indicates “circle” when a user correctly pushes a button after a target stimulus, on

In this paper, the oddball task with yellow circle and black background (Fig.2) is called the conventional background task, and that with yellow circle and game style background is called the proposed background task. The game style background uses “Super Mario Brothers ©1985 Nintendo” because aged people know the game well and the action of the character matches a game-based oddball task.

4. Outline of Experiment 4.1 Oddball Task with Score and Feedback System This experiment was carried out to measure P300 by the conventional and the proposed oddball tasks. The subjects included 12 male college/graduate students. The wireless living body measurement machine Polymate Mini made by Miyuki Giken was used as the electroencephalograph. The position of the electrode was “Pz,” with “A1” and “A2” as the base in 10-20 methods [25]. The sampling frequency was 1000 Hz. The interstimulus interval was 1400 ms (stimuluson time : 700ms; stimulus-off time : 700 ms). The number of target stimuli was 22, and that of standard stimuli was 88 as a

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

165

set. A subject had three sets of the proposed tasks (see Fig.2), and three sets of the conventional task. The order differed between subjects. The evaluation method was a rating scale method to evaluate the user’s interest. Subjects answered the questionnaire (Fig.6) after each set. Brainwaves were passed through the band pass filter (1-5 Hz) and used 700 ms after the stimulus appeared. The latency and amplitude of P300 were evaluated from the user’s arithmetic mean waveform. Arithmetic mean waveforms are waves that cancel noise by averaging the same condition waves. It is expected that the latency will not change the values. On the other hand, the amplitude should be large, because the value becomes large when a user concentrates on a task [26] [27].

Fig. 7: Scores of all difficulty w/ and w/o score

5.1.2 Oddball Task with Feedback System

Fig. 6: Questionnaire used in the experiment

4.2 Oddball Task with Changed Background

The results are shown in Fig.8. The results showed that scores of “concentrating,” “enjoying,” and “not boring” in tasks with feedback system were better than those in tasks without feedback system. The paired t-test for the four ratings indicates that there are significant differences in the two ratings of “enjoying,” and “not boring” (each of them : p=1.44*10−5 ɼp=2.00*10−3 ).

This experiment was carried out to measure P300 by the conventional and proposed background tasks. The subjects included 10 aged people, 79.5 year-old on average. The interstimulus interval was 1000 ms (stimulus-on time : 500ms; stimulus-off time : 500 ms). A subject had three sets (the difficulty 50 conventional task, the difficulty 70 conventional task, and the difficulty 70 proposed task), and the order of sets was fixed. The other conditions were the same with 4.1.

5. Result of Experiment 5.1 Result of the Rating Scale Method 5.1.1 Oddball Task with Score The results are shown in Fig.7. The results showed that scores of “concentrating,” “not tired,” “enjoying,” and “not boring” in tasks with score were better than those in tasks without score. The paired t-test for the four ratings indicates that there are significant differences in the three ratings of “concentrating,” “enjoying,” and “not boring” (each of them : p=7.31*10−4 ɼp=1.25*10−7 ɼp=2.83*10−9 ). The significance level is 0.003 (=0.05/18) using the Bonferroni method because there are 18 t-tests (there are 12 t-tsets in results of the rating scale method and 6 t-tests in results of brainwaves in this paper).

Fig. 8: Scores of all difficulty w/ and w/o feedback system

5.1.3 Oddball Task with Changed Background The results are shown in Fig.9. The results showed that scores of “concentrating,” “enjoying,” and “boring” in the proposed background task were better than those in the conventional task. The paired t-test for the four ratings indicates that there are no significant differences in the all

ISBN: 1-60132-508-8, CSREA Press ©

166

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

ratings (each of them : p=2.9*10−2 ɼp=0.69ɼp=8.9*10−2 ɼ p=0.55).

Fig. 9: Scores of difficulty 70 of proposed and conventional background tasks

5.2 Result of Brainwaves

Fig. 11: Latency and amplitude w/ and w/o feedback system

5.2.1 Oddball Task with Score The latency and amplitude of P300 are shown in Fig.10. There are no differences between the two tasks. The paired ttest for two values (latency and amplitude of all difficulties) indicates that there are no significant differences (each of them: p = 0.05, p = 0.51).

indicates that there are no significant differences (each of them: p = 0.44, p = 8.4*10−2 ).

Fig. 10: Latency and amplitude w/ and w/o score

5.2.2 Oddball Task with Feedback System The latency and amplitude of P300 are shown in Fig.11. There are no differences between the two tasks. The paired ttest for two values (latency and amplitude of all difficulties) indicates that there are no significant differences (each of them: p = 0.33, p = 0.48). 5.2.3 Oddball Task with Changed Background The latency and amplitude of P300 are shown in Fig.12. There are no differences between the two tasks. The paired ttest for two ratings (latency and amplitude of difficulty 70)

Fig. 12: Latency and amplitude with task whose background changed and conventional task

6. Examination 6.1 Examination of the Rating Scale Method In the experiment of oddball task with score, there were significant differences in the three ratings of “concentrating,” “enjoying,” and “not boring.” In the experiment of oddball task with feedback system, there were significant differences in the two ratings of “enjoying,” and “not boring.” Users can enjoy oddball tasks more without feeling tired or boring when the tasks have “score” or “feedback system” comparing with the simple oddball task.

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

On the other hands, there were no significant differences in all of the ratings in the experiment of oddball task with changed background. The following discusses the reasoning behind the subjects’ opinions. Almost all of opinions were “There are little differences between two tasks.” Moreover, some of the subjects did not know the game, “Super Mario Brothers ©1985 Nintendo.” In this experiment, subjects were around 80 years old. It subjects are 60 years old or less, the result will be different because they may know this game. On the other hand, two subjects said that it was easier for them to do the proposed task than the conventional task because they were tired in eyes with looking yellow circle and black background, which is not because of the game style background. In addition to that, it must be a problem to fix the order of tasks. The subjects felt more tired in the proposed tasks done in the last set. Re-experiment is needed because there are some problems in this experiment.

6.2 Examination of Brainwaves There were no differences between each comparison of the two tasks. Latency is related to how fast subjects recognize a target stimulus. Whether tasks have elements of game or not has no relation to the stimulus; hence, there were no differences in latency. Amplitude is related to the concentration of the user. In the rating scale method with and without score, there was significant difference in the rating of “concentration,” so it is expected that the amplitude of the task with score was larger than without score. However, the amplitude of the two tasks were similar. It might need more concentration to make amplitude bigger, and it will be able to be achieved by adopting all elements of games.

7. Conclusion This paper discussed the effects of “score,” “feedback system,” and “background” in oddball tasks. The following results were found. There were significant differences in the two ratings of “enjoying,” and “not boring” in the rating scale method with “score” and “feedback system.” On the other hands, there were no significant differences in all of the ratings in the rating scale method with changed “background.” Re-experiment is needed because there were some problems in the experiment for the study of effects of background. There were no differences in the latency and amplitude of P300 in all tasks with elements of games. In future research, a game-based oddball task that helps users enjoy the task will be created. Acknowledgements This study was supported by the Center of Innovation (COI) of the Japan Science and Technology Agency (JST).

References [1] K. Miwa, T. Yoshikawa, T. Furuhashi, M. Hoshiyama, T. Makino, Madoka, Yanagawa, Y. Suzuki, H. Umegaki and M. Kuzuya: “Study on estimation of MMSE score by using the latency of p300 and alpha wave”, SCIS&ISIS2018 (2018).

167

[2] J. Polich: “Updating p300: an integrative theory of p3a and p3b”, Clinical neurophysiology, 118, 10, pp. 2128–2148 (2007). [3] V. V. Ogryzko, R. L. Schiltz, V. Russanova, B. H. Howard and Y. Nakatani: “The transcriptional coactivators p300 and cbp are histone acetyltransferases”, Cell, 87, 5, pp. 953–959 (1996). [4] K. TAKAKURA, T. YOSHIKAWA and T. FURUHASHI: “A study on age dependency on start point of delay of p300 peak latency (in japanese)”, IEICE Technical Report, 115, 318, pp. 41–46 (2015). [5] H. NITTONO: “Measuring attention to video clips: An application of the probe stimulus technique using event-related brain potentials (in japanese)”, Physiological psychology and psychophysiology, 24, 1, pp. 5–18 (2006). [6] J. P. Rosenfeld, K. Bhat, A. Miltenberger and M. Johnson: “Eventrelated potentials in the dual task paradigm: P300 discriminates engaging and non-engaging films when film-viewing is the primary task”, International Journal of Psychophysiology, 12, 3, pp. 221–232 (1992). [7] J. McGonigal: “Reality is broken: Why games make us better and how they can change the world”, Penguin (2011). [8] S. Sutton, M. Braren, J. Zubin and E. John: “Evoked-potential correlates of stimulus uncertainty”, Science, 150, 3700, pp. 1187– 1188 (1965). [9] S. Sutton: “P300–thirteen years later”, Evoked brain potentials and behavior, Springer, pp. 107–126 (1979). [10] E. Donchin and M. G. Coles: “Is the p300 component a manifestation of context updating?”, Behavioral and brain sciences, 11, 3, pp. 357– 374 (1988). [11] S. A. Hillyard and M. Kutas: “Electrophysiology of cognitive processing”, Annual review of psychology, 34, 1, pp. 33–61 (1983). [12] S. A. Hillyard and T. W. Picton: “Electrophysiology of cognition”, Comprehensive Physiology, pp. 519–584 (2011). [13] R. Johnson: “A triarchic model of p300 amplitude.”, Psychophysiology (1986). [14] R. Johnson Jr: “The amplitude of the p300 component of the eventrelated potential: Review and synthesis”, Advances in psychophysiology, 3, pp. 69–137 (1988). [15] M. Molnár: “On the origin of the p3 event-related potential component”, International Journal of Psychophysiology, 17, 2, pp. 129–144 (1994). [16] T. W. Picton: “The p300 wave of the human event-related potential”, Journal of clinical neurophysiology, 9, 4, pp. 456–479 (1992). [17] R. L. Price and D. B. Smith: “The p3 (00) wave of the averaged evoked potential: A bibliography”, Physiological Psychology, 2, 3, pp. 387–391 (1974). [18] R. Verleger: “Event-related potentials and cognition: A critique of the context updating hypothesis and an alternative interpretation of p3”, Behavioral and brain sciences, 11, 3, pp. 343–356 (1988). [19] D. S. Goodin, K. C. Squires, B. H. Henderson and A. Starr: “Agerelated variations in evoked potentials to auditory stimuli in normal human subjects”, Electroencephalography and clinical neurophysiology, 44, 4, pp. 447–458 (1978). [20] Y. Sata, M. Inagaki, S. Shirane and M. Kaga: “Visual perception of japanese characters and complicated figures”, NO TO HATATSU, 34, 4, pp. 300–306 (2002). [21] N. K. Squires, K. C. Squires and S. A. Hillyard: “Two varieties of long-latency positive waves evoked by unpredictable auditory stimuli in man”, Electroencephalography and clinical neurophysiology, 38, 4, pp. 387–401 (1975). [22] C. S. Herrmann and R. T. Knight: “Mechanisms of human attention: event-related potentials and oscillations”, Neuroscience & Biobehavioral Reviews, 25, 6, pp. 465–476 (2001). [23] A. Malik and H. Amin: “Designing EEG Experiments for Studying the Brain: Design Code and Example Datasets”, Elsevier Science (2017). [24] B. Kotchoubey and S. Lang: “Event-related potentials in a semantic auditory oddball task in humans”, Neuroscience letters, 310, pp. 93–6 (2001). [25] H.H.Jasper: “The ten-twenty electrode system of the international federation”, Electroencephalography and clinical neurophysiology, 10, 2, pp. 371–375 (1958). [26] Y. Ishikura, G. Ikeda, K. Akimoto, M. Hata, A. Kusumoto, A. Kidokoro, M. Kontani, H. Kawashima, Y. Kiso and Y. Koga: “Arachi-

ISBN: 1-60132-508-8, CSREA Press ©

168

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

donic acid supplementation decreases p300 latency and increases p300 amplitude of event-related potentials in healthy elderly men”, Neuropsychobiology, 60, 2, pp. 73–79 (2009). [27] C. C. Duncan-Johnson and E. Donchin: “On quantifying surprise: The variation of event-related potentials with subjective probability”, Psychophysiology, 14, 5, pp. 456–467 (1977).

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

169

A Method to Acquire Multiple Satisfied Solutions 1

Tomohiro Yoshikawa1, Kouki Maruyama1 Graduate School of Engineering, Nagoya University, Nagoya, Japan

Abstract - In general, the main purpose of Genetic Algorithm (GA) is to acquire a solution with the highest evaluation value in a single-objective problem or Pareto solutions with various evaluation values in a multi-objective problem.However, in engineering problems, the acquisition of multiple satisfied solutions satisfying certain conditions is often more strongly desired than acquiring a single best solution. In addition, to help set design choices, satisfied solutions should satisfy different design variable patterns from one another. There are multiple objective functions and rather than being maximized/minimized these are intended to approximate certain target values. These multiple objective functions can be unified into a single-objective function by summing up the errors from the target values. Through this unification of objective functions, computing resources for searching can be assigned in terms of the diversity in the design variable space rather than the objective space. Engineering problems often involve numerous constrained optimization problems. In such problems, the unification of objective functions can also be applied to constraints. In this paper, a method for acquiring multiple satisfied solutions by GA in many constrained multi-objective optimization problems is proposed. The proposed method is applied to a real-world problem and compared with Island model to investigate its performance. Keywords: Satisfied Solutions, Design Variables Pattern, Unification of Objectives, Genetic Algorithm, Island model

1

Introduction

In addition to improving the performance of computers, Genetic Algorithm (GA) is actively applied to engineering problems [1-3]. GA is an optimization method that imitates the evolution of creatures. In general, the main purpose of GA is to acquire a solution with the highest evaluation value in a single-objective problem or Pareto solutions with various evaluation values in a multi-objective problem. In both cases, evaluation values are the highest priority, and the variety of individuals is considered in the objective space. However, in engineering problems, the acquisition of multiple satisfied solutions satisfying certain conditions is often more strongly desired than acquiring a single best solution [4]. In addition, to help set design choices, satisfied solutions should satisfy different design variable patterns from one another.

Because of the characteristics of GA, when applying it to the acquisition of satisfied solutions and after a satisfied solution is acquired, searches of the population are intensively performed very close to the acquired solution because individuals with slight differences from the satisfied solution could also be satisfied solutions. As a result, many satisfied solutions in the design variable space that are very similar to the first one are often acquired. These similar solutions usually have no practical meaning. Many methods that can maintain the diversity of design variables have been proposed [5-7]. However, these methods aim to prevent solutions from converging to local solutions by maintaining the diversity of design variables, rather than acquiring various types of satisfied solutions. In the case of applying GA to multi-objective optimization problems, searches are performed to acquire various and uniform solutions in the objective space [8]. In this case, the diversity of design variables is generally not considered. Thus, various solutions are acquired in the objective space rather than in the design variable space. In general, different solutions in the objective space have different design variables. However, there is no guarantee that solutions have different design variables. In contrast, there are multiple objective functions, which approximate certain target values rather than being maximized/minimized. These multiple objective functions can be unified into a single-objective function by summing up the errors from the target values. Through this unification of objective functions, computing resources for searching can be assigned in terms of the diversity in the design variable space rather than the objective space. Engineering problems often involve many constraints. In such problems, the acquisition of feasible solutions is required. Feasible solutions are defined as those that satisfy all constraints. The satisfaction of constraints has a high affinity with approximating certain evaluation values, as described above. Acquiring various satisfied solutions, i.e., feasible solutions in the design variable space is expected to be achieved by applying the above unification of objective functions to the constraints. In this study, a method for acquiring multiple satisfied solutions in unified single-objective optimization problems using GA is proposed. To investigate the effectiveness of the

ISBN: 1-60132-508-8, CSREA Press ©

170

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

proposed method, an experiment is conducted. In the experiment, the proposed method is applied to a two-objective optimization problem with many constraints [9] and compared with Island model [7] which is one of the most representative methods to maintain the diversity of design variables.

2 2.1

Unification of objectives Multi-objective optimization problems

There are multiple objective functions. When these functions are not to be maximized/minimized but rather approximated to certain target values, they can be unified into a single objective function by summing up the errors from the target values in each objective function. The formulas for calculating this unification of objectives are shown in eqs. (1), (2), and (3).

min F

m

¦ i 1

fî

fi

fî

(1)

(i 1,2,...m)

f i max

(2)

Max ( f ti f i thi ,0) (a ) ° (b) (3) ® Max ( f ti f i ,0) ° Max ( f f ,0) (c ) i ti ¯

fi

f imax is the maximum value of fi in all fi . When a

function aims to keep the error from the target value f ti , within an allowable value thi , (a) is selected as fi . When a function aims to obtain a larger value than the target f ti , (b) is selected as fi . When a function aims to obtain smaller value than the target f ti , (c) is selected as fi . An individual for which F is equal to 0 is a satisfied solution, which indicates that all functions satisfy the given conditions.

2.2

Many-constrained

In optimization problems with many constraints, the unification below can be applied [10, 11].

min F

m

¦ i 1

gˆ i

gi g imax

l

fî ¦ gˆ i

(4)

i 1

(i 1,2,...l ) (5)

Here, fî is same as in eq.(2) in 2.A, g i is the amount by which the i-th constraint is violated, and g imax is the maximum value of all g i . In this case, an individual for which F is equal to 0 is a feasible and satisfied solution, as in constrained multiobjective optimization problems.

3

Proposed method

To acquire multiple satisfied solutions, a method is proposed that has the features described below. The flow of the proposed method is illustrated in Fig. 1, which shows the minimization of f.

3.1

Flow of proposed method

First, initial individuals are generated randomly, then is true, x “neighbors” are defined. When and y are defined as mutual neighbors, where x and y are individuals, is the neighbor range, which is input in advance, and d xy is the distance between x and y in the design variable space. After defining neighbors, one child (C) is generated. Then, the child’s neighbors (C n ) are defined. When C n contains at least one satisfied solution, the child is not evaluated, and the generation of a child is repeated. When C n does not contain satisfied solutions, the child is evaluated. After the evaluation, the population is selected according to the flow (see Fig. 1). In the flow, A is the number of individuals in A , where A is a set of some individuals. Furthermore, f ( x) ! f ( y) denotes that y’s evaluation value is better than that of x, and x m y indicates that the information of x is updated, including y. After the selection, the generation of a child is repeated. This process is repeated until the end condition is satisfied.

3.2

Feature of proposed method

z

Distributing computing resources dynamically

— When a satisfied solution exists among the neighbors for a new child, the child is deleted without evaluation to assign computing resources to search other areas (see Fig. 2(a)). z Sequential update — Like MOEA/D [12, 13], a good child with a high fitness value can become a parent immediately. Thus, high convergence can be expected. z Defining neighbors in the design variable space — Defining neighbors using the neighbor radius in the design variable space can enact a group search. The use of group search leads to diversity being maintained, and sometimes results in high convergence [14] in each group search. — We can adjust the granularity of the distance between acquired satisfied solutions in the design variable space. When the neighbor range is large, the distance between satisfied solutions is expected to be large, and vice versa. z Neighborhood crossover

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

171

— High convergence can be expected because of neighborhood crossovers [15, 16]. z Mechanism for maintaining diversity in the design variable space When the number of neighbors Cn for a new child (C) is greater than the maximum neighbor population (nmax ) and f (C ) is better than f (Cnnad ) , the child replaces Cnnad , and the

information on the worst individual is updated ( Cnnad : the worst individual among the child’s neighbors, see Fig. 2(b)). When the number of neighbors Cn for a new child (C) is

(a) Case (A)

less than nmax and the number of neighbors I nmax is greater , and the information on than nmax , the child replaces I nmax nad the worst individual is updated (see Fig. 2(c)).

4

Experiment

In this study, an experiment was conducted. In the experiment, an engineering problem in the real world [9] was considered. This is a constrained two-objective optimization problem. The problem comprises 222 design variables, 54 constraints, and two-objective functions ( f1 is minimized, and f 2 is maximized). In constrained optimization problems, feasible solutions are defined as those that satisfy all constraints [17]. In this problem, satisfied solutions are also defined as those that satisfy certain conditions (evaluation values are less than target values or greater than target values) among feasible solutions. In the experiment, the conditions were set to f1 d 3.0 and f 2 t 34 . These values are those introduced as the evaluation values of the solution designed by a human in the benchmark problem [9].

4.1

(c) Case (C) Figure 2: Mechanism for maintaining diversity islands was 5, 10, 15. From the result of the pre-experiment, there was no migration.

Problem settings

In the experiment, the searches using Island model and the proposed method with the unification of the objective functions and constraints described in Section II were compared. In the proposed method, plural groups are generated by defining neighbors based on neighbor range, which gives similar feature for maintaining diversity in the design variable space. Thus Island model was compared with the proposed method.

4.2

(b) Case (B)

Experimental conditions

In the searches using Island model and the proposed method, the numbers of individuals, evaluations, and trials were 100, 30,000, 21, respectively. The initial population for both methods was the same in every trial. In the search using the proposed method, the maximum neighbor population nmax was 15, the neighbor range was 22.2 using the Manhattan distance, and the crossover rate with neighbors was 0.7. In the search using Island model, the number of

4.3

Results

The results for Island model are shown in Tables 2 to 4. Table 2 shows the number of islands which succeeded in acquiring satisfied solutions and the number of satisfied solutions. The number of satisfied solutions decreased as the number of islands increased (see Table 2). It is thought that the number of individuals for one island decreased as the number of islands increased. Thus the number of islands which could acquire satisfied solutions decreased because of low convergence by small number of individuals. Table 3 shows the distance between satisfied solutions in each island in a trial whose number of acquired satisfied solutions was the median in 21 trials in the case that the number of islands was 5. In this trial, 4 islands could acquire satisfied solutions, and each island was named “island 1” to “island 4.” The distances between satisfied solutions in each island were very small, which shows that very similar satisfied solutions were acquired (see Table 3). It is thought that when a satisfied

ISBN: 1-60132-508-8, CSREA Press ©

172

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

solution was acquired on a certain island, very similar solutions also became satisfied solutions and were acquired by intensively searching around the satisfied solution. In practice, these similar satisfied solutions are regarded to be only one satisfied solution. Thus, in this trial, the substantial number of acquired satisfied solutions were 4, which was the number of islands which succeed in acquiring satisfied solutions. In Island model, because there is no mechanism for controlling the distance between islands, it is possible that the distance between islands become small and the diversity of the design variable space cannot be maintained. This tendency has seen by migration in Experiment 1. The distance between islands which could acquire satisfied solutions (one island is regarded as one satisfied solution) is shown in Table 4. The distance between islands was calculated as the distance between their center of gravity of satisfied solutions in each island. The distance between islands was sufficiently large (see Table 4). The influence of the number of islands to the distance between satisfied solutions was very small. In other words, regardless of the number of islands, it is considered that the distance between islands would be around 50 in Manhattan distance in Island model without migration. Although it is possible to acquire satisfied solutions in Island model, it is difficult to adjust the distance between islands, that is, satisfied solutions expressly. The results for the proposed method are shown in Table 5 and 6. Table 5 shows the number of satisfied solutions acquired by the proposed method. The smaller the neighbor range was, the more satisfied solution were acquired (see Table 5). Table 6 shows the distance between satisfied solutions. It was confirmed that the granularity of the distance between satisfied solutions was adjustable by changing the neighbor range . It was confirmed that in Island model without migration, diverse satisfied solutions were also acquired. However, because Island model does not explicitly give the distance between islands, it is difficult to adjust the granularity of the satisfied solutions, while the granularity of the satisfied solutions can be adjusted in the proposed method by changing the neighbor range. The evaluation value of satisfied solutions in island model is shown in Table 7 and in the proposed method is shown in Table 8. There was no big difference in both methods (see Table7 and 8). Table 2: Number of islands which succeeded in acquiring satisfied solutions and number of satisfied solutions using Island model (no migration) Number of islands 5 10 15 Number of islands which 4 2.6 1.3 succeed in acquiring satisfied solutions Number of satisfied 80 25.7 8.4 solutions

Table 3: Distance between satisfied solutions in each island (Manhattan distance) island 1 island 2 island 3 island 4 Min. 0.05 0.05 0.05 0.05 Max. 1.70 1.45 2.35 1.50 Ave. 0.61 0.41 0.75 0.71 Table 4: Distance between islands (Manhattan distance) Number of islands 5 10 15 Min. 42.42 47.27 45.53 Max. 52.18 50.69 48.26 Ave. 48.15 49.04 47.05 Table 5: Number of satisfied solutions in the proposed method Neighbor range 4.4 8.8 22.2 Number of 14.24 4.95 3.9 satisfied solutions Table 6: Distance between satisfied solutions in the proposed method (Manhattan distance) Neighbor range 4.4 8.8 22.2 Min. 5.03 16.66 43.69 Max. 25.05 38.69 50.6 Ave. 14.78 31.14 47.54 Table 7: Evaluation values of satisfied solutions in Island model Number of islands 5 10 15 f1 Ave. 2.993 2.993 2.994 Std. 0.004 0.004 0.002 f2 Ave. 34.03 34.04 34.08 Std. 0.12 0.12 0.14 Table 8: Evaluation values of satisfied solutions in the proposed method Neighbor range 4.4 8.8 22.2 f1 Ave. 2.994 2.993 2.993 Std. 0.005 0.005 0.004 f2 Ave. 34.08 34.01 34.01 Std. 0.20 0.02 0.02

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

5

Conclusion

In this study, the unification of objective functions in multi-objective optimization problems and many-constraint optimization problems was introduced. This paper proposed a method for acquiring multiple satisfied solutions in unified single-objective optimization problems. To investigate the effectiveness of the proposed method, an experiment was conducted. In the experiment, a 54 constraint two-objective optimization problem was considered, and the proposed method and Island model were compared. The results showed that both Island model and the proposed method could acquire diverse satisfied solutions in the design variable space. The results also showed that the proposed method could adjust the granularity of the distance between acquired satisfied solutions in the design variable space while Island model could not. In the experiment, although satisfied solutions could be acquired in all trials, there was one trial in which only one satisfied solution could be acquired. Because the purpose of this study was to acquire various satisfied solutions in engineering problems, multiple satisfied solutions should be acquired in all trials. A study of the appropriate neighbor range is necessary. Further studies are also needed in order to make the proposed method more suitable for engineering problems.

6

References

[1] D.Dasgupta, Z.Michalewicz, “Evolutionary algorithms in engineering applications,” Springer-Verlag Berlin Heidelberg, 1997. [2] [2] C.A.C.Coello, G.B.Lamont, “Applications of multiobjective evolutionary algorithms,” World Scientific, 2004. [3] K.Deb, “Optimization for engineering design Algorithms and Examples,” 2nd ed., PHI Learning Pvt. Ltd., 2012. [4] K.Deb, “Evolutionary Algorithms for Multi-Criterion Optimization in Engineering Design,” Evolutionary Algorithms in Engineering and Computer Science 2, pp. 135161, 1999. [5] L.Nguyen, L,Bui, H.Abbass, “A new niching method for the direction-based multi-objective evolutionary algorithm,” IEEE Symposium on Computational Intelligence in MultiCriteria Decision-Making, pp. 1-8, 2013. [6] H,Toshio, N.Kenta, “Structural Morphogenesis by the Genetic Algorithms Considering Diversity of Solution” J.Struct. Constr. Eng. AIJ, No.614, pp35-43, 2007. [7] D.Whitely, R.Soraya, H.Robert B, "Island model genetic algorithms and linearly separable problems.", In: AISB International Workshop on Evolutionary Computing. Springer, Berlin, Heidelberg. pp. 109-125, 1997

173

[8] H.Li, Q.Zhang, “Multiobjective Optimization Problems With Complicated Pareto Sets, MOEA/D and NSGA-II,” IEEE Transactions on Ecolutionary Computation, Vol. 13, No. 2, pp. 284-302, 2009. [9] T.Kohira, H.Kemmotsu, A.Oyama, T.Tatsukawa, “Proposal of Simultaneous Design Optimization Benchmark Problem of Multiple Car Structures Using Response Surface Method,” The Japanese Society for Evolutionary Computation, 2017. http://ladse.eng.isas.jaxa.jp/benchmark/index.html. [10] K.Deb, “An efficient constraint handling method for genetic algorithms,” Computer Methods in Applied Mechanics and Engineering, Vol. 186, No. 2, pp. 311-338, 2000. [11] E.Mezura-Montes, “Constraint-Handling in Evolutionary Optimization,” Springer, 2009. [12] Q.Zhang, H.Li, “MOEA/D: A Multiobjective Evolutionary Algorithm Based on Decomposition,” IEEE Transactions on Evolutionary Computation, Vol. 11, No. 6, pp. 712-731, 2007. [13] Q.Zhang, W.Liu, H.Li, “The performance of a New Version of MOEA/D on CEC09 UnconstrainedMOP Test Instances,” In Proceedings of the IEEE Congress on Evolutionary Computation, No. 1, pp. 203-208, 2009. [14] D.Whitley, S.Rana, R.B.Heckendorn, “Island model genetic algorithms and linearly separable problems,” AISB International Workshop on Evolutionary Computing, Vol. 1305, pp.109-125, 1997. [15] S.Kikuchi, T.Suzuki, “The Effect of Neighborhood Crossover in Evolutionary Search Methods for Landscape Photograph Geocoding Support,” IPSJ SIG Technical Report, Vol. 2010-FI-98, No. 10, 2010. [16] S.Watanabe, T.Hiroyasu, M.Miki, “NCGA: Neighborhood Cultivation Genetic Algorithm for MultiObjective Optimization Problems,” Proceedings of the Genetic and Evolutionary Computing Conference, pp. 458465, 2002. [17] E.Mezura-Montes, C.A.C.Coello, “Constraint-handling in nature-inspired numerical optimization: Past present and future,” Swarm and Evolutionary Computation, Vol. 1, No. 4, pp. 173-194, 2011. [18] JB.Kruskal, M.Wish, “Multidimensional scaling” Vol.11, Sage, 1978.

ISBN: 1-60132-508-8, CSREA Press ©

174

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

Figure 1: Flow of the proposed method

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

175

Performance evaluation of MEGADOCK protein–protein interaction prediction system implemented with distributed containers on a cloud computing environment 1

Kento Aoyama1,2 , Yuki Yamamoto1 , Masahito Ohue1 , and Yutaka Akiyama1 Department of Computer Science, School of Computing, Tokyo Institute of Technology, Tokyo, Japan 2 AIST-Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL), National Institute of Advanced Industrial Science and Technology, Ibaraki, Japan Email: {aoyama, y_yamamoto}@bi.c.titech.ac.jp, {ohue, akiyama}@c.titech.ac.jp

Abstract— Container-based virtualization, a lightweight virtualization technology, has begun to be introduced into large-scale parallel computing environments. In the bioinformatics field, where various dependent libraries and software tools need to be combined, the container technology that isolates the software environment and enables rapid distribution as in an immediate executable format, is expected to have many benefits. In this study, we employed Docker, which is an implementation of Linux containers, and implemented a distributed computing environment of our original protein– protein interaction prediction system, MEGADOCK, with virtual machine instances on Microsoft Azure cloud computing environment, and evaluated its parallel performance. Both when MEGADOCK was directly performed on the virtual machine and also when it is performed with Docker containers of MEGADOCK on the virtual machine, the execution speed achieved was almost equal even if the number of worker cores was increased up to approximately 500 cores. On the standardization of portable and executable software environments, the container techniques have large contributions in order to improve productivity and reproducibility of scientific research. Keywords: container-based virtualization, cloud computing, MPI, Docker, MEGADOCK, protein–protein interaction (PPI)

1. Introduction In the field of bioinformatics and computational biology, various software are utilized for research activities. Management of software environments such as dependent software libraries is one of the most challenging issues in computational research. Recently, as a solution to complication of the software environment, introduction of the containerbased virtualization technology, which is an approach using virtualization technologies with lightweight and excellent performance, is advancing [1], [2]. Particularly in the field of genome research, pipeline software systems consisting of multiple pieces of software are commonly used, which tend to complicate the environment. For this reason, case studies have been reported, including those on environmental

management and distributed processing using the containerbased virtualization technology [3], [4]. In the container-based virtualization, a software execution environment, including dependent software libraries and execution binaries, is isolated as a container, and immediate software distribution as an executable format can be realized [5], [6]. This feature facilitates the management of the software environment and distribution, thus introducing new software. It has also been reported that the container-based virtualization performs better than the hypervisor-based virtualization, which is used to implement common virtual machines (VMs), and when properly configured, performs almost as well as running on a physical machine [7]. Container-based virtualization has developed in areas such as dynamic load balancing in parallel distributed platforms on cloud environments because it enables rapid environment building and application abstraction [8]. Although the introduction has not advanced in the computer environments at research institutes and universities due to the concern regarding performance degradation by virtualization, there is a tailwind through excellent benchmark results on parallel computing environments and application research case reports. Recently, the container-based virtualization technology has begun to be adopted in supercomputing environments. As an example, the National Energy Research Scientific Computing Center (NERSC) in the United States, which has a large-scale supercomputer Cori, developed an opensource software for the container-based virtualization of high-performance computing called Shifter [9]. There are reports on the utilization of the container-based virtualization in various application researches from NERSC [10]. In addition, another container implementation, Singularity[11], is available on the TSUBAME 3.0 supercomputer of Tokyo Institute of Technology and AI Bridging Cloud Infrastructure (ABCI) of National Institute of Advanced Industrial Science and Technology (AIST), both of which are world’s firsttier supercomputing environments in Japan [12]. From the above, correspondence to the container-based virtualization is urgent even in the bioinformatics and computational biology fields.

ISBN: 1-60132-508-8, CSREA Press ©

176

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

App App

Linux Kernel Li K l Virtuall Hardware Vi H d

App

App

App

Root File System (‘/’)

App

Container

App

App

VM

Container Root

Hypervisor/App

Container Runtime

Root File System S t (‘/’)

Root File Fil System S t (‘/’)

Linux Kernel Li K l

Linux Kernel Li K l

H Hardware d

H d Hardware

Fig. 1: Overview of virtualization technologies: hypervisorbased (left) and container-based virtualization (right) In this study, we focused on MEGADOCK [13], [14], a protein–protein interaction (PPI) prediction software, as an example of bioinformatics software, that can predict PPIs between various proteins by parallel computing. We introduced distributed processing on MEGADOCK using Docker containers [1], an implementation of the containerbased virtualization, and then evaluated its computational performance on the Microsoft Azure public cloud computing environment [15] by comparing it with a simple parallel implementation with message passing interface (MPI) [16].

2. Overview of container-based virtualization There are two major concepts of virtualization approaches in the context of applications running on a cloud environment: hypervisor-based and container-based.

2.1 Hypervisor-based virtualization In hypervisor-based virtualization, the virtual environment is provided by a higher-level “hypervisor” that further manages the OS (Supervisor) that manages the application (Fig. 1, left). The “virtual machine” (VM) widely used in general cloud environments is provided by the hypervisorbased virtualization that enables users to use various operating systems such as Windows and Linux OS as Guest OS, which is managed by a hypervisor running on the Host OS or hardware. There are various types of implementations for the hypervisor-based virtualization, such as Kernel Virtual Machine (KVM) [17], Hyper-V [18] used in Microsoft Azure, XEN [19] used in Amazon Web Service, and VMware [20].

2.2 Container-based virtualization In container-based virtualization, containers are realized by the isolation of the namespaces of user processes running on the host OS (Fig. 1, right). These virtualizations are mainly implemented by the namespace [5], one of the Linux

kernel features. The namespace can isolate user processes from global namespaces to individual namespaces, and enable us to use different namespaces for mounting points, processes, users, networks, hostnames, etc. Therefore, users can touch an isolated application environment that is separate from the host environment. The container-based virtualization is sometimes called as kernels-sharing virtualization because the containers running on the same host commonly use the same kernel. According to a previous study, performance overheads of a container in various aspects are smaller than those of the VMs because the resource management in containers is under the direct control of its host kernels [7]. Moreover, the data size of the container images tends to be smaller than that of the VMs. This offers a significant advantage on the application deployment.

2.3 Docker Docker [1] is the most popular set of tools and platform for managing, deploying, sharing of Linux containers. It is an open-sourced software on the GitHub repository, operated by Moby [21] project, written in Golang, contributed by worldwide developers. There are several related toolsets and services of Docker ecosystems, such as Docker Hub [6], the largest container image registry service to exchange userdeveloped container images. Docker Machine [22] provides container environments to Windows and MacOS using a combination of Docker and the hypervisor-based approach. 2.3.1 Sharing container image via Docker Hub A container image can include all the dependencies necessary to execute the target application: code, runtime, system tools, system libraries, and configurations. Thus, it enables us to reproduce the same application environment in the container as we build it, and deploy onto the machine with other specifications. Users easily share their own application environment with each other through uploading (push) of container images via the Docker Hub [6], the largest container image registry service for Docker containers, and downloading (pull) of the same container image onto a different machine environment (Fig. 2). 2.3.2 Filesystem of Docker container image Docker adopts a layered file system as the file system in the container image to reduce the total file size of the images. Every image layer corresponds to a batch of user operations or differential changes of the file system in the container and each has a hashed id. Its layered file system has great benefits on reproducing the operations, rollbacks, reuse of the same container layer that has the same hash id; these contribute to the reproducibility of the application and also of the research. Note that such a layered file system does not show good performance at the latency of the file I/O due to its differential layer management such that we usually

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

Microsoft Azure

Image g

Client

App

Docker k Hub b

Dockerfile DSWJHWLQVWDOOʞ wget … … make

SSH

Pull

Image g

Container

App

App

Bins/Libs

Bins/Libs

Docker Engine CentOS

Docker Engine Ubuntu Linux Kernel

Linux Kernel

Fig. 2: Sharing a Docker container image via Docker Hub Dockerfile

Docker Container 孺孳宩宦孵孴宨孻孶孵室宩孶

孴孵孼孱學季宐宅

安宕宒宐季宸宥宸宱宷宸孽孴孻孱孳孷

宧孼學孳宦孶孻孺宨孹孷宨

孷孳孱孻季宐宅

宕官宑季季季室害宷季宸害宧室宷宨季将宼孩孩季室害宷季宬宱家宷室宯宯季尀

宧孼學孳宦孶孻孺宨孹孷宨

孷孵孱學季宐宅

宕官宑季季季宺宪宨宷宫宷宷害孽孲孲孱孱孱

宨孶孵宩孹孶孷孵宧孼宨宦

孷孱孹孴季宐宅

宕官宑季季季季宰室宮宨

Build

megadock:latest > docker run -it megadock:latest megadock -R data/1gcq_r.pdb …

Resource group

Deploy VMs Submit job

Virtual network Public interface

VM (master)

VM (worker) VM (worker)

Run

Image g

App Bins/Libs

Azure CLI

Bins/Libs

Push

Build

177

Run

Fig. 3: A layered filesystem in a Docker container image built from Dockerfile directly perform the data I/O through the mount point where the target data are stored and attached to the container.

3. MEGADOCK MEGADOCK is a PPI prediction software for a largescale parallel computing environment, which was developed by Ohue et al. [13], [14]. MEGADOCK supports MPI, OpenMP, and GPU parallelization, and has achieved massive parallel computing on TSUBAME 2.5/3.0, K computer, etc. The MPI parallel implementation on Microsoft Azure [15] public cloud (MEGADOCK-Azure [16]) as well as the predicted PPI database MEGADOCK-Web [23] have been developed to promote the use of this software in more general environments.

3.1 MEGADOCK-Azure [16] MEGADOCK-Azure has three main functions: client, resource group management, and task distribution on Microsoft Azure VMs. Fig. 4 shows a diagram of the system architecture of parallel processing infrastructure on Microsoft Azure VMs using MEGADOCK-Azure. The client function uses the Azure command line interface (AzureCLI) to locate the necessary compute resources on Microsoft Azure, including VMs and virtual networks. It can

Storage

Fig. 4: System architecture of MEGADOCK-Azure [16] also transfer data (e.g., protein 3-D coordinate (pdb) files) and results to/from a representative VM (Master VM). The resource group management function provides control over VMs, virtual networks, storage, etc. They are allocated to the same resource group on Microsoft Azure so that they can communicate with each other. The task distribution function provides appropriately task allocation of PPI prediction calculations for each protein pair on a secured VM. One VM is assigned as a master, the rest as workers, and the master sends MEGADOCK calculations to the workers by specifying a pdb file, and the calculations are distributed in the master-worker model. Because MEGADOCK calculations can be performed independently for each protein pair, it is implemented by hybrid parallelization with thread-level parallelization of various rotational angles for a protein pair and processlevel parallelization for different protein pairs [24]. The master-worker model was implemented using the MPIDP framework [24], [25], such that the master and the worker communicate on MPI, and the workers do not communicate with each other. Each worker process executes its calculation in multi-threads by using OpenMP (OpenMP and CUDA when using GPUs).

3.2 MEGADOCK with container-based virtualization The versatility of a software dependent environment with deployment/management problems and improvement of its execution performance are still pressing issues. The use of a cloud computing environment, such as the case of MEGADOCK-Azure, which is based on VMs, is one of the solutions, but there are still concerns about the complexity to reuse the entire existing local computing resource, vendor lock-in problems, and performance overheads due to the hypervisor-based virtualization. In this study, we tried to solve these problems by using Docker containers. Introduction of container techniques into MEGADOCK has the following advantages: • Docker containers are able to run on almost all environments over various cloud computing infrastructure using the same container image as well as on our local environments.

ISBN: 1-60132-508-8, CSREA Press ©

178

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

The container-based virtualization approach generally shows superior performance than the hypervisor-based virtualization approach both in running and deploying. • There are compatible container environments[9] available on several high-performance computing (HPC) environments such as TSUBAME 3.0 and ABCI supercomputers, such that it can even be a model of standard application package in HPC environments. To maintain compatibility between different environments, we implemented the MEGADOCK system using Docker containers running on VM instances of Microsoft Azure by referring to the MEGADOCK-Azure architecture. The system overview is shown in Fig. 5. The Docker container image used for the system is generated from a build recipe (Dockerfile) and can run the same MEGADOCK calculation as VM images. The containers over different VM instances are connected on an overlay network using Docker networking functions and are able to communicate with each other using the MPI library. Thereby, we can run the docking calculation of MEGADOCK on the cloud environment using containers, and the system is compatible with other environments. •

4. Performance evaluation We present two experiments to evaluate the parallel performance of the container-distributed MEGADOCK system using Docker features.

4.1 Experiment 1. Parallel performance on a cloud environment Firstly, we measured the execution time and its parallel speed-up ratio of distributed MEGADOCK system by changing the number of worker processor cores of the VM instances in Microsoft Azure under the master-worker model on the MPIDP framework. 4.1.1 Experimental setup We selected Standard_D14_v2, a high-end VM instance on Microsoft Azure. The specifications of the instance are listed in Table 1, and the software environment is shown in Table 2. Table 1: Experiment 1. Specifications of Standard_D14_v2 CPU Memory Local SSD

Intel Xeon E5-2673, 2.40 [GHz] × 16 [core] 112 [GB] 800 [GB]

The measured data were obtained from the result of the command, and the median of 3 runs of the calculations was selected. To avoid slower data transfer time between nodes, all output results of docking calculation were generated onto local SSDs attached on each VM instance. On the Docker container case, to avoid the unnecessary performance

time

Table 2: Experiment 1. Software environment OS (image) Version (tag) Linux Kernel GCC FFTW OpenMPI Docker Engine

Virtual Machine SUSE Linux Enterprise Server 12 3.12.43 4.8.3 3.3.4 1.10.2 1.12.6

Overlay Network (Swarm mode)

Docker library/ubuntu 14.04 N/A 4.8.4 3.3.5 1.6.5 N/A Standard_D14_v2

Standard_D14_v2

Standard_D14_v2

Standard_D14_v2

Docker Container

Docker Container

Process

Process

Process

MPI comm.

Process

Process

Local SSD Process

MPIDP Master

Process

Process

Process

Local SSD Process

MPIDP Worker

Fig. 5: Overview of overlay network using Docker Network

degradation due to the layered file system of the container, all output files are stored to a data volume on local SSD which was mounted on the inside of the container. For MPI and OpenMP configurations, the number of processes is selected such that every node has four processes, and the number of threads is fixed to four in all the cases (OMP_NUM_THREADS=4). On the container case, each node runs one container for MEGADOCK calculation, and it performs as same as when it directly runs on a VM. 4.1.2 Dataset Protein hetero-dimer complex structure data of the protein–protein docking benchmark version 1.0 [26] were used for performance evaluation. Whole 59 hetero-dimer protein complexes were used and all-to-all combinations of each binding partner (59×59 = 3,481 pairs) were calculated to predict their possible PPIs. 4.1.3 Experiment result The execution time of MEGADOCK running with the Docker containers on the VM instances and MEGADOCK directly running on the VM instances (MEGADOCK-Azure) is shown in Fig. 6. Each bar shows the execution time on the number of VM instances, and the error-bars show the standard deviation in the measurements. Fig. 7 shows the scalability in strong-scaling for the same results. The label “ideal” indicates the ideal linear scaling to the number of worker cores. According to the result of scalability, both of them achieved a good speed-up of up to 476 worker cores. It was ×35.5 speed-up in the case of directly running on the VMs, and ×36.6 in the case of the Docker containers on the VMs.

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

宗宬宰宨季实家宨宦宠

179

宙宐

宇宲宦宮宨宵季宲宱季宙宐

学季宲宩季宙宐家季孫孴孹季宦宲宵宨家季宬宱季孴季宙宐孬

Fig. 6: Experiment 1: Execution time comparison between MEGADOCK directly running on VMs and with Docker container

宖害宨宨宧孰宸害

季完宧宨室宯季宙宐季宇宲宦宮宨宵季宲宱季宙宐

学季宲宩季宺宲宵宮宨宵季宦宲宵宨家

Fig. 7: Experiment 1: Strong-scaling performance of MEGADOCK (based on VM=1) on the benchmark dataset The speed-ups were almost equivalent in this experiment and they indicate that the performance overhead on the Docker containers is small when running on the VM instance of the cloud environment. The MEGADOCK execution load is mainly composed of independent 3D-fast Fourier transform (FFT) convolutions on each single node even in the MPI version such that it tends to be a compute-intensive workload, not a data I/O or network-intensive; therefore, similar to the situation mentioned in the Linux container performance profile reports [7], MEGADOCK calculation on the distributed containers environment also performs well.

4.2 Experiment 2. Execution performance on a GPU-attached bare-metal node Additionally, to investigate the overhead of the containerbased virtualization, we measured the execution time of MEGADOCK running on a local node with various conditions and compared the results of performance. 4.2.1 Experimental setup We used a bare-metal (not a virtual machine) GPUattached node for this experiment. The specifications are

listed in Table 3, and environmental settings are shown in Table 4. Table 3: Experiment 2. Specifications of physical machine CPU Memory Local SSD GPU

Intel Xeon E5-1630, 3.7 [GHz] × 8 [core] 32 [GB] 128 [GB] NVIDIA Tesla K40

Table 4: Experiment 2. Software and environmental settings OS (image) Version (tag) Linux Kernel GCC FFTW OpenMPI Docker Engine NVCC NVIDIA Docker NVIDIA Driver

bare-metal CentOS 7.2.1511 3.10.0 4.8.5 3.3.5 1.10.0 1.12.3 8.0.44 1.0.0 rc.3 367.48

Docker library/ubuntu 14.04 N/A 4.8.4 3.3.5 1.6.5 N/A N/A N/A N/A

Docker (GPU) nvidia/cuda 8.0-devel N/A 4.8.4 3.3.5 N/A N/A 8.0.44 N/A 367.48

We used the physical machine and measured the execution time of the MEGADOCK calculation for 100 pairs of pdb data, for each of the following conditions (Fig. 8): (a) MEGADOCK using MPI library, (b) MEGADOCK using MPI library in a container, (c) MEGADOCK using GPU, and (d) MEGADOCK using GPU in a container.

ISBN: 1-60132-508-8, CSREA Press ©

180

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

a)

Bare-metal machine Process

Process

Process

c)

b)

Process

GPU

Docker Container Process

Process

Bare-metal machine

Bare-metal machine

Process

d)

Process Process

Bare-metal machine

Docker Container Process

GPU

Fig. 8: Experiment 2: Overview of experimental conditions: (a) MEGADOCK using MPI library, (b) MEGADOCK using MPI library in a container, (c) MEGADOCK using GPU, and (d) MEGADOCK using GPU in a container

We used NVIDIA Docker [27] to invoke NVIDIA GPU device from inside of the Docker container. All the result data of MEGADOCK calculation were output to the network-attached storage (NAS) on the same network in all the cases, and we used the “volume” option to mount the path to the NAS when using Dockers. The data were obtained by the time command to measure the duration from start to end of the execution, and we selected the median of six repeated runs. As it had the same configuration as that of Experiment 1, the number of Docker containers per node was one, the number of processes was four, and the number of threads per process was fixed as four. 4.2.2 Dataset PPI predictions were performed for 100 pairs of pdb data randomly obtained from the KEGG pathway [28]. 4.2.3 Experiment result

宗宬宰宨季实家宨宦宠

宅室宵宨孰宰宨宷室宯

宇宲宦宮宨宵

宆宓官季孫宐宓完孬

宊宓官

Fig. 9: Experiment 2: Execution time comparison between MEGADOCK running on bare-metal and with Docker container Fig. 9 show the result for the execution time of the MEGADOCK calculation for 100 pairs in each condition. As

the result of experiment, MEGADOCK calculation using the MPI library in the Docker container (b) was approximately 6.3% slower than the same calculation using the MPI library on the bare-metal environment (a). Note that this experiment was performed on a single node of a bare-metal environment such that the communication cost can be made sufficiently small because there was no inter-node MPI communication. On the other hand, MEGADOCK calculation using GPU in the Docker container (d) performed almost the same result as calculations using the GPU on the bare-metal environment (c). This means that the MEGADOCK-GPU, which does not use the MPI library, performs at the same speed of execution even when it runs in the Docker container.

5. Discussion In Experiment 1, the execution on a single VM instance with the MEGADOCK-Azure (VM) particularly consumed time but the reason was unknown, and that affected the result of scalability. This irregularity increases the scalability more than expected. The reason should be investigated; however, it is difficult because it is time-consuming calculation. We did not achieve multiple GPU nodes parallelization using the MPI library in this study; however, now the VM instances attached to NVIDIA GPU devices are generally available. Moreover, there are more sophisticated VM instances for the HPC applications on Microsoft Azure that is connected by the InfiniBand each other and supports lowlatency communication using remote direct memory access (RDMA). We have already performed experiments over multiple VM instances with GPUs [16] or HPC instances; therefore, further experiments with the Docker containers is our future challenge. Additionally, there is a possibility to introduce an alternative approach for task distributions of MEGADOCK. In this study, we used the MPIDP framework, which uses the MPI library to realize a dynamic task distribution over multiple nodes because it is the built-in-function of MEGADOCK; however, that can be alternated by another framework such as MapReduce. Moreover, our current implementation lacks the functionality to recover from unpredictable failures caused by the MPI processes, containers, VM instances, or hardware such that we should introduce a more faulttolerant framework that has functions of auto-recovery from failure and redundancy of executions. We are considering introducing container orchestration frameworks such as Kubernetes [29] and Apache Mesosphere [30] to resolve the issues.

6. Conclusion We implemented a protein–protein interaction prediction system, MEGADOCK, using Docker containers and its networking functions on the VM instances of Microsoft Azure. We confirmed that the performance is almost equivalent to

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

the same calculation directly performed on the VM instances through the benchmark experiment of protein docking calculations. Both when MEGADOCK directly runs on the virtual machine and when it runs with the Docker containers of MEGADOCK on the virtual machine, the execution speed achieved was almost equal even when the number of worker cores increased up to approximately 500 cores. In the second experiment, we performed MEGADOCK calculation with MPI/GPU on a single bare-metal machine both in a Docker container and in a bare-metal environment, to investigate the effects of performance overhead of container virtualization. Results showed a small performance degradation of approximately 6.3% on the MPI version in the container case compared with the bare-metal; however, it was almost equal in the GPU version with the container and bare-metal. The containers enable us to isolate software dependencies and the system software stacks, which offer a great advantage to the users in sharing software packages through platforms, thereby making it easy to distribute the latest research achievement. Virtualization technologies have been evolved in the context of the general cloud computing environment; however, in the current era, many research institutions have introduced container environments into their HPC infrastructures. To improve productivity and retain scientific reproducibility, it is necessary to introduce such software engineering techniques into research activities.

Code Availability The entire source code of MEGADOCK is open sourced on the GitHub repository. There is also a build recipe (Dockerfile) for building Docker container images for performing the PPI prediction calculations in various environments. https://github.com/akiyamalab/MEGADOCK

Acknowledgment This work was partially supported by KAKENHI (Grant No. 17H01814 and 18K18149) from the Japan Society for the Promotion of Science (JSPS), the Program for Building Regional Innovation Ecosystems “Program to Industrialize an Innovative Middle Molecule Drug Discovery Flow through Fusion of Computational Drug Design and Chemical Synthesis Technology” from the Japanese Ministry of Education, Culture, Sports, Science and Technology (MEXT), the Research Complex Program “Wellbeing Research Campus: Creating new values through technological and social innovation” from JST, Microsoft Business Investment Funding from Microsoft Corp., and Leave a Nest Co., Ltd.

References [1] “Docker.” https://www.docker.com/, [Accessed May 8, 2019]. [2] G. Stphane, “LXC - Linux containers.” https://linuxcontainers.org/, [Accessed May 8, 2019].

181

[3] P. Di Tommaso, E. Palumbo, M. Chatzou, et al. “The impact of Docker containers on the performance of genomic pipelines,” PeerJ, vol. 3, no. e1273, 2015. [4] A. Paolo, D. Tommaso, A. B. Ramirez, et al. “Benchmark Report: Univa Grid Engine, Nextflow, and Docker for running Genomic Analysis Workflows,” Univa White Paper, 2016. [5] E. W. Biederman, “Multiple Instances of the Global Linux Namespaces,” In Proc. the 2006 Ottawa Linux Symp., vol. 1, pp. 101–112, 2006. [6] “Docker Hub.” https://hub.docker.com/, [Accessed May 8, 2019]. [7] W. Felter, A. Ferreira, R. Rajamony, J. Rubio, “An updated performance comparison of virtual machines and Linux containers,” In Proc. IEEE ISPASS 2015, pp. 171–172, 2015. [8] L. M. Vaquero, L. Rodero-Merino, R. Buyya, “Dynamically scaling applications in the cloud,” ACM SIGCOMM Computer Communication Review, vol. 41(1), pp. 45–52, 2011. [9] W. Bhimji, S. Canon, D. Jacobsen, et al. “Shifter : Containers for HPC,” J. Phys. Conf. Ser., vol. 898, no. 082021, 2017. [10] B. Debbie, “Using containers and supercomputers to solve the mysteries of the Universe,” dockercon16, 2016. [11] G. M. Kurtzer, V. Sochat, M. W. Bauer, “Singularity: Scientific containers for mobility of compute,” PLoS One, vol. 12(5), pp. 1– 20, 2017. [12] “TOP500 Lists,” https://www.top500.org/, [Accessed May 8, 2019]. [13] M. Ohue, T. Shimoda, S. Suzuki, et al. “MEGADOCK 4.0: An ultrahigh-performance protein-protein docking software for heterogeneous supercomputers,” Bioinformatics, vol. 30(22), pp. 3281–3283, 2014. [14] M. Ohue, Y. Matsuzaki, N. Uchikoga, et al. “MEGADOCK: An all-to-all protein-protein interaction prediction system using tertiary structure data,” Protein Pept. Lett., vol. 21(8), pp. 766–778, 2014. [15] “Microsoft Azure.” https://azure.microsoft.com/, [Accessed May 8, 2019]. [16] M. Ohue, Y. Yamamoto, Y. Akiyama. “Parallel computing of proteinprotein interaction prediction system MEGADOCK on Microsoft Azure,” IPSJ Tech. Rep., vol. 2017-BIO-49, no. 4, 2017. [17] A. Kivity, U. Lublin, A. Liguori, et al. “kvm: the Linux virtual machine monitor,” Proc. the Linux Symp., vol. 1, pp. 225–230, 2007. [18] A. Velte, T. Velte, “Microsoft Virtualization with Hyper-V,” 2010. [19] “Xen Project.” https://www.xen.org, [Accessed May 8, 2019]. [20] “VMware - Virtualization Overview.” https://www.vmware.com/pdf/virtualization.pdf, [Accessed May 8, 2019]. [21] “Moby project.” https://mobyproject.org, [Accessed May 8, 2019]. [22] “Docker Machine.” https://docs.docker.com/machine/, [Accessed May 8, 2019]. [23] T. Hayashi, Y. Matsuzaki, K. Yanagisawa, et al. “MEGADOCK-Web: an integrated database of high-throughput structure-based proteinprotein interaction predictions,” BMC Bioinform., vol. 19(Suppl 4), no. 62, 2018. [24] Y. Matsuzaki, N. Uchikoga, M Ohue, et al. “MEGADOCK 3.0: a ˘ Sprotein high-performance proteinâA ¸ interaction prediction software using hybrid parallel computing for petascale supercomputing environments.” Source Code for Biol. and Med., vol. 8(1), no. 18, 2013. [25] M. Kakuta, S. Suzuki, K. Izawa, et al. “A massively parallel sequence similarity search for metagenomic sequencing data,” Int. J. Mol. Sci., vol. 18(10), no. 2124, 2017. [26] R. Chen, J. Mintseris, J. Janin, Z. Weng, “A protein-protein docking benchmark,” Proteins, vol. 52(1), pp. 88–91, 2003. [27] “NVIDIA - nvidia-docker.” https://github.com/NVIDIA/nvidia-docker, [Accessed May 8, 2019]. [28] E. V. Wasmuth, C. D. Lima, “KEGG: new perspectives on genomes, pathways, diseases and drugs,” Nucl. Acids Res., vol. 45, pp. 1–15, 2016. [29] B. Burns, B. Grant, D. Oppenheimer, E. Brewer, J. Wilkes, “Borg, Omega, and Kubernetes,” Queue, vol. 14, no. 1, 2016. [30] B. Hindman, A. Konwinski, M. Zaharia, et al. “Mesos: A Platform for Fine-grained Resource Sharing in the Data Center,” In Proc. the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI’11), pp. 295–308, 2011.

ISBN: 1-60132-508-8, CSREA Press ©

182

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

Structure of Neural Network Automatically Generating Fonts for Early-Modern Japanese Printed Books Y. Takemoto1, Y. Ishikawa2, M. Takata1, and K. Joe1 1 Nara Women’s University 2Siga University

Abstract – A huge number of learning data is required for multi-font characters recognition of Early-Modern Japanese Printed Books. However, there is a limit to the conventional manual character image collection. Therefore, we have proposed a method to automatically generate character images of a specific font for early-modern Japanese printed books with deep learning. In order to improve the accuracy ability of the neural network that generates fonts automatically, it is necessary to find the most suitable structure of the neural network. In this paper, we perform some experiments combining various structures and parameters, and find a combination where the reproducing accuracy of generated character images is optimal.

Keywords: Font Generation, Deep Learning, Convolution Neural Network, and Early-Modern Japanese Printed Books

1

Introduction

With the spread of personal computers and smartphones, the Internet has made our everyday life convenient. Information in the world can be obtained instantly by searching the Internet. Online shopping allows us to buy various products without going directly to the stores. These advantages are that when we connect to the Internet, such services are available for anyone, regardless of the location or the time. National Diet Library [1] provides a service that allows us to browse image data of National Diet Library collection. In this service, valuable books that are difficult to read in the library are also to be public. Some of these books are called Early-Modern Japanese Printed Books, published in the Meiji to the early Showa eras. However, browsing image data of the books has a drawback: text search of the document contents is not available. Therefore, it is necessary to read the page of the book to find out where the necessary information is described. In order to improve the convenience of the service, it is required to quickly convert the image data of the book into text. As a method of converting image data into text, usually characters from image data are converted by hand while the number of books to be public on the Internet is huge. So the manual text conversion requires large cost. Therefore, it is

necessary to convert image data into text automatically by computer. Present most books can be automatically converted to text using Optical Character Recognition (OCR). However, early-modern books cannot be accurately recognized by commercially available OCR software. Therefore, we have proposed a multi-fonts character recognition method for Early-Modern Japanese Printed Books [2][3][4]. The accuracy of character recognition is influenced by the number of training data. For more accurate character recognition, it is necessary to prepare enough numbers of character images from early-modern books. At present, the main method of collecting learning data is manual extraction of character images from image data. Therefore, an efficient collection method is required. We have proposed a method for automatically generating character images of the font generated by a specific publisher and publication age of early-modern Japanese printed books using deep learning [5] as a new method of collecting character images [6]. In this method, we constructed a neural network (NN) consisting of four convolution layers and deconvolution layers. In the experiment of automatic font generation using the NN, we tried to generate character images of the font of a publisher Shinshindo in the middle of the Meiji period from the Gothic type font character images. As a result, it was shown that generated images can be used as learning data of the multifonts character recognition method. The average matching rate of pixel values with the newly generated character images of Shinshindo was 73.69%. In deep learning, the structure and parameter sets of the NN greatly affect the accuracy of the generated images. Therefore, in this paper, we conduct experiments to find the structure of NN suitable for font generation of early-modern Japanese printed books, and aim to improve the accuracy of automatically generated character images. In the rest of the paper, we show the structure and the parameter set of deep learning NNs for more suitable font generation. First, section 2 introduces NN that automatically generates font for early-modern Japanese printed books. Next, section 3 describes experimental methods for finding the structure and parameters of NN suitable for automatic font generation, and section 4 shows the results.

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

2 2.1

183

Automatic font generation

Table. 1 Result of automatic font generation with training data.

The method

Character images of Gothic font

In this section, we show the method that automatically generates character images of the particular font of earlymodern Japanese printed books with deep learning. In this method, we considered that character image has the character specific feature and the font specific feature. We aimed at converting only the font specific feature while keeping the character specific feature. The character specific feature indicates the position of dots and lines which make up a character. The font specific feature indicates the thickness and the shape of dots and lines. When character images of a modern font are inputted into the constructed NN, the character image converted into the particular font of earlymodern Japanese printed books is outputted. We show the NN that automatically generates a font. The NN consists of 4 convolutional layers and 4 deconvolutional layers with the filter size of 3x3. In order to unify the image sizes between input and output, only the final deconvoluted layer filter size is 4x4. The number of output channels in each convolutional layer is 128, 256, 512 and 1,024. The width of padding is set to 1, and stride is set to 2. By setting stride to 2, it works as the functions of pooling and unpooling. As for the activation function, we use the ReLU function expressed by 𝑓 𝑥 = 𝑚𝑎𝑥 0, 𝑥 . The dropout [7] is applied to the four convolutional layers to prevent overfitting. The dropout rate is set to 0.7. 2.2

Experiment and result

We performed an experiment of automatic font generation with 1,297 data sets [6]. We review the experiment and the result here in this subsection. The character images of early-modern Japanese printed book in the data set are all extracted from the books published by ShinShindo in the middle of the Meiji era. 1,200 sets are used as training data, and the remaining 97 sets are used as test data. Table. 1 shows the results of automatic generation of fonts using training data. Table. 2 shows the results of automatically generating fonts using test data. The left line of the table is a gothic character image, the middle line is a character image of an early-modern book, and the right line is an automatically generated character image. Comparing the three kinds of character images, it shows the Gothic font used for the input image is converted into the font close to the early-modern Japanese printed book to be generated. When the average matching rate of the pixel values between the character image of the automatically generated character and the character image of the early-modern Japanese printed book was calculated: 99.79% and 73.69% when generated automatically from training data test data, respectively. The average matching rate for test data is extremely lower than that for training data. Therefore, we compared the PDC features between the character image automatically generated from test data and the character image of early-modern Japanese printed book. When the PDC features of the

Character images from Early-Modern Book

Automatically generated character images

Table. 2 Result of automatic font generation with test data. Character images of Gothic font

Character images from Early-Modern Book

Automatically generated character images

Table. 3 Examples of character images that are failed to convert the font. Character images from Early-Modern Book

Automatically generated character images

automatically generated character image and the character image of early-modern Japanese printed book are similar, it can be said that the automatically generated character image can be used as learning data of the multi-fonts character recognition method. From this viewpoint, we showed that character images are automatically generated and reproduced the font of an early-modern Japanese printed book by the proposed method, and the learning data of the multi-fonts character recognition method can be complemented by the generated images. Although we succeeded in reproducing the font of an early-modern Japanese printed book used for learning data, the font conversion is sometimes not sufficient for unknown characters, namely not suitable for learning, and character composition may be broken. Table. 3 shows examples of character images that are failed to convert the font. The automatically generated character images in the table not only cannot convert the font, but also lose the features that make up the character. In order to perform more accurate font conversion, it is necessary to improve the structure of NN that automatically generates suitable fonts. The details are described in section 3.

3

Improvement Methodologies

We consider the following 5 factors in the structure of the NN affecting the accuracy of generating character images. I. II.

Number of convolutional layers Number of output channels

ISBN: 1-60132-508-8, CSREA Press ©

184

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

III. IV. V.

Filter size Kind of pooling Number of pooling

As the number of convolutional layers increases, the features of the character image can be learned more deeply. As the number of output channels in each convolutional layer increases, the number of features that can be extracted from character images increases. As the filter size is larger, features of larger local area can be extracted from the character image. The larger these parameters, the more features can be learned from the training data. However, when such features are learned too much from training data, the accuracy for unknown data decreases. Therefore, we should find the combination of parameters and structure that have high accuracy for test data by experiments. Pooling is performed to correct the positional deviation of character images. As pooling is repeated, the size of the image is reduced and the features of the details image are to be lost. Also, the pixel values of the image output from the pooling layer changes depending on the kind of pooling. Therefore, it is considered that there are a pooling suitable for the font conversion. In order to find the structure of the NN suitable for automatic font generation, we perform several experiments by changing the parameters of the above five factors. The parameters investigated in the experiments are following. I. II. III. IV. V.

Number of convolutional layers 2, 3, 4 Number of first output channel 8, 16, 32, 64, 128 Filter size 3x3, 5x5, 7x7, … Kind of pooling max pooling, average pooling Number of pooling all layers, outer 1 layer, outer 2 layers, …

We automatically generate font with the number of convolutional layers of 2, 3, and 4. The number of output channels is assumed to be 8 for the first layer and 128 for the maximum. For the second and subsequent layers, the number of the next layer is the double of each layer. The filter size is increased by 2 from 3x3. The pooling types are max pooling and average pooling. In the case of 2 convolutional layers, pooling is performed in both layers. In the case of 3 layers, it is all layers, 2 outer layers and 1 outer layer. In the case of 4 layers, it is the all layers, 3 outer layers, 2 outer layers, and 1 outer layer. We explain the experimental method of automatic font generation. The font of the input character image is Gothic. The character images of the early-modern Japanese printed book that we use for learning data are extracted from the

Fig. 1 Result of font generation with max pooling and 2 convolution layers books published by Shinshindo in the middle of the Meiji era. We prepare 500 data sets: 450 data sets are used as training data and the remaining 50 data sets are as test data. The image size is standardized to 64x64. The number of learning epochs is 5,000 and the mini-batch size is 50. We appropriately adjust the learning rate so that the average pixel value matching rate of the image automatically generated from the test data and the character image of early-modern Japanese printed book can be increased by suppressing overlearning.

4

Experiment Results

We show the results of automatic font generation by combining various parameters. Subsection 4.1 is the result when max pooling is used, and subsection 4.2 is the result when using average pooling. Subsection 4.3 shows the results of automatic font generation using the combination of parameters most suitable for automatic font generation, obtained from the results of subsections 4.1 and 4.2. 4.1

Automatic font generation with max pooling

First, we show the result of automatic font generation with 2 convolution layers. Fig. 1 shows the average matching rate of the pixel values between the generated image and the character image of the early-modern Japanese printed book for each parameter with 2 convolutional layers. The vertical axis represents the number of output channels in each convolutional layer, and the horizontal axis represents the average matching rate of pixel values. The bar graph is the average matching rate of pixel values for training data, and the line graph is that for test data. The larger the number of output channels, the higher the average matching rate of pixel values to training data, and the lower that to test data. As the filter size increases, the average matching rate of pixel values to test data increases up to 9x9, but it does not change so much after 11x11.

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

185

Fig. 2 Result of font generation with max pooling and 3 convolution layers Next, we show the result of automatic font generation with 3 convolution layers. Fig. 2 shows the average matching rate of the pixel values between the generated image and the character image of the early-modern Japanese printed book for each parameter with 3 convolution layers. As in the case of 2 convolutional layers, when the number of output channels increases, the average matching rate of pixel values to training data becomes high, and that to test data becomes low. However, in the case of pooling in all layers and in the case of pooling in the outer 2 layers, the average matching rate of the pixel value to the test data is high when the number of output channels is (128, 256, 512). Also, as the filter size increases, the average matching rate of pixel values to training data increases as the number of poolings increases, and that to test data increases as the number of poolings decreases. From this, we consider that the more the number of poolings, the more features of training data are learned and it is easy to fall into overfitting. Finally, we show the result of automatic font generation with 4 convolution layers. Fig. 3 shows the average matching rate of the pixel values between the generated image and the character image of the early-modern Japanese printed book for each parameter with 4 convolution layers. The larger the number of output channels, the higher the average matching rate of pixel values to training data. The average matching rate of pixel values for test data is higher as the number of output channels is smaller in the case of pooling in the outer 1 layer, but is higher as the number of output channels is larger in the case of pooling in the outer 2 or more layers. Also, the average matching rate of pixel values to training data increases as the number of poolings increases, and that to test data increases as the number of poolings decreases. This is considered to be the same as in the case of 3 convolutional layers.

Fig. 3 Result of font generation with max pooling and 4 convolution layers

Table. 4 Images generated from test data and number of output channels for the first layer Number of output channels for the first layer 2 convolutional layers 3 convolutional layers 4 convolutional layers

8

16

32

64

128

In all cases, as the number of output channels increases, the average matching rate of pixel values to test data decreases. However, looking at the character images generated from the test data for each number of output channels shown in Table. 4, the boundary of the character is clearer as the number of output channels is larger, and the accuracy of the generated image is higher. We consider that the binarization at the time of comparing pixel values is affecting that the average matching rate of the pixel values is higher in the generated image with lower accuracy. The character images of the early-modern Japanese printed book have distortion of shape, ink blurring and rubbing, so even if features of fonts are reproduced, the pixel values do not always match. We infer that the average matching rate of the pixel values will be high in this case because the gray pixels will change to black pixels when binarized.

ISBN: 1-60132-508-8, CSREA Press ©

186

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

Table. 6 Best combination of parameters for each number of convolution layers with max pooling 2 convolutional layers

3 convolutional layers 128 / 256 / 512

4 convolutional layers 128 / 256/ 512 / 1024

Number of output channels

128 / 256

Filter size

13x13

7x7

5x5

All layers

Outer 2 layers

Outer 2 layers

98.15%

99.86%

100%

75.97%

77.75%

78.41%

Number of Pooling Average pixel matching rate of training data Generated image from training data Average pixel matching rate of test data Generated image from test data

Fig. 4 Result of font generation with average pooling and 2 convolution layers

Table. 5 Character images of dataset Character images of Gothic font

Character images from Early-Modern Book

Table. 6 shows the combination of parameters with the best matching rate of the pixel values and the automatic generated images in each number of the convolutional layers. The average matching rate of the pixel values to the training data is 100% or near. The average matching rate of pixel values to test data is higher as the number of convolutional layers is larger. Compared with the character images of the data set shown in Table. 5, the generated images can be reproduced to the font of early-modern Japanese printed book more clearly as the number of convolution layers is larger. From this, we find that the more the convolution layers, the higher the font reproduction accuracy. 4.2

Automatic font generation with average pooling Fig. 5 Result of font generation with average pooling and 3 convolution layer

First, we show the result of automatic font generation with 2 convolution layers. Fig. 4 shows the average matching rate of the pixel values of the generated image and the character image of the early-modern Japanese printed book for each parameter with 2 convolution layers. As the number of output channels increases, the average matching rate of pixel values to training data increases, and that to test data does not change so much. As the filter size increases, the average matching rate of pixel values increases for both training data and test data.

pixel values to training data increases, and that to test data decreases. When comparing filter sizes, the average matching rate of pixel values increases significantly between 3x3 and 5x5 for both training data and test data, but does not change so much between 5x5 and 7x7. The higher the number of poolings, the higher the average matching rate to training data pixel values, and the lower that to test data pixel values.

Next, we show the result of automatic font generation with 3 convolution layers. Fig. 5 shows the average matching rate of the pixel values of the generated image and the character image of the early-modern Japanese printed book for each parameter with 3 convolution layers. As the number of output channels increases, the average matching rate of

Finally, we show the result of automatic font generation with 4 convolution layers. Fig. 6 shows the average matching rate of the pixel values of the generated image and the character image of the early-modern Japanese printed book for each parameter with 4 convolution layers. As the number of output channels increases, the average matching rate of

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

187

Table. 8 Images generated from test data and number of output channels for the first layer Number of output channels for the first layer 3 convolutional layers 4 convolutional layers

8

16

32

64

128

Table. 7 Best combination of parameters for each number of convolution layers with average pooling 2 convolutional layers

and 4 convolution layers pixel values to training data increases. The average matching rate of the pixel values for the test data is higher when the number of output channels is larger as the number of poolings is smaller, and is higher as the number of output channels is larger as the number of poolings is larger. When comparing the filter sizes, the average matching rate of pixel values to training data increases as the filter size increases. The average matching rate of pixel values to test data may increase from 3x3 to 5x5, but may decrease from 5x5 to 7x7. We thought that this is because overfitting appears. As the number of pooling increases, the average matching rate of pixel values to training data decreases, but that to test data increases. With max pooing, in the case of 3 or 4 convolutional layers, as the number of output channels increases, the average matching rate of pixel values to test data decreases. However, when looking at the character images generated from the test data for each number of output channels shown in Table. 8, the boundary of the characters becomes clear as the number of output channels increases. Even with average pooling, the more unclear the boundary, the higher the matching rate of pixel values after binarization. Table. 7 shows the combination of parameters with the best matching rate of the pixel values and the automatic generated images in each number of the convolutional layers. The average matching rate of pixel values to both training data and test data is higher as the number of convolutional layers is larger. Similar to max pooling, it can be seen that the

4 convolutional layers (128, 256, 512, 1024)

Number of output channels

(128,256)

Filter size

7x7

7x7

7x7

Number of Pooling

All layers

Outer 2 layers

Outer 2 layers

82.48%

99.46%

99.97%

77.16%

77.43%

78.09%

Average matching rate of training data Generated image from training data Average matching rate of test data Generated image from test data

Fig. 6 Result of font generation with average pooling

3 convolutional layers (128, 256, 512)

generated images can be reproduced to the font of the earlymodern Japanese printed book more clearly as the number of convolution layers increases. 4.3

Automatic font generation with most suitable structure

When comparing the results in Table. 6 and Table. 7, max pooling has higher average matching rates of pixel values, and the boundaries of the generated image are clearer. It is assumed that this is due to the difference in the pixel values selected at pooling. Max pooling takes the maximum pixel value of the pooling range, and average pooling takes the average value of pixel values in the range. Taking the average of pixel values at character boundaries will make the character boundaries unclear, which may be considered to reduce the accuracy of the generated image. Therefore, it can be said that max pooling is suitable for automatic font generation. From the results shown in subsections 4.1 and 4.2, we selected the structure and parameters that increase the average matching rate of pixel values, and automatically generated fonts using 1,297 data sets. As a result, when 4 convolutional layers, the number of output channels (64, 128, 256, 512), filter size 7x7, and max pooling are applied to the outer 2 layers, the average matching rate of the pixel values to the test data is the highest. At this time, the average matching rate of pixel values to training data is 79.10%, and that to test data is

ISBN: 1-60132-508-8, CSREA Press ©

188

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

76.56%. When the dropout is applied to the 3rd and 4th convolutional layers with the dropout rate of 0.7, the average matching rate of the pixel values to the training data is 78.21%, and that to the test data is 77.63%. The average matching rate of the pixel values for the test data is slightly higher, but the generalization rate does not improve. As a result of increasing the number of training data, the average matching rate of pixel values is lower than the result of subsection 4.1. This is because the number of types of characters increased, so the number of features that have to be learned increased and the NN cannot learn enough. Therefore, when the image size is changed from 64x64 to 32x32, and 4 convolutional layers, the number of output channels (128, 256, 512), and the filter size of 5x5, max pooling are applied to the outer two layers, the average matching rate of pixel values to training data is 99.45%, and that to test data is 76.22%. The automatic font generation with the new structure NN shows better results than the existing methods in either case. It shows that the structure of the NN improves the accuracy of the generated image. In this paper, we experimented by unifying the filter size of each convolutional layer. However, since the image size changes when pooling, changing the filter size for each layer will improve the accuracy more. Further experiments will be conducted in the future to improve the accuracy of font conversion.

5

Conclusions

In this paper, in order to generate more accurate character images, we experiment to find the most suitable structure of NN that automatically generates fonts. A font was automatically generated by NN combining various structures and parameters with 500 data sets, and the average matching rates of the pixel values are compared. The character images used for the data set are those of the Gothic font and the font used for the book published by Shinshindo in the middle of the Meiji era. The structure and parameters of the changed NN are the number of convolutional layers, the number of output channels, the filter size, and the kind and number of pooling. As a result, we find that the larger the number of convolutional layers, the number of output channels, and the filter size, the higher the average matching rate of pixel values to test data. Pooling is better for max pooling than for average pooling, and the average match rate is highest when applied to the outer 2 layers. When the data set is expanded to 1,297 pairs, 4 convolutional layers, the number of output channels (64, 128, 256, 512) and a filter size of 7x7, max pooling are applied to the outer 2 layers. The average matching rate of pixel values to training data was 99.32%, and that to test data was 75.79%: the font conversion accuracy was the highest. Since the average matching rate of the pixel value to the test data is improved compared to the existing method, we consider that the conversion accuracy of the font is improved by changing the structure of NN. In the future, we will experiment of automatic font generation with various combinations of filter sizes in each convolutional layer, and aim for more accurate font conversion.

6

References

[1] National Diet Library. http://www.ndl.go.jp. Accessed 2019-04-15. [2] Chisato Ishikawa, Naomi Ashida, Yurie Enomoto, Masami Takata, Tsukasa Kimesawa and Kazuki Joe. “Recognition of Multi-Fonts Character in Early-Modern Printed Books”. Proceedings of International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA09), Vol.Ⅱ, pp. 728-734(2009)． [3] Manami Fukuo, Yurie Enomoto, Naoko Yoshii, Masami Takata, Tsukasa Kimesawa and KazukiJoe. “ EvaluaTion of the SVM based Multi-Fonts Kanji Character Recognition Method for Early-Modern Japanese Printed Books”. Proceedings of The 2011, International Conference on Parallel and Distributed Processing Technologies and Applications (PDPTA2011), Vol.Ⅱ, pp. 727-732(2011). [4] Taeka Awazu, Kazumi Kosaka, Masami Takata and Kazuki Joe. “A Multi-Fonts Kanji Character Recognition Method for Early-Modern Japanese Printed Books”. Information Processing Society of Japan TOM, Vol. 9(2), pp.33-40(2016). [5] Takayuki Okatani ， Masaki Saito. “Deep Learning”. IPSJ SIG Technical Report, Vol.2013-CVIM-185, No.19, pp.1-6, 2013. [6] Yuki Takemoto, Yu Ishikawa, Masami Takata, Kazuki Joe. : Automatic Font Generation for Early-Modern Japanese Printed Books, Proceedings of The 2018 International Conference on Parallel and Distributed Processing Technologies and Applications (PDPTA2018), Vol.Ⅰ, pp. 326-332(2018)． [7] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”. Journal of Machine Learning Research 15 (2014) 1929-1958.

Acknowledgment This work is partially supported by Grant-in-Aid for scientific research from the Ministry of Education, Culture, Sports, Science and Technology of Japan (MEXT) No. 17H01829.

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

189

Applying CNNs to Early-Modern Japanese Printed Character Recognition S. Yasunami1, N. Koiso1, Y. Takemoto1, Y. Ishikawa2, M. Takata1 and K. Joe1 1 Nara Women’s University, 2Siga University

Abstract - Offline Japanese character recognition has been studied for handwritten characters as well as printed characters. We have investigated the third type of Japanese character recognition that is for early-modern Japanese printed books. Since they are generated by typographical printing, their features have the characteristics of current printed characters and handwritten characters. The major problem in early-modern Japanese printed character recognition is the lack of learning data because characters with low appearance are very difficult to be explored in earlymodern printed books. If enough learning data was available, the use of CNN for early-modern printed character recognition would be very promising. In this paper, we propose the combinatorial use of tens of current fonts and several early-modern Japanese printed characters for learning data of CNN and show the experiment results of the CNN. Keywords: earl-modern Japanese character recognition, CNN, digital archiving

1

Introduction

Offline character recognition is a classical field of pattern recognition research. Japanese character recognition has been investigated almost apart from the alphabet based Euro-American character recognition research because of the huge number of Japanese character set and their extremely complicated shapes. In 80’, feature extraction for Japanese handwritten character recognition was a hot spot in Japan and fundamental algorithms such as Peripheral Direction Contributivity Feature (PDC) [1], Weighted Direction Index Histogram Method [2], and Cellular Features [3] had been proposed. With the use of one of the above feature extraction algorithms, we had presented a large-scale neural network which recognizes 300 Japanese handwritten characters [4][5] taken from ETL8 [6] while Euro-American character recognition adopted a convolution based neural network applied to character images [7]. In 90’, Japanese handwritten character recognition seemed to be matured and the 97.76% recognition rate for about 3,000 Japanese handwritten characters taken from ETL9 1 [6] was reported using an

1

The ETL8 and ETL9 are collections of handwritten Japanese character images of primary school education (about

effective feature extraction method [8]. Finally, the recognition rate was rewritten by 98.69% in 2000 [9]. This record was not overridden until deep learning appears. In 2015, a Japanese domestic workshop paper reports that the Japanese handwritten character recognition using a CNN with the same data set to [8] and [9] achieves 99.70% [10]. At this point, we say the Japanese handwritten character recognition research is really matured. In general, Japanese character recognition research consists of handwritten and printed character recognition. The latter was already matured and OCR is commercially available. We found that there is the third Japanese character recognition that is for early-modern printed books. “Earlymodern” means the Meiji, Taisyo and the first twenty years of Showa eras (1968-1945). In those days, typographical printing was used for publication, which is completely different from current standardized fonts. There are more than 20,000 publishers for early-modern Japanese printed books and printed characters are completely different from the current standardized fonts. Furthermore, different publishers generate different early-modern printed characters in different publication years. Considering the above facts, we assumed that the recognition of early-modern Japanese characters has the characteristics of both handwritten and printed (by current fonts) character recognition. So we proposed the third Japanese character recognition method [11-13] with a small number of learning data taken from National Diet Library Digital Collection [14] where 35,000 titles of early-modern Japanese printed books are open to the public as picture images. In [15], we presented an ensemble learning consisting of two classifiers (SVM and LVQ) and three feature extraction methods [1-3] for early-modern Japanese printed books where we used six sets of data: each set has 2,678 types of characters and each set was generated by different publishers. The recognition rate for unknown data is about 90% that is not so good compared with [8-10]. The reason why we could not get a good recognition rate is simply a lack of data sets. In [8-10], the number of data sets is 200: each set has JIS level-1 (about 3,000) characters. To improve the recognition rate, we need more data.

1,000) and JIS level-1 (about 3,000) character sets, respectively. Each character is written by 200 examinees.

ISBN: 1-60132-508-8, CSREA Press ©

190

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

The number of collectable early-modern printed character images of the font generated by a publisher is at most 2,000 from our experience. That is based on Zipf’s law [16]: the number of elements whose appearance frequency is k-th is proportional to 1⁄k of the total. Namely, characters with lower appearance frequency are printed fewer times in the entire books. Therefore, it is very difficult to collect such rarely appearing characters. In addition, characters that are never printed in a book cannot be collected from the book because there is no metal type. If we could use 200 sets of early-modern printed book character images generated by different publisher for 3,000 character types, we would get as same recognition rate to [8-9]. Applying CNN, we might get more like [10]. The serious problem is that we cannot find enough character images from [14]. We notice that we may use tens of current fonts for the learning data of early-modern Japanese printed book recognition instead of tens of earlymodern publishers. Especially, it may be suitable for the learning of CNN. In this paper, we propose a method to apply CNN to early-modern Japanese printed book recognition by increasing data sets with using 21 types of current fonts instead of different early-modern Japanese printed characters by different publishers. The structure of this paper is as follows. In section 2, the CNN for early-modern Japanese printed character recognition is briefly explained. In section 3, the CNN based early-modern Japanese printed character recognition with a small set of data is presented. In section 4, the CNN based Japanese printed character recognition learned with current fonts is reported. In section 5, the CNN based Japanese printed character recognition with 21 current fonts and five early-modern Japanese printed character set is explained.

2

The CNN for early-modern Japanese printed character recognition

In our proposed multi-fonts character recognition method, it is difficult to accurately extract complex features specific to early-modern printed Japanese characters because the features extracted from character images had to be defined in advance. As described above, the feature extraction used to be a hot spot for Japanese character recognition. The CNN has given a great affect to the Japanese character recognition research because the features to be extracted in advance are to be learned automatically. Therefore, we propose character recognition of early-modern printed Japanese books using CNN in this paper.

layer. We repeat the experiments with changing the number of the convolutional layers and filter properties in the convolutional layers to find the optimal structure of the neural network. Learning data is used for neural network learning, and test data is used to confirm recognition accuracy. Both learning data and testing data are a set of a character image of early-modern printed Japanese books and labels which indicates the kind of characters the given image means.

3

CNN based recognition using earlymodern printed Japanese character

This section describes character recognition experiments on early-modern printed Japanese books to verify whether CNN can be applied to early-modern printed Japanese character recognition with a very small number of learning data sets. As explained in section 1, CNN outperforms conventional feature extraction based recognition methods with enough number of learning data sets. We have presented that an ensemble learning method with several feature extraction methods with a small number of earlymodern printed Japanese character recognition has an ability to recognize about 90% of unknown data. If enough number of learning data sets is given, the ensemble learning method would relatively good performance and CNN would outperform the ensemble learning method. Here in this section, we investigate CNN based recognition for early-modern printed Japanese books with a small number of learning data sets. In early-modern printed Japanese books, image data is written in facing pages, and each character is clipped from the image data. Then, each character image is used. The size of all character images is 62x62. The character types used in this recognition experiment are 2,667 excluding the character types whose strokes are greatly broken, from 2,678 types of JIS level-1 Kanji, JIS level-2 Kanji, and Hiragana that are used in [15]. Using the character images of these character types, we prepare 6 data sets consisting of 6 different fonts by different publication ages and publishers. Among these data sets, 5 data sets are used as learning data and the remaining one data set is used as test data to check the accuracy of character recognition by CNN. Recognition experiments are performed by changing the number of convolutional layers, the kernel sizes, and the number of filters that defines the CNN. We try to find the structure of CNN suitable for earlymodern printed Japanese character recognition.

When a character image of early-modern printed Japanese books is given, the proposed neural network outputs the probability, for each character used as learning data, that the character image is recognized as the given character. At this time, the output is normalized to real numbers of 0 to 1 using a softmax function. The middle layers of the neural network are composed of a sequence of convolution layers and pooling layers. We use maximum pooling for the pooling

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

191

Fig. 2 Recognition rate in Experiment 2

Fig. 1 Recognition rate in Experiment 1

3.1

Recognition experiments

We perform three experiments to find the best structure of CNN for early-modern printed Japanese character recognition. In Experiment 1, in order to determine the kernel size suitable for the character images used in this time, the kernel size is changed to various values. In Experiment 2 and Experiment 3, the number of layers and the number of filters are changed to determine the structure of the convolutional layer, respectively. In Experiment 1, the number of filters in the first convolution layer is 25, and the second layer is 50. Only the kernel size is changed. The kernel size is 5 types of 3x3, 5x5, 7x7, 9x9, and 11x11 represented by Exp.1-A, Exp.1-B, Exp.1C, Exp.1-D, and Exp.1-E, respectively. Fig.1 shows the recognition rate in each experiment. As the kernel size increases, the recognition rate improves, and in the case of 7x7, the average recognition rate for test data is the highest, with the average of 88.02%. So we consider that the 7x7 filter is optimal, and use the 7x7 filter as the kernel size in the following experiments.

Fig. 3 Recognition rate in Experiment 3 the number of filters to (40, 80, 160, 320), (80, 160, 320, 640) and (160, 320, 640, 1280). Fig.3 shows the recognition rates for test data. In Experiment 3, the recognition rate is less than 88% in all experiments. We cannot obtain better results than Exp. 2-F.

3.2 In Experiment 2, the number of layers is increased to 3 and the number of filters is changed to compare the recognition rates of Experiment 1. We set the numbers of filters in each convolutional layer to (5, 10, 20), (10, 20, 40), (20, 40, 80), (40, 80, 160), (80, 160, 320) and (160, 320, 640) represented by Exp.2-A, Exp.2-B, Exp.2-C, Exp.2-D, Exp.2-E, and Exp.2-F, respectively. Fig. 2 shows the recognition rate for each experiment. The recognition rate increases as the total number of filters increases. In Exp. 2-F, where the convolutional layer is three layers with (160, 320, 640) filters, the recognition rate for test data is up to 90.36% that is the highest recognition rate.

Discussion

From subsection 3.1, the best recognition result with CNN is 90.36%. In our previous study using ensemble learning, recognition results by Adaboost.M1 are reported to be approximately 86-92%, with an average recognition rate of 88.88%. Therefore, we find that our proposed method with CNN cannot improve the recognition rate so much. The number of data sets used in this experiment is only five, and we consider that the features necessary to correctly recognize each character have not been sufficiently learned. Therefore, the recognition rate would be improved if the learning data is given enough.

In Experiment 3, the experiment is performed with four convolutional layers. From the results of Experiment 2, we consider that the recognition rate increased as the total number of filters increased. Therefore, the experiment is performed by increasing the number of filters, starting from the closest one to Exp.2-F where the recognition rate is the highest. We set

ISBN: 1-60132-508-8, CSREA Press ©

192

4

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

CNN based Japanese printed character recognition learned with current fonts

Table. 1 Recognition rate for each current font character Font name HG Marugothic M-PRO HG Gothic E HG Mincho E MS Gothic Toppan Bunkyu Gothic Toppan Bunkyu Midashi Mincho Tsukushi A Round Gothic Tsukushi B Round Gothic Klee Hiragino Kaku Gothic Hiragino Maru Gothic ProN Hiragino Mincho ProN Hiragino Kaku Gothic ProN Hiragino Kaku Gothic Std Kokutai YuKyokasho Meiryo Osaka YuGothic YuMincho +36p Kana YuMincho

Section 3 shows that CNN based early-modern Japanese printed character recognition requires a large amount of image data. However, as explained in section 1, the appearance frequency is becoming too small as gathering more characters. In our experience, finding more than 1,000 character types is difficult and more than 2,000 is almost impossible to collect from at least 200 different publishers. On the other hand, CNN requires enough number of training data sets. Therefore, we prepare character images of current fonts to use these images as learning data for CNN. Since current fonts are easy to obtain, it is unnecessary to consider the frequency of character appearance. Some amount of character fonts can be obtained. This section describes early-modern printed Japanese character recognition with using current fonts as learning data.

4.1

Preliminary experiments

Before performing recognition experiments, we verify whether the CNN used in this paper is useful as a character recognition method for current fonts. We use 21 kinds of current fonts. For each font, we use 3,011 types of JIS level-1 Kanji and Hiragana. One of 21 kinds of current fonts is used as test data, and the remaining 20 kinds of that are used as learning data. The data set for the test is changed for the cross validation, and the experiment is performed 21 times in total. We set the number of filters in the three layers to (160, 320, 640) and the all kernel sizes to 7x7 as the structure of CNN. They are obtained from the highest recognition result in section 3. The result of preliminary experiment is shown in Tab.1. The average recognition rate of 21 experiments is about 99%, which confirmed that the CNN based character recognition is performed correctly.

4.2

Recognition experiments

In this recognition experiment, all current fonts used in the preliminary experiment are used as learning data, and character recognition for early-modern Japanese printed characters is performed. The test data is the six data sets used in section 3. For the 6 x 2,667 characters of test data, the learning data of the 21 current fonts of 3,011 types (JIS level1 Kanji and Hiragana) are used. The recognition experiment is carried out using the same CNN structure as in the preliminary experiment. The average recognition rate is 71.57%. From this result, we confirm that the existing OCR learned only from the current fonts cannot recognize the earlymodern Japanese printed characters correctly. Early-modern Japanese printed books are typographically printed, so considerable numbers of characters are blurred. Furthermore, some characters from poorly preserved books lost a part of character by ink spreading. Thus, the features specific to early-modern Japanese printed characters had to be learned in any way. Since early-modern Japanese printed character

Recognition rate(%) 96.37 99.10 96.24 98.87 99.97 99.30 99.83 99.97 99.70 99.70 100.0 99.87 99.80 99.97 99.97 100.0 90.68 99.97 99.93 97.67 99.27

images also include the features that are different from the current fonts, we notice that we use the combination of current fonts and several early-modern Japanese printed character sets for learning data.

5

CNN based early-modern Japanese printed character recognition using both current fonts and early-modern characters

In section 4.2, a CNN learns the character features from current fonts and we find that the trained CNN is difficult to recognize early-modern Japanese printed characters. Although character images from early-modern Japanese printed books are indispensable for early-modern Japanese printed characters recognition, learning data images just from earlymodern Japanese printed books is not sufficient. Therefore, we propose a new method that uses both character images of early-modern Japanese printed books and of current fonts as learning data. It makes possible to extract features of character images both from early-modern Japanese printed books and current fonts. The recognition of the character types which cause misrecognition by the difference of the fonts will be improved. Furthermore, by using early-modern Japanese printed characters as learning data, it is possible to capture the features such as blurs that are essential to early-modern Japanese printed books.

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

5.1

Recognition experiments

In the experiment of this section, each of six data sets of early-modern Japanese printed character images are used as test data in order. Learning data are 21 data sets of current fonts shown in section 4 and of five early-modern Japanese printed character sets that are not used as the test data. We set the number of filters in the three convolutional layers to (160, 320, 640) and each kernel size to 7 x 7 as the structure of the CNN. They are the same in sections 3 and 4. The experiment results are shown in Tab.2. The average recognition rate is 96.89%, which is improved by about 6% compared to our previous methods. We find that the recognition rate is improved when not only character images of early-modern Japanese printed books but also those of current fonts are given as learning data. Moreover, in the existing method, the difference of up to 6% was found in the recognition rates for each data. We consider that one reason of this is the difference in fonts by publisher, that is, the fact that it is impossible to recognize characters whose shapes are different as the same character. In this experiment, we also use 21 current fonts. As a result, it becomes possible to learn various character features that differ in the shapes of the characters. Therefore, in this experiment, we conclude that the same recognition result would be obtained by using any other early-modern Japanese printed characters. On the other hand, there are some characters that could not be recognized. These characters have been broken to be difficult to be recognized by even Japanese people. We aim to further improve the recognition rate by performing the recognition corresponding to broken character images by compensating other information.

6

Related work

As far as we know, there is no other research group for early-modern Japanese printed character recognition. In this section, we explain our early-modern Japanese printed character recognition research project since 2008. [11] is our first report for the project. In [11], we adopt the PDC [1] as feather extraction and an SVM as a classifier targeting for only 10 kinds of early-modern printed characters where each character has 50 images of different publishers. The recognition rate for test data is 97.5%. We analyzed the results and found unique noises for early-modern printed books in the misrecognized images. So we proposed a new noise removal method for the early-modern Japanese printed character recognition to improve the recognition rate up to 99.0%. The high recognition rate is obtained because of the small number of character types (just 10). In [12], we report that we collect 262 kinds of early-modern printed characters where each character has 10 images of different publishers. The recognition rate is 92.7%. As you can easily guess, the deterioration of the recognition rate comes from the lack of training data. At this time we noticed that characters with low frequency of appearance are difficult to find in early-modern printed books

193

Table. 2 Recognition rate for each data set of early-modern printed Japanese character image Data ID Data1 Data2 Data3 Data4 Data5 Data6

Recognition rate(%) 96.36 97.21 97.35 97.54 96.88 95.98

as is known as Zipf’s law [16]. Actually we can easily more than one hundred images of different publishers for the characters with higher frequency of appearance. When the number of character types reaches to 256, it becomes very difficult to find more than 10 publishers by hand. We notice that the main problem in early-modern Japanese printed character recognition is the difficulty of finding books of different publishers that include low frequency of appearance. So we proposed a data collection method for early-modern Japanese printed books in [17]. However, the method requires an appropriate recognizer for early-modern Japanese printed characters and we have not achieved such system yet. We also investigated various feature extraction methods rather than PDC [1]. In [18], we apply Weighted Direction Index Histogram Method [2] and Cellular Features [3] to compare the three kinds of feature extraction methods. We find that misrecognized characters are evenly distributed among three feature extraction methods. In other words, using multiple feature extraction methods, it would be possible to reduce the number of required training data sets. As described in section 1, we presented an ensemble learning consisting of two classifiers (SVM and LVQ) and three feature extraction methods [1-3] with six sets of data: each set has 2,678 types of characters and each set was generated by different publishers. The recognition rate for test data is about 90% [15]. We also investigate another method to increase the number of learning data. In [19], we propose a challenging method to generate character images of an early-modern publisher that are not found in any printed books. We develop a deep neural network to read current available fonts to generate corresponding character images of a specific publisher. We show when the DNN learns enough, it generates unknown character images of the publisher. In [20], we will show the optimal structure of the font generating DNN. There is another problem to construct an OCR for earlymodern Japanese printed books other than data sets. In earlymodern years, Japanese literacy rate was very low. Most Japanese can read Hiragana and Katakana that are phonetic characters while such low educated people could not read several thousands of Kanji characters. For those people, most

ISBN: 1-60132-508-8, CSREA Press ©

194

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

early-modern Japanese printed books adopt the Ruby system: each Kanji character is printed with a sequence of small Hiragana or Katakana characters to teach readers how to pronounce the character. It is very useful for low educated people but prevents OCRs from clipping Kanji characters. In [21], we propose a Ruby removal method using genetic programming and show that 99% of Ruby characters are to be removed. We also work for the OCR for early-modern Japanese printed books. Besides the above subtopics, the layout analysis is needed to develop the OCR. We will show a new method for layout analysis for early-modern Japanese printed books [22].

7

Conclusions

In this paper, we propose a method using CNN to improve the recognition rate of early-modern Japanese printed character recognition. When recognition is performed only with early-modern Japanese printed character images, the recognition rate is similar to our previous method. When only current fonts are used as learning data for early-modern Japanese printed character recognition, the recognition rate is about 70%, which is insufficient for the accuracy of character recognition. This is because blurred characters are usual for the character images of early-modern Japanese printed books and do not exist in current fonts. It confirms that the existing OCRs cannot be applied to early-modern Japanese printed character recognition so much. When the character image of current fonts and of early-modern Japanese printed characters are also used as learning data, the average recognition rate of the recognition achieves about 97%. Using the character images of the current fonts, learning data consisting only early-modern Japanese printed characters is not sufficient. Furthermore, the features to be extracted from early-modern Japanese printed character images are learned from earlymodern printed Japanese character images. As the result, we observed that the recognition rate of early-modern Japanese printed character images can be greatly improved by providing the characters of current fonts and early-modern books together as learning data. It is thought that the recognition rate can be further improved by giving more kinds of current fonts. In addition, we investigate a new method to recognize the characters that cannot be read because of the noises such as broken or blurred.

8

References

[1] Hagita, N., Naito, S. and Masuda, I.: Handprinted Chinese Characters Recognition by Peripheral Direction Contributivity Feature (in Japanese), IEICE, Vol.J66-D, 10, pp.1185-1192(1983). [2] Tsuruoka. S, Kurita, M., Harada, T., Kimura, F. and Miyake, Y: Handwritten “KANJI”and“HIRAGANA” Character Recognition Using Weighted Direction Index Histogram Method (in Japanese), IEICE, and Vol.J70-D， No.7，pp.1390–1397 (1987) ．

[3] Oka, R.: Handwritten Chinese-Japanese Characters Recognition Using Cellular Features (in Japanese), IEICE, Vol.J66-D, No.1, pp.17–24 (1983). [4] Mori, Y. and Joe, K.: A Large-Scale Neural Network Which Recognizes Handwritten Kanji Characters, Proceedings of 2nd Neural Information Processing System, pp.415-422 (1990). [5] Joe, K., Mori, Y. and Miyake, S.: Construction of a Large-scale Neural Network: Simulation of Handwritten Japanese Character Recognition on NCUBE, Concurrency: practice and experience, John Wiley & Sons Inc., Vol.2, No.2, pp.79-107 (1990). [6] ETL Character Database. http://etlcdb.db.aist.go.jp/ Accessed 2019-06-10. [7] LeCun, Y. et.al. “Backpropagation Applied to Handwritten Zip Code Recognition”, Neural Computation, Vol.1-4 pp. 541-551 (1989). [8] Kato, N., Suzuki, M., Omachi, S., Aso, H., Nemoto, Y. A., “A handwritten character recognition system using directional element feature and asymmetric Mahalanobis distance”, IEEE Trans. on Pattern Analysis and Machine Intelligence, 21(3), pp. 258-262 (1999). [9] Nakajima, T., Wakabayashi, T., Kimura, F. and Miyake, Y., Accuracy Improvement by Compound Discriminant Functions for Resembling Character Recognition (in Japanese), IEICE Trans. D-2, 83(2), pp.623-633 (2000). [10] Sasaki, K., Chen, K. and Baba, T., A Software Implementation of Handwritten Japanese Character Recognition Using Convolutional Neural Network, technical report of JCEEE-Kyushu, pp.348 (2015). [11] Ishikawa, C., Ashida, N., Enomoto, Y., Takata, M., Kimesawa, T. and Joe, K., “Recognition of Multi-Fonts Character in Early-Modern Printed Books”, PDPTA2009, Vol.2, pp.728-734 (2009). [12] Fukuo, M., Enomoto, Y., Yoshii, N., Takata, M., Kimesawa, T. and Joe, K., “Evaluation of the SVM Based Multi-Fonts Kanji Character Recognition Method for EarlyModern Japanese Printed Books”, PDPTA2011, Vol.2, pp.727-732 (2011). [13] Awazu, T., Kosaka, K., Takata, M. and Joe, K., “A Multi-Fonts Kanji Character Recognition Method for EarlyModern Japanese Printed Books”. IPSJ trans. on TOM, Vol. 9(2), pp.33-40 (2016). [14] National Diet Library Digital Collection. http://dl.ndl.go.jp. Accessed 2019-04-15. [15] Fujimoto, K., Ishikawa, Y., Takata, M. and Joe, K. “Early-Modern Printed Character Recognition using Ensemble Learning”, PDPTA2017, pp.288-294 (2017). [16] Zipf, G. K., “The Psycho-Biology of Language, BostonCambridge Mass”. Houghton Mifflin. (1935). [17] Kosaka, K., Awazu, T., Ishikawa, Y., Takata, M. and Joe, K., “An Effective and Interactive Training Data Collection Method for Early-Modern Japanese Printed Character Recognition”, PDPTA2015, Vol.1, pp. 276-282 (2015). [18] Kosaka, K., Fujimoto, K., Ishikawa, Y., Takata, M. and Joe, K., “Comparison of Feature Extraction Methods for Early-Modern Japanese Printed Character Recognition”, PDPTA2016 Final Edition, pp.408-414 (2016).

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

[19] Takemoto, Y., Ishikawa, Y., Takata, M. and Joe, K., “Automatic Font Generation for Early-Modern Japanese Printed Books”, PDPTA2018, pp.326-332 (2018). [20] Takemoto, Y., Ishikawa, Y., Takata, M. and Joe, K., “Structure of Neural Network Automatically Generating Fonts for Early-Modern Japanese Printed Books”, PDPTA2019, accepted (2019). [21] Awazu, T., Fukuo, M., Takata, M. and Joe, K., “A MultiFonts Kanji Character Recognition Method for Early-Modern Japanese Printed Books with Ruby Characters”, ICPRAM2014, pp. 637-645 (2014). [22] Iida, S., Takemoto, Y., Ishikawa, Y., Takata, M. and Joe, K., “Layout analysis using semantic segmentation for Imperial Meeting Minutes”, PDPTA2019, accepted (2019).

Acknowledgment This work is partially supported by Grant-in-Aid for scientific research from the Ministry of Education, Culture, Sports, Science and Technology of Japan (MEXT) No. 17H01829.

ISBN: 1-60132-508-8, CSREA Press ©

195

196

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

Shape Recognition Technique for High-accuracy Mid-surface Mesh Generation M. Okumoto1, S. Jien1, J. Niiharu1, K. Ishikawa1 and H. Nishiura1 1 Software Development, Integral Technology, Co., Ltd., Osaka, Osaka, Japan

Abstract - Computer Aided Engineering (CAE) is an essential technology in the process of automotive development. In term of mesh generation for structural analysis of resin products, it has been a challenge to generate high quality mid-surface mesh automatically because of the difficulty to create high quality mid-surface calculated from 3D CAD. This paper introduces a new method of extracting mid-surface aiming to generate high quality mid-surface mesh. Recognition of endterminal surfaces plays an important role for mid-surface generation, and hereby Offset and SURF techniques are being studied and compared in terms of recognition accuracy. This technique has been proved to be able to recognize endterminal surface with high-accuracy. This new method has achieved in extracting high quality mid-surface mesh automatically, and improving the efficiency of the whole process for CAE modelling. Keywords: Mid-mesh, Automation, Shape recognition, Mesh control

1

Introduction

Recently, the necessity to cut time and cost in product development cycle is becoming important in CAE market as a production strategy evolved into many kinds in small quantity. Also, it is predictable that social problems such as depopulation, a declining birth rate and an aging population will cause man-power shortage and engineering skill deterioration as the number of retiring skilled workers keeps increasing. To answer the market and social changes mentioned above, automation of CAE process that has a knowledge database gained from skilled workers and fastcomputing system is demanded [1]. The whole CAE process is composed of three parts: preprocess, analysis process, and post-process. Pre-processing is a process to generate mesh model from 3D CAD data that will be used for analysis computation. In case of sheet metal or resin products, mid-mesh modeling is the most accepted way for modeling. The main focus of this paper is on the automation and speeding-up of pre-processing modeling especially for mid-mesh generation process. Several different approaches to generate mid-mesh automatically has been proposed in [2][3]. Medial Axis Transform (MAT) is one of the well-known methods, which

calculates the mid-surface from the wall-thickness information of 3D model [4][5]. Quadros and Shimada proposed Chordal Axis Transform (CAT), using bubble packing algorithm to firstly generate one-layer tetra mesh and then divide the tetra mesh right in the middle to create mid-surface [6]. Sheen et.al, proposed a method by detecting face-pairs at first and generating mid-surface from the recognized face-pairs through offset [7]. The method of fair-pairs has an advantage for more flexibility on controlling mid-mesh generation based on any specific rules. However, the traceability problem and imperfect result are more likely to occur in the case of complicated shapes of 3D-CAD data, and for this reason manual modification/repair is needed. The inexistence of midsurface reference that is needed for manual repair is the main obstacle, where the mid-surface has to be created by hand and sometimes it is not easy. Depending on the shape complexity, a tremendous time is needed for mesh modification process due to the tedious manual repair work and therefore it is difficult for speed-up modelling time without automation. In general, geometry simplification is recommended before the mid-surface generation process, where the unnecessary features such as fillets at the bottom of ribs and others are eliminated manually. As a first step for automation and speed-up of high-quality mid-mesh generation, a new method called “SURF” technique was developed for automatically recognizing geometry features [8][9]. This technique can be utilized to recognize basic features such as end-terminal, rib, flange, and many others. Also, this method can be applied to recognize the unnecessary features that have to be eliminated during geometry simplification process. Therefore, the geometry simplification process can be automated and the manual work load can be reduced. This paper serves as an investigation and evaluation of the automatic recognition of unnecessary features during midmesh generation process. Among several other features, the discussion and study on this paper will focus only on the recognition of end-terminal surfaces, which is considered as the most important feature for mid-mesh generation. To obtain recognition results with high-accuracy, a hybrid method of “SURF” in combination with other algorithms are applied [9]. In this paper, the new shape recognition technique “Offset technique” was proposed, and the capability of Offset and SURF techniques as shape recognition method is introduced. Experimentation of SURF technique for shape

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

recognition were compared and tested with 3D plate-like models. Introduction of shape recognition methods are presented in chapter 2. Chapter 3 gives the comparison of experimentation results and evaluation on the capability of the proposed method for shape recognition. The final mid-mesh generation result is presented in Chapter 4. Finally, chapter 5 gives the conclusion and comments for future development.

2

consider carefully the value of . Furthermore, since the offset has to be repeated several times for the observation of surface area changes, calculation time increases thus this method is not an efficient method in term of computation time. Computation time increases according to the size of models.

Shape recognition technique

In this section, two methods of shape recognition are studied. Offset technique in Section 2.1, and SURF technique in Section 2.2. Section 2.1 provides the outline of Offset technique to recognize end-terminal surfaces. Section 2.2 gives the brief introduction and general methodology of SURF technique for the recognition of end-terminal surfaces.

2.1

197

Offset technique

Offset technique is used to recognize shape features by evaluating the changes of surface area for each offset step. Offset process is executed with the direction facing to the inner side of the model (refer to Fig. 1). The end-terminal surfaces tend to have a larger change in surface area in comparison to the other surfaces. In other words, the area of end-terminal surfaces has a tendency to go to zero along the offset process. An experiment with Offset technique is performed for a model shown in Fig. 2, where the tendency of area changes during offset process is shown in Fig. 3. In Figure.3, vertical axis indicates the offset surface area ratio, and horizontal axis indicates the number of offsets. In this graph, F1 is a fillet surface on the bottom side of rib, while T1 and T2 are end-terminal surfaces. P1 and P2 are the other surfaces. By referring to Fig.3, it is clear that T1 and T2, which have the characteristic as end-terminal feature, shows a drastic decrease in surface area compared to the other type of surfaces. This information of offset surface area ratio can be implemented to easily distinguish end-terminal surfaces from the other. End-terminal surfaces have larger changes of offset surface area or smaller angle  indicated in Fig. 3. All surfaces that are smaller than the threshold angle  have a high probability as end-terminal surfaces. However, it is not easy to fix the value for  because the offset area ratio is different depending on the geometry contours. The variance of offset area ratio for different models will cause instability if the recognition of end-terminal surfaces is determined only by the threshold value. For example, if the threshold angle  is determined as  that is indicated in Fig. 3, T1 will be recognized as end-terminal surface. On the other hand, if  is determined as 2 (Fig. 3), T1 will recognized the other surface. Therefore, the decision on the recognition of end-terminal based on threshold value is prone to misjudging. In addition, models with complicated contours will show irregular changes of offset area ratio, which may cause ambiguity for its reliability and accuracy. Due to the above-mentioned restrictions, it is necessary to

Fig. 1 Offset technique for recognition of end-terminal surfaces.

Fig. 2 Sample model for explanation of Offset technique

Fig. 3 Graph showing the relationship between numbers of offset and offset surface area ratio.

ISBN: 1-60132-508-8, CSREA Press ©

198

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

2.2

SURF technique

SURF technique was invented with the purpose to recognize features/shapes efficiently and accurately. Utilizing SURF technique will improve both the calculation time and also the recognition accuracy. SURF technique has a function to calculate the edge connection between two adjacent surfaces of 3D CAD models (see Fig. 4). By referring to this edge connection information, geometry shapes/features can be recognized by analyzing the patterns of edge connection collections. Edge connection is calculated as edge connection = (v × nT) ∙ nA

(a) Concave connection

(b) Convex connection

(1)

where v indicates the vector of the direction of edge shared between two adjacent surfaces A and T. nT and nA represent the normal vector of each surface, which is calculated in the vicinity of shared edge. Edge connection calculated by Eq. (1) can be classified into three types as shown in Fig. 5. (c) Planar connection

 Edge connection > 0 is concave connection (Fig. 5a).  Edge connection < 0 is convex connection (Fig. 5b).

Fig. 5 Edge connection; (a) concave connection, (b) convex connection and (c) planar connection

 Edge connection = 0 is planar connection (Fig.5c). In the case of end-terminal surfaces, edge connection between two adjacent surfaces will convex-convex connections as shown in Fig.6. This characteristic of endterminal surfaces defined by SURF technique can be helpful to solve the recognition problems in Offset technique. Improvement in recognition accuracy and computation time can be expected by adopting SURF technique. Figure 7 shows the recognition result of adapting SURF technique to Offset technique. SURF technique is applicable not only for recognizing end-terminal surface, but it can also be extended to recognize ribs, flanges and many other shapes/features. Also, the other advantage of using SURF technique over Offset method is the reduction in computation cost because of its capability to find any types of features in such an easy way. Details about using SURF technique for recognizing various types of geometry features is given in [8].

Fig. 6 Recognition rule for end-terminal surface by relating to surface connection calculated by SURF technique

Fig. 7 Recognition result for end-terminal surfaces using SURF technique

3

Shape recognition examples

In this section, shape recognition using SURF and Offset technique are experimented for its capability to recognize end-terminal surfaces. The evaluation and validation of recognition accuracy for two different models are indicated in Fig. 8 and 9. The equation to find T and offset distance x of Offset technique is given in Eq. (2) and (3), respectively. Fig. 4 Calculation of surface connection using SURF technique

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

𝜑T = cos −1

𝑥=

1 1 + 𝑁MAX

𝑇MAX 2𝑁MAX

2

(2)

(3)

Where, TMAX defines the maximum thickness of a model and NMAX is the maximum offset number. x is solved using Eq. (3), and then T is calculated from Eq. (2) by referring to NMAX as shown in Fig. 10. In this case of models with simple contours, NMAX is defined as 20 and  is defined as 1.

Fig. 8 Sample model A for shape recognition experimentation.

199

3.1

Result of shape recognition with Model A

Firstly, the recognition result of end-terminal surfaces for model A is shown in Fig.10. Model A has a total of 5 endterminal surfaces. Offset technique without SURF technique was able to recognize all the end-terminal surfaces, as shown in blue color of Fig. 11. However, this result contains one excessive surface that is not really end-terminal surface, where the recognition error is shown in Fig. 12. Offset technique recognizes the end-terminal surfaces by observing the changes of offset surface area, therefore all surfaces that has offset surface area ratio close to zero is assumed to be end-terminal surfaces. In conjunction to this, the recognition error in model A occurs because the incorrectly recognized surface has offset rate below T. On the other hand, the recognition result of end-terminal surfaces using both Offset technique and SURF technique is shown in Fig.13, where all the end-terminal surfaces were correctly recognized. SURF technique has the ability to find the incorrect end-terminal surfaces by looking at the type of edge connections. Endterminal surfaces should have convex-convex connections by definition. Meanwhile, the recognition error of model A (Fig. 12) has convex-concave connections. By using SURF technique, it was easy to find the incorrect end-terminal surfaces. This result provides a proof for the effectiveness and capability of SURF technique for the recognition of endterminal surfaces.

Fig. 9 Sample model B for shape recognition experimentation. Fig. 11 Recognition results of model A using Offset technique.

Fig. 10 Graph of offset area ratio indicating the recognition region for end-terminal surface that lies within the range of T.

Fig. 12 End-terminal surfaces recognition error of model A.

ISBN: 1-60132-508-8, CSREA Press ©

200

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

Fig. 13 Recognition results of model A using Offset technique and SURF technique.

Fig. 15 Recognition results of model B using Offset technique and SURF technique.

3.2

4

Result of shape recognition with Model B

The recognition result of end-terminal surfaces using Offset technique for model B is shown in Fig. 14. Total number of end-terminal surfaces possessed in model B is 77. Offset technique without SURF technology could only recognize about 27. Although there is no excessively recognized surface in model B, Offset technique only shows a poor performance. The failure in recognizing the end-terminal surfaces has a connection with the tendency of the declining offset surface area, where the area ratio does not fall below T. In contrary, an experiment using both Offset technique and SURF technique shows remarkably improved result where all the 77 end-terminal surfaces were recognized properly but there are 3 surfaces which are not actually endterminal surfaces (Fig. 15). Although there is an error of excessively recognized surfaces, the result is acceptable and can be implemented for practical use for mid-mesh generation because this recognition error was considered unimportant and has no direct influence in mid-mesh generation. These results provide a confirmation and validation of the effectiveness and potential of SURF technique for improving the recognition accuracy of end-terminal surfaces. For future implementation, a tremendous reduction of processing time can be expected by using only SURF technique for shape recognition without Offset technique. Using SURF technique to find the convex-convex connections for end-terminal surfaces will be much easier and will save a lot of time. An experiment to validate the recognition accuracy and calculation time using SURF technique only is excluded in this paper.

Mid-mesh Generation

Figure 16 and 17 give the results of mid-mesh generated after recognizing the end-terminal surfaces using SURF technique. The automatic mid-mesh generation is mainly generated from the recognized face-pairs [7]. Since endterminal surfaces are firstly recognized using SURF technique before face-pairs detection, the improper pairing problem was corrected and recognition accuracy was improved. At the end, high-quality mid-mesh was generated. It has been shown that SURF technique has a high potential for shape recognition, and improvement in the midmesh generation results can be expected by applying this technique with the common existing methods. The betterment in mid-mesh result will then reduce the amount of manual work and thus improve the modelling efficiency with less dependency on human operators. The difference in modeling skills from the novice to experienced operators will have an influence to the output result. In connection to this, this automated mid-mesh generation tool has a benefit as a robust modeling by eliminating the influence of differences in the skill level of human operators.

Fig. 16 Mid-mesh generation results of model A.

Fig. 14 Recognition results of model B using Offset technique

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

201

We are continually working on the development of effective and high-speed algorithm to answer the challenge of rapid changes in market demands.

6

Acknowledgments

Thanks to the Japanese automobile makers for their collaboration, evaluation and valuable feedback on the improvement of meshing efficiency for thin-walled components. Fig. 17 Mid-mesh generation results of model B.

5

Conclusion

In this paper, “Offset” technique was proposed as a new method for shape recognition of end-terminal surfaces and the advantages of using “Offset” along with “SURF” technique was introduced. The validation of recognition accuracy was carried out through experiments with two sample models. The implementation of shape recognition using SURF technique for mid-mesh generation is summarized as follows: 1） Offset technique is a method to recognize features through the evaluation of offset surface area changes towards the inner side of 3D plate-like models. Since the changes of surface area differ depending on the complexity of model contours, the recognition result is mostly affected by the determination of T (the threshold angle; section 2.1). Therefore, the selection of T is a vital part in Offset technique. 2） SURF technique is a method to recognize feature shapes based on a calculation of edge connection (concave, convex and planar connection) of 3D CAD model, which has been proved for its capability and effectiveness for the recognition of end-terminal surfaces. 3） Comparative study using SURF technique also shows an improved accuracy for both cases of sample models. 4） Finally, improvement in the recognition of end-terminal surfaces also gives a better result for mid-mesh generation. SURF technique has a high potential for shape recognition and improves the geometry traceability during mid-mesh generation. Faster computation time and improved accuracy will finally bring reduce the amount of manual work needed, where high-efficiency and fast modeling process can be expected. Moreover, reliance on the difference of skill level among human operators can be eliminated through the automation of mid-mesh generation using SURF technique. Further improvement and innovation are still necessary for better shape recognition and mid-mesh generation results.

7

References

[1] Nishiura, H. “Automatic Construction for CAE analysis”; The Japan Society of Mechanical Engineers 20th computational Mechanics Division Conference, No.07-36, p.26-28, 2007. [2] Onodera, M. “Medial-Surface Generation Technique for Transforming an Assembly Model for Finite Element Analysis”; Transaction of The Japan Society of Mechanical Engineers Series A (in Japan), Vol. 71, Issue 707, No.07-36, p.26-28 ,2005. [3] Onodera, M., Yoshimitsu, H., Goto, K. and Kongo, C. “Automatic Generation Technique of Suitable Medial-Surface Model for FEM”; Transaction of The Japan Society of Mechanical Engineers Series A (in Japan), Vol. 71, Issue 708, No. 04-0991, p. 169-176, 2005 [4] Ramanathan, M. and Gurumoorthy, B. “Generating the Mid-Surfaces of a Solid Using 2D MAT of its Faces”; Computer-Aided Design and Applications Vol 1, p.665-674, 2004. [5] Armstrong, C., Robinson, T. and Ou, H. “Recent Advances in CAD/CAE Technologies for Thin-Walled Structures Design and Analysis”; Fifth International Conference on Thin-Walled Structures, 2008. [6] Quadros, W.R. and Shimada, K. “Hex-Layer: Layered All- Hex Mesh Generation on Thin Section Solids via Chordal Surface Transformation”; Proceedings of the 11th International Meshing Roundtable, Springer-Verlag, 2002. [7] Sheen, D.P., Son, T., Ryu, C., Lee, S. et al. “Dimension Reduction of Solid Models by Mid-Surface Generation”; International Journal of CAD/CAM, Vol 7, No. 1, 2009. [8] Jien, S. and Nishiura, H. “Automatic Tool for Improving the Meshing Efficiency for Thin-Walled Objects”; SAE Int. Journal of Materials and Manufacturing 11(4), p. 361-372. [9] Jien, S., Ishikawa, K., Niharu, J. and Nishiura, H. “Automatic Mid-mesh Generation Tool with Shape Recognition and Mesh Control Capability”; 2019 JSAE Annual Congress (Spring).

ISBN: 1-60132-508-8, CSREA Press ©

202

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

SESSION LATE BREAKING PAPERS: PARALLEL & DISTRIBUTED PROCESSING AND APPLICATIONS Chair(s) TBA

ISBN: 1-60132-508-8, CSREA Press ©

203

204

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

205

A GPU-MapCG based parallelization of BSO metaheuristic for Molecular Docking problem Hocine Saadi

Nadia Nouali Taboudjemat

CERIST Research Center Algiers, and Univesity Djillali LIABES of Sidi Bel Abbes, Algeria Email: [email protected]

CERIST Research Center, Algeria Email: [email protected]

Abstract—Molecular docking is fast becoming a key instrument in the drug discovery process. This method helps in delivering new drug candidates quickly with lower cost. It consists on calculating the optimum orientation, position, and conformation of a new molecule (Drug candidate) to an existing molecule (protein) to form a stable molecular complex with overall minimum energy. In this paper, we propose a parallel model of bees swarm optimization metaheuristic to solve the molecular docking problem. We use the MapCG framework to implement the MapReduce model on graphics processing card (GPUs). MapCG was developed to simplify the programming process on GPU and design portable application independently of the hardware architecture. Our solution can run sequentially on CPU, or in parallel on GPU without changing the code. Experiments when docking a set of protein-ligand complexes show that our solution achieves a good performance. Indeed, the parallel implementation using MapCG on GPU gains an average speedup of 10x with respect to a single CPU core.

Malika Mehdi∗ , ousmer sabrine† and hafidabenboudjelthia‡ Universite des Sciences et de la Technologie Houari Boumediene, Algiers, Algeria Email: ∗ [email protected], † [email protected], ‡ [email protected]

Docking

+ Ligand

Protein

+

Docking

I. I NTRODUCTION Molecular docking (MD) methods play an important role in the field of computational chemistry and biomedical engineering [1], they are frequently used to predict the binding position and orientation of a ligand (drug candidate) when interacting with a protein receptor (the origin of the disease) [2] (see Fig.1), and predict the affinity and activity of ligand-protein complex measured by a scoring function (binding affinity)[3]. The latter gives detailed description of protein-legend poses in chemical detail, and provides information about the bound conformation and the stability of the protein-ligand complex [4]. In protein-ligand docking, metaheuristics are used as search algorithm, to find a maximum of binding poses of the ligand against the target protein by traveling through the search space. These algorithms play a central role in the docking accuracy [5]. Many metaheuristics have been proposed in the literature for MD. Some well known are: Genetic Algorithm (GA)[6], Simulated Annealing (SA)[7], Particle Swarm Optimization (PSO) [8]. However, these algorithms need improvement to better explore the search space. In docking simulations, the most CPU time is spent in the evaluation phase (scoring function) of the metaheuristic [9]. In this phase, the calculation of the scoring function

Fig. 1. Illustration of docking process

consumes more than 80 % of the total execution time and this is considered as a big challenge in the performance of the MD. This bottleneck could be resolved through the parallelization of this calculation in order to enhance the performance and the quality of the docking process. Recently, Graphics Processing Units (GPUs) have been playing an important role in the general purpose computing field. It exists actually several high level frameworks to program GPUs instead of using low level GPU APIs such as CUDA or OpenCL [10][11]. Although, low level APIs can achieve very good performances, it requires high development and maintenance costs, it raises some portability issues. Indeed, developers are required to write a specific code version for each potential target architecture. Most of parallel docking solutions in literature proposed solution with low level GPU APIs such as CUDA or other language like OpenCL [Ref] which limit the utilization of those solutions. In this work, we propose parallel BSO based Metaheruistic for MD problem. We propose an new approach by adding a layer to separate the code from the GPU architectures,

ISBN: 1-60132-508-8, CSREA Press ©

206

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

the aim is to propose portable application which can run on heterogeneous system architecture without changing the code. To achieve this goal we used the MapCG framework and the MapReduce [12] programming model to parallelize BSObased metaheuristic. The remainder of the paper is structured as follows. In section II, we present the background needed to understand the remaining sections. To that purpose, we first briefly introduce the MapReduce model which we have chosen as the parallel model, then we introduce the MapCG framework used to implement this model on GPUs, In section III we introduces the sequential version of BSO for MD, then the parallel version using MapReduce Model. In section IV we give details on experiments and discuss the obtained results. Finally, in section V we summarize the results and present some concluding remarks and work perspectives. II. BACKGROUND A. MapReduce MapReduce is a parallel programming model designed to process large data on distributed systems (clusters). Initially used by google for web page indexing, then it was popularised to deal with all kind of information [13]. In this model, the user defines the computation in terms of a Map and a Reduce functions, then the associate run-time system divides in parallel the computation across different nodes of the cluster (see Fig.2). Developers can use MapReduce model to easily create parallel programs without any worries about the architecture where the code will be executed, or about the inter-machines communication. In the MapReduce model, the Map function allows to divide and distribute the work on different nodes and produces a set of intermediate key/value pairs for each data read, then a specific library puts together all values which have the same key K and passes them to the reduce function. Reduce function process and assemble values provide by the Map function together to form results with possibly one output[12]. In general, here are the steps to follow to write a MapeReduce program: 1) Choose an appropriate approach to split the data so that it can be parallelized by the Map function 2) Choose a good key to use according to the problem 3) Write the code for Map function 4) Write the code for the Reduce function Fig.2 illustrates an example of text treated in parallel with the MapReduce Model, the aim is to find the number of occurrences of words in the input text file. First, the lines of the file are splited into blocks. Then. In the ”Map” phase, keys are created with an associated value. In this example a key is a word and the value is the number 1 to indicate that the word is present once. Then, all the identical keys are grouped together (same words). Finally, and in the Reduce phase, a treatment is performed on all the values of the same key (in this example, the values are added together to obtain the number of occurrences of each word).

B. MapCG Framework MapReduce is a model introduced to create powerful parallel programs in easily way. However, this model is designed only for distrusted system like cluster with CPUs, it can not be used for heterogeneous systems with GPU processors. To deal with this issue MapCG framework is created to allow programmers to write a MapReduce parallel programs on GPUs [14]. With this framework the programmer only needs to write one version of program, define the Map and the Reduce functions, then the MapCG run-time library generates automatically the GPU versions by source code translation. this operation ensures portability with a high level of abstraction [15]. Fig.3 shows the MapCG framework architecture. The later contains two parts. The first part, provides the programmers a unified, high level parallel programming environment which allows him to write a MapReduce code once. While the second part represents the MapCG runtime which executes MapReduce code efficiently on heterogeneous platforms, and bridges the gaps of different hardware features. In the execution step the input data is split by the Splitter() function into pieces, then these pieces are passed to the Map function. The Map function processes the data and emits intermediate pairs (key/value) using MapCG emit intermediate() function. The intermediate pairs are then grouped and passed to the Reduce function, which emits data using the MapCG emit() function. The data emitted by Reduce can then be obtained by invoking the MapCG get output() function. A hash table is used to group the key/value pair on GPU, it is hard to implement this table on GPU, because the data must be dynamically allocated. for this purpose MapCG uses its one memory allocation system to dynamically allocate memory on GPU and use closed addressing hash table. An other problem is the concurrent insertion issue, MapCG framework uses lockfree algorithm to solve this problem. This algorithm guarantees that the insertion never gets blocked by any particular thread [16]. III. M ETHODOLOGY A. BSO metaheuristic for MD The metaheuristic Bees Swarm Optimization (BSO) is among the newest swarm intelligent algorithms. It is based on a swarm of artificial bees cooperating together to solve an Np-hard problem [17]. BSO simulate the collective honey bees behavior in nature. In this algorithm an artificial bee named InitBee works out to find a first solution named Sref with some good features. We use Sref as starting point to find a group of disjoint solutions called Search Area to maximally exploit the search space by using a certain strategy. Then, every bee takes one solution from the Search Area group and consider it as its starting point to do a local search (intensification) to look for other potential solution. After that every bee communicates the solutions found to all its neighbors through a table named Dance Table. The best solutions stored in the Dance Table becomes the new reference solution during the

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 | Data

Splitting put out

207

Map

Tri

Reduce

put 1

out 1 out 1

out 2

out 1

key value

put out out put put away

out put

put 1 put 1 put 1

out 1

put 1 key value

put away

put 1 away 1

Result

put 3 out 2 away 1

put 3

away 1

away 1

key value

Fig. 2. An example of a text file treated with MapReduce Model

Application Written in MapReduce

x

y

z

a

b

c

r

t1

...

tn

MapCG API

D1 Translation

D2 Orientation

D3 Torsion

MapCG Runtime

MapCG Runtime

OpenMP

CUDA

Fig. 4. Representation of ligand in the Molecular Docking problem

Multi-Core CPUs

GPU

D1: Translation: The variables x, y and z represent the position of the center of the ligand. D2: Orientation: The orientation is represented by a quaternion, a, b, c that denotes the vector of the axis, and r denotes the rotation angle along this axis. D3: Torsion: ti is an angle associated with the i-th rotatable bond, so tn angles are required to represent all the rotatable bonds, n is the number of rota-table bonds. we use three parameters to denote translation, four parameters for the orientation, and n parameters for torsion, so we need need 7+n parameters to represent the ligand’s conformation against the protein target. Fitness function: The Fitness functions (scoring function) calculates the affinity between the ligand and the protein, it is based on an energy prediction. The lower energy means the better docking. In our solution we use the semi-empirical force field scoring function of Autodock4.2 framework [19], this latter involves four terms, the van der Waals, the hydrogen bond, the electrostatics, and the solvation energies.

Fig. 3. MapCG architecture overview

next iteration, a scoring function is used to choose reference solutions according to the quality of the solutions found. A taboo list is used to store reference solution in order to avoid cycles. The algorithm stops when the optimal solution is found or the maximum number of iterations is reached. To adapt BSO to the molecular docking problem we have to define the encoding solution or the representation of our problem, and the scoring function The encoding solution: The molecular docking problem can be formulated as an optimization problem [18], it can be mathematically expressed by a vector S and an objective function E (S): Given: S=(x,y,z, a,b,c,r, t1, t2,t3 . . . tn) Sought: S*, E(S*) = < E(S) We try to find in a set of potential solutions S which is often very large, and find the best solution S* by minimizing the scoring function E(S) calculated for each potential solution. S represents the vector of the decision variable , it can be seen as a set of three fields as shown in Fig.4: S=(D1, D2, D3)

B. Parallel BSO for MD In this section, we present the design and implementation of the parallel strategy of BSO metaheuristic. We first describe the overall design with MapReduce model, and then present the implementation with MapCG framwork on the GPU. All the BSO search algorithm steps such as the determination of the reference solution, the determination of the regions,

ISBN: 1-60132-508-8, CSREA Press ©

208

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

the neighbor search, all those steps are executed sequentially on the CPU because their complexity is negligible compared to the evaluation step ( the scoring function). the later is parallelized using the MapReduce model in which we calculate the free energy on the GPU efficiently by evaluating several solutions in parallel with MAP workers. The main idea is to Maper the dance table which contains solutions into several pieces, then chose the best solution within the all solutions found in the first step by the reduce function. Fig.5 explains our parallel approach details. We propose to parallelize the evaluation step of the BSO algorithm. Initially, the dance table generated by the neighborhood step is split into several pieces Mi (Maper i) using the split function provided by the MapCG framework. The table dance contains all the solutions generated for one iteration. The number of Mi is determined by this function split function based on the GPU capacity (number of cores) and data size. Each piece Mi is passed to a map function (a map worker) to evaluate the solutions of each piece. During processing, each map function produces intermediate pairs (key, value). The key represents the number of Mi, and the value represents a solution S which contains the position, the orientation and the conformation. We added two elements to the vector of each solution: an index of the solution, and the energy associated for this solution calculated during this step. A solution S is represented as follows: S = (Index, Translation, Orientation, Rotation, Energy). These pairs (key, value) are sent by the intermediate function to the reduce function. In this reduction step the MIN function is applied to define the best solution with the smallest energy value of each piece Mi using the reduce function. Finally, the pairs generated during the reduction step are emitted by the emit function which calls the get output function in order to obtain the solution that has the best energy value for all Mi pieces by choosing the min of the different pairs energy. The pseudo code of the map and reduce functions is given in Algorithm 1: Algorithm 1 The Map and Reduce pseudocode for the evaluation step BSO 1: Map (String: Key, String: Value) : 2: count i [K] 3: for i = 0 to M i do 4: // Mi is a subset of solutions 5: Evaluate (Si) solutions 6: end for 7: emit intermediate(...) 8: Reduce (String: key, Iterator * value) 9: count i [K] 10: for j = 0 to K do 11: Choose the solution with minimum energy() 12: end for 13: emit(count i)

IV. E XPERIMENTS AND RESULTS This section shows the experimental results. We compare the performance of the BSO algorithm implemented in sequential with the parallel version implemented with MapCG framework. First, we describe the dataset and the environment used for the evaluation before we show the experimental results. We use GeForce GT 740M card, this GPU is a 1.03 GHz processor with 384 cores and 2 GB of memory size. For the CPU side we use Intel Core i5 model, which is 4 GHz processor with 4 cores. On the software side, we use C++ with MapCG framework, we run this experiments on Linux Mint 17.1 64 bits which was chosen for its stability and performance. The results of this experimentation depend on two factors, the size of the data (large, medium, or small) and the number of iterations of the BSO Metaheuristic. For the size of data we have two elements that increase the complexity and therefore affect the results of our experimentation: First the number of flexible residues on ligand which is known as degree of freedom. Then, the number of atoms in the protein and the ligand. To calculate the energy of interaction between the ligand and the protein, we must calculate the energy between each atom of a ligand with each atom of the protein for n iterations. The number of solutions generated depends on the size of the vector which represents the solution. The number of variables for the Position and Orientation fields is constant. However, it is variable for the rotation bounds field (flexible residue). So, the number of solutions depends on the number of the flexible bounds in some residues. This latter will vary the number of regions (bees) and the population generated by BSO search algorithm. Table 1 describes the population size according to the number of flexible residues bound. For the benchmark, we use a protein called Cutinase and a ligand called 3QPApropre from RCSB protein data bank [20]. The protein has 63 atoms while the ligand has 40 atoms and 4 flexible bounds. In this experiment we vary the number of iteration of the BSO algorithm and calculate the execution time for the sequential and parallel version of BSO, then we calculate the speedup obtained with the parallel version on GPU with respect to the sequential execution on CPU. Table 2 shows that the Parallel BSO algorithm outperforms the sequential algorithm, especially for large data sizes where we evaluate a huge number of solutions. We get better occupancy of the GPU when we increase the number of iterations. We obtained an average speedup of 10 X with respect to the sequential algorithm. Indeed, the performance of the parallel algorithm with MapCG framework on GPU is better than the sequential algorithm on CPU. V. C ONCLUSION In this article we proposed a new parallel method to explore the search space of the docking process to get the best position and conformation of a small molecule called ligand to a macromolecule. This method is based on parallel bees swarm optimization algorithm (BSO). We parallelized and

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

209

Split Evaluation (..)

S1 S2

M1

Reduce

Map

Min_Energy(..)

Evaluation (..)

Sk

Sk+1

Evaluation (..)

M2

Reduce

Map

Min_Energy(..)

The solution with the minimum energy

Evaluation (..)

St St+1

Evaluation (..)

M3

Map

Reduce

Min_Energy(..)

Reduce

Min_Energy(..)

Evaluation (..)

Sr

Sr+1

Evaluation (..)

M4

Map Sn

Evaluation (..)

Dance Table

Evaluate solutions in parallel

Output

Fig. 5. Parallel BSO design with MapReduce Model

TABLE I P OPULATION SIZE ACCORDING TO DATA SIZE . Degree of freedom

regions size

dance table size

2 4 5 6

192 384 480 576

36288 145452 226800 326592

TABLE II PERFORMANCE COMPARISON BETWEEN SEQUENTIAL AND PARALLEL BSO AND S PEED U P.

Data Size sequential parallel Speed Up small medium large

140.91 550.34 1236.23

15.51 9.08 60.21 9.15 115.51 10.7

implemented the calculation of the fitness function of BSO on GPUs architecture with MapCG Framework. This framework is based on MapReduce Model, our solution allows to write one version of the code which can be executed on both CPU and GPU without changing any line of it. The results show a total speed-up exceeding 10x for the evaluation stage with respect to one CPU version, The MapsCG framework and the Map Reduce model are a potentially fruitful area for future research in metaheuristic parallelizing on GPUs and CPUs, and a good tool to accelerate the docking process which can help in drug design speedup process. R EFERENCES [1] X.-Y. Meng, H.-X. Zhang, M. Mezei, and M. Cui, “Molecular docking: a powerful approach for structure-based drug discovery,” Current

computer-aided drug design, vol. 7, no. 2, pp. 146–157, 2011. [2] S. F. Sousa, P. A. Fernandes, and M. J. Ramos, “Protein–ligand docking: current status and future challenges,” Proteins: Structure, Function, and Bioinformatics, vol. 65, no. 1, pp. 15–26, 2006. [3] B. Mukesh and K. Rakesh, “Molecular docking: a review,” Int J Res Ayurveda Pharm, vol. 2, pp. 746–1751, 2011. [4] I. Pechan and B. Feher, “Molecular docking on fpga and gpu platforms,” in Field Programmable Logic and Applications (FPL), 2011 International Conference on. IEEE, 2011, pp. 474–477. [5] L. Guo, Z. Yan, X. Zheng, L. Hu, Y. Yang, and J. Wang, “A comparison of various optimization algorithms of protein–ligand docking programs by fitness accuracy,” Journal of molecular modeling, vol. 20, no. 7, p. 2251, 2014. [6] C.-l. Li, Y. Sun, D.-y. Long, and X.-c. Wang, “A genetic algorithm based method for molecular docking,” in Advances in Natural Computation, L. Wang, K. Chen, and Y. S. Ong, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2005, pp. 1159–1163. [7] S.-Y. Yue, “Distance-constrained molecular docking by simulated annealing,” Protein Engineering, Design and Selection, vol. 4, no. 2, pp. 177–184, 1990. [8] V. Namasivayam and R. Günther, “Pso@ autodock: A fast flexible molecular docking program based on swarm intelligence,” Chemical biology & drug design, vol. 70, no. 6, pp. 475–484, 2007. [9] J. Fang, A. L. Varbanescu, B. Imbernon, J. M. Cecilia, and H. E. P. Sánchez, “Parallel computation of non-bonded interactions in drug discovery: Nvidia gpus vs. intel xeon phi.” in IWBBIO, 2014, pp. 579– 588. [10] M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick, S. Morton, E. Phillips, Y. Zhang, and V. Volkov, “Parallel computing experiences with cuda,” IEEE micro, vol. 28, no. 4, pp. 13–27, 2008. [11] J. E. Stone, D. Gohara, and G. Shi, “Opencl: A parallel programming standard for heterogeneous computing systems,” Computing in science & engineering, vol. 12, no. 3, p. 66, 2010. [12] J. Dean and S. Ghemawat, “Mapreduce: a flexible data processing tool,” Communications of the ACM, vol. 53, no. 1, pp. 72–77, 2010. [13] ——, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008. [14] C. Hong, D. Chen, W. Chen, W. Zheng, and H. Lin, “Mapcg: writing parallel program portable between cpu and gpu,” in Proceedings of the 19th international conference on Parallel architectures and compilation techniques. ACM, 2010, pp. 217–226. [15] L. Liu, Y. Zhang, M. Liu, C. Wang, and J. Wang, “A-mapcg: An adaptive mapreduce framework for gpus,” in Networking, Architecture,

ISBN: 1-60132-508-8, CSREA Press ©

210

[16]

[17] [18]

[19]

[20]

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 | and Storage (NAS), 2017 International Conference on. IEEE, 2017, pp. 1–8. C.-T. Hong, D.-H. Chen, Y.-B. Chen, W.-G. Chen, W.-M. Zheng, and H.B. Lin, “Providing source code level portability between cpu and gpu with mapcg,” Journal of Computer Science and Technology, vol. 27, no. 1, pp. 42–56, 2012. H. Drias, S. Sadeg, and S. Yahi, “Cooperative bees swarm for solving the maximum weighted satisfiability problem,” in International WorkConference on Artificial Neural Networks. Springer, 2005, pp. 318–325. Y. Liu, W. Li, Y. Wang, and M. Lv, “An efficient approach for flexible docking base on particle swarm optimization,” in 2009 2nd International Conference on Biomedical Engineering and Informatics. IEEE, 2009, pp. 1–7. G. M. Morris, R. Huey, W. Lindstrom, M. F. Sanner, R. K. Belew, D. S. Goodsell, and A. J. Olson, “Autodock4 and autodocktools4: Automated docking with selective receptor flexibility,” Journal of computational chemistry, vol. 30, no. 16, pp. 2785–2791, 2009. RCSB PDB, “Protein Data Bank,” https://http://www.rcsb.org/.

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

211

Resilient and Hierarchical Controller Placement Problem for Collaborative Virtual SDN Services Sakir Yucel Wexford, PA, USA [email protected] Abstract – Controller placement problem in the software different applications and/or tenants. Virtual networks defined networking (SDN) is an optimization problem share the physical network resources. modeled after the classical facility location problem SDNs evolved from a centralized control plane to a being optimization. It has evolved to addressing the same an enabler for virtual networks and network-as-a-service problem in WAN, placing multiple controllers, building offerings. NV and SDN bring new business opportunities a hierarchy of controllers, covering multi-domain for network providers. One challenge for network and controllers, incorporating the resiliency of controllers, service providers is to effectively allocate resources and and recently trying to solve controller placement problem program the network elements with network flows that in hypervisor and virtual network environments Thanks satisfy the customer requirements. Placing the SDN to virtual network infrastructures, container execution controllers in optimum locations of the SDN network is environments and SDN, organizations are provisioning significant in timely programming of the network. This their workloads on on-premises, co-located, or on public challenge has been identified as the “controller placement cloud infrastructures based on demand and service level problem” which is an optimization problem and has objectives. The problem that we address in this paper is evolved over time as the NV and SDN progressed. the collaborative controller placement problem where Controller placement problem evolved over time where infrastructure providers including network and service first solutions addressed placing a single controller in an providers are collaborating in offering end-to-end services to SDN. It has evolved to addressing the same problem in organizations and they offer resources for running each other’s WAN [1][2], placing multiple controllers [5], building a SDN controllers. Collaborative controller placement problem hierarchy of controllers [6] [7] [8], covering multiis modeled after the collaborative facility location domain controllers [9][10][11], incorporating the problem. In this paper, we develop mixed integer resiliency of controllers [14][15][16], and recently trying programming formulations for the problem. We conclude to solve the controller placement problem in hypervisor that it is not fundamentally different than the various and virtual network environments [12][13]. Many studies forms of the controller placement problem but it can be have been done on the optimal placement of the used in different settings where the infrastructure controllers in different settings. Most studies formulate providers may charge for the resources and may create the problem as integer programming and offer some marketplace platforms. We outline such settings and heuristics for approximations. Controller placement further address hierarchy and resiliency scenarios in the problem is NP-hard and some heuristics and algorithms model. are presented in these studies [3][4]. Keywords: Software Defined Network, Collaborative In [17], we developed terminology and formulations for Controller Placement Problem, Virtual SDN Service the controller placement problems in recent and 1 Introduction innovative virtual SDN deployments for which no such Software-defined networking (SDN) separates the formulations had been done before. In this paper, we network control plane and the forwarding plane enabling extend that work to cover scenarios where various an SDN controller to program the routing decisions on network, cloud and infrastructure providers may multiple network elements. The SDN provides the collaborate with the each other and with the customers to network operators with capability to configure, manage, offer flexible and dynamic virtual networks utilizing the secure, and optimize network resources quickly through SDN. The actual networks could be in the edge networks software, thereby enabling them to program the networks of corporations, or in the networks of edge infrastructure providers, in the networks of public and private cloud, in in a very fast and agile manner. the networks of network providers. These networks could Network virtualization (NV) allows creating multiple be wireline and wireless. virtual networks over physical network resources. Each virtual network could be tailored to the requirements of

ISBN: 1-60132-508-8, CSREA Press ©

212

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

The original controller placement problem was characterized as a facility location problem [3][4]. The controller replacement problem that we address in this paper is collaborative controller placement problem which can be characterized after the collaborative facility location problem. Collaborative facility location problem has been studied. [18] defines the model as: “The goal of the cooperative facility location model is to share distribution centres between participating carriers with the aim of reducing costs and improving distribution efficiency.” It also states: “The cooperative carrier facility location model can be expected to lead to particular outcomes that differ from a traditional facility location setting. In the traditional application of facility location models, a large number of potential sites are considered, out of which typically a small number of sites are opened. In a DC sharing context, however, it is assumed that each carrier starts from a set of open DCs of which the number and locations are already (near to) optimal for this carrier when working independently. When considering collaboration, we hence start from this given set of opened facilities, and the model will investigate whether savings can be achieved from collaboration. These savings can only result from either keeping all existing DCs open, but finding a better allocation of transport routes, or from closing a number of DCs and reoptimising the allocation of transport routes.” Such collaborations are not new to the telecommunication industry. Telecommunications Act of 1996 required ILECs (Incumbent Local Exchange Carrier) to unbundle their network elements and to provide them to other requesting operators. Similarly, it required the ILECs to provide the requesting operators with collocation of equipment necessary for interconnection or access to unbundled network elements. The industry saw virtual network operators coming up and renting facility and equipment from the ILEC and offering services on top of the rented infrastructure. Also, the industry saw network access points which served as the meeting point where many providers pulled infrastructure and established peering agreements diverting traffic from transit connections. Although unbundling and colocation requirements have been phased out, a network provider could still rent its infrastructure to virtual operators where it make sense economically and business-wise. This model is more prevalent in wireless (such as Google Fi project) where virtual mobile operator rents infrastructure and wireless bandwidth from the wireless network provider and offers

services. With SDN and NV, the collaboration extends to a new dimension where colocation is not just for physical resources but includes virtualized VMs and containerized components. From corporations’ perspective, virtual infrastructures including virtual network infrastructures allow them to provision their workloads on where the resources best support the services and applications like , on-premises, co-located, clouds, network and other providers’ infrastructure. Besides, they can move the workloads from place to place based on demand and set-forth policies. The boundaries on different network domains are blurring as corporations and different types of infrastructure providers are collaborating to offer an endto-end services over the networks of various network, service, cloud, content delivery and other infrastructure providers. The controller replacement problem that we address in this paper is collaborative controller placement problem in SDNs where multiple infrastructure providers could collaborate in offering hypervisors and/or container execution environments for hosting SDN controllers. Any infrastructure provider can join this collaboration. The SDN controllers placed on the collaborated resources may belong to any infrastructure provider or the customer. The collaboration may extend to include the corporate customers of the collaborating infrastructure providers in which case the infrastructure providers could run their SDN controllers within the corporationmanaged networking infrastructures. In this paper, we consider the following scenarios for collaborative controller placement: - Achieving economies of scale by sharing: This is a very general scenario where various infrastructure providers share devices, networks, other infrastructures, services for reducing the cost. Specifically for the collaborative controller placement problem, we envision that they collaborate in sharing hosts or container execution environments for running each other’s controllers as well as the controllers of the customers of the collaborating providers. - Accelerating demand for fixed-mobile convergence: Sharing of infrastructure, packet forwarding and processing platforms among operators is typical for this purpose. An example is sharing of services and appliances among service creation platforms including edge devices in home networks and data centers. Similar sharing is possible in the access networks by the access network operators including

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

-

-

-

-

-

the non-traditional operators such as municipalities deploying FTTx. For the problem we are addressing in this paper, we envision the scenario where such networks are implemented with SDN approach and collaborative controller placement is desired among the partners. New telco strategy: Network virtualization brings up the opportunity for providers offering the network as a service over the network infrastructure. Network infrastructure owner opens up the network to service providers, application providers on top of the basic network resources. With this strategy, network providers aim to reinforce their position in the value chain of service platform offerings and open software ecosystems. When the network providers open such platforms, those platforms would present lower entry barriers for new service providers to develop and offer innovative services, thereby attracting more service providers over to the platforms. For the problem we are addressing in this paper, we envision a network provider owning SDN network serves multiple service providers. In this scenario, network provider and service providers collaborate in offering hosts or container execution environments to host each other’s SDN controllers. Virtual mobile network operators: This scenario is similar to new telco strategy above but over the cellular networks. For the problem we are addressing in this paper, we envision an SDN approach for controlling the cellular networks. Wireless mesh networks: The wireless mesh networks built by commercial operators, communities and municipalities could support service providers and specialized operators to offer various services over such mesh networks. Similarly to the virtual mobile network scenario over cellular networks, we envision an SDN approach for controlling the wireless mesh networks. End-to-end services: In general, multi-domain controllers in WAN, on-premises and cloud domains need to communicate for end-to-end orchestration of the managed services. For such services, collaboration among providers and the enterprise would be beneficial to all. Programmability of CPE networks, access networks, data center networks, cloud networks: As SDN helps the evolution of service creation architectures, new scenarios are envisioned that will allow to collaborate with commercial network providers on controlling these networks. For example, the provider may extend the services over to CPE in CORD (Central Office Re-architected as a Datacenter) model and my utilize the infrastructure

213

elements of the customer. In this example, we envision an SDN approach for controlling the customer network where customer collaborates with the provider in providing place for SDN controller. - Move control to provider: In the opposite direction of the previous scenarios, SDN flexibly allows moving network control functions from the customer’s site into the service provider’s network. This could be done to simplify controlling the wide area network connectivities (MPLS VPN, IP VPNs, Internet access) as in managed SD-WAN service. - Other scenarios: We can see similar scenarios in other settings. For example, service providers becoming content providers through partnership and joint ventures. Such business relationships allow them control the networks of subsidies, partners, joint ventures. In a different example, a content delivery network (CDN) provider collaborates with content providers in using hosts. We are not addressing the problem where a service provider, or a service orchestrator or a service broker may provide a colocation service where telcos, cloud providers, CDN providers and enterprises could place their controllers. We are addressing the problem where some of these players collaborate in providing resources for running each other’s controllers and may charge for this. The models we develop in this paper could be used beyond the listed scenarios of employing SDN for network control. Not just in SDN environments, the models are generic to be usable in problems where some common controller or facilitator component is needed to provide some facility to a number of other components where collaboration is desired. One characteristics of the environment to take advantage of this solution is to support virtualization or container execution environment. Another characteristics is that it should be agile. With SDN, it is possible that a controller gets to be assigned to control a set of network elements quickly and then it can configure them at once, which is something that we couldn't do as fast in non-SDN IP environments. For example, our model should be easily application for service mesh applications like istio where mesh providers place an infrastructure container that acts as the networking/message broker among communicating containerized components.

2

Collaborative Virtual SDN Model and Definitions

We defined terminology and concepts on network virtualization in [17]. We reuse them and we add few more definitions. We also use some terminology and

ISBN: 1-60132-508-8, CSREA Press ©

214

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

concepts from collaborative facility location problem [18]. To help with defining terms, let us consider the scenario where a large enterprise has SDN in a campus network in a branch and rents a virtual SDN from a service provider to connect the branch to other corporate locations. For large enterprises in this scenario, their SDN would include their campus network and the virtual SDN network over the provider’s SDN network. There could be various business arrangements between customer and the provider. In this scenario, controllers are instantiated to 1. Manage the carrier network: SDN controllers are instantiated for controlling the provider network 2. Manage the customer network: Customer uses this instantiated controller to manage their SDN network We will call the controller in the first item above a Provider Controller and denote it with PC. We will call the controller in the second item a Customer Controller and denote with CC. Although the customer in this scenario is a large enterprise, the customer could be another provider. Customer controllers manage the virtual SDN network of the respective customers, mainly for traffic engineering among data centers, corporate offices and the cloud connectivity. In this scenario, provider controllers control customer controllers for managing the customer virtual networks. Provider and customer controllers need to communicate when the provider needs to modify or create a new virtual network for the customer. For example, if part of the network is experiencing some faults or some security attacks, the provider may provision another virtual network for the customer on other part of its SDN network. This may be necessary in case of faults or security attacks, for example in natural disasters, power outage, security bugs, and malicious attacks on the provider network.

3

Controller Placement Problem for Collaborative Virtual SDN Model

The cooperative facility location problem addressed in this paper can be defined as a multi-company, multistage, capacitated facility location problem in which multiple sourcing is allowed [18]. Some network providers are open to cooperation in various forms as we outlined in Introduction section. Multiple network and service providers may participate in collaboration by offering hypervisor hosts, physical servers, container execution environments.

The providers have some centralized control logic for controlling the networks. In implementation, SDN controller instances could be deployed in distributed, hierarchical and resilient architectures. SDN controllers could be hosted by hypervisors, physical servers and container execution environments in various stacks outlined in [17]. When the SDN controllers are modeled as containerized components running in some container execution environments, they could be provisioned, duplicated and migrated by taking advantages of the container orchestration services such as Kubernetes. The challenge here is how to calculate their locations and configurations most optimally. We will address single stage (single hierarchy) and multistage (multiple hierarchy) cases. 3.1

Single Hierarchy

This collaborative model with one level of hierarchy is not fundamentally different from our original model in [17]. In the original model, network providers provide resources for hosting the SDN controllers. In this model, the resources are provided by any collaborating provider. At the end, some resources (hosts, container execution environments) are made available to choose from for the optimization algorithm. Even the collaborative facility location problem paper states “[cooperation related constraint] can be eliminated and the model can be reduced to the classic formulation of a multi-product twostage capacitated facility location model” [18]. There are some differences though as outlined below; 1. Cost: costs could be different when offered by multiple providers vs when offered by the network provider alone. The cost related variables are therefore more complex in the collaborative model. 2. Collaborative model could be used in different settings, which we elaborate below. The collaborative facility location problem paper states the collaboration model could be used in different settings than the traditional facility location problem [18]: “each carrier starts from a set of open DCs of which the number and locations are already (near to) optimal for this carrier when working independently. When considering collaboration, we hence start from this given set of opened facilities, and the model will investigate whether savings can be achieved from collaboration. These savings can only result from either keeping all existing DCs open, but finding a better allocation of transport routes, or from closing a number of DCs and re-optimising the allocation of transport routes.”

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

In one setting, the virtual network providers may choose on their own from the set of available hosts in the cooperation model, without considering the cost for all. This is also similar to the original model in [17]. In a different setting, one centralized algorithm could be used to solve for all virtual networks aiming to minimize the overall cost for all globally. In the collaborative facility location problem, multisourcing is allowed, which means demand in one customer zone for a particular product type can be fulfilled from more than one DC. In the collaborative controller placement problem, multi-sourcing corresponds to a load-balancing and resiliency model where an SDN network element could be controlled by more than one (practically two) SDN controllers. This model has benefits in distributing the load among the controllers of the same provider as well as for increasing the availability of the control plane of the SDN network [12]. Though, in practice, work is needed to achieve this load balancing and coordination among the controllers. This in turn should work together with redundancy in active/active configuration. Similar to the capacitated collaborative facility location model [18], capacity of each SDN controller should be a constrained because a controller can handle up to a certain number of SDN network elements. This capacity can be specified as 1. Number of SDN network elements a controller could support 2. Virtual machine configuration: CPU and memory configuration, or instance configuration like [low, medium, high, premium] in the hypervisor, or the component configuration in the container execution environment [17]. Formulations for different objectives in [17] apply to the single-stage collaboration model, and for this reason, we will not present details. Additional considerations for the collaboration model are summarized below: 1. The notations for SDN Network Specification should include the id of the provider where the provider could be any infrastructure provider and the enterprise. Below are examples to how the id of the provider could be incorporated into the notations. G(V,E)p: SDN network of provider p Vp: Set of SDN network elements of provider p hp: A host or an SDN network element of provider p capable of executing SDN controllers. This variable holds the capacity of the host.

215

2. New variable for the cost of offering the host should be included in the notation. If the host is provided free of additional charges, then this variable would be 0. Cph : The cost of host h by provider p. 3. The decision variables should include the id’s of the providers. For example: xc(cr0, hp): = 1, if vSDN controller cr of virtual SDN network r is hosted in host h of provider p at hierarchy level of 0, meaning the resource requirements of cr can be satisfied by the available capacity of host h. xv(vr, cr0, hp): =1, if virtual SDN network element vr is controlled by vSDN controller cr in host h of provider p at hierarchy level of 0. 4. Latency related objective functions from [17] are applicable. Host related objectives from [17] could be augmented to take the cost of the offered hosts. Alternatively, objective functions for cost should be defined in addition to the existing ones. One objective function could be to minimize the cost for a single provider. Another objective function could be to minimize the cost of collaborating providers. 𝑎𝑟𝑔𝑚𝑖𝑛 ' ' ' 𝐶)* 𝑥𝑐( 𝑐12 , ℎ* ) 𝑥𝑣(𝑣 2 , 𝑐12 , ℎ* ) *

)7 / 0

where r is the virtual SDN network for which the objective function is trying to minimize the cost. The second objective function that tries to minimize the cost for all the virtual SDN networks should minimize the sum of all the costs for all r’s. The collaborative controller placement model opens up economic considerations such a profit maximizations and pricing in addition to the costs over the collaboration platform. Such considerations are outside of the scope of this paper. 3.2

Multi-stage: Hierarchy of Controllers

In the collaborative facility location problem paper, twostage scenario was addressed [18]. We will generalize the problem to multi-stage. Multi-stage in the collaborative controller placement problem refers to creating a hierarchy of controllers at multiple hierarchy levels. We will additionally address the resiliency of the controllers at each hierarchy level. One constraint in multi-stage scenario is that the number of controllers at each hierarchy level should be constrained. There are various applicable constraints: 1. Capacities of the SDN controllers at a higher layer in the hierarchy. The is the number of SDN controllers at layer k-1 an SDN controller at layer k

ISBN: 1-60132-508-8, CSREA Press ©

216

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

could support and is determined by the capacity allocated to the SDN controller. We assume it could be specified the same way as in the single-stage case where an SDN controller at the leaf level has capacity for the number of SDN network elements it can control. Similarly, an SDN controller at layer k will have capacity for the number of layer k-1 SDN controllers that it can support. 2. The redundancy settings: The number of SDN controller at each hierarchy depends on the redundancy settings. Examples are N+1, N+N where N is the number of required SDN controllers. 3. The utilization settings: The provider may want to under-provision and therefore restrict the utilizations of SDN controllers. The number of controllers at each hierarchy level is related to the cost of deploying an SDN controller similar to the fixed cost of having a facility in the facility location problem. For cost reasons, this number may need to be minimized. For resiliency and utilization reasons, this number may need to be constrained within a range. Notation: In addition to variables defined for the single-hierarchy case and the ones from [17], we introduce the following variables: r

1| is the number of SDN controllers at level k-1. This goes the same all the way up to the highest level in the hierarchy, hence the log in the formula.

Xrk can be specified. If not specified, it is calculated. Its minimum value at leaf hierarchy level is calculated as |Vr| 𝑅12 max (𝑋12 ) 𝑈12 and similarly for higher levels, except the number of SDN controllers in the lower hierarchy level should be used in calculation, rather than the number of SDN network elements. When Xrk is specified, its value should be bigger than the calculated value. Redundancy setting for each level in the hierarchy could be specified for each virtual SDN network. For example, redundancy setting could be specified by associating each virtual network element with two virtual controllers of the lowest level in the hierarchy and by enforcing N+1 redundancy for virtual controllers at each level in the hierarchy. N+1 redundancy setting for the leaf level in the hierarchy may not yield any solution. This is because the latency between the redundant SDN controller and all SDN nodes may not be within the acceptable limit if nodes are dispersed over long latencies. Objective Functions:

R k: Redundancy setting for virtual SDN network r at hierarchy level k. Its value is an enumeration over the possible settings of [N+1, N+N] Xrk: The number of SDN controllers for virtual SDN network r at hierarchy level k. Urk: Utilization setting for virtual SDN network r at hierarchy level k. Its value is a percentage.

Various latency and host related objective functions are developed in [17] for the general virtual SDN network model. Since collaborative controller replacement problem is not fundamentally different, objective functions in [17] are applicable. In the multi-stage and resilient case, additional latency objective functions should be defined since a hierarchy of redundant controllers is to be created. They are to address:

Decision Variables: We will use the same decision variables of the singlehierarchy case with one difference. In the singlehierarchy case, the level of the hierarchy is 0 whereas in the multi-stage case the level is a variable identified by k.

1. Latencies between the redundant pairs at the same level 2. Latencies between an SDN controller at level k and its child controllers at level k-1.

Constraints: Due to lack of space, typical constraints are not included in this paper. We will state only some constraints related to the notations and decision variables introduced above. K. the number of levels in the hierarchy, can be specified. If not specified, it is calculated to be equal to ceiling(logn|Vr|). Vr is the set of virtual SDN network elements of vSDN network r and |Vr| is the number of them. n is the average number of SDN controllers needed and is equal to average(Xrk) for all k’s in virtual SDN network r. At the leaf level, |Vr|/n controllers are needed. At the level k, |Xk-1|/nk controllers are needed where |Xk-

Similar to the single-hierarchy case, host related objective functions should be augmented with costs, or new cost related objective functions should be used in formulations. In the multi-stage and resilient case, the costs of SDN controllers at the higher layers in the hierarchy and the cost of redundant controllers for each level should be incorporated into the cost objective function. The formulation is similar to the one in the single-stage case.

4

Conclusions and Future Work

Thanks to virtual network infrastructures, container execution environments and SDN, organizations are

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

provisioning their workloads on on-premises, co-located, or on public cloud infrastructures based on demand and service level objectives. Infrastructure providers including network and service providers are collaborating in offering end-to-end services to organizations. Controller placement problem in the SDN is an optimization problem modeled after the classical facility location problem optimization. The problem that we address in this paper is collaborative controller placement problem where infrastructure providers offer resources for running each other’s SDN controllers. In this paper, we introduced the problem, defined relevant terms and developed notation for them, and developed mixed integer programming formulations. We conclude that it can be modeled after the collaborative facility location problem and it is not fundamentally different than the various forms of the controller placement problem but it can be used in different settings where the infrastructure providers may charge for the resources and may create marketplace platforms. We outlined such settings and further addressed the hierarchy and resiliency scenarios in the model. Our future work includes further developing optimization algorithms and heuristics, and evaluating them under various scenarios described in the paper.

5

References

[1]. Peng Xiao, Wenyu Qu, Heng Qi, Zhiyang Li, Yujie Xu; “The SDN Controller Placement Problem for WAN”, IEEE/CIC ICCC 2014 Symposium on Privacy and Security in Commutations [2]. Kshira Sagar Sahoo, Sampa Sahoo, Anamay Sarkar, Bibhudatta Sahoo and Ratnakar Dash; “On the Placement of Controllers for Designing a Wide Area Software Defined Networks”, Proc. of the 2017 IEEE Region 10 Conference (TENCON), Malaysia, November 5-8, 2017 [3]. Stanislav Lange, Steffen Gebert, Thomas Zinner, Phuoc Tran-Gia, David Hock, Michael Jarschel, and Marco Hoffmann “Heuristic Approaches to the Controller Placement Problem in Large Scale SDN Networks”, IEEE Transactions On Network And Service Management, Vol. 12, No. 1, March 2015 [4]. Stanislav Lange, Steffen Gebert, Joachim Spoerhase, Piotr Rygielski, Thomas Zinner, Samuel Kounev, and Phuoc Tran-Gia; “Specialized Heuristics for the Controller Placement Problem in Large Scale SDN Networks”, 2015 27th International Teletraffic Congress [5]. Othmane Blial, Mouad Ben Mamoun, and Redouane Benaini; “An Overview on SDN Architectures with Multiple Controllers”, Journal of Computer Networks and Communications Volume 2016, Article ID 9396525 [6]. Bela Genge ; Piroska Haller; “A Hierarchical Control Plane for Software-Defined Networks-based Industrial Control Systems”, 2016 IFIP Networking Conference (IFIP Networking) and Workshops

217

[7]. Rinku Shah, Mythili Vutukuru, Purushottam Kulkarni; “Cuttlefish: Hierarchical SDN Controllers with Adaptive Offload”, 2018 IEEE 26th International Conference on Network Protocols (ICNP) [8]. Rinku Shah, Mythili Vutukuru, Purushottam Kulkarni; “Devolve-Redeem: Hierarchical SDN Controllers with Adaptive Offloading”, APNet'17 Proceedings of the First Asia-Pacific Workshop on Networking [9]. Franciscus X. A. Wibowo, Mark A. Gregory, Khandakar Ahmed, Karina M. Gomez; “Multi-DomainSDN.pdf Multi-Domain Software Defined Networking: Research Status and Challenges”, Journal of Network and Computer Applications · March 2017 [10]. Tao Hu , Peng Yi , Zehua Guo, Julong Lan, And Jianhui Zhang; “Bidirectional Matching Strategy for Multi-Controller Deployment in Distributed Software Defined Networking”, IEEE Access ( Volume: 6 ) [11]. Kévin Phemius, Mathieu Bouet, Jérémie Leguay; “DISCO: Distributed Multi-domain SDN Controllers”, 2014 IEEE Network Operations and Management Symposium (NOMS) [12]. Andreas Blenk, Arsany Basta, Johannes Zerwas, Martin Reisslein, Wolfgang Kellerer; “Control Plane Latency With SDN Network Hypervisors: The Cost of Virtualization”, IEEE Transactions On Network And Service Management, Vol. 13, No. 3, September 2016 [13]. Andreas Blenk, Arsany Basta, Johannes Zerwas, Wolfgang Kellerer; “Pairing SDN with Network Virtualization: The Network Hypervisor Placement Problem”, 2015 IEEE Conference on Network Function Virtualization and Software Defined Network [14]. Maryam Tanha, Dawood Sajjadi, Jianping Pan; “Enduring Node Failures through Resilient Controller Placement for Software Defined Networks”, 2016 IEEE Global Communications Conference (GLOBECOM) [15]. Bala Prakasa Rao Killi, Seela Veerabhadreswara Rao; “Controller Placement With Planning for Failures in Software Defined Networks”, 2016 IEEE International Conference on Advanced Networks and Telecommunications Systems (ANTS) [16]. Nancy Perrot, Thomas Reynaud; “Optimal Placement of Controllers in a Resilient SDN Architecture”, 2016 12th Int. Conference on the Design of Reliable Communication Networks (DRCN 2016) [17]. Sakir Yucel; “Controller Placement Problem for Virtual SDN Services”, The 25th Int'l Conf on Parallel and Distributed Processing Techniques and Applications (PDPTA'19) [18]. Lotte Verdonck, Patrick Beullens, An Caris1, Katrien Ramaekers and Gerrit K Janssens; “Analysis of collaborative savings and cost allocation techniques for the cooperative carrier facility location problem”, Journal of the Operational Research Society (2016) 67, 853–871

ISBN: 1-60132-508-8, CSREA Press ©

218

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

Docker-based Platform for Real-time Face Recognition Jongkwon Jang1, Sanggil Yeoum1, Moonseong Kim2, Byungseok Kang3, and Hyunseung Choo1 1 Department of Electrical and Computer Engineering, Sungkyunkwan University, South Korea {jangjk, sanggil12, choo}@skku.edu 2 Department of Liberal Arts, Seoul Theological University, South Korea [email protected] 3 Department Electronics, Computing and Mathematics, University of Derby, UK [email protected] Abstract - Face recognition has emerged as replacement of existing methods for identifying people and to be used in various fields. With the development of devices and communication technologies, platforms using face recognition are becoming more diverse. In this paper, we propose a face recognition system which composed of client device (Smart Mobile Device, PC) and server (Docker) for IoT environment. The client offloads the face recognition job to the server platform and receives the decision result in real time. We virtualize the server through the Docker to improve the performance as well as the functionality. Virtualized server provides fast response times when numerous clients request face recognition data. Testbed measurement is carried out to compare the performance with single server while increasing the clients. The proposed Docker-based face recognition platform has a small response time increase and increases CPU and memory utilization by 18.56% and 36.3%, respectively. Keywords: Face Recognition, Image Processing, Container, Internet of Things, Distributed System

1

Introduction

In order to identify a person, we normally used password, Personal Identification Number (PIN), ID card, and key combination. These traditional methods are very vulnerable because of the object loss or hacking. To overcome this problem, face recognition using Deep Learning technology has emerged. Face recognition is a technology that converts the characteristics of a person’s face into digital data and then identifies a person by comparing it to existing database [1]. A typical use website is the Elderly Care Center [2]. It is used in various domains such as medical welfare, criminal suspect detection, finance, access control [3-5]. The face recognition field is developing greatly as world-class researchers continue to pay attention. Most platforms that use face recognition process work such as face detection and face recognition through Smart Mobile Device (SMD). As a result, SMD performs highthroughput tasks, resulting in several problems such as heat generation and speed reduction. In general, high throughput works are handled by a server. Node.js is a Java base software platform used to build these server environments. In a Software Defined Network (SDN) environment, the server

provides faster response time than when handling face detection and face recognition on a client device. However, when multiple client devices request service from one server at the same time, the server becomes overloaded. Overloaded server has a longer delay, which results in infinite service time than those handled by a client. For that reason, a research on virtualization platforms has been actively conducted recently to decentralize the load on the server. A typical software Docker can solve the overload problem and significantly reduces response time. We propose an efficient face recognition platform for IoT environment. Our system supports face recognition and container technology in terms of functionality and performance aspects. For functionality, the proposed platform enables multiple face recognition in real time received from multiple source of clients. For collecting the client data, we use API for Naver which famous portal web site in South Korea [6]. This web site provides web page to manage insertion, deletion, and modification of the information stored in the database. The web page provides an interface that allows the client to view photos taken by SMD camera in real time and manage the information of the person. For performance, this paper takes full charge of high-throughput jobs, such as face detection and recognition in the server. At the same time, Docker virtualizes the server to provide fast response time in case several clients request service to the server.

2

Related Work

IoT connect to the Internet by incorporating sensors, objects and communication functions in various objects. Communication for IoT environment is evolving with 5G, Cloud Computing and IoT application software [7]. At the same time, research that benefits people by combining IoT and face recognition is being actively carried out. A number of platforms have been launched that provide various functions such as face age and face effects through face recognition, targeting the younger generation [8, 9]. However, face recognition requires a server machine because it requires high throughput/performance. Several software platforms are being developed to configure a network of servers and clients. Constructing these networks does not solve the problems that occur when multiple clients request services from a single

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

server. To solve this problem, distributed systems and virtualization technologies are actively researched. Recently, several software platforms have been continuously developed for efficient network application development. Among them, Node.js has been in use and its infrastructure is expanding. Node.js is a key module that interprets JavaScript and uses a JavaScript engine called V8, developed by Google. Because it is an asynchronous I/O method, it is possible to receive another task at the same time even if the existing operation is proceeding. However, the Event model is based on a single thread. As a result, the delay time increases when the event queue has many callback functions [10]. Several methods have been proposed to compensate for these disadvantages of Node.js. A typical method is the Cluster Module provided by Node.js. Since Node.js uses only one Core, it is a waste of resources on computers that are multi-core or higher CPUs. Cluster modules create N servers when the number of CPU cores is N. [11] uses Cluster Module and Docker to evaluate server performance. Virtual Machine (VM) and Containers are traditional examples of virtualization technique. The VM is separated from the host OS by an operating system on top of the virtualized hardware and runs through a hypervisor. The hypervisor serves to help VM run and manage guest OS and to help host machines distribute resources to VM. However, many studies have shown that VM technology lags far behind Container technology recently. The reason is that VM has a Guest OS on top of the Host OS, while the Container only has the binary required to run the application. This could cause the VM to take longer to configure the environment than the Container. One of these Container's most representative technologies is Docker [12, 13]. The Docker uses image file to configure features and preferences for running the container. Multiple containers are created by running an image file. Docker solves the problems of Single Server and builds stable and efficient server environment. Docker-related research is currently active.

3 3.1

Proposed Platform Platform Overview

We propose a platform that provides real-time multiface recognition for clients using SMD and PC for IoT environment. The platform consists of client devices (SMD, PC) and a server (Controller, Docker Container). The PC is used to access web page. The platform has the ability to insert and modify not only public awareness but also celebrity recognition and client information stored in the server’s database. In terms of performance, since the server is composed of several Docker Containers, even if multiple clients request simultaneous face recognition to the server, it provides real-time service without delay time and distributes server load. The Docker Container is dynamically assigned to clients. By using the Docker Container, many clients receive quick response times when requesting face recognition from

219

the server. As shown in Figure 1, Docker Container consists of a DB Server, a Socket Server, and a Web Server. Socket Server handles TCP socket communication with a client device. In all TCP connections of a container, container's IP addresses are same and ports are assigned differently. DB Server stores various data such as image storage, human information and so on. When a client connects to a web page, the Web server needs an IP address and port, just like Socket Server. The Server Controller assigns the client to the appropriate container. The server also receives and processes requests from the client and uses the CFR API and News API provided by Naver to handle celebrity awareness.

Figure 1. Platform Overall Structure The client requests the face recognition to the server using the client device (e.g., SMD and PC), and receives face recognition result through SMD and web page anytime and anywhere. SMD is based on Android and works through applications created using Android Studio. The client takes a picture of the person who wants to know information through SMD and receives information from the server. In addition, through the web page, it is possible to confirm, edit, delete, and insert real-time image, information of the face recognized person, and face not recognized.

3.2

Platform Architecture

Figure 1 shows the flow between client and server. Client requests face recognition from server through SMD and PC. A server consists of a controller and several Docker Containers. Figures 2 and 3 show the two cases in which clients request face recognition to server using SMD and PC. First, when a client uses SMD (see Figure 2). The client uses the SMD to take an image of the person who wants to know the information, save it, and send it to the server. The server assigns the container to the client through the controller. The container stores the transmitted images and sends them to the web page, while simultaneously detecting and recognizing faces. When face recognition is performed, the facial features data previously stored in the DB are compared to the characteristic values in the image transferred from the client. In addition, we use the CFR API and News API provided by Naver to recognize celebrities. After identifying who the celebrity is through the CFR API, the current information of the celebrity is obtained through the News API. Finally, the container sends information to web page and SMD after the face detection and face recognition process is finished. The client confirms the received information through SMD and web page.

ISBN: 1-60132-508-8, CSREA Press ©

220

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

Figure 2. Block diagram when client uses SMD Second case is when the client uses PC (see Figure 3). The PC is mainly used when clients access web page. After accessing the web page, the client requests the server to correct or insert information from the person who is not recognized. Server assigns the container to the client through the controller. The container updates the database after the face has completed the modification of the information of the recognized person, and the insertion of the information of the unrecognized person. Finally, the container provides pop-up notifications on web page and displays the changed information to prove that the changed task is complete.

Figure 3. Block diagram when client uses PC

4

Figure 4. Response time by case Figure 4 shows a comparison of the performance of each case. Case 1 consists of one server. When many clients request face recognition with one server, certain number of callback functions are accumulated in the server's Event Queue. As the number of clients request increases, the response time accordingly increase at a fast rate than Cases 2 and 3. Case 2 is composed of four servers because our computer has Quadra CPU cores. The increase ratio of response time is smaller than Case 1. However, the client's face recognition requests that can be handled by four servers are limited. Therefore, if the four servers exceed the requests of the clients that can be accommodated, the response time increases greatly. Case 3 assigns a number of Docker Containers to clients appropriately. Since each Docker Container is assigned a client enough to process it, the response time is clearly shorter than those of Cases 1 and 2. In Figure 4, the slope ratio of each case is similar to 13:11:1, and the slope of Case 3 is the smallest. In other words, Case 3 means that the increase in response time is smaller than Cases 1 and 2 even if the number of clients increases. Therefore, the case with the best response time as the number of clients increases is Case 3 using Docker.

Performance Evaluation

We virtualize the server by using Docker Container technology. Server consists of three main components such as Socket Server, Web Server and DB Server. Socket Server and Web Server run on the Node.js platform based on JavaScript syntax, MySQL server for DB, JavaScript, HTML, and CSS for web page. One of the client devices SMD is based on the Android Studio program. We experiment by case as follows to show that the server with the Docker has good performance. The experiment consists of three types of Cases. First case is Single Server which means one server created in the Node.js platform. Second one is Cluster Module Server case. The cluster module provided by the Node.js platform is used to create as many servers as the number of cores in the CPU. The last is Docker Server case. In this case, multiple containers are created based on the Docker Container technology.

Figure 5. CPU utilization by case

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

221

and will improve performance by using Cluster Module and Docker together.

6

ACKNOWLEDGMENT

This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ICT Consilience Creative program (IITP-2019-2015-0-00742) supervised by the IITP (Institute for Information & communications Technology Planning & Evaluation). Prof. Choo is the corresponding author.

7

Figure 6. Memory utilization by case Figures 5 and 6 show CPU and memory utilization for each case, depending on the number of clients. For Cases 1 and 2, CPU and memory utilization are smaller than Case 3 even as the number of clients increases. This means that Single Server and Cluster Module Server do not make efficient use of computer hardware resources. On the other hand, since Case 3 uses several Docker Containers, CPU and memory utilization increase. Even if a computer’s performance is good, it is meaningless if it does not utilize its resources properly. According to the number of clients, the average CPU utilization is larger by 18.56%, 6.53%, and the average memory utilization is 36.3% and 20.3% larger than Cases 1 and 2, respectively. That is Case 3 efficiently use the hardware resources. We have demonstrated that server with Docker perform better with three performance indicators like response time, CPU, and memory utilization.

5

Conclusion

This paper proposes the platform divided into functionality aspects and performance aspects. At the functionality aspects, the client requests face recognition from the server through the client device (SMD, PC) and then receive the results from the server. The platform also provides multi-face recognition function in real time and functions that recognize celebrities using CFR API and News API that are provided by Naver web site. In addition, through the web page, it is possible to check images seen in SMD in real time, and to confirm and modify images and information of recognized persons. For those who are not recognized, images and information can be inserted. For the performance aspects, Docker can provide clients with fast response times and distribute the load of server to deliver improved performance. We compared Single Server, Cluster Module Server and Docker Server and found that when using Docker, the increase in response time is the smallest. Average CPU and memory utilization was used more than Cases 1 and 2. These performance indicators demonstrate the excellence of proposed Docker server. Future research direction of this paper is to develop various service functions such as linking face recognition and calendaring,

References

[1] A. K. Jain, B. Klare, and U. Park, “Face recognition: Some challenges in forensics,” In Face and Gesture 2011, 726-733, 2011. [2] M. S. Hossain, & G. Muhammad, “Cloud-assisted speech and face recognition framework for health monitoring,” Mobile Networks and Applications, 20(3), 391-399, 2015. [3] M. S. Hossain, “Patient state recognition system for healthcare using speech and facial expressions,” Journal of medical systems, 40(12), 272, 2016. [4] L. Y. Mano, B. S. Faiçal, L. H. Nakamura, P. H. Gomes, G. L. Libralon, R. I. Meneguete, & J. Ueyama, “Exploiting IoT technologies for enhancing Health Smart Homes through patient identification and emotion recognition,” Computer Communications, 89, 178-190, 2016. [5] N. H. Motlagh, M. Bagaa, & T. Taleb, “UAV-based IoT platform: A crowd surveillance use case,” IEEE Communications Magazine, 55(2), 128-134, 2017. [6] Clova Face Recognition (CFR), Naver Developers, https://developers.naver.com/products/clova/face (accessed on 10 July 2019. [7] P. Hu, H. Ning, T. Qiu, Y. Zhang, and X. Luo, “Fog computing based face identification and resolution scheme in internet of things,” IEEE transactions on industrial informatics, 13(4), 1910-1920, 2017. [8] P. M. Kumar, U. Gandhi, R. Varatharajan, G. Manogaran, R. Jidhesh, & T. Vadivel, “Intelligent face recognition and navigation system using neural learning for smart security in Internet of Things,” Cluster Computing, 1-12, 2017. [9] N. K. Jayant, & S. Borra, “Attendance management system using hybrid face recognition techniques,” In 2016 Conference on Advances in Signal Processing (CASP), 412417, 2016. [10] S. Tilkov, and S. Vinoski, “Node. js: Using JavaScript to build high-performance network programs,” IEEE Internet Computing, 14(6), 80-83, 2010. [11] J. Zhu, P. Patros, K. B. Kent, and M. Dawson, “Node. js scalability investigation in the cloud,” In Proceedings of the 28th Annual Intl. Conf. on Computer Science and Software Engineering, 201-212, 2018. [12] C. Anderson, “Docker [software engineering],” IEEE Software, 32(30), 102-c3, 2015. [13] C. Boettiger, “An introduction to Docker for reproducible research,” ACM SIGOPS Operating Systems Review, 49(1), 71-79, 2015.

ISBN: 1-60132-508-8, CSREA Press ©

222

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

Controller Placement Problem for Virtual SDN Services Sakir Yucel Wexford, PA, USA [email protected] Abstract – Network Virtualization (NV) and Software Defined Networking (SDN) bring new business opportunities for network and service providers on innovative service creation and management. To capture opportunities and shine in competitive network-as-a-service market, network and service providers need to excel in addressing the changing customer requirements and in the operations and management of the SDN resources. One challenge is to place SDN controllers in optimum locations of the SDN networks so that providers could effectively allocate resources and program the network elements with network flows that satisfy the customer requirements. This challenge has been identified as the “controller placement problem” which is an optimization problem and has evolved over time as the NV and SDN progressed. We address the problem in modern and innovative service offerings where the providers could offer virtual networks to tenants. We introduce mixed integer programming formulations to decide on where to place the SDN controllers for optimal management of the virtual tenant networks with respect to differing application requirements and scenarios. We introduce algorithms to satisfy the latency and cost related objectives. Keywords: Software Defined Network, Controller Placement Problem, Virtual SDN Network, Virtual SDN Service

1

Introduction

Network virtualization (NV) allows creating multiple virtual networks over physical network resources. Each virtual network could be tailored to the requirements of different applications and/or tenants. Virtual networks share the physical network resources. Software-defined networking (SDN) separates the network control plane and the forwarding plane enabling an SDN controller to program the routing decisions on multiple network elements. Programming the network flows by centralized SDN controllers helps the network operators to configure the network flows based on set forth policies (security policy, traffic engineering and QoS policy) and provider’s business objectives. Further, analytics and intelligence could be employed by the controller to make better forwarding decisions. SDN controllers use southbound interfaces for programming the network elements with the calculated network flows. Northbound interfaces supported by SDN controllers could offer innovative APIs opening up many possibilities for dynamic service creation and management.

SDNs evolved from a centralized control plane to a being an enabler for virtual networks and network-as-a-service offerings. NV and SDN bring new business opportunities for network providers. By combining NV and SDN, network providers could provision many virtual SDN networks (vSDNs) on their physical SDN network resources and offer them as network-as-a-service to multiple tenants. NV and SDN help to reduce cost of building and operating networks. They also offer flexibility of innovation for network and service providers to capture opportunities and shine in the competitive network-as-a-service market. In order to do that, network providers need to excel in addressing the changing customer requirements and in the operations and management of the SDN resources. One challenge for network and service providers is to effectively allocate resources and program the network elements with network flows that satisfy the customer requirements. Placing the SDN controllers in optimum locations of the SDN network is significant in timely programming of the network. This challenge has been identified as the “controller placement problem” which is an optimization problem and has evolved over time as the NV and SDN progressed. Controller placement problem evolved over time where first solutions addressed placing a single controller in an SDN. Then it has evolved to addressing the same problem in WAN [1][2], placing multiple controllers [5], building a hierarchy of controllers [6] [7] [8], covering multi-domain controllers [9][10][11], incorporating the resiliency of controllers [14][15][16], and recently trying to solve controller placement problem in hypervisor and virtual network environments [12][13]. Many studies have been done on the optimal placement of the controllers in different settings. Most studies formulate the problem as integer programming and offer some heuristics for approximations. Controller placement problem is NPhard and therefore some heuristics and algorithms are presented in these studies [3][4]. The problem we address in this paper is to build on the existing research on the placement of controllers in virtual SDN networks, particularly to extend the work in [12]. Our objectives with this paper is to develop terminology and formulations for the controller placement problems in recent and innovative virtual SDN

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

deployments for which no such formulations have been done before. Network and service providers have objectives of simplifying the operations via streamlined procedures, achieving faster results and thereby reducing the overall cost and the risks. Our work can be used by network/service providers for various business objectives including • Effective management of service creation complexities: creating services using programmable and DevOps style production environments and agile methodologies • Offering multiple virtual networks to their customers as network as a service. For the latter, we particularly consider the following network/service architectures and business agreements among network/service providers: • Network provider offers the network resources to be shared by plurality of service providers while network provider is optimizing the SDN controller resources • Service provider offers application specific/centric virtual networks through service chaining and other mechanisms, and by supporting differentiated QoS, security, SLA for different applications (mission critical, real-time, etc.) • Service provider provides fast lanes and better QoS to technology companies such as OTT providers over its Internet network (which is possible due to current network neutrality policy). Service provider provisions separate networks for such tech companies that can manage their networks, services while service provider is optimizing the SDN controller resources • Service provider offers hosts to the controllers of the virtual tenant networks while optimizing the SDN controller resources In section Virtual SDN Model and Definitions, we describe the virtual SDN model and provide relevant definitions. In section Controller Placement Problem for Virtual SDN Model, we provide mathematical formulations for the problem and present approximation algorithms.

2

Virtual SDN Model and Definitions

In this section, we will describe the virtual SDN model and provide definitions to its various components. A physical network element such as a switch or a router has physical resources such as ingress and egress ports and interfaces, packet processing resources (network processor), switching fabric, CPU, memory,

223

buffers/queues and other specialized hardware units for performing functions such as lookup, forwarding, queueing, packet scheduling, modulating/demodulating. Network function virtualization (NFV) supports executing many network functions over general and commodity servers without requiring purpose-built network elements hardware. Our focus is not NFV per see. Our approach is general by assuming a virtualization layer (via hypervisor) or a container execution environment which will allow the network/service provider to create virtual network elements over a physical network element. A hypervisor provides a virtual environment for guest operating systems. Examples are KVM, ESX, Hyper-V. A container execution environment is capable of running a container such as a docker container. A physical link has a bandwidth (data rate) and propagation delay associated with it. Multiple virtual links can be created over a physical link each sharing the total bandwidth of the physical link, usually by bandwidth reservation. The propagation delay of all the virtual links on a physical link is assumed to be the same. An SDN network contains SDN network elements and links. SDN network elements include SDN routers and switches. We assume each SDN network element is capable of running a hypervisor or container environment, hence can be also called as host or server (though we will not use the term server in this paper not to confuse it with network servers such as DNS). Therefore, each SDN network element is capable of hosting multiple instances of virtual SDN network elements. A virtual (vSDN) SDN network has virtual SDN network elements and virtual links. Virtual SDN network elements include virtual routers and virtual switches. A virtual switch represents network elements in a virtual network at layer 2, whereas a virtual router at layer 3. Virtual SDN network elements could come in various software stacks such as the following: 1. Host (SDN Network Element) + Hypervisor + Virtual Switch or Virtual Router 2. Host (SDN Network Element) + Hypervisor + Guest Operating System + Virtual Switch or Virtual Router 3. Host (SDN Network Element) + Hypervisor + Nested Hypervisor+ Guest Operating System + Virtual Switch or Virtual Router 4. Host (SDN Network Element) + Container Execution Environment + Virtual Switch or Virtual Router as containerized applications.

ISBN: 1-60132-508-8, CSREA Press ©

224

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

In stack 2, multiple guest operating system instances can be deployed over the hypervisor each running a virtual switch or router software. In stack 3, a hypervisor can host multiple other hypervisors in a nested configuration. Each nested hypervisor can host one or more guest operating systems each running virtual switch or router software. In stack 4, multiple containerized virtual switch or router components can run over the container execution environment as in Linux Foundation's OVN deployments in OpenStack, Docker, and Kubernetes environments. The software stack of the network elements is an implementation detail as far as this paper is concerned and will not be further elaborated, except for including the additional latency into the formulations. In any of the stacks, resources will be virtually partitioned to create multiple tenant virtual SDN networks. An SDN network thereby supports multiple virtual SDN networks offering a multi-tenant network where the network provider could lease each virtual network to separate tenant customer. An SDN network has SDN controller(s). A virtual SDN network its virtual SDN controller(s), which we refer as vSDN controller. Thanks to multi-tenancy, each vSDN controller only sees its corresponding vSDN network elements and links. Hypervisors and/or container execution environments abstract the underlying physical SDN network (e.g. by providing topology information of virtual network elements) and support the multi-tenancy. Virtual SDN controllers may have similar stacks as the virtual SDN network elements. In addition to them, special hypervisors could support running multiple SDN controller applications offering vNic and the visibility to the virtual network but this approach may not support strong isolation needed for multi-tenancy. In this case, the hypervisor forwards the control traffic from the virtual switches towards the corresponding vSDN controller.

3

Controller Placement Problem for Virtual SDN Model

Our formulation of the controller placement problem for virtual SDN model builds on the model in [12]. Our model enhances [12] in the following way:: • It addresses the problem in recent applications of virtual network service as network-as-a-service offering with SDN • It incorporates differing objectives of the network providers. Objectives are defined separately in the formulation so that optimization could be tailored to the specified objective.

Network providers may have differing objectives such as the following: Latency related objectives: • Minimize the latency between vSDN controllers and virtual network elements considering all the virtual networks of all tenants. • Rather than minimizing the latencies, network provider may want to keep all the latencies below a threshold value. Host related objectives: • Minimize the number of physical servers used by the controllers. This is a typical advantage of virtualization. By running multiple instances of the vSDN controllers on the same physical server for different virtual SDNs, various benefits are sought like cost, ease of management, ease of deployment. In this objective, the network provider would like to run many instances of the controllers on the same hosts constrained by the host capacity. • The primary risk with the above objective is that multiple eggs will be in the same basket. The network provider may prefer some level of separation for resiliency and load balancing. The separation factor could be supplied as an input to the models and algorithms presented below. Hierarchy related objectives: • The network provider may want to create a hierarchy of controllers for scalability purposes. Resiliency related objectives: • The network provider may want to assure resiliency so that a. No virtual network element is served by a single virtual controller at any time. b. No virtual controller at any level in the hierarchy is without a redundant hot standby Other objectives: •

Migration of vSDN controller could be costly. The network provider may want to minimize the number of such migrations in general or may want to limit the number in a given interval.

We thereby define the controller placement problem for the vSDN network provider as follows: Network provider provides virtual SDNs to tenants. Assume that the provider SDN network is capable of supporting the tenant virtual networks and all tenant virtual networks are all calculated. The following requirements are specified: 1. There must be at least one vSDN controller per tenant virtual network.

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

2. A host (e.g. an SDN network element) can support n number of virtual SDN network element where n is specific to the host based on its capacities (e.g. CPU and memory) and the resource requirements of the vSDN controller. 3. The objective(s) of the network provider should be satisfied as much as possible. 3.1

Notations

Our work builds on [12]. We use the same notation in where possible. Although we address hierarchy and resiliency related objectives in our separate work [17], we included some notation and variables on hierarchy and resiliency for consistency with our further work in [17]. In this paper, we assume all vSDN controllers are at the same level and that is level 0 in the hierarchy. 3.1.1

225

Cr: Set of vSDN controllers for network r crk: A vSDN controller at hierarchy level k. For vSDNs with no hierarchy of controllers, k = 0 for all controllers. This variable holds the resource requirement of controller which depends on the level in the hierarchy. Choose one of [low, medium, high, premium]. (vr): Mapping of virtual SDN network element vr to SDN network element v. tr: Maximum value of the latency for programming all virtual SDN network elements in r (er): Latency on the virtual link in vSDN network r. The latency for each virtual link er is calculated based on the reserved bandwidth on the physical link e for er and the propagation delay on the physical path e. Latency due to virtualization or container execution environment is assumed the same for all virtual links and is ignored. r: Set of shortest paths between all virtual SDN network elements in r. Shortest paths are pre-calculated using (er), the latency of virtual links, per the virtual SDN network, not based on latency in the physical SDN network.

SDN Network Specification

G(V,E): SDN network V: Set of SDN network elements V: an SDN network element. v  V

(s,t) r: Shortest path between two virtual SDN network elements s and t in r where (s,t)  r

E: Set of SDN network links e: An SDN network link. e  E

d(s,t) r: Latency between two virtual SDN network elements s and t in r where (s,t)  r

: Set of SDN network elements capable of hosting vSDN controllers.   V

3.2

: Set of additional hosts capable of hosting vSDN controllers. These are not SDN network elements and    =  h: A host or an SDN network element capable of executing vSDN controllers. h   or h  . This variable holds the capacity of the host in CPU and memory for hosting vSDN controllers. Note that packet processing, switching fabric, queues are not relevant to the controllers. They are relevant to the virtual network elements. For simplicity, this is specified as an enumeration of [low, medium, high, premium]. (e): Bandwidth and propagation delay on the link e 3.1.2

vSDN Network Specification

We assume the virtual SDN networks have been calculated and supplied as the input. R: Set of all vSDN networks

Formulations for Different Objectives

In this section, we will provide mixed integer programming formulations for latency and host related objectives. 3.2.1

Decision Variables

The decision variables are: xh(h): =1, if host h is selected to host a vSDN controller xc(crk, h): = 1, if vSDN controller cr of vSDN network r is hosted in host h at hierarchy level of k, meaning the resource requirements of cr can be satisfied by the available capacity of host h. xv(vr, cr0, h): =1, if virtual SDN network element vr is controlled by vSDN controller cr in host h at hierarchy level of 0. 3.2.2

Objective Functions

vr: A virtual SDN network element of vSDN network r

We will provide formulations to latency and host related objectives outlined in the previous section. Network provider may use some of the objectives together while some combinations may not be used together as they cannot produce a solution.

Er: Set of virtual links in vSDN network r. Er  E

Latency related objectives:

er: A virtual link in vSDN network r

The main latency related objective is to place vSDN controllers so that latencies between a vSDN controller and its vSDN network elements are kept at a reasonable

G(Vr,Er): vSDN network r Vr: Set of virtual SDN network elements of vSDN network r. Vr  V

K: number of levels in the hierarchy.

ISBN: 1-60132-508-8, CSREA Press ©

226

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

level for each vSDN controller. This way, configuration changes are programmed to the virtual SDN network elements within the latency objectives of the network provider. Network provider may have a service level agreement with the virtual network tenants to guarantee a maximum latency for programming the virtual SDN network elements. This maximum latency value could potentially be different for each virtual SDN network. The variable tr represents this maximum We assume that a vSDN controller could program all of its virtual SDN network elements in parallel and that the latencies between the vSDN controller and vSDN network elements overlap. Therefore, we take the maximum value among multiple latencies experienced by a vSDN controller when it is programing its virtual SDN network elements. 1. Minimize the maximum latencies for all virtual SDN networks: 𝑎𝑟𝑔𝑚𝑖𝑛 ∑ max⁡(∑ ∑ ∑ 𝑥𝑣(𝑣 𝑟 , 𝑐0𝑟 ⁡, ℎ)𝑑(𝑣 𝑟 ⁡, 𝑐0𝑟 ⁡)) 𝑣𝑟

𝑟

𝑐0𝑟

ℎ

The parameter to the first max is the set of all latencies for a given virtual SDN network. The parameter to the sum is the set of all maximums of the individual virtual SDN networks. The objective is to minimize the total sum of all maximums. 2. Minimize the average latencies for each virtual SDN network: 𝑎𝑟𝑔𝑚𝑖𝑛⁡ ∑ 𝑟

1 (∑ ∑ ∑ 𝑥𝑣(𝑣 𝑟 , 𝑐0𝑟 ⁡, ℎ)𝑑(𝑣 𝑟 ⁡, 𝑐0𝑟 ⁡)) |𝑟| 𝑟 𝑟 𝑣

𝑐0

ℎ

3. Minimize the average latency for all virtual SDN networks: 𝑎𝑟𝑔𝑚𝑖𝑛⁡

1 ∑ (∑ ∑ ∑ 𝑥𝑣(𝑣 𝑟 , 𝑐0𝑟 ⁡, ℎ)𝑑(𝑣 𝑟 ⁡, 𝑐0𝑟 ⁡)) |𝑅| 𝑟 𝑟 𝑟

𝑣

𝑐0

ℎ

With using only these objectives, latency values could be higher than the reasonable values for some virtual SDN network if the values for other virtual SDN networks are low, keeping the total in minimum. For this reason, we don’t propose using the above objective functions by themselves, but together with a constraint that limits the maximum latencies of each virtual SDN network by the maximum value for the virtual SDN network, which we provide a formulation later. When combined with the constraint, these objectives could still be pursued if the network provider sees value in overachieving the latency

objectives further than the advertised or agreed values in SLA. Another objective regarding latency is to control/minimize the latency among the vSDN controllers as the they communicate with each other. This latency is particularly critical in a hierarchy of controllers when the higher level vSDN controllers communicate with the lower level vSDN controllers [17]. Host related objectives: The network provider may want to minimize the number of hosts to reduce the overall cost of deploying and managing the hosts. Each host could add an additional cost to the network provider (e.g. licensing, hardware resources, operations and management costs). Note that this objective is not about the number of vSDN controllers. That number is calculated by the controller placement algorithm and is supplied as an input to our model. The objective is to minimize the number of hosts to execute all the vSDN controllers. Network provider may find advantage if multiple vSDN controllers are hosted on the same host. This objective could easily be specified by minimizing the decision variable for whether a host is selected to host a vSDN controller.

Alternatively, the network provider may want to maximize the spread of the vSDN controller over to as many hosts as possible. The network provider may choose this objective for resiliency purposes to reduce the number of vSDN controllers affected by unplanned unavailability of a host. This objective is to minimize the number of vSDN controllers on hosts by deploying as many hosts as needed constrained by the number of available hosts. This could be achieved by maximizing the number of hosts subject to

and

The network provider may choose to limit the number of vSDN controllers on hosts although the host could have capacity to execute more than the limit. The limiting number could be specific to each host since each host could have different capacity. In that case, the network provider could use the constraint below for each host h where nc(h) is the limiting number per host h. 3.3

Constraints

Due to lack of space, typical constraints are not included in this paper.

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

3.4

Algorithms

Models presented in the previous section are all based on the controller placement problem which has been proven to be NP-hard. We will devise two algorithms to provide reasonable solutions to match the different objectives of the network provider. The algorithms presented in this paper address the latency and host related objectives only. Resilience and hierarchy related objectives are addressed in our other work [17]. Inputs to the algorithms: • •

SDN network specification Specifications for each virtual SDN network

Algorithm 1: Optimize for latency related objectives 1. Run the controller placement problem optimization per virtual SDN network o Use the specific latency related objective 2. Provision the vSDN controllers from the output of step 1. The above algorithm is simple and provides a good solution for all tenant virtual SDN networks with respect to their number of controllers and the latency requirements. The first step uses an existing optimization which would yield the number of vSDN controllers along with locations and resource requirements of vSDN controllers. Since it employs the basic controller placement problem optimization which is modeled after the classical facility location problem optimization, the latencies are near to optimum values after running the controller placement optimization. It illustrates that a simple algorithm is available to satisfy tenant requirements. With this algorithm, network provider could meet and even exceeds the SLA with the tenant customers. However, the output may be far from being optimum for the network provider in terms of the host related objectives and related costs in the sense that the service provider may need to provision a big number of hypervisor instances than necessary compared to a solution that optimizes host related objectives. Certainly, some virtual networks overlap and any host could execute multiple vSDN controllers. However, in the worst case all the available hosts could be deployed to execute the vSDN controllers. Could the network provider use an algorithm which would favor fewer number of hosts while keeping the latencies within the limits? The second algorithm below serves for that optimizations. Algorithm 2. Optimize for minimum number of hosts 1. Combine all the virtual SDN network graphs into a single graph as follows:

227

o Use a weighted graph: the weight is the number of edges considering all tenant virtual networks 2. Normalize the weight and the other link related metrics (e.g. latency) into a single value. o For example: weight*100 - latency_weight (a value less than 100) + other_weights (a value less than 10) 3. Run the optimization for the global graph – similar to classical facility location problem o Keep the constraint that a host could support n vSDN controllers where n is specific to the host 4. Sort the locations based on number of tenants that they belong to 5. S = set of all tenant virtual networks 6. Run until S becomes empty (all tenants have at least one SDN controller) and descending on the sorted list of locations o Pop the location from the beginning and push it into list of locations o Remove the tenant virtual networks that this location belongs to After this algorithm, all hypervisor locations are determined and tenant virtual networks are assigned their hypervisors. From service provider perspective, the number of hypervisors utilized by multiple tenant virtual networks is maximized. Some tenant virtual networks get more than one hypervisor assignment. In this algorithms, tenant virtual networks may get fewer number of hypervisors compared to guaranteeing them with the optimum number of hypervisors in the first algorithm. This trade-off could be customized by passing different weight formulas to the algorithm. The weight assignment supports some flexibility and control over the facility location problem optimization. The operator could modify the weights (e.g. higher weight for the latency to make sure the latency constrains by the tenant virtual networks reduce the weight value by the number of links in all networks) to fine tune the facility location problem optimization.

4

Conclusions and Future Work

One challenge for network and service providers in capturing opportunities in the network-as-service market is to place SDN controllers in optimum locations of the SDN network so that providers could effectively allocate resources and program the network elements with network flows that satisfy the customer requirements. This challenge has been identified as the “controller placement problem” which is an optimization problem and has evolved over time as the Network Virtualization (NV) and Software Defined Networking (SDN) progressed. We address the problem in modern and

ISBN: 1-60132-508-8, CSREA Press ©

228

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

innovative service offerings where the providers could offer virtual networks to tenants. Virtual SDN networks are offered in a number of scenarios. The controller placement problem is challenging in such settings due to plurality of virtual tenant networks and their differing requirements. On top of these, network providers may want to implement their own objectives on the problem. In such complex optimization applications where objective functions conflict with each other, we believe mixed integer programming formulations should separate the objective functions and allow the provider to choose the ones that are most relevant to their business. Since such optimizations are NP-hard, algorithms should be devised as approximations to objective functions. Our contribution in this paper is to present new formulations to bring the controller placement problem up to the innovative and modern applications of SDN over virtual networks, and demonstrate how different objective functions could be formulated and how algorithms could be devised to yield approximate solutions to the objective functions. The paper presents the mathematical formulations of the Controller Placement Problem for Virtual SDN Services and describes algorithms to provide approximate solutions with some heuristics. Further evaluations of the algorithms and heuristics are subjects of our future work. Another area of further research is to apply the controller replacement problems in component/container models.

5

References

[1]. Peng Xiao, Wenyu Qu, Heng Qi, Zhiyang Li, Yujie Xu; “The SDN Controller Placement Problem for WAN”, IEEE/CIC ICCC 2014 Symposium on Privacy and Security in Commutations [2]. Kshira Sagar Sahoo, Sampa Sahoo, Anamay Sarkar, Bibhudatta Sahoo and Ratnakar Dash; “On the Placement of Controllers for Designing a Wide Area Software Defined Networks”, Proc. of the 2017 IEEE Region 10 Conference (TENCON), Malaysia, November 5-8, 2017 [3]. Stanislav Lange, Steffen Gebert, Thomas Zinner, Phuoc Tran-Gia, David Hock, Michael Jarschel, and Marco Hoffmann “Heuristic Approaches to the Controller Placement Problem in Large Scale SDN Networks”, IEEE Transactions On Network And Service Management, Vol. 12, No. 1, March 2015 [4]. Stanislav Lange, Steffen Gebert, Joachim Spoerhase, Piotr Rygielski, Thomas Zinner, Samuel Kounev, and Phuoc Tran-Gia; “Specialized Heuristics for the Controller Placement Problem in Large Scale SDN Networks”, 2015 27th International Teletraffic Congress [5]. Othmane Blial, Mouad Ben Mamoun, and Redouane Benaini; “An Overview on SDN Architectures with Multiple Controllers”, Journal of Computer Networks and Communications Volume 2016, Article ID 9396525

[6]. Bela Genge ; Piroska Haller; “A Hierarchical Control Plane for Software-Defined Networks-based Industrial Control Systems”, 2016 IFIP Networking Conference (IFIP Networking) and Workshops [7]. Rinku Shah, Mythili Vutukuru, Purushottam Kulkarni; “Cuttlefish: Hierarchical SDN Controllers with Adaptive Offload”, 2018 IEEE 26th International Conference on Network Protocols (ICNP) [8]. Rinku Shah, Mythili Vutukuru, Purushottam Kulkarni; “Devolve-Redeem: Hierarchical SDN Controllers with Adaptive Offloading”, APNet'17 Proceedings of the First Asia-Pacific Workshop on Networking [9]. Franciscus X. A. Wibowo, Mark A. Gregory, Khandakar Ahmed, Karina M. Gomez; “Multi-DomainSDN.pdf Multi-Domain Software Defined Networking: Research Status and Challenges”, Journal of Network and Computer Applications · March 2017 [10]. Tao Hu , Peng Yi , Zehua Guo, Julong Lan, And Jianhui Zhang; “Bidirectional Matching Strategy for Multi-Controller Deployment in Distributed Software Defined Networking”, IEEE Access ( Volume: 6 ) [11]. Kévin Phemius, Mathieu Bouet, Jérémie Leguay; “DISCO: Distributed Multi-domain SDN Controllers”, 2014 IEEE Network Operations and Management Symposium (NOMS) [12]. Andreas Blenk, Arsany Basta, Johannes Zerwas, Martin Reisslein, Wolfgang Kellerer; “Control Plane Latency With SDN Network Hypervisors: The Cost of Virtualization”, IEEE Transactions On Network And Service Management, Vol. 13, No. 3, September 2016 [13]. Andreas Blenk, Arsany Basta, Johannes Zerwas, Wolfgang Kellerer; “Pairing SDN with Network Virtualization: The Network Hypervisor Placement Problem”, 2015 IEEE Conference on Network Function Virtualization and Software Defined Network [14]. Maryam Tanha, Dawood Sajjadi, Jianping Pan; “Enduring Node Failures through Resilient Controller Placement for Software Defined Networks”, 2016 IEEE Global Communications Conference (GLOBECOM) [15]. Bala Prakasa Rao Killi, Seela Veerabhadreswara Rao; “Controller Placement With Planning for Failures in Software Defined Networks”, 2016 IEEE International Conference on Advanced Networks and Telecommunications Systems (ANTS) [16]. Nancy Perrot, Thomas Reynaud; “Optimal Placement of Controllers in a Resilient SDN Architecture”, 2016 12th Int. Conference on the Design of Reliable Communication Networks (DRCN 2016) [17]. Sakir Yucel; “Resilient and Hierarchical Controller Placement Problem for Collaborative Virtual SDN Services”, The 25th Int'l Conf on Parallel and Distributed Processing Techniques and Applications (PDPTA'19)

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

229

Multi-start Parallel Tabu Search for the Blocking Job Shop Scheduling Problem Adel Dabah Division of Computer Science and Engineering CERIST Research Center Algiers, Algeria. [email protected]; [email protected]

Nadia Nouali Taboudjemat Division of Computer Science and Engineering CERIST Research Center Algiers, Algeria. [email protected]

Abdelhakim AitZai

Ahcene Bendjoudi

Department of Computer Science University of Sciences and Technology Houari Boumedienne Algiers, Algeria. [email protected]

Division of Computer Science and Engineering CERIST Research Center Algiers, Algeria. [email protected]

Abstract—The Blocking Job Shop Scheduling (BJSS) is a version of the well known Job Shop Scheduling Problem (JSSP). It consists to schedule a set of jobs on a set of machines without a storage space, which induces a blocking situation. This problem is very commune in production chains and solving it efficiently engenders an important economic gain. In addition to its tremendous search space that increases exponentially according to the number of jobs and the number of machines, the BJSS is particularly a very challenging problem due to the high ratio of infeasible to explored solutions. An efficient way to explore only feasible solutions is to use a Feasibility Recovery Strategy (FRS). This latter consists to reorder the execution of some jobs to ensure feasibility. However, it slows down considerably the search process by incurring a huge time to explore a small area in the search space. For this reason, we propose in this paper an efficient multi-start parallel Tabu Search algorithm based on a FRS. The proposed parallelization can be viewed as a high level parallelization in which several instances of the TS algorithm explores the search space simultaneously. This parallel scheme exploits the computing power offered by a cluster-based supercomputer. Experiments on well known benchmark instances show the positive impact of the parallelization not only to speedup the execution time, but also in improving the quality of the obtained solutions. Index Terms—Job Shop, Blocking constraint, Parallel Tabu Search.

I. I NTRODUCTION The classical Job Shop Scheduling Problem (JSSP) represents one of the most studied problems in the literature. It consists to schedule a set of jobs on a set of machines, where each job has its own path trough the machines. The execution of a job on a machine is called operation. In several application areas (production chain, train scheduling, etc.), there is no storage space between machines which leads to a blocking situation. i.e., A job, which has completed its processing time, remains on the machine (blocks it from processing other jobs) until its next machine becomes available for processing. This situation is known as the blocking constraint. Thus, the JSSP with the blocking constraint is called the Blocking Job Shop

Scheduling (BJSS) problem. This problem appears to be even more difficult to solve than the classical JSSP which is already an NP-hard optimization problem. In this paper, we address a particular case of the blocking constraint named, the blocking with Swap allowed. A swap represents a deadlock situation that involves several jobs, where each job is waiting for a machine blocked by another job in the swap. In other words, we deal with a version of the BJSS problem where solutions that contain a swap situation are accepted. The BJSS problem has been treated by several authors using exact and approximate methods. On one hand, the exact resolution of this problem [1], [3], [11] is impractical for large problem instances, and on the other hand, most of the existing approximate approaches for this problem [1], [8], [12], [14]–[16] either do not take into account the specificity of the blocking constraint, or involve a random factor during the search which induces a low performance and lack of stability for large problem instances. The Tabu-search (TS) method is one of the most widely used metaheuristics for combinatorial optimization problems. It explores the search space by moving from one solution into another one using a neighborhood function. Due to the impact of the blocking constraint, the classical TS neighborhood (based on switching two concurrent critical operations) produces, in most cases, infeasible solutions. Therefore, a low ratio of feasible to explored solutions and poor quality of results. In order to ensure the feasibility of neighboring solutions, we proposed in [4] a FRS that consists to re-schedule the execution of some jobs. However, the recovery step makes the TS algorithm very slow inducing a huge execution time to explore a small area in the search space. To overcome this drawback, we propose in this paper a high level parallel TS approach that exploits the computing power of a cluster-based supercomputer. In this parallel approach, the search space is divided among several instances of the TS algorithm where each instance explores its sub-search space independently from the others.

ISBN: 1-60132-508-8, CSREA Press ©

230

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

Thereby, a rigid synchronization and single search strategy are used. Experiments on reference benchmarks show the positive impact of our parallel TS algorithm on expending the explored search space and improving the quality of the obtained results. i.e., With the same execution time as the serial version, the search space explored by our parallel approach is 240 times bigger than our serial TS algorithm. By taking benefit from both parallelism and diversification gain, we have been able to improve the best results for a large number of benchmarks. The remainder of this paper is organized as follows: Section 2 introduces the blocking job shop scheduling problem and the model used to solve it. Section 3 describes our proposed parallel TS algorithm and its components. Section 4 discusses the computational results. Finally, conclusions and perspectives are presented in Section 5. II. B LOCKING J OB S HOP S CHEDULING P ROBLEM In this section, we present the basic information related to the Blocking Job Shop Scheduling (BJSS) problem and the model used to solve it. A. Problem Formulation The classical Job Shop Scheduling Problem (JSSP) aims to schedule a set J of n jobs (J1, ..., Jn) on a set M of m machines (M1, ..., Mm). The execution of a job on a machine is called operation. We denote by O the set of all operations (o1 , ..., on∗m ). Each operation oi needs to use a machine M(i) for an uninterrupted duration called processing time pi . Each job has its own sequence of crossing on machines which creates a precedence constraint between every successive operations of a job. Finally, this problem assumes disjunctive renewable machines. i.e. Each machine can process at most one job at a given time. A solution (schedule) for this problem consists to assign a starting and finishing times ti and ci for each operation oi (i=1, ..., n ∗ m); while satisfying all constraints. Our goal is to minimize the Makespan (Cmax=max ci /i=1, ..., n ∗ m). The JSSP assumes an unlimited capacity for intermediate buffers between machines which is not possible for a lot of real world manufacturing systems. To model these systems, the BJSS problem is used. This latter is a version of the classical JSSP with no intermediate buffers, where a job has to wait on a machine (blocks it from treating other jobs) until its next machine becomes available for processing. There are two different cases of the BJSS problem depending on the application areas and the specification of the manufacturing systems namely: the blocking with swap allowed (BWS) and the blocking with no-swap (BNS). In the blocking no-swap, a schedule with swap situation is considered as an infeasible, while in BWS, all the operations in the swap move simultaneously to their subsequent machine. The swap situation is a specification of the blocking constraint, it represents a deadlock situation between several jobs. This means that each job in the swap is waiting for the liberation of a machine blocked by another job in the same swap. This

situation is represented by a zero length cycle in the alternative graph representation. B. Alternative Graph Modelization The BJSS problem can be modelled as an alternative graph representation introduced by Mascis and Pacciarelli [11], which is a generalization of the disjunctive graph of Roy and Sussman [17]. This model can be defined as a graph G = (N, F, A) where: N represents the set O of nodes (operations) with two additional dummy nodes (start and finish) modelling the beginning and the end of a schedule. F represents the set of fixed arcs imposed by the precedence constraints. Each arc (q, p) ∈ F corresponds to a precedence relation between two consecutive operations of a job and fqp represents its length. Finally, A is a set of alternative pairs representing the processing order for concurrent operations ( operations that use the same machine). Each pair ((i, j), (h, k)) corresponds to the processing order between two concurrent operations and ai j is the length of alternative arc (i, j). Each arc represents the fact that one operation must be completed before the machine can start the processing of the other operation. A selection S1 is a set of arcs obtained from A by choosing, at most, one arc from each pair. Let us define G(S1 ) =(N, F ∪ S1 ) as the graph representation of the selection S1 . A selection S1 is considered as feasible if there is no positive length cycle in G(S1 ) and its evaluation (Makespan) is equal to the longest path in G(S1 ). We say that S1 is a complete selection if exactly one arc is chosen from each pair in A, therefore |A| = |S1 |. In this way, we define a schedule (solution of the problem) as a complete feasible selection. Given a feasible selection S1 , let l(i, j) be the length of the longest path from operation i to operation j in the graph G(S1 ). Finally, an ideal operation represents an operation where its machine becomes immediately available after the end of its processing time. otherwise, the operation is called blocking operation. If so, we denote by σ (i) the operation immediately following oi in the same job.

Pi

i

σ(i)

Pr 0 r

Fig. 1. Alternative pairs between blocking and ideal operations.

1) Alternative Pair Generation: In the following we describe the process used to generate all alternative pairs. Figure 1 shows the alternative pairs between a blocking operation oi and an ideal operation or , where M(i)

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 | = M(r). Since these operations cannot be executed at the same time, we associate them with an alternative pair. The first alternative arc (σ (i), j) having length 0 represents the situation where oi is processed before or . Indeed, since oi is a blocking operation, M(i) can begin the processing of or only after the starting time of σ (i)(when oi leaves M(i)). The other alternative arc is based on the fact that Or is an ideal operation, which means that the machine M(r) becomes immediately available after the processing time pr . Thereby, the other alternative arc is directly from r to i with length pr . Cost (s)

(s, cost(s))

231

Moreover, Algorithm 1 and Table I describe the general structure and symbols used by each parallel instance of the TS algorithm. TABLE I T HE DESCRIPTION OF THE SYMBOLS USED IN OUR TS A LGORITHM . Symbol

Description

s S Best s∗ Cost(s) sc TL Cand(s) N(s)

a feasible schedule (solution). a set of all feasible solutions. The makespan of the Best solution (s*) found by the TS. The best solution found by TS. The (makespan) of solution s. Best solution in Cand(s). Tabu List. a set of unforbidden neighbor solutions of s. a set of all neighbor solutions of schedule s.

Local optimum

Algorithm 1 Pseudo-code of our proposed TS algorithm.

Global optimum

N(s’)

s’

s* Fig. 2. Tabu Search trajectory.

BEGIN 1. Generate an initial solution s ∈ S; 2. s∗ :=s; 3. T L:=0; / REPEAT ′ ′ 4. cand(s) :={s ∈ N(s)| the move from s to s is not ′ taboo OR s satisfies the aspiration criterion }; 5. Generate a solution sc ∈ Cand(s); 6. Update T L (insert the move from s to sc in T L.); S 7. s :=sc ; 8. IF cost(s) < cost(s∗ ) THEN s∗ :=s; UNTIL stop-criteria = true RETURN s∗ ; END.

III. T HE P ROPOSED PARALLEL TS A LGORITHM In this section, we present our proposed parallel TS approach for the BJSS problem with swap allowed. The goal of the parallelization is to accelerate the TS method by dividing the search space into several parts where each part is explored by a TS process (thread). For this reason, we begin this section by introducing our TS algorithm and its components in addition to the proposed recovery strategy used by each parallel process during the search. The TS algorithm is a local search metaheuristic used to solve combinatorial optimization problems. It was introduced by Glover in 1986 [5]. As depicted in Figure 2, the TS algorithm aims to reach the global optimum s∗ by moving, at each iteration, from one solution to another. This can be done by exploring for a solution s ∈ S its entire neighborhood N(s) and the best solution in this neighborhood s′ is selected as the new solution even if its quality is worse than s. To avoid local optimum, the latest k visited configurations are stored in a short term memory forbidding any move that results in any of these configurations. This memory is called the Tabu List (T L). For additional information on the method, the reader may refer to Glover [6], [7].

A. TS Algorithm for the BJSS Problem Many authors in the literature [8], [9], [13] tried to solve the BJSS problem using the TS algorithm. However, theirs results are generally of low quality due to the incapacity of efficiently exploring the search space. In the following, we describe our adaptation of this method for the BJSS problem. For that, we begin by introducing the used classical neighborhood structure (N1) and the steps of our proposed recovery strategy. 1) Neighborhood Structure: The results of the TS algorithm depend essentially on the used neighborhood. Our neighborhood function is based on the classical JSSP neighborhood (N1) where the neighbor solutions are obtained by permuting two concurrent critical operations (two operations in the critical path that use the same machine). In the classical JSSP the N1 neighborhood always produces feasible solutions and converges to the local optimum [18]. However, applying it for the blocking case produces in 98% of cases non feasible solutions. According to this classical neighborhood, a neighbor solution s′ of a solution s is obtained by permuting two successive critical operations oi and o j assigned to the same machine.

ISBN: 1-60132-508-8, CSREA Press ©

232

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

Generate initial solution s

Applying N1 neighborhood And generate the set Cand (s)

For each solution s’ from Caned (s) Stape1: remove from s’ arcs that cause infeasibility

Recovery strategy

Step2: Rebuild s’ using heuristics

Choose best s’ from Cand(s) s=s’

Cost (s) < Best Yes

No

Best = Cost (s)

Update the TL

Stopping criteria No Yes Return Best

the following our proposed recovery strategy. 2) The Proposed Feasibility Recovery Strategy (FRS): Our goal here is to restore the feasibility of the neighbor solution (s′ ) by re-scheduling the execution of some jobs. In other words, the proposed recovery strategy consists to remove some of the arcs that cause the infeasibility (Step 1) and then to use a heuristic to complete the neighbor solution (Step 2). In the alternative graph, s′ represents a complete selection, therefore, A = 0/ and G(s′ ) = (N, F ∪ s′ ) is the corresponding graph. Let J(i) be the job containing the operation oi . Step 1: Remove arcs that cause infeasibility After replacing arc (i′ , j) by its alternative ( j′ , i) in the neighborhood function, we remove from s′ all the incoming and outgoing alternative arcs from the job J(i). The result of this step is a feasible partial selection s′ and a set of unselected alternative pairs in the set A. Step 2: Reconstruct the neighbor solution Given a partial selection s′ from the Step 1, this step extends s′ at each iteration with a new alternative arc (l, m) until a new feasible solution is obtained. This arc is chosen using the AMCC (Avoid Maximum Current Cmax) heuristic. The idea behind using this heuristic is to generate a good quality neighbor solutions. Indeed, the AMCC focuses on the quality issue by avoiding the unselected alternative arc that increases the most the makespan. Thereby, the AMCC selects the pair ((l, m), (e, f )) such that l(0, e) + ae f + l( f , n) = Max{ l(0, u) + auv + l(v, n)} ∀(u, v) ∈ A. Thus, s′ := s′ ∪ (l, m). In order to ensure the feasibility of s′ and to prevent positive length cycles, we must fix other alternative pairs related to the arc selected by the AMCC heuristic. This means that we have to check for each unselected alternative pair ((u, v), (p, q)) ∈ A whether adding the arc (u, v) (resp. (p, q)) to s′ will produce an infeasible selection. If so, the arc (p, q) (resp. (u, v)) is added to s′ . At the end of this step we have a complete feasible neighbor solution s′ . B. Taxonomy of Parallel Tabu Search Algorithm

Fig. 3. The proposed sequential TS algorithm.

Let i′ =

σ (i) if oi is a blocking operation. i if oi is an ideal operation.

Let j′ =

σ ( j) if o j is a blocking operation. j if o j is an ideal operation.

In the alternative graph representation, the neighbor solution (s′ ) is obtained from s by replacing the alternative arc (i′ , j) on a critical path in G(s) by its mate (alternative) ( j′ , i). In 98% of the cases, the neighbor solution s′ is not feasible due to the existence of a positive length cycle. For this reason, a recovery strategy is unavoidable. In [8], [9] Gröflin and Klinkert propose a way to recover the feasibility of the neighbor solutions. However, the makespan quality of theirs neighbor solutions is low. To both restore the feasibility and generate a good quality neighboring solutions, we present in

In this section, we will give a brief introduction on the parallel TS classifications. Trienkens and Bruin [2] classified the parallelization of the TS algorithm in two groups: low level and high level. Compared to the sequential TS version, the low level parallelization does not change the way in which the search space is explored, it is only faster. On the other hand, the high level parallelization has a different behavior from the sequential version since it is based on several parallel TS threads (instances) exploring simultaneously the search space. Crainic et al. [2] introduced a taxonomy of parallel TS algorithm which remains the most used in the literature as of today. This taxonomy is based on three dimensions. The first dimension "control cardinality" defines the parallel TS trajectory which can be controlled by one processor (1-control i.e., the search space is explored in the same way as a sequential TS), or distributed among several processors (Pcontrol). The second dimension "Control and communication type" manages the communication, the organization, and the

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 | synchronization between the parallel processes. It contains four categories depending on the way in which parallel processes handle and share information between them: (Rigid Synchronization, Knowledge Synchronization, Collegial, and Knowledge collegial). The third and last dimension Search Differentiation focuses on the starting solution and the search strategy of each parallel process. According to this dimension, four combinations can be identified: Same initial Point, Same search Strategy (SPSS); Same initial Point, Different search Strategies (SPDS); Multiple initial Points, Same search Strategy (MPSS); Multiple initial Points, Different search Strategies (MPDS). For more details about the parallel TS classification, the reader may refer to [2]. For low-level TS parallelization (the parallel algorithm follows the same exploration path and reaches the same solution quality as the sequential version), the performance can be calculated using the standard parallelization criteria (speedup and efficiency); since the gain in this parallelization model is the the execution time. For the High-level TS parallelization, where the search space is explored by several parallel processes, the benefit of parallel metaheuristics should include the quality of the obtained solutions. For this reason, our experiments will focus more on the quality issue rather than the execution time. C. The Proposed Low-level Parallelization This approach can be seen as a low-level parallelization in which the goal is to speed up the execution time of the TS method without trying to improve the solution quality. This approach aims to exploit the multi-core CPU processors available in almost all recent computers. This parallel approach explores the search space in the same way as a serial TS algorithm. i.e. One-control model is considered since both serial and parallel approaches have the same trajectory in the search space; however, the parallel approach moves faster. Indeed, the serial approach evaluates all neighboring solutions sequentially, one neighbor solution at a time, and then chooses the best one in terms of solution quality to move on. The parallel approach here exploits the fact that each neighbor solution can be evaluated independently from the others. Hence, it uses several CPU threads to evaluate all neighboring solutions simultaneously. Thereby, improving the time complexity of the serial version. D. The Proposed High-level Parallelization Scheme In this section, we describe the proposed parallel TS approach that exploits the computing power offered by a clusterbased supercomputer. As already mentioned in the parallel TS taxonomy, the parallel TS implementations can be described by the following aspects. In the following, we describe our parallel TS algorithm that exploits the computing power offered by a cluster-based supercomputer. 1) Control Cardinality and Communication: The proposed parallelization can be seen as a high level parallelization in

233

which the search space is explored simultaneously using several TS processes, where each one uses the recovery strategy defined above. Therefore, a P-control model is considered. Each process has its own search path which allows to explore efficiently the search space. e.g. using 100 parallel processes, the explored search space is 100 times bigger than the sequential version which may allow to improve the solution quality. In our implementation, we opted for the Rigid synchronization model in which there is no communications between the parallel processes. The goal here is to diversify the search and to prevent parallel processes from falling in the same area of the search space (taking the same path in the search space). The end of our parallel TS algorithm is reached when all the parallel processes have completed their iterations. Before that, each process sends its best explored solution to the process "id=0" in order to report the final result of the parallel TS method. 2) Search Differentiation: The way the search space is explored depends essentially on two aspects, the used initial solutions and the used search strategies. Initial Solution: The initial solution represents the starting point of the TS algorithm. We can say that regardless the quality of the initial solution, the TS algorithm is able to reach a good final solution. A very simple way to generate the initial solution is to schedule jobs sequentially in a random way as explained in [16]. In our parallelezation of the TS algorithm, We considered a multiple starting points which means that each parallel process (TS instance) creates its own initial solution generated randomly. The main goal here is to divide the search space over all parallel TS instances, aiming to explore efficiently this tremendous search space and improve the results quality. Search Strategies: The search strategy aims to define the search trajectory of the parallel TS algorithm. In our case, the trajectory is controlled by the used heuristic in the recovery step. Our parallel TS approach is based on a single search strategy model. In this model, all parallel processes use the same heuristic (AMCC) in the recovery step. The choice of the AMCC heuristic is motivated by its good results in terms makespan quality. 3) Tabu List: In order to avoid the trap of local optimum for each parallel process, a Tabu List (TL) is used. The T L elements must include enough information to faithfully remember the visited solutions. The T L is a circular list updated at each iteration using a first-in-first-out (FIFO) strategy. Given a candidate arc (i, j) and its alternative pair ((i, j),(h, k)), the associated move consists to replace the candidate arc (i, j) by its alternative arc (h, k). To avoid returning back to the previous solution and going through the same solution a second time, we save both directions of the pair ((i, j), (h, k)) associated with the performed move. In order to simplify the management of the T L, we use a unique codification for each alternative pairs. In this way, if we want to check whether a move is forbidden or not we search for the codification of this pair (movement) in the T L. Using a global TL for all parallel processes allows to explore

ISBN: 1-60132-508-8, CSREA Press ©

234

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

efficiently the search space and avoid exploring the same search trajectory by several processes. However, the communication and synchronization cost related to this strategy will be very high. In our parallel Ts approach, the FRS used by each parallel process allows to prevent from falling in the same search trajectory. i.e., Even when performing the same move by several parallel processes, it is very rare to come across the same solution. Thereby, there is no need to use a global TL that will engender a high communication cost. IV. E XPERIMENTS In this section, computational results of our parallel TS method are given using large Lawrence instances [10]. These instances are the most used in the literature which allows us to make a fair comparison with the state of the art results. For each instance we associate the notation n × m, where n and m represent respectively the number of jobs and the number of machines. The TS parallelization is implemented using the C++ language and Message Passing Interface (MPI) as a communication tool between the parallel processes. The experiments have been carried out using Ibnbadis cluster which has 32 computing nodes with 16 CPU-Cores each. The goal here is to experiment the ability of the proposed TS parallelization in solving the BJSS problem. To this aim, we fixed the number of iterations for both the sequential algorithm and the parallel processes to 5,000 iterations. In the following, we compare the quality results of our parallel approach against our sequential results and the best results in the literature. TABLE II C OMPARING

OUR OBTAINED MAKESPAN RESULTS WITH THE BEST KNOWN SOLUTIONS FOR THE BJSS PROBLEM WITH SWAP ALLOWED .

⋆= [3] ≀ = [15] ⋄ = [16] Inst.

PTS

RCseq

RCBest

1313⋆

1506 1397 1501 1430 1453

1410 1303 1338 1399 1343

-6.3 -6.7 -10.8 -2.1 -7.5

-1.6 -2.7 -6.8 3.3 2.3

La26 La27 La28 La29 La30

1989⋄ 2005⋆ 2020⋆ 1725⋆ 2049⋄

2014 2034 1997 1849 1994

1914 1959 1878 1819 1942

-4.9 -3.6 -5.9 -1.6 -2.6

-3.8 -2.3 -7.0 5.4 -5.2

La31 La32 La33 La34 La35

2921≀ 3237≀ 2844≀ 2848≀ 2923≀

2813 3064 2709 2823 2850

2748 2935 2662 2731 2722

-2.3 -4.2 -1.7 -3.2 -4.5

5.9 -9.3 -6.4 -4.1 -6.9

La36 La37 La38 La39 La40

1701⋆ 1848⋆ 1598⋆ 1714⋆ 1714⋆

1787 1838 1683 1748 1729

1685 1815 1606 1705 1708

-5.7 -1.2 -4.8 -2.4 -1.2

-0.9 -1.8 0.5 -0.5 -0.4

Best

La21 La22 La23 La24 La25

1433⋆ 1339⋄ 1436⋆ 1354⋆

Seq.T S

Among the most recent works on the BWS we mention the approximate parallel B&B proposed by Dabah et al. [3], the TS method proposed by Gröflin et al. [8], [9], the IFS and CP-OPT algorithms proposed by Oddi et al. [15], and finally, the Iterated Greedy (IG) algorithm proposed by Pranzo et al. [16]. To the best of our knowledge, these results are the best known solutions to date. Table II shows the makespan results of our parallel TS approach for the BJSS problem with swap allowed. Column Inst. reports the instance names and sizes. Column Best gives the makespan and the reference of the best solution in the literature. Column seq. reports the best makespan over four runs of our sequential TS method using 5,000 iterations. Column PTS reports the results of our parallelization approach using 240 CPU-cores. Thus, our parallel TS algorithm contains 240 parallel processes and each one uses the same iteration number as the serial version. Finally, columns RC_seq and RC_best report respectively the relative change of our parallel results from the sequential and the best results in the literature −Seq.TS × 100 , and using the following formulas: RCseq = PTSSeq .TS PTS−Best RCbest = Best ×100. The negative values obtained by these formulas indicate an improvement in results quality. For each instance, the bold underlined results indicate that our approach improves the current best solution which is a very difficult goal to achieve. The first thing to notice from Table II is the good results of our sequential version which uses our feasibility recovery strategy based on the AMCC heuristic. Even with a such low number of iterations, this version allows us to improve the best results for 8 instances. Increasing the number of iterations for this serial version will probably improve the quality of the obtained results. However, the time needed for each instance will grow unacceptably especially for large size instances. In such a situation, we have two options. Either waiting a huge amount of time for each instance to finish, or taking advantage of the parallel TS method which can simply explore several portions of the search space simultaneously, therefore, exploring a huge search space within the same execution time as the sequential version. The main conclusion from Table II is the positive impact of the parallelization on the solution quality. Using the same running time as the sequential version, our parallel approach improves all the results of the sequential version as indicated by Column RCseq . As reported in Column RCbest , our parallel results show a substantial improvement of the best known results, especially for large instances. More precisely, over twenty benchmark instances, our parallel TS approach improves on the best known results for 16 instances. This improvement proves the ability of our parallel TS algorithm not only to find good solutions but also in a short amount of time. The deviation of our results from the best results in the literature for the unimproved instances varies between 2% and 3%. Table III shows a comparison between the sequential results (Seq. TS) with 50,000 iterations (best of ten executions) and the parallel result (PTS) with only 5,000 iterations (only one

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 | TABLE III M AKSPAN RESULTS OF THE PARALLEL AND SEQUENTIAL TS FOR THE BJSS PROBLEM WITH SWAP ALLOWED .

METHODS

Sequential TS: 50,000 iterations. (best of 10 executions) Parallel TS (PTS): 5,000 iterations.(only one execution) Instance Size PTS Seq.T S La21 La22 La23 La24 La25 La26 La27 La28 La29 La30 La31 La32 La33 La34 La35 La36 La37 La38 La39 La40

15×10 15×10 15×10 15×10 15×10 20×10 20×10 20×10 20×10 20×10 30×10 30×10 30×10 30×10 30×10 15×15 15×15 15×15 15×15 15×15

1410 1303 1338 1398 1343 1914 1959 1878 1819 1942 2748 2935 2662 2731 2722 1685 1815 1606 1705 1708

235

the state-of-the-art results for large Lawrence instances which confirms the efficiency of the proposed parallel approach. As a perspective, we plan to extend this approach for the BJSS problem without swap, and explore other parallelization models of the TS algorithm. R EFERENCES

1467 1347 1442 1398 1373 1929 1960 1880 1803 1965 2715 2987 2672 2729 2776 1713 1802 1630 1697 1692

execution is performed). The time needed by our parallel TS for each instance is 100 times less than the time needed by the sequential TS. Even with such small running time as compared with the sequential approach, our parallel results with only one execution are mostly better than the sequential results. Indeed, over the twenty most difficult instances of Lawrence benchmarks, the parallel results are better in 14 instances which indicates clearly the huge benefit of using parallelism to explore the tremendous BJSS search space. The good results of our parallel version can be explained by the diversification gain. In other words, our parallel approach explores simultaneously several areas in the search space which enables to have more chance to come across good solutions. For this reason, it is more beneficial to use parallelization than increasing the number of iterations for the sequential version. V. C ONCLUSION We discussed in this paper the parallelization of the Tabu search method to solve efficiently the Blocking Job Shop Scheduling problem with allowed swap. This problem is one of the most challenging scheduling problem due to high ratio of infeasible to explored solutions. For this reason, a feasibility recovery strategy is used. However, this latter slows down considerably the search process. For that, we proposed in this paper a high level parallel TS approach that exploits the computing power offered by a cluster-based supercomputer. This approach divides the search space over several TS instances that explores simultaneously the search space. The obtained results show the significant impact of using parallel architectures not only to accelerate the exploration process, but also to improve the quality of the obtained results. Indeed, the diversification gain allowed us to improve on the most of

[1] Abdelhakim AitZai, Brahim Benmedjdoub, and Mourad Boudhar. A branch and bound and parallel genetic algorithm for the job shop scheduling problem with blocking. International Journal of Operational Research, 14(3):343–365, 2012. [2] Teodor Gabriel Crainic, Michel Toulouse, and Michel Gendreau. Toward a taxonomy of parallel tabu search heuristics. INFORMS Journal on Computing, 9(1):61–72, 1997. [3] Adel Dabah, Ahcene Bendjoudi, and Abdelhakim AitZai. Efficient parallel b&b method for the blocking job shop scheduling problem. In 2016 International Conference on High Performance Computing & Simulation (HPCS), pages 784–791. IEEE, 2016. [4] Adel Dabah, Ahcene Bendjoudi, and Abdelhakim AitZai. An efficient tabu search neighborhood based on reconstruction strategy to solve the blocking job shop scheduling problem. Journal of Industrial and Management Optimization, 13(4):2015–2031, 2017. [5] Fred Glover. Future paths for integer programming and links to artificial intelligence. Computers & Operations Research, 13(5):533–549, 1986. [6] Fred Glover. Tabu search-part i. ORSA Journal on computing, 1(3):190– 206, 1989. [7] Fred Glover. Tabu search-part ii. ORSA Journal on Computing, 2(1):4– 32, 1990. [8] Heinz Gröflin and Andreas Klinkert. A new neighborhood and tabu search for the blocking job shop. Discrete Applied Mathematics, 157(17):3643–3655, 2009. [9] Heinz Gröflin, Dinh Nguyen Pham, and Bürgy Reinhard. The flexible blocking job shop with transfer and set-up times. Journal of Combinatorial Optimization, 22(2):121–144, 2011. [10] S Lawrence. Resource constrained project scheduling: an experimental investigation of heuristic scheduling techniques (supplement). Graduate School of Industrial Administration, 1984. [11] Alessandro Mascis and Dario Pacciarelli. Job-shop scheduling with blocking and no-wait constraints. European Journal of Operational Research, 143(3):498–517, 2002. [12] Yazid Mati, Nidhal Rezg, and Xiaolan Xie. A taboo search approach for deadlock-free scheduling of automated manufacturing systems. Journal of Intelligent Manufacturing, 12(5-6):535–552, 2001. [13] Yazid Mati and Xiaolan Xie. Multiresource shop scheduling with resource flexibility and blocking. IEEE Transactions on Automation Science and Engineering, 8(1):175–189, 2011. [14] Carlo Meloni, Dario Pacciarelli, and Marco Pranzo. A rollout metaheuristic for job shop scheduling problems. Annals of Operations Research, 131(1-4):215–235, 2004. [15] Angelo Oddi, Riccardo Rasconi, Amedeo Cesta, and Stephen F Smith. Iterative improvement algorithms for the blocking job shop. In ICAPS, 2012. [16] Marco Pranzo and Dario Pacciarelli. An iterated greedy metaheuristic for the blocking job shop scheduling problem. Journal of Heuristics, 131:587–611, 2015. [17] Bernard Roy and B Sussmann. Les problemes d’ordonnancement avec contraintes disjonctives. Note ds, 9, 1964. [18] Peter JM Van Laarhoven, Emile HL Aarts, and Jan Karel Lenstra. Job shop scheduling by simulated annealing. Operations research, 40(1):113–125, 1992.

ISBN: 1-60132-508-8, CSREA Press ©

236

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

Atomic Commitment Protocol in Distributed Systems with Fail-Stop Model Sung-Hoon Park

Su-Chang Yoo

Dept. of Computer Engineering

Dept. of Computer Engineering

Chungbuk National Univ., Chungbuk, Korea

Chungbuk National Univ., Chungbuk, Korea

[email protected]

[email protected]

Abstract— This paper defines the Non-Blocking Atomic Commitment problem in a message-passing asynchronous system and determines a failure detector to solve the problem. This failure detector, which we call the modal failure detectorstar, and which we denote by M*, is strictly weaker than the perfect failure detector P but strictly stronger than the eventually perfect failure detector P. The paper shows that at any environment, the problem is solvable with M*.

Based on the properties, they defined several failure detector classes: perfect failure detectors P, weak failure detectors W, eventually weak failure detectors ◊W and so on. In [3] and [4] they studied what is the "weakest" failure detector to solve Consensus. They showed that the weakest failure detector to solve Consensus with any number of faulty processes is Ω+Σ and the one with faulty processes bounded by ⎡n/2⎤ (i.e., less than ⎡n/2⎤ faulty processes) is ◊W. After the work of [8], several studies followed. For example, the weakest failure detector for stable leader election is the perfect failure detector P [4], and the one for Terminating Reliable Broadcast is also P [1,3]. Recently, as the closest one from our work, Guerraoui and Kouznetsov showed a failure detector class for mutual exclusion problems that is different from the above weakest failure detectors. The failure detector, called the Trusting failure detector, satisfies the three properties, i.e., strong completeness, eventual strong accuracy and trusting accuracy so that it can solve the mutual exclusion problem in asynchronous distributed systems with crash failure. And they used the bakery algorithm to solve the mutual exclusion problem with the trusting failure detector. 1.2 Road Map The rest of the paper is organized as follows. Section 2 addresses motivations and related works and Section 3 overviews the system model. Section 4 introduces the Modal failure detector star M*. Section 5 shows that M* is sufficient to solve the problem, respectively. Section 6 concludes the paper with some practical remarks.

Keywords—Synchronous Distributed Systems, Mutual exclusion, Fault Tolerance, Mobile Computing System.

I. INTRODUCTION We address the fault-tolerant Non-Blocking Atomic Commitment problem, simply NB-AC, in an asynchronous distributed system where the communication between a pair of processes is by a message-passing primitive, channels are reliable and processes can fail by crashing. In distributed systems, to ensure transaction failure atomicity in a distributed system, an agreement problem must be solved among a set of participating processes. This problem, called the Atomic Commitment problem (AC) requires the participants to agree on an outcome for the transaction: commit or abort [5,11,12,17]. When it is required that every correct participant eventually reach an outcome despite the failure of other participants, the problem is called Non-Blocking Atomic Commitment (NB-AC) [2,6]. The problem of Non-Blocking Atomic Commitment becomes much more complex in distributed systems (as compared to singlecomputer systems) due to the lack of both a shared memory and a common physical clock and because of unpredictable message delays. Evidently, the problem cannot be solved deterministically in a crash-prone asynchronous system without any information about failures. There is no way to determine that a process is crashed or just slow. Clearly, no deterministic algorithm can guarantee Non-Blocking Atomic Commitment simultaneously. In this sense, the problem stems from the famous impossibility result that consensus cannot be solved deterministically in an asynchronous system that is subject to even a single crash failure [7]. 1.1 Failure Detectors. In this paper, we introduced a modal failure detector M* and showed that the Non-Blocking Atomic Commitment problem is solvable with it in the environment with majority correct processes. The concept of (unreliable) failure detectors was introduced by Chandra and Toueg [3,4], and they characterized failure detectors by two properties: completeness and accuracy.

II. MOTIVATIONS AND RELATED WORKS Actually, the main difficulty in solving the NonBlocking Atomic Commitment problem in presence of process crashes lies in the detection of crashes. As a way of getting around the impossibility of Consensus, Chandra and Toug extended the asynchronous model of computation with unreliable failure detectors and showed in [4] that the FLP impossibility can be circumvented using failure detectors. More precisely, they have shown that Consensus can be solved (deterministically) in an asynchronous system augmented with the failure detector ◊S (Eventually Strong) and the assumption of a majority of correct processes. Failure detector ◊S guarantees Strong Completeness, i.e., eventually, every process that crashes is permanently suspected by every process, and Eventual Weak Accuracy, i.e., eventually, some correct process is never suspected. Failure detector ◊S can however make an arbitrary number of mistakes, i.e., false suspicions. A Non-Blocking Atomic Commitment problem, simply NB-AC, is an agreement problem so that it is impossible to solve in asynchronous distributed systems

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

with crash failures. This stems from the FLP result which mentioning the consensus problem can’t be solved in asynchronous systems. Can we also circumvent the impossibility of solving NB-AC using some failure detector The answer is of course “yes”. The NB-AC algorithm of D. Skeen [16] solves the NB-AC problem with assuming that it has the capability of the failure detector P (Perfect) in asynchronous distributed systems. This failure detector ensures Strong Completeness (recalled above) and Strong Accuracy, i.e., no process is suspected before it crashes [2]. Failure detector P does never make any mistake and obviously provides more knowledge about failures than ◊S. But it is stated in [7] that Failure detector ◊S cannot solve the NB-AC problem, even if only one process may crash. This means that NB-AC is strictly harder than Consensus, i.e., NB-AC requires more knowledge about failures than Consensus. An interesting question is then “What is the weakest failure detector for solving the NBAC problem in asynchronous systems with unreliable failure detectors?” In this paper, as the answer to this question, we show that there is a failure detector that solves NB-AC weaker than the Perfect Failure Detector. This means that the weakest failure detector for NB-AC is not a Perfect Failure Detector P.

III. MODEL We consider in this paper a crash-prone asynchronous message passing system model augmented with the failure detector abstraction [3]. The Non-Blocking Atomic Commitment problem Atomic commitment problems are at the heart of distributed transactional systems. A transaction originates at a process called the Transaction Manager (abbreviated TM) which accesses data by interacting with various processes called Data Managers abbreviated DM. The TM initially performs a begin transaction operation, then various write and read operations by translating writes and reads into messages sent to the DM and initially an endtransaction operation. To ensure the so-called failure atomicity property of the transaction, all DMs on which write operations have been performed, must resolve an Atomic Commitment problem as part of the end-transaction operation. These DMs are called participants in the problem. In this paper we assume that the participants know each other and know about the transactions. The atomic commitment problem requires the participants to reach a common outcome for the transaction among two possible values: commit and abort. We will say that a participant AC-decides commit (respectively AC-decides abort). The write operations performed by the DMs become permanent if and only if participants AC-decide commit. The outcome AC-decided by a participant depends on votes (yes or no) provided by the participants. We will say that a participant votes yes (respectively votes no). Each vote reflects the ability of the participant to ensure that its data updates can be made permanent.

237

We do not make any assumption on how votes are defined except that they are not predetermined. For example, a participant votes yes if and only if no concurrency control conflict has been locally detected and the updates have been written to stable storage. Otherwise the participant votes no. A participant can AC-decide commit only if all participants vote yes. In order to exclude trivial situations where participants always AC-decide abort, it is generally required that commit must be decided if all votes are yes and no participant crashes. We consider the Non-Blocking Atomic Commitment problem, NB-AC, in which a correct participant AC-decides even if some participants have crashed, NB-AC is specified by the following conditions: Uniform-Agreement: No two participants ACdecide different outcomes. - Uniform-Validity: If a participant AC-decides commit, t hen all participants have voted yes. - Termination: Every correct participant eventually ACdecides. - Non-Triviality: If all participants vote yes and there is no failure, then every correct participant eventually AC decides commit. Uniform-Agreement and Uniform-Validity are safety conditions. They ensure the failure atomicity property of transactions. Termination is a liveness condition which guarantees non blocking. Non-Triviality excludes trivial solutions to the problem where participants always ACdecide abort. This condition can be viewed as a liveness condition from the application point of view since it ensures progress, i.e. transaction commit under reasonable expectations when no crash and no participant votes no.

IV. THE MODAL FAILURE DETECTOR STAR M* Each module of failure detector M* outputs a subset of the range 2Π. Initially, every process is suspected. However, if any process is once confirmed to be correct by any correct process, then the confirmed process id is removed from the failure detector list of M*. If the confirmed process is suspected again, the suspected process id is inserted into the failure detector list of M*. The most important property of M*, denoted by Modal Accuracy, is that a process that was once confirmed to be correct is not suspected before crash. Let HM be any history of such a failure detector M*. Then HM(i,t) represents the set of processes that process i suspects at time t. For each failure pattern F, M(F) is defined by the set of all failure detector histories HM that satisfy the following properties: Strong Completeness: There is a time after which every process that crashes is permanently suspected by every correct process: ∀i,j∈Ω, ∀i∈correct(F), ∀j∈F(t), ∃ t’’:∀t’>t’’, j∈H(i, t’). Eventual Strong Accuracy: There is a time after which every correct process is never suspected by any correct process. More precisely: ∀i,j∈Ω,∀i∈correct(F), ∃ t:∀t’>t, ∀j∈ correct(F), j∉ H(i, t’). Modal Accuracy: Initially, every process is suspected. After that, any process that is once confirmed to be correct is not suspected before crash. More precisely:∀i,j∈Ω:

ISBN: 1-60132-508-8, CSREA Press ©

238

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

j∈H(i,t0), t0< t< t’ , j∉ H(i ,t) ∧ j∈ Ω-F(t’) => j∉ H(i, t’) Note that Modal Accuracy does not require that failure detector M* keeps the Strong Accuracy property over every process all the time t. However, it only requires that failure detector M* never makes a mistake before crash about the process that was confirmed at least once to be correct. If process M* outputs some crashed processes, then M* accurately knows that they have crashed, since they had already been confirmed to be correct before crash. However, concerning those processes that had never been confirmed, M* does not necessarily know whether they crashed (or which processes crashed). 5. Solving NB-AC Problem with M* We give in Figure 1 an algorithm solving NBAC using M* in any environment of group where at least one node is available. The algorithm uses the fact that eventual strong accuracy property of M*. More precisely, with such a property of M* and the assumption of at least one node being available, we can implement our algorithm of Figure 1. Var status: {rem, try, ready } initially rem Var coordinator : initially NULL Var token : initially empty list Var groupi : set of processes Periodically(τ) do request M* for HM 1. Upon received (trying, upper_ layer) 2. if not (status = try) then 3. wait until ∀j ∈ groupi : j ∉ HM 4. statusi := try 5. send (ready, i) to ∀j ∈ groupi 6. Upon received (ok, j) 7. token := token ∪ { j } 8. if group = token then 9. send (commit, i) to ∀j ∈ Qk status:= rem 10. Upon received (ready, j ) 11. if status = rem then send (ok, i ) to j 12. coordinator:=i 13. status:= ready else send (no, i) to j 14. Upon received (no, j ) 15. if status=try then 16. send (abort, i) to ∀j ∈group 17. status:= rem 18. Upon received (abort, j ) 19. if status=ready then do abort() 20. status:= rem 21. Upon received (commit, j ) 22. if status=ready then commit transaction 23. status:= rem 24. Upon received HM from Mi 25. if (status=try∧ ∃i∈my_group and HM) then send (abort, i) to ∀j ∈ my_group abort-transaction

26. 27.

status:= rem if (status=ready and coordinator ∈ HM ) then coordinator:=NULL abort-transaction() 28. status:= rem Figure 1: NB-AC algorithm using M* : process i. We give in Figure 1 an algorithm solving NB-AC using M* in any environment E of a group with any number of correct processes ( f < n ). Our algorithm of Figure 1 assumes: • Each process i has access to the output of its modal failure detector module Mi*; • At least one process is available; In our algorithm of Figure 1, each process i has the following variables: 1. A variable status, initially rem, represents one of the following states {rem, try, ready}; 2. A variable coordinatori, initially NULL, which denotes the coordinator when i send its ok message to other node; 3. A list tokeni, initially empty, keeping the ok messages that i has received from each member of the group. Description of [Line 1-5] in Figure 1; the idea of our algorithm is inspired by the well-known NB-AC algorithm of D. Skeen[4,7]. That is, the processes that wish to try their Atomic Commitment first wait for the group whose members are all alive based on the information HM from its failure detector M*. Those processes eventually know the group by the eventual strong accuracy property of M* in line 3 of Figure 1 and then sets its status to “try”, meaning that it is try to commit. It sets the variable group with all members and send the message “(ready, i)” to all nodes in the group. Description of [Line 6-10] in Figure 1; the coordinator asking for a ready to proceed an atomic commitment from every process of the group does not take steps until the all “ok messages” are received from the group. But it eventually received ok or no messages from the group, and it will commits or aborts the transaction. Description of [11-15] in Figure 1; On received “ready message from the coordinator, the node sends “ok” to the coordinator and it set its status with “ready” meaning that it is in ready state to wait a decision that is “commit” or “abort”. Description of [16-18] in Figure 1; If the coordinator received the message “no” from a node of group, it sends the “abort” message to every member of the group and after that it remains in “rem” state again. Description of [19-21] in Figure 1; The node i, received “abort” from coordinator j, if it is in ready state, aborts the transaction. Description of [2224] in Figure 1; The node i, received “commit” from coordinator j, if it is in ready state, commits the transaction. Description of [25-27] in Figure 1; When the node i received the failure detector history HM from M*, if it is a coordinator and knows that a node of group died, it sends the abort message to all members of group. Description of [28-29] in Figure 1; Upon received the failure detector history HM from M*. If it is a node waiting a decision from the coordinator and it nows that the coordinator died, it aborts the transaction. Now we prove the correctness of the algorithm of Figure 1 in terms of two properties: Uniform- Agreement and Uniform-Validity. Let R be an arbitrary run of the

ISBN: 1-60132-508-8, CSREA Press ©

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'19 |

algorithm for some failure pattern F∈E (f