Performance Analysis of Parallel Applications for HPC [1st ed. 2023] 9819943655, 9789819943654

This book presents a hybrid static-dynamic approach for efficient performance analysis of parallel applications on HPC s

159 17 16MB

English Pages 271 [259] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

High Performance Computing in Clouds: Moving HPC Applications to a Scalable and Cost-Effective Environment [1st ed. 2023] 3031297687, 9783031297687

This book brings a thorough explanation on the path needed to use cloud computing technologies to run High-Performance C

185 33 27MB Read more

Parallel Computing: Performance Analysis [6]

281 23 361KB Read more

High Performance Computing in Clouds: Moving HPC Applications to a Scalable and Cost-Effective Environment [1st ed. 2023] 3031297687, 9783031297687

This book brings a thorough explanation on the path needed to use cloud computing technologies to run High-Performance C

328 44 40MB Read more

Co-scheduling of HPC applications 9781614997290, 9781614997306

High-performance computing (HPC) has become an essential tool in the modern world. However, systems frequently run well

471 28 5MB Read more

Advanced Technologies for Industrial Applications [1st ed. 2023] 3031332377, 9783031332371

This book provides information on advanced communication technology used in Industry 4.0 and 5.0. The book covers a vari

218 79 4MB Read more

Advanced Technologies for Industrial Applications [1st ed. 2023] 3031332377, 9783031332371

This book provides information on advanced communication technology used in Industry 4.0 and 5.0. The book covers a vari

114 106 10MB Read more

High Performance Computing for Geospatial Applications [1st ed.] 9783030479978, 9783030479985

This volume fills a research gap between the rapid development of High Performance Computing (HPC) approaches and their

1,617 229 10MB Read more

Performance Modelling Techniques for Parallel Supercomputing Applications [1 ed.] 9781612093611, 9781606922699

Ever since the invention of the computer, users have demanded more and more computational power to tackle increasingly c

160 108 2MB Read more

Performance Analysis in Game Sports: Concepts and Methods: Concepts and Methods [1st ed. 2023] 3031072499, 9783031072499

This book offers a comprehensive overview on the methods and concepts of theoretical and practical performance analysis.

236 81 9MB Read more

Performance Analysis in Game Sports: Concepts and Methods: Concepts and Methods [1st ed. 2023] 3031072499, 9783031072499

This book offers a comprehensive overview on the methods and concepts of theoretical and practical performance analysis.

102 35 24MB Read more

Performance Analysis of Parallel Applications for HPC [1st ed. 2023]
9819943655, 9789819943654

Author / Uploaded
Jidong Zhai
Yuyang Jin
Wenguang Chen
Weimin Zheng

Table of contents :
Preface
Acknowledgments
Contents
Acronyms
1 Background and Overview
1.1 Background of Performance Analysis
1.2 Hybrid Static-Dynamic Approaches
1.3 Overview of Book Structure
References
Part I Performance Analysis Methods: Communication Analysis
2 Fast Communication Trace Collection
2.1 Introduction
2.2 Related Work
2.3 Design Overview
2.4 Live-Propagation Slicing Algorithm
2.4.1 Slicing Criterion
2.4.2 Dependence of MPI Programs
2.4.3 Intra-procedural Analysis
2.4.4 Inter-procedural Analysis
2.4.5 Discussions
2.5 Implementation
2.5.1 Compilation Framework
2.5.2 Runtime Environment
2.6 Evaluation
2.6.1 Methodology
2.6.2 Validation
2.6.3 Performance
2.6.3.1 Memory Consumption
2.6.3.2 Execution Time
2.7 Applications
2.7.1 Optimize Process Placement of MPI Programs
2.7.2 Sensitivity Analysis of Communication Patterns to Input Parameters
2.8 Limitations and Discussions
2.9 Conclusions
References
3 Structure-Based Communication Trace Compression
3.1 Introduction
3.2 Overview
3.3 Extracting Communication Structure
3.3.1 Intra-procedural Analysis Algorithm
3.3.2 Inter-procedural Analysis Algorithm
3.4 Runtime Communication Trace Compression
3.4.1 Intra-process Communication Trace Compression
3.4.2 Inter-process Communication Trace Compression
3.5 Decompression and Performance Analysis
3.6 Implementation
3.7 Evaluation
3.7.1 Methodology
3.7.2 Communication Trace Size
3.7.3 Trace Compression Overhead
3.7.3.1 Intra-process Overhead
3.7.3.2 Inter-process Overhead
3.7.3.3 Compilation Overhead of Cypress
3.7.4 Case Study
3.7.4.1 Analyzing Communication Patterns
3.7.4.2 Performance Prediction
3.8 Related Work
3.9 Conclusions
References
Part II Performance Analysis Methods: Memory Analysis
4 Informed Memory Access Monitoring
4.1 Introduction
4.2 Overview
4.2.1 Spindle Framework
4.2.2 Sample Input/Output: Memory Trace Collector
4.3 Static Analysis
4.3.1 Intra-procedural Analysis
4.3.1.1 Extracting Program Control Structure
4.3.1.2 Building Memory Dependence Trees
4.3.2 Inter-procedural Analysis
4.3.3 Special Cases and Complications
4.4 Spindle-Based Runtime Monitoring
4.4.1 Runtime Information Collection
4.4.2 Spindle-Based Tool Developing
4.4.2.1 Memory Bug Detector (S-Detector)
4.4.2.2 Memory Trace Collector (S-Tracer)
4.5 Evaluation
4.5.1 Experiment Setup
4.5.2 Spindle Compilation Overhead
4.5.3 S-Detector for Memory Bug Detection
4.5.4 S-Tracer for Memory Trace Collection
4.6 Related Work
4.7 Conclusion and Future Work
References
Part III Performance Analysis Methods: Scalability Analysis
5 Graph Analysis for Scalability Analysis
5.1 Introduction
5.2 Design Overview
5.3 Graph Generation
5.3.1 Static Program Structure Graph Construction
5.3.2 Sampling-Based Profiling
5.3.2.1 Associate Vertices with Performance Data
5.3.2.2 Graph-Guided Communication Dependence
5.3.2.3 Indirect Function Calls
5.3.3 Program Performance Graph
5.4 Scaling Loss Detection
5.4.1 Location-Aware Problematic Vertex Detection
5.4.2 Backtracking Root Cause Detection
5.5 Implementation and Usage
5.6 Evaluation
5.6.1 Experimental Setup
5.6.2 PSG Analysis
5.6.3 Performance Overhead
5.6.4 Case Studies with Real Applications
5.6.4.1 Zeus-MP
5.6.4.2 SST
5.6.4.3 Nekbone
5.7 Related Work
5.8 Conclusion
References
6 Performance Prediction for Scalability Analysis
6.1 Introduction
6.1.1 Motivation
6.1.2 Our Approach and Contributions
6.2 Base Prediction Framework
6.3 Definitions
6.3.1 Communication Sequence
6.3.2 Sequential Computation Vector
6.4 Sequential Computation Time
6.4.1 Deterministic Replay
6.4.2 Acquire Sequential Computation Time
6.4.3 Concurrent Replay
6.5 Representative Replay
6.5.1 Challenges for Large-Scale Applications
6.5.2 Computation Similarity
6.5.3 Select Representative Processes
6.6 Convolute Computation and Communication Performance
6.7 Implementation
6.8 Evaluation
6.8.1 Methodology
6.8.2 Sequential Computation Time
6.8.2.1 The Number of Representative Replay Groups
6.8.2.2 Validation of Sequential Computation Time
6.8.2.3 Analysis of Sequential Computation Time
6.8.3 Performance Prediction for HPC Platforms
6.8.4 Performance Prediction for Amazon Cloud Platform
6.8.5 Message Log Size and Replay Overhead
6.8.6 Performance of SIM-MPI Simulator
6.9 Discussions
6.10 Related Work
6.11 Conclusion
References
Part IV Performance Analysis Methods: Noise Analysis
7 Lightweight Noise Detection
7.1 Introduction
7.2 vSensor Design
7.3 Fixed-Workload V-Sensors
7.3.1 Fixed-Workload V-Sensor Definition
7.3.2 Analysis of Intra-procedure
7.3.3 Analysis of Inter-procedure
7.3.4 Multiple-Process Analysis
7.3.5 Whole Program Analysis
7.4 Regular-Workload V-Sensors
7.4.1 Instruction Sequences
7.4.2 Regular Workload Definition
7.5 Program Instrumentation
7.5.1 V-Sensor Selection
7.5.2 Inserting External V-Sensors
7.5.3 Analyzing External V-Sensors
7.6 Runtime Performance Variance Detection
7.6.1 Smoothing Data
7.6.2 Normalizing Performance
7.6.3 History Comparison
7.6.4 Multiple-Process Analysis
7.6.5 Performance Variance Report
7.7 Experiment
7.7.1 Experimental Setup
7.7.2 Overall Analysis of Fixed V-Sensors
7.7.3 Analysis of Regular V-Sensors
7.7.4 External Analysis of V-Sensors
7.7.5 V-Sensor Distribution
7.7.6 Injecting Noise
7.7.7 Case Studies
7.8 Related Work
7.9 Conclusion
References
8 Production-Run Noise Detection
8.1 Introduction
8.2 Overview
8.3 Performance Variance Detection
8.3.1 Fixed-Workload Fragments
8.3.2 State Transition Graph
8.3.3 Performance Data Collection
8.3.4 Identifying Fixed-Workload Fragments
8.3.5 Performance Variance Detection
8.4 Performance Variance Diagnosis
8.4.1 Variance Breakdown Model
8.4.2 Quantifying Time of Factors
8.4.3 Progressive Variance Diagnosis
8.5 Implementation
8.6 Evaluation
8.6.1 Evaluation Setup
8.6.2 Overhead and Detection Coverage
8.6.3 Verification of Fixed Workload Identification
8.6.4 Comparing with Profiling Tools
8.6.5 Case Studies
8.6.5.1 Detection of a Hardware Bug
8.6.5.2 Detection of Memory Problem
8.6.5.3 Detection of IO Performance Variance
8.7 Related Work
8.8 Conclusion
References
Part V Performance Analysis Framework
9 Domain-Specific Framework for Performance Analysis
9.1 Introduction
9.2 Overview
9.2.1 PerFlow Framework
9.2.2 Example: A Communication Analysis Task
9.3 Graph-Based Performance Abstraction
9.3.1 Definition of PAG
9.3.2 Hybrid Static-Dynamic Analysis
9.3.3 Performance Data Embedding
9.3.4 Views of PAG
9.4 PerFlow Programming Abstraction
9.4.1 PerFlowGraph
9.4.2 PerFlowGraph Element
9.4.3 Building Performance Analysis Pass
9.4.3.1 Low-Level API Design
9.4.3.2 Example Cases
9.4.4 Performance Analysis Paradigm
9.4.5 Usage of PerFlow
9.5 Evaluation
9.5.1 Experimental Setup
9.5.2 Overhead and PAG
9.5.3 Case Study A: ZEUS-MP
9.5.4 Case Study B: LAMMPS
9.5.5 Case Study C: Vite
9.6 Related Work
9.7 Conclusion
References
10 Conclusion and Future Work

Citation preview

Jidong Zhai · Yuyang Jin Wenguang Chen · Weimin Zheng

Performance Analysis of Parallel Applications for HPC

Performance Analysis of Parallel Applications for HPC

Jidong Zhai • Yuyang Jin • Wenguang Chen • Weimin Zheng

Performance Analysis of Parallel Applications for HPC

Jidong Zhai Department of Computer Science and Technology Tsinghua University Beijing, China

Yuyang Jin Department of Computer Science and Technology Tsinghua University Beijing, China

Wenguang Chen Department of Computer Science and Technology Tsinghua University Beijing, China

Weimin Zheng Department of Computer Science and Technology Tsinghua University Beijing, China

ISBN 978-981-99-4365-4 ISBN 978-981-99-4366-1 https://doi.org/10.1007/978-981-99-4366-1

(eBook)

Jointly published with Posts and Telecom Press, Beijing, China The print edition is not for sale in China (Mainland). Customers from China (Mainland) please order the print book from: Posts & Telecom Press © Posts & Telecom Press 2023. This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publishers, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publishers nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publishers remain neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore Paper in this product is recyclable.

Preface

Modern supercomputers have brought about an unprecedented growth in computing power. However, many parallel applications fail to efficiently utilize this power due to performance bugs, such as load imbalance, resource contention, and network congestion, as well as the performance variance of HPC systems. Identifying performance bottlenecks and understanding the performance behaviors of parallel applications on HPC systems is crucial for performance analysis. Unfortunately, conventional performance analysis techniques mainly rely on dynamic approaches, which only analyze performance data collected at runtime. These traditional techniques often incur significant overhead and thus become ineffective when used on Exascale supercomputers. This book aims to introduce a hybrid static-dynamic approach to achieve lightweight performance analysis for parallel applications. By utilizing program structures extracted during static analysis, this approach can guide dynamic analysis and effectively reduce unnecessary and redundant data collection and analysis, thereby reducing both runtime and space overheads. This book covers the core idea and related theories of hybrid static-dynamic approaches and demonstrates a series of innovative techniques for various performance analysis scenarios, such as communication pattern analysis, communication trace compression, memory access monitoring, scalability analysis, and performance variance detection. By showcasing these specific performance analysis techniques, the book emphasizes the effectiveness of static analysis in assisting dynamic analysis. This book is structured as follows: Chap. 1 provides an introduction to the background of performance analysis and the overview of the hybrid static-dynamic approach. It also presents several performance analysis methods utilizing this approach. Chapters 2 and 3 cover two communication analysis techniques: communication trace collection and communication trace compression. Chapter 4 introduces a memory analysis method. Chapters 5 and 6 discuss two scalability

v

vi

Preface

analysis methods that use graph analysis and representative replay. Chapters 7 and 8 present two variance detection methods based on source codes and binaries, respectively. Chapter 9 details a performance analysis framework for parallel applications. Finally, Chap. 10 summarizes the challenges of the hybrid staticdynamic approach. Beijing, China

Jidong Zhai Yuyang Jin Wenguang Chen Weimin Zheng

Acknowledgments

We are very grateful to the contributions of our students and research collaborators. We list the main contributors for preparing drafts of each chapter as follows: • Chapter 1: Yuyang Jin, Jidong Zhai. • Chapter 2: Jidong Zhai, Tianwei Sheng, Jiangzhou He, Wenguang Chen, Weimin Zheng. • Chapter 3: Jidong Zhai, Jianfei Hu, Xiongchao Tang, Xiaosong Ma, Wenguang Chen. • Chapter 4: Haojie Wang, Jidong Zhai, Xiongchao Tang, Bowen Yu, Xiaosong Ma, Wenguang Chen. • Chapter 5: Yuyang Jin, Haojie Wang, Xiongchao Tang, Zhen Zheng, Teng Yu, Torsten Hoefler, Xu Liu, Jidong Zhai. • Chapter 6: Jidong Zhai, Wenguang Chen, Weimin Zheng. • Chapter 7: Xiongchao Tang, Jidong Zhai, Xuehai Qian, Bingsheng He, Wei Xue, Wenguang Chen. • Chapter 8: Liyan Zheng, Jidong Zhai, Xiongchao Tang, Haojie Wang, Teng Yu, Yuyang Jin, Shuaiwen Leon Song, Wenguang Chen. • Chapter 9: Yuyang Jin, Haojie Wang, Runxin Zhong, Chen Zhang, Jidong Zhai. • Chapter 10: Yuyang Jin, Jidong Zhai. We presented the proposal of this book in 2021 after a discussion with Springer Senior Editor Dr. Celine Lanlan Chang. We thank Celine for her informative suggestions and great patience during the development of this book. We are also thankful to Zhenhua Li and Ruijun He for their insightful suggestions. Finally, we would like to appreciate our organization, the Department of Computer Science and Technology at Tsinghua University, for providing an excellent environment, supports, and facilities for the preparation of this book. This book is supported in part by the National Natural Science Foundation of China (NSFC) under grants 62225206 and U20A20226.

vii

Contents

1

Background and Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Background of Performance Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Hybrid Static-Dynamic Approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Overview of Book Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 2 2 5

Part I Performance Analysis Methods: Communication Analysis 2

Fast Communication Trace Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Live-Propagation Slicing Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Slicing Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Dependence of MPI Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Intra-procedural Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.4 Inter-procedural Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Compilation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Runtime Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.2 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.1 Optimize Process Placement of MPI Programs . . . . . . . . . . . . 2.7.2 Sensitivity Analysis of Communication Patterns to Input Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9 9 12 13 14 16 17 19 21 25 26 26 27 28 28 29 29 34 34 36

ix

x

3

Contents

2.8 Limitations and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38 39 39

Structure-Based Communication Trace Compression . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Extracting Communication Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Intra-procedural Analysis Algorithm . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Inter-procedural Analysis Algorithm . . . . . . . . . . . . . . . . . . . . . . . 3.4 Runtime Communication Trace Compression . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Intra-process Communication Trace Compression . . . . . . . . . 3.4.2 Inter-process Communication Trace Compression . . . . . . . . . 3.5 Decompression and Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.2 Communication Trace Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.3 Trace Compression Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.4 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43 43 46 48 49 50 53 53 55 57 58 58 58 59 62 64 66 67 67

Part II Performance Analysis Methods: Memory Analysis 4

Informed Memory Access Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Spindle Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Sample Input/Output: Memory Trace Collector . . . . . . . . . . . . 4.3 Static Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Intra-procedural Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Inter-procedural Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Special Cases and Complications . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Spindle-Based Runtime Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Runtime Information Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Spindle-Based Tool Developing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Spindle Compilation Overhead. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.3 S-Detector for Memory Bug Detection . . . . . . . . . . . . . . . . . . . . . 4.5.4 S-Tracer for Memory Trace Collection . . . . . . . . . . . . . . . . . . . . . 4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73 73 75 75 76 78 78 80 82 83 83 84 88 88 89 90 92 94 95 95

Contents

xi

Part III Performance Analysis Methods: Scalability Analysis 5

Graph Analysis for Scalability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Graph Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Static Program Structure Graph Construction . . . . . . . . . . . . . . 5.3.2 Sampling-Based Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Program Performance Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Scaling Loss Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Location-Aware Problematic Vertex Detection . . . . . . . . . . . . . 5.4.2 Backtracking Root Cause Detection . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Implementation and Usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.2 PSG Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.3 Performance Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.4 Case Studies with Real Applications. . . . . . . . . . . . . . . . . . . . . . . . 5.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

101 101 104 106 106 107 110 111 111 112 114 115 115 117 117 119 124 125 126

6

Performance Prediction for Scalability Analysis . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Our Approach and Contributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Base Prediction Framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Communication Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Sequential Computation Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Sequential Computation Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Deterministic Replay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Acquire Sequential Computation Time . . . . . . . . . . . . . . . . . . . . . 6.4.3 Concurrent Replay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Representative Replay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Challenges for Large-Scale Applications . . . . . . . . . . . . . . . . . . . 6.5.2 Computation Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.3 Select Representative Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Convolute Computation and Communication Performance . . . . . . . . 6.7 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.8.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.8.2 Sequential Computation Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.8.3 Performance Prediction for HPC Platforms . . . . . . . . . . . . . . . . 6.8.4 Performance Prediction for Amazon Cloud Platform . . . . . . 6.8.5 Message Log Size and Replay Overhead . . . . . . . . . . . . . . . . . . .

129 129 129 131 133 135 135 135 136 137 137 139 140 140 141 142 144 146 147 147 148 151 154 154

xii

Contents

6.8.6 Performance of SIM-MPI Simulator . . . . . . . . . . . . . . . . . . . . . . . . 6.9 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.10 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

156 157 158 159 160

Part IV Performance Analysis Methods: Noise Analysis 7

Lightweight Noise Detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 vSensor Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Fixed-Workload V-Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Fixed-Workload V-Sensor Definition . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Analysis of Intra-procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Analysis of Inter-procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.4 Multiple-Process Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.5 Whole Program Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Regular-Workload V-Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Instruction Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 Regular Workload Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Program Instrumentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 V-Sensor Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.2 Inserting External V-Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.3 Analyzing External V-Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Runtime Performance Variance Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.1 Smoothing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.2 Normalizing Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.3 History Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.4 Multiple-Process Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.5 Performance Variance Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.2 Overall Analysis of Fixed V-Sensors . . . . . . . . . . . . . . . . . . . . . . . 7.7.3 Analysis of Regular V-Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.4 External Analysis of V-Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.5 V-Sensor Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.6 Injecting Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.7 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

165 165 168 171 171 173 174 175 176 176 177 177 178 178 180 181 182 182 183 183 184 185 185 185 186 188 189 189 191 192 194 195 195

8

Production-Run Noise Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Performance Variance Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

199 199 201 203

Contents

8.3.1 Fixed-Workload Fragments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 State Transition Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.3 Performance Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.4 Identifying Fixed-Workload Fragments. . . . . . . . . . . . . . . . . . . . . 8.3.5 Performance Variance Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Performance Variance Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Variance Breakdown Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2 Quantifying Time of Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.3 Progressive Variance Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.1 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.2 Overhead and Detection Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.3 Verification of Fixed Workload Identification . . . . . . . . . . . . . . 8.6.4 Comparing with Profiling Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.5 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xiii

203 203 205 206 208 209 209 210 212 213 213 213 214 216 217 218 220 222 222

Part V Performance Analysis Framework 9

Domain-Specific Framework for Performance Analysis . . . . . . . . . . . . . . . 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 PerFlow Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.2 Example: A Communication Analysis Task . . . . . . . . . . . . . . . . 9.3 Graph-Based Performance Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 Definition of PAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.2 Hybrid Static-Dynamic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.3 Performance Data Embedding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.4 Views of PAG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 PerFlow Programming Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.1 PerFlowGraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.2 PerFlowGraph Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.3 Building Performance Analysis Pass. . . . . . . . . . . . . . . . . . . . . . . . 9.4.4 Performance Analysis Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.5 Usage of PerFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.2 Overhead and PAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.3 Case Study A: ZEUS-MP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.4 Case Study B: LAMMPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.5 Case Study C: Vite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

227 227 229 230 231 232 232 233 233 234 235 235 236 236 240 241 242 242 243 243 246 248 250

xiv

Contents

9.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 10

Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255

Acronyms

CFG CPU DFG FLOPS HPC IO IR MPI P2P PCG PMU SPMD

Control flow graph Central processing unit Data flow graph Floating point operations per second High performance computing Input/output devices (usually refers to storage devices) Intermediate representation Massage passing interface Point to point Program call graph Performance monitoring unit Single program, multiple data

xv

Chapter 1

Background and Overview

Abstract Performance analysis is essential for understanding the performance behaviors of large-scale parallel applications on modern supercomputers. Current performance analysis techniques are based on either profiling or tracing. Profiling incurs low costs during runtime but misses important information for identifying underlying bottlenecks, while tracing brings unacceptable overhead at large scales. In this book, we leverage static information, such as program structures and data dependence, from source codes and executable binaries to guide dynamic analysis, which achieves the analyzability of tracing with the overhead of profiling. We apply this approach to many performance analysis tasks, including memory monitoring, communication analysis, scalability analysis, and noise detection.

1.1 Background of Performance Analysis Currently, modern supercomputers are bringing unprecedented growth in computing power to the world. The top-ranked supercomputer [1] has already reached exascale peak performance, which means that they are capable of calculating at least .1018 double-precision operations per second (EFlops). For instance, ORNL’s Frontier has 8,730,112 cores and achieves 1.102 EFlops. This unprecedented growth in recent years poses a number of challenges to the developers of parallel applications. It is common that large-scale parallel applications fail to fully utilize the performance of modern supercomputers due to various performance issues, including load imbalance, resource contention, inefficient memory access, network congestion, as well as performance variance from HPC systems. Performance is the core issue for high-performance computing. In particular, performance analysis techniques play important roles in identifying performance bottlenecks and understanding the performance behaviors of parallel applications. In the past decades, researchers from all over the world have made great efforts in developing performance analysis techniques using two traditional approaches: profiling and tracing. Profiling-based approaches [2–4] collect statistical information at runtime with low overhead. Summarizing the data statistically loses important information such © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 J. Zhai et al., Performance Analysis of Parallel Applications for HPC, https://doi.org/10.1007/978-981-99-4366-1_1

1

2

1 Background and Overview

as the order of events, control flow, and possible dependence and delay paths. Thus, such approaches can only provide a coarse-grained insight into application bottlenecks, and thus substantial human efforts are required to identify the root cause of performance issues. Tracing-based approaches [5–8] capture performance data as time series, which allows tracking dependence and delay sequences to identify the root causes of performance issues. Their major drawback is that the prohibitive storage and runtime overhead of detailed data logging are introduced during the analysis. Thus, such tracing-based analysis cannot be used for large-scale parallel applications. To accommodate the increasing scale of supercomputers, parallel applications are implemented using complex data and control flow within parallel units, as well as complex interactions between parallel units, which incurs large data, control, and communication dependence, making it difficult to understand performance behaviors. Tracing-based approaches incur significant runtime and storage overhead, so they are not feasible to be extended to large-scale systems, such as exascale. Profiling-based approaches cannot effectively identify the underlying bottlenecks of large-scale parallel applications with complex dependence due to the loss of some important information. Thus, we conclude that identifying performance bottlenecks for large-scale parallel applications is still an important open problem.

1.2 Hybrid Static-Dynamic Approaches To achieve lightweight and accurate performance analysis of large-scale parallel applications, this book introduces a hybrid static-dynamic approach that reduces the runtime and space overhead and guarantees the accuracy of the analysis. The core idea of this hybrid approach is that static data, like program structures, can guide dynamic analysis and effectively reduce unnecessary and redundant data collection. Figure 1.1 shows the overview of the hybrid static-dynamic approach. Based on this hybrid approach, this book proposes a series of innovative techniques for various performance analysis tasks, including communication pattern analysis, communication trace compression, memory analysis, scalability analysis, performance variance detection, and performance prediction.

1.3 Overview of Book Structure Here is the content structure of this book: Fast Communication Trace Collection Communication patterns of parallel applications are important to optimize application performance and design better communication subsystems. Communication patterns can be extracted from communication traces. However, existing approaches to generate communication traces

1.3 Overview of Book Structure

3

Fig. 1.1 Overview for lightweight performance analysis

need to execute complete parallel applications on full-scale systems that are timeconsuming and expensive. Chapter 2 introduces a novel technique, which can perform fast communication trace collection for large-scale parallel applications on small-scale systems. The core idea is to reduce the original program to obtain a program slice through static analysis and to execute the program slice to acquire the communication traces. Program Structure-Based Communication Trace Compression The problem size and the execution scale on supercomputers keep growing, producing a prohibitive volume of communication traces. To reduce the size of communication traces, existing dynamic compression methods introduce large compression overhead with the job scale. Chapter 3 introduces a hybrid static-dynamic method that leverages information acquired from static analysis to facilitate more effective and efficient dynamic trace compression. This scheme extracts a program communication structure tree at compile time, which naturally contains crucial iterative computing features such as the loop structure, allowing subsequent runtime compression to “fill in,” in a “top-down” manner, event details into the known communication template. Informed Memory Access Monitoring Memory monitoring is of critical use in understanding applications and evaluating systems. Due to the dynamic nature of programs’ memory accesses, a common practice today leaves large amounts of address examination and data recording at runtime at the cost of substantial performance overhead (and large storage time/space consumption if memory traces are collected). Chapter 4 introduces a novel memory access monitoring technique, which identifies predictable memory access patterns into a compact program

4

1 Background and Overview

structure summary at compile time. Leveraging the static structural information, this technique dramatically reduces the amount of instrumentation that incurs heavy runtime memory address examination or recording. Graph Analysis for Scalability Analysis Scaling a parallel program to modern supercomputers is challenging due to inter-process communication, Amdahl’s law, and resource contention. Chapter 5 proposes a novel technique. It first leverages static compiler techniques to build a program structure graph, which records the main computation and communication patterns as well as the program’s control structures. At runtime, lightweight techniques are adopted to collect performance data according to the graph structure and generate a program performance graph. With this graph, backtracking root cause detection can automatically and efficiently detect the root cause of scaling loss. Performance Prediction for Scalability Analysis Performance prediction of parallel applications is critically important for designers of large-scale parallel computers and developers of parallel applications. However, it is difficult because the execution time of parallel applications is determined by several factors, including sequential computation time in each process, communication time, and their convolution. Despite previous efforts, it remains an open problem to estimate sequential computation time in each process accurately and efficiently for largescale parallel applications. Chapter 6 introduces a novel approach to predict the sequential computation time accurately and efficiently. It only needs a single node of the target platform but the whole target system need not be available. It employs deterministic replay techniques to execute any process of a parallel application on a single node at real speed. Besides, it leverages representative replay techniques, which only need to execute representative parallel processes instead of all of them. Lightweight Noise Detection Noise in the performance of parallel and distributed systems is becoming increasingly challenging. The runtimes of different executions can vary greatly even with a fixed number of computing nodes. Many HPC applications on supercomputers exhibit such variance. This not only leads to unpredictable execution times but also renders the system’s behavior unintuitive. The efficient online detection of variations in performance is an open problem in HPC research. Chapter 7 introduces an approach to detect variations in the performance of systems. The key finding of this approach is that the source code of programs can better represent performance at runtime than an external detector. Production-Run Noise Detection Existing detection approaches either bring too large overhead and hurt applications’ performance or rely on nontrivial source code analysis that is impractical for production-run parallel applications. Chapter 8 introduces a performance variance detection and diagnosis framework for production-run parallel applications. This approach is based on an important observation that most parallel applications contain code snippets that are repeatedly executed with fixed workloads, which can be used for performance variance detection in production-run environments.

References

5

Domain-Specific Framework for Performance Analysis Although a large number of performance tools have been designed, accurately pinpointing root causes for such complex performance issues still needs specific in-depth analysis. To implement each such analysis, significant human efforts and domain knowledge are normally required. Chapter 9 introduces a domain-specific programming framework to reduce the burden of implementing accurate performance analysis. This framework abstracts the step-by-step process of performance analysis as a dataflow graph. Conclusions and Future Works Chapter 10 summarizes the advantages of the hybrid static-dynamic approach for performance analysis and presents a series of interesting open problems that required further research.

References 1. TOP500 website (2020). http://top500.org/ 2. Vetter, J., & Chambreau, C. (2005). mpip: Lightweight, scalable MPI profiling. 3. Tallent, N. R., Adhianto, L., & Mellor-Crummey, J. M. (2010). Scalable identification of load imbalance in parallel executions using call path profiles. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 1–11). IEEE Computer Society. 4. Tallent, N. R., et al. (2009). Diagnosing performance bottlenecks in emerging petascale applications. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (pp. 1–11). IEEE. 5. Intel Trace Analyzer and Collector. https://software.intel.com/en-us/trace-analyzer 6. Zhai, J., Chen, W., & Zheng, W. (2010). PHANTOM: Predicting performance of parallel applications on large-scale parallel machines using a single node. In PPoPP. 7. Geimer, M., et al. (2010). The scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience, 22(6), 702–719. 8. Linford, J. C., et al. (2017). Performance analysis of openSHMEM applications with TAU commander. In Workshop on OpenSHMEM and Related Technologies (pp. 161–179). Springer.

Part I

Performance Analysis Methods: Communication Analysis

Chapter 2

Fast Communication Trace Collection

Abstract Communication patterns of parallel applications are important for optimizing application performance and designing better communication subsystems. Communication patterns can be extracted from communication traces. However, existing approaches are time-consuming and expensive because they generate communication traces by executing the entire parallel applications on full-scale systems. We propose a novel technique, namely, Fact, which can perform Fast Communication Traces collection for large-scale parallel applications on smallscale systems. Our idea is to reduce the original program and obtain a program slice through static analysis and to execute the program slice to acquire the communication traces, which is based on an observation that most computation and message contents in parallel applications are not relevant to their spatial and volume communication attributes and therefore can be removed for the purpose of communication trace collection. We have implemented Fact and evaluated it with NPB programs and Sweep3D. The results show that Fact can reduce resource consumption by two orders of magnitude in most cases. For example, Fact collects communication traces of 512-process Sweep3D on a 4-node (32 cores) platform in just 6.79 s, consuming 1.25 GB memory, while the original program takes 256.63 s and consumes 213.83 GB memory on a 32-node (512 cores) platform.

2.1 Introduction Communication pattern is a key factor affecting the performance of message passing parallel applications. Different applications exhibit different communication patterns, which can be characterized by three key attributes: volume, spatial and temporal1 [1, 2]. Figure 2.1 presents the spatial and volume communication attributes of CG in NAS Parallel Benchmark (NPB) [3] with 64 processes. The gray

1 The

communication volume is specified by the number of messages and the message size. The spatial attribute is characterized by the distribution of message source and destination. The temporal behavior is captured by the message generation rate. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 J. Zhai et al., Performance Analysis of Parallel Applications for HPC, https://doi.org/10.1007/978-981-99-4366-1_2

9

10

2 Fast Communication Trace Collection

Fig. 2.1 The communication spatial and volume attributes of NPB CG program (CLASS .= D, NPROCS .= 64)

9

x 10

Receiver Rank

63 56

7

49

6

42

5

35 4 28 3

21

2

14

1

7 0

0

7

14

21

28 35 42 Sender Rank

49

56

63

0

level of a cell at the .xth row and .yth column represents the communication volume (in byte) between two processes x and y. Proper understanding of communication patterns of parallel applications is important to optimize the communication performance of these applications [4, 5]. For example, with the knowledge of spatial and volume communication attributes, MPIPP [4] optimizes the performance of message passing interface (MPI) programs on nonuniform communication platforms by tuning the scheme of process placement. Besides, such knowledge can also help design better communication subsystems. For instance, for circuit-switched networks used in parallel computing, communication patterns are used to pre-establish connections and eliminate the runtime overhead of path establishment [6]. Furthermore, a recent work shows spatial and volume communication attributes can be employed by replay-based MPI debuggers to reduce replay overhead significantly [7]. Here, we focus on MPI-based parallel applications due to their popularity, but our approach can be applied to other message passing parallel programs. Previous work on communication patterns of parallel applications mainly relies on traditional trace collection methods [2, 8, 9]. A series of trace collection and analysis tools have been developed, such as ITC/ITA, KOJAK, Paraver, TAU, and VAMPIR [10–14]. These tools need to instrument original programs at the invocation points of communication routines. The instrumented programs are executed on full-scale parallel systems, and communication traces are collected during the execution. The collected communication trace files record type, size, source destination, etc. for each message. The communication patterns of parallel applications can be easily generated from the communication traces [8]. However, traditional communication trace collection methods have two main limitations: Huge Resource Requirement Typically, parallel applications are designed to solve complex scientific computational problems and tend to consume huge com-

2.1 Introduction

11

puting power and memory. For example, ASCI SAGE routinely runs on 2000–4000 processors [15] and FT program in the NPB consumes more than 600 GB memory for Class E input [3]. Therefore, it is impossible to use traditional trace collection methods to collect communication patterns of large-scale parallel applications without full-scale systems. Long Trace Collection Time Although traditional trace collection methods do not introduce significant overhead to collect communication traces, they do require to execute the entire parallel applications from the beginning to the end. This results in very long trace collection time. Again, we use ASCI SAGE as an example, which takes several months to complete even on a system with thousands of CPUs. It is prohibitive long for trace collection and prevents many interesting explorations of using communication traces, such as input sensitivity analysis of communication patterns. We have two observations on existing communication trace collection and analysis approaches: (i) Many important applications of communication pattern analysis, such as the process placement optimization [4, 16] and subgroup replay [7], do not require temporal attributes. (ii) Most computation and message contents in message passing parallel applications are not relevant to their spatial and volume communication attributes. Motivated by the above observations, we expect to address the following problem: If we can tolerate missing the temporal attributes in communication traces, can we find a way to collect communication traces which still include all spatial and volume attributes in a more efficient way? For purposes of illustration, we use communication patterns in the rest parts of the chapter to represent spatial and volume attributes of communications. We propose a novel technique, called Fact [17], which can perform Fast Communication Trace collection for large-scale parallel applications on smallscale systems. Our idea is to reduce the original program to obtain a program slice through static analysis and to execute the program slice to acquire communication traces. The program slice preserves all the variables and statements in the original program relevant to the spatial and volume attributes but deletes any unrelated parts. In order to recognize the relevant variables and statements, we propose a livepropagation slicing algorithm (LPSA) to simplify original programs. By solving an inter-procedural data flow equation, it can identify all the variables and statements affecting the communication patterns. We have implemented Fact and evaluated it with NPB programs and Sweep3D. The results show that Fact can preserve the spatial and volume communication attributes of original programs and reduce resource consumptions by two orders of magnitude in most cases. For example, Fact collects communication traces of 512process Sweep3D on a 4-node (32 cores) platform in just 6.79 s and using 1.25 GB memory, while the original program takes 256.63 s and consumes 213.83 GB memory on a 32-node (512 cores) platform. This remaining part of this chapter is organized as follows. Section 2.2 gives a discussion of related work. In Sect. 2.3, we present an overview of our approach

12

2 Fast Communication Trace Collection

followed by our live-propagation slicing algorithm in Sect. 2.4. Section 2.5 describes the implementation of Fact. Our experimental results are reported in Sect. 2.6. Section 2.7 presents two application examples of Fact. We discuss our work in Sect. 2.8. Finally, we conclude in Sect. 2.9. A proof for correctness of our algorithm is given in Appendix.

2.2 Related Work Communication patterns of parallel applications have been studied extensively by many research groups [1, 2, 9, 18, 19]. Typically, these studies have mainly relied on instrumentation-based trace collection methods. A series of trace collection and analysis tools have been developed by both academia and industry, such as ITC/ITA [10], KOJAK [13], Paraver [14], VAMPIR [11], and TAU [12]. These tools instrument original programs and execute them to acquire communication traces. Additionally, mpiP [20] is a lightweight profiling library for MPI applications and only collects statistical information of MPI functions. However, all these traditional trace collection methods require the execution of the entire instrumented programs, which restricts their wide usage for analyzing large-scale applications. Our method adopts the similar technique to capture the communication patterns at runtime as the traditional trace collection methods. However, our method only requires executing the program slice rather than the entire program. Therefore, our method can analyze large-scale applications on small-scale systems. Moreover, previous methods for trace compression [21] can be integrated into our framework. A few studies have tried to compute a symbolic expression of the communication patterns for a given parallel program through data flow analysis [22, 23]. Shao et al. proposed a technique named communication sequence to present communication patterns of applications [22]. Ho and Lin described an algorithm for static analysis of communication structures in the programs written in a channel-based message passing language [23]. Since these approaches only employ static analysis techniques trying to represent communication patterns of applications, they suffer from intrinsic limitations of static analysis. For example, they cannot deal with program branches, loops, and the effects from input parameters. In fact, our approach is a hybrid method of static analysis and traditional trace collection methods. In our approach, program slicing is used to simplify the original program at compile time, and then a custom communication library is used to collect communication traces from the program slice at runtime. Therefore, our approach can address above limitations of static analysis. Our approach exploits the technique of program slicing in the compiler. Program slicing was first proposed by Mark Weiser [24]. Traditionally, it has been used to assist in tedious and error-prone tasks such as program debugging and software maintenance in sequential programs [25–27]. It has also been used to hide I/O latency for parallel applications [28].

2.3 Design Overview

13

2.3 Design Overview Fact consists of two primary components, a compilation framework and a runtime environment as shown in Fig. 2.2. The compilation framework is divided into two phases, intra-procedural analysis followed by inter-procedural analysis. The program is sliced based on the results of the inter-procedural analysis. Finally, the communication traces are collected in the runtime environment. During the intra-procedural analysis phase, Fact parses the source code of an MPI program and identifies the invoked communication routines. The relevant arguments of these routines that determine communication patterns are collected. Information about control dependence, data dependence, and communication dependence for each procedure is gathered, which will be explained in detail in Sect. 2.4.2. During the inter-procedural analysis phase, the program call graph is built based on the information of callsites collected during the intra-procedural phase. LPSA is used to identify all the variables and statements that affect the communication patterns. The output of the compilation framework is the program slice as well as directives for usage at runtime. A program slice is a skeleton of the original program that cannot be executed on the system directly. Runtime environment of Fact provides a custom MPI communication library to collect the communication traces from the program slice based on the directives inserted at compile time. The program slice is linked to

Fig. 2.2 Overview of Fact

14

2 Fast Communication Trace Collection

the custom communication library and executed on a small-scale system. The communication traces of applications are collected during the execution according to the specified problem size, input parameters, and number of processes. Figure 2.3 uses an example to illustrate the differences between the sliced program and the original program. The example program is a parallel matrixmatrix multiplication program C = A × B based on the domain decomposition algorithm. The problem is decomposed by assigning each worker task a number of consecutive columns of matrix B and replicating matrix A to all tasks. Each worker task computes one or more columns of the result matrix C. Process 0 is the master task, which is in charge of distributing the matrices and collecting the results, but does not take part in the calculation. The main differences after slicing in Fact are as follows: 1. Line 4, the declaration of arrays A, B, and C, is replaced with dummy arrays at Line 5. 2. Lines 14–20, the source codes for initializing matrices A and B, are deleted. 3. Lines 41–49, the main computation codes for the matrix multiplication on each worker task, are deleted. 4. Lines 23 and 31, the value of the variable offset, are not relevant to the communication pattern, and these two lines are deleted. 5. Additional directives for usage at runtime are added for MPI routines at Lines 7, 8, 24, 26, 32, 37, 39, and 50 (M means marked and U means unmarked).2 The sliced program is linked with the custom communication library, and the communication traces are collected at runtime. At runtime, the library will judge the state of each MPI communication routine based on the directives. In this example, six unmarked communication routines at Lines 24, 26, 32, 37, 39, and 50 will not be executed at runtime, since the contents of these messages do not affect the communication patterns. The original program consumes about 3N 2 memory, does 2N 3 /(P − 1) floating point computations, and performs three times communication operations for each worker process and performs 3(P − 1) times communication operations for the master process (where N is the size of the matrix for each dimension and P is the number of processes). These memory consumption and computation times are reduced in the sliced program. Meanwhile, the communication time is reduced at runtime.

2.4 Live-Propagation Slicing Algorithm From a formal point of view, the definition of program slice is based on the concept of slicing criterion [24]. A slicing criterion is a pair .p, V , where p is a program point and V is a subset of the program variables. A program slice on the slicing 2 The marked MPI routines will be executed at runtime, and the unmarked will not be executed at runtime. Specific definitions of them will be given in Sect. 2.4.

2.4 Live-Propagation Slicing Algorithm

15

Fig. 2.3 Illustration of program slicing with a parallel Fortran matrix-matrix multiplication program

16

2 Fast Communication Trace Collection

criterion .p, V is a subset of program statements that preserve the behavior of the original program at the program point p with respect to the program variables in V . Therefore, determining the slicing criterion and designing an efficient slicing algorithm according to the actual problem requirements are two key challenges in the compilation framework.

2.4.1 Slicing Criterion Since our goal is to collect communication traces for analyzing spatial and volume communication attributes, we record the following communication properties in LPSA for a given parallel program: • For point-to-point communication, we record message type, message size, message source and destination, message tag, and communicator id. • For collective communication, we record message type, sending message size, receiving message size, root id (if exist), and communicator id. Message size, source, and destination are used to compute spatial and volume communication attributes, while message type, message tag, and communicator id are useful for other communication analysis. In an MPI program, these properties can be acquired directly from the corresponding parameters of the MPI communication routines. For example, in the routine MPI_Send in Fig. 2.4, the parameters count and type determine the message size. The parameters dest and comm determine the message destination. The message tag and communicator id can be acquired from the parameters tag and comm. The parameter buf does not affect the communication patterns directly. However, sometimes, it may affect the communication patterns indirectly through data flow propagation, and we will analyze it in the following subsections. Comm variable is defined to represent those parameters that determine the communication patterns directly.

Fig. 2.4 Comm variables in the routine for MPI_Send (The variables marked with Comm directly determine the communication patterns of the parallel program)

2.4 Live-Propagation Slicing Algorithm

17

Definition 2.1 (Comm Variable) Comm variable is a parameter of a communication routine in a parallel program, the value of which directly determines the communication patterns of the parallel program. As MPI is a standard communication interface, we can explicitly mark Comm variables for each MPI routine. In Fig. 2.4, Comm variables in the routine for MPI_Send are marked. All the parameters, except buf , are Comm variables. When a communication routine is identified in the source code, the corresponding Comm variables are collected. For each procedure P , we use a Comm Set, .C(P ), to record all the Comm variables. .C(P ) = {(, v) | is the unique label of v, v is a Comm variable}. For example, the Comm Set for the parallel matrix multiplication program in Fig. 2.3 is (note that we use the line number of the variable as its unique label): . C(P ) = {(7, myid), (8, nprocs), (24, N), (24, dest), (25, tag), (26, size), (27, dest), . (27, tag), (32, size), (33, source), (33, tag), (37, N), (37, master), (37, tag), . (39, size), (39, master), (39, tag), (50, size), (50, master), (51, tag)}. The Comm Set .C(P ) is the slicing criterion for simplifying the original program in LPSA, which will be optimized during the phase of data dependence analysis.

2.4.2 Dependence of MPI Programs For convenience, we assume that a control flow graph (CFG) is built for each procedure and the program call graph (PCG) is constructed for the whole program. To describe our slicing algorithm easily, we use statement instead of basic block as a node in the CFG. We assume that each statement in the program is uniquely identified by its label . and is associated with two sets: DEF [.], a set of variables whose values are defined at ., and USE [.], a set of variables whose values are used at .. In an MPI program, there are three main types of dependence for statements and variables that would change the behavior for a given program point, data dependence (dd), control dependence (cd), and communication dependence (md). Data Dependence Data dependence between statements means that the program’s computation might be changed if the relative order of statements were reversed [29]. To analyze the data dependence, we must first calculate the reaching definitions for each procedure. We define the GEN and KI LL sets for each node in the CFG. Then we adopt the iterative algorithm presented in [30] to calculate the reaching definitions. The data flow graph (DFG) can be constructed based on the results of the reaching definitions analysis. The node in the DFG is either a statement or a predicate statement. The edge represents the data dependence of the variables. The data dependence information computed by the reaching definitions is stored in the data structures of DU and UD chains [31].

18

2 Fast Communication Trace Collection

Definition 2.2 (DU and UD Chain) Def-use (DU) chain links each definition of a variable to all of its possible uses. Use-def (UD) chain links each use of a variable to a set of its definitions that can reach that use without any other intervening definition. Example The DU chain for (10, size) and UD chain for (32, size) in Fig. 2.3 are DU (10, size) = {26, 32, 39, 50}, .U D(32, size) = {10}.

.

We can further optimize the Comm Set based on the results of data flow analysis. If there are no other intervening definitions for the consecutive Comm variables, we keep only the last Comm variable. Therefore, the Comm Set for the program in Fig. 2.3 can be optimized as . C(P ) = {(7, myid), (8, nprocs), (27, dest), (33, .source), (37, N), (50, size), (50, master), (51, tag)}. Control Dependence If a statement X determines whether statement Y is executed, statement Y is control-dependent on statement X. For example, the statement at Line 32 in Fig. 2.3 is control-dependent on the if statement at Line 13 and the .do while statement at Line 30. The DFG does not include information of control dependence. Control dependence can be computed with the post-dominance frontier algorithm [32]. We convert the control dependence into data dependence by treating the predicate statement as a definition statement and then incorporating the control dependence into the UD chains. Example After converting the control dependence of Lines 13 and 30 into data dependence, the UD chain for size at Line 32 in Fig. 2.3 is .U D(32, size) = {10, 13, 30}. Communication Dependence Communication dependence is an inherent characteristic for MPI programs due to message passing behavior. MPI programs take advantage of explicit communication model to exchange data between different processes. For example, sending and receiving routines for the point-to-point communications are usually used in pairs in the programs. Definition 2.3 (Communication Dependence) Statement x in process i is communication-dependent on statement y in process j , if 1. Process j sends a message to process i through explicit communication routines. 2. Statement x is a receiving operation, and statement y is a sending operation (.x = y). For example, in Fig. 2.3, MPI_Recv routine at Line 37 is communicationdependent on the MPI_Send routine at Line 24. In MPI programs, both point-topoint communications and collective communications can introduce communication dependence. Communication dependence can be computed through identifying all potential matching communication operations in MPI programs. Although, in general, it is a difficult problem for static analysis to determine the matching operations, we find it is sufficient to deal with this problem using simple heuristics in practice. We conservatively connect all potential sending operations with a receiving operation and adopt some heuristics, such as mismatched tags or data types of message

2.4 Live-Propagation Slicing Algorithm

19

buffer, to prune edges that cannot represent real matches. We will further discuss the communication dependence issues in Sect. 2.4.5. In MPI programs, the message is exchanged through the message buffer variable, buf . The communication dependence can be represented with the message buffer variable. .msg_buf () is used to denote the message buffer variable in the communication statement .. Additional considerations for non-blocking communications will be described in the implementation of runtime environment. Definition 2.4 (MD Chain) Message-Dependence Chain (MD Chain) links each variable of message receiving buffer to all of its sending operations. Example The MD chain for variable A at Line 37 in Fig. 2.3 is .MD(37, A) = {24}. The message buffer variable in MPI communication routine of Line 24 is .msg_buf (24) = {A}. Definition 2.5 The slice set of an MPI program (M) with respect to the slicing criterion .C(M), denoted by .S(C(M)), consists of all statements . on which the values of variables in .C(M) directly or indirectly dependent. More formally: d1

dn

S(C(M)) = { | v − → ... − → , v ∈ C(M), n > 0, f or 1 ≤ i ≤ n, di ∈ {cd, dd, md}} .

We use the symbol .→ to denote the dependence between variables and statements. For computing the program slice with respect to the slicing criterion .C(M), we define LIVE variable to record dependence relationship between the variables of programs. A Comm variable itself is also a LIVE variable based on the definition of LIVE variable. Definition 2.6 (LIVE Variable) A variable x is LIVE, if the change of its value at statement . can affect the value of any Comm variable .v directly or indirectly through dependence of MPI programs, denoted by .v →∗ (, x). There is a LIVE Set for each procedure P , .LIVE[P ]. .LIVE[P ] = {(, x) | v →∗ (, x), v ∈ C(P )}.

2.4.3 Intra-procedural Analysis During the intra-procedural analysis phase, data dependence, control dependence, and communication dependence are collected and put into corresponding data structures. Each procedure P is associated with two sets, .W L[P ] and .LIVE[P ]. .W L[P ] is a worklist that holds the variables waiting to be Processed, and .L IVE [P ] holds the LIVE variables for procedure P . As program slicing is a backward data flow problem, we use a worklist algorithm to traverse the UD chains and iteratively find all the LIVE Variables. We put the statements that define LIVE variables into slice set .S(P ) and mark MPI statements that define LIVE variables or have communication dependence with marked MPI statements. The main body of the analysis algorithm is given in Algorithm 1. .receive_buf denotes the message buffer

20

2 Fast Communication Trace Collection

Algorithm 1 Compute LIVE set and mark MPI statements for intra-procedure 1: procedure INTRA-LIVE(P ) 2: input: worklist W L[P ] and LIVE set LIVE[P ] 3: output: program slice set of procedure: S(P ) 4: Change ← False 5: while W L[P ] = φ do 6: Remove an item (i , v) from W L[P ] / LIVE[P ] then 7: if (i , v) ∈ 8: Change ← True 9: LIVE[P ] ← {(i , v)} ∪ LIVE[P ] 10: Process communication dependence! 11: if (i , v) ∈ receive_buf then 12: for j ∈ MD(i , v) do 13: Mark MPI statement j 14: S(P ) = S(P ) ∪ {j } 15: W L[P ] ← {(j , msg_buf (j ))} ∪ W L[P ] 16: end for 17: else Process control and data dependence! 18: for k ∈ U D(i , v) do 19: if k ∈ MP I _Routines then 20: Mark MPI statement k 21: end if 22: S(P ) = S(P ) ∪ {k } 23: for x ∈ USE[k ] do 24: W L[P ] ← {(k , x)} ∪ W L[P ] 25: end for 26: end for 27: end if 28: end if 29: end while 30: return S(P ) 31: end procedure

variables in the receiving operations. The worklist .W L[P ] for each procedure is initialized with its Comm Set, and .LIVE[P ] is initialized with null set. Example After computing by Algorithm 1, the final LIVE Set for the program in Fig. 2.3 is .LIVE[P ] = {(7, myid), (8, nprocs), (27, dest), (33, source), (37, N), .(50, size), (50, master), (51, tag), (22, nprocs), (30, nprocs), (13, myid), .(13, master), (10, cols), (10, N), (9, N), (9, nprocs)}. The slice set of program is .S(P ) = {3, 7, 8, 9, 10, 11, 12, 13, 22, 30}. The MPI routines at Lines 7–8 are marked by the algorithm. In Algorithm 1, the statements not in slice sets except MPI routines are deleted, while all the MPI routines are retained. For unmarked MPI routines by Algorithm 1, it means that no LIVE variable is defined or no communication dependence exists in these routines. The retained unmarked MPI routines are served for runtime environment of Fact to collect communication traces. For example, the six unmarked communication routines in Fig. 2.3 do not need to transfer the messages over the

2.4 Live-Propagation Slicing Algorithm

21

Fig. 2.5 Marked MPI point-to-point communication routines by Algorithm 1 (M means marked)

network actually. We only need to collect the values of their Comm variables at runtime. Therefore, the communication time of the original program can be significantly reduced. In contrast, for marked MPI routines by Algorithm 1, LIVE variables are defined, or communication dependence exists in these routines. For MPI routines used for message passing, it means that the contents of messages are relevant to the communication patterns. In Fig. 2.5, the LIVE variable, num, is defined in MPI_Irecv of Line 5 that is communication-dependent on MPI_Send of Line 2. Therefore, both MPI routines are marked by the algorithm and will be executed at runtime. Algorithm 1 is sufficient for the MPI program with one function, such as the program in Fig. 2.3. In the real parallel application, the program is always modularized with several procedures. In the following subsection, we will present additional considerations for inter-procedural analysis.

2.4.4 Inter-procedural Analysis Slicing across procedure boundaries is complicated due to the necessity of passing the LIVE variables into and out of procedures. Because program slicing is a backward data flow problem and the slicing criterion can arise either in the calling procedure (caller) or in the called procedure (callee), the LIVE variable can propagate bidirectionally between the caller and the callee through parameter passing. To obtain a precise program slice, we adopt a two-phase traverse over the PCG, top-down phase followed by bottom-up phase. Additionally, the UD chains built during the intra-procedural phase are refined to consider the side effects of procedure calls. We assume that all the parameters are passed by reference and our algorithm can be extended to the case that they are passed by value. MOD/REF Analysis To build precise UD chains, we use the results of interprocedural MOD/REF analysis. For example, in Fig. 2.6, before incorporating the information from the MOD/REF analysis, .U D(4, a) = {2, 3}. We compute the following sets in the MOD/REF analysis for each procedure [33]: GMOD(P) and GREF(P). GMOD(P) is a set of variables that are modified by an invocation

22

2 Fast Communication Trace Collection

Fig. 2.6 An example of LIVE variable propagation from the caller to the callee

of procedure P , while GREF(P) is a set of variables that are referenced by an invocation of procedure P [34]. The information from the MOD/REF analysis tells us whether a variable is modified or referenced due to the procedure calls. With these results, we can refine the UD chains built during the intra-procedural analysis. For example, .U D(4, a) = {3}. Extension of MD Chains The MD chains collected during the intra-procedural phase do not include inter-procedural communication dependence. During the interprocedural analysis phase, MD chains are extended to consider cross-procedural dependence. At the same time, Algorithm 1 is extended to Algorithm 2 that will be invoked by Algorithm 3. Only the different parts from Algorithm 1 are listed here. .P : j denotes the statement .j in procedure .P . Algorithm 2 Extension of INTRA-LIVE(P) 1: procedure INTRA-LIVE-EXT(P) ... 12: for each P : j ∈ MD(i , v) do 13: Mark MPI statementP : j 14: S(P ) = S(P ) ∪ {j } 15: W L[P ] ← {(j , msg_buf (j ))} ∪ W L[P ] ...

Top-Down Analysis The top-down phase propagates the LIVE variables from the caller to the callee over the PCG by binding the actual parameters of the caller to the formal parameters of the callee. As the LIVE variable can be modified by the called procedure via parameter passing, we need to find the corresponding definition of this variable in the called procedure. For example, in Fig. 2.6, we can compute from the intra-procedural analysis that .(3, a) is a LIVE variable. This calling context is then passed into the procedure bar. The corresponding formal parameter in bar is the parameter b. There may be several definitions of b in procedure bar; however, we only care about the last definitions (it is a set due to the effects of control flow) of variable b due to the property of the backward data flow analysis. This definition appears in statement 9 in bar. In addition, we put this statement into slice set and

2.4 Live-Propagation Slicing Algorithm

23

Fig. 2.7 An example of LIVE variable propagation from the callee to the caller

put its USE variables into the worklist of procedure bar. Other LIVE variables in procedure bar can be computed iteratively by Algorithm 2. In procedure f oo, the actual parameter .(3, a) is no longer put into its worklist. We define the LIVE_Down function to formalize this data flow analysis. Definition 2.7 (LIVE Down) Procedure P invokes procedure Q, v is a LIVE variable, and also an actual parameter at callsite . in procedure P , .v is the corresponding formal parameter in procedure Q, and .LIVE_Down(P , , v, Q) returns statement set (L is the label set) of the last definitions of .v in procedure Q: .LIVE_Down(P , , v, Q) = L. Bottom-Up Analysis The bottom-up phase is responsible for propagating the LIVE variables from the callee to the caller. For a LIVE variable in the called procedure, if its definition is a formal parameter, we need to propagate the LIVE information by binding the formal parameters to the actual parameters. For example, in Fig. 2.7, the formal parameter of b in procedure bar is a LIVE variable computed by the intraprocedure analysis. We need to propagate this information into the calling procedure f oo. The corresponding actual parameter is the parameter a in f oo. We put this variable into the worklist of procedure f oo, and Algorithm 2 is used for computing other LIVE variables. The LIVE_Up function is defined as follows: Definition 2.8 (LIVE Up) Procedure Q is invoked by procedure P , v is a LIVE variable and also a formal parameter (the label of procedure entry point is .0 ) in procedure Q, .v is the corresponding actual parameter in procedure P at callsite . , and .LIVE_Up(Q, 0 , v, P ) returns the label of the callsite and the actual parameter pair: .LIVE_Up(Q, 0 , v, P ) = ( , v ). The final algorithm for program slicing based on live propagation is given in Algorithm 3 that invokes Algorithm 2. The output of LPSA is the program slice set as well as directives for MPI routines. Our experimental results show that LPSA can converge within three to four iterations for the outer loop. Let .C(M) be the slicing criterion for a given MPI program M. Let .S(M) be the slice set computed by LPSA. Then the correctness of the algorithm can be stated by Theorem 2.1. A sketch of the proof of this theorem is given in the Appendix.

24

2 Fast Communication Trace Collection

Algorithm 3 Pseudo-code for live-propagation slicing algorithm (LPSA) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39: 40: 41: 42: 43: 44: 45: 46:

input: An MPI program M output: Program slice S(M) and marked information For each procedure P : Build UD and MD Chains For each procedure P : Build Comm Set C(P ) MOD/REF analysis over the PCG For each procedure P : Refine UD and MD chains For each procedure P : W L[P ] ← C(P ) For each procedure P : LIVE[P ] ← ∅ Change ← True while (Change = True) do Change ← False

Top-Down Phase

for procedure P in Pre-Order over PCG do call INTRA-LIVE-EXT(P ) for Q ∈ successor(P ) do for parameter v at callsite do if ((, v) ∈ LIVE[P ]) then L = LIVE_Down(P , , v, Q) for ∈ L do if ∈ MP I _Routines then Mark MPI statement end if S(Q) = S(Q) ∪ { } for x ∈ USE[ ] do end for W L[Q] ← {( , x)} ∪ W L[Q] end for end if end for end for end for Bottom-Up Phase for procedure Q in Post-Order over PCG do call INTRA-LIVE-EXT(Q) for P ∈ predecessor(Q) do for formal parameter v at 0 in Q do if ((0 , v) ∈ LIVE[Q]) then ( , x) = LIVE_Up(Q, 0 , v, P ) W L[P ] ← {( , x)} ∪ W L[P ] S(P ) = S(P ) ∪ { } end if end for end for end for end while For each procedure P : return S(P )

Theorem 2.1 .S(C(M)) = S(M).

2.4 Live-Propagation Slicing Algorithm

25

2.4.5 Discussions A common question to our algorithm is that how it works with applications whose communication behavior is dependent on message data or even input data. As demonstrated by Theorem 2.1, LPSA algorithm can always guarantee that the generated program slice will preserve these message data or input data and related computation statements. We perform experiments with 7 NPB programs and Sweep3D. There are a few marked MPI communication routines. For example, in NPB-BT program of Fig. 2.8, MPI_Bcast is used to broadcast the iteration count and datatype to slave processes. In order not to affect the communication patterns, these routines will be executed at runtime. An interesting observation is that all the marked MPI communication statements are collective communications and none of point-to-point communications is marked in these programs, i.e., the message contents of all the point-to-point communications are irrelevant to communication patterns. This can be explained by the fact that in mature MPI applications, collective communications are generally more preferred than point-to-point communications. Thus, although our approach for matching communication routines is quite conservative, we find it works well for all applications we have tested. There is some work on more accurate communication matching algorithms. MPI-ICFG [35] is considered an effective approach to identify potential matching operations. Recently, Bronevetsky [36] has also proposed a uniform data flow framework to address this problem. We are studying more MPI applications and may include them when there is a demand. In addition, we are defining some annotation constructs that programmers can use to tag matching communication operations.

Fig. 2.8 Marked MPI collective communications by LPSA for NPB-3.3/bt.f program (M means marked)

26

2 Fast Communication Trace Collection

2.5 Implementation 2.5.1 Compilation Framework We have implemented LPSA for Fact in the production compiler, Open64 [37], and our patch for Open64 can be downloaded from the website (www.hpctest.org. cn/resources/fact-1.0.tgz). Open64 is the open-source version of the SGI Pro64 compiler under the GNU General Public License (GPL). As shown in Fig. 2.9, the major functional modules of Open64 are the front end (FE), pre-optimizer (PreOPT), inter-procedural analysis, loop nest optimizer (LNO), global scalar optimizer (WOPT), and code generator (CG). To exchange data between different modules, Open64 utilizes a common intermediate representation (IR), called WHIRL. WHIRL consists of five levels of abstraction, from very high level to lower levels. Each optimization module works on a specific level of WHIRL. Fact is implemented in the PreOPT and inter-procedural analysis modules as shown in Fig. 2.9. In the PreOPT phase, the CFG is created for each procedure. Control dependence analysis is carried out on the CFG in reverse dominator tree order, while the data dependence is collected into the DU and UD chains. The inter-procedural analysis module can be further divided into three main phases: Inter-procedural Local Analysis (IPL), Inter-procedural Analyzer (IPA), and InterProcedural Optimizer (IPO). During the IPL phase, we parse the WHIRL tree and identify the communication routines. Communication dependence is collected into

Fig. 2.9 Fact in the Open64 infrastructure

2.5 Implementation

27

MD chains, and Comm variables are stored in the form of summary data. During the IPA phase, the PCG is constructed. MOD/REF analysis is performed on this, and the DU and UD chains built in the IPL phase are refined. MD chains are extended to consider cross-procedural communication dependence. By solving an inter-procedural data flow equation during the IPA phase, we compute the LIVE sets and slice sets for each procedure and mark necessary MPI statements. During the IPO phase, we delete all the statements that are not in slice sets except MPI routines and remove the variables that are not in the LIVE sets from the symbol table. The marked information for MPI communication routines are retained in the program slice. Currently, we only support Fortran programs in Fact, and supporting other languages remains as our future work.

2.5.2 Runtime Environment Runtime environment is in charge of collecting communication traces from the program slice. It provides a custom MPI wrapper communication library which differentiates MPI routines based on their functions. For MPI routines used to create and shut down MPI runtime environment, such as MPI_Init and MPI_Finalize, they are not modified and executed directly in the library. For MPI routines used to manage communication contexts, such as MPI_COMM_Split and MPI_COMM_Dup, the library requires executing these routines and collecting information about the relation for process translation between different communicators. For MPI routines used for message passing, such as MPI_Send, MPI_Irecv, and MPI_Bcast, the library first judges the state of the MPI routine based on the results of LPSA analysis. If the communication routine is marked, we need to execute it and meanwhile collect communication property information. Otherwise, only related information is recorded. In addition, for unmarked non-blocking communication routines, the parameter requests of these routines are set so that the library guarantees that corresponding communication operations are not be executed, such as MPI_Wait or MPI_Waitall. We use MPI profiling layer (PMPI) to capture the communication events, record communication traces to a memory buffer, and eventually write them on local disks. Figure 2.10 gives an example of collecting communication traces for MPI_Send routine. In Fig. 2.10, myid is a global variable computed with PMPI_Comm_rank. Our runtime environment also provides a series of communication trace analyzers which can generate communication profiles of applications, such as distribution of message sizes, and communication topology graph.

28

2 Fast Communication Trace Collection

Fig. 2.10 Pseudo-code for collecting the communication traces for MPI_Send routine at runtime

2.6 Evaluation 2.6.1 Methodology We evaluate the Fact with ASCI Sweep3D (S3) [38], as well as 7 NPB programs [3], BT, CG, EP, FT, LU, MG, and SP. NPB is a set of scientific benchmarks derived from computational fluid dynamic applications. We use version 3.3 of NPB and the Class D data set. Sweep3D is an application in the ASCI Purple suite which is used to solve a three-dimensional particle transport problem. In our experiments, the problem size in the Sweep3D is fixed for each process (.150 × 150 × 150). The main communication routines used in NPB programs and Sweep3D are listed in Table 2.1. We perform our experiments on two platforms: test platform and validation platform. The test platform is a small-scale system used to collect communication traces with Fact, while the validation platform is a large-scale system used to validate the communication traces collected with Fact and record memory requirements and execution time of the original programs. Details of the two platforms are given below: • Test Platform (32 cores): Small cluster of four nodes, where each node is a two-way quad-core with Intel Xeon E5345 2.33 GHz CPUs and 8 GB memory and connected with a Gigabit Ethernet. Our custom communication library is implemented based on mpich2-1.0.7 [39]. • Validation Platform (512 cores): A cluster consisting of 32 nodes. Each node is a four-way quad-core with AMD8347 1.9 GHz CPUs and 32 GB memory and connected with a 20 Gbps InfiniBand network. MPI library is mvapich-1.1.0 [40].

2.6 Evaluation

29

Table 2.1 Main communication routines used in NPB programs and Sweep3D (S3) Routines MPI_Send MPI_Recv MPI_Isend MPI_Irecv MPI_Barrier MPI_Bcast MPI_Reduce MPI_Allreduce MPI_Alltoall MPI_Comm_Split

BT

CG √

EP

FT

.

√ √ . √ . √ . √ . √

.

LU √ √

.

MG √

.

√ √

. .

√

.

√

.

.

√

.

√

.

√ . √ . √ .

√ . √

.

√ √ . √ .

.

√

.

SP

.

√ √ . √ . √ . √ .

.

S3 √ √

.

√ √ . √ . √ . √ . √

.

.

.

√ √

.

.

√

.

√

.

2.6.2 Validation We give the proof for correctness of our algorithm in Appendix. In addition, we also validate the implementation of Fact through comparing the communication traces collected by Fact with traces collected by traditional trace collection methods on the validation platform. We perform comparison with seven NPB programs and Sweep3D for different numbers of processes (64, 128, 256, and 512). The experimental results show that they are identical except that the traces collected by Fact do not include time stamp. Communication patterns of parallel applications can be extracted from the communication traces. Figure 2.11 shows extracted communication patterns for BT, LU, and MG with Fact (Due to limitations of space, we just list the communication patterns for partial applications). The gray level of a cell at the .ith row and .j th column represents the communication volume (in byte) between two processes i and j . From the figures, we can find the communications in these programs have well locality. Most of the communications exist between adjacent processes around the diagonal line in the communication matrix.

2.6.3 Performance 2.6.3.1

Memory Consumption

To present the advantages of Fact over the traditional trace collection methods, we collect communication traces of NPB programs (Class D) and Sweep3D (.150 × 150 × 150) with a large date set on a small-scale system, the four-node test platform which has only 32 GB memory in all. The memory requirements of these programs except EP and LU for 512 processes exceed the memory capacity of the test platform. For example, the NPB FT with Class D input for 512 processes

30

2 Fast Communication Trace Collection 8

x 10 15

x 10 2

63

56

1.8

56

49

1.6

49

42

1.4

42

1.2

35

1

28

0.8

Receiver Rank

Receiver Rank

9

63

10

35 28 21

21

5

0.6 14

14

0.4

7

7

0.2

0

0

0 0

7

14

21

28 35 42 Sender Rank

49

56

0 0

63

7

14

21

28 35 42 Sender Rank

49

56

63

(b)

(a) 8

x 10 63 2

56

Receiver Rank

49 1.5

42 35

1

28 21 14

0.5

7 0

0 0

7

14

21

28 35 42 Sender Rank

49

56

63

(c)

Fig. 2.11 Extracted communication patterns for BT, LU, and MG (Class .= D, NPROCS .= 64) with Fact. (a) BT. (b) LU. (c) MG

will consume about 126 GB memory size. Therefore, the traditional trace collection methods cannot collect the communication traces on such a small-scale system due to the memory limitation. Our experimental results shown in Fig. 2.12 demonstrate that Fact is able to collect the communication traces for these programs on the test platform. Moreover, it consumes very little memory resources. The memory requirements of the original programs are collected on the validation platform. In most cases, the memory consumption for collecting the communication traces with Fact is reduced by two orders of magnitude compared to the original programs. For example, Sweep3D only consumes 0.13 GB memory for 64 processes and 1.25 GB memory for 512 processes with Fact, while the original program consumes 26.61 GB and 213.83 GB memory, respectively. Figure 2.13 shows the memory consumption of the Fact compared to the Null benchmark. Null is a micro-benchmark which only contains the invocations of

2.6 Evaluation

31

(a)

(b)

(c)

(d)

(e)

(g)

(f)

(h)

Fig. 2.12 The memory consumption (in Gigabyte) of Fact for collecting the communication traces of NPB programs (Class D) and Sweep3D (.150 × 150 × 150) on the test platform. (a) BT. (b) CG. (c) EP. (d) FT. (e) LU. (f) MG. (g) SP. (h) Sweep3D

32

2 Fast Communication Trace Collection

Fig. 2.13 The memory consumption of Fact compared to Null micro-benchmark when collecting the communication traces for different programs. AV G is the arithmetic mean for all the programs

MPI_Init and MPI_Finalize and no other computation or communication operations. Null is used to provide a lower bound on memory consumption for an MPI program with different numbers of processes. Note that the memory consumption of Null grows when the number of processes increases. It is because that the MPI communication library itself consumes a certain memory for process management. As shown in Fig. 2.13, the memory consumption of Fact for all the programs is close to the Null benchmark with different numbers of processes. Among these programs, BT and SP consume relatively more memory resources than others. In contrast, EP and CG consume the least memory. For example, with 512 processes, the Null benchmark consumes 1.04 GB memory, while EP and CG consume 1.11 GB and 1.22 GB memory, respectively. Additionally, the memory buffer in our runtime library of Fact will also consume a certain memory, no more than 320 KB for each process. The results prove that LPSA algorithm in Fact can effectively reduce memory requirements of the original programs.

2.6.3.2

Execution Time

Figure 2.14 lists the execution time of Fact when collecting the communication traces on the test platform. As the traditional trace collection methods cannot collect the communication traces on the test platform, the execution time of the original programs is collected on the validation platform. In addition, as the problem size is

2.6 Evaluation

33

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Fig. 2.14 The execution time (in Second) of Fact when collecting the communication traces of NPB programs and Sweep3D on the test platform (32 cores). (a) BT. (b) CG. (c) EP. (d) FT. (e) LU. (f) MG. (g) SP. (h) Sweep3D

34

2 Fast Communication Trace Collection

fixed for each process in the Sweep3D, its execution time increases with the number of processes. Since Fact deletes irrelevant computations of the original program at compile time and only executes necessary communication operations at runtime, the execution time of the original program can be reduced significantly. For example, Fact just takes 0.28 s for collecting the communication traces of BT for 64 processes, while the original program running on the 512-core validation platform takes 1175.65 s yet. As few of communication operations are used in the EP program, its execution time is negligible after slicing. When collecting communication traces for 512 processes, the execution time with Fact has a slight increase for a few programs. This is caused by the following two reasons: (1) Besides the computation time in the program slice, the execution time with Fact also includes the time of recording the communication traces into the memory buffer and eventually writing them into the disk files. When the number of processes is small enough, the buffer of the file system can hold all the trace files in the memory. When the number of processes increases, the size of communication trace files will exceed the buffer limitation of the file system. The file system will flush the buffer to the hard disk. As a result, the I/O time increases dramatically for a large number of processes. (2) As the marked communication operations at compile time would be executed at runtime, such as the MPI_Bcast invocation of the BT program in Fig. 2.8, the communication time will also increase on such a small-scale system for a large number of processes due to network contention. Overall, the execution time with Fact is acceptable for most developers to study the communication patterns on such a small-scale system. In addition, Fact can benefit when more nodes are available. Figure 2.15 illustrates the results when collecting the communication traces of MG with 512 processes and BT with 400 processes by using more nodes on the test platform. Fact represents good scalability with the number of nodes. For example, with 12 nodes, Fact only takes 2.43 s and 3.46 s to collect the communication traces for MG and BT, respectively. This is because both the I/O time and the communication time mentioned before are reduced with the increase of number of nodes.

2.7 Applications 2.7.1 Optimize Process Placement of MPI Programs In this section, we present an application using the communication traces collected with Fact to optimize communication performance. Modern parallel computers, such as SMP (symmetric multiprocessor) clusters and multi-clusters, exhibit nonuniform communication cost. Therefore, it is important to map virtual parallel processes to physical processors (or cores) in an optimal way to obtain scalable performance. Current research works, such as MPIPP and MPISX [4, 41], address

2.7 Applications

35

Fig. 2.15 The execution time when collecting the communication traces of MG with 512 processes and BT with 400 processes by using more nodes on the test platform

this problem. However, the communication topology graphs (CTGs) in these methods are acquired with the traditional costly trace collection methods. In this chapter, we present a cost-effective process placement method, called MAPIX, using the Fact framework. The process placement approach of MAPIX includes three steps: (1) Collect communication traces of applications and generate the CTG using Fact. In MAPIX, collective communications are decomposed into point-to-point communications based on their specific implementation algorithms. In fact, collective communications are system and MPI implementation-dependent, such as hardware support for collectives in BlueGene/L. Finally, the CTG is composed of the decomposed collective communications and the point-to-point communications, which includes a communication count matrix and a communication volume matrix. (2) Generate the network topology graph (NTG) of the target platform. This is done automatically with a parallel ping-pong benchmark. A communication latency matrix and a communication bandwidth matrix between each pair of two processor cores are recorded in the NTG. (3) Compute the optimized process placement scheme by mapping CTG to NTG with a heuristic K-way graph mapping algorithm presented in [4]. The communication cost in MAPIX is computed with the Hockney Model [42]. Finally, the parallel application is submitted to the parallel system with the optimized process placement scheme. We performed our experiment on a nonuniform communication platform, 16node cluster connected with a Gigabit Ethernet. Each node is a two-way 1.6 GHz

36

2 Fast Communication Trace Collection

Fig. 2.16 The normalized time for NPB programs with optimized process placement scheme (MAPIX) compared to MPI default placement scheme. (The base time in second is listed on the x-axis.) AVG means the geometric mean

Intel Xeon dual-core processors, with 4 GB memory. The MPI communication library is MPICH-1.2.7 [43]. The data set of the NPB is Class C with 64 processes. The MPI default placement scheme used is the round-robin order. The CTGs of NPB programs are collected with Fact on the test platform. The results are shown in Fig. 2.16. The performance of BT, CG, LU, FT, and SP is greatly improved. For example, the execution time of BT improves from 103.8 s to 90.7 s. Because EP has very few communication and the MPI default placement scheme is almost optimal for MG, the performance of both programs is only marginally improved by MAPIX. Overall, the performance of the seven NPB programs with MAPIX shows a 9.8% improvement on average over the MPI default scheme. The advantage of MAPIX over MPIPP is that the cost to acquire the CTG of the application is significantly reduced. Figure 2.17 shows the time cost on logarithmic scale to acquire the CTGs of NPB programs with MPIPP and MAPIX. From the figure, we can find that MAPIX is more efficient than MPIPP to acquire the CTGs of programs.

2.7.2 Sensitivity Analysis of Communication Patterns to Input Parameters There are arguments on the usage of communication traces, just like any other profile-based analysis, that communication traces are in fact dependent on input

2.7 Applications

37

Fig. 2.17 The time comparison (in second) to acquire the CTGs of NPB programs between MPIPP and MAPIX

parameters. Thus, it is important to reveal the relationship between input parameters and communication patterns. Traditionally, developers manually express the application-specific knowledge of the parallel applications, but it is very timeconsuming and error-prone. It is greatly desired that we can perform sensitivity analysis of communication patterns to key input parameters automatically. Because of the prohibitive resource and time requirements of traditional methods, it is too expensive to get communication traces even for a single input, not to mention the sensitivity analysis which requires multiple executions. Fact enables us to explore the sensitivity analysis in a cost-effective way. We use Sweep3D as an example. Sweep3D has four key input parameters affecting the communication performance of the program: i, j , mk, and mmi. The values of i and j determine the mapping shape of the processes. The number of processes must be equal to the product of i and j . The computation granularity of the pipeline is determined by a kplane blocking factor (mk) and an angle blocking factor (mmi). In our experiments, it takes less than 1 s for collecting seven sets of communication traces by Fact with different input parameters on the test platform. The process number is 64. Figure 2.18 exhibits the communication locality [2] of Sweep3D when varying the input parameters, i and j . Note that the two sets of input parameters present different communication localities for the application. For example, in Fig. 2.18a, Process 8 communicates with Processes 0, 9, and 18 frequently, while in Fig. 2.18b, Process 8 communicates with Processes 4, 9, and 12 more frequently. Table 2.2 shows that when increasing the value of mk, the message size will increase. It is similar for the parameter of mmi. When “.mk = 50” and “.mmi =

38

2 Fast Communication Trace Collection 7

7

x 10

x 10

63

10

10

56

9

56

9

49

8

49

8

42

7

42

7

35

6

35

6

5

28

4 21

Receiver Rank

Receiver Rank

63

3

14

2

4 21

1

7

0

0

0

7

14 21 28 35 42 49 56 63

3

14

7 0

5

28

2 1 0

7

14 21 28 35 42 49 56 63

0

Sender Rank

Sender Rank

(b)

(a)

Fig. 2.18 The sensitivity of communication spatial locality of Sweep3D to input parameters: i and j . (a) (mk .= 10 mmi .= 3 .i = 8 .j = 8). (b) (mk .= 10 mmi .= 3 .i = 4j = 16) Table 2.2 The sensitivity of message size (msg_size) in byte and message count (msg_count) to the input parameters Input paras msg_size msg_count

mk .= 10 mmi .= 1 12,000 17,280

mk .= 20 mmi .= 1 24,000 9216

mk .= 10 mmi .= 3 36,000 5760

mk .= 20 mmi .= 3 72,000 3072

mk .= 50 mmi .= 6 360,000 576

6,” the message size changes to 360,000 byte, and the message count is 576 in Sweep3D. This information is very important for understanding the communication behavior of Sweep3D.

2.8 Limitations and Discussions Absence of Temporal Attributes The Fact framework is based on the observation that there are many applications of communication trace analysis that can be done without temporal attributes. We reiterate some known applications here: MPI process mapping optimization [4, 16], communication subsystem design [44], and subgroup replay-based MPI debuggers [7]. In addition, Fact can also be used to do performance prediction as described in [45, 46]. We do not intend, and it is impossible to list all potential applications of Fact, but we believe these known applications are sufficient to support its usefulness. Fact cannot be used to support performance optimization and performance debugging that require temporal information. For example, traces collected by Fact cannot be used with performance tuning tools such as Intel Trace Analyzer [10]

References

39

to get the overhead of a message transfer. Some automatic performance tuning algorithms, such as the critical path analysis [47], cannot be supported by Fact either. However, we would like to mention that although traces collected by Fact do not include temporal attributes, they do include more than spatial and volume attributes. One important attribute we preserve in Fact framework is the temporal order attribute. Traces collected by Fact have all the message operations and their sequences in addition to statistics on volume and spatial attributes. The temporal order attribute has many potential applications. For example, they can be used in identifying similarity between MPI processes [46]. Due to the limitation of space, we omit more discussions on this issue. Communication Nondeterminism Like any trace-based approaches, Fact also has its limitation in representing many executions with one trace. One possible way to address the problem is to obtain traces for many times and observe the sensitivity of communication patterns to different executions. In this sense, Fact has significant advantage over traditional approaches because it costs much less in acquiring traces. Fact may also affect communication patterns of nondeterministic applications in a more subtle way. Because Fact may remove computation from original programs, it changes the load balancing characteristics that may result in the change of communication patterns. We will perform more investigations on this direction and see if this is a real problem in practice.

2.9 Conclusions In this chapter, we propose a novel approach, called Fact, to acquire communication traces of large parallel message passing applications on small-scale systems. Our approach can preserve the spatial and volume communication attributes while greatly reducing the time and memory overhead of trace collection process. We have implemented Fact and evaluated it with several parallel programs. Experimental results show that Fact is very effective in reducing the resource requirements and collection time. In most cases, we get one to two orders of magnitude of improvement. To the best of our knowledge, Fact is the first work that can collect communication traces of large-scale parallel applications on small-scale systems. With Fact, we are enabled to explore more applications of using communication patterns. In the future, we will evaluate Fact with more parallel applications.

References 1. Chodnekar, S., et al. (1997). Towards a communication characterization methodology for parallel applications. In Proceedings Third International Symposium on High-Performance Computer Architecture.

40

2 Fast Communication Trace Collection

2. Kim, J., & Lilja, D. J. (1998). Characterization of communication patterns in message-passing parallel scientific application programs. In: International Workshop on Communication, Architecture, and Applications for Network-Based Parallel Computing (pp. 202–216). 3. Bailey, D., et al. (1995). The NAS Parallel Benchmarks 2.0. Moffett Field, CA: NAS Systems Division, NASA Ames Research Center. 4. Chen, H., et al. (2006). MPIPP: An automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters. In Proceedings of the 20th Annual International Conference on supercomputing (pp. 353–360) ACM. 5. Preissl, R., et al. (2008). Using MPI communication patterns to guide source code transformations. In: Computational Science–ICCS 2008: 8th International Conference (pp. 253–260). 6. Ding, Z., et al. (2005). Switch design to enable predictive multiplexed switching. In: 19th IEEE International Parallel and Distributed Processing Symposium (p. 100.1). 7. Xue, R., et al. (2009). MPIWiz: Subgroup reproducible replay of MPI applications. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (pp. 251–260). 8. Preissl, R., et al. (2008). Detecting patterns in MPI communication traces”. In: ICPP. 2008, pp. 230–237. 9. Vetter, J. S., & Mueller, F. (2002). Communication characteristics of large-scale scientific applications for contemporary cluster architectures. In Proceedings 16th International Parallel and Distributed Processing Symposium (pp. 853–865). 10. Intel Ltd. Intel Trace Analyzer & Collector. http://www.intel.com/cd/software/products/asmona/eng/244171.htm 11. Nagel, W. E., et al. (1996). VAMPIR: Visualization and analysis of MPI resources. 12. Shende, S. S., & Malony, A. D. (2006). The TAU parallel performance system. The International Journal of High Performance Computing Applications, 20(2), 287–311. 13. Mohr, B., & Wolf, F. (2003). KOJAK-A tool set for automatic performance analysis of parallel programs. In Euro-Par 2003 Parallel Processing: 9th International Euro-Par Conference. 14. Labarta, J., et al. (1996). DiP: A parallel program development environment. In: European Conference on Parallel Processing (pp. 665–674). Springer. 15. Kerbyson, D. J., et al. (2001). Predictive performance and scalability modeling of a large-scale application. In Proceedings of the 2001 ACM/IEEE conference on Supercomputing (pp. 37– 48). 16. Zhang, J., et al. (2009). Process mapping for MPI collective communications. In Euro-Par 2009 Parallel Processing: 15th International Euro-Par Conference. 17. Zhai, J., et al. (2009). FACT: Fast communication trace collection for parallel applications through program slicing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. https://doi.org/10.1145/1654059.1654087 18. Zamani, R., & Afsahi, A. (2005). Communication characteristics of message- passing scientific and engineering applications. In: International Conference on Parallel and Distributed Computing Systems. 19. Faraj, A., & Yuan, X. (2002). Communication characteristics in the NAS parallel benchmarks. In International Conference on Parallel and Distributed Computing Systems. 20. Vetter, J. S., & McCracken, M. O. (2001). Statistical scalability analysis of communication operations in distributed applications. In: Proceedings of the Eighth ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming (pp. 123–132). 21. Noeth, M., et al. (2009). ScalaTrace: Scalable compression and replay of communication traces for high-performance computing. Journal of Parallel and Distributed Computing, 69(8), 696– 710. 22. Shao, S., Jones, A. K., & Melhem, R. G. (2006). A compiler-based communication analysis approach for multiprocessor systems. In: Proceedings 20th IEEE International Parallel & Distributed Processing Symposium. 23. Ho, S., & Lin, N. (2002). Static analysis of communication structures in parallel programs. In International Computer Symposium.

References

41

24. Weiser, M. (1984). Program slicing. IEEE Transactions on Software Engineering, 10(4), 352– 357. 25. Gallagher, K. B., & Lyle, J. R. (1991). Using program slicing in software maintenance. IEEE Transactions on Software Engineering, 17(8), 751–761. 26. Binkley, D. (1998). The application of program slicing to regression testing. Information and Software Technology, 40(11–12), 583–594. 27. Harman, M., & Danicic, S. (1995). Using program slicing to simplify testing. Journal of Software Testing, Verification and Reliability, 5, 143–162. 28. Chen, Y., et al. (2008). Hiding I/O latency with pre-execution prefetching for parallel applications. In: SC’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing (pp. 1–10). 29. Horwitz, S., Reps, T., & Binkley, D. (1990). Interprocedural slicing using dependence graphs. ACM Transactions on Programming Languages and Systems, 12(1), 26–60. 30. Appel, A. W. (1997). Modern compiler implementation in C: Basic techniques. New York, USA: Cambridge University Press. 31. Aho, A. V., et al. (2006). Compilers: Principles, techniques, and tools (2nd Edn.). USA: Addison-Wesley Longman Publishing. ISBN: 978-0-321-48681-3. 32. Ferrante, J., Ottenstein, K. J., & Warren, J. D. (1987). The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems, 9(3), 319– 349. 33. Banning, J. (1979). An efficient way to find side effects of procedure calls and aliases of variables. In: Proceedings of the 6th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (pp. 29–41). 34. Muchnick, S. S. (1997). Advanced compiler design and implementation. San Francisco, CA, USA: Morgan Kaufmann Publishers. 35. Strout, M. M., Kreaseck, B., & Hovland, P. (2006). Data-flow analysis for MPI programs. In 2006 International Conference on Parallel Processing (ICPP’06) (pp. 175–184). 36. Bronevetsky, G. (2009). Communication-sensitive static dataflow for parallel message passing applications. In 2009 International Symposium on Code Generation and Optimization. 37. SGI. Open64 compiler and tools. http://www.open64.net 38. LLNL. ASCI Purple Benchmark. https://asc.llnl.gov/computing_resources/purple/archive/ benchmarks 39. Argonne National Laboratory. MPICH2. http://www.mcs.anl.gov/research/projects/mpich2 40. Ohio State University. MVAPICH: MPI over InfiniBand and iWARP. http://mvapich.cse.ohiostate.edu 41. Träff, J. L. (2002). Implementing the MPI process topology mechanism. In SC’02: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing. 42. Hockney, R. W. (1994). The communication challenge for MPP: Intel paragon and meiko CS-2. Parallel Computing, 20(3) , 389–398. 43. Argonne National Laboratory. MPICH1. http://www-unix.mcs.anl.gov/mpi/mpich1 44. Duato, J., Yalamanchili, S., & Ni, L. (2003). Interconnection networks: An engineering approach. Morgan Kaufmann Publishers. 45. Snavely, A., et al. (2002). A framework for application performance modeling and prediction. In: SC ’02: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing (pp. 1–17). 46. Zhai, J., Chen, W., & Zheng, W. (2010). PHANTOM: Predicting performance of parallel applications on large-scale parallel machines using a singlenode. In PPoPP. 47. Schulz, M. (2005). Extracting critical path graphs from MPI applications. In: 2005 IEEE International Conference on Cluster Computing (pp. 1–10).

Chapter 3

Structure-Based Communication Trace Compression

Abstract Communication traces are increasingly important, both for the performance analysis and optimization of parallel applications and for designing nextgeneration HPC systems. Meanwhile, the problem size and the execution scale on supercomputers keep growing, producing a prohibitive volume of communication traces. To reduce the size of communication traces, existing dynamic compression methods introduce large compression overhead as the job scale increases. We propose a hybrid static-dynamic method, called Cypress, which leverages information acquired from static analysis to facilitate more effective and efficient dynamic trace compression. Cypress extracts a program communication structure tree at compile time using inter-procedural analysis. This tree naturally contains crucial iterative computing features such as the loop structure, allowing subsequent runtime compression to “fill in,” in a “top-down” manner, event details into the known communication template. Results show that Cypress reduces intra-process and interprocess compression overhead up to 5.× and 9.×, respectively, over state-of-the-art dynamic methods while only introducing very low compiling overhead. (© 2014 IEEE. Reproduced, with permission, from Jidong Zhai, et al., Cypress: combining static and dynamic analysis for top-down communication trace compression, SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2014.)

3.1 Introduction Communication traces are indispensable in analyzing communication characteristics of MPI (message passing interface) programs for performance problem identification and optimization [1, 2]. They are also highly useful for designing/codesigning future HPC (high-performance computing) systems [3], such as EXA scale systems, where trace-driven simulators (such as DIMEMAS [4], BSIM [5], and SIM-MPI [6]) are often employed to predict and compare application performance under alternative design choices. Many communication trace collection tools have been developed, such as Intel ITC/ITA [7], Vampir [8], TAU [9], Kojak [10], and Scalasca [11]. Typically, © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 J. Zhai et al., Performance Analysis of Parallel Applications for HPC, https://doi.org/10.1007/978-981-99-4366-1_3

43

44

3 Structure-Based Communication Trace Compression

these collection tools instrument MPI programs with the MPI profiling layer (PMPI) and record communication operation details (e.g., message type, size, source/destination, and timestamp) during the program execution. However, as the applications scale up (in terms of both the number of processes and the problem size), the volume of communication traces increases dramatically. For example, ASC benchmark SMG2000 [12] generates about 5TB communication traces only with a small problem size (.64 × 64 × 32) for 22,538 processes [13]. A large volume of communication traces not only put pressure on trace collection and storage but also interfere with the execution of user programs. Ironically, while trace analysis becomes more crucial for EXA scale applications’ performance debugging and system design, trace collection itself becomes less and less affordable on such systems. To reduce the size of communication traces for large-scale parallel programs, several recent studies investigated communication trace compression to address the ever-increasing trace size [14–18]. For most applications, these approaches can produce orders-of-magnitude reduction in trace sizes, facilitating more efficient trace storage and processing. Meanwhile, dynamic trace compression methods take a “bottom-up” approach to discover patterns from the event sequence itself. As reported by existing research, they may have difficulty in compressing complex communication patterns or have very high computational complexity to process such patterns [15]. Also, trace compression itself brings nontrivial overhead (see Fig. 3.1). In particular, the inter-process compressed trace comparison and merge is a rather expensive procedure, with an 2 .O(n ) complexity in merging a pair of per-process traces (where n is the total length of compressed “patterns” after intra-process compression) [14]. We have found that the inter-process compression overhead grows linearly with the number of processes with a recent compression tool [18], which makes it challenging to scale to serve EXA scale workloads. In this chapter, we propose a novel “top-down” technique, called Cypress [19], for effective, scalable communication trace compression. Cypress combines static program analysis with dynamic runtime trace compression. It extracts program structure at compile time, obtaining critical loop/branch control structures, which

Fig. 3.1 Sample segments of MPI program and its communication trace

3.1 Introduction

45

enable the runtime compression module to easily identify similar communication patterns. This approach is motivated by the observation that most of communication information needed by trace compression can be acquired from program structure. For example, in Fig. 3.1, a compiler can effortlessly gather that MPI_Waitall and MPI_Reduce are in the outermost loop, while MPI_Isend and MPI_Irecv are in the innermost loop but in different branch structures. We propose using compiler techniques to statically extract the program structure as an ordered tree called communication structure tree (CST). Our research has found that such structural information provides dynamic trace compression valuable guidance by offering a “big picture” naturally lacked in bottom-up runtime pattern search. During intra-process trace compression, only communication operations at the same vertex of the CST need to be compared, resulting in a dramatically smaller search space. During inter-process compression, since all the per-process CSTs have the same structure, merging traces is of .O(n) computational complexity, compared to .O(n2 ) offered by existing dynamic methods. In addition, we have found that the extra overhead caused by such static analysis is negligible, making this hybrid “top-down” method appealing for EXA scale applications/systems, where we can choose to pay this small, one-time static analysis cost and significantly trim the trace volume as well as trace compression overhead, both growing with the job execution time or the number of processes used. More specifically, we consider the major contributions of this work as follows: Combining Static and Dynamic Analysis for Trace Compression This approach pays a negligible cost for extra program structure analysis and storage, which is independent of the job execution time or the number of processes used. In exchange, the structural information acquired at compile time enables the dynamic compression to gain enormous enhancement in both compression effectiveness and efficiency. Sequence-Preserving Trace Compression and Replay for Trace-Driven Performance Prediction We enable sequence-preserving trace compression, a feature not afforded by most current trace compression tools, using the CST data structure. This allows more accurate trace-driven simulation or performance analysis. We implemented Cypress in the LLVM compiler framework [20]. We evaluated it using NPB programs and a real-world application with a variety of communication patterns and compare our method with a state-of-the-art dynamic method. Results show that Cypress can improve compression ratios in the majority of test cases and get an average fivefold and ninefold reduction for intra- and inter-process trace compression overhead, respectively. Finally, we use a real-world application to demonstrate how to use Cypress to analyze program performance and replay program for performance prediction. The remaining part of this chapter is organized as follows: In Sect. 3.2, we present an overview of the Cypress system. We introduce the key static data structure, the communication structure tree in Sect. 3.3, followed by intra-process and interprocess communication trace compression algorithms in Sect. 3.4. We describe the

46

3 Structure-Based Communication Trace Compression

design of the replay engine and implementation of Cypress in Sect. 3.5 and Sect. 3.6. Our experimental results are reported in Sect. 3.7. Section 3.8 gives a discussion of related work. Finally, we conclude in Sect. 3.9.

3.2 Overview Cypress is a hybrid communication trace compression system consisting of both a static and a dynamic analysis module, as shown in Fig. 3.2. The rest of our discussion focuses on MPI programs/traces, while the Cypress approach is general and can be applied to other communication libraries. The static analysis module first generates an intermediate summary for each procedure by identifying MPI communication invocations, which collectively determine the program communication patterns. It also identifies control structures, such as loop and branch, that may affect the execution of communication operations. The static module then constructs the call graph for the whole program, combines the above intermediate summaries through inter-procedural analysis, and stores the resulting program communication structure in a compressed text file. The dynamic analysis module implements a customized communication library with the MPI profiling layer. Like existing dynamic compression tools, it com-

Fig. 3.2 Overview of Cypress

3.2 Overview

47

Fig. 3.3 A simplified MPI program for Jacobi iteration

presses the intra-process communication traces on the fly and conducts an interprocess trace compression at the end of the program’s execution. However, the aforementioned program structural information from static analysis contains crucial iterative computing features such as the loop structure, enabling top-down dynamic compression. The key idea is that when informed of such a priori structures, Cypress can focus on communication traces generated by the same piece of code, which are very likely to have high redundancy. We illustrate the Cypress workflow with a simplified MPI code snippet for Jacobi iteration (Fig. 3.3). In the static analysis module, Cypress identifies the loop (Line 8) containing four branches (Lines 9, 11, 13, 15), each calling an MPI routine. Such loop body is likely to generate similar traces across different iterations. Because Cypress knows which communication traces correspond to the same call site at runtime, it can pinpoint its similarity search and avoid expensive dynamic probing. Similarly, Cypress also tries to merge traces generated by the same piece of code in different processes for further compression. Figure 3.4 illustrates the compression process, where both intra-process and inter-process repeating patterns are identified. In Fig. 3.4, we use the term of loop-level to denote the repeating patterns within each process and task-level to denote the repeating patterns between different processes. Cypress combines the merit of both static analysis and dynamic analysis. The static module can collect the complete program communication structure with little overhead at compile time, while the dynamic module can record the inputdependent control structures, such as loop iteration counts and branch outcomes. When combined, the static program structure enables the dynamic module to perform informed and scalable trace processing that compresses more efficiently.

48

3 Structure-Based Communication Trace Compression

Fig. 3.4 Communication trace compression (intra-process and inter-process) for Jacobi iteration

3.3 Extracting Communication Structure As mentioned earlier, Cypress leverages static analysis techniques to extract a program communication structure at compile time. The program communication structure records communication invocations and related control structures, which can be used not only to accelerate identification of repetitive communication operations but also to help store compressed communication traces. To this end, we propose a tree-based data structure called communication structure tree (CST). To retain the original sequence of communication operations traced, the CST is organized into an ordered tree to record the communication invocations and the program control structures. In our MPI-oriented design, the leaf nodes represent MPI communication invocations, while non-leaf nodes represent program control structures, including branch nodes and loop nodes. Edges represent the hierarchy of the program communication structure. Note that we use an ordered tree to organize all the nodes of the CST; this is because that a pre-order traversal of the CST is completely matched with the static structure of the program. As a result, we can easily capture the program execution at runtime over the static CST. The CST of a given MPI program is built in two major phases: intra-procedural analysis and inter-procedural analysis. Below, we give more details on their perspective processing.

3.3 Extracting Communication Structure

49

3.3.1 Intra-procedural Analysis Algorithm The intra-procedural analysis phase builds an intermediate CST for each procedure. This is done by collecting its control flow graph (CFG) and identifying all the loop and branch structures. For loops, Cypress uses a classic dominator-based algorithm [21] (Fig. 3.5). This phase identifies all the MPI communication invocations and user-defined functions in the program, each of which represented as a leaf node in the intermediate CST. (Note that the user-defined function nodes will be refined during inter-procedural analysis.) If a procedure does not contains any MPI or user-defined function calls, its intra-procedural CST is null. Finally, we add a virtual root vertex to connect all the first-level vertices. For each vertex in the CST, we assign it a unique global id (denoted by GID) in pre-order, which is useful for handling MPI asynchronous communications. Algorithm 4 gives the complete process of building an intra-procedural CST. Figure 3.6 shows the intra-procedural CST for the function main in our sample MPI program (Fig. 3.5).

Fig. 3.5 An MPI program for illustrating the usage of the CST

50

3 Structure-Based Communication Trace Compression

Algorithm 4 Intra-procedural analysis algorithm in Cypress 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33:

input: A CFG for a procedure p (a node of the CFG is a basic block) initialize: T ← φ Identify all the loop and branch structures in the CFG Insert a virtual root vertex root into T for all node n in Post-Order over CFG do if n is a loop header node then Insert a loop vertex n into T for all vertex v ∈ T do if v ∈ successor(n) then Insert an edge e from n to v into T end if end for else if n is a branch node then For each path insert a branch vertex n into T for all vertex v ∈ T do if v ∈ successor(n ) then Insert an edge e from n to v into T end if end for else for all invocation i ∈ n do if i is an MPI invocation or a user-defined function then Insert a leaf vertex i into T end if end for end if end for for all vertex v ∈ T do if v ∈ successor of the function entry node then Insert an edge from root to v into T end if end for Recursively delete the leaf node not an MPI invocation or a user function

Fig. 3.6 Intra-procedural CST for the function main in Fig. 3.5

3.3.2 Inter-procedural Analysis Algorithm The inter-procedural analysis phase combines the intermediate results from intraprocedural phase into a complete CST. The core idea is to connect all the intra-procedural CSTs according to the relationship of function calls. To do this, we

3.3 Extracting Communication Structure

51

Algorithm 5 Pseudo-code for constructing communication structure tree (CST) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:

input: Intra-procedural CSTs for each procedure p: I _CST (p) input: The program call graph (PCG) Change ← True /* Bottom-Up inter-procedural analysis */ while (Change == True) do Change ← False for all procedure p in Post-Order over PCG do for all vertex v in Pre-Order over I _CST (p) do if v is a user-defined function then Replace the vertex v with its intra-procedural CST of I _CST (v) Change ← True end if end for end for end while

0:Root 1:Loop 2:Br 3:MPI_Send

4:Br 5:MPI_Recv

8:Br 6:Loop

9:MPI_Reduce

7:MPI_Bcast

Fig. 3.7 A complete CST for the program in Fig. 3.5

first construct a program call graph (PCG), followed by inter-procedural analysis to iteratively replace user-defined functions with their intra-procedural CSTs. A bottom-up inter-procedural analysis algorithm (Algorithm 5) effectively reduces the iteration number. At the end of this process, the CST of the function main is the final CST for the program. Figure 3.7 portraits the complete CST for the MPI program in Fig. 3.5. After analyzing with Algorithm 5, we get an MPI program communication structure tree. However, there are some irrelevant vertices in the CST that will not be used during the trace compression phase, such as the leaf vertices that are not MPI invocations. We conduct a pruning pass over the CST and delete all the irrelevant vertices. Our pruning algorithm includes two steps: (1) Delete all the leaf vertices in the CST that are not MPI invocations. (2) Repeat step 1 until all the leaf vertices are MPI invocations. We use an iterative DFS (depth-first search) algorithm over the CST to accomplish the two steps above. Recursive function call creates a challenging problem for building the CST. In Cypress, we unroll all the recursions and convert each recursive function call into an approximate loop control structure, adopting the method proposed by Emami et al. [22]. At compile time, we insert a pseudo-loop node at the entry point of the recursive function and replace all the internal recursive functions with branch nodes

52

3 Structure-Based Communication Trace Compression

Fig. 3.8 Conversion of recursive function calls in the CST

Fig. 3.9 Instrumented functions during the static phase in Cypress

in the CST. Figure 3.8 shows an example of converting recursive function calls in Cypress. At runtime, we record the branch outcomes and compress the repetitive communication operations. We will further illustrate this in Sect. 3.4. The static CST serves as a template for runtime communication trace compression. Any communication invocation or program control structure has a corresponding vertex in the CST to match. To inform our runtime compression library of the currently executing vertex in the CST, we introduce two extra functions, PMPI_COMM_Structure and PMPI_COMM_Structure_Exit, as shown in Fig. 3.9. Cypress automatically inserts codes to bracket each control structure with this pair of functions during static analysis, where the id field assists the runtime analysis to identify the matching CST vertex.

3.4 Runtime Communication Trace Compression

53

3.4 Runtime Communication Trace Compression At runtime, Cypress again adopts two-phase communication trace processing, with intra-process and an inter-process trace compression, respectively. Both phases utilize the program communication structure to achieve effective and low-overhead compression. In order to efficiently organize and store compressed traces, Cypress uses a data structure similar to the CST, called the compressed trace tree (CTT). It is an ordered tree with the same edges and the number of vertices as the CST. Each CTT vertex, however, is associated with a linked list storing runtime information.

3.4.1 Intra-process Communication Trace Compression During the intra-process compression phase, repeated communication operations for each process are compressed and stored in the CTT. This phase is completed on the fly during the program execution. At the beginning of the program execution, Cypress initializes the CTT according to the CST and set the linked list of each CTT vertex to null. It maintains a program pointer, p. Facilitated by the ordered nature of the CTT and the instrumented functions at compile time, the pointer p always points to the CTT vertex that is currently being executed. This enables the runtime compression to “fill in” event details into the known communication template. Below, we give more details on compressing each type of CTT vertices. Communication Vertex Compression For each communication operation, the following parameters are collected: communication type, size, direction, tag, context, and time. For the communication time, two types of recording methods are supported in Cypress. One records the average time and the standard deviation of repeated communication operations. The other uses a histogram to record the distribution of the communication time [14]. For each incoming communication operation, the current Cypress implementation only compares it with the last one in the same CTT vertex, merging them if all their communication parameters (all but the communication time) match. Potentially, one can set a larger sliding window for each leaf vertex, to find more similar communication patterns. There is clearly a trade-off between costeffectiveness and compression effectiveness. We find our current implementation adequate for most of parallel programs. Loop Vertex Compression For each loop vertex, we need to record its actual iteration counts. This is done by incrementing a certain counter every time the PMPI_COMM_Structure function associated with this loop vertex is invoked. The counter stops after MPI_COMM_Structure_Exit is called. For nested loops, the inner loop iterations during each round of the outer loop iteration are recorded to recover the correct communication sequence.

54

3 Structure-Based Communication Trace Compression

Fig. 3.10 Compressing communication traces over a nested loop in the CTT

Fig. 3.11 Branch compression example

Figure 3.10 shows a program containing a nested loop with varied inner loop iteration counts. For leaf vertices, similar traces are compressed. .a × n means that the trace a repeats n times. Cypress can further compress loop counts with striding patterns, using tuples like .< 0, k − 1, 1 >, which means that the iteration count is from 0 to .k − 1 and the stride is 1. For the outermost loop, we only need to record its iteration count k. Branch Vertex Compression As mentioned earlier, Cypress records all branch outcomes at runtime. Moreover, if a branch vertex is a child of one or more loops, the current iteration number for all the parent loop vertices should also be recorded. Figure 3.11 shows how to record the branch outcomes in a CTT. Here, the branch is selected with an alternating pattern, which can again be denoted by tuples like .< 0, 8, 2 > (branch taken at iteration numbers from 0 to 8 with a stride of 2). Asynchronous Communication For asynchronous communication, the MPI library uses request handlers to connect asynchronous communication routines with checking functions (e.g., MPI_Wait, MPI_Waitall). To associate each non-blocking communication routine (e.g., MPI_Isend, MPI_Irecv) with its corresponding checking function, we map its request handler to its unique GID in

3.4 Runtime Communication Trace Compression

55

Fig. 3.12 Example of mapping between the request handler and GID

the CST. Then, in the checking function, the request handler is replaced with this GID. Figure 3.12 shows an example of such mapping. During the decompression phase, the MPI checking function and asynchronous communication routine can be paired again using the GID. For partial completion in MPI programs, such as MPI_Waitsome, MPI_Testsome, and MPI_Testany, we also use the GID to record the actual non-blocking communication completion. During the decompression phase, we can replay the complete communication sequence using the GID and CTT structure. Nondeterministic Events Nondeterministic events complicate trace compression. For a non-blocking wildcard receive (e.g., MPI_Irecv with MPI_ANY_SOURCE), the source is not known when the routine is posted, so the non-blocking wildcard receive cannot be matched upon invocation. In Cypress, these wildcard receives are cached, with compression delayed until the corresponding checking functions are executed.

3.4.2 Inter-process Communication Trace Compression Finally, the inter-process trace compression phase identifies similar communication patterns across different processes, at the end of the program execution (e.g., MPI_Finalize). This is where the static-CST-assisted Cypress obtains most outstanding advantage against traditional dynamic-only methods: the knowledge obtained at compile time enables Cypress to perform informed compression, scalable to large numbers of parallel processes. Due to the characteristics of the prevailing SPMD (single program multiple data) model, in common applications today, most of the processes execute the same path in the program call graph. As a result, processes generate highly similar communication traces. Still, dynamic-only methods face the challenging communication trace alignment task [14]. For example, when compressing the sequences (a:3, b:4) and (b:2, a:2) (trace:repeating counts) for two processes, dynamic methods may produce three different results using different strategies, (b:2, a:5, b:4), (a:3, b:6, a:2), and (a:5, b:6). Therefore, the computational complexity in compressing a pair of per-process traces for dynamic methods is .O(n2 ) (n is the

56

3 Structure-Based Communication Trace Compression

Fig. 3.13 Performing inter-process communication trace compression with the CTTs. (For simplicity, only partial linked lists of the vertices in the CTTs are shown)

number of compressed trace events for each process), which makes it challenging to scale with ever-increasing system size. Cypress solves this problem elegantly, thanks to owning top-down structural information: statically extracted CSTs and dynamically populated CTTs based on the former. The SPMD nature of parallel programs dictates that CTTs across most processes share the common structure existing in the single source code. As any MPI communication invocation corresponds to a unique CTT vertex, we only need to compare the communication invocations at the same vertex in the CTT. If a process has not executed a certain call path in the CTT, the call path is ignored for this process. In Fig. 3.13, we demonstrate the process of merging the CTTs of an even process .p0 and an odd process .p1 of the program in Fig. 3.5. Cypress traverses the CTTs in pre-order. It first merges the virtual nodes 0 (referred to by its GID). Next, it merges the loop vertices 1. Since both processes have the same iteration count, a single value is recorded. After that, it merges the branch vertex 2. Since process .p1 does not take this branch, it just records the information for .p0 . It then merges the leaf vertex 3, where we just record the repeating communication operations for process .p0 , and so on. Finally, it merges the two CTTs into one. This procedure continues until all per-process CTTs are combined into the merged CTT. We can use a parallel algorithm to merge all the CTTS. Therefore, the computational complexity of Cypress is .O(n log(P )) when merging P per-process traces, each with length n. In contrast, the complexity for the dynamic-only method is dependent

3.5 Decompression and Performance Analysis

57

on the per-process communication traces, and the complexity for the worst case is O(nP log(P )). To effectively compress similar communication invocations for different processes, we need to encode the process rank in a uniform way. To this end, Cypress adopts an existing relative ranking method [14]. For example, we use the current process rank id plus or minus a constant value to denote the source process or the destination process. This method is effective for most parallel programs, especially stencil applications.

.

3.5 Decompression and Performance Analysis Compressed communication traces stored in the CTT by Cypress can be decompressed by traversing the CTT in pre-order and performing the following tasks depending on the vertex type: (1) for a loop vertex, iteratively traversing its child vertices according to the recorded loop count; (2) for a branch vertex, traversing its child vertices according to the recorded branch outcome; and (3) for a communication vertex, printing the communication trace stored in the per-vertex linked list. Communication traces acquired by Cypress can be used for various performance analyses based on communication traces. As a proof-of-concept prototype, we integrated Cypress with a trace-driven performance simulator, SIM-MPI [6], as shown in Fig. 3.14. SIM-MPI can simulate various MPI communication routines. The LogGP communication model [23] is used to simulate point-to-point routines, while collective routines are decomposed into point-to-point operations [24]. In addition to decompressed communication traces, SIM-MPI also needs the sequential computation time of the target program on a target platform, which can be obtained by using deterministic replay on a single node of the target platform [6]. Combining the above results, SIM-MPI predicts the overall performance for a given parallel program.

Cypress Traces

Decompress

Communication Sequence

Network Parameters

Trace-driven Simulator

PHANTOM System

Sequential Computation Time

Fig. 3.14 Integrating Cypress with a trace-driven simulator

Performance Prediction

58

3 Structure-Based Communication Trace Compression

3.6 Implementation We implemented the static analysis module of Cypress as a plug-in of the LLVM compiler [20], identifying loop and branch structures over the control flow graph at the LLVM intermediate representation (IR) level. Our prototype stores the program CST in a compressed text file. The runtime compression library of Cypress is implemented with the MPI profiling layer (PMPI) and is independent of any specific MPI implementation. We have tested Cypress with Intel MPI-4.0, Intel MPI-4.1, and MPICH2-1.5. Users do not need to manually modify any application source code to use Cypress.

3.7 Evaluation 3.7.1 Methodology We evaluate Cypress with the NPB benchmark and a real-world application, collectively presenting a variety of communication patterns, to assess the benefits of our approach. We design several groups of experiments to answer the following questions: 1. How does the performance of communication trace compression of Cypress compare with that of existing dynamic methods? 2. What is the compression overhead of Cypress compared to dynamic methods, both intra-process and inter-process? 3. What is the compilation overhead of Cypress to build the CSTs? 4. How to use the compressed traces of Cypress to analyze the communication performance for a given parallel program? We use the Explorer-100 cluster system at Tsinghua University as our experimental platform, which has a peak performance of 104TFlops/s. Compute nodes, each with two 6-core Intel Xeon X5670 processors and 48GB of memory, are interconnected via the QDR Infiniband network. The operating system is Red Hat Enterprise Linux Server 5.5, and the MPI library is Intel MPI-4.0.2. For NAS, we used the NPB 3.3 programs [25], including BT, CG, DT, EP, FT, LU, MG, and SP benchmarks. All tests use the Class D problem size. We also tested with a real-world application, LESlie3d [26] for computational fluid dynamics, to demonstrate the effectiveness of using Cypress compressed traces to analyze the communication performance of a given parallel program. The grid size of LESlie3d is .193 × 193 × 193. We compare Cypress with the other three techniques, Gzip, ScalaTrace [14], and ScalaTrace-2 [18]. Gzip is a popular technique for compressing user documents and data on Linux systems, and it is also the trace compression method used in the OTF library [27]. ScalaTrace is the state of the art for lossless dynamic trace

3.7 Evaluation

59

compression developed at North Carolina State University. ScalaTrace-2 improves the performance of ScalaTrace by using a loop agnostic inter-node compression scheme. However, the probabilistic method used in ScalaTrace-2 only preserves partial communication information and may lose much information for better compression [18], while Cypress retains trace details. Cypress supports two types of trace compression files, a normal binary file and a compressed version with Gzip (labeled with Cypress+Gzip in the figures). Gzip does bring extra compression at very small overhead and can be integrated into Cypress. ScalaTrace-2 does not support Gzip compression files currently. In order to fairly compare the results, we add extra Gzip support for ScalaTrace-2 (labeled with ScalaTrace2+Gzip in the figures).

3.7.2 Communication Trace Size Figure 3.15 shows the total trace size of NPB programs in log-scale, collected with different methods. Since Gzip cannot perform inter-process trace compression, the 8

10

7

10

6

10

Gzip ScalaTrace ScalaTrace2 ScalaTrace2+Gzip Cypress Cypress+Gzip

7

10

6

10

Trace size (KB)

Trace size (KB)

5

10

4

10

3

10

2

10

Gzip ScalaTrace ScalaTrace2 ScalaTrace2+Gzip Cypress Cypress+Gzip

5

10

4

10

3

10

2

1

10

0

10

10

1

10

0

-1

10

64

Trace size (KB)

10

256

10

400

64

128

256

Num of Processes

Num of Processes

(a)

(b)

Gzip ScalaTrace ScalaTrace2 ScalaTrace2+Gzip Cypress Cypress+Gzip

4

10

3

10

Trace size (KB)

3

121

2

10

1

10

512

Gzip ScalaTrace ScalaTrace2 ScalaTrace2+Gzip Cypress Cypress+Gzip

2

10

1

10

0

10

0

10

-1

48

64

128

256

10

64

128

256

Num of Processes

Num of Processes

(c)

(d)

512

Fig. 3.15 Total communication trace sizes (KB, log-scale) of NPB programs with different compression tools. (a) BT. (b) CG. (c) DT. (d) EP. (e) FT. (f) LU. (g) MG. (h) SP

60

3 Structure-Based Communication Trace Compression 9

3

Trace size (KB)

10

10

Gzip ScalaTrace ScalaTrace2 ScalaTrace2+Gzip Cypress Cypress+Gzip

8

10

7

10

Gzip ScalaTrace ScalaTrace2 ScalaTrace2+Gzip Cypress Cypress+Gzip

6

Trace size (KB)

4

10

2

10

1

10

10

5

10

4

10

3

10

2

10

0

10

1

10

0

-1

10

7

10

6

10

64

128

256

10

512

64

128

256

Num of Processes

Num of Processes

(e)

(f) 8

10

Gzip ScalaTrace ScalaTrace2 ScalaTrace2+Gzip Cypress Cypress+Gzip

7

10

6

10

512

Gzip ScalaTrace ScalaTrace2 ScalaTrace2+Gzip Cypress Cypress+Gzip

Trace size (KB)

Trace size (KB)

5

10

4

10

3

10

2

10

4

10

3

10

2

10

1

10

1

10

0

10

5

10

0

64

128

256

Num of Processes

(g)

512

10

64

121

256

400

Num of Processes

(h)

Fig. 3.15 (continued)

trace sizes increase linearly with the number of processes. In contrast, both Cypress and dynamic methods (ScalaTrace and ScalaTrace-2) can get near-constant trace sizes (DT, EP, and LU) or sublinear scaling of trace sizes (BT, CG, FT, MG, and SP) as the number of processes increases. For most of NPB programs (DT, FT, LU, MG, and SP), Cypress shows an order of magnitude improvement over ScalaTrace. Although ScalaTrace2 has improved the compression ratios over ScalaTrace, Cypress outperforms ScalaTrace-2 for DT, EP, FT, and LU. At the same time, Gzip appears to bring significant improvement in compression effectiveness for both Cypress and ScalaTrace-2. With Gzip incorporated (the “+Gzip” bars), Cypress offers up to an order of magnitude improvement over ScalaTrace-2 for the majority of NPB programs. The only case where Cypress significantly underperforms is SP, due to its varied message sizes and message tags. However, from the analysis of compression overhead below, ScalaTrace-2 introduces much more compression overhead than Cypress for SP as the number of processes increases (Fig. 3.16). Among the NPB codes, MG and SP feature complex communication patterns, as shown in Fig. 3.17. The gray level of a cell at the .xth row and .yth column

3.7 Evaluation

Time ovhd-ScalaTrace Time ovhd-ScalaTrace2 Time ovhd-Cypress

Mem ovhd-ScalaTrace Mem ovhd-Cypress

20%

Time ovhd-ScalaTrace Time ovhd-ScalaTrace2 Time ovhd-Cypress

Mem ovhd-ScalaTrace Mem ovhd-Cypress

20%

10%

6%

8% 4%

6% 4%

2%

15%

15%

10%

10%

5%

5%

Memory overhead (%)

8%

12%

Time overhead (%)

10%

Time overhead (%)

14%

Memory overhead (%)

16%

61

2%

64

121

256

400

0%

0%

64

Num of Processes

(a) Mem ovhd-ScalaTrace Mem ovhd-Cypress

Time ovhd-ScalaTrace Time ovhd-ScalaTrace2 Time ovhd-Cypress

512

0%

6% 80%

5%

4%

3%

6%

2% 3%

Memory overhead (%)

70%

9%

Time overhead (%)

256

(b) Time ovhd-ScalaTrace Time ovhd-ScalaTrace2 Time ovhd-Cypress

Mem ovhd-ScalaTrace Mem ovhd-Cypress

50% 100%

40% 30%

50%

20%

1%

200%

150%

60%

Time overhead (%)

12%

128

Num of Processes

Memory overhead (%)

0%

10%

0%

64

128

256

512

0%

0%

64

Num of Processes

128

256

512

0%

Num of Processes

(c)

(d)

Time ovhd-ScalaTrace Time ovhd-ScalaTrace2 Time ovhd-Cypress

Mem ovhd-ScalaTrace Mem ovhd-Cypress

1000%

100%

Time ovhd-ScalaTrace Time ovhd-ScalaTrace2 Time ovhd-Cypress

Mem ovhd-ScalaTrace Mem ovhd-Cypress

100%

Memory overhead (%)

10%

10%

Time overhead (%)

100% 100%

Memory overhead (%)

Time overhead (%)

1000%

10%

10%

1%

1%

0.1%

1% 64

128

256

Num of Processes

(e)

512

1%

64

121

256

400

0.1%

Num of Processes

(f)

Fig. 3.16 Intra-process compression overhead in terms of time and memory (per process) with ScalaTrace and Cypress. (a) BT. (b) CG. (c) FT. (d) LU. (e) MG. (f) SP

represents the communication volume (in Byte) between two processes x and y. MG solves a three-dimensional discrete Poisson equation using a v-cycle multi-grid method. There is a nested 3D torus for some particular communication processes, which results in irregular communication operations between different processes. For example, processes of 8–11 and processes of 12–15 have different communication patterns. SP also presents nonuniform communication patterns between processes. Moreover, for some loops in SP, the message sizes and the message tags of sending and receiving communications are varied for each process. Without

62

3 Structure-Based Communication Trace Compression 9

8

x 10

x 10

63

63

3.5

2

56

56

49

Receiver Rank

42 35

1

28

Receiver Rank

1.5

49

3

42

2.5

35

2

28 1.5 21

21

1

14

0.5

14 0.5

7

7 0 0

7

14

21

28

35

42

49

56

63

0

0

Sender Rank

(a)

0

7

14

21

28

35

42

49

56

63

0

Sender Rank

(b)

Fig. 3.17 Communication patterns of MG and SP (64 processes). (a) MG. (b) SP

such information, dynamic methods fail to compress communications effectively (see MG and SP trace sizes of ScalaTrace). To address these complex patterns, ScalaTrace 2 [18] proposes a special algorithm that introduces large compression overhead. However, Cypress is able to acquire these information automatically from the source codes and achieve effective trace compression, as discussed further below in overhead analysis.

3.7.3 Trace Compression Overhead 3.7.3.1

Intra-process Overhead

Figure 3.16 shows the intra-process compression overhead in terms of memory and time for both Cypress and dynamic methods. Among the NPB benchmarks, DT, EP, and FT contain few communication operations, resulting in very low memory and time overhead regardless of compression methods. We include only FT to represent this group. For time overhead (bars in Fig. 3.16), Cypress introduces very little overhead on the NPB programs, about 1.58% on average. The maximum overhead of Cypress is 5.18% for CG on 512 processes. With both ScalaTrace and ScalaTrace-2, however, the time overhead varies greatly. For example, ScalaTrace incurs a 400%+ overhead on 512 processes for MG. This is caused by the complex nested branches in MG to implement different types of message exchange. For ScalaTrace-2, the time overhead varies from less than 1% to about 60% (LU on 512 processes). In summary, the average intra-process compression overhead for NPB programs is 51.05% for ScalaTrace, 9.1% for ScalaTrace-2, and only 1.58% for Cypress. For memory consumption (line+symbol graphs in Fig. 3.16), Cypress outperforms dynamic methods significantly. At runtime, Cypress only uses extra memory resources to store the CTT, consuming about 2.2 MB (1.79%) memory for each

3.7 Evaluation

63

process of NPB programs on average, and the memory consumption changes little with increasing number of processes. ScalaTrace, on the other hand, consumes much more memory, averaging 34 MB (36.31%) per process. Its memory consumption also increases rapidly with the number of processes. 3.7.3.2

Inter-process Overhead

Figure 3.18 shows the inter-process compression overhead (in second) on a logscale for NPB programs. Due to space limit, we only list the results for BT, CG, LU, MG, and SP, considering the small overhead for DT, EP, and FT. Cypress shows 1.5 orders of magnitude improvement for BT and CG and 2 orders of magnitude improvement for LU over ScalaTrace and ScalaTrace-2. Because the computational complexities for Cypress and dynamic methods are .O(n) vs. .O(n2 ) in merging a pair of per-process traces (where n is the length of compressed traces per process), the inter-process overhead is significantly reduced by Cypress. For MG and SP, Cypress shows about two to five times improvement over ScalaTrace2. In summary, the average inter-process compression overhead for NPB programs is 170.69% for ScalaTrace, 30.3% for ScalaTrace-2, and 3.29% for Cypress. These results confirm that our method can be extended to a much larger HPC systems. 3.7.3.3

Compilation Overhead of Cypress

To build the CST of an MPI program, we add an extra phase in the LLVM compiler. Table 3.1 shows the compilation overhead for NPB programs. We can find that for most of NPB programs the compilation overhead is negligible. The average

Fig. 3.18 Inter-process trace compression overhead (in second, log-scale) with different compression methods Table 3.1 Compilation overhead of Cypress (in second) Compile time w/o Cypress w/ Cypress Overhead (%)

BT 7.19 7.44 3.51

CG 0.95 1.06 11.20

DT 0.40 0.45 12.79

EP 0.34 0.43 27.72

FT 1.56 1.61 3.21

LU 6.57 6.73 2.54

MG 2.51 2.60 3.67

SP 7.40 7.51 1.51

64

3 Structure-Based Communication Trace Compression

compilation overhead is 8.27% for NPB programs. The maximum time to build the CST is 0.25 s for BT.

3.7.4 Case Study In this section, we use a real-world application, LESlie3d, to demonstrate the use of Cypress to analyze a parallel program performance. LESlie3d (Large-Eddy Simulations with Linear-Eddy Model in 3D) is a computational fluid dynamics program used to investigate a wide array of turbulence phenomena, such as mixing, combustion, acoustics, and general fluid mechanics. Figure 3.19 shows the trace sizes collected with different methods. Here, Cypress brings about 1.5 orders of magnitude improvement over ScalaTrace and 4 orders of magnitude improvement over Gzip.

3.7.4.1

Analyzing Communication Patterns

The basic function with the compressed traces of Cypress is to analyze program communication patterns. Figure 3.20 shows the communication patterns when the number of processes is 32 and 64. We observe that there is a communication locality in this application. For example, the process 0 only communicates with the processes of 1, 2, and 8. There are only two types of message sizes, 43 KB and 83 KB. With this information, we can perform communication optimization for this application. For instance, on a nonuniform cluster system, we can improve program communication performance with process mapping techniques [28].

7

10

Gzip ScalaTrace Cypress

6

Compressed trace size (KB)

10

5

10

4

10

3

10

2

10

1

10

32

64

128

256

512

Number of Processes

Fig. 3.19 Compressed communication traces of LESlie3d with Gzip, ScalaTrace, and Cypress

3.7 Evaluation

65 8

8

x 10 30

16

8 12

0

0

18

12

6

Sender Rank

24

30

6

42

5

35

4

28 21

3

4

14

2

2

7

1

0

0

6

6

Receiver Rank

Receiver Rank

10

7

49

12 18

8

56

14

24

x 10

63

0

7

14

21

28

35

Sender Rank

42

49

56

63

0

(b)

(a)

Fig. 3.20 Extracted communication patterns of LESlie3d with Cypress traces. (a) 32 processes. (b) 64 processes 1200

1000

90% 80% 70%

800

60% 600

50% 40%

400

30% 20%

200

Communication Time Percentage

Execution Time (Sec.)

100%

Measured Execution Time Predicted Execution Time Communication Time Percentage

10% 0%

0 32

64

128

256

512

Number of Processes

Fig. 3.21 Predicted execution time of LESlie3d with Cypress traces

3.7.4.2

Performance Prediction

Cypress preserves the complete communication sequence of the original traces. We decompress LESlie3d communication traces and feed them into the SIM-MPI simulator. The network parameters needed by the SIM-MPI is acquired using two nodes of the Explorer-100 cluster. The sequential computation time is acquired with the deterministic replay technique on a single node of the Explorer-100 cluster. In Fig. 3.21, SIM-MPI obtains high prediction accuracy with the compressed traces, with an average prediction error of 5.9%. We can also see that the program speedup grows slowly as the number of processes increases. This is because the

66

3 Structure-Based Communication Trace Compression

communication time of LESlie3d increases with the number of processes. For example, at process 32, the average communication time percentage is only 2.85%, whereas the communication time percentage reaches 32.47% at process 512.

3.8 Related Work Communication traces are widely used to characterize communication patterns and identify performance bottlenecks for large-scale parallel applications [1, 2]. Many trace collection tools have been developed, such as Intel ITC/ITA [7], Vampir [8], TAU [9], Kojak [10], and Scalasca [11]. mpiP [29] is a popular communication profiling tool, which collects statistical information for MPI programs and helps developers analyze communication performance. Open Trace Format (OTF) [27] is a fast and efficient trace format library with regular Zlib compression, but it does not support inter-process communication trace compression, and so the volume of traces size increases with the number of processes. Knupfer et al. [30] proposed using compressed complete call graphs (cCCG) to compress communication traces. However, their method is postmortem and original traces have to be collected. Noeth et al. [14] proposed a scalable communication trace compression method, called ScalaTrace. The compression algorithm maintains a queue of MPI events and attempts to greedily compress the first matching sequence. Their method extends regular section descriptors (RSD) and power-RSD (PRSD) to express repeating communication events involved in loop structures. Since all the compression work is completed during the program execution, large compression overhead can be introduced for parallel programs when processing complex communication patterns, especially for inter-process trace compression. Wu et al. proposed a novel framework of ScalaTrace-2 [18] to address the compression inefficiencies of ScalaTrace. Compared to ScalaTrace, ScalaTrace-2 can significantly improve compression rates for scientific applications with inconsistent behavior across time steps and nodes. However, the inter-process compression still introduces large overhead for largescale parallel applications. Xu et al. [15] introduced a framework for identifying the maximal loop nest based on Crochemore’s algorithm. Their algorithm can efficiently discover longrange repeating communication patterns due to outer loops in MPI traces. However, their algorithm cannot process inter-process trace compression. Krishnamoorthy et al. [16] augmented the SEQUITUR compression algorithm for communication trace compression. Although they employed various optimizations to improve compression overhead for dynamic compression techniques, large overhead is still incurred in their system. Ratn et al. [31] proposed a method for adding timing information to compressed traces generated by ScalaTrace. They rely on delta times rather than absolute timestamps to express similarities for repeating communication patterns. Cypress focuses on trace-driven simulation so it uses two simple mechanisms of histogram

References

67

and mean values to represent communication time. Their work can be integrated with our work to better record communication time. Our previous work [32] proposed a slicing technique to efficiently acquire MPI communication traces on a small-scale system for a large-scale application, but it does not compress the collected communication traces. In the future, we can combine both work to effectively analyze communication patterns of parallel applications. Shao et al. [17] proposed a static compiler framework to analyze communication patterns for parallel applications. However, due to pointer alias and program input, their method can only identify static and persistent communication patterns, and dynamic communication behavior cannot be processed using their framework.

3.9 Conclusions In this chapter, we propose a top-down method, called Cypress, which can effectively compress communication traces for large-scale parallel applications with very low overhead through combining static and dynamic analysis. Our approach leverages a program’s inherent static structure to improve the efficiency of trace compression. We implement Cypress and evaluate it with several parallel programs. Results show that our method can improve compression ratios significantly in the majority of test cases compared with a state-of-the-art dynamic method and only incurs 1.58 and 3.29% overhead on average for intra- and inter-process compression, respectively. To the best of our knowledge, Cypress is the first work that leverages program structure to improve communication trace compression.

References 1. Vetter, J. S., & Mueller, F. (2002). Communication characteristics of large-scale scientific applications for contemporary cluster architectures. In Proceedings 16th International Parallel and Distributed Processing Symposium (pp. 853–865). 2. Becker, D., et al. (2007). Automatic trace-based performance analysis of metacomputing applications. In 2007 IEEE International Parallel and Distributed Processing Symposium (pp. 1–10). IEEE. 3. Snavely, A., et al. (2002). A framework for application performance modeling and prediction. In SC ’02: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing (pp. 1–17). 4. Choudhury, N., Mehta, Y., & Wilmarth, T. L., et al. (2005). Scaling an optimistic parallel simulation of large-scale interconnection networks. In Proceedings of the Winter Simulation Conference, 2005 (pp. 591–600). 5. Susukita, R., Ando, H., & Aoyagi, M., et al. (2008). Performance prediction of large-scale parallel system and application using macro-level simulation. In Proceedings SC’08 (pp. 1–9). 6. Zhai, J., Chen, W., & Zheng, W. (2010). PHANTOM: Predicting performance of parallel applications on large-scale parallel machines using a singlenode. In PPoPP. 7. Intel Ltd. Intel trace analyzer & collector. http://www.intel.com/cd/software/products/asmona/eng/244171.htm

68

3 Structure-Based Communication Trace Compression

8. Nagel, W. E., et al. (1996). VAMPIR: Visualization and analysis of MPI resources. 9. Shende, S. S., & Malony, A. D. (2006). The TAU parallel performance system. The International Journal of High Performance Computing Applications, 20(2), 287–311. 10. Mohr, B., & Wolf, F. (2003). KOJAK-A tool set for automatic performance analysis of parallel programs. In Euro-Par 2003 Parallel Processing: 9th International Euro-Par Conference. 11. Geimer, M., et al. (2010). The Scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience, 22(6), 702–719. 12. Advanced Simulation and Computing Program. The ASC SMG2000 benchmark code, https:// asc.llnl.gov/computing_resources/purple/archive/benchmarks/smg/. 13. Wylie, B. J. N., Geimer, M., & Wolf, F. (2008). Performance measurement and analysis of large-scale parallel applications on leadership computing systems. Scientific Programming, 16(2–3), 167–181. 14. Noeth, M., et al. (2009). ScalaTrace: Scalable compression and replay of communication traces for high-performance computing. Journal of Parallel and Distributed Computing, 69(8), 696– 710. 15. Xu, Q., Subhlok, J., & Hammen, N. (2010). Efficient discovery of loop nests in execution traces. In 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (pp. 193–202). 16. Krishnamoorthy, S., & Agarwal, K. (2010). Scalable communication trace compression. In Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (pp. 408–417). IEEE Computer Society. 17. Shao, S., Jones, A. K., & Melhem, R. G. (2006). A compiler-based communication analysis approach for multiprocessor systems. In Proceedings 20th IEEE International Parallel & Distributed Processing Symposium. 18. Wu, X., & Mueller, F. (2013). Elastic and scalable tracing and accurate replay of non-deterministic events. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. ICS’13 (pp. 59–68). ACM. 19. Zhai, J., et al. (2014). Cypress: Combining static and dynamic analysis for top-down communication trace compression. In SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 143–153). IEEE. 20. The LLVM Compiler Framework. http://llvm.org 21. Muchnick, S. S. (1997). Advanced compiler design and implementation. San Francisco, CA, USA: Morgan Kaufmann Publishers. 22. Emami, M., Ghiya, R., & Hendren, L. J. (1994). Context-sensitive interprocedural points-to analysis in the presence of function pointers. In PLDI’94 (pp. 242–256). ACM. 23. Alexandrov, A., et al. (1997). LogGP: Incorporating long messages into the LogP model for parallel computation. Journal of Parallel and Distributed Computing, 44(1) , 71–79. 24. Zhang, J., et al. (2009). Process mapping for MPI collective communications. In Euro-Par 2009 Parallel Processing: 15th International Euro-Par Conference. 25. Bailey, D., et al. (1995). The NAS Parallel Benchmarks 2.0. Moffett Field, CA: NAS Systems Division, NASA Ames Research Center. 26. Duque, E. P., et al. (2012). Ifdt–intelligent in-situ feature detection, extraction, tracking and visualization for turbulent flow simulations. In 7th International Conference on Computational Fluid Dynamics (Vol. 2). 27. Knupfer, A., et al. (2006). Introducing the open trace format (OTF). In International Conference on Computational Science (pp. 526–533). 28. Chen, H., et al. (2006). MPIPP: An automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters. In Proceedings of the 20th Annual International Conference on supercomputing (pp. 353–360) ACM. 29. Vetter, J. S., & McCracken, M. O. (2001). Statistical scalability analysis of communication operations in distributed applications. In Proceedings of the Eighth ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming (pp. 123–132).

References

69

30. Knupfer, A., & Nagel, W. E. (2005). Construction and compression of complete call graphs for post-mortem program trace analysis. In 2005 International Conference on Parallel Processing (pp. 165–172). IEEE. 31. Ratn, P., et al. (2008). Preserving time in large-scale communication traces. In Proceedings of the 22nd Annual International Conference on Supercomputing (pp. 46–55). New York, NY, USA: ACM. 32. Zhai, J., et al. (2009). FACT: Fast communication trace collection for parallel applications through program slicing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. https://doi.org/10.1145/1654059.1654087

Part II

Performance Analysis Methods: Memory Analysis

Chapter 4

Informed Memory Access Monitoring

Abstract Memory monitoring is of critical use in understanding applications and evaluating systems. Due to the dynamic nature in programs’ memory accesses, common practice today leaves large amounts of address examination and data recording at runtime, at the cost of substantial performance overhead and large storage consumption for memory traces. Recognizing the memory access patterns available at compile time and redundancy in runtime checks, we propose a novel memory access monitoring and analysis framework, Spindle. Unlike methods delaying all checks to runtime or performing task-specific optimization at compile time, Spindle performs common static analysis to identify predictable memory access patterns into a compact program structure summary. Custom memory monitoring tools can then be developed on top of Spindle, leveraging the structural information extracted to dramatically reduce the amount of instrumentation that incurs heavy runtime memory address examination or recording. Our evaluation demonstrated the effectiveness of two Spindle-based tools, performing memory bug detection and trace collection, respectively, with a variety of programs. Results show that these tools can aggressively prune online memory monitoring processing, fulfilling desired tasks with performance overhead significantly reduced (.2.54× on average for memory bug detection and over .200× on average for access tracing, over stateof-the-art solutions).

4.1 Introduction Memory access behavior is crucial to understand applications and evaluate systems. They are widely monitored in system and architecture research, for memory bug or race condition detection [1–3], information flow tracking [4, 5], large-scale system optimization [6–8], and memory system design [9–11]. Memory access monitoring and tracing need to obtain and check/record memory addresses visited by a program, and this process is quite expensive. Even given complete source-level information, much of the relevant information regarding locations to be accessed at runtime is not available at compile time. For example, it is common that during static analysis, we see a heap object accessed repeatedly © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 J. Zhai et al., Performance Analysis of Parallel Applications for HPC, https://doi.org/10.1007/978-981-99-4366-1_4

73

74

4 Informed Memory Access Monitoring

in a loop, but does not have any of the parameters needed to perform our desired examination or tracing: where the object is allocated, how large it is, or how many iterations there are in a particular execution of the loop. As a result, existing memory checking tools mostly delay the checking/transcribing of such memory addresses to execution time, with associated instructions instrumented to perform task-specific processing. Such runtime processing brings substantial performance overhead (typically bringing .2× or more application slowdown [12, 13] for online memory access checking and much higher for memory trace collection [14–16]). However, there are important information not well utilized at compile time. Even with actual locations, sizes, branch-taken decisions, or loop iteration counts unknown, we still see patterns in memory accesses. In particular, accesses to large objects are not isolated events that have to be verified or recorded individually at runtime. Instead, they form groups with highly similar (often identical) behaviors and relative displacement in locations visited given plainly in the code. The processing tasks that are delayed to execution time often perform the same checking or recording on individual members of such large groups of highly homogeneous accesses. In addition, the memory access patterns recognizable during static analysis summarize common structural information useful to many memory checking/tracing tasks. Based on these observations, we propose Spindle [17], a new platform that facilitates hybrid static+dynamic analysis for efficient memory monitoring. It leverages common static analysis to identify from the target program the source of redundancy in runtime memory address examination. By summarizing groups of memory accesses with statically identified program structures, such compact intermediate analysis results can be passed to Spindle-based tools to further perform task-specific analysis and code instrumentation. The regular/predictable patterns contained in Spindle-distilled structural information allow diverse types of memory access checking more efficiently: by computing rather than collecting memory accesses whenever possible, even when certain examination has to be conducted at runtime, it can be elevated from instruction to object granularity, with the amount of instrumentation dramatically pruned. We implement Spindle on top of the open-source LLVM compiler infrastructure [18]. On top of it, we implement two proof-of-concept custom tools, a memory bug detector (S-Detector) and a memory trace collector (S-Tracer), that leverage the common structural information extracted by Spindle to optimize their specific memory access monitoring tasks. We evaluated Spindle and the aforementioned custom tools with popular benchmarks (NPB, SPEC CPU2006, Graph500, and PARSEC) and open-source applications covering areas such as machine learning, key-value store, and text processing. Results show that S-Detector can reduce the amount of instrumentation by .64% on average using Spindle static analysis results, allowing runtime overhead reduction of up to .30.25× (.2.54× on average) over the Google AddressSanitizer [12]. S-Tracer, meanwhile, reduces the trace collection time overhead by up to over 500.× (228.× on average) over the popular PIN tool [14] and cuts the trace storage space overhead by up to over 10,000.× (248.× on average).

4.2 Overview

75

Spindle is publicly available at https://github.com/thu-pacman/Spindle.

4.2 Overview 4.2.1 Spindle Framework

Control Flow Analysis Dependence Analysis

Inter-Procedural Analysis

Intra-Procedural Analysis

Spindle

Source Code

Spindle is designed as a hybrid memory monitoring framework. Its main module performs static analysis to extract program structures relevant to memory accesses. Such structural information allows Spindle to obtain regular or predictable patterns in memory accesses. Different Spindle-based tools utilize these patterns in different ways, with the common goal of reducing the amount of instrumentation that leads to costly runtime check or information collection. Figure 4.1 gives the overall structure of Spindle, along with sample memory monitoring tools implemented on top of it. To use Spindle-based tools, end users only have to compile their application source code with the Spindle-enhanced LLVM modules, whose output then goes through tool-specific analysis and instrumentation. More specifically, the common static analysis performed by Spindle will

Build Program Call Graph(PCG) Inter-procedural Analysis Algorithm

Tracer Specific Analyzer and Instrumentaon Instrumented Code 1

S-Detector

S-Tracer

Memory Access Skeleton (MAS) Bug Detector Specific Analyzer and Instrumentaon

Runme Trace Collecng Lib Execute Dynamic Trace Stac Trace

Fig. 4.1 Spindle overview

Runme

Traces

Runtime

Instrumented Code 2 Runme Bug Detecng Lib Execute

Bug Report

Other Spindle Based Tools

76

4 Informed Memory Access Monitoring

generate a highly compact memory access skeleton (MAS), describing the structured, predictable memory access components. Spindle tool developers write their own analyzer, which uses MAS to optimize their code instrumentation, aggressively pruning unnecessary or redundant runtime checks or monitoring data collection. In general, such task-specific tools enable computing groups of memory addresses visited before or after program executions to avoid examining individual memory accesses at runtime. As illustrated in Fig. 4.1, each of such Spindle-based tools (the memory bug detector S-Detector and memory trace collector S-Tracer in this case) will generate its own instrumented application code. As our results will show, for typical applications, the majority of memory accesses are computable given a small amount of runtime information, leading to dramatic reduction of instrumentation and runtime collection. End users then execute their tool-instrumented applications, with again taskspecific runtime libraries linked. The instrumented code conducts runtime processing to perform the desired form of memory access monitoring, such as bug or race condition detection, security check, or memory trace collection. The runtime libraries capture dynamic information to fill in parameters (such as the starting address of an array or the actual iteration count of a loop) to instantiate the Spindle MAS and complete the memory monitoring tasks. In addition, all the “unpredictable” memory access components, identified by Spindle at compile time as input-dependent, are monitored/recorded in the traditional manner. Spindle’s static analysis workflow to produce MAS is further divided into multiple stages, performing intra-procedural analysis, inter-procedural analysis, as well as tool-specific analysis and instrumentation. During the intra-procedural stage, Spindle analyzes the program control flow graph and finds out the dependence among memory access instructions. The dependence checking is then expanded across functions in inter-procedural analysis. One limitation of the current Spindle framework is that it requires sourcelevel information of target programs. As this work is a proof-of-concept study, also considering the current trend of open-source software adoption [19, 20], our evaluation uses applications with source code available. Spindle can potentially work without source code though: it starts with LLVM IR and can therefore employ open-source tools such as Fcd [21] or McSema [22] to translate binary codes into IR. In our future work, we are however more interested in direct static analysis, performing tasks such as loop and dependency detection on binaries.

4.2.2 Sample Input/Output: Memory Trace Collector We take S-Tracer, our Spindle-based trace collector, as an example to give a more concrete picture of Spindle’s working. Suppose the application to be monitored is the bubble sort program listed in Fig. 4.2. S-Tracer’s output, given in Fig. 4.3, is a complete yet compressed memory access trace, consisting of its MAS (coupled

4.2 Overview

77

Fig. 4.2 Sample bubble sort program

Fig. 4.3 Memory traces of the bubble sort program

with corresponding dynamic parameters) and dynamic traces collected in the conventional manner. In the static trace, we list out the structure of the program, including the control flow, the memory accesses pattern, and the call graph. There are information items that cannot be determined during static analysis, such as the base address of array A and its size N, which is also the final value of loop induction variables i and j , as well as the value of flag, which is data-dependent and determines the control flow of this program. The “Instrumented code 1” shown in Fig. 4.1 records these missing values at executing time, which compose the dynamic trace shown on the right. This new trace format, though slightly more complex than traditional traces, is often orders of magnitude smaller. A straightforward post-processor can easily take S-Tracer traces and restore the traditional full traces. More practically, an S-Tracer trace driver performing similar decompression can be prepended to typical memory trace consumers to enable fast replay without involving large trace files or slow I/O.

78

4 Informed Memory Access Monitoring

4.3 Static Analysis 4.3.1 Intra-procedural Analysis During this first step, Spindle extracts a program’s per-function control structure to identify memory accesses whose traces can be computed and hence can be (mostly) skipped in dynamic instrumentation.

4.3.1.1

Extracting Program Control Structure

A program’s memory access patterns (or the lack thereof) are closely associated with its control flows. It is not surprising that it shares a similar structure with the program’s control flow graph (CFG). Therefore, we call this graph M-CFG. Unlike traditional control flow graphs, M-CFG records only instructions containing memory references (rather than the entire basic block), program control structures (loops and branches), and function calls. For loops and branches, we need to record related variables, such as loop boundaries and branch conditions. With M-CFG, memory access instructions are embedded within program basic control structures, as illustrated in Fig. 4.4 for the aforementioned BubbleSort function (Fig. 4.2). Here, the M-CFG records a nested loop containing two memory accesses and a branch with a function call. Section 4.3.1.2 discusses dependence analysis regarding memory access instructions and identification of computable memory accesses, while Sect. 4.3.2 discusses as handling of function calls.

4.3.1.2

Building Memory Dependence Trees

In Spindle, we classify all memory accesses into either computable or noncomputable types. The computable accesses can have traces computed based on the static trace, with the help of little or no dynamic information; the non-computable ones, on the other hand, need to fall back to traditional instrumentation and runtime tracing.

Fig. 4.4 The M-CFG for the function BubbleSort

4.3 Static Analysis

79

For such classification, we build a memory dependence tree for each memory access instruction. It records data dependence between a specific memory access instruction and its related variables. The tree is rooted at the memory address accessed, with non-leaf nodes denoting operators between variables such as addition or multiplication and leaf nodes denoting variables in the program. Edges hence intuitively denote dependence. Below, we list the types of leaf nodes in memory dependence trees: • Constant value: value determined at compile time • Base memory address: start address for continuously allocated memory region (such as an array), with value acquired at compile time for global or static variables and at runtime for dynamically allocated variables • Function parameter: value determined at either compile time or runtime (see Sect. 4.3.2) • Data-dependent variable: value dependent on data not predictable at compile time—to be collected at runtime • Function return value: value collected at runtime • Loop induction variable: variable regularly updated at each loop iteration, value determined at compile time or runtime Algorithm 6 Algorithm of building memory dependence tree 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21:

input: A worklist W L[A]. Predefined Leaf types: T ype output: memory dependence tree: T (A) Insert a root note r to T (A) while W L[A] = φ do Remove an item v1 from W L[A] / T ype then if v1 ∈ for v2 ∈ U D(v1 ) do if v2 ∈ T ype then Insert a leaf node v2 Insert an edge from v1 to v2 else Insert an operator node in v2 to T (A) Add all variables used in v2 to W L[A] end if end for else Insert a leaf node v1 to v1 to T (A) Insert an edge from r to v1 to T (A) end if end while return T (A)

The memory dependence tree is built by performing a backward data flow analysis at compile time. Specifically, for each memory access, we start from the variable storing this memory address and traverse its use-define data structure, which describes the relation between the definition and use of each variable, to identify all the variables and operators affecting it. This traversal is an iterative

80

4 Informed Memory Access Monitoring

Fig. 4.5 Sample memory dependence tree

process that stops when all the leaf nodes are categorized into one of the types listed above. We give the worklist algorithm (Algorithm 6) that performs such backward data flow analysis, where we repeatedly get variables storing memory addresses from the worklist .W L(A) and iteratively find all the related variables through the use-define structure .U D(v), till the worklist becomes empty. Figure 4.5 shows a group of instructions (generated from the source code in Fig. 4.2) and the memory dependence tree corresponding to the variable %array.1 in the last line. Here, getelementptr is an instruction that calculates the address of an aggregate data structure (where an addition operation is implied) and does not access memory. We omit certain arguments for this instruction for simplicity. sext performs type casting. As to the leaf nodes, %A is an array base address, 4 is a constant value, and %i.0 is a loop induction variable. Such a dependence tree allows us to approach the central task of Spindle: computable memory access identification. This is done by analyzing the types of the leaf nodes in the memory dependence tree. Intuitively, a memory access is computable if the leaf nodes of its dependence tree are either constants (trivial) or loop induction variables (computable by replicating computation performed in the original program using initial plus final values, collected at compile time or runtime). The M-CFG and the memory access dependence trees, preserving control flows, data dependencies, and operations to facilitate such replication, can be viewed as a form of program pruning that only retains computation relevant to memory address calculation. By replacing each memory instruction of the M-CFG with its dependence tree, we obtain a single graph representing main memory access patterns for a single function. Note that such dependence analysis naturally handles aliases.

4.3.2 Inter-procedural Analysis At the end of the intra-procedural analysis, we have a memory dependence tree for every memory access within each function. Below, we describe how Spindle analyzes memory address dependence across functions.

4.3 Static Analysis

81

Fig. 4.6 Transformation of dependence tree

The core idea here is to propagate function arguments plus their dependence from the caller to the callee and replace all the function parameters of the dependence trees in the callee with actual parameters. For this, we first build a program call graph (PCG), on which we subsequently perform top-down inter-procedural analysis. Algorithm 7 gives the detailed process. Algorithm 7 The algorithm of inter-procedural analysis 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:

input: The dependence trees for each procedure p input: The program call graph (PCG) Change ← True /* Top-Down inter-procedural analysis */ while (Change == True) do Change ← False for all procedure p in Pre-Order over PCG do for all dependence trees d in p do if A leaf node l of d is a function’s parameter then Replace l with its actual parameter Change ← True end if end for end for end while

Figure 4.6 illustrates the transformation a dependence tree in function Swap (Fig. 4.2) undergoes during inter-procedural analysis. After intra-procedural analysis, the dependence tree for the load instruction .Load3 of function Swap has two leaf nodes that are function parameters, which cannot be analyzed then as the variables %S and %i.0 are undetermined. Within inter-procedural analysis, these two nodes are replaced with their actual parameters, a base address %A and a loop induction variable %i.0 Now, the dependence tree rooted at %array.1 is computable. For function calls forming a loop in PCG, such as recursive calls, currently, we do not perform parameter replacement for any function in this loop during our interprocedural analysis, as when these functions terminate is typically data-dependent.

82

4 Informed Memory Access Monitoring

Fig. 4.7 NPB CG code with index array colidx

4.3.3 Special Cases and Complications Index Arrays If a memory dependence tree has data-dependent variables as its leaf nodes, normally, we consider it non-computable. However, we still have a chance to extract regular patterns. Index array is an important case of such data-dependent variables, storing “links” to other data structures, as explained below. Figure 4.7 gives a simplified version of a code snippet from NPB CG [23], where the array z is repeatedly accessed via the index array colidx, which cannot be determined at compile time. However, we find that in many programs (including here) the index array itself is not modified across multiple iterations of accesses. Therefore, there is still a significant room for finding repeated access patterns and removing redundancy. To this end, Spindle performs the following extra evaluation during its static analysis. First, it compares the size of index array and its total access count. If the latter is larger, we only need to record the content of the index array and compute the memory accesses accordingly rather than instrumenting them at runtime. Such evaluation needs to be repeated if the content of this index array is changed, of course. This is the case with the example given in Fig. 4.7, where the total memory access count for the index array colidx is i*m and greater than the size of colidx. Thus, at runtime, we only need to record its content at the beginning of this nested loop and the base address of array z. Combining such information and memory dependence tree, we can compute all the memory access locations. Multi-Threaded Programs The discussion so far has been focused on analyzing single-thread programs. However, Spindle’s methodology can also be easily applied to multi-threaded applications. Spindle is thread-safe, and we perform the same static analysis as for single-thread programs, except that we also mark the point where a new thread is created and record relevant parameter values. With parallel executions, during dynamic memory monitoring (discussed in the next section), the current thread ID would be easily fetched along with information such as loop iteration count and branch taken, which allows us to distinguish runtime information collected by different threads. Note that certain techniques need to be augmented to handle multi-threaded executions. For example, the array index technique (Sect. 4.3.3) needs to be protected by additional check, as an array could be modified by another thread.

4.4 Spindle-Based Runtime Monitoring

83

Again, with addresses or values that cannot be determined at compile time, such as shared objects or branches affected by other threads, we fall back to runtime instrumentation. So typical SPMD codes will share the same static MAS to be supplemented by per-thread or even per-process runtime information, making Spindle even more appealing in efficiency and scalability. If significant amount of output is generated, such as with memory trace collection, Spindle allows users to have the option to look at a single-thread’s memory accesses or correlating accesses from all threads (though trace interleaving is a separate topic that requires further study.) For example, with pthread, Spindle instruments pthread_create to record where a new thread is created. During multi-threaded execution, the appropriate thread ID is recorded for each function. Thus, we know which thread the dynamic information collected by Spindle belongs to therefore can apply per-thread static analysis, similar to that in single-thread executions.

4.4 Spindle-Based Runtime Monitoring This section illustrates how Spindle’s static analysis results can be used to reduce runtime instrumentation. We first describe common runtime information to be obtained through instrumentation and then present two samples of Spindle-based tool design, for memory bug detection and for memory trace collection, respectively.

4.4.1 Runtime Information Collection During program runs, Spindle’s static memory access skeleton is supplemented by information not available at compile time. Generally, three cases require instrumentation—control structures, input-dependent variables, and noncomputable memory accesses: Control Structures Spindle needs to record the initial values of all the loop induction variables and the loop iteration count if they are unknown at compilation time. Moreover, for a loop with multiple exit points, we need to instrument each exit point to track where the loop exits. Similarly, for conditional branches in MAS, we need to record their taken statuses to track taken paths. Input-Dependent Variables For input-dependent variables, runtime information is necessary, but certain static analysis can indeed reduce runtime overhead. For instance, the address of a dynamically allocated memory region can be obtained at runtime by collecting actual values. An optimization in Spindle is that we do not instrument every instruction that references input-dependent variables, but only where they are defined, initialized, or updated. For example, for a global variable

84

4 Informed Memory Access Monitoring

Fig. 4.8 Sample runtime information collection

needed by the analysis, it leverages static analysis to only record its initial value at the beginning of the program and then again upon its updates. Non-computable Memory Accesses For non-computable memory accesses (as mentioned in Sect. 4.3.1.2), we fall back to conventional dynamic monitoring/instrumentation. Figure 4.8 shows an example of runtime information collection for the BubbleSort routine discussed earlier in Sect. 4.2.2. The left side gives the dependence tree of the variable %array.1 in function Swap, where undetermined address %A and loop L.0 ’s induction variable %N need to be collected at runtime. Note that L.0 ’s initial index value (0) and increment (1) can be determined at compile time. The right side lists the instrumented BubbleSort code. Here, Spindle automatically instruments three memory accesses by inserting the highlighted statements (for A, N, and the branch related flag, which falls out of the dependence tree shown). variable_id, loop_id, and path_id are also automatically generated by Spindle for its runtime library to find the appropriate static structures.

4.4.2 Spindle-Based Tool Developing Spindle performs automatic code instrumentation for runtime information collection, based on its static analysis. To build a memory monitoring tool on top of Spindle, users only need to supply additional codes using its API to perform custom analysis, as to be illustrated below. Our two sample tools, S-Detector and S-Tracer, each take under 500 lines of code to implement both compile-time analysis and runtime library.

4.4 Spindle-Based Runtime Monitoring

4.4.2.1

85

Memory Bug Detector (S-Detector)

Memory bugs, such as buffer overflow, use after free, and use before initialization, may cause severe runtime errors or failures, especially with programming languages like C and C++. There have been a series of tools, software- or hardwarebased, developed to detect memory bugs at compile time or runtime. Among them, Memchecker [24] uses hardware support for memory access monitoring and debugging and is therefore fast (only .2.7% performance overhead for SPECCPU 2000). Such special purpose hardware is nevertheless not yet adopted by general processors. ARCHER [25] relies on static analysis only, so is faced with the difficult trade-off between accuracy (false positives) and soundness (false negatives), like other static tools. A recent, state-of-the-art tool is AddressSanitizer (ASan) [12], an industrial-strength memory bug detection tool developed by Google and now built into the LLVM compiler. ASan inserts memory checking instructions (such as out-of-bound array accesses) into programs at compile time and then uses shadow memory [26] for fast runtime checking. Despite being well implemented and highly tuned, ASan still introduces two to three.× slowdown to SPEC programs. In this work, we present S-Detector, a memory bug detector that leverages Spindle-gathered static information to eliminate unnecessary instrumentation to facilitate efficient online memory checking. Our proof-of-concept implementation of S-Detector can currently detect invalid accesses (e.g., out-of-bound array access and use after free) and memory leaks (dynamically allocated objects remaining unfreed upon program termination). With Spindle’s MAS, S-Detector is aware of a program’s groups of memory accesses and therefore able to perform checking at a coarser granule. For example, with dynamically allocated arrays, even when neither the starting address (base) or size (bound) is known at compile time, its accesses are given as relative to these two values and can therefore be checked for out-of-bound bugs at compile time. With existing tools like ASan, however, such checks are delayed till runtime and repeated at every memory accesses. Therefore, S-Detector performs aggressive memory check pruning by proactively conducting compile-time access analysis and replacing instruction-level checks by object-level ones. Only for accesses labeled “non-computable” by Spindle, SDetector falls back to traditional instrumentation. Below, we illustrate S-Detector’s memory check pruning with two sample scenarios, both contained in the same code snippet from SPEC CPU2006 mcf (Fig. 4.9). In-structure Accesses This sample code references an array of structures (new), issuing multiple accesses to members of its elements. In this case, assisted with Spindle-extracted MAS, all access targets can be represented as addr = struct_base + constant_offset. Once S-Detector finds that the constant offset is valid for this struct, i.e., offset. p0 ) can be predicted.) The prediction accuracy with Phantom is much higher than that with the regression-based approach for most programs. We also analyze average communication time percentage for these programs on Infiniband network in Fig. 6.14. For most of these programs, communication time increases with the number of processes. Among these programs, Sweep3D is the most communication intensive with the maximum communication percentage of 46.12%. We predict the performance for Sweep3D on three target platforms. The real execution time is measured on each target platform to validate our predicted results. All the message logs are collected on Explorer platform. As shown in Fig. 6.15, Phantom can get high prediction accuracy on these platforms. Prediction errors on Dawning, DeepComp-F, and DeepComp-B platforms are on average 2.67%, 1.30%, and 2.34%, respectively, with only .−6.54% maximum error on Dawning platform

154

6 Performance Prediction for Scalability Analysis

for 128 processes. Phantom has a better prediction accuracy as well as greater stability across different platforms compared to the regression-based approach. For example, on the DeepComp-B platform, the prediction error for Phantom is 4.53% with 1024 processes while 23.67% for regression-based approach (.p0 = 32, 64, 128 used for training). Note that while Dawning platform has lower CPU frequency and peak performance than DeepComp-F platform, it has better application performance before 256 processes. DeepComp-B platform presents the best performance for Sweep3D among three platforms.

6.8.4 Performance Prediction for Amazon Cloud Platform Figure 6.16 shows the prediction results for Amazon cloud platform with Phantom. The sequential computation time of representative processes is acquired using a single node of Amazon EC2 platform. We can find Phantom shows very high prediction accuracy for most of the applications. The prediction error with Phantom is less than 7% on average for all the applications. For both 113.GemsFDTD and 128.GAPgeofem, the prediction errors are relative high. This is because the communication time accounts for most of the execution time in these programs and communication contention becomes more serious with the number of processes. For example, the communication time percentage in 113.GemsFDTD is more than 80% at the process of 128. In addition, we can find applications of 113.GemsFDTD, 127.wrf2, and 128.GAPgeofem are not scalable on the cloud platform. Through comparing the communication time between the HPC platform and the Amazon cloud platform in Figs. 6.14 and 6.16, we can find communication becomes a main bottleneck on the cloud platform. The communication time on the HPC platform using DDR IB network for most of the programs is less than 40% of the total execution time, while on the Amazon cloud platform using 10Gigabit Ethernet the communication time for some of the applications is more than 40%. Figure 6.17 shows the latency and bandwidth on both platforms. The HPC platform has very lower latency for small messages than the cloud platform, while bandwidths between two platforms are very close for large messages. We can conclude that communication latency is a main reason hindering tightly coupled parallel applications running on the current cloud platform.

6.8.5 Message Log Size and Replay Overhead In Phantom, we just record the message logs for representative processes and those processes executing on the same node with them. As the number of process groups is far smaller than the number of processes presented in Sect. 6.8.2.1, the message log size is reasonable for all the programs. As shown in Table 6.4, SP has the largest message logs, while EP has the least due to little communication.

6.8 Evaluation

155

Fig. 6.16 Predicted time with Phantom on Amazon EC2 cloud platform (Measured means the real execution time of applications, comm means communication time percentage of total execution time)

156

6 Performance Prediction for Scalability Analysis

Fig. 6.17 Latency and bandwidth on HPC platform and Amazon cloud platform Table 6.4 Message log size (in Gigabyte except EP in Kilobyte) Proc. # 16 32 64 128 256

BT 3.01 6.14 3.5 2.6 1.99

CG 2.12 1.59 1.85 1.85 1.99

EP 1.03 K 1.03 K 1.03 K 1.04 K 1.06 K

LU 1.39 2.79 2.79 3.32 3.06

MG 0.3 0.62 0.46 0.58 0.72

SP 5.49 11.2 6.38 4.75 3.65

S3-S 0.39 0.78 0.78 0.92 0.84

S3-W 0.14 0.55 0.55 1.29 1.28

Figure 6.18 shows the replay-based execution time compared with normal execution for each program with 256 processes. Because most of the incoming messages are read from the log files and little synchronization overhead is introduced, the replay-based execution time is less than normal execution for most of programs. For example, in weak-scaling Sweep3D with 256 processes, both communication and synchronization cost account for more than .46% of execution time, about 32.62 s, While the overhead introduced during the replay is 8.28 s. As a result, the replaybased execution time is much smaller than normal execution time.

6.8.6 Performance of SIM-MPI Simulator SIM-MPI simulator has high efficiency since only the communication operations need to be simulated. All the simulation in this chapter is executed on a server node equipped with two-way quad-core Xeon E5504 processors (2.0 GHz), 12 GB of memory size. Table 6.5 gives the performance of SIM-MPI simulator for different programs. For most of programs, the simulation time is less than 1 minute. Among these programs, LU has the longest simulation time due to frequent communication operations.

6.9 Discussions

157

Fig. 6.18 The elapsed time of replay-based execution compared with normal execution for each program with 256 processes Table 6.5 Performance of SIM-MPI simulator (in second) Proc. # 16 32 64 128 256

BT 0.49 1.51 3.24 9.97 30.38

CG 0.79 1.98 4.20 11.27 21.19

EP 0.04 0.09 0.15 0.25 0.49

LU 4.17 8.57 17.33 34.19 66.47

MG 0.20 0.60 0.75 1.35 2.73

SP 0.93 2.82 6.21 19.36 39.41

S3-S 0.36 0.79 1.62 2.75 5.24

S3-W 0.19 0.43 0.92 1.97 4.24

6.9 Discussions Problem Size The problem size we can deal with is limited by the scale of host platforms since we need to execute the parallel applications with the same problem size and the same number of parallel processes on them to collect message logs that are required at the replay phase. It should be noticed that neither the CPU speed nor the interconnect performance of host platforms are relevant to the accuracy of performance prediction on target platforms in our framework. This implies that we can even generate message logs on a host platform with fewer number of processors/cores than the target platform. The only hard requirement for the host platform is its memory size. There are several potential ways to address this limitation. One is to use grid computing techniques through executing applications on grid systems which provide larger memory size than any single host platform. Another promising way is to use SSD (solid-state drive) devices and virtual memory to trade speed for cost. Note that the message logs only need to be collected once for one application for a

158

6 Performance Prediction for Scalability Analysis

given problem size, which is a favorable feature of our approach to avoid high cost for message log collection. Node of Target Platforms We assume that we have at least one node of target platforms which enables us to measure computation time at real execution speed. This rises a problem of how we can predict performance of target platforms even without a single node. Our approach can apply with a single-node simulator which is usually ready years before the parallel machine. It is clear that this will be much slower than measurement. Thanks to the representative replay technique we proposed, we only need to simulate a few representative processes, and the simulation can be also performed in parallel. I/O Operations Our current approach only models and simulates communication and computation of parallel applications. However, I/O operations are also an important factor of parallel applications. In this chapter, we focus on how to acquire sequential computation time; we think the framework of our approach can be extended to cope with I/O operations although there are many pending issues to investigate. Nondeterministic Applications As a replay-based framework, Phantom has limitations in predicting performance for applications with nondeterministic behavior. Phantom can only predict the performance of one possible execution of a nondeterministic application. However, we argue that for well-behaved parallel applications, nondeterministic behaviors should not cause significant impact on their performance because it means poor performance portability. So we believe it is acceptable to use predicted performance of one execution to represent the performance of wellbehaved applications.

6.10 Related Work Performance prediction of parallel applications has a large body of prior work. There are two types of approaches for performance prediction. One approach is to build an analytical model for the application on the target platform [3, 4, 6, 14, 29, 30]. Spafford et al. also propose a domain-specific language for performance modeling [31]. The main advantage of analytical methods is low-cost. However, constructing analytical models of parallel applications requires a thorough understanding of the algorithms and their implementations. Most of such models are constructed manually by domain experts, which limits their accessibility to normal users. Moreover, a model built for an application cannot be applied to another one. Phantom is an automatic framework which requires little user intervention. The second approach is to develop a system simulator to execute applications on it for performance prediction. Simulation techniques can capture detailed performance behavior at all levels and can be used automatically to model a given

6.11 Conclusion

159

program. However, an accurate system simulator is extremely expensive, not only in terms of simulation time but especially in their memory requirements. Existing simulators, such as BigSim, MPI-SIM [7, 32, 33], are still inadequate to simulate the very large problems that are of interest to high-end users. Trace-driven simulation [1, 12] and macro-level simulation [13] have better performance than detailed system simulators, since they only need to simulate the communication operations. The sequential computation time is usually acquired by analytical methods or extrapolation in previous work. We have discussed their limitations in Sect. 6.1. In this chapter, our proposed representative replay technique can acquire more accurate computation time, which can be used in both trace-driven simulation and macro-level simulation. Our prototype system, Phantom, is also a trace-driven simulator integrated with the representative replay. Moreover, Phantom adopts the FACT technique, which makes it feasible to collect the communication traces needed by the trace-driven simulator on a small-scale system. Yang et al. propose a cross-platform prediction method based on relative performance between target platforms without program modeling, code analysis, or architecture simulation [8]. Their approach works well for iterative parallel codes that behave predictably. In order to measure partial iteration performance, their approach requires a full-scale target platform available, while our approach only requires a single node of the target platform. Lee et al. present piecewise polynomial regression models and artificial neural networks that predict application performance as a function of its input parameters [34]. Barnes et al. [5] employ the regression-based approaches to predict parallel program scalability, and their method shows good accuracy for some applications. However, the number of processors used for training is still very large for better accuracy, and their method only supports load-balanced workload. Statistical techniques have been used widely for studying program behaviors from large-scale data [35, 36]. Our approach is inspired by these previous work and also adopts statistical clustering.

6.11 Conclusion In this chapter, we demonstrate the benefit of an automatic and accurate prediction method for large-scale parallel applications. We propose a novel technique of using deterministic replay to acquire accurate sequential computation time for largescale parallel applications on a single node of the target platform and integrate this technique into a trace-driven simulation framework to accomplish effective performance prediction. We further propose representative replay scheme which employs the similarity of computation patterns in parallel applications to reduce time of prediction significantly. We verify our approach on the traditional HPC platform and the latest Amazon EC2 cloud platform, and the experimental results show that our approach can get high prediction accuracy on both types of platforms.

160

6 Performance Prediction for Scalability Analysis

The approach proposed in this chapter is a combination of operating system and performance analysis techniques.

References 1. Snavely, A., et al. (2002). A framework for application performance modeling and prediction. In SC ’02: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing (pp. 1–17). 2. Marin, G., & Mellor-Crummey, J. (2004). Cross-architecture performance predictions for scientific applications using parameterized models. In SIGMETRICS. 3. Kerbyson, D. J., et al. (2001). Predictive performance and scalability modeling of a large-scale application. In Supercomputing (pp. 37–48). 4. Sundaram-Stukel, D., & Vernon, M. K. (1999). Predictive analysis of a wavefront application using LogGP. In PPoPP ’99: Proceedings of the Seventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (pp. 141–150). 5. Barnes, B. J., et al. (2008). A regression-based approach to scalability prediction. In Proceedings of the 22nd Annual International Conference on Supercomputing (pp. 368–377). ACM. 6. Mathias, M., Kerbyson, D., & Hoisie, A. (2003). A performance model of non-deterministic particle transport on large-scale systems. In Workshop on Performance Modeling and Analysis. ICCS. 7. Zheng, G., Kakulapati, G., & Kale, L. V. (2004). BigSim: A parallel simulator for performance prediction of extremely large parallel machines. In 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings (pp. 78–87). 8. Yang, X., et al. (2008). Compiler-assisted application-level checkpointing for MPI programs. In 2008 The 28th International Conference on Distributed Computing Systems. 9. Amazon Inc. (2011). High Performance Computing (HPC). http://aws.amazon.com/ec2/hpcapplications/ 10. Zhai, Y., et al. (2011). Cloud versus in-house cluster: Evaluating amazon cluster compute instances for running MPI applications. In SC ’11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 11. Choudhury, N., Mehta, Y., & Wilmarth, T. L., et al. (2005). Scaling an optimistic parallel simulation of large-scale interconnection networks. In Proceedings of the Winter Simulation Conference, 2005 (pp. 591–600). 12. Labarta, J., et al. (1996). DiP: A parallel program development environment. In European Conference on Parallel Processing (pp. 665–674). Springer. 13. Susukita, R., Ando, H., & Aoyagi, M., et al. (2008). Performance prediction of large-scale parallel system and application using macro-level simulation. In SC ’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (pp. 1–9). 14. Barker, K. J., Pakin, S., & Kerbyson, D. J. (2006). A performance model of the Krak hydrodynamics application. In: 2006 International Conference on Parallel Processing (ICPP’06) (pp. 245–254). 15. Xue, R., et al. (2009). MPIWiz: Subgroup reproducible replay of mpi applications. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (pp. 251–260). 16. Maruyama, M., Tsumura, T., & Nakashima, H. (2005). Parallel program debugging based on data-replay. In PDCS’05 (pp. 151–156). 17. LeBlanc, T. J., & Mellor-Crummey, J. M. (1987). Debugging parallel programs with instant replay. IEEE Transactions on Computers, 36(4), 471–482. 18. Bouteiller, A., Bosilca, G., & Dongarra, J. (2007). Retrospect: Deterministic replay of MPI applications for interactive distributed debugging. In EuroPVM/MPI (pp. 297–306).

References

161

19. Zhai, J., et al. (2015). Performance prediction for large-scale parallel applications using representative replay. IEEE Transactions on Computers, 65(7), 2184–2198. 20. Zhai, J., Chen, W., & Zheng, W. (2010). PHANTOM: Predicting performance of parallel applications on large-scale parallel machines using a single node. In PPoPP. 21. Shao, S., Jones, A. K., & Melhem, R. G. (2006). A compiler-based communication analysis approach for multiprocessor systems. In Proceedings 20th IEEE International Parallel & Distributed Processing Symposium. 22. Alexandrov, A., et al. (1997). LogGP: Incorporating long messages into the LogP model for parallel computation. Journal of Parallel and Distributed Computing, 44(1), 71–79. 23. Sur, S., et al. (2006). RDMA read based rendezvous protocol for MPI over InfiniBand: Design alternatives and benefits. In PPoPP ’06: Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (pp. 32–39). 24. Zhang, J., et al. (2009). Process mapping for MPI collective communications. In Euro-Par. 25. Tsinghua University. SIM-MPI Simulator. http://www.hpctest.org.cn/resources/sim-mpi.tgz 26. Bailey, D., et al. (1995). The NAS Parallel Benchmarks 2.0. Moffett Field, CA: NAS Systems Division, NASA Ames Research Center. 27. Standard Performance Evaluation Corporation. SPEC MPI2007 Benchmark Suite (2007). http://www.spec.org/mpi2007/ 28. LLNL. ASCI Purple Benchmark. https://asc.llnl.gov/computing_resources/purple/archive/ benchmarks 29. Hoisie, A., et al. (2000). A general predictive performance model for wavefront algorithms on clusters of SMPs. In Proceedings 2000 International Conference on Parallel Processing (pp. 219–228). 30. Meng, J., et al. (2012). Dataflow-driven GPU performance projection for multi-kernel transformations. In SC ’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (pp. 82:1–82:11). 31. Spafford, K. L., & Vetter, J. S. (2012). Aspen: A domain specific language for performance modeling. In SC ’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (pp. 84:1–84:11). 32. Wilmarth, T., Zheng, G., & Bohm, E. J., et al. (2005). Performance prediction using simulation of large-scale interconnection networks in POSE. In Proceedings of the 19th Workshop on Parallel and Distributed Simulation (pp. 109–118). 33. Prakash, S., & Bagrodia, R. (1998). MPI-SIM: Using parallel simulation to evaluate MPI programs. In Winter Simulation Conference (pp. 467–474). 34. Lee, B. C., Brooks, D. M., & de Supinski, B. R., et al. (2007). Methods of inference and learning for performance modeling of parallel applications. In PPoPP ’07: Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (pp. 249–258). 35. Zhong, Y., et al. (2004). Array regrouping and structure splitting using whole-program reference affinity. In PLDI’04 (pp. 255–266). 36. Sherwood, T., et al. (2002). Automatically characterizing large scale program behavior. In ASPLOS X: Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (pp. 45–57).

Part IV

Performance Analysis Methods: Noise Analysis

Chapter 7

Lightweight Noise Detection

Abstract Performance variance of parallel and distributed systems is becoming increasingly severe. The runtimes of different executions can vary greatly even with a fixed number of computing nodes. Many HPC applications on supercomputers exhibit such variance. Efficient online performance variance detection is an open problem in HPC research. To solve it, we propose an approach, called vSensor, to detect the performance variance of systems. The key finding of this study is that the source code of programs can better represent performance at runtime than an external detector. Specifically, many HPC applications contain code snippets that are fixed-workload patterns of execution, e.g., the workload of an invariant quantity and a linearly growing workload. This observation allows us to automatically identify these snippets of workload-related code and use them to detect performance variance. We evaluate vSensor on the Tianhe-2A system with a large number of parallel applications, and the results indicate that it can efficiently identify variations in system performance. The average overhead of 4,096 processes is less than 6% for fixed-workload v-sensors. We identify a problematic node with slow memory and network issues on Tianhe-2A system with vSensor that degrade programs’ performance by 21% and 3.37.×, respectively. (© 2022 IEEE. Reproduced, with permission, from Jidong Zhai et al., Leveraging code snippets to detect variations in the performance of HPC systems, IEEE Transactions on Parallel and Distributed Systems, 2022.)

7.1 Introduction Current large-scale, high performance computing (HPC) systems [1–3] suffer from serious variations in performance. Because of this [4], the runtimes of different executions of the same application may vary considerably. This behavior is very common in major HPC centers, and an example is shown in Fig. 7.1. On a fixed number of nodes of the Tianhe-2A supercomputer [5], the NPB-FT program in CLASS D using 1,024 processes was executed many times. A segment of a longer execution is shown in Fig. 7.1. The system itself or other jobs generated background

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 J. Zhai et al., Performance Analysis of Parallel Applications for HPC, https://doi.org/10.1007/978-981-99-4366-1_7

165

166

7 Lightweight Noise Detection

Fig. 7.1 Variation in performance with a fixed number of computing nodes. The runtime of NPB-FT using 1,024 processes varied significantly among different executions on a fixed number of nodes of the Tianhe-2 HPC system

noise. The variance in performance across different executions was significant, and the longest runtime was more than three times the shortest one. Both normal users and application developers can be adversely affected by the prevalent variance in performance. For common users, it can lead to unexpected behavior by an executing application, leading to violations of performance-related requirements and higher resource costs. Furthermore, it becomes complicated to measure and compare the performance of different programs. For application developers, the background variance in performance can hide the advantage of a novel optimization technique. Based on previous work [3, 6, 7], the variance in performance can be caused by a number of reasons, e.g., network contention, OS schedule, zombie processes, and hardware faults. Certain types of variances in performance can be successfully prevented according to their causes, whereas others are not preventable. For instance, if a problematic node leads to a variance in performance, system users can replace it with another node and execute the job again. By contrast, users have few options for avoiding the variance caused by network contention because the network is generally shared by many users. Therefore, we need to understand two key issues before we resubmit a job or blame the system for such variation: (1) the degree of the variance in performance and (2) its root cause. Four main methods are currently used to identify variance in the performance of applications, but none of these is good enough to address the two challenges stated above. (1) Rerun. A basic way of observing variance in performance is to execute a program many times and calculate the durations of different executions. This strategy consumes more resources of the system and requires a long time for re-executions. (2) Performance models [1, 8]. An accurate performance model estimates the execution time of an application to measure the difference between its estimated and observed performance. However, most models can estimate only the total variance in performance, rather than identifying the cause of such variance. In addition, the performance model is usually not portable. The model designed for one program usually fails to deliver satisfactory predictions for another program. (3) Tracing and profiling [9, 10]. These methods have major drawbacks even though they have been used in many cases. Due to the omission of information on the time sequence of applications, profiling-based methods fail to find time-dimensional

7.1 Introduction

167

variance in performance. Tracing-based methods generate massive amounts of trace data, particularly with the increase in the size of the problem and scale of the job [11]. Moreover, the overhead of tracing-based methods limits their use for identifying variance in performance on the fly. Even with trace compression techniques [12, 13], knowledge of the application is necessary to understand the collected traces, which is not practical for normal users. (4) Benchmarking. With benchmarks of fixed-work quanta (FWQ), we can measure the variance in performance. When a fixed amount of work is executed repeatedly, we can observe a variance in performance if the execution times vary for the same job. For instance, the variance in network performance may be identified continuously by conducting and detecting the same communication. The key problem is the intrusiveness of this approach: It may incur extra variance in performance due to the contention for resources between the benchmarks and the original application. It is therefore not suitable for production programs. Although a considerable amount of research has been dedicated to detecting the variance in performance and its root causes, successful detection on the fly remains an outstanding problem for large-scale parallel applications. To solve this problem, we develop vSensor, which is a lightweight approach for detecting variance in HPC applications on the fly. Our main insight is that common parallel applications may include code snippets that exhibit the same FWQ benchmark behavior. Many code fragments in a loop, for instance, may execute in a similar manner in multiple iterations. Such snippets can be regarded as FWQ benchmarks incorporated into a program. We call such a snippet a v-sensor. Vsensors can detect variance in performance during execution without significant overhead. Using v-sensors to track variance in the performance of applications has three advantages: (1) There is no need for a performance model, which can be very difficult to implement. (2) It has low interference and the minimum cost. During execution, vSensor does not need to run external programs, such as external benchmarks. This prevents overhead and resource contention owing to monitoring daemons. (3) V-sensors can make the identification of the root causes of a variance in performance easier. However, we need to handle two challenges in implementing v-sensors as a tool to detect variance in performance. (1) Detection of v-sensors. It is challenging to identify v-sensors in a large amount of code; for instance, various workloads can be processed by snippets in different iterations, various arguments can influence program behavior, and programs can take various paths. For complicated applications, it is unrealistic for programmers to mark v-sensors by themselves, even if they are well understood for the given application. (2) Reducing on-the-fly tracking overhead. Although v-sensors belong to the original applications, the variance in performance still needs to be observed by additional analysis and measurement. We need to reduce the overhead to prevent increasing the execution time or exacerbating the variance. To handle these difficulties, we use a hybrid static-dynamic method for program analysis. During compilation, we provide a propagation algorithm with dependence to automatically locate the v-sensors and then create and set suitable v-sensors. Dur-

168

7 Lightweight Noise Detection

ing runtime, we develop a lightweight analytical algorithm to identify the variance in performance on the basis of the v-sensors on the fly. We use the characteristics of the v-sensors to identify the variance in their performance by comparing them with historical data on their behavior. By comparing the performance of the v-sensors in different processes on a large-scale parallel system, we identify the variance in performance among processes. We develop vSensor [14] as a tool chain integrated with the LLVM [15] to measure the efficacy of our method. It supports HPC applications written in MPI [16]. In experiments, vSensor was tested on 4,096-process MPI programs on the Tianhe-2A HPC system. Results show that vSensor can detect v-sensors and incurs an overhead of only less than 6% by using fixed-workload v-sensors for the online detection of variance. Two more classes of v-sensors, i.e., regularworkload v-sensors and external v-sensors , significantly enhanced the capability of detection and applicability of our tool. We also verify the usefulness of vSensor in empirical scenarios: In the Tianhe-2A system, it detected a bad node that degraded the performance of the CG by 21%. A serious issue in the network, which had caused the FT program to slow down by 3.37.×, was also identified by vSensor. We make the following contributions in this chapter: • We obtain the key insight that fixed-work quanta benchmarks inside applications, called v-sensors, are valuable for detecting variance. We also propose a lightweight online performance variance detection algorithm based on the vsensors. • We develop vSensor, a tool to detect variance in performance, for HPC applications that can automatically identify v-sensors in the source code by using a combination of static and dynamic analyses. • We evaluate vSensor on up to 4,096 processes and verify its ability to identify an issue in the network that slowed the application down by 3.37.×. The remaining part of this chapter is arranged as follows: A general design of the system of v-sensor is provided in Sect. 7.2. The v-sensor search method during compilation is given in Sect. 7.3. V-sensors with variable workload, i.e., regular vsensors, are detailed in Sect. 7.4. Rules for the selection of v-sensors are described in Sect. 7.5. The on-the-fly detection algorithm is provided in Sect. 7.6, and the results of evaluation are given in Sect. 7.7. Section 7.8 provides a summary of related work in the area, and we give the conclusions of this study in Sect. 7.9.

7.2 vSensor Design The key idea of vSensor is to locate code snippets with certain recognizable patterns, i.e., v-sensors , in user programs to detect system-wide variance in performance via their runtime statistics. We classify v-sensors inside user programs into two categories: fixed v-sensors and regular v-sensors . Fixed-workload v-sensors execute a constant sequence of instructions for each occurrence. Due to their

7.2 vSensor Design

169

property of a fixed workload, the variance in their execution times can directly reflect the variance in the performance of the system on which they are running. Code snippets with variant sequences of instruction execution are still eligible as variation sensors, for instance, if they always execute a multiple of the same set of workloads. Such a scenario occurs in applications—a loop structure executed with different numbers of iterations under different program states. We define all such eligible variant sensors as regular v-sensors. We normalize the execution time series of regular v-sensors to detect system variance. A formal and generalized definition of regular v-sensors is given in Sect. 7.4, and its normalization is discussed in Sect. 7.6.2. Fixed sensors and regular sensors constitute internal v-sensors and rely on user programs. On the contrary, external v-sensors use external benchmarks to detect variance in performance; this is an auxiliary method for programs with an insufficient number of internal v-sensors. We discuss the use case and trade-off between extra detection coverage and overhead for external v-sensors in Sect. 7.5.2. With the help of these kinds of v-sensors in an application, the variance in performance of programs can be evaluated during execution. We develop vSensor as a performance tool for large-scale HPC applications. vSensor is composed of a dynamic module and a static module. The static module automatically detects v-sensors with the help of compiler technologies. We use LLVM [15] to locate v-sensors in our system. To process applications written in different languages, such as Fortran, C, and C++, we develop this implementation by analyzing the LLVM intermediate representation (LLVM-IR). However, most HPC applications are not compiled by LLVM but by vendor compilers to optimize their performance. Therefore, to allow developers to use their preferred compiler, our tool builds a mapping from the LLVM-IR to the source code and conducts instrumentation in the source code. The dynamic module gathers data on the fly, performs performance analysis at runtime, and provides a performance summary. The summary is updated during application execution and allows users to check the real-time results before the program finishes execution. We show the workflow of vSensor in Fig. 7.2. The steps shown in the figure are described next. 1 Compile vSensor uses the LLVM to map the original source code to the LLVMIR. 2 Identify v-Sensors The LLVM-IR provides instructions and basic blocks for a given program. The v-sensor detection algorithm provided in Sects. 7.3 and 7.4 is executed at this level. Our tool then handles the call graph of the program to process special situations, such as recursive calls. vSensor then analyzes the loops and function calls. It also analyzes the behavior of the v-sensor in multiple processes in parallel programs because it targets multi-process applications. 3 Map to Source The LLVM-IR provides the v-sensors that are detected in step 2, and our tool can locate them in the source code.

170

7 Lightweight Noise Detection

Static Module Source Code 1 Compile LLVM-IR 2 Identify v-sensors IR of v-sensors 3 Map to source v-sensors Locations in Source 4 Instrument

Dynamic Module 5 Compile Executable File 6 Run Performance Data 7 Analyze Detection Results 8 Visualize

Modified Source

Variance Report

2 Identify v-sensors a Program call graph preprocess b Loop analysis c Function call analysis d Process analysis

7 Analyze a Data preprocess b Intra-process analysis c Inter-process analysis

Fig. 7.2 vSensor workflow

4 Instrument In this step, our tool carries out instrumentation in the source code based on information obtained in the previous step. To obtain a reasonable v-sensor selection during instrumentation, we use multiple rules for guidance that are detailed in Sect. 7.5. 5 Compile To maintain the special optimizations in the compilers, we use the default compilers and the original compiling arguments to compile the instrumented code and the revised programs. 6 Run In this step, our tool executes the updated program in real environments. During runtime, it generates the related performance monitoring data by using customized functions with instrumentation. 7 Analyze Note that the performance data are subjected to a preprocessing step so that they can be used to detect variance. By comparing the gathered data with historical records, our algorithm detects variance in each process during runtime as detailed in Sect. 7.6. The data gathered from the same v-sensors in different processes are used to identify variance in performance among processes. 8 Visualize Finally, our tool generates figures to show the distribution of the variance in performance. Note that the results are updated on the fly because new contents are continuously generated and analyzed during an execution. The v-sensor identification in step 2 and variance analysis in step 7 are the most complicated steps in our design and are explained next.

7.3 Fixed-Workload V-Sensors

171

7.3 Fixed-Workload V-Sensors In this section, we define the v-sensor and develop a dependence propagation algorithm to identify it. We show our basic ideas in C language and illustrate our examples in MPI [16], which is popular in HPC. Note that we use the LLVM-IR to show our algorithm, but our approach also works on other parallel applications based on message passing.

7.3.1 Fixed-Workload V-Sensor Definition In the v-sensor design, we use v-sensors as benchmarks integrated within HPC applications. We regard a code snippet in a loop as a v-sensor; therefore, the vsensor runs repeatedly. Note that the number of snippets within the loop can be large. Accordingly, we define a fixed v-sensor in a loop as a code snippet with fixed amount of work in it. We show this in Fig. 7.3, in which we regard snippet-2 as a v-sensor with a fixed workload in the for loop. To maintain an appropriate overhead for online detection and instrumentation, we need to carefully select the granularity of the v-sensor, although the code snippet can be very small according to the definition. In our design, we consider only function calls and loops as candidates for v-sensors. The idea underlying vSensor is shown in Fig. 7.4, where we assign identifying numbers to function calls and loops. We highlight the arguments in the functions, variables in the loops, and global variables. Note that in this case, v-sensor candidates for Loop-2 are Call-1 and Call-2, and the Loop-4 v-sensor candidate is Loop-5. However, the count++ statement is not a v-sensor candidate for Loop-3 because the statement is not a function call or a loop. We now establish the concept of the quantity of work. Our code snippets are categorized according to their purposes into three types: IO, network, and computation snippets. In Fig. 7.4, Call-3 is a network snippet and Loop-4 is a computation snippet. The fixed-workload standard for each type of snippet is different. The criterion for calculating the quantity of work is always based on user-

Fig. 7.3 V-sensor

172

7 Lightweight Noise Detection

Fig. 7.4 Code examples for vSensor

defined standards, even for the same kind of fragments. Our default decision rules are shown as follows, and our tool allows users to add constraints: • IO. The IO v-sensor means that for a given IO operation, the sizes of the input and output do not change. • Network. The network v-sensor means that for a communication operation, the message size remains unchanged during iterations. • Computation. The computation v-sensor means that over the iterations, the computation snippet is related to the same instruction sequence. vSensor permits users to identify certain factors to determine the v-sensors to guarantee versatility. For instance, the same sequences of instructions with various cache miss rates can result in different behaviors, and users can determine whether a constant rate of cache error should be an additional requirement for the v-sensors. More variables, including IO frequency, network distance, and communication destination, can be used as dynamic rules on runtime information, or as static rules that are available only at compilation. The effect of additional rules on the choice of v-sensors is shown in Fig. 7.5. Static rules for v-sensor selection are used during compilation. More severe static

Fig. 7.5 Static rules and dynamic rules

7.3 Fixed-Workload V-Sensors

173

rules intuitively generate fewer v-sensors. v-sensors can be further classified based on online information by using the dynamic rules. As is shown in Fig. 7.5, our tool identifies v-sensors based on static rules and ignores all dynamic rules during compilation. Note that the v-sensors are also classified into two groups during runtime based on their performance-related data that are available only during execution. The variance in the performance of each group can be identified using dynamic rules. In MPI communication, network destinations should be used in static rules for applications as they are normally available during compilation. Conversely, the rate of cache misses can be considered only as a dynamic rule. With regard to the range of rates of cache misses, such as 0–10%, or 10–20%, we can classify performance records during runtime into different groups.

7.3.2 Analysis of Intra-procedure This section provides the intra-procedural analysis of code snippets that invoke no function. The candidate v-sensors considered in this analysis are pure loops with computations because in most cases, programs call functions for IO and network operations. The workload of the snippet is fixed if its instruction sequence is not modified during iterations, under the rules mentioned in Sect. 7.3.1. Therefore, the quantity of work is determined by the branch statements and loops. A candidate is not a v-sensor if it contains a control expression that is influenced by variables between executions. In our design, we execute a compiler technique, called usedefine chain analysis, to analyze the dependence between variables. Pure computation snippets are shown in Fig. 7.6. There are three subloops of the outermost loop (Loop-n), Loop-1, Loop-2, and Loop-3. Both Loop-1 and Loop2 use the index of Loop-n, which is n. Loop-1 and Loop-2 are not v-sensors of Loop-n because the variable n is updated between their executions. By contrast, the variable n does not influence Loop-3’s control expression, and thus its quantity of

Fig. 7.6 Example of intra-procedural analysis

174

7 Lightweight Noise Detection

work is unchanged during iterations of Loop-n. Based on this analysis, Loop-3 is a v-sensor of Loop-n. Figure 7.4 shows that Loop-3 and Loop-5 are the v-sensors of Loop-1 and Loop4, respectively, according to our intra-procedural analysis.

7.3.3 Analysis of Inter-procedure The principles of our inter-procedural analysis are illustrated in Fig. 7.7. Assume that three calls C1, C2, and C3 are made to the function F within loop L and a snippet S exists in F. If the following conditions are met, S qualifies as a v-sensor of L: • For all parent loops within F, snippet S is a v-sensor. This implies that the quantity of work of S is invariant during the execution of F. • For all invocations of F within L, which are C1, C2, and C3, if the global variables or partial arguments of F influence S’s quantity of work, they must be invariant during iterations of L. Hence, among the invocations of F, S’s quantities of work are the same. We provide an example of the dependency propagation of inter-procedural analysis in Fig. 7.8. It shows how to decide whether a function call belongs to a vsensor. Figure 7.8 shows the dependency relation obtained from the inter-procedural

Fig. 7.7 Illustration for the principle of the analysis of inter-procedure

Fig. 7.8 Example of inter-procedural analysis

7.3 Fixed-Workload V-Sensors

175

analysis, and the argument-workload relations for each function need to be analyzed. For instance, the GLBV global variable and the x argument influence the workloads of foo. If GLBV and x remain invariant among calls, foo processes the same amount of work. As k does not influence the control sequence of the foo function, Call-1 is a v-sensor of Loop-2. By contrast, Call-1 does not act as a v-sensor for Loop-1 because n is updated during its iterations. Furthermore, Call-2 is not a vsensor of Loop-1 and Loop-2. In Fig. 7.8, we take Loop-5, a v-sensor of Loop-4, as an example for illustration. Loop-5 also acts as a v-sensor for both Loop-1 and Loop-2 because it depends on neither global variables nor function arguments. Because the x argument is updated during different invocations of Call-2, Loop-4 does not act as a v-sensor of Loop-2.

7.3.4 Multiple-Process Analysis The workload for each process needs to be analyzed for multi-process programs, such as MPI programs. We show an instance of a snippet for which the workload remains the same during the iterations but changes among processes. Each process obtains a rank ID from MPI_Comm_rank. The rank influences the workload of Loop-1. For various processes, the workloads are different, but for the same rank, its workload remains the same among different iterations. Specifically, only the oddranking processes need to execute count++. To detect inter-process variance at runtime, vSensor uses snippets that have fixed workloads for all processes (Fig. 7.9). To analyze the dependence between process ranks and workloads, we apply a similar method. First, we need to identify the functions, such as MPI_Comm_rank, gethostname, that can specify the process IDs. Second, the relation between the quantity of work and process ID (such as host names and rank IDs) needs to be examined. Third, the snippet is regarded as having a fixed workload among processes if it is not influenced by rank variables.

Fig. 7.9 Analyzing different processes

176

7 Lightweight Noise Detection

Fig. 7.10 Analyzing whole programs

7.3.5 Whole Program Analysis In this section, we show the method to analyze different procedures. To determine the order of analysis, we perform a topological sort of the call graph of a given program. A bottom-up analysis of the program is conducted on the call graph of the application. To transmit the information of the entity called to the caller, we need to analyze the former before its related caller. In this way, we can detect v-sensors for inter-procedural analysis. We need to deal with some special situations. Cycles are created by recursive calls among the call graphs of an application, where this can deter the sorting of the topology. For applications with function pointers, it is difficult to detect call targets during compilation. Thus, we remove invocations to function pointers in the call graph of a program. Figure 7.10 illustrates this procedure. Many applications may call external functions whose codes cannot be accessed, such as the MPI functions fopen and printf. We cannot analyze program behavior during compilation without the code. Hence, we use a traditional method: We regard an external function that does not have any descriptions as having a neverfixed workload, which indicates that snippets with invocations to functions with the never-fixed workload are not regarded as v-sensors. This method can prevent false positives, although some real v-sensors may be missed as well. Moreover, we provide the flexibility whereby users can determine the behaviors of external functions, such as arguments that may influence the workload, by themselves, and our tool offers descriptions of the prevalent functions in the C and MPI libraries.

7.4 Regular-Workload V-Sensors In this section, we formally define regular-workload v-sensors, provide concrete examples, and discuss how to identify them via compiler techniques.

7.4 Regular-Workload V-Sensors Fig. 7.11 Instruction sequence. (a) Pair 1. (b) Pair 2. (c) Pair 3

177 inst 1

inst 2

inst 3

Sequence 1

Sequence 2 (a)

(b)

(c)

7.4.1 Instruction Sequences Code snippets with a fixed runtime workload can be used as variant sensors because they execute the same instruction sequence each time. Two identical instruction sequences are comparable in terms of their execution times, and the difference between them directly reflects system-wide variance in performance. To reflect variance, however, the instruction sequences need not be exactly the same. In Fig. 7.11, we use different colors to denote different instructions. Each pair of instruction sequences 1 and 2 in this figure is a multiple of the same sequence of instructions, and we define such a pair of instruction sequences as comparable. The execution times of comparable sequences can reflect system variance once they have been normalized with a factor of their multiplicity of the instruction set common to them.

7.4.2 Regular Workload Definition The key insight of the previous section is that any two comparable instruction sequences can be used to detect variance. As a result, we can leverage code snippets with varying workloads as variant sensors as long as any pair of comparable instruction sequences occurs in their executions. We define such code snippets as regular-workload v-sensors . We use different colors to denote different sets of comparable sequences in Fig. 7.12. In the fixed-workload sensor, all execution sequences are identical and thus comparable to one another. The regular-workload sensor shown in the figure contains two separate sets of comparable instruction sequences. Although the three instruction sequences in blue have different multiplicities (3, 2, 1), they have a common instruction set and can thus be used to detect variance in performance. We provide a concrete example of such regular-workload v-sensors. Figure 7.13 shows two nested loops, L1 and L2. The inner loop L2 has a varying workload depending on the iterating variable of the outer loop. With the increase in the value of n, the inner loop increases its range of iteration and enters different branch statements and then executes a constant function call. In its first two occurrences, it executes function call .f oo() once and twice, respectively. The sequences of instruction execution of these two occurrences are comparable as they are multiples

178

7 Lightweight Noise Detection

Fig. 7.12 Fixed and regular workload comparison. (a) Fixed workload sensor. (b) Regular workload sensor

Fig. 7.13 Illustration of a code snippet for v-sensor with a regular workload

of the same instruction sequence of function .f oo(). A similar analysis is applied to its third and fourth occurrences, which execute function .bar() different numbers of times. Therefore, loop L2 is a regular-workload v-sensor as it contains two different sets of comparable instruction sequences.

7.5 Program Instrumentation 7.5.1 V-Sensor Selection Having discussed different types of v-sensors in Sect. 7.3, we now show in this section how to choose appropriate v-sensors for instrumentation. We use the instance in Fig. 7.14 for illustration. • Scope. The scope of a snippet implies the loop within which it is a valid v-sensor. Figure 7.14 contains two v-sensors, S1 and S2, and two loops, L1 and L2. The

7.5 Program Instrumentation

179

Fig. 7.14 V-sensor scope

scope of S2 is L1, which is larger than L2 for S1. S1’s workload varies among the iterations of L1 but remains invariant among those of L2. In the iterations of L1, S1 cannot use data from its previous iterations because they could have been updated. Therefore, a v-sensor that has a wide range gathers more durable data that can be applied to identify variance in performance during a longer period. The snippets have a global scope or whole program scope if they belong to vsensors in the outermost loop and are called global v-sensors . In our design, we select global v-sensors for instrumentation. • Granularity. A compromise between the capacity for detection and the consequent overhead must be considered. Although snippets of small sizes can detect the variance in performance in a fine-grained manner, they are costly in terms of performance. By contrast, large snippets can ignore the variance in performance at a high frequency. In our design, users can set the value of max-depth: depth-0 is the outermost loop, and our tool instruments v-sensors that are at depths of less than max-depth. Nevertheless, we can obtain only the execution time of a v-sensor during runtime, and the analysis during compilation is a mere prediction. Hence, optimizations during runtime are used to minimize the cost incurred by fine-grained methods and are detailed in Sect. 7.6. • Nested v-sensors. In the presence of nested v-sensors, at most one of them is instrumented because the auxiliary functions instrumented by vSensor have varying workloads. If the v-sensors are instrumented inside, the outside vsensors contain a function for instrumentation and hence are disqualified from consideration as v-sensors. We instrument the outermost v-sensors in our design. To assess the performance of the v-sensor from a temporal perspective, our tool instruments special functions, such as Tick and Tock shown in Fig. 7.3, before and

180

7 Lightweight Noise Detection

after their execution during compilation. Then, these functions trigger the algorithm to detect variance in performance.

7.5.2 Inserting External V-Sensors Leveraging code snippets in user programs as sensors of performance variance introduces a slight overhead to the original application, but such v-sensors are uncontrollable in terms of their distribution and duration at runtime. However, it is expected that vSensor has a more evenly distributed sensing period during program execution to minimize potentially undetected problems of system variance in these uncovered periods. To achieve this goal, vSensor provides (1) a code instrumentation mechanism at compilation that inserts function calls of predefined benchmarks into the original programs and (2) a trigger mechanism that uses a user-defined policy and triggers the v-sensors selectively. For the sake of clarity, we define such inserted sensors as external v-sensors and those identified in the original programs as internal v-sensors . Figure 7.15 shows an example of an external v-sensor. The function calls are inserted at the beginning of the body of the loop, within which a predefined fixedworkload benchmark is executed after checking the trigger condition. If they are triggered, we treat such inserted sensors as identical to fixed-workload v-sensors when computing the overall system variance. In vSensor, we use FWQ/FTQ [7] as the default workload for external v-sensors. We also expose the interfaces to users so that they can define their own workload. The triggering mechanism of external v-sensors in our tool enhances users’ control of the coverage time of the sensors and their distribution and overhead. Users can provide their own policy to vSensor that defines a trigger condition for each external v-sensor. In the current implementation, users are allowed to define the following control parameters: • Minimum interval: The minimum interval between triggered v-sensors. Both internal and external v-sensors are taken into account. • Duration: The duration of each external v-sensor.

Fig. 7.15 Illustration for inserted fixed-work

7.5 Program Instrumentation

181

internal sensors

minimum interval

external sensors

untriggered sensors

duration

Fig. 7.16 External v-sensors

• Maximum overhead: The maximum performance overhead introduced by external v-sensors. The performance overhead of external v-sensors is monitored at runtime. If it is excessive, the subsequent external v-sensors are disabled until the overall overhead falls below a given threshold.

7.5.3 Analyzing External V-Sensors We show such a mechanism in Fig. 7.16. The first two external sensors cannot be triggered because they are too close to the previous v-sensor. The third one satisfies the minimum interval requirement and is allowed to execute for a predefined duration. It is also possible for an external v-sensor not to be triggered if the maximum tolerable overhead is reached. The external v-sensors can control sensing distribution much better than internal v-sensors and help increase sensing coverage but introduce additional overhead to the program. To quantify this overhead, we build a performance model and discuss the trade-off between the increase in coverage and the additional overhead. Suppose that the sensing coverage without external v-sensors is .C0 with performance overhead .H0 . Overhead .H0 is relatively low because the internal vsensors leverage code snippets in the original program as variance benchmarks and introduce only such additional work as time-stamp retrieval and variance analysis. External v-sensors, however, insert and trigger external benchmarks that increase coverage while imposing a larger overhead. Assume that the additional coverage is .C and the overall overhead of both internal and external v-sensors is H . The relationship between .C and H is: H ≥ H0 +

.

C 1 − C0 − C

The formula indicates a higher overhead as the original sensing coverage .C0 increases, which is also shown in Fig. 7.17. We show how the ideal overhead of vSensor grows to achieve the designated additional coverage under different initial coverages. With a 20% overhead being introduced, 16.7% additional coverage is obtained when there is no internal v-sensor. If the initial coverage is 50%, we can achieve additional coverage of only 8.3% with the same overhead. In conclusion, external v-sensors are more efficient when the sensing coverage of internal v-sensors

182

7 Lightweight Noise Detection

Fig. 7.17 Relation between ideal overhead and extra coverage provided by external v-sensors

is low. Thus, external v-sensors are expected to play a key role in detecting variance in performance if fixed-workload code snippets are rare in the users’ programs.

7.6 Runtime Performance Variance Detection We show in this section how to use the gathered performance data to detect variance in performance.

7.6.1 Smoothing Data Due to interruptions in operating systems, HPC systems usually contain noise that is short but has a high frequency. Because the noise is usually generated in the kernel [7], it typically cannot be avoided, and has a regular periodical pattern. We regard system noise as a system characteristic instead of an instance of variance in performance. vSensor focuses on periodic and long-lasting variance in performance, but some v-sensors with short durations can be influenced by noise and can yield false alerts. We combine data and use the average value for data smoothing for a limited duration, 1000 us by default, to prevent false-positive results. We illustrate noise under time resolutions of 10 and 1000 us in Fig. 7.18, where the v-sensors have a periodic workload that takes about 10 us to execute. For each Fig. 7.18 Background noise filtering

7.6 Runtime Performance Variance Detection

183

execution, its wall time is recorded. The performance data appear to be unstable at a high resolution (10 us), but when we calculate with the longer period of 1000 us, its plot becomes smooth. Using this method, we can focus on significant and longlasting variance in performance because a large amount of noise is filtered out. Moreover, the overhead owing to fine-grained v-sensor is reduced because our analysis algorithm is executed only once in a time slice.

7.6.2 Normalizing Performance We claim that v-sensors can be categorized into those for IO, network, and computation in Sect. 7.3.1. The v-sensor method allows us to find reasons for the observed variance in performance. In addition, multiple v-sensors of the same kind reflect the same aspect of the system, and hence these performance data can be combined to boost the precision of detection. For instance, assuming that we have ten network v-sensors that are invoked once every 1000 us, we can analyze at a granularity of 1000 us for each v-sensor. We can then perform a joint analysis with all v-sensors together at a finer-grained granularity, such as 100 us, that can help us better grasp the variance in network performance. Because various v-sensors do not relate to the same workloads, we can use only the normalized performance instead of comparing direct execution times: We normalize the record with the fastest speed to .1.0 and compare each record to it. For example, the normalized performance of a record that takes twice as long as the fastest one is .0.5. The same kind of v-sensors indicates a certain part of the HPC system, and thus their normalized performance indicates the performance of the part they represent: If the performance of a system module degrades, its related normalized performance also degrades. For regular-workload v-sensors, an additional step is needed for each specific v-sensor. Assume that a regular v-sensor is executed three times in .t0 , .t1 , and .t2 , executing the same instruction sequence each time with different multiplicities .m0 , .m1 , and .m2 . Then, before comparing the execution time of each occurrence, we normalize each record with a factor of its multiplicity of execution .mi that results in a normalized record .ti /mi . A similar normalization process between different v-sensors is then performed, in which each v-sensor compares its records with the record that was most quickly normalized.

7.6.3 History Comparison vSensor tracks the collected performance-related data, such as wall time, and compares them with the given state of performance of the system to maintain a low overhead for analysis. The quantity of work is invariant for a v-sensor. We only keep a standard time scalar for each v-sensor but do not maintain a long list. We describe in Sect. 7.6.1 how each time slice has a record that indicates the execution

184

7 Lightweight Noise Detection Wall-Time (s)

3

3

7

3

5

3

7

3

3

3

Record ID

0

1

2

3

4

5

6

7

8

9

Cache Miss (High/Low)

L

L

H

L

L

L

H

L

L

L

Case 1 3

3

7

0

1

2

Case 2

Cache miss is expected to be constant. 3 5 3 7 3 3 3

3

4

5

6

7

8

9

Record 2, 4, 6 are variance.

Cache miss as a dynamic rule.

3

3

3

5

3

3

3

3

7

7

0

1

3

4

5

7

8

9

2

6

L

L

L

L

L

L

L

L

H

H

Records with low cache miss. Record 4 is variance.

Records with high cache miss. No variance is detected.

Fig. 7.19 Online detection

time on average. Our tool transmits the standard execution time for each v-sensor to a normalized time compared with the fastest record. In addition to the execution time, we use a performance monitoring unit (PMU [17]) to collect other metrics, including memory accesses and cache misses. Note that vSensor can shut down the short-term v-sensor analysis to reduce overhead, which is to say that it can turn off the Tick and Tock function. We illustrate the process of online detection in Fig. 7.19. We ran a v-sensor for ten times and collected the cache miss rate and wall time. We also use low and high to reflect the cache miss rates for the sake of simplicity. If it was invariant, we observed variance in records 2, 4, and 6 because of their relatively long durations. By contrast, the records were classified as high or low if the cache miss rate acted in a dynamic way. Based on this analysis, there were no data in the group representing variance in terms of a high cache miss rate, and record 4 had variance with a low cache miss rate.

7.6.4 Multiple-Process Analysis To analyze inter-procedure processes, vSensor builds an analysis server as a dedicated process. It compares the performance of a single sensor on various processes and gathers performance-related information from every process to detect variance. The processes conduct these operations when they update shared files or communicate with the analysis server. Experiments have indicated that data transmission to the server is limited, and causes no significant network or IO degradation (detailed in Sect. 7.7). To further reduce the cost of analysis, each process keeps the data in a local buffer and transmits the collected data to the analysis server after they reach a certain threshold of size. In this way, it creates a suitable amount of batched messages that are network friendly.

7.7 Experiment

185

Fig. 7.20 Performance matrix example

7.6.5 Performance Variance Report We develop a visualizer to improve the interpretation of the results of predicted variance by users. For each category, such as the IO, network, and computation, our tool creates a performance matrix. We show an example of the performance matrix in Fig. 7.20, which displays the performance of 128 processes for 100 s. The x-axis has a resolution of 200 ms and reflects time. The MPI rank, which is the process ID, is represented by the y-axis. Colors are used to show degrees of variations in performance. For example, we indicate high performance in dark blue and low performance in whiter shades. We can see the variance in performance in the whiter part of this figure. Figure 7.20 shows that the entire application generally exhibited high performance. In Sect. 7.7, we analyze the results of a significant difference in performance in empirical cases. The white parts show the period and place of the variance in performance, and the type of variance indicates that aspect that incurred the variance. For instance, when the tool stated that certain processes were slower than others, this indicated that the nodes were possibly unstable. Our tool is equipped to identify variance in performance with a low overhead and minimal manual interference. Note that users should decide whether to perform a re-execution or reboot the system when vSensor detects a variance in performance and identifies bottlenecks.

7.7 Experiment 7.7.1 Experimental Setup Methodology vSensor is implemented with LLVM version 3.5.0. We use ROSE, Dragonegg-3.5.0, and Clang-3.5.0 to instrument codes. We use Python to implement the visualizer. Our current version can provide most functionalities described in previous sections, and they are discussed next. Benchmark We use eight programs to measure the performance of vSensor, including five kernels SP, FT, CG, LU, and BT from the NAS Parallel Bench-

186

7 Lightweight Noise Detection

marks (NPB) [18] and three real applications RAXML [19], AMG [20], and LULESH [21]. Platform The evaluation was performed on the Tianhe-2A HPC supercomputer with an efficient and high-speed interconnected network. Each computing node contained two Xeon E5-2692(v2) CPUs with 64 GB of memory. We used 4,096 processes in total. In our evaluation, first, we use vSensor to detect fixed-workload v-sensors, verify the results, and analyze the runtime overhead of the algorithms used for detection. Second, we evaluate the benefits of regular v-sensors and external vsensors. Third, we analyze the capability of vSensor to identify the variance in performance. Finally, we focus on identifying variance in real cases on the Tianhe2A HPC system.

7.7.2 Overall Analysis of Fixed V-Sensors We show the metrics used for analysis during compilation and runtime identification in Table 7.1. The left columns show the compile-time information and the right columns show the online results when using 4,096 MPI processes. Table 7.1 Validation of results of 4,096 MPI processes with fixed v-sensors Program BT CG FT LU SP AMG LULESH RAXML

Code (KLoc) 11.3 2.0 2.5 7.7 6.3 75.0 5.3 36.2

Number of snippets 679 214 340 1671 697 4695 1671 4744

Number of v-sensors 169 50 75 185 149 413 103 650

Instrumentation number and type 46Comp+1Net 6Comp+5Net 12Comp 63Comp 37Comp+1Net 65Comp+6Net 22Comp+6Net 159Comp

Program BT CG FT LU SP AMG LULESH RAXML

Workload max error 2.7% 0.8% 3.3% 2.2% 0.8% 0.0% 8.6% 2.3%

Performance overhead 0.24% 2.02% 5.05% 5.29% 0.90% 0.08% 4.15% 0.16%

Sense-time coverage 4.3% 6.8% 7.7% 70.9% 6.3% 0.43% 31.9% 14.3%

Frequency (kHz) 182.3 0.493 2.318 283.5 1070 0.010 3155 4530

7.7 Experiment

187

Our results include the types of v-sensors to instrument, numbers of observed v-sensors and snippets, and number of lines of code. We find that most v-sensors of instrumentation are computation sensors. The calls and loops are v-sensor candidates, as mentioned in Sect. 7.3.1. For instance, there are 4,695 candidates and 75K lines of code in AMG, and our tool observes 413 v-sensors among these snippets during compilation. However, it chooses only 71 v-sensors during the final instrumentation, which indicates that a large amount of snippets exhibiting variance in workload are filtered out to significantly reduce the instrumentation overhead. We then check whether the workloads of the detected v-sensors are invariant to verify the results. For example, we check the message sizes of the v-sensors of the network, and the results show that the arguments are fixed. The most difficult validation pertains to v-sensor identification, and we use the PMU to check if the workload has changed among iterations according to its number of instructions for the v-sensors used for computation. We provide each execution a vector of instructions, .v0 , v1 , . . . , vn . All the values are theoretically the same for a v-sensor. However, the sequence of .vi has variance because the measurement of the PMU is not 100% accurate [22]. To avoid the impact of this inaccuracy, we denote by .Ps the .Mean(vi )/MI N(vi ) to represent the variance of the PMU data. Assume that the maximum difference among all v-sensors is .Pa = MAX(Ps ) and the maximum difference among all processes is .Pa = MAX(Ps ). We show the error incurred when using our tool in Table 7.1, and this is .Pm − 1 in the Workload max error column. It is clear that our tool incurs an error of less than 9% on average, which shows its usability during compilation. We compare the execution time of our tool with that of the original program execution to measure the performance cost incurred by the former. We execute the snippets several times and use the one with the shortest time owing to the large variance in performance on the Tianhe-2A HPC system. We show in Table 7.1 that the increase in the cost of performance is less than 6%, which proves the efficiency of vSensor during runtime. In the two right columns of Table 7.1, the sense-time coverage and frequency are presented for each program. We denote the ratio of durations of all v-sensors to the total time as the sense-time coverage and call the ratio of the number of v-sensors to the total time as the average frequency. Table 7.1 shows that these two metrics varied on different programs. For instance, LU has a v-sensor frequency of 283.5 kHz and a coverage of 70.9%, indicating that a v-sensor occurs at a 5 us interval on average, and most of the operational period of LU is monitored by vSensor. However, some programs, such as BT and AMG, have low coverage. The root issue is a lack of detectable fixed-workload v-sensors in these programs, mainly the result of a dynamic workload in the computational pattern. We have shown in Sects. 7.7.3 and 7.7.4 that regular v-sensors and external v-sensors can further improve the coverage.

188

7 Lightweight Noise Detection

7.7.3 Analysis of Regular V-Sensors We evaluate vSensor on regular v-sensors by using the same configuration as above. As shown in Fig. 7.21, the regular v-sensors increase the average coverage by 14.3%. FT and SP show the most significant improvements of 38.5% and 35.0%, respectively, because they contain many loops with variable numbers of iterations that are identified as regular v-sensors. vSensor counts the number of iterations in the regular v-sensors and then divides the execution time of the loop to the number of iterations of the fixed-workload loop. Therefore, vSensor covers more execution time with the help of regular v-sensors. With both fixed and regular v-sensors, the only program with less than 10% coverage is AMG. Even for programs with such dynamic computation patterns, we could use external v-sensors to improve the detection as described in Sect. 7.7.4. Figure 7.22 shows the overhead of fixed- and regular-workload v-sensors. After enabling regular v-sensors, applications besides the FT show a small performance penalty of 2.3% on average. FT has a large overhead due to a frequently invoked regular v-sensor. Most of the workload in FT consists of nested in-loops with nonconstant loop conditions and thus can be dealt with by fixed v-sensors. Although the overhead of the instrumented functions is small, the innermost loop might have comparable workloads in the body of the loop and becomes a major source of overhead. It is worth noting that this extra overhead comes with nontrivial increments in sense-time coverage at the same time. To maintain a small overhead, sampling is an effective method. We test FT with the 25% sampling rate. The results show it has an 11.5% performance overhead and Fig. 7.21 Coverage with or without regular-workload v-sensors

Fig. 7.22 Performance overhead with or without regular-workload v-sensors

7.7 Experiment

189

11.6% coverage. Sampling in vSensor is convenient because the sampling rate can be set at runtime. More complex sampling strategies can also be applied, such as adaptive sampling.

7.7.4 External Analysis of V-Sensors As Table 7.1 and Fig. 7.21 indicate, it is difficult to find internal v-sensors in AMG. We use external v-sensors to improve the coverage for this type of application. Section 7.5.2 analyzes the relation between the extra overhead and coverage in this case, and we have determined that low-coverage programs gain extra coverage with a smaller performance overhead than high-coverage programs. We evaluate 4,096process AMG with 200-ms external v-sensors and test it with 5%, 10%, 20%, and 30% target coverage. Figure 7.23 shows the results of AMG with external v-sensors. The ideal coverage line coincides with the real coverage in this figure. This verifies that vSensor achieves the target coverage under these four configurations. The real overhead is consistent with the theoretical values under small coverage target. However, it is higher than the theoretical value under a high-coverage target. This extra overhead is resulted from the dependence among processes. Furthermore, the extended execution time leads to more external v-sensor invocations as well. This introduces more performance overhead but is still necessary because vSensor tries to reach the target coverage during the whole execution.

7.7.5 V-Sensor Distribution We analyze the distribution of v-sensors because the performance of vSensor is influenced by their temporal characteristics. We show the main concepts pertaining to the distribution of v-sensors in Fig. 7.16, where the length of the v-sensor is called duration and the distance between consecutive v-sensors is called interval. Fig. 7.23 AMG results with external v-sensors

190

7 Lightweight Noise Detection

Fig. 7.24 The durations of v-sensors

Fig. 7.25 The intervals between v-sensors

We group the v-sensors into four categories based on execution time: (1) longer than 1 s, (2) 1 ms to 1 s, (3) 100 us to 10 ms, and (4) less than 100 us. We show the execution intervals in Figs. 7.24 and 7.25 for each case. We omit the plotting of case longer than 1 s because only one network v-sensor is used in all evaluations. Most sense-times are shorter than 100 us, which shows that the aggregation is necessary as shown in Sect. 7.6.1, and most v-sensors belong to fine-grained snippets. We can guarantee that vSensor detects variance in performance that is longer than 1 s because most intervals are shorter than 1 s. However, some programs include intervals longer than 1 s. For example, in RAXML, many I/O operations are invoked that last for a long time on a distributed file system. However, vSensor is still able to observe such variances in performance because of a large amount of v-sensors across the code. By contrast, for AMG, there are long durations without v-sensors that exist only for a small duration of the entire execution. Table 7.1 shows its low frequency and coverage, which are caused by the adaptive algorithm of mesh refinement. Because this can lead to changes in workload on the fly, only some fixed snippets of workloads exist in this situation. Furthermore, vSensor cooperates well with a large amount of HPC applications with static workloads, and the external v-sensors are able to deal with the remaining programs.

7.7 Experiment

191

7.7.6 Injecting Noise We manually provide an instance of the injection of noise to the 128-process cg.D.128 benchmark in NPB with the help of the mpiP [23] profiler and vSensor. However, we conduct this experiment only locally on a dual Xeon E5-2670(v3) cluster with a 100 Gbps 4.×EDR InfiniBand because of the challenges of operating the Tianhe-2A HPC system. Compared with Profiling We show the results of mpiP in Fig. 7.26 for a regular execution with no noise injection. The computation and communication times are around 75 s and 50 s, respectively. The profiling is unable to justify the short time difference between processes due to variance among compute nodes or workloads. We then inject noise and execute the code again as follows: During the program execution, some nodes start to execute noiser, and the program competes for resources such as memory and CPU with noiser. The noise is injected twice, each for 10 s. The performance degrades with such injections. We show the results of a profile with mpiP in Fig. 7.27 after injecting noise. In contrast to Fig. 7.26, the time needed to execute MPI increases from around 50 s to around 65 s, but the time needed for computation remains the same. We are unable to determine whether the variance has an influence on the program when using only Fig. 7.27. Because of the greater amount of MPI communication, it is not easy to analyze network issues. We manage to provide a reasonable analysis for the CG in such situations when analyzing the code and the results of mpiP: Because the CPUs have more idle time between instances of communication, our tool could inject the workload between MPI calls. Therefore, the communication is delayed significantly for some processes and the tool stretches little computation. The performance-degrading processes incur long waiting times in communication, Fig. 7.26 mpiP results for a regular execution

Fig. 7.27 Profiling a noise-injected execution with mpiP

192

7 Lightweight Noise Detection

Fig. 7.28 Results of a noise-injected execution with vSensor

and mpiP regards this as MPI time. Moreover, such a technique should be popular among nonexpert users of MPI. It is unable to locate the noise injected from mpiP, and vSensor can make up for this deficiency. Figure 7.20 illustrates the performance matrix of a regular execution, and Fig. 7.28 shows two blocks in white to represent the injected noise. It shows that processes 24–47 and 72–96 are injected with noise at around 34 s and 66 s, respectively. Therefore, vSensor exhibits extra benefit in terms of identifying variance in performance compared with typical profilers. Compared with Tracing To perform further analysis, an external MPI tracer, ITAC [24], is used to gather traces. Compared with the sizes of the result obtained by vSensor, 8.8 MB, ITAC creates a larger result, 501.5 MB. This is because in empirical applications, the tools for analysis based on traces can generate a large amount of data that result in an overhead in terms of time and memory. However, vSensor can handle such difficulties plaguing previous tools. In this execution, the generated size of the data for 128 processes within 140 s is only 8.8 MB, which means that for each process, the rate for generation is about 0.5 KB/s. Accordingly, for 16,384 processes, the total generation rate is only 8 MB/s and occupies a small portion of network communication.

7.7.7 Case Studies In this section, we show how to use vSensor to identify the variance in performance on the Tianhe-2A HPC system. We use CG as an example and show its computational performance using 256 processes in Fig. 7.29. A white line appears close to process 100 above, indicating that it slows the execution. vSensor can accurately detect the problematic processes; if they are on the same node, the host node needs to be replaced. To confirm this idea, we use micro-benchmarks to check the system’s performance, including memory and CPU. Our results show that CPU performance is normal, but the memoryrelated performance of one processor is only 55% of that of the other nodes. Such information is reported to system administrator, and we use another node to replace it for re-execution. When the problematic node is replaced, the execution time of CG

7.7 Experiment

193

Fig. 7.29 Results generated by vSensor for CG

Fig. 7.30 Results generated by vSensor for FT

is optimized to 66.05 s, from the original time of 80.04 s. We get an improvement of 21%. Although a pre-test before the execution of the application can solve this problem of variance in performance, the same nodes still can exhibit variance during execution, as illustrated in Fig. 7.1. For instance, we execute FT with 1,024 processes on a fixed cluster to observe the variance in performance during program execution, while a traditional pre-test cannot handle this issue. Figure 7.1 illustrates that the time for an FT execution is 23.31 s and in an abnormal case can reach 78.66 s, slowing down by .3.37 times. Our tool can identify the variance in performance, as shown in the performance matrix of the network in Fig. 7.30. It shows that a degradation in performance occurs from 16 to 67 s. After further checking the code for FT, we find that it exchanges data between processes through MPI_Alltoall. This is because MPI_Alltoall needs to cooperate with all processes, which is very complicated in communication among networks. Furthermore, we observe that the Tianhe-2A HPC system sometimes encounters problems in performance due to the variance in the performance of FT. Such network problems may be due to network congestion, which cannot be prevented. However, vSensor can report such issues and leave the choice of whether to continue to users.

194

7 Lightweight Noise Detection

7.8 Related Work Performance Variance Variance in performance is a very important problem in current HPC systems [3]. Skinner et al. [6] studied the variance in performance among HPC systems and pointed out the performance improvement without variance in performance. Hoefler et al. [7] explored the variance in operating systems of HPC applications and concluded that the pattern of variance has a larger influence than its intensity. Tsafrir et al. [25] showed that the interrupts of the periodic clock cause a major variance in general operating systems; OS co-scheduling can be a solution to this problem [26]. Agarwal et al. [27] explored performance distributions in scalable parallel programs and pointed out that long-tail noise can incur a significant degradation in performance. Ferreira et al. [2] injected different noises into programs in HPC systems and studied their influence. Beckman et al. [28] demonstrated the synchronization overhead caused by system interrupts, and Phillips et al. [29] showed that daemon processes can prevent nodes from performing efficient computations in HPC systems. Variance Detection Previous studies have applied models to identify variance in performance. Petrini et al. [1] used analytic models to detect the performancerelated challenges to SAGE in HPC systems. This can be used to quantify the influence of system noise on performance. Lo et al. [30] built a general toolkit to analyze architectural variance by using the roofline model [31]. Many researchers, such as Calotoiu [32], Yeom [33], and Lee [34], have examined accelerating model constructing; however, we find many unsolved problems in the context of constructing a suitable model for large-scale HPC applications. Jones et al. [9] analyzed the traces of systems to find reasons for variance in performance, but the size of the trace could be large when the problem size was large [11]. Trace collection can incur a significant degradation in the performance of applications [13, 35]; thus, accurate noise identification is impractical in HPC systems. Statistical tracers [36] and profilers [23] are lightweight choices but are less accurate. Our tool, vSensor, focuses on variance in system performance; other tools, like PRODOMETER [37], AutomaDeD [38], and STAT [39], focus on the bug detection and program faults and are different from our tool. PerfScope [40] analyzed system call invocations to identify potential buggy functions. Sahoo et al. [41] monitored program invariants to find software bugs. Overall, previous works have been unable to efficiently identify variance in performance on the fly in a lightweight manner, whereas vSensor uses features of the program that integrates both dynamic and static detection technologies to efficiently identify variance in performance.

References

195

7.9 Conclusion We propose a lightweight tool, called vSensor, which is able to detect variance in the performance of systems on the fly. We show that we can obtain features of online program execution directly from the code without relying on an external detector. In particular, we show that there are repeated snippets executed with a fixed quantity of work for a large number of HPC programs. Our tool automatically identifies the snippets of static workloads via complication technologies and regards them as v-sensors to evaluate performance. We validate vSensor on the Tianhe-2A HPC system, and the results show that it incurs a performance cost of less than 6% with 4,096 processes for fixed v-sensors. Specifically, it can identify nodes that have slow memory and improve performance by .21%. Furthermore, it identifies a serious bottleneck in the network that slows performance by .3.37×.

References 1. Petrini, F., Kerbyson, D. J., & Pakin, S. (2003). The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing. SC’03. Phoenix, AZ, USA: ACM. 2. Ferreira, K. B., et al. (2013). The impact of system design parameters on application noise sensitivity. In 2010 IEEE International Conference on Cluster Computing (Vol. 16, No 1, pp. 117–129). 3. Mondragon, O. H., et al. (2016). Understanding performance interference in next-generation HPC systems. In SC16: International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 384–395). IEEE. 4. Wright, N. J., et al. (2009). Measuring and understanding variation in benchmark performance. In DoD High Performance Computing Modernization Program Users Group Conference (HPCMP-UGC), 2009 (pp. 438–443). IEEE. 5. TOP500 website (2020). http://top500.org/. 6. Skinner, D., & Kramer, W. (2005). Understanding the causes of performance variability in HPC workloads. In Proceedings of the IEEE International Workload Characterization Symposium, 2005 (pp. 137–149) IEEE. 7. Hoefler, T., Schneider, T., & Lumsdaine, A. (2010). Characterizing the influence of system noise on large-scale applications by simulation. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. SC’10 (pp. 1–11). 8. Gong, Y., He, B., & Li, D. (2014). Finding constant from change: Revisiting network performance aware optimizations on iaas clouds. In SC14: International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 982–993). IEEE. 9. Jones, T. R., Brenner, L. B., & Fier, J. M. (2003). Impacts of operating systems on the scalability of parallel applications. In Lawrence Livermore National Laboratory, Technical Report UCRL-MI-202629. 10. Tallent, N. R., Adhianto, L., & Mellor-Crummey, J. M. (2010). Scalable identification of load imbalance in parallel executions using call path profiles. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 1–11). IEEE Computer Society.

196

7 Lightweight Noise Detection

11. Wylie, B. J. N., Geimer, M., & Wolf, F. (2008). Performance measurement and analysis of large-scale parallel applications on leadership computing systems. Scientific Programming, 16(2–3), 167–181. 12. Geimer, M., et al. (2010). The Scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience, 22(6), 702–719. 13. Zhai, J., et al. (2014). Cypress: Combining static and dynamic analysis for top-down communication trace compression. In SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 143–153). IEEE. 14. Zhai, J., et al. (2022). Leveraging code snippets to detect variations in the performance of HPC systems. IEEE Transactions on Parallel and Distributed Systems, 33(12), 3558–3574. 15. Lattner, C., & Adve, V. (2004). LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-Directed and Runtime Optimization (p. 75). IEEE Computer Society. 16. MPI Documents. http://mpi-forum.org/docs/ 17. Mucci, P., et al. (2004). Automating the large-scale collection and analysis of performance data on Linux clusters1. In Proceedings of the 5th LCI International Conference on Linux Clusters: The HPC Revolution. 18. Bailey, D., et al. (1995). The NAS Parallel Benchmarks 2.0. Moffett Field, CA: NAS Systems Division, NASA Ames Research Center. 19. Pfeiffer, W., & Stamatakis, A. (2010). Hybrid MPI/Pthreads parallelization of the RAxML phylogenetics code. In 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)(pp. 1–8). IEEE. 20. Yang, U. M., et al. (2002). BoomerAMG: A parallel algebraic multigrid solver and preconditioner. Applied Numerical Mathematics, 41(1), 155–177. 21. Karlin, I., Keasler, J., & Neely, J. R. (2013). Lulesh 2.0 updates and changes. Technical Report Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States). 22. Weaver, V. M., Terpstra, D., & Moore, S. (2013). Non-determinism and overcount on modern hardware performance counter implementations. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (pp. 215–224). IEEE. 23. Vetter, J., & Chambreau, C. (2005). mpip: Lightweight, scalable MPI profiling. 24. Intel Trace Analyzer and Collector. https://software.intel.com/en-us/trace-analyzer 25. Tsafrir, D., et al. (2005). System noise, OS clock ticks, and fine-grained parallel applications. In Proceedings of the 19th Annual International Conference on Supercomputing. ICS’05 (pp. 303–312). New York, NY, USA: ACM. ISBN: 1-59593-167-8. 26. Jones, T. (2012). Linux kernel co-scheduling and bulk synchronous parallelism. International Journal of High Performance Computing Applications, 26, 1094342011433523. 27. Agarwal, S., Garg, R., & Vishnoi, N. K. (2005). The impact of noise on the scaling of collectives: A theoretical approach. In High Performance Computing–HiPC 2005 (pp. 280– 289). Springer. 28. Beckman, P., et al. (2006). The influence of operating systems on the performance of collective operations at extreme scale. In 2006 IEEE International Conference on Cluster Computing (pp. 1–12). IEEE. 29. Phillips, J. C., et al. (2002). NAMD: Biomolecular simulation on thousands of processors. In Supercomputing, ACM/IEEE 2002 Conference (pp. 36–36). 30. Lo, Y. J., et al. (2014). Roofline model toolkit: A practical tool for architectural and program analysis. In International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (pp. 129–148). Springer. 31. Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM, 52(4), 65–76. 32. Calotoiu, A., et al. (2016). Fast multi-parameter performance modeling. In 2016 IEEE International Conference on Cluster Computing (CLUSTER) (pp. 172–181). IEEE. 33. Yeom, J.-S., et al. (2016). Data-driven performance modeling of linear solvers for sparse matrices. In International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) (pp. 32–42). IEEE.

References

197

34. Lee, S., Meredith, J. S., & Vetter, J. S. (2015). Compass: A framework for automated performance modeling and prediction. In Proceedings of the 29th ACM on International Conference on Supercomputing (pp. 405–414). ACM. 35. Wu, X., & Mueller, F. (2013). Elastic and scalable tracing and accurate replay of nondeterministic events. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. ICS’13 (pp. 59–68). ACM. 36. Tallent, N. R., et al. (2011). Scalable fine-grained call path tracing. In Proceedings of the International Conference on Supercomputing (pp. 63–74). ACM. 37. Mitra, S., et al. (2014). Accurate application progress analysis for large-scale parallel debugging. ACM SIGPLAN Notices, 49(6), 193–203. ACM. 38. Laguna, I., et al. (2015). Diagnosis of performance faults in LargeScale MPI applications via probabilistic progress-dependence inference. IEEE Transactions on Parallel and Distributed Systems, 26(5), 1280–1289. 39. Arnold, D. C., et al. (2007). Stack trace analysis for large scale debugging. In 2007 IEEE International Parallel and Distributed Processing Symposium (pp. 1–10). IEEE. 40. Dean, D. J., et al. (2014). Perfscope: Practical online server performance bug inference in production cloud computing infrastructures. In Proceedings of the ACM Symposium on Cloud Computing (pp. 1–13). 41. Sahoo, S. K. et al. (2013). Using likely invariants for automated software fault localization. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (pp. 139–152).

Chapter 8

Production-Run Noise Detection

Abstract The performance variance detection approach in Chap. 7 relies on nontrivial source code analysis that is impractical for production-run parallel applications. In this chapter, we further propose Vapro, a performance variance detection and diagnosis framework for production-run parallel applications. Our approach is based on an important observation that most parallel applications contain code snippets that are repeatedly executed with fixed workload, which can be used for performance variance detection. To effectively identify these snippets at runtime even without program source code, we introduce state transition graph (STG) to track program execution and then conduct lightweight workload analysis on STG to locate variance. To diagnose the detected variance, Vapro leverages a progressive diagnosis method based on a hybrid model leveraging variance breakdown and statistical analysis. Results show that the performance overhead of Vapro is only .1.38% on average. Vapro can detect the variance in real applications caused by hardware bugs, memory, and IO. After fixing the detected variance, the standard deviation of the execution time is reduced by up to 73.5%. Compared with the stateof-the-art variance detection tool based on source code analysis, Vapro achieves .30.0% higher detection coverage.

8.1 Introduction Performance variance has been confirmed as a serious problem when running parallel programs on data centers [1], supercomputers [2, 3], and cloud platforms [4–6], which happens in different processes or threads within one execution and between executions. As the execution time of a parallel program is mostly determined by the slowest process or thread, performance variance may slow down the whole program even when only one process or thread is affected. As shown in Fig. 8.1, the time spent on the same task with fixed nodes varies greatly. Variance not only leads to performance degradation or resource waste but also makes applications’ behavior unstable and hard to understand. Performance variance comes from various sources, including OS interruption [7, 8], memory errors [9], cache conflicts [10], network interference [11], and many © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 J. Zhai et al., Performance Analysis of Parallel Applications for HPC, https://doi.org/10.1007/978-981-99-4366-1_8

199

200

8 Production-Run Noise Detection

Fig. 8.1 100 repeated executions of 256-process NPB-CG on the same group of nodes on the Tianhe-2A supercomputer

other hardware or software faults [12]. The varying symptoms of performance variance make detection and diagnosis extremely difficult [12]. General approaches like rerunning, tracing applications, and executing benchmarks during the execution of applications can help detect variance. However, such intrusive approaches introduce large overhead and cannot be deployed in a production environment, which is a serious limitation due to the poor reproducibility of performance variance. Therefore, a lightweight online detection and diagnosis approach is necessary to find out whether and why performance variance happens during the execution of a running program. To address this problem, we leverage an important observation that many parallel applications contain code snippets that are repeatedly executed with fixed workload [9, 13, 14]. For example, applications, such as neural networks and image processing, repeatedly execute certain math kernels to perform the same computation (with different data) in each iteration. These fixed-workload code snippets can be used as benchmarks inside programs to detect and diagnose performance variance at runtime since it is expected to take unchanged execution time for all executions. Although other works have tried compiler techniques to identify fixed-workload snippets for variance detection, they have major limitations regarding variance detection, diagnosis, and usability. For the state of the art, vSensor [9], (1) it misses many fixed-workload snippets that cannot be determined at compilation, and fails to handle complex alias analysis [15]; (2) it cannot diagnose variance since it neglects the crucial properties of fixed workload for variance diagnosis; and (3) it is impractical for closed-source applications and libraries, which are common in the production environment. Therefore, how to detect variance without source code and diagnose performance variance remains an open problem. To overcome the limitations of existing approaches, we have to solve two main challenges. (1) How to identify. An application generates a continuous instruction flow at runtime. We need to split the instruction flow into a set of fragments (i.e., an execution of a code snippet) and identify fragments with fixed workload at runtime. It is challenging because only limited runtime information is available for identification. (2) How to diagnose. The runtime of programs does not contain much semantic information and the causes of variance are numerous. In addition, although various runtime performance data provide rich information, we should keep a small overhead for production environments, which limits the amount of collected data.

8.2 Overview

201

In this work, we propose Vapro [16], a lightweight performance variance detection and diagnosis tool without requiring source code, which is practical for production-run parallel applications. Vapro is based on two important observations missed by previous works. First, many code snippets have de facto fixed workload or only a few classes of workload, which are usable benchmarks inside programs but can only be identified at runtime. Second, the comparability of fixed workload makes it ideal for variance diagnosis. By comparing various performance information of fixed workload, the differences among them can effectively expose the causes of variance. Based on the observations, we propose a series of novel approaches. There are three main contributions in our work. • We propose a new data structure, called state transition graph (STG), to track program execution and reorganize the collected dynamic fragments. With a fixed-workload fragment identification algorithm executed on STG, we perform a lightweight online analysis algorithm to detect performance variance and quantify their influence. • To diagnose variance without source code, we propose a progressive diagnosis method based on a hybrid model with a combination of variance breakdown and statistical analysis. It takes both software and hardware into consideration and is able to progressively locate fine-grained reasons with a small overhead, which can effectively guide variance diagnosis. • We evaluate Vapro on real applications with up to 2,048 processes to verify its efficacy on large-scale parallel applications. Vapro only introduces 1.38% performance overhead on average and has a 30.0% higher detection coverage than the state-of-the-art tool. Vapro detects variances resulting from a hardware problem on Intel processors, distributed file system, memory, and computing resource competition. Experimental results show that optimizations based on the crucial reports from Vapro reduce the standard deviations of executions by up to 73.5% and bring a speedup up to 24.0%. In this work, we focus on detecting and diagnosing the performance variance caused by external environment, such as the variance caused by hardware, OS, and communication functions implemented in shared libraries. Vapro helps users and system maintainers identify whether applications are running with performance slowdowns caused by environment. For the detected variance, Vapro provides the most possible reasons causing variance to help fix such problems.

8.2 Overview Vapro is packaged as a dynamic library to perform data collection and analysis. It requires no recompilation or re-link of applications. Figure 8.2 illustrates the workflow of Vapro. Each step is described in detail:

202

8 Production-Run Noise Detection

1 Intercepng Binary programs

External libraries

Vapro library 2

Building STG

Runtime

Func invocaon 1

Func invocaon 2

Func invocaon 3

3 Computaon fragments

Performance Data Collecon Communicaon and IO fragments

4 Idenfying Fixed-workload Fragments

Weighted equalizaon Normalized performance

Variance locang

Progressive 6 Variance Diagnosis

Online analysis Variance 5 Detecon

Fixed-workload fragments with performance data Data collecon Variance breakdown Causes analysis

7 Visualizaon Fig. 8.2 Vapro overview

1. Intercepting Vapro splits the running progress of an application into a number of fragments (i.e., an execution of a code snippet) by intercepting the external functions provided by dynamic libraries. For a repeatedly executed code snippet, it generates many fragments at runtime. 2. Building STG (Sect. 8.3.2) Vapro generates an STG as a representation of the running progress of a program. 3. Performance Data Collection (Sect. 8.3.3) Vapro records runtime information for fragments, including elapsed time, function parameters, and performance counters. 4. Identifying Fixed-Workload Fragments (Sect. 8.3.4) Vapro identifies fragments with fixed workload by clustering for each STG edge and vertex. 5. Variance Detection (Sect. 8.3.5) Vapro automatically locates performance variance by analyzing the clustering result. 6. Progressive Variance Diagnosis (Sect. 8.4) For detected variance, Vapro leverages a breakdown model and a statistical method to progressively pinpoint the potential causes. 7. Visualization For variance detection, Vapro plots a heat map to illustrate the normalized performance and reports the region of variance and the quantified performance loss. For variance diagnosis, Vapro breaks down the variance and shows the impact and time duration for each factor.

8.3 Performance Variance Detection

203

8.3 Performance Variance Detection Vapro is based on the observation of fixed workload. In this section, we will introduce how Vapro locates performance variance by analyzing fixed-workload fragments.

8.3.1 Fixed-Workload Fragments A code snippet can generate several sets of fixed-workload fragments. Figure 8.3 shows an example of a code snippet between two MPI invocations. It is not a code snippet with a fixed workload at compilation time, since the loop termination condition is determined by two nonconstant variables. However, although this snippet is executed hundreds of times in a program execution, there are only seven different workloads. By distinguishing these different workloads at runtime and dividing them into separate sets of fixed-workload fragments, Vapro exploits code snippets that cannot be identified in static analysis-based tools, such as vSensor [9].

8.3.2 State Transition Graph To identify potential code snippets with fixed workload for a parallel application, we first split the running progress of an application into a set of fragments. We propose a new data structure, named state transition graph (STG), to organize these fragments. We give a formal definition of STG below. Definition 8.1 State transition graph (STG) is a representation of the running progress of a parallel application. In an STG, vertices record a program’s running states, while edges represent their transitions between different states. An application’s running progress is partitioned into a set of fragments. From one fragment to another, the program has a state transition. STG is built during program executions. Vapro creates a vertex for each running state and an edge if the program transfers from one state to another. A key point

Fig. 8.3 A code snippet with fixed workloads in AMG

204

8 Production-Run Noise Detection

Enter outer loop Sub-loop 0 MPI_Irecv

MPI_Send

MPI_Wait

MPI_Irecv

MPI_Send

MPI_Wait

MPI_Irecv

MPI_Send

MPI_Wait

MPI_Irecv

MPI_Send

MPI_Wait

Sub-loop 1

Sub-loop 2

Leave outer loop Fig. 8.4 A context-free STG

of building STG is to attach fragments to STG according to their running states. In Vapro, we have two alternative approaches to record a running state. They are based on call-site and call-path information, respectively. Using call-site information as running states generates a context-free STG, while using call path generates a context-aware STG. Context-Free STG In a context-free STG, the state of a fragment is only determined by the call site of the corresponding invocation. We use the CG program of NPB benchmark [17] as a running example to show how we build a state transition graph. Figure 8.4 shows the context-free STG for a nested loop in CG.1 A vertex in Fig. 8.4 represents a communication call site in the source code. Edges in Fig. 8.4 represent the transition between communication call sites, i.e., the computation code snippets among them. With a context-free STG, all the communication and IO invocation fragments from the same call site are attached to the same vertex, and all the computation fragments from the same computation block are attached to the same edge. Context-Aware STG Different from a context-free STG, a context-aware STG takes the entire call stack of external invocations into consideration. Invocations from the same call site may have different call paths. For example, each vertex or edge in Fig. 8.4 corresponds to two vertices or edges in a context-aware STG, since the code is executed in both warm-up and real test stages with different call paths.

1 It

is the cgitmax loop in cg.f:1170-1360.

8.3 Performance Variance Detection

205

Fig. 8.5 Performance data of fixed-workload computation fragments in 16-process B-scale CG under the computation and memory noises

8.3.3 Performance Data Collection Performance Counters Performance counters, including software counters such as the number of page faults and context switches and hardware counters like performance events provided by performance monitor unit (PMU), are valuable information for understanding performance. Vapro collects runtime data through performance counters for fixed workload identification and variance diagnosis. It adopts different methods for computation, communication, and IO workloads. We elaborate each on type as below. Computation Workload The ideal way to classify the workload of two computation fragments is by comparing their instruction flows. However, its enormous overhead makes it impossible for lightweight online analysis. We have to find proxy metrics that are able to represent the workload and remain stable even under performance variance. Figure 8.5 shows the values of TOT_INS (total number of instructions) and TSC (time-stamp counter, a high-precision clock in CPU) for fixed-workload fragments in CG. We inject CPU and memory noises2 while CG is executing. The results show that the TOT_INS is stable and insensitive to the noises, while the TSC, i.e., the execution time, is affected. Thus, Vapro takes TOT_INS as a crucial proxy metric for computation workload analysis by default. Users are able to specify other PMU metrics for analysis in Vapro as well, e.g., the number of load and store instructions, or cache miss rate. Collecting more performance metrics improves the precision of workload representation but introduces extra overhead. Communication Workload Different from computation workload, PMU metrics of CPU cannot directly reflect communication workload. For example, if a receiving process is waiting for its sending process via busy waiting, it will generate lots of memory access instructions. As a result, the number of memory access instructions is proportional to the waiting time rather than actual communication time. To address this problem, Vapro 2 In

this work, the computation noise is generated by executing stress [18] on the same CPU core of applications, and the memory noise is generated by executing stream [19] on the idle cores.

206

8 Production-Run Noise Detection

uses communication invocation arguments, including message size, the source and destinations processes, and other invocation-specific information, such as the scope of broadcast communication, instead of PMU values to approximate communication workload. Vapro records the elapsed time of each communication invocation to analyze its performance. Although the elapsed time can be affected by load imbalance and some other factors, we take them as a whole into account since they demonstrate communication performance in some degree. For more precise timing on non-blocking communication, users can also choose the communication libraries exposing underlying communication time, such as the MPI library with an enhanced profiling layer [20]. IO Workload Similar to communication workload, Vapro collects function parameters to identify IO workload. Parameters that have an influence on IO performance are recorded, such as sizes of data, file descriptors, and IO modes.

8.3.4 Identifying Fixed-Workload Fragments

1 Send(…); 2 Computaon with two classes of workload; 3 Recv(…); (a) Source code

Send

Recv (b) STG edge with different workload

TOT_INST

So far, we have attached fragments with runtime information to STG. However, as shown in Fig. 8.6b, fragments on an STG edge or vertex can have various workload patterns, which cannot be directly used by Vapro for variance analysis. To identify fragments with fixed workload, we propose a lightweight approach based on workload clustering. Although Fig. 8.6 only shows the clustering of computation fragments on edges, we similarly identify fixed-workload communication and IO fragments on STG vertices. We represent all kinds of workload with a workload vector which contains normalized performance metrics and/or invocation arguments and then cluster these workload vectors. Vapro has to cluster millions of fragments collected at runtime without any prior knowledge, such as the number of clusters (i.e., the number of different workloads). A large number of algorithms have been studied to decide the optimal number of clusters during clustering, such as hierarchical clustering Send

Recv Other PMU metrics (c) Workload clustering (d) STG edges with fixed workload

Fig. 8.6 Clustering fragments by their workload. The workload of fragments is represented with different shapes

8.3 Performance Variance Detection

207

and minimum radius-based automatic k-means clustering [21]. However, they have high time complexities of at least .O(n2 ) [22] and .O(ndk+1 ) [23], where .k, d, n are the number of clusters, dimensions, and vectors to be clustered. Most of these algorithms require complex computation (nonlinear time complexity in the number of vectors) which is not suitable for lightweight performance analysis, especially for production-run parallel applications. Clustering Algorithm To address this problem, we present an ad hoc clustering algorithm (Algorithm 10) leveraging properties of the performance metrics. The Euclidean norm, i.e., the length of vectors, is used to classify different workload vectors. This is because a smaller norm of workload vectors, such as a smaller number of cache misses, usually means better performance. Performance variance usually enlarges these metrics rather than decreasing them. For metrics that are the larger, the better, we convert them into the opposite metrics. Then, for fixedworkload fragments, their norms have a concentrated distribution near the smallest norm of all data, which indicates the stable performance. After selecting the least norm of unprocessed fragments, we find all fragments whose distance from the fragment with the least norm is smaller than a predefined threshold (.5% in our implementation). For example, computation fragments within 1000–1050 instructions and 200–210 load and store instructions are put into the same cluster. The computational complexity of this algorithm is linear with respect to the number of workload vectors without regard to the sorting, so it introduces a small overhead. This will be shown in the evaluation in Sect. 8.6.2. Algorithm 10 Clustering algorithm for identifying fixed-workload snippets 1: for all edge/vertex ∈ STG do 2: Sort all fragments attached to this edge/vertex according to the norms of workload vectors 3: while unprocessed fragments exist do 4: Select the fragment with the smallest norm 5: Find similar fragments whose distance from the selected fragment is less than a predefined threshold 6: Move them into a new cluster 7: end while 8: Report clusters with too few fragments 9: end for

In Vapro, we do not strictly require the workload in the same cluster to be identical and tolerate a small difference. The main reason is the inherent error of PMU mechanism [24]. Additionally, Vapro aims to detect performance variance that has a significant performance impact, so the small difference of workload does not prevent detection for severe performance variance. During the post-processing (Line 8), clusters with too few fragments (less than five in our current implementation) but long execution time will be reported, which means that the corresponding execution path is not executed repeatedly. Users need to pay attention to whether these fragments represent abnormal performance.

208

8 Production-Run Noise Detection

8.3.5 Performance Variance Detection After workload clustering, Vapro uses these fixed-workload fragments to detect performance variance. For a parallel application, Vapro detects performance variance both within a single process (in the temporal dimension) and across multiple processes (in the spatial dimension). Intra-process Detection For fragments with the same workload, Vapro calculates the normalized performance for every fragment. As shown in Fig. 8.7b, fragments with different workloads are analyzed separately. For each fragment in a cluster, Vapro normalizes their performance according to their time consumption. The performance of the fastest fragment is normalized to 1, and the others are between 0 and 1. Then, the normalized performance of both clusters is merged to produce an overall performance report. To report the performance of profiled programs concisely, Vapro merges the normalized performance from all clusters for computation, network, and IO, respectively. Inter-process Detection Since different processes and threads often have similar tasks for parallel applications, Vapro detects inter-process variance by analyzing fixed-workload fragments from multiple processes or threads. As shown in Fig. 8.8, Vapro uses dedicated server processes for inter-process analysis. The server processes collect performance data from clients periodically. Each time, the server processes analyze the data for the last time window. The periods of analysis Low performance

Performance variance Normalized performance

Execution time

Long me fragments

(a)

(b) performance of fragments

Time (c) overall performance

Fig. 8.7 Detecting variance from multiple fragment clusters. Circles and triangles mean fragments with different workload and lines mean their normalized performance Transfer data to server processes

Applicaon processes Server processes Periodically inter-process analysis

Fig. 8.8 Periodic analysis for multiple processes

Overlapped sliding windows

8.4 Performance Variance Diagnosis

209

Fig. 8.9 8-thread PageRank under a memory noise

windows are overlapped so that the analyzed results from different periods can be concatenated together. Vapro servers report normalized performance as a heat map, where performance variance is represented by light-colored blocks. Figure 8.9 shows an example of multi-threaded PageRank under injected memory noise, where the vertical axis denotes different processes or threads and the horizontal axis denotes time progress. Variance Locating Vapro automatically pinpoints the variance by the region growing method. It regards a contiguous region with normalized performance below a threshold (.0.85 in our implementation) in the heat map as a possible variance. All possible variance is reported to users according to their impact on performance, which is calculated by the normalized performance. Users are able to select regions of interest on the heat map for diagnosis as well. Sampling Similar to most performance monitoring tools, sampling is an optional approach for Vapro to trade off between overhead and accuracy. By skipping recording part of external invocations, Vapro dynamically achieves a desired balance between overhead and the detection ability. Heuristic sampling policies can be adopted by Vapro, such as skipping short fragments instead of long ones, to maintain high detection coverage with low overhead.

8.4 Performance Variance Diagnosis In this section, we describe how Vapro automatically diagnoses the detected variance. Based on a variance breakdown model (Sect. 8.4.1), the execution time is broken down into several factors (Sect. 8.4.2). Vapro adopts a progressive analysis method to effectively locate the causes of variance (Sect. 8.4.3).

8.4.1 Variance Breakdown Model Vapro leverages the crucial comparability of fixed workload to diagnose variance. Since fixed-workload fragments without performance variance should have the same execution time and similar results of performance counters, differentiating

210

8 Production-Run Noise Detection

Computaon me of fixed-workload fragments

S3

Suspension Page fault (PF)

Memory bound L2 L3 L1 bound bound bound

DRAM bound

Factors quanfied in me

So PF

Context switch (CS)

Hard VolunPF tary CS

Signal

Rering

Process is suspended

Backend bound Core bound

S2

FB

S1

BS

Process is running

Involuntary CS

Factors unquanfied in me

Fig. 8.10 Variance breakdown model. Nodes with vertical text indicate that the underlying finegrained factors are omitted. FE and BS mean frontend bound and bad speculation

performance counters can reveal the reasons for variance. In this work, we only use performance counters inside processors and OS to illustrate our approach, even though the sources of variance vary and hundreds of counters exist. However, only a small number of counters can be simultaneously collected due to overhead constraints. To diagnose the variance with a small overhead, we propose a variance breakdown model to guide the direction of diagnosis. As shown in Fig. 8.10, it covers both hardware and software variance. A node in Fig. 8.10 represents a factor accounting for partial execution time, which corresponds to certain hardware or software performance counters. Nodes are organized hierarchically according to the inclusion relation of their execution and form several stages for variance diagnosis. The model first divides computation time into five stage-one (S1) factors. For the time when processes are running on CPUs, the variance breakdown model divides it into four S1 factors according to the top-down structure of PMU events [25]. For example, the S1 factor backend bound represents the time spent on computation and memory access, i.e., the S2 factors’ core bound and memory bound in Fig. 8.10. These S2 factors can be further broken down into S3 factors. For the process suspension caused by OS, its time is included in the S1 factor of suspension. Similarly, suspension can be further divided into fine-grained factors, such as page faults and other common OS events, which is extendable for covering more factors in diagnosis. By differentiating the time on each factor for fixed-workload fragments, Vapro quantifies the variance caused by each factor.

8.4.2 Quantifying Time of Factors To compare the impact of different factors on variance diagnosis, Vapro quantifies the time cost for each factor by collecting corresponding performance counters. Performance counters have different units and Vapro classifies them according to whether they can be directly quantified in time. In Fig. 8.10, the factors with

8.4 Performance Variance Diagnosis

211

Fig. 8.11 Variance breakdown of the fixed-workload fragments of 16-process CG under concurrent computing noise and memory contention

background color are directly quantifiable in time, such as how long a process spends on CPU frontend bound.3 With the help of well-designed hardware PMU events, a top-down time breakdown is feasible for factors in CPUs [25]. Since this breakdown relies on formulas according to the meaning of PMU events, we call this a formula-based method. However, there are still many factors that cannot be directly quantified in time. For example, OS provides users with the count of page faults. But we cannot directly calculate the time of page faults according to this data. We propose an OLS-based (ordinary least squares) statistical method to estimate the time of unquantified factors for variance diagnosis. Vapro separately processes fixedworkload fragments to leverage their comparability. For each cluster, all factors are normalized to the range of 0–1. Then, Vapro checks the multicollinearity of factors by the Farrar-Glauber test [26]. In the multivariate OLS, multicollinearity means that one explanatory variable can be linearly correlated with others. This makes the estimated coefficients unstable and possibly reduces the precision of results. Since some factors are related to each other, such as that a page fault in user space is also a context switch, multicollinearity tends to occur in our analysis. Vapro removes the multicorrelated factors one by one until multicollinearity does not exist in OLS. Vapro takes execution time as the explained variable and factors as explanatory variables for OLS. For the OLS results, only factors with a significant influence (.p < 0.05) on the time are considered in the following diagnosis. After scaling the coefficients to recover the normalization, we obtain the estimated time impact of each factor. For factors excluded from OLS due to multicollinearity, their coefficients are estimated by their multicollinear relationship. Thus, Vapro calculates the time of each factor to facilitate the performance diagnosis in Sect. 8.4.3. To verify this OLS-based statistical method, we compare it with the formulabased method. For the injected noise, which will be shown in Fig. 8.11, the impact of backend bound and suspension estimated by the formula-based method (89.4 and 4.9%) is consistent with the statistical method (86.6 and 3.1%).

3 On the Intel Ivy Bridge CPUs, the time fraction of frontend bound is equal to IDQ_UOPS_NOT_DELIVERED.CORE / (4 * CPU_CLK_UNHALTED.THREAD).

212

8 Production-Run Noise Detection

8.4.3 Progressive Variance Diagnosis Vapro adopts a progressive diagnosis method based on the above variance breakdown model, which progressively locates major factors in the current stage and diagnoses its fine-grained factors. The major factors are decided according to their contribution to variance, which means how much slowdown a factor causes. To calculate the contribution, fragments costing more than .ka times of the fastest fragment are regarded as abnormal fragments (1.2 is used in our implementation), and the others are normal ones. Vapro takes the average time of each factor in normal fragments as a reference value. Thus, the contribution of a factor is the difference between the time of this factor in abnormal fragments and the reference value. By summing up the contribution of all abnormal fragments, we obtain the contribution of a factor during a period of execution. Figure 8.11 shows that fixed-workload fragments are injected with computing noise and memory contention using the same method in Fig. 8.5. Each point represents a fragment and its marker indicates the major factor resulting in variance for fragments under variance. BE and SP mean backend bound and suspension. The dashed line shows the region boundary of different major factors. Since the noises mainly increase two S1 factors, suspension and backend bound, we take them as axes and omit the other three S1 factors. The average of the normal fragments is the origin in Fig. 8.11. Thus, the coordinates of fragments mean the contribution of factors. Vapro selects the factors contributing more than a threshold (0.25 in our implementation) of overall variance as major factors for further diagnosis. Then, the server notifies clients to collect data for fine-grained factors. This diagnosing process repeats until the most fine-grained factors are determined. In such a progressive way, Vapro requires only a small number of concurrently active performance counters and thus imposes low overhead. As a trade-off, this method costs n clientserver data transferring periods and n server analysis latencies to locate an Sn factor in the variance breakdown model. Compared with the long execution time of applications in the production environment, this diagnosis method reacts efficiently. Vapro reports the impact and duration of each factor to users. The impact of a factor is calculated by summing up its contribution from all abnormal fragments. The duration of a factor is the total time of abnormal fragments whose major factors include it. For example, in the case of Fig. 8.11, the process suspension accounts for 60.3% of the slowdown and influences 24.2% of the execution time. Previous noise detection tools cannot break down variance for analysis due to the lack of the precondition of fixed workload.

8.6 Evaluation

213

8.5 Implementation In this section, we discuss the implementation details about how Vapro records performance data into STG and processes performance data for a parallel application. Intercepting External Functions We leverage runtime symbol look-up interfaces on Linux, i.e., dlsym, and an environment variable of dynamic linkers, LD_PRELOAD, to transparently intercept these functions. Currently, Vapro supports the following external functions: • • • •

Communication: MPI communication functions IO: POSIX IO interfaces and MPI-IO functions Multi-threading: main POSIX pthread interfaces User-defined explicit invocations

Although most parallel applications heavily rely on external libraries, some of them execute with very few external function invocations for a long period. For these programs, Vapro inserts user-defined invocations into programs with the support of Dyninst [27], a binary rewriting tool. We insert explicit invocations at some key points, such as the entry and exit of functions. The binary exponential backoff strategy [28] is applied to adapt the frequency of profiling data collection and limit the overhead of Vapro. Performance Data Analysis Vapro provides a multi-threaded server for online variance detection and diagnosis. For large-scale parallel applications, Vapro supports concurrent data collection with multiple servers to improve throughput. By equally assigning parallel processes to different servers, servers can achieve load balance. Further optimizations are feasible with data collection frameworks such as MRNet [29], which organizes servers into a tree-like structure.

8.6 Evaluation 8.6.1 Evaluation Setup Vapro is implemented as a dynamic library and should be preloaded at program executions. We collect TOT_INS for workload clustering and evaluate it with real cases. Platform We conduct the multi-process evaluation on the Tianhe-2A supercomputer, whose nodes have dual 12-core Intel Xeon E5-2692 v2 processors and 50 Gbps networks. The multi-threaded applications are evaluated on a server with dual 12-core Intel E5-2670 v3 processors. Applications We evaluate (1) BERT [30], an efficient inference framework for the popular natural language processing model BERT; (2) PageRank [31], a

214

8 Production-Run Noise Detection

multi-threaded graph computing application; (3) WordCount [32], a MapReducestyle program; (4) AMG [33], a parallel algebraic multigrid program solver; (5) CESM [34], the state-of-the-art climate simulator with more than 500,000 lines of code; (6) 6 programs from the PARSEC suite [35] which covers the applications of image processing, finance, and hardware design; and (7) 7 programs from the NPB [17] benchmarks with E-class problem size. These programs are from diverse fields and cover both multi-threading and multiprocessing.

8.6.2 Overhead and Detection Coverage Overhead Table 8.1 shows the performance overhead and detection coverage of Vapro and the state-of-the-art vSensor [9]. Since vSensor does not support multi-threaded applications, we present the evaluating results of multi-process applications in detail. The overhead of Vapro is small since Vapro is triggered only when external functions are invoked, which are time-consuming communication and IO operations. Therefore, the overhead of Vapro is bounded by a small ratio of the cost of external function invocations. Vapro with context-aware STG has higher overhead (3.81%) than that with context-free STG (1.80%) due to the costly call stack backtracking operation for the call-path information. Although both Vapro and vSensor introduce low performance overhead, vSensor, as a tool relying on source code analysis, fails to work on complex applications with large codebases, such as CESM. Vapro is configured with a 15-second reporting period and one server process serves 256 application processes. The overhead of Vapro servers is only .0.4% (1/256) of the resources used by applications. For the storage overhead, Vapro generates 12.8 or 47.4 KB data per second for one thread or process on average. Since detailed performance data can be periodically analyzed and merged as normalized performance, Vapro has a small storage requirement. Detection Coverage Since only the variance during the execution of fixedworkload fragments is detected, we define detection coverage as the ratio of time on repeated fixed-workload fragments to total execution time. To exemplify the importance of coverage, we compare different tools for the C-scale NPB SP program under the computing noise that lasts 1 second. As shown in the red circles in Fig. 8.12, Vapro accurately detects the 50% performance loss caused by OS process scheduling, which equally divides CPU time for the application and the noise process. However, vSensor incorrectly reports a 90% performance loss lasting 1/10 s, since its detection coverage (8.7%) is significantly lower than Vapro (36.4%). The low coverage causes vSensor to not collect enough fragments to correctly show the impact of context switch caused by OS scheduling. The detection coverage of Vapro exceeds 70.0% and outperforms vSensor by 30.0% on average, which is critical for the precision of detection. More importantly,

Mean

AMG CESM NPB

BT CG EP FT LU MG SP

Multi-process applications

Overhead (%) vSensor CA 2.02 1.34 N/A 8.06 0.22 2.07 0.00 0.29 0.00 1.04 2.21 4.73 1.12 8.56 1.03 6.99 1.22 1.19 3.81 0.98 CF 0.37 0.02 2.00 0.72 1.04 5.13 2.88 2.86 1.23 1.80

Coverage (%) vSensor CA 0.00 57.5 N/A 33.8 80.1 83.7 19.5 78.3 0.0 87.5 93.2 72.0 65.9 97.4 76.2 5.1 29.4 66.6 45.5 64.7 CF 66.4 47.7 86.2 78.2 87.5 72.2 97.7 77.7 66.3 75.5 BERT PageRank WordCount PARSEC FFT Blackscholes Canneal Ferret Swaptions Vips Mean

Multi-threaded applications

Overhead (%) CF 0.75 2.70 0.41 0.16 0.00 2.63 0.02 0.00 1.85 0.95

Coverage (%) CF 72.8 47.3 74.1 66.9 84.9 81.3 79.0 92.4 96.7 74.1

Table 8.1 Performance overhead and detection coverage (with 2048 and 1024 processes for CESM and the other multi-process applications and 16 threads for multi-threaded programs). CA and CF mean Vapro with context-aware and context-free STG

8.6 Evaluation 215

216

8 Production-Run Noise Detection

Fig. 8.12 1024-process SP under a computing noise

Vapro works on programs with runtime fixed workload, such as AMG and EP, which static analysis-based vSensor cannot handle. In Table 8.1, context-free STG outperforms context-aware STG with 10.8% higher average coverage and smaller overhead. This is because workload clustering overcomes the disadvantage of context-free STG. According to these evaluating results, context-free STG is more favorable, and we use it in the following experiments.

8.6.3 Verification of Fixed Workload Identification To verify the fixed workload identification algorithm, we record the exact execution paths and compare them with the clustering results of Vapro. We evaluate four applications with medium codebases. All loops and branches in their hotspots, which cover more than .80% of the execution time, are instrumented for recording execution paths. Table 8.2 shows the clustering results by the homogeneity, completeness, and Vmeasure scores [36]. All the completeness scores are equal to one, which means that fragments with the same workload are in the same cluster. For PageRank, the homogeneity score (0.74) indicates that some fragments with different workloads are clustered together. By inspecting its source code, we find that some fragments with approximately equal workload (e.g., a common 100000-iteration loop with only less than 20 different arithmetic operations) are put into 1 cluster. Since these Table 8.2 Verification of fixed workload identification. C, H, and V mean completeness, homogeneity, and V-measure (the harmonic mean of C and H) scores. Programs are executed with 16 processes or threads Applications CG FT EP PageRank

Number of fragments 3801 640 16 672

C 1.00 1.00 1.00 1.00

H 1.00 1.00 1.00 0.74

V 1.00 1.00 1.00 0.85

8.6 Evaluation

217

workload differences are small, such mixed clusters do not hinder Vapro from detecting variance which significantly impacts performance.

8.6.4 Comparing with Profiling Tools We generate parallel computational noises to interfere with processes. After 2048process E-scale CG is started, computing noises are injected into two different computing nodes. Figure 8.13 shows that Vapro accurately locates the performance variance (two white boxes) and reports a 42.8% computation performance loss. With the regression based on the variance breakdown model, Vapro reports that involuntary context switches have a significant negative influence (.p < 0.001) on performance. The execution time breakdown provided by profiling tools is often misleading. We take mpiP [37] as an example, whose result summarizes the computation time and communication time. The result of mpiP in Fig. 8.14 shows that the communication time increases and the computation time remains the same, which indicates a network problem. mpiP highlights the significant increase in communication time caused by dependence but omits the relatively small changes in computation time. However, with the help of fixed-workload fragments, Vapro catches the nuanced change and diagnoses it. For vSensor, it cannot pinpoint the source of variance although the variance is detected.

Fig. 8.13 Detection results of 2048-process CG under software noises by Vapro

Fig. 8.14 Results of 2048-process CG by mpiP

218

8 Production-Run Noise Detection

Fig. 8.15 Detection results of an HPL execution under hardware variance detected by Vapro

8.6.5 Case Studies In this subsection, we present three case studies covering variance caused by hardware cache variance, memory problem, and IO variance.

8.6.5.1

Detection of a Hardware Bug

In this case, we evaluate high performance LINPACK (HPL) [38] with 36 processes on a computing node with dual 18-core Intel Xeon Gold 6140 processors. All processes are bound to the dedicated processor cores to mitigate the interference of OS scheduling. In our test, HPL usually has a stable performance with performance variation of less than 2%, since it has relatively little communication. However, Vapro captures an abnormal execution with 22.2% longer execution time than the normal run. Figure 8.15 shows the normalized performance reported by Vapro. From the figure, we can find that there is a large performance variance among processes, especially for processes on the second processor socket, whose process IDs are between 16 and 31. With the progressive variance diagnosis, Vapro reports that 96.6% of the slowdown results from the backend bound in the CPU pipeline. The fine-grained breakdown shows that the L2 and DRAM bound (48.2 and 38.0% of the slowdown) are mainly responsible for this variance. This result implies that the extra cache misses and memory accesses impair the performance. By recording several low-level micro-architecture PMU events related to cache and repeating the execution, we verify that this variance is correlated with a PMU metric counting the number of CPU cycles stalling on L2 cache miss.4 This phenomenon of variance matches a severe Intel processor hardware bug related to the L2 cache [10, 39], which makes data in the L2 cache evicted and randomly generates significant slowdowns. To mitigate this problem, we leverage the huge page mechanism to decrease the frequency of problematic L2 cache evictions. Figure 8.16 shows the cumulative distribution function of the HPL performance. With the original page size of 2 MB,

4 The

event name is CYCLE_ACTIVITY.STALLS_L2_MISS.

8.6 Evaluation

219

Fig. 8.16 Distribution of HPL performance

significant performance degradation is shown on the left side of the figure. After using 1 GB pages, the standard deviation of the execution time is reduced by 51.3%. For this problem, which influences all programs on the problematic processors, Vapro provides an online detection and diagnosis approach based on comparing fixed-workload fragments from different processes. Since the inter-process comparison fails without the presupposition of fixed workload, other profiling tools, such as perf [40], cannot achieve it. vSensor fails in this case as well since it is a closedsource application provided by Intel. Vapro not only facilitates an early stop for the affected programs but also avoids time-consuming re-executions for diagnosing this non-deterministic problem.

8.6.5.2

Detection of Memory Problem

In this case, we execute the 128-process Nekbone [41], a computational fluid dynamics problem solver, on Tianhe-2A. As shown in Fig. 8.17, Vapro locates processes on a node which is slower than others. By breaking down the variance, 97.2% of the slowdown is caused by backend bound, and nearly all of it is contributed by the memory bound. With memory tests, we find that the memory bandwidth of this node is 15.5% lower than others. By replacing this problematic node, it yields a speedup of .1.24×. We have reported this finding to the system administrator. One could argue that this variance can be detected by benchmarks in advance, but as shown in the previous case and Fig. 8.1, variance happens even when programs are executed on the same nodes.

Fig. 8.17 Detection results of computation results for 128-process Nekbone by Vapro

220

8 Production-Run Noise Detection

Fig. 8.18 Detection results of IO performance for 512-process RAxML by Vapro

Fig. 8.19 IO performance of consecutive read and write operations with fixed workload in RAxML

8.6.5.3

Detection of IO Performance Variance

The third case study focuses on RAxML [42], a popular phylogenetic analysis application. We execute this application with 512 processes and observe a significant execution time variance, which ranges from 41.1 to 68.0 seconds for 10 consecutive executions. Vapro suggests that both computation and communication performance are stable. However, as shown in Fig. 8.18, the IO performance variance is reported by Vapro for the first process, which has significantly lower performance than the others. Vapro identifies the most varied fixed-workload IO fragments and plots their execution time in Fig. 8.19. Following this important hint, we further investigate RAxML and find that it merges data from multiple small files. Thus, its performance is vulnerable to the variance of the shared distributed file system. To reduce the distributed file system access, we implement a simple file buffer for these files. This optimization yields a 17.5% speedup and a 73.5% reduction in the standard deviation of overall execution time. In this case, although we cannot collect performance counters from the distributed file system due to security reasons, Vapro still efficiently filters out irrelevant factors to provide crucial hints for the solution.

8.7 Related Work Detecting Variance by General Approaches There have been several general approaches for performance variance detection and diagnosis. Micro-benchmark is a classic approach to detect system variance [6, 8]. But running benchmarks

8.7 Related Work

221

is intrusive; it interferes with other applications and is not suitable for online production detection. The major drawback of tracing is its prohibitive data volume and performance overhead [43–45]. Program profiling [37] often discards time sequence information, so it is difficult to detect performance variance in the time dimension. Although variance can be detected by performance modeling [46], building accurate models is extremely difficult. Detecting Variance Caused by Environment On detecting performance variance, an effective approach is differentiation, which means determining the processes or periods with different behavior. For works based on fragment-level fixed workload differentiation, vSensor [9] identifies such fixed-workload code snippets with static analysis to detect variance. However, relying on source code analysis, vSensor is impractical for proprietary programs and misses snippets with de facto fixed workload which cannot be determined at compilation. Shah [47] estimated the impact of external interference on bulk-synchronous MPI applications by comparing fixed-workload segments. However, neither of this work nor vSensor can diagnose the causes of variance. In contrast, Vapro diagnoses variance and provides crucial guides to solve it, which is not supported by these works. Many research works are able to detect or diagnose the performance variance with differentiation in other methods. IASO [48] detects fail-slow, i.e., extremely severe performance variance, by monitoring the response time of requests. Xray [49] leverages performance summarization and deterministic replay to locate basic-block-level causes of performance anomalies. UBL [50] predicts performance anomalies in the cloud by unsupervised learning. VarCatcher [51] detects and analyzes variance patterns by the parallel characteristic vector. Compared with these works, Vapro is able to locate, quantify, and diagnose the variance online, which cannot be covered by a single one of the above works. Detecting Variance Caused by Applications Software bugs lead to performance variance as well. STAT [52] detects the root cause of the program hanging problem by finding out the processes with a different call stack. AutomaDeD [53] uses a Markov model to find bugs by comparing their control-flow behavior history and finding the least-progress process. Su et al. [54] identify several performance bugs by recording function-level variance. PerfScope [55] analyzes system call invocations to locate candidate buggy functions. Sahoo et al. [56] find software bugs by monitoring program invariants. Other tools such as Hytrace [57] and PRODOMETER[58] focus on this topic as well. Orthogonal to these works, Vapro focuses on the diagnosis of performance variance caused by the external environment instead of functional and performance bugs inside applications.

222

8 Production-Run Noise Detection

8.8 Conclusion We present Vapro, an online lightweight performance variance detection and diagnosis tool for production-run parallel applications. We combine a novel data structure to dynamically identify fixed workload for variance detection. Based on the variance breakdown model, Vapro diagnoses variance and reports the most possible reasons that the previous tools cannot realize.

References 1. Ballani, H., et al. (2011). Towards predictable datacenter networks. In Proceedings of the ACM SIGCOMM 2011 Conference (pp. 242–253). 2. Ferreira, K. B., et al. (2013). The impact of system design parameters on application noise sensitivity. In 2010 IEEE International Conference on Cluster Computing (Vol. 16, No. 1, pp. 117–129). 3. Mondragon, O. H., et al. (2016). Understanding performance interference in next-generation HPC systems. In SC16: International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 384–395). IEEE. 4. Schad, J., Dittrich, J., & Quiané-Ruiz, J.-A. (2010). Runtime measurements in the cloud: Observing, analyzing, and reducing variance. In Proceedings of the VLDB Endowment, 3(1–2), 460–471. 5. Schwarzkopf, M., Murray, D. G., & Hand, S. (2012). The seven deadly sins of cloud computing research. In 4th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud’12). 6. Maricq, A., et al. (2018). Taming performance variability. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18) (pp. 409–425). 7. Beckman, P., et al. (2006). The influence of operating systems on the performance of collective operations at extreme scale. In 2006 IEEE International Conference on Cluster Computing (pp. 1–12). IEEE. 8. Hoefler, T., Schneider, T., & Lumsdaine, A. (2010). Characterizing the influence of system noise on large-scale applications by simulation. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. SC’10 (pp. 1–11). 9. Tang, X., et al. (2018). vSensor: Leveraging fixed-workload snippets of programs for performance variance detection. In ACM SIGPLAN Notices (Vol. 53, No. 1, pp. 124–136). ACM. 10. McCalpin, J. D. (2018). HPL and DGEMM performance variability on the Xeon platinum 8160 processor. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 225–237). IEEE. 11. Gong, Y., He, B., & Li, D. (2014). Finding constant from change: Revisiting network performance aware optimizations on iaas clouds. In SC14: International Conference for High Performance Computing, Network- ing, Storage and Analysis (pp. 982–993). IEEE. 12. Gunawi, H. S., et al. (2018). Fail-slow at scale: Evidence of hardware performance faults in large production systems. ACM Transactions on Storage (TOS), 14(3), 23. 13. Sherwood, T., Sair, S., & Calder, B. (2003). Phase tracking and prediction. ACM SIGARCH Computer Architecture News, 31(2), 336–349. ACM. 14. Sherwood, T., Perelman, E., & Calder, B. (2001). Basic block distribution analysis to find periodic behavior and simulation points in applications. In Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques (PACT’01) (pp. 3–14). IEEE.

References

223

15. Aho, A. V., et al. (2006). Compilers: Principles, techniques, and tools (2nd Ed.). USA: Addison-Wesley Longman Publishing. ISBN: 978-0-321-48681-3. 16. Zheng, L., et al. (2022). Vapro: Performance variance detection and diagnosis for productionrun parallel applications. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (pp. 150–162). https://doi.org/10.1145/3503221. 3508411 17. Bailey, D., et al. (1995). The NAS Parallel Benchmarks 2.0. Moffett Field, CA: NAS Systems Division, NASA Ames Research Center. 18. Stress. https://packages.debian.org/buster/stress 19. McCalpin, J. (2018). Memory Bandwidth: STREAM Benchmark Performance Results. https:// www.cs.virginia.edu/stream/ (Visited on March 20, 2018). 20. Vetter, J. (2002). Dynamic statistical profiling of communication activity in distributed applications. ACM SIGMETRICS Performance Evaluation Review, 30(1), 240–250. 21. Yu, T., et al. (2019). Large-scale automatic K-means clustering for heterogeneous many-core supercomputer. IEEE Transactions on Parallel and Distributed Systems (TPDS), 31, 997–1008. 22. Berkhin, P. (2006). A survey of clustering data mining techniques. In Grouping multidimensional data: Recent advances in clustering, J. Kogan, C. Nicholas, & M. Teboulle (Eds.). Berlin, Heidelberg: Springer. ISBN: 978-3-540-28349-2. 23. Inaba, M., Katoh, N., & Imai, H. (1994). Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering. In Proceedings of the Tenth Annual Symposium on Computational Geometry (pp. 332–339). 24. Weaver, V. M., Terpstra, D., & Moore, S. (2013). Non-determinism and overcount on modern hardware performance counter implementations. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (pp. 215–224). IEEE. 25. Yasin, A. (2014). A top-down method for performance analysis and counters architecture. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’14) (pp. 35–44). IEEE. 26. Farrar, D. E., & Glauber, R. R. (1967). Multicollinearity in regression analysis: The problem revisited. The Review of Economic and Statistics, 49,. 92–107. 27. Bernat, A. R., & Miller, B. P. (2011). Anywhere, any-time binary instrumentation. In Proceedings of the 10th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools (pp. 9–16). 28. Goodman, J., et al. (1988). Stability of binary exponential backoff. In Journal of the ACM (JACM), 35(3), 579–602. 29. Brim, M. J., et al. (2010). MRNet: A scalable infrastructure for the development of parallel tools and applications. In Cray user group. 30. The cuBERT framework. https://github.com/zhihu/cuBERT 31. The parallel PageRank program. https://github.com/nikos912000/parallel-pagerank 32. The MapReduce framework. https://github.com/sysprog21/mapreduce 33. Yang, U. M., et al. (2002). BoomerAMG: A parallel algebraic multigrid solver and preconditioner. Applied Numerical Mathematics, 41(1), 155–177. 34. Kay, J. E., et al. (2015). The community earth system model (CESM) large ensemble project: A community resource for studying climate change in the presence of internal climate variability. Bulletin of the American Meteorological Society, 96(8), 1333–1349. 35. Bienia, C., et al. (2008). The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques (PACT’08) (pp. 72–81). 36. Rosenberg, A., & Hirschberg, J. (2007). V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLPCoNLL) (pp. 410–420). 37. Vetter, J., & Chambreau, C. (2005). mpip: Lightweight, scalable MPI profiling. 38. Dongarra, J. J., Luszczek, P., & Petitet, A. (2003). The LINPACK benchmark: Past, present and future. Concurrency and Computation: Practice and Experience, 15(9), 803–820.

224

8 Production-Run Noise Detection

39. Intel. Addressing Potential DGEMM/HPL Perf Variability on 24-Core Intel Xeon Processor Scalable Family. White Paper, Number 606269, Revision 1.0 (2018). 40. De Melo, A. C. (2010). The new linux perf tools. In Slides from Linux Kongress (Vol. 18, pp. 1–42). 41. The Nekbone program. https://github.com/Nek5000/Nekbone. 42. Stamatakis, A. (2006). RAxML-VI-HPC: Maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics, 22(21), 2688–2690. 43. Zhai, J., et al. (2014). Cypress: Combining static and dynamic analysis for top-down communication trace compression. In SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 143–153). IEEE. 44. Wylie, B. J. N., Geimer, M., & Wolf, F. (2008). Performance measurement and analysis of large-scale parallel applications on leadership computing systems. Scientific Programming, 16(2–3), 167–181. 45. Jones, T. R., Brenner, L. B., & Fier, J. M. (2003). Impacts of operating systems on the scalability of parallel applications. In Lawrence Livermore National Laboratory, Technical Report UCRL-MI-202629. 46. Petrini, F., Kerbyson, D. J., & Pakin, S. (2003). The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing. SC’03. Phoenix, AZ, USA: ACM. 47. Shah, A., Müller, M., & Wolf, F. (2018). Estimating the impact of external interference on application performance. In European Conference on Parallel Processing (pp. 46–58). Springer. 48. Panda, B., et al. (2019). IASO: A fail-slow detection and mitigation framework for distributed storage services. In 2019 USENIX Annual Technical Conference (USENIX ATC’19) (pp. 47– 62). 49. Attariyan, M., Chow, M., & Flinn, J. (2012). X-ray: Automating root-cause diagnosis of performance anomalies in production software. In Presented as Part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI’12) (pp. 307–320). 50. Dean, D. J., Nguyen, H., & Gu, X. (2012). Ubl: Unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems. In Proceedings of the 9th International Conference on Autonomic Computing (pp. 191–200). 51. Zhang, W., et al. (2016). Varcatcher: A framework for tackling performance variability of parallel workloads on multi-core. IEEE Transactions on Parallel and Distributed Systems (TPDS), 28(4), 1215–1228. 52. Arnold, D. C., et al. (2007). Stack trace analysis for large scale debugging. In 2007 IEEE International Parallel and Distributed Processing Symposium (pp. 1–10). IEEE. 53. Laguna, I., et al. (2015). Diagnosis of performance faults in LargeScale MPI applications via probabilistic progress-dependence inference. IEEE Transactions on Parallel and Distributed Systems, 26(5), 1280–1289. 54. Su, P., et al. (2019). Pinpointing performance inefficiencies via lightweight variance profiling. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’19) (pp. 1–19). 55. Dean, D. J., et al. (2014). Perfscope: Practical online server performance bug inference in production cloud computing infrastructures”. In Proceedings of the ACM Symposium on Cloud Computing (pp. 1–13). 56. Sahoo, S. K., et al. (2013). Using likely invariants for automated software fault localization. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (pp. 139–152). 57. Dai, T., et al. (2018). Hytrace: A hybrid approach to performance bug diagnosis in production cloud infrastructures. IEEE Transactions on Parallel and Distributed Systems (TPDS), 30(1), 107–118. 58. Mitra, S., et al. (2014). Accurate application progress analysis for large-scale parallel debugging. ACM SIGPLAN Notices, 49(6), 193–203. ACM.

Part V

Performance Analysis Framework

Chapter 9

Domain-Specific Framework for Performance Analysis

Abstract In this book, we propose several performance analysis approaches for communication analysis, memory monitoring, etc. However, to implement each such analysis, significant human efforts and domain knowledge are required. To alleviate the burden of implementing specific performance analysis tasks, we propose a domain-specific programming framework, named PerFlow. PerFlow abstracts the step-by-step process of performance analysis as a dataflow graph. This dataflow graph consists of main performance analysis sub-tasks, called passes, which can either be provided by PerFlow’s built-in analysis library or be implemented by developers to meet their requirements. Moreover, to achieve effective analysis, we propose a Program Abstraction Graph to represent the performance of a program execution and then leverage various graph algorithms to automate the analysis. We demonstrate the efficacy of PerFlow by three case studies of real-world applications with up to 700K lines of code. Results show that PerFlow significantly eases the implementation of customized analysis tasks. In addition, PerFlow is able to perform analysis and locate performance bugs automatically and effectively.

9.1 Introduction Performance analysis is indispensable for understanding and optimizing applications and is widely used in different fields including scientific computing [1, 2], machine learning [3, 4], and data processing [5, 6]. Due to the complexity of load imbalance, communication dependence, resource contention, etc. [7–9], significant human efforts and knowledge need to be involved in effective analysis currently. It is challenging to understand the performance behavior of parallel applications with ease. A large number of performance tools have been proposed to facilitate performance analysis based on either profiling or tracing. Profiling-based tools [10–12] record program snapshots at regular intervals, indicating the overall statistical performance data of programs, with very low overhead. Tracing-based tools [13–16] record all event traces during program execution, which contain plentiful information, including computation, memory access, and communication characteristics. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 J. Zhai et al., Performance Analysis of Parallel Applications for HPC, https://doi.org/10.1007/978-981-99-4366-1_9

227

228

9 Domain-Specific Framework for Performance Analysis

These tools provide various performance data, which are the basis of performance analysis. However, to locate the underlying performance bugs hidden by complex performance data and communication dependence, in-depth analysis is further required. Researchers have proposed many in-depth performance analysis approaches to locate different kinds of performance bugs in different scenarios, such as critical path analysis [17–19], root cause analysis [8, 20], etc. Existing approaches only focus on a specific aspect of the performance issue of parallel programs. However, a performance issue for a complex parallel program may involve multiple factors interleaved in a complex way. (1) Complex communications, locks, and data dependence unpredictably hide performance bugs. (2) Different performance bugs interact with each other, which means that the detected performance bugs may come from several kinds of performance issues, including load imbalance, resource contention, etc. Identifying root causes in a new scenario requires specific indepth analysis approaches, and implementing specific approaches normally requires significant human efforts and domain knowledge. Therefore, we conclude that an easy-to-use framework for easing the implementation of in-depth performance analysis is necessary. Designing a general framework for effective performance analysis has two key challenges. (1) Providing a unified form to express different performance analysis tasks is difficult. To meet the needs of various scenarios, the algorithms of constraint-solving-based analysis approaches are designed specifically and differ greatly. We observe that a typical performance analysis approach is a step-by-step process, meaning that each step only performs a basic analysis, and the results from one step are further processed by the next step. Inspired by this observation, we come up with the idea of abstracting performance analysis tasks as a general dataflow graph. The vertex of the dataflow graph corresponds to a step, while the data on the edge of the dataflow graph record intermediate results between steps. (2) Providing a unified form to represent the performance of a program is difficult, since analysis approaches rely on significantly different programs and performance data, including performance monitor unit data, program structure, communication patterns, data dependence, and many more. Many existing works utilize graphs to represent program behavior and design task-driven methods to solve their problems including program debugging [21, 22], performance modeling [23], communication trace compression [24], etc. [20, 25, 26]. Inspired by these works, we represent the performance of a program as a graph structure. In this work, we focus on the domain of performance analysis and propose PerFlow [27], a domain-specific framework to ease the implementation of in-depth performance analysis tasks. In PerFlow, we abstract the step-by-step process of performance analysis as a dataflow graph [28], called PerFlowGraph, where the analysis steps, called passes, correspond to vertices and the intermediate results of each analysis step correspond to the data on edges. We leverage hybrid staticdynamic analysis to generate a Program Abstraction Graph (PAG) as a unified form to represent the performance of a parallel program and then implement tasks of analysis steps with graph operations and algorithms on the generated PAG. We

9.2 Overview

229

provide a built-in analysis pass library containing several basic performance analysis sub-tasks and low-level APIs to build user-defined passes. With PerFlow, developers only need to describe their specific performance analysis tasks as PerFlowGraphs. PerFlow is able to automatically perform specific in-depth analysis tasks and report results specified by developers. In summary, there are four main contributions in our work. • We propose PerFlow,1 a domain-specific framework for performance analysis. PerFlow provides a dataflow-based programming interface for developers to customize specific performance analysis tasks with ease. • We present a Program Abstraction Graph, which is a unified performance representation of parallel programs. • We provide a performance analysis pass library and some built-in performance analysis paradigms. Developers can directly use passes and paradigms to perform analysis. • We demonstrate the efficacy and efficiency of PerFlow by three case studies on real-world applications with up to 700K lines of code, leveraging different PerFlow dataflow graphs to detect performance bugs in different scenarios. We evaluate PerFlow with both benchmarks and real-world applications. Experimental results show that PerFlow can detect scalability bugs, load imbalance, and resource contention with different PerFlow dataflow graphs more effectively and efficiently compared with mpiP [11], HPCToolkit [10], and Scalasca [13]. Besides, PerFlow significantly eases the implementation of the scalability analysis task in ScalAna [20]. Applications can achieve up to 25.29.× performance improvements by fixing detected performance bugs.

9.2 Overview To help developers deal with the complexities in implementing specific performance analysis tasks, we develop PerFlow, a domain-specific programming framework that screens developers from all complexities and automatically performs the process of specific performance analysis. In PerFlow, the step-by-step process of performance analysis is abstracted as a dataflow graph, namely, PerFlowGraph. Using PerFlow, developers only need to describe performance analysis tasks as PerFlowGraphs, and PerFlow will run the program and perform performance analysis automatically. In this section, we introduce the PerFlow framework and give an example to illustrate how to program with PerFlow.

1 PerFlow

is available at https://github.com/thu-pacman/PerFlow.

230

9 Domain-Specific Framework for Performance Analysis

Fig. 9.1 The framework of PerFlow

9.2.1 PerFlow Framework The overview of PerFlow is shown in Fig. 9.1, which consists of two components: a graph-based performance abstraction and a PerFlow programming abstraction. Graph-Based Performance Abstraction In this component, the performance of a program execution is represented as a Program Abstraction Graph, whose vertices represent code snippets and edges represent control flow, data movement, and dependence (Sect. 9.3). Taking an executable binary as input, PerFlow first leverages hybrid static-dynamic analysis (Sect. 9.3.2) to extract program structures and collect performance data. Then performance data are embedded into the program structure to build a PAG, describing the performance of a program run (Sect. 9.3.3). PerFlow Programming Abstraction This component abstracts the process of performance analysis as a dataflow graph. PerFlow programming abstraction consists of two concepts, performance analysis passes and dataflow-based programming model. As the core of programming abstraction, the performance analysis pass library provides various built-in passes (Sect. 9.4.3.2), which are built with low-level APIs based on the performance abstraction. The passes perform graph algorithms, such as breadth-first search, subgraph matching, etc., on the PAGs and complete basic analysis sub-tasks. Intermediate results, which are the inputs/outputs of passes, are organized as sets. The elements of a set are PAG vertices and edges. A pass takes sets as input, updates the sets, and outputs them. (Note that a PAG is an environment of all passes in a PerFlowGraph and a set is a subset of PAG vertices flowing along the edges of a PerFlowGraph.) As a programming framework, PerFlow also provides a dataflow-based API (high-level API), allowing users to analyze the performance of parallel programs in

9.2 Overview

231

different scenarios with ease and high efficiency. Developers only need to combine passes into PerFlowGraph according to the demand of their analysis tasks. Then PerFlow will automatically run the programs and perform the specific performance analysis. PerFlow currently supports MPI, OpenMP, and Pthreads programs in C, C++, and Fortran. The hybrid static-dynamic module is easy to extend to other programming models, such as CUDA, and other architectures, such as ARM. In our design, the PerFlow is a cross-platform framework.

9.2.2 Example: A Communication Analysis Task We take a communication analysis task as an example to illustrate how to program with PerFlow. When analyzing the communication performance of a program execution, the balance of communications is one of the key points. If communications are detected with imbalanced behavior, developers need to break them down to determine whether the cause of imbalance is different message sizes, the load imbalance before the communications, or others. We conclude the step-by-step process of this communication analysis task as a PerFlowGraph in Fig. 9.2. It reports key attributes (including function name, communication patterns, debug info, and execution time) of detected communication calls with performance bugs. The report module provides both human-readable texts and visualized graphs. Listing 9.1 shows the implementation of the PerFlowGraph with PerFlow’s high-level Python APIs.

Fig. 9.2 A communication analysis task represented as a dataflow graph (PerFlowGraph) 1 3

pflow = PerFlow() # Run the binary and return a program abstraction graph pag = pflow.run(bin = "./a.out", cmd = "mpirun -np 4 ./a.out")

5 7 9 11

# Build a PerFlowGraph V_comm = pflow.filter(pag.V, name = "MPI_*") V_hot = pflow.hotspot_detection(V_comm) V_imb = pflow.imbalance_analysis(V_hot) V_bd = pflow.breakdown_analysis(V_imb) attrs = ["name", "comm-info", "debug-info", "time"] pflow.report(V_imb, V_bd, attrs)

Listing 9.1 A communication analysis task written using PerFlow’s Python API

232

9 Domain-Specific Framework for Performance Analysis

9.3 Graph-Based Performance Abstraction A program can be naturally represented as a graph. The code snippets of programs correspond to the vertices in the graph, while the relationships between these code snippets, such as control/data flow and dependence across threads/processes, correspond to the edges in the graph. The performance data can be stored as attributes of vertices and edges. In PerFlow, we use a Program Abstraction Graph to represent the performance of a program run. In this section, we first introduce the definition of PAG and then describe how to leverage hybrid static-dynamic analysis to extract PAG structure and how to embed performance data on a PAG.

9.3.1 Definition of PAG A PAG is a (weighted) directed graph .G = (V , E). Vertex(V ) Each vertex .v ∈ V represents a code snippet or a control structure of a program, whose labels and properties indicate the types of this vertex and the recorded data on it. (1) The labels of a vertex include function, call, loop, and instruction. Call vertices are divided into user-defined function calls, communication function calls, external function calls, recursive calls, indirect calls, etc. (2) The properties of a vertex are various performance data, including execution time, performance monitor unit (PMU) data, communication data, the number of function calls, iteration count, etc., depending on the specific requirement of analysis tasks and the view of the PAG. Edge (E) Each edge .e = (vsrc , vdest ) ∈ E connects two vertices .vsrc and .vdest , whose labels and properties indicate the types of this edge and the recorded data on this edge. (1) The labels of an edge include intra-procedural, inter-procedural, inter-thread, and inter-process. The intra-procedural edge represents the control flow of functions. The inter-procedural edge represents function call relationships. The inter-thread edge represents data dependence across different threads, such as waiting events caused by locks. The inter-process edge represents communications between different processes, including synchronous point-to-point (P2P) communications, asynchronous P2P communications, and collective communications. (2) The properties of an edge can be the performance data, the execution time of communications, the amount of communication data, the time of waiting events, etc., depending on the types of edges and runtime data.

9.3 Graph-Based Performance Abstraction

233

9.3.2 Hybrid Static-Dynamic Analysis Hybrid static-dynamic analysis is leveraged to collect data for PAG generation. Static analysis extracts the main structure of PAG, while the dynamic analysis collects performance data and the required structure that cannot be obtained statically, such as indirect calls, locks, communications, etc., by monitoring the program at runtime. Static analysis can significantly reduce the runtime overhead of pure dynamic analysis. Static Analysis PerFlow statically analyzes the binary using Dyninst [29] to extract the static information, including the control flow, static call relationship, and debug information. The static analysis also marks the function calls whose information cannot be obtained at the static phase so that they can be filled in at runtime. Dynamic Analysis PerFlow provides a built-in runtime data collection module using sampling-based approaches. The collection module collects runtime data that cannot be obtained statically, including the performance monitor unit (PMU) data, communication data, lock information, indirect call relationships, etc.

9.3.3 Performance Data Embedding Performance data embedding associates performance data with attributes of the corresponding vertices. We first identify the corresponding vertex through the calling context of each piece of data and then associate the performance data with these vertices. In Fig. 9.3, we give an example to illustrate the process of performance data embedding. Figure 9.3a is a calling context, and Fig. 9.3b shows a PAG. Starting from the main vertex, Loop_1, foo, and pthread_create vertices are detected with the calling context during the searching process. Finally, this piece of data is embedded into the pthread_create vertex.

Fig. 9.3 Illustration of performance data embedding

234

9 Domain-Specific Framework for Performance Analysis

9.3.4 Views of PAG PerFlow provides two views of PAG: a top-down view and a parallel view. We take an example for explanation. Listing 9.2 shows an MPI + Pthreads program example with three functions (main, foo, and add). (Static analysis is performed on executable binaries, and the example code is only for ease of understanding.) Top-Down View of PAG The top-down view of PAG only contains intraprocedural and inter-procedural edges. Figure 9.4a shows three PAGs of main, foo, and add generated through intra-procedural structure extraction. Figure 9.4b shows a PAG that merges each function’s PAG with inter-procedural edges (only the related vertex for merging marked). Figure 9.4c shows a top-down view of PAG with performance data in each vertex after performance data embedding (only the vertex with performance data marked). The color saturation of vertices represents the severity of hotspots.

2 4 6 8 10 12 14 16

int main() { while (iter--) { // Loop_1 foo(...); pthread_mutex_lock(...); sum += local_sum; pthread_mutex_unlock(...); pthread_join(...); } printf(...); } void foo(...){ pthread_create(..., add, ...); for (...) sum += B[i]; // Loop_2 MPI_Sendrecv(sum, ...); } void* add(...) { pthread_mutex_lock(...); for (...) sum += A[i]; // Loop_3 pthread_mutex_unlock(...); }

Listing 9.2 An MPI + Pthreads program example Fig. 9.4 Generating the top-down view of PAG. The color saturation of vertices represents the severity of hotspots. Only relevant vertices are marked. (a) Intra-procedural structure extraction. (b) Merge with inter-procedural edges. (c) Performance data embedding

9.4 PerFlow Programming Abstraction

235

Fig. 9.5 Parallel view of PAG. The color saturation of vertices represents the severity of hotspots. Only relevant vertices are marked. (a) Flows of two threads. (b) The parallel view of a PAG

Parallel View of PAG The parallel view of PAG contains all types of edges including intra-procedural, inter-procedural, inter-thread, and inter-process edges. To build a parallel view of PAG, (1) we generate a flow for each process and thread. A flow is the vertex access sequence recorded by pre-order traversal through a specific part of the top-down view of PAG. Figure 9.5a shows the generated flows for all threads. (2) Then we add inter-thread and inter-process edges, which represent locks, communications, etc., across flows of different processes and threads. (3) We further embed performance data into the PAG. Finally, a parallel view of PAG is formed. Figure 9.5b shows the generated parallel view of PAG.

9.4 PerFlow Programming Abstraction 9.4.1 PerFlowGraph PerFlow uses a dataflow graph (PerFlowGraph) to represent all analysis steps and phases in a performance analysis task, including the running phase, the analysis subtasks, the result reporting phase, etc. The key observation from existing performance analysis approaches and our experience is that the process of performance analysis is similar to a dataflow graph. Developers analyze profiles and traces step by step and finally identify performance bugs. Thus we design a dataflow-based programming abstraction to represent the process of performance analysis. In the rest of this section, we introduce the elements in a PerFlowGraph, as well as performance analysis passes and paradigms.

236

9 Domain-Specific Framework for Performance Analysis

Fig. 9.6 The relationship of sets and performance analysis passes

9.4.2 PerFlowGraph Element In a PerFlowGraph, each vertex represents an analysis sub-task, and each edge represents the input to, or output from, a vertex. We use a performance analysis pass to complete a sub-task in a vertex and use sets as the data flowing along edges. We introduce the elements of the PerFlowGraph below. Set The sets can be sets of PAG vertices V or sets of PAG edges E, or both (V, E). In PerFlow, we model all code snippets and program structures as PAG vertices and all data/control dependence and data movements as PAG edges (details in Sect. 9.3.1). The contents of sets are updated as they flow through vertices of PerFlowGraphs. Pass A performance analysis pass takes sets as input. After performing its analysis sub-task, it also outputs sets as the input of the next pass. As shown in Fig. 9.6, the input sets flow through a performance analysis pass, and then output sets are generated and continue flowing. The format of inputs and outputs is determined by the design of passes. Developers can flexibly use and combine passes to build the structure of the PerFlowGraph. PerFlow provides high-level APIs and a built-in pass library for PerFlowGraph construction. The built-in pass library provides hotspot detection, differential analysis, critical path identification, imbalance analysis, breakdown analysis, etc. Besides, PerFlow also provides low-level APIs, which allow developers to write user-defined passes to meet their requirements. We introduce several built-in passes and their implementations using low-level APIs in Sect. 9.4.3.2.

9.4.3 Building Performance Analysis Pass We introduce the design of low-level API and how to build performance analysis passes with the API below.

9.4.3.1

Low-Level API Design

We design three types of APIs: graph operation APIs, graph algorithm APIs, and set operation APIs. Graph Operation APIs provide interfaces for developers to access the attributes of PAG vertex and edge, including name, type, performance data, debug information, etc., or even to transform the PAG. Here we define the inputs and outputs of a pass,

9.4 PerFlow Programming Abstraction

237

which uses graph operation API, as I and O. It may happen that O .⊆ I (.∃ e .∈ O, but e .∈ / I), which means graph operations can add new elements to the output. Graph Algorithm APIs provide many graph algorithms, such as breadth-first search, subgraph matching, community detection, etc. Developers can use these algorithms and combine constraints to achieve specific analysis tasks. Set Operation APIs include element sorting, filtering, classification, as well as computing intersection, union, complement, and difference of sets. Different from graph operations, for a pass that only uses set operations, the outputs must be a subset of the inputs (O .⊆ I). We take the operation filter as an example. It is designed to deliver specific PAG vertices and edges to specific passes. The metric of a filter can be the type, name, and other attributes of vertices and edges. A filter can distinguish communication vertices by matching the name attribute with the string MPI_* and IO vertices by matching the name with the strings istream::read or types of vertices.

9.4.3.2

Example Cases

We further introduce four built-in performance analysis passes and illustrate how to use PerFlow’s low-level API to develop passes with graph algorithms on PAG and set operations. A: Hotspot Detection Hotspot detection refers to identifying the code snippets with the highest value of specific metrics, such as total execution cycles, cache misses, instruction count, etc. The most common hotspot detection is to identify the most time-consuming code snippets, whose specific metric is total execution cycles or execution time. As shown in Listing 9.3, a hotspot detection pass is built. B: Performance Differential Analysis Performance differential analysis refers to a comparison of program performance conducted under the independent variables of input data, parameters, or different executions. The comparison helps analysts understand the trend of performance as the input changes. The performance difference can be intuitively represented on a top-down view of PAG, and we leverage the graph difference to perform differential analysis. The graph difference algorithm is performed on the top-down view of PAG. As shown in Fig. 9.7, .G1 and .G2 are two PAGs with different inputs, and .G3 is the graph difference between .G1 and .G2 . The color saturation of vertices represents the severity

1 3 5 7

# Define an "hotspot detection" pass # Input: The vertex set of a PAG - V # Sorting metric - m # The number of returned vertices - n # Output: Hotspot vertex set def hotspot(V, m, n): return V.sort_by(m).top(n)

Listing 9.3 The implementation of hotspot detection pass

238

9 Domain-Specific Framework for Performance Analysis

Fig. 9.7 Graph difference on the top-down view of PAGs. The color saturation of vertices represents the severity of hotspots. (a) .G1 . (b) .G2 . (c) .G3 = (G2 − G1 ) 1 3 5 7 9 11

# Define a "differential analysis" pass # Input: Vertex sets of two PAGs - V1, V2 # Output: A set of difference vertices def differential_analysis(V1, V2): V_res = [] for (v1, v2) in (V1, V2): v = pflow.vertex() for metric in v1.metrics: v[metric] = v1[metric] - v2[metric] V_res.append(v) return V_res

Listing 9.4 The implementation of performance differential analysis pass

of hotspots. We find that the color saturation of MPI_Reduce in .G1 and .G2 is not the highest, but it is the highest in .G3 , which means the performance of non-hotspot vertex MPI_Reduce varies significantly with different inputs. Vertices that behave like the MPI_Reduce are identified with performance issues through performance differential analysis. Graph difference intuitively shows the changes in performance between program runs with different inputs. We implement this pass in Listing 9.4. C: Causal Analysis Performance bugs can propagate through complex interprocess communications as well as inter-thread locks, and lead to many secondary performance bugs, which makes root cause detection even harder. Paths that consist of a parallel view of PAG’s edges can well represent correlations across these performance bugs in different processes and threads. We leverage a graph algorithm, lowest common ancestor [30] (LCA), and specific restrictions to detect the correlations and thus achieve the purpose of causal analysis. The goal of the LCA algorithm is to search the deepest vertex that has both v and w as descendants in a tree or directed acyclic graph. The causal analysis pass is designed based on the LCA algorithm. Listing 9.5 shows the implementation of the causal analysis pass. This pass takes vertices with performance bugs as inputs and regards them as descendants. After performing the LCA algorithm, the detected common ancestors of descendants are recorded and output as the vertices that cause performance bugs.

9.4 PerFlow Programming Abstraction

1 3 5 7 9 11 13

239

# Define a "causal analysis" pass # Input: A set of vertices with performance bugs - V # Output: A set of vertices that cause the bugs def casual_analysis(V) V_res, S = [], [] # S for scanned vertices for (v1, v2) in (V, V): if v1!=v2 and v1 not in S and v2 not in S: # v1 and v2 are regarded as descendants v, path = pflow.lowest_common_ancestor(v1, v2) # v is the detected lowest common ancestor # path is an edge set if v in V: V_res.append(v) return V_res

Listing 9.5 The implementation of the causal analysis pass

D: Contention Detection Contention refers to a conflict over a shared resource across processes or threads, which leads to a negative impact on the performance of processes or threads competing for the resource. It can cause several kinds of misbehavior, such as unwanted synchronization or periodicity, deadlock, livelock, and many more, which need expensive human efforts to be detected. We observe that misbehaviors have specific patterns on the parallel view of PAGs. Subgraph matching [31], which searches all embeddings of a subgraph query in a large graph, is leveraged to search these specific patterns on the PAGs and detect resource contention. The contention detection pass determines whether resource contention exists in the vertices of input sets. The input of a contention detection pass is a set of vertices detected by the previous pass, while the outputs are the detected subgraph embeddings. We define a set of candidate subgraphs to represent resource contention patterns. Then we identify resource contentions by searching the embeddings of candidate subgraphs around the vertices of the input set. Listing 9.6 shows the implementation of the contention detection pass.

2 4 6 8 10 12

# Define a "contention detection" pass # Input: Vertex set - V # Output: Subgraph embeddings def contention_detection(V): V_res = [] # Build a candidate subgraph with contention pattern sub_pag = pflow.graph() sub_pag.add_vertices([(1,"A"), (2,"B"), (3,"C"), (4,"D"), (5,"E")]) sub_pag.add_edges([(1,3), (2,3), (3,4), (3,5)]) # Execute subgraph matching algorithm V_ebd, E_ebd = pflow.subgraph_matching(V.pag, sub_pag) return V_ebd, E_ebd

Listing 9.6 The implementation of the contention detection pass

240

9 Domain-Specific Framework for Performance Analysis

9.4.4 Performance Analysis Paradigm A performance analysis paradigm is a specific PerFlowGraph for an analysis task. We summarize some typical performance analysis approaches of existing tools [10, 11, 13–15] as built-in analysis paradigms, such as an MPI profiler paradigm (inspired by mpiP [11]), a critical path paradigm (inspired by the work of Böhme et al. [17] and Schmitt et al. [18]), a scalability analysis paradigm (inspired by the work of Böhme et al. [8] and ScalAna [20]), etc. We take the scalability analysis paradigm as an example to show how to implement a performance analysis paradigm. The scalability analysis task in [20] first detects code snippets with scaling loss and imbalance, then finds the complex dependence between the detected code snippets by a backtracking algorithm, and finally identifies the root causes of scaling loss. We decompose the scalability analysis task into multiple steps. Most of the steps can be completed with PerFlow’s built-in passes, and we only need to implement the backtracking step as a userdefined pass. As shown in Fig. 9.8, we build the PerFlowGraph of the scalability analysis paradigm, containing three built-in passes (differential analysis pass, hotspot detection pass, and imbalance analysis pass), a user-defined pass (backtracking analysis pass), a union operation, and a report module. Listing 9.7 shows the implementation of the scalability analysis paradigm, which consists of two parts: (1) First is writing a backtracking analysis pass. We first write a backtracking analysis pass, which is not provided by our built-in pass library. As shown in Listing 9.7, this pass implements a backward traversal through communications and control/data flow with several graph operation APIs, including neighbor acquisition (v.es at Line 13), edge filter (select at Lines 13, 17, 20, and 22), attribute access (v[...] at Lines 15–16, 18–19), and source vertex acquisition (e.src at Line 25). (2) Second is building the PerFlowGraph of the scalability analysis paradigm. We build a PerFlowGraph with built-in and user-defined passes. The differential analysis pass (Line 30) takes two executions (i.e., a small-scale run and a large-scale run) as input and outputs all vertices with their scaling loss. Then the hotspot analysis pass (Line 31) outputs vertices with the poorest scalability, while the imbalance analysis pass (Line 32) outputs imbalanced vertices between different processes. The union operation (Line 33) merges two sets (outputs of the hotspot analysis pass and the imbalance analysis pass) as the input of the backtracking analysis pass (Line 34). Finally, the backtracking paths and the root causes of scalability are stored in (V_bt, E_bt) and reported (Line 36).

Fig. 9.8 The PerFlowGraph of the scalability analysis paradigm

9.4 PerFlow Programming Abstraction

1 3 5 7 9 11 13 15 17 19 21 23 25

241

# Define a "scalability analysis" paradigm # Input: PAGs of two program runs - PAG1, PAG2 def scalability_analysis_paradigm(PAG1, PAG2): # Part 1: Define a "backtracking analysis" pass # Input: A set of vertices with performance bugs - V # Output: Vertices and edges on backtracking paths def backtracking_analysis(V): V_bt, E_bt, S = [], [], [] # S for scanned vertices for v in V: if v not in S: S.append(v) in_es = v.es.select(IN_EDGE) while len(in_es) != 0 and v[name] not in pflow.COLL_COMM: if v[type] == pflow.MPI: e = in_es.select(type = pflow.COMM) elif v[type] == pflow.LOOP or v[type] == pflow.BRANCH: e = in_es.select(type = pflow.CTRL_FLOW) else e = in_es.select(type = pflow.DATA_FLOW) V_bt.append(v) E_bt.append(e) v = e.src return V_bt, E_bt

27 29 31 33 35

# Part 2: Build the PerFlowGraph of scalability analysis paradigm V1, V2 = PAG1.vs, PAG2.vs V_diff = pflow.differential_analysis(V1, V2) V_hot = pflow.hotspot_detection(V_diff) V_imb = pflow.imbalance_analysis(V_diff) V_union = pflow.union(V_hot, V_imb) V_bt, E_bt = backtracking_analysis(V_union) attrs = ["name", "time", "dbg-info", "pmu"] pflow.report([V_bt, E_bt], attrs)

37 39 41 43

# Use the scalability analysis paradigm pag_p4 = pflow.run(bin = "./a.out", cmd = "mpirun -np 4 ./a.out") pag_p64 = pflow.run(bin = "./a.out", cmd = "mpirun -np 64 ./a.out") scalability_analysis_paradigm(pag_p4, pag_p64)

Listing 9.7 The implementation of the scalability analysis paradigm

9.4.5 Usage of PerFlow In summary, there are two main ways for developers to implement specific analysis tasks with PerFlow’s APIs: using paradigms and building PerFlowGraphs. Using Paradigms Developers can directly use built-in paradigms to obtain related performance analysis reports. An example that shows how to use a paradigm is given at Lines 38–43 in Listing 9.7. We run a program with two process scales of 4 and 64 and directly input them into the scalability analysis paradigm. Building PerFlowGraphs PerFlow provides a built-in performance analysis pass library for building PerFlowGraphs. For scenarios where analysis tasks have

242

9 Domain-Specific Framework for Performance Analysis

already been designed, the example in Listing 9.7 has shown a complete process of implementation. For scenarios in which developers do not know what analysis to apply, PerFlow supports an interactive mode. It is advisable to first use a general built-in analysis pass, such as hotspot detection. The output of the previous pass will provide some insights to help determine or design the next passes. Then analysts can add other analysis passes into PerFlowGraph step by step. Finally, PerFlowGraphs are generated according to detailed analysis. If built-in passes cannot satisfy the demands, developers need to write their own passes and combine these user-defined passes with other built-in passes to build PerFlowGraphs. Developers require some basic knowledge to write userdefined passes. The implementations of several passes have been introduced (four built-in passes in Sect. 9.4.3.2 and the backtracking analysis pass at Lines 5–26 in Listing 9.7).

9.5 Evaluation 9.5.1 Experimental Setup Experimental Platforms We perform the experiments on two clusters: (1) Gorgon, a cluster with dual Intel Xeon E5-2670 (v3) and 100 Gbps 4xEDR InfiniBand, and (2) a national supercomputer Tianhe-2A. Each node of Tianhe-2A has two Intel Xeon E5-2692 (v2) processors (24 cores in total) and 64 GB memory. The Tianhe2A supercomputer uses a customized high-speed interconnection network. PerFlow uses Dyninst (v10.1.0) [29] for static binary analysis, as well as PMPI wrapper, PAPI library (v5.4.3) [32], and libunwind library (v1.3.1) for dynamic data collection. The PAG is stored in a graph processing system igraph [33]. Evaluated Programs We use a variety of parallel programs to evaluate the efficiency and efficacy of PerFlow, including BT, CG, EP, FT, IS, LU, MG, and SP, from the widely used NPB benchmark suite (v3.3) [34], plus several real-world applications, ZEUS-MP [35], LAMMPS [36], and Vite [37]. For NPB programs, problem size Class C is used. Methodology In our evaluation, we first present both the static and runtime overhead, as well as the space cost of the hybrid static-dynamic analysis module (all evaluated programs run with 128 processes on Gorgon). Then we show basic features of the top-down view and the parallel view of PAGs for all evaluated programs (128 processes for the parallel view). Finally, we use three real-world applications to demonstrate the process of performing customized performance analysis with PerFlow. In addition, we compare the results of PerFlow with four state-of-the-art tools, mpiP (v3.5) [11], HPCToolkit (v2020.12) [10], Scalasca (v2.5) [13], and ScalAna [20], by studying the performance of ZEUS-MP. For HPCToolkit, we set the sampling frequency to 200 Hz, which is the same as

9.5 Evaluation

243

Table 9.1 The overhead of PerFlow Program Static (Sec.) Dynamic (%) Space (B)

BT 0.20 0.44 346K

CG 0.06 3.73 57K

EP 0.03 0.13 35K

FT 0.09 1.83 215K

MG 0.12 0.92 464K

SP 0.19 1.08 449K

LU 0.23 1.42 184K

IS 0.04 0.03 28K

ZMP 1.50 1.56 2.4M

LMP 5.34 0.71 22M

Vite 0.73 0.03 1.6M

PerFlow. For Scalasca, we first profile the program and determine where tracing is needed, which significantly reduces instrumentation overhead.

9.5.2 Overhead and PAG Static Analysis We first evaluate the cost of static analysis on the executable binaries. As shown in Table 9.1, the static analysis only incurs very low overhead (0.03–5.34 s, 0.77 s on average). For a software package with over 700K lines of code, LAMMPS, the static analysis costs only 5.34 s. Dynamic Analysis For all programs, both PMU data and communication data are collected during dynamic analysis. The runtime performance overhead of dynamic analysis is 1.11% on average (0.03–3.73%), as shown in Table 9.1. The variance in dynamic overhead is caused by the different complexities of communication patterns. CG implements collective communications with three point-to-point communications, which makes its communication pattern more complicated. Thus, the runtime overhead of CG is much higher than that of other programs (3.73%). Space Cost The space cost of PerFlow is the storage size of PAGs. Table 9.1 shows that the space costs for evaluated programs range from 28 kilobytes to 22 megabytes and 2.5 megabytes on average. The storage cost for the LAMMPS package is only 22 megabytes. Basic Features of PAG Table 9.2 shows the code size, the binary size, as well as the vertex and edge counts of both the top-down view and the parallel view of generated PAGs for all evaluated programs. The PAG of the program whose binary size is larger tends to have more vertices and edges.

9.5.3 Case Study A: ZEUS-MP We use PerFlow and four state-of-the-art tools, mpiP [11], HPCToolkit [10], Scalasca [13], and ScalAna [20], to study the performance analysis of ZEUS-MP. ZEUS-MP implements a three-dimensional astrophysical phenomena simulation with computational fluid dynamics using an MPI programming model.

244

9 Domain-Specific Framework for Performance Analysis

Table 9.2 Code size, binary size, and basic features of top-down view and parallel view of PAG for evaluated programs. #V and #E are the number of vertices and edges, respectively Program BT CG EP FT MG SP LU IS ZEUS-MP LAMMPS Vite

Code (KLoc) 11.3 2.0 0.6 2.5 2.8 6.3 7.7 1.3 44.1 704.8 15.9

Binary (bytes) 490K 97K 60K 222K 270K 357K 325K 37K 2.2M 14.67M 2.8M

Top-down view #V #E 3,283 3,282 321 320 111 110 2,904 2,903 4,701 4,700 2,252 2,251 1,566 1,565 325 324 11,981 11,980 85,230 85,229 7,118 7,117

Parallel view #V 420,224 41,088 14,208 371,712 601,728 288,256 200,448 41,600 1,533,568 10,909,440 970,624

#E 462,404 55,176 34,360 409,128 712,432 322,364 284,780 69,816 2,805,760 16,423,808 984,866

We run ZEUS-MP with a problem size of 256.×256.×256 for different numbers of processes ranging from 16 to 2,048 on the Tianhe-2A supercomputer. Experimental results show that the speedup of ZEUS-MP does not scale well on 2,048 processes, which is only 72.57.× (16 processes as baseline). Performance Analysis with PerFlow We use the scalability analysis paradigm in Fig. 9.8 to analyze the scalability problems. PerFlow first runs ZEUS-MP with 16 and 2,048 processes. Figure 9.9 shows the output of the differential analysis pass. Loop, mpi_waitall_, and mpi_allreduce_ vertices are detected with scaling loss. The output of the imbalance analysis pass is the imbalanced vertices, which are marked with black boxes in Fig. 9.10. Then the backtracking analysis pass builds paths between these imbalanced vertices, which represent how the performance bugs propagate (shown as red bold arrows). Figure 9.10 shows partial results due to space limitations. Finally, the imbalanced process vertices of loop_10.1 in bvald_ and loop_1.1.1 in newdt_ are detected as the underlying reasons for ZEUS-MP’s poor scalability. As shown in Listing 9.8, the load imbalance of loop_10.1 at bvald.F: 358 causes that some processes of mpi_waitall_ at nudt.F: 227 wait for others. The delays in these processes cause the waiting events of some processes of mpi_waitall_ at nudt.F: 269 and then propagate to mpi_waitall_ at

Fig. 9.9 The output vertices of differential analysis pass on the top-down view of PAG

9.5 Evaluation

245

Fig. 9.10 Partial results of the backtracking analysis pass on the parallel view of ZEUS-MP’s PAG. The vertices with boxes are the output of the imbalance analysis pass, and red bold arrows represent the detected edges by the backtracking analysis pass subroutine bvald (rl1, ru1, rl2, ru2, rl3, ru3, d) do k=ks-1,ke+1 ! Loop 10 do i=is-1,ie+1 ! Loop 10.1 if (abs(nijb(i,k)) .eq. 1) then d(i,js-1,k) = d(i,js ,k) d(i,js-2,k) = d(i,js+1,k) call MPI_IRECV(d(1,je+j+uu,1), 1, j_slice, n2p ... call MPI_ISEND(d(1,je+j-ll,1), 1, j_slice, n2p ... subroutine nudt 207 call bvald (1,0,0,0,0,0,d) ... 227 call MPI_WAITALL (nreq, req, stat, ierr) ... 242 call bvald (0,0,1,0,0,0,d) ... 269 call MPI_WAITALL (nreq, req, stat, ierr) ... 284 call bvald (0,0,0,0,1,0,d) ... 328 call MPI_WAITALL (nreq, req, stat, ierr) ... 361 call MPI_ALLREDUCE(buf_in(1), buf_out(1), 1 ... 357 358 359 360 361 391 399

Listing 9.8 ZEUS-MP code with performance bugs

nudt.F: 328. Finally, the synchronization in mpi_allreduce_ at nudt.F: 361 becomes a scaling issue. In conclusion, the load imbalance propagates through three non-blocking point-to-point communications and causes the poor scalability of mpi_allreduce_ and ZEUS-MP.

246

9 Domain-Specific Framework for Performance Analysis

Comparison We run ZEUS-MP with four state-of-the-art tools, mpiP, HPCToolkit, Scalasca, and ScalAna, on both 16 and 2,048 processes. (1) MpiP generates statistical profiles, which present communication hotspots and other communication data, including message size, call count, debug information, etc. In the report of mpiP, the mpi_allreduce_ in nudt_ takes 0.06 and 7.93% of the total time on 16 and 2,048 processes, respectively. However, detecting the scaling loss of each communication call still needs significant human efforts. (2) HPCToolkit provides both fine-grained loop-level hotspots. In addition to hotspot analysis, HPCToolkit [38] can also detect multiple scalability issues in mpi_allreduce_ and mpi_waitall_. But the root cause of poor scalability and the underlying reasons cannot be easily obtained without performance analysis skills. (3) Scalasca, a tracing-based tool, can automatically detect root causes with event traces. The runtime overhead is 56.72% (not including I/O) and the storage cost is 57.64 gigabytes on 128 processes for function-level event traces with human intervention, while PerFlow only incurs 1.56% runtime overhead and 2.4-megabyte storage. (4) Besides, to implement the scalability analysis task with PerFlow, developers only need to write 27 lines of code with 7 high-level APIs and 5 low-level APIs (as shown in Listing 9.7). In contrast, the source code of ScalAna has thousands of lines. Optimization We fixed the root cause by changing ZEUS-MP into a hybrid MPI + OpenMP programming model. OpenMP #pragma on loop_10.1 at bvald_ allows idle processors to share the workload of busy processors, which mitigates the load imbalance between processes. We also perform this optimization on other detected code snippets with load imbalance. With these optimizations, the speedup of ZEUS-MP increases from 72.57.× to 77.71.× on 2,048 processes (16 processes as baseline). Meanwhile, the performance of ZEUS-MP is improved by 6.91% on 2,048 processes.

9.5.4 Case Study B: LAMMPS LAMMPS is an open-source software package for large-scale molecular dynamics simulation. It is implemented with the hybrid MPI + OpenMP programming model. We run LAMMPS with 6,912,000 atoms and 2,048 processes (in.clock.static as an input) on the Tianhe-2A supercomputer. With simple profiling, we notice that the total communication time is up to 28.91%. In order to analyze the performance issue of LAMMPS, we design a PerFlowGraph in Fig. 9.11. The PerFlowGraph detects imbalanced vertices and performs causal analysis repeatedly until the output set no longer changes, and we identify the outputs as the root causes. Performance Analysis with PerFlow After running the program, a PAG is generated. Passing through the hotspot detection pass and the communication filter, MPI_Send and MPI_Wait are detected as communication hotspots with 7.70 and 7.42% of the total time. The imbalance analysis pass detects that some

9.5 Evaluation

247

Fig. 9.11 PerFlowGraph designed for performance analysis on LAMMPS

Fig. 9.12 Illustration of the process of PerFlowGraph on the parallel view of LAMMPS’s PAG

processes of MPI_Send and MPI_Wait are imbalanced vertices with longer execution time. As shown in Fig. 9.12 (we only show a partial parallel view of the PAG due to space limitations), the top-down vertical axis represents the data flow, and the horizontal axis represents different parallel processes. The vertices with boxes are imbalanced MPI_Send and MPI_Wait calls. The output of the causal analysis pass indicates that the long execution time of MPI_Send and MPI_Wait in CommBrick::reverse_comm (comm_brick.cpp: 544, 547) is caused by loop_1.1 in PairLJ- Cut::compute (pair_lj_cut.cpp: 102-137). Figure 9.12 shows the result of causal analysis. The paths consisting of bold edges are causal relationships, which show how performance bugs in loop_1.1 propagate to MPI_Send and MPI_Wait. The cause is that processes 0, 1, and 2 run with a longer time in loop_1.1 than the others. As shown in Listing 9.9, each process sends buffers to its neighbors, and it is implemented with blocking communications. The blocking communication propagates performance bugs in processes 0, 1, and 2 (loop_1.1) to other processes (MPI_Send and MPI_Wait).

void PairLJCut::compute(){ for (ii = 0; ii < inum; ii++) { for (jj = 0; jj < jnum; jj++) void CommBrick::reverse_comm(){ for (int iswap = nswap-1; iswap if (size_reverse_recv[iswap]) if (size_reverse_send[iswap]) if (size_reverse_recv[iswap])

// Loop_1 {...} // Loop_1.1 >= 0; iswap--) { MPI_Irecv(...); MPI_Send(...); MPI_Wait(...);

Listing 9.9 LAMMPS code with performance bugs

248

9 Domain-Specific Framework for Performance Analysis

Optimization The imbalance in loop_1.1 is the root cause, and the performance bugs of MPI_Send and MPI_Wait are secondary bugs, which means that our optimization target is to make loop_1.1 more balanced. We add balance commands into the input file to adjust the size and shape of sub-domains of processes every 250 steps during simulation. With the optimization, the performance improves significantly from 118.89 time steps/s to 134.54 time steps/s (improved by 13.77%) on 2,048 processes.

9.5.5 Case Study C: Vite Vite implements the distributed memory Louvain method for graph community detection using the MPI + OpenMP programming models. We evaluate its performance on a weighted graph with 600,000 vertices and 11,520,982 edges with eight processes and different numbers of threads per process ranging from 2 to 8 (on Gorgon). As shown in Fig. 9.13, the red dashed line represents the execution time of the original version of Vite. We observe that Vite has extremely poor scalability as the number of threads grows. The execution time on eight threads is even longer than that on two threads. As shown in Fig. 9.14, we design a PerFlowGraph, which sets up different branches for comprehensive diagnosis, to detect the performance issues of Vite. Performance Analysis with PerFlow PerFlow first runs Vite with two and eight threads on eight processes, and two PAGs are generated. Figure 9.15a

Fig. 9.13 Scalability of Vite with eight processes and different numbers of threads ranging from 2 to 8

Fig. 9.14 PerFlowGraph designed for performance analysis on Vite

9.5 Evaluation

249

Fig. 9.15 The output of different passes. (a) A partial output of the hotspot detection pass on the top-down view of PAG. Dozens of vertices are detected as hotspots. (b) A partial output of the differential analysis pass on the top-down view of PAG. Only three _M_realloc_insert vertices are detected

Fig. 9.16 A partial output of the contention detection pass on the parallel view of Vite’s PAG

shows a partial output of the hotspot analysis pass. The darker the color of a vertex is, the longer the execution time of its corresponding code snippet. We notice that there exist dozens of hotspots, including several operations of _Hashtable. The output of the differential analysis pass is shown in Fig. 9.15b, from which we detect that _M_realloc_insert calls in distExecuteLouvainIteration have scalability issues. The report after the causal analysis presents that the _M_realloc_insert vertices themselves and _M_emplace vertices are detected as the root causes. As shown in Fig. 9.16, the contention detection pass

250

9 Domain-Specific Framework for Performance Analysis

searches for resource contention around the detected _M_realloc_insert vertices. The vertical direction from top to down represents the control/data flow, and the horizontal direction represents different parallel processes and threads. Each vertex stands for a code snippet in a thread or a process. We hide irrelevant inter-process and inter-thread edges to simplify the parallel view of PAG for better representation (the complete parallel view of PAG is much more complex). Subgraphs in red circles are detected embeddings of the resource contention pattern in different processes and code snippets. In a zoomed-in subgraph, it can be seen that resource contention exists in allocate, reallocate, and deallocate (called by _M_realloc_ insert, and _M_emplace). We find that the reason for resource contention is that memory allocation operations are thread-unsafe. When a thread allocates memory, an implicit lock is needed before the operation is performed. These locks lead to resource contention in memory allocation vertices, thus causing performance degradation and scalability issues as the number of threads grows. Optimization The results indicate that the key of optimization is to reduce the resource contention in allocate, reallocate, and deallocate. We apply two approaches to optimize it. (1) First, we use static thread-local variables to replace default stack variables so that they are initialized only once, which significantly reduces the number of allocate and deallocate calls. (2) We change the data structure from unordered_map to a customized vector-based hashmap for tiny objects, which allocates memory statically to avoid frequent memory reallocation. With these optimizations, the performance and multi-threaded scalability improve significantly. As shown in Fig. 9.13, the performance of Vite is improved by 25.29.× for eight threads, and the speedup increases from 0.56.× to 1.46.× for eight threads (two threads as baseline).

9.6 Related Work Performance Tools Existing tools are either based on profiling or tracing. (1) Profiling-based tools collect performance data with very low overhead. MpiP [11] is a lightweight profiling library, which provides statistical performance data for MPI functions. HPCToolkit [10], gprof [39], and VTune [40] are all lightweight profilers for general applications and architectures. Arm MAP [41] and CrayPat [42] are a performance analysis tools specially designed for ARM and Cray X1 platform, respectively. (2) Tracing-based tools collect rich information for indepth analysis [43]. Based on Score-P [16], TAU [15, 44], Vampir [14, 45], and Scalasca [13, 46] provide visualization for generated trace data and provide direct insights. Paraver [47–49] is a trace-based performance analyzer, which brings great flexibility for data collection and analysis. Performance Analysis To satisfy the demands of different scenarios, researchers have made great efforts in presenting specific in-depth analysis. Böhme et al. [8]

References

251

propose an approach to identify the root cause of imbalance by replaying event traces forward and backward. Tallent et al. [50] propose a lightweight detection technique focusing on lock contention. Böhme et al. [17] and Schmitt et al. [18] detect performance bugs by critical path analysis. Load imbalance analysis [12, 19, 51], critical path analysis [17, 18], and other approaches [52] have been proposed to detect performance bugs. Graph-Based Analysis STAT [21] designs a 3D-Trace/Space/Time Call Graph with stack traces for large-scale program debugging. Cypress [24] and ScalAna [20] generate graphs with program structure and runtime data for communication trace compression and scaling loss detection. wPerf [25] uses a wait-for graph with thread-level waiting events to identify bottlenecks. Spindle [26] builds a Memory-centric Control Flow Graph for efficient memory access monitoring. ProGraML [53] represents programs as directed multigraphs and leverages deep learning models for further analysis. These works use graphs to detect information hidden by complex program structures and dependence on traces and profiles.

9.7 Conclusion In this chapter, we present PerFlow, a domain-specific programming framework for easing the implementation of in-depth performance analysis tasks. PerFlow provides a dataflow-based programming abstraction, which allows developers to develop customized performance analysis tasks by describing the analysis process as a PerFlowGraph with built-in or user-defined passes. We first propose a Program Abstraction Graph (PAG) to represent the performance of parallel programs and then build passes with graph operations and algorithms. We also provide some paradigms, which are the specific combinations of passes, for some general and common analysis tasks. Besides, PerFlow provides easy-to-use Python APIs for programming. We evaluate PerFlow with both benchmarks and real-world applications. Experimental results show that PerFlow can effectively ease the implementation of performance analysis and provide insightful guidance for optimization.

References 1. Ao, Y., et al. (2017). 26 pflops stencil computations for atmospheric modeling on sunway taihulight. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS’17). IEEE. (pp. 535–544). 2. Ravikumar, K., Appelhans, D., & Yeung, P. K. (2019). GPU acceleration of extreme scale pseudo-spectral simulations of turbulence using asynchronism. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’19) (pp. 1–22).

252

9 Domain-Specific Framework for Performance Analysis

3. Zhang, H., et al. (2017). Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In Proceedings of the 2017 USENIX Conference on Annual Technical Conference (USENIX ATC’17) (pp. 181–193). 4. Huang, K., et al. (2021). Understanding and bridging the gaps in current GNN performance optimizations. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’21) (pp. 119–132). 5. Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. In Communications of the ACM, 51(1), 107–113. 6. Armbrust, M., et al. (2015). Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 1383–1394). 7. Petrini, F., Kerbyson, D. J., & Pakin, S. (2003). The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing. SC’03. Phoenix, AZ, USA: ACM. 8. Bohme, D., et al. (2010). Identifying the root causes of wait states in large-scale parallel applications. In 2010 39th International Conference on Parallel Processing. IEEE (pp. 90– 100). 9. Hidayeto˘glu, M., et al. (2019). MemXCT: Memory-centric X-ray CT reconstruction with massive parallelization. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’19) (pp. 1–56). 10. Adhianto, L., et al. (2010). HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22(6), 685–701. 11. Vetter, J., & Chambreau, C. (2005). mpip: Lightweight, scalable MPI profiling. 12. Tallent, N. R., Adhianto, L., & Mellor-Crummey, J. M. (2010). Scalable identification of load imbalance in parallel executions using call path profiles. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 1–11). IEEE Computer Society. 13. Geimer, M., et al. (2010). The Scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience, 22(6), 702–719. 14. Nagel, W. E., et al. (1996). VAMPIR: Visualization and analysis of MPI resources. 15. Shende, S. S., & Malony, A. D. (2006). The TAU parallel performance system. The International Journal of High Performance Computing Applications, 20(2), 287–311. 16. Score-P homepage. Score-P Consortium. http://www.score-p.org 17. Böhme, D., et al. (2012). Scalable critical-path based performance analysis. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium (IPDPS’12) (pp. 1330–1340). IEEE. 18. Schmitt, F., Dietrich, R., & Juckeland, G. (2017). Scalable critical-path analysis and optimization guidance for hybrid MPI-CUDA applications. The International Journal of High Performance Computing Applications, 31(6), 485–498. 19. Bohme, D., Wolf, F., & Geimer, M. (2012). Characterizing load and communication imbalance in large-scale parallel applications. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW). May 2012 (pp. 2538–2541). https:// doi.org/10.1109/IPDPSW.2012.321 20. Jin, Y., et al. (2020). ScalAna: Automating scaling loss detection with graph analysis. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’20) (pp. 1–14). 21. Arnold, D. C., et al. (2007). Stack trace analysis for large scale debugging. In 2007 IEEE International Parallel and Distributed Processing Symposium (pp. 1–10). IEEE. 22. Bronevetsky, G., et al. (2010). AutomaDeD: Automata-based debugging for dissimilar parallel tasks. In 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN’10) (pp. 231–240). IEEE. 23. Bhattacharyya, A., Kwasniewski, G., & Hoefler, T. (2015). Using compiler techniques to improve automatic performance modeling. In Proceedings of the 24th International Conference on Parallel Architectures and Compilation. San Francisco, CA, USA: ACM.

References

253

24. Zhai, J., et al. (2014). Cypress: Combining static and dynamic analysis for top-down communication trace compression. In SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 143–153). IEEE. 25. Zhou, F., et al. (2018). wPerf: Generic Off-CPU analysis to identify bottleneck waiting events. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18) (pp. 527–543). 26. Wang, H., et al. (2018). Spindle: informed memory access monitoring. In 2018 Annual Technical Conference (pp. 561–574). 27. Jin, Y., et al. (2022). PerFlow: A domain specific framework for automatic performance analysis of parallel applications. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (pp. 177–191). https://doi.org/10.1145/ 3503221.3508405. 28. Culler, D. E. (1986). Dataflow architectures. Annual Review of Computer Science, 1(1), 225– 253. 29. Williams, W. R., et al. (2016). Dyninst and MRNet: Foundational infrastructure for parallel tools. In Tools for High Performance Computing 2015 (pp. 1–16). Springer. 30. Schieber, B., & Vishkin, U. (1988). On finding lowest common ancestors: Simplification and parallelization. SIAM Journal on Computing, 17(6), 1253–1262. 31. Shi, T., et al. (2020). GraphPi: High performance graph pattern matching through effective redundancy elimination. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’20) (pp. 1–14). IEEE. 32. PAPI tools. http://icl.utk.edu/papi/software/. 33. Csardi, G., Nepusz, T., et al. (2006). The igraph software package for complex network research. InterJournal, Complex Systems, 1695, 1–9. 34. Bailey, D., et al. (1995). The NAS Parallel Benchmarks 2.0. Moffett Field, CA: NAS Systems Division, NASA Ames Research Center. 35. Hayes, J. C., et al. (2006). Simulating radiating and magnetized flows in multiple dimensions with ZEUS-MP. The Astrophysical Journal Supplement Series, 165(1), 188. 36. Large-scale Atomic and Molecular Massively Parallel Simulator. “Lammps”. Available at: http://lammps.sandia.gov (2013). 37. Ghosh, S., et al. (2018). Distributed louvain algorithm for graph community detection. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS’18) (pp. 885– 895). IEEE. 38. Wei, L., & Mellor-Crummey, J. (2020). Using sample-based time series data for automated diagnosis of scalability losses in parallel programs. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’20) (pp. 144–159). 39. Graham, S. L., Kessler, P. B., & McKusick, M. K. (1982). Gprof: A call graph execution profiler. ACM Sigplan Notices, 17(6), 120–126. 40. Reinders, J. (2005). VTune performance analyzer essentials. Intel Press. 41. January, C., et al. (2015). Allinea MAP: Adding energy and OpenMP profiling without increasing overhead. In Tools for High Performance Computing 2014 (pp. 25–35). Springer. 42. Kaufmann, S., & Homer, B. (2003). Craypat-cray x1 performance analysis tool. In Cray User Group (May 2003). 43. Becker, D., et al. (2007). Automatic trace-based performance analysis of metacomputing applications. In 2007 IEEE International Parallel and Distributed Processing Symposium (pp. 1–10). IEEE. 44. TAU homepage. University of Oregon. http://tau.uoregon.edu 45. Vampir homepage. Technical University Dresden. http://www.vampir.eu 46. Scalasca homepage. Julich Supercomputing Centre and German Research School for Simulation Sciences. http://www.scalasca.org 47. Labarta, J., et al. (1996). DiP: A parallel program development environment. In European Conference on Parallel Processing (pp. 665–674). Springer. 48. Servat, H., et al. (2009). Detailed performance analysis using coarse grain sampling. In European Conference on Parallel Processing (pp. 185–198). Springer.

254

9 Domain-Specific Framework for Performance Analysis

49. Paraver homepage. Barcelona Supercomputing Center. http://www.bsc.es/paraver 50. Tallent, N. R., Mellor-Crummey, J. M., & Porterfield, A. (2010). Analyzing lock contention in multithreaded applications. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’10) (pp. 269–280). 51. Gamblin, T., et al. (2008). Scalable load-balance measurement for SPMD codes. In International Conference for High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. (pp. 1–12). https://doi.org/10.1109/SC.2008.5222553 52. Yu, T., & Pradel, M. (2016). Syncprof: Detecting, localizing, and optimizing synchronization bottlenecks. In Proceedings of the 25th International Symposium on Software Testing and Analysis (pp. 389–400). 53. Cummins, C., et al. (2020). Programl: Graph-based deep learning for program optimization and analysis. Preprint arXiv:2003.10536.

Chapter 10

Conclusion and Future Work

Although static analysis has several benefits and has shown great efficiency in many performance analysis scenarios, it is remarkable that static analysis still has some challenges for real-world applications. In this chapter, we state some open problems for further research. Alias Analysis Data dependence extracted by static analysis is very useful for many performance analysis tasks, for example, memory access monitoring in Chap. 4 and noise detection in Chap. 7. Real-world applications written in C, C++, and Fortran are implemented with many pointers, which makes it difficult to extract data dependence. Alias analysis, or pointer analysis, is a classic static analysis technique used to establish points-to relationships between pointers and variables or storage locations. Although many alias analysis approaches, such as contextinsensitive algorithms, flow-insensitive algorithms, etc., have been proposed, fully precise static alias analysis is still a challenging problem. In performance analysis, one possible solution is to use dynamic analysis to obtain partial pointer-related information and thus compensate for the precision loss in static alias analysis. The hybrid static-dynamic approach not only means that static analysis guides dynamic analysis but also that dynamic analysis can guide static analysis. Input Dependency Most applications have input configuration files and input data files. Input configuration files contain several parameters that determine which branches to execute, and thus the specific algorithms to execute, including physical methods, data distribution, communication patterns, etc. The input data largely affects the load characteristics of applications, especially for sparse problems. Static analysis cannot handle these input configuration files and input data files. Dynamic analysis under a certain input only reflects the workload characteristics of one aspect of applications. How to combine static and dynamic analysis to analyze the holistic workload characteristics of applications needs to be further studied. Compiler Optimization Compiler optimizations can lead to the performance behaviors of applications to be more difficult to understand. Compilers perform © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 J. Zhai et al., Performance Analysis of Parallel Applications for HPC, https://doi.org/10.1007/978-981-99-4366-1_10

255

256

10 Conclusion and Future Work

optimizations for the applications at the compile time, including loop unroll, inlining, etc. These optimizations make the performance behaviors of the execution instructions differ from the original source codes. As a result, it is difficult to directly match performance issues to the original source code. Besides, through compiler optimizations, executable binaries contain less detail compared to the source code, such as branch control, etc. Source Code Limitation Some real-world applications in military science and national security fields are often not available in source code for static analysis. Also, binaries are not compiled with debugging options, so information that can be obtained during static analysis is very limited. For such cases, it is extremely challenging to use limited static data to assist dynamic analysis. In conclusion, this book presents a hybrid static-dynamic approach for efficient performance analysis of parallel applications on HPC systems. We demonstrate this hybrid approach with various performance analysis tasks, including communication analysis, memory analysis, scalability analysis, and noise analysis. The static analysis provides reliable insight from source codes and binaries, which can effectively guide dynamic analysis and reduce runtime analysis overhead. In addition, static data is independent of problem size and system scale, allowing for good scalability. Therefore, the hybrid static-dynamic approach is a promising solution for future exascale HPC. Finally, we summarize four major problems and future research directions regarding the hybrid static-dynamic approach, such as alias analysis, input dependency, compiler optimization, and source code limitation.