Network Management in Cloud and Edge Computing [1st ed. 2020] 9789811501371, 9811501378

Traditional cloud computing and the emerging edge computing have greatly promoted the development of Internet applicatio

234 33 4MB

English Pages 152 [148] Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Network Management in Cloud and Edge Computing [1st ed. 2020]
 9789811501371, 9811501378

Table of contents :
Preface
Book Organization
Acknowledgments
Contents
1 Introduction
1.1 Research Background
1.2 Content Summary
1.3 Key Contributions
1.4 Chapter Arrangement
References
2 A Survey of Resource Management in Cloud and Edge Computing
2.1 Latency Optimization for Cloud-Based Service Chains
2.2 Toward Shorter Task Completion Time
2.3 Container Placement and Reassignment for Large-Scale Network
2.4 Near-Optimal Network System for Data Replication
2.5 Distributed Edge Caching in Short Video Network
2.6 The Controllability of Dynamic Temporal Network
References
3 A Task Scheduling Scheme in the DC Access Network
3.1 Introduction
3.2 Dynamic Differentiated Service with Delay-Guarantee
3.2.1 The Components of Latency
3.2.2 Design Philosophy
3.2.3 D3G Framework
3.2.4 Adjusting Resource Allocation
3.3 Deployment
3.4 D3G Experiment
3.4.1 Overall Performance
3.4.2 Algorithm Dynamism
3.4.3 System Scalability
3.5 Conclusion
References
4 A Cross-Layer Transport Protocol Design in the Terminal Systems of DC
4.1 Introduction
4.2 TAFA's Control Scheme
4.3 Key Challenges
4.4 TAFA: Task-Aware and Flow-Aware
4.4.1 Task-Awareness
4.4.1.1 End-Host Operations
4.4.1.2 Switch Operations
4.4.1.3 Multiple Priority Queues
4.4.2 Flow-Awareness
4.4.3 Algorithm Implementation
4.5 System Stability
4.6 TAFA Experiment
4.6.1 Setup
4.6.2 Overall Performance of TAFA
4.6.3 TAFA vs. Task-Aware
4.6.4 TAFA vs. Flow-Aware
4.7 Conclusion
References
5 Optimization of Container Communication in DC Back-EndServers
5.1 Container Group-Based Architecture
5.2 Problem Definition
5.2.1 Objective
5.2.1.1 Communication Cost
5.2.1.2 Resource Utilization Cost
5.2.1.3 Residual Resource Balance Cost
5.2.2 Constraints
5.3 Container Placement Problem
5.3.1 Problem Analysis
5.3.2 CA-WFD Algorithm
5.4 Container Reassignment Problem
5.4.1 Problem Analysis
5.4.2 Sweep&Search Algorithm
5.4.2.1 Sweep
5.4.2.2 Search
5.5 Implementation
5.6 Experiment
5.6.1 Performance of CA-WFD
5.6.1.1 Algorithm Performance
5.6.1.2 Algorithm Variations
5.6.2 Performance of Sweep&Search
5.6.2.1 Algorithm Performance
5.6.2.2 Algorithm Efficiency
5.7 Approximation Analysis of Sweep&Search
5.8 Conclusion
References
6 The Deployment of Large-Scale Data Synchronization System for Cross-DC Networks
6.1 Motivation of BDS+ Design
6.1.1 Baidu's Inter-DC Multicast Workload
6.1.2 Potentials of Inter-DC Application-Level Overlay
6.1.3 Limitations of Existing Solutions
6.1.4 Key Observations
6.2 System Overview
6.3 Near-Optimal Application-Level Overlay Network
6.3.1 Basic Formulation
6.3.2 Decoupling Scheduling and Routing
6.3.3 Scheduling
6.3.4 Routing
6.4 Dynamic Bandwidth Separation
6.4.1 Design Logic
6.4.2 Integrated to BDS+
6.4.2.1 Online Traffic Prediction Algorithm
6.4.2.2 Dynamic Bandwidth Separation
6.5 System Design
6.5.1 Centralized Control of BDS+
6.5.2 Dynamic Bandwidth Separation of BDS+
6.5.3 Fault Tolerance
6.5.4 Implementation and Deployment
6.6 BDS+ Experiment
6.6.1 BDS+ Over Existing Solutions
6.6.1.1 Methodology
6.6.1.2 BDS+ vs. Gingko
6.6.1.3 BDS+ vs. Other Overlay Multicast Techniques
6.6.2 Micro-benchmarks
6.6.2.1 Scalability
6.6.2.2 Fault Tolerance
6.6.2.3 Choosing the Values of Key Parameters
6.6.2.4 In-Depth Analysis
6.6.3 BDS+'s Dynamic Bandwidth Separation
6.6.3.1 Further Improvements Over BDS+
6.6.3.2 BDS+'s Prediction Algorithm
6.7 Conclusion
Appendix
References
7 Storage Issues in the Edge
7.1 Introduction
7.1.1 The Characteristics of Edge Caching in Short VideoNetwork
7.1.2 Limitations of Existing Solutions
7.2 AutoSight Design
7.2.1 System Overview
7.2.2 Correlation-Based Predictor: CoStore
7.2.3 Caching Engine: Viewfinder
7.3 AutoSight Experiment
7.3.1 Experiment Setting
7.3.2 Performance Comparison
7.4 Conclusion
References
8 Computing Issues in the Edge
8.1 Background
8.2 DND: Driver Node Algorithm
8.2.1 Parameters and Variable Declarations
8.2.2 Modeling
8.2.3 Abstraction of Topology
8.3 DND Experiment
8.3.1 Communication Radius
8.3.2 Nodes Density
8.3.3 Nodes Velocity
8.3.4 Control Time
8.4 Conclusion
References

Citation preview

Yuchao Zhang Ke Xu

Network Management in Cloud and Edge Computing

Network Management in Cloud and Edge Computing

Yuchao Zhang • Ke Xu

Network Management in Cloud and Edge Computing

Yuchao Zhang Beijing University of Posts and Telecomm Beijing, China

Ke Xu Tsinghua University Beijing, China

ISBN 978-981-15-0137-1 ISBN 978-981-15-0138-8 (eBook) https://doi.org/10.1007/978-981-15-0138-8 © Springer Nature Singapore Pte Ltd. 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Preface

Both cloud computing and edge computing have their specific advantages. This book addresses the challenges in both cloud and edge computing. Traditional cloud services are providing more and more convenient services for network users, while the resource heterogeneity and different server configurations bring serious challenges to the network performance. These challenges lie in the overall procedures of application processing. First, when an end user sends a request to one network application, the server should make access control and decide whether to accept it or not. Then the server should further schedule all the accepted requests and make transmission control. When this request is being served, it will call for server communications and coprocessing to satisfy the function requirements. For the request that needs cross-region services, different IDCs (Internet data centers) have to make information synchronization and data transmission. At last, the final calculation results could be sent back to this request. To provide high performance and shorter latency to network users, this book clarifies the overall procedures of network application requests. For each procedure, this book proposes a customized solution and optimization scheme. The emerging edge computing provides obvious advantages in the edge by enabling the computing and caching near end users. But due to the limitation of edges, the design principles are totally different from that in the cloud. This book focuses on both the calculation and storage problems and provides the general mechanisms to address these challenges. Overall, this book first analyzes the working procedure of cloud computing and edge computing and then provides detailed solutions for both trends. It can contribute to both traditional data center network and the emerging edge network.

v

vi

Preface

Book Organization This book contains eight chapters organized in two logical parts. Despite the continuity and contrast relationship between each chapter, we tried to keep each chapter self-contained to maximum reading flexibility. Chapter 1 Data center has many features such as virtualized resource environment, modular infrastructure, automated operation and maintenance management, rapid expansion capability, efficient resource utilization, and reliable redundant backup. While edge computing has advantages in the latency due to the short distance to end users, a variety of business applications have exploded, which has placed higher demands on the basic functions and service performance of the edge servers. This chapter briefly introduces these two trends. Chapter 2 This chapter gives a comprehensive survey of related issues in cloud and edge computing, including (1) the whole cloud computing process, starting from the service access to the data center, to the data transmission control, to the server backend communication, and to the data synchronization service support, tracking the complete service flow of the data flow and (2) the key two issues in the edge storage and computing. We carry out in-depth related work analysis for each of these topics. Chapter 3 This chapter designs a new type of task access control mechanism, which performs delay prediction on each intermediate server to pre-calculate the overall response delay of the request and redistributes and adjusts the data center resources according to the estimated response delay. The estimated delay of the load stream falls within the user’s patience function. On this basis, this chapter also introduces a delay feedback mechanism to ensure that non-interactive load streams are not greatly affected, ensuring algorithm fairness. Chapter 4 This chapter designs an efficient transmission protocol between the end system and the end user and a dual-aware scheduling mechanism (both task-level awareness and data stream-level awareness) through the end system and improves with the terminal. The ECN mechanism communicates to minimize the completion time of multitasking. Specifically, this chapter first studies the potential impact of the scheduling mechanism considering the task level and data flow level on the data center task scheduling effect, which leads to the task-aware and flow-aware (TAFA) and the band. There are data transfer protocols that improve ECN. Through the task scheduling strategy with priority adjustment, the data stream and task are serialized more reasonably, and the resource competition of the task is minimized. Chapter 5 This chapter addresses the communication issues within the container group for the same application by means of container redistribution between hosts. Specifically, this chapter designs a redistribution mechanism FreeContainer that can sense communication between containers using a novel two-stage online adjustment algorithm. In the first phase, some hosts are vacated to reserve more space for the next-stage redistribution algorithm. The second-stage container redistribution algorithm utilizes an improved variable neighborhood search algorithm to find a

Preface

vii

better distribution scheme. The FreeContainer algorithm does not require hardware modifications and is completely transparent to online applications. It was deployed on Baidu’s server cluster and performed extensive measurements and evaluations in a real network environment. The data results of the online application request indicate that FreeContainer can significantly reduce the communication overhead between containers and increase the throughput of the cluster in a traffic burst environment compared with the current adjustment algorithm. Chapter 6 This chapter introduces the BDS system, a transmission scheduling platform for large-scale data synchronization tasks. Specifically, BDS uses a centrally controlled approach to coordinating multiple scheduling tasks. By implementing the algorithm, BDS is able to dynamically allocate bandwidth resources between different transport tasks, maximizing traffic based on task priorities. In order to verify the transmission performance of BDS, this chapter has been deployed and verified on Baidu’s intra-domain network and compared with current technology. Chapter 7 This chapter presents AutoSight, a distributed edge caching system for short video network, which significantly boosts cache performance. AutoSight consists of two main components, solving the above two challenges, respectively: (i) the CoStore predictor, which solves the non-stationary and unpredictability of local access pattern by analyzing the complex video correlations, and (ii) a caching engine Viewfinder, which solves the temporal and spatial video popularity problem by automatically adjusting future sights according to video life span. All these inspirations and experiments are based on the real traces of more than 28 million videos with 100 million accesses from 488 servers located in 33 cities. Experiment results show that AutoSight brings significant boosts on distributed edge caching in short video network. Chapter 8 This chapter proposes the computation method of controllability in edge networks by analyzing the controllability of dynamic temporal network. It also calculates the minimum number of driver nodes, so as to ensure the network controllability. These insights are critical for varieties of applications in the future edge network. Beijing, China January 2019

Yuchao Zhang Ke Xu

Acknowledgments

A significant part of this book is the result of work performed in the Computer Science and Technology Department, Tsinghua University. Some parts of the material presented in this book are the work performed in Hong Kong University of Science and Technology (HKUST) and Beijing University of Posts and Telecommunications (BUPT). We are grateful to all the partners, faculties, and students. Some implementation work was done in Baidu, Huawei, and Kuaishou Companies. We thank them for their efficient cooperation. Also, thanks to the students in my group, namely, Shuang Wu, Pengmiao Li, and Peizhuang Cong, for their help and contributions. Finally, we hope this book will be useful to researchers, students, and other participants working in related topics. It would be great if we can inspire them to make their own contributions in cloud and edge computing field!

ix

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Research Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Content Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Chapter Arrangement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 3 9 11 11

2

A Survey of Resource Management in Cloud and Edge Computing . . . 2.1 Latency Optimization for Cloud-Based Service Chains . . . . . . . . . . . . . . 2.2 Toward Shorter Task Completion Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Container Placement and Reassignment for Large-Scale Network . . . 2.4 Near-Optimal Network System for Data Replication . . . . . . . . . . . . . . . . . 2.5 Distributed Edge Caching in Short Video Network . . . . . . . . . . . . . . . . . . . 2.6 The Controllability of Dynamic Temporal Network . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15 15 16 18 20 22 23 25

3

A Task Scheduling Scheme in the DC Access Network . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Dynamic Differentiated Service with Delay-Guarantee. . . . . . . . . . . . . . . 3.2.1 The Components of Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Design Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 D3 G Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Adjusting Resource Allocation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 D3 G Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Algorithm Dynamism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 System Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33 33 35 35 35 36 37 37 38 38 39 40 40 41

xi

xii

4

5

6

Contents

A Cross-Layer Transport Protocol Design in the Terminal Systems of DC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 TAFA’s Control Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Key Challenges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 TAFA: Task-Aware and Flow-Aware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Task-Awareness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Flow-Awareness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Algorithm Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 System Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 TAFA Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2 Overall Performance of TAFA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.3 TAFA vs. Task-Aware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.4 TAFA vs. Flow-Aware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43 43 45 46 47 47 50 52 53 56 56 56 58 59 62 63

Optimization of Container Communication in DC Back-End Servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Container Group-Based Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Container Placement Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Problem Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 CA-WFD Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Container Reassignment Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Problem Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Sweep&Search Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.1 Performance of CA-WFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.2 Performance of Sweep&Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Approximation Analysis of Sweep&Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65 65 66 66 68 70 70 71 72 72 73 76 79 80 83 85 87 88

The Deployment of Large-Scale Data Synchronization System for Cross-DC Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Motivation of BDS+ Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Baidu’s Inter-DC Multicast Workload . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Potentials of Inter-DC Application-Level Overlay . . . . . . . . . . . . 6.1.3 Limitations of Existing Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.4 Key Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91 91 92 93 95 97 97

Contents

7

8

xiii

6.3 Near-Optimal Application-Level Overlay Network . . . . . . . . . . . . . . . . . . . 6.3.1 Basic Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Decoupling Scheduling and Routing . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.4 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Dynamic Bandwidth Separation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Design Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Integrated to BDS+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Centralized Control of BDS+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.2 Dynamic Bandwidth Separation of BDS+ . . . . . . . . . . . . . . . . . . . . . 6.5.3 Fault Tolerance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.4 Implementation and Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 BDS+ Experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.1 BDS+ Over Existing Solutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.2 Micro-benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.3 BDS+’s Dynamic Bandwidth Separation . . . . . . . . . . . . . . . . . . . . . . 6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99 99 101 102 102 103 104 104 105 105 107 107 108 108 109 110 114 117 118

Storage Issues in the Edge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 The Characteristics of Edge Caching in Short Video Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Limitations of Existing Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 AutoSight Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Correlation-Based Predictor: CoStore . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Caching Engine: Viewfinder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 AutoSight Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Experiment Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

121 121

Computing Issues in the Edge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 DND: Driver Node Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Parameters and Variable Declarations . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.2 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.3 Abstraction of Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 DND Experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Communication Radius. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 Nodes Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

129 129 131 131 131 133 134 135 135

122 123 124 124 124 125 126 126 126 127 128

xiv

Contents

8.3.3 Nodes Velocity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.4 Control Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

136 137 137 138

Chapter 1

Introduction

Abstract Data center has many features such as virtualized resource environment, modular infrastructure, automated operation and maintenance management, rapid expansion capability, efficient resource utilization, and reliable redundant backup. While edge computing has advantages in the latency due to the short distance to end users, a variety of business applications have exploded, which has placed higher demands on the basic functions and service performance of the edge servers. This book will analyze these two trends in detail. This chapter introduces the research background, content summary, key contributions, and arrangements of the remaining chapters.

1.1 Research Background The rapid growth of service chains is changing the landscape of cloud-based applications. Different stand-alone components are now handled by cloud servers, providing cost-efficient and reliable services to Internet users. It is known that the workloads from service chains are more complex than the traditional noninteractive (or batch) workloads: for non-interactive workloads, such as scientific computing and image processing, they can be processed on one specific server and do not need interactions from other servers. Being not strictly time-sensitive, they can be scheduled to run anytime as long as the work could finish before a soft deadline. But for the interactive workloads from service chains, they should go through multiple stand-alone components to apply corresponding functions, and this unavoidably introduces additional latency. Meanwhile these interactive chained services typically process real-time user requests, such as business transactional and complex gaming control. Therefore, the performance for service chains is in urgent need to be ensured. Nowadays, data centers have become the cornerstones of modern computing infrastructure and one dominating paradigm in the externalization of IT resources. The data center tasks always consist of rich and complex flows which traverse different parts of the network at potentially different times. To minimize the network contention among different tasks, task serialization was widely suggested. This © Springer Nature Singapore Pte Ltd. 2020 Y. Zhang, K. Xu, Network Management in Cloud and Edge Computing, https://doi.org/10.1007/978-981-15-0138-8_1

1

2

1 Introduction

approach applies task-level metric and aims to serve one task at a time with synchronized network access. While serialization is a smart design to avoid tasklevel interference, our study shows that the flow level network contention within a task can however largely affect the task completion time. This prolongs the tail as well as the average task completion time and unavoidably reduces the system applicability to serve delay-sensitive applications. Containerization [1] has become a popular virtualization technology due to many promising properties such as lightweight, scalable, highly portable, and good isolation, and the emergence of software containerization tools, e.g., docker [2], further allows users to create containers easily on top of any infrastructure. Therefore, more and more Internet service providers are deploying their services in the form of containers in modern data centers. For large-scale online service providers, such as Google, Facebook, and Baidu, an important data communication pattern is inter-DC multicast of bulk data – replicating massive amounts of data (e.g., user logs, web search indexes, photo sharing, blog posts) from one DC to multiple DCs in geo-distributed locations. Our study on the workload of Baidu shows that inter-DC multicast already amounts to 91% of inter-DC traffic, which corroborates the traffic pattern of other large-scale online service providers [3, 4]. As more DCs are deployed globally and bulk data are exploding, inter-DC traffic then needs to be replicated in a frequent and efficient manner. While there have been tremendous efforts toward better inter-DC network performance (e.g., [3, 5–9]), the focus has been improving the performance of the wide area network (WAN) path between each pair of DCs. These WAN-centric approaches, however, are incomplete, as they fail to leverage the rich applicationlevel overlay paths across geo-distributed DCs, as well as the capability of servers to store-and-forward data. As illustrated in Fig. 1.1, the performance of inter-DC multicast could be substantially improved by sending data in parallel via multiple overlay servers acting as intermediate points to circumvent slow WAN paths and performance bottlenecks in DC networks.

DC A

DC B

DC C (a) Direct replication over pair-wise inter-DC WANs

Overlay servers DC A

2 2

DC B

1 2 2 1

DC C

(b) Replication leveraging overlay paths

Fig. 1.1 A simple network topology illustrating how overlay paths reduce inter-DC multicast completion time. Assume that the WAN link between any two DCs is 1 GB/s, and that A wants to send 3 GB data to B and C. Sending data from A to B and C separately takes 3 s (a), but using overlay paths A → B → C and A → C → B simultaneously takes only 2 s (b). The circled numbers show the order for each data piece is sent

1.2 Content Summary

3

It is important to notice that these overlay paths should be bottleneck-disjoint; that is, they do not share common bottleneck links (e.g., A → B → C and A → C → B in Fig. 1.1) and that such bottleneck-disjoint overlay paths are available in abundance in geo-distributed DCs. In recent years, many short video platforms are developing at an incredible speed (such as Kuaishou [10], Douyin/TikTok [11], YouTube Go [12], Instagram Stories [13], and so on), these platforms allow users to record and upload very short videos (usually within 15 s). As a result of the massive short videos uploaded by distributed users, caching problem is becoming more challenging compared with the traditional centralized Content Delivery Network (CDN) whose traffic is dominated by some popular items for a long period of time [14]. To handle scalability and improve users’ Quality of Experience (QoE), the emerging edge computing naturally matches the demand for distributed storage, and the abovementioned platforms thus have resorted to employ edge caching servers to store and deliver the massive short videos, so as to avoid that all requests have to be fetched from the backend/origin server, which usually introduces extra user-perceived latency. In recent years, with the rapid development of Internet of things technology, a variety of smart devices that can be connected to the network have emerged. However, these smart devices are different from the traditionals, especially some of these kinds are highly mobile, such as the rapid driving vehicles in the Internet of Vehicles, which led to the Internet of Vehicles becoming a kind of dynamic networks. In fact, dynamic networks exist in all aspects of our lives, such as delaytolerant networks, opportunistic-mobility networks, social networks, friendship networks, etc. [15, 16]. The topology of these networks is not static, but changes over time, where the connections among nodes establish or destruct irregularly. This results in rapidly changing network topology, which makes it difficult to hold controllability of the entire network. To address this challenge, it is necessary to propose a way to analyze and solve the controllability of dynamic networks.

1.2 Content Summary In this chapter, we present the measurement and analysis of interactive workloads on Baidu’s cloud platform. The real-world trace indicates that these interactive workloads are suffering from significantly long latency than the general noninteractive workloads. A typical case is shown in Fig. 1.2a where Nuomi is a group buying application, Waimai is a take-out service, and Alipay is an online payment platform. When a user clicks an item on Nuomi, the latency is quite short because this query does not require many interactions among services. However, the story will be different when the user orders a take-out and purchase the item. In this case, the request goes through Nuomi, Waimai, and then Alipay. In other words, this interactive workload consists of several highly dependent operations which have to be processed on different servers separately, like Fig. 1.2b, and there are six procedures for interactive workloads and only two procedures for non-interactive

4

1 Introduction

User 3

User Request

nuomi.com

Service 1

User 1 waimai.baidu.com

Service 2 User 2 alipay.com

Service 3

User 3

(a) Data flow

(b) Interactive

Non-interactive

Fig. 1.2 The processes of interactive and non-interactive workloads

workloads. It is easy to see that such interactive workloads in chained applications will introduce extra latency to users because these requests will be handled by different services for multiple times. Unfortunately, we find that most of the existing workload scheduling approaches are aimed to reschedule [17] and leverage different priorities [18, 19] on individual queues. In other words, these optimizations are made on intermediate servers separately, so the overall latency of interactive workloads is still unpredictable. To better optimize the overall latency of chained services, we apply a latency estimation approach to predict overall latency and try to accelerate the interactive workloads. Furthermore, we design a feedback scheme to ensure workload fairness and avoid remarkable degradation of non-interactive workloads. Our real-world deployments on Baidu indicate that the proposed delay-guarantee (D3 G) framework can successfully reduce the latency of interactive applications with minimal effect on other workloads. In this chapter, we for the first time investigate the potential to consider both flow level and task level interference together for data center task scheduling. We provide the design of TAFA (task-aware and flow-aware) to obtain better serialization and minimize possible flow and task contentions. TAFA adopts dynamic priority adjustment for the task scheduling. Different from FIFO-LM [20], this design can successfully emulate shortest-task-first scheduling while requires no prior knowledge about the task. Further, TAFA gives a more reasonable and efficient approach to reduce task completion time by considering the relationship among different flows in one task, rather than treating them all the same. With this intelligent adjustment in flow level, TAFA provides the shorter flow waiting time and leading to earlier finish time. As the total waiting time of a task consists of waiting time and processing time, TAFA for the first time combines the two aspects together.

1.2 Content Summary

5

To achieve shorter processing time, TAFA solves the dominant resource matching problem and classifies different requirements and VMs into different resource dominant type. By allocating different resource dominant VMs to corresponding resource-dominant requirements, we can achieve more rapid processing time. Generally, each Internet service has several modules which are instantiated as a set of containers, and the containers belonging to the same service often need to communicate with each other to deliver the desired service [21–24], resulting in heavy cross-server communications and downgrading service performance [19, 21]. If these containers are placed on the same server, the communication cost can be greatly reduced. However, the containers belonging to the same service are generally intensive to the same resource (e.g., containers of the big data analytics services [25–27] are usually CPU-intensive, and containers of the data transfer applications [4, 28–31] are usually network I/O-intensive). Assigning these containers on the same server may cause heavily imbalanced resource utilization of servers, which could affect the system availability, response time, and throughput [32, 33]. First, it prevents any single server from getting overloaded or breaking down, which improves service availability. Second, servers usually generate exponential response time when the resource utilization is high [34]; load balancing guarantees acceptable resource utilizations for servers, so that the servers can have fast response time. Third, no server will be a bottleneck under balanced workload, which improves the overall throughput of the system. Figure 1.3 shows an example. Suppose there are two services (denoted by SA and SB ) to be deployed on two servers. Each service has two containers (CA1 , CA2 , and CB1 , CB2 , respectively). The containers of SA are CPU-intensive, while the containers of SB are network I/O-intensive. Figure 1.3a shows a solution which assigns one of SA ’s containers and one of SB ’s containers on each server. This approach achieves high resource utilization on both CPU and network I/O, but incurs high communication cost between the two servers. Figure 1.3b shows another solution where the containers of the same service are assigned on the same server. The communication overhead is thus significantly reduced; however, the utilization of CPU and network I/O is highly imbalanced on the two servers. CPU

I/O

CPU

I/O

CPU

CB1

CB1

CB2

CB2

CA1

CA2

CA2

CA1

CA1

CA2

Server

Server Network (a)

I/O

CPU

CA1

CB1

CA2

CB2

Server

I/O

CB1 CB2

Server Network (b)

Fig. 1.3 The conflict between container communication and server resource utilization. (a) High network communication overhead, balanced resource utilization. (b) Imbalanced resource utilization, low network communication overhead

6 Table 1.1 Resource utilization in a data center from Baidu (http://www. baidu.com)

1 Introduction Resource CPU MEM SSD

Top 1% 0.943 0.979 0.961

Top 5% 0.865 0.928 0.927

Top 10% 0.821 0.890 0.875

Mean 0.552 0.626 0.530

We further explore the conflict between container communication and resource utilization in a data center with 5,876 servers from Baidu. According to our knowledge, the containers of the same service in this data center are placed as close as possible in order to reduce the communication cost. Table 1.1 gives the top 1%, top 5% and top 10% CPU, MEM (memory) and SSD (solid-state Drives) utilization of servers in this data center, which shows that the utilization of resources is highly imbalanced among servers. Reducing container communication cost while keeping balanced server resource utilization is never an easy problem. In this chapter, we try to address such conflict in large-scale data centers. Specifically, such conflict lies in two related phases of an Internet service’s life cycle, i.e., container placement and container reassignment, and we accordingly study two problems. The first is container placement problem, which strives to place a set of newly instantiated containers into a data center. The objective of this phase is to balance resource utilization while minimizing the communication cost of these containers after placement. The second is container reassignment problem, which tries to optimize a given placement of containers by migrating containers among servers. Such reassignment approach can be used for online periodical adjustment of the placement of containers in a data center. We formulate these two problems as multi-objective optimization problems, which are both NP hard. For the container placement problem, we propose an efficient Communication Aware Worst Fit Decreasing (CA-WFD) algorithm, which subtly extends the classical Worst Fit Decreasing bin packing algorithm to container placement. For the container reassignment problem, we propose a two-stage algorithm named Sweep&Search which can seek a container migration plan efficiently. We deploy our algorithms in Baidu’s data centers and conduct extensive experiments to evaluate the performance. The results show that the proposed algorithms can effectively reduce the communication cost while simultaneously balancing the resource utilization among servers in real systems. The results also show that our algorithms outperform the state-of-the-art strategies up to 90% used by some top containerization service providers. This chapter first introduces BDS+, an application-level centralized near-optimal network system, which splits data into fine-grained units, and sends them in parallel via bottleneck-disjoint overlay paths with dynamic bandwidth sharing. These paths are selected dynamically in response to changes in network conditions and the data delivery status of each server. Note that BDS+ selects applicationlevel overlay paths and is therefore complementary to network-layer optimization of WAN performance. While application-level multicast overlays have been applied

1.2 Content Summary

7

in other contexts (e.g., [35–38]), building one for inter-DC multicast traffic poses two challenges. First, as each DC has tens of thousands of servers, the resulting large number of possible overlay paths makes it unwieldy to update overlay routing decisions at scale in real time. Prior work either relies on local reactive decisions by individual servers [39–41], which leads to suboptimal decisions for lack of global information, or restricts itself to strictly structured (e.g., layered) topologies [42], which fails to leverage all possible overlay paths. Second, even a small increase in the delay of latency-sensitive traffic can cause significant revenue loss [43], so the bandwidth usage of inter-DC bulk-data multicasts must be tightly controlled to avoid negative impact on other latency-sensitive traffic. To address these challenges, BDS+ fully centralizes the scheduling and routing of inter-DC multicast. Contrary to the intuition that servers must retain certain local decision-making to achieve desirable scalability and responsiveness to network dynamics, BDS+’s centralized design is built on two empirical observations (Sect. 6.2): (1) while it is hard to make centralized decisions in real time, most multicast data transfers last for at least tens of seconds and thus can tolerate slightly delayed decisions in exchange for near-optimal routing and scheduling based on a global view; (2) centrally coordinated sending rate allocation is amenable to minimizing the interference between inter-DC multicast traffic and latency-sensitive traffic. The key to making BDS+ practical is how to update the overlay network in near real-time (within a few seconds) in response to performance churns and dynamic arrivals of requests. BDS+ achieves this by decoupling its centralized control into two optimization problems, scheduling of data transfers and overlay routing of individual data transfers. Such decoupling attains provable optimality and, at the same time, allows BDS+ to update overlay network routing and scheduling in a fraction of second; this is four orders of magnitude faster than solving routing and scheduling jointly when considering the workload of a large online service provider (e.g., sending 105 data blocks simultaneously along 104 disjoint overlay paths). In practice, there is always a fixed upper bound of available bandwidth for bulk-data multicast, because multicast overlay network shares the same inter-DC WAN with online latency-sensitive traffic. Existing solutions always reserve a fixed amount of bandwidth for the latency-sensitive traffic, according to its peak value. This guarantees the strict bandwidth separation, but the side effect is the waste of bandwidth, especially when the online traffic is in its valley. To further improve link utilization, BDS+ implements dynamic bandwidth separation that can predict online traffic and reschedule bulk-data transfer. In other words, BDS+ achieves dynamic bandwidth separation between bulk-data multicast and online traffic to further speed up data transfer. We have implemented a prototype and integrated it in Baidu. We first deployed BDS+ in 10 DCs and ran a pilot study on 500 TB of data transfer for 7 days (about 71 TB per day). Our real-world experiments show that BDS+ achieves 3– 5× speedup over Baidu’s existing solution named Gingko, and it can eliminate the incidents of excessive bandwidth consumption by bulk-data transfers. Using micro-benchmarking, we show that BDS+ outperforms techniques widely used in

8

1 Introduction

CDNs, BDS+ can handle the workload of Baidu’s inter-DC multicast traffic with one general-purpose server, and BDS+ can handle various failure scenarios.1 We then use trace-driven simulations to evaluate BDS+ with dynamic bandwidth separation; the results show that BDS+ further speeds up the bulk data transfer by 1.2 to 1.3 times in the network where online and offline services are mixed deployed. There have been tremendous efforts toward better caching performance in traditional CDN, and these caching algorithms can be classified into two categories: the simple but effective reactive caching algorithms, such as First-In First-Out (FIFO), Least Recently Used (LRU), Least Frequently Used (LFU), k-LRU, and their variants, and the proactive caching algorithms, such as DeepCache [44]. These prior solutions work well in traditional centralize-controlled CDN, but become invalid in the emerging short video network due to the following two essential differences. First, the non-stationary user video access pattern. The basic assumption of reactive caching policies is the stationary user access pattern, i.e., recently requested or frequently requested content in the past should be kept in cache because these policies assume that such contents have greater chance of being visited in the future. While a study in [14] shows that in short video network, popular content will become expired very quickly (within tens of minutes), indicating that the popularity in the past could not represent that in the future, and this is the root cause of the failure of these reactive policies (see Sect. 7.1.1). Second, The temporal and spatial video popularity pattern, i.e., the change of video popularity, varies in different edge caching servers during different time periods. Our study on the workload of Kuaishou shows that it takes less than 1 h for a popular video to become unpopular during peak hours while takes more than 3 h during late midnight. Existing proactive caching policies that try to predict future content popularity always focus on a fixed sight, making them fail in edge caching scenarios. To address the above challenges, this chapter presents AutoSight, a distributed caching mechanism that works in edge caching servers for short video network. AutoSight allows edge servers to retain respective local caching sight to adapt to local video access pattern and video life span. AutoSight’s distributed design is built on two empirical observations: (1) although the historical video access data is non-stationary on individual edge server (see Fig. 7.1), making it difficult to make popularity prediction, there is sufficient correlations among videos within the same edge server, because users tend to request related videos, contributing much cross visits in edge servers, and thus improves distributed predictions, and (2) temporal and spatial video popularity pattern brings challenge for future sight of caching policy (see Fig. 7.2), but distributed design allows adaptive future sights, enabling edge servers to make decisions according to different video expiration speeds.

1 As

the existing solutions are with fixed bandwidth separation, so in these series of experiments, we use BDS+ without dynamic bandwidth separation as comparation, while BDS+ with dynamic bandwidth separation is evaluated separately.

1.3 Key Contributions

9

We have implemented a prototype of AutoSight and evaluate it with Kuaishou data traces. Experiments show that AutoSight achieves much higher hit rate compared with both reactive caching policies and the state-of-the-art proactive caching policies. There has been some research and progress on controllability of networks in recent years. However, these research just proved the controllability of complex networks in mathematical theory, but do not apply the theory to the controllability of mobile dynamic networks. This chapter refers to the mathematical theory knowledge of these papers and applies it to the controllability of dynamic networks. For a dynamic network, it needs several driver nodes to ensure the controllability of the entire network. By analyzing the connection state of the dynamic network during control time, the minimum number of driver nodes can be obtained.

1.3 Key Contributions The main contributions are summarized as follows: • We present the measurement and latency analysis of service chains in Baidu networks and disclose the long latency for interactive workloads. • We design the D3 G algorithm to accelerate interactive workloads in a global manner other than in each independent server and leverage a latency estimation algorithm and a feedback scheme to ensure fairness. • We evaluate our methods on servers in Baidu networks, and the extensive experiment results show that D3 G succeeds in accelerating interactive chained applications while ensuring workload fairness. In short, this chapter mainly makes three contributions, which are described as follows. Firstly, we point out that the flow contentions in one task will also make system performance degrade, and a task-aware scheme which ignores flow relationship will achieve longer completion time. We give a simple example in Sect. 4.1 and analyze the disadvantages of leaving out information on flow contention. Secondly, we design an scheduling framework called TAFA, which can achieve both task-awareness and flow-awareness. Task-awareness ensures short tasks are prioritized over long tasks and enables TAFA to emulate STF scheduling without knowing the task size beforehand. Flow-awareness optimizes scheduling order, and achieves shorter task completion time. Thirdly, we solved the practical issue in the designing framework. We make TAFA applicable by considering realistic environment which is in multiple resources and heterogeneous VMs. To avoid mismatching between resource requirements and allocation, we solve the dominant resource matching problem, which will occur when heterogenous requirements of flows conflict with heterogenous virtual machine (VM) configurations. By using the concept of dominant resource, we can choose the proper VM for particular flows.

10

1 Introduction

This chapter is extended from our preliminary work [45] with significant improvements including: • We disclose a new problem (i.e., the container placement problem) that places a set of newly instantiated containers into a data center, which is a necessary and important phase in services’ life cycle. • We propose the CA-WFD algorithm to solve the container placement problem and conduct extensive experiments to evaluate the performance. • We refine the algorithms proposed for the container reassignment problem and significantly extend the experimental study for this problem. Our contributions are summarized as follows: • Characterizing Baidu’s workload of inter-DC bulk-data multicast to motivate the need of application-level multicast overlay networks (Sect. 6.1). • Presenting BDS+, an application-level multicast overlay network that achieves near-optimal flow completion time by a centralized control architecture (Sects. 6.2 and 6.3). • Making dynamic bandwidth separation to further improve link utilization in the network where online and offline services are mixed deployed (Sects. 6.2 and 6.4). • Demonstrating the practical benefits of BDS+ by a real-world pilot deployment and large-scale simulations in Baidu (Sects. 6.5 and 6.6). Our contributions are summarized as follows: • Characterizing Kuaishou’s workload of short video network to motivate the need of distributed edge caching policy. • Presenting AutoSight, a distributed edge caching mechanism working in edge caching servers for short video network, which solves the problem of nonstationary user access pattern and temporal/spatial video popularity pattern. • Demonstrating the practical benefits of AutoSight by real traces. First, this chapter raised some issues of controllability challenges faced by dynamic networks; then took the vehicle network on a road as an example with the moving speed, communication radius, average density of vehicles, and the control time as different variable parameters; and designed an algorithm to calculate the optimal solution of the minimum number of driver nodes required for a dynamic network. Finally, after the experiment of simulation data, we got the influence of the vehicle’s moving speed, communication radius, average density, and the control time on the number of required driver nodes of a vehicle network. By deploying a minimum of driver nodes to ensure the controllability of the entire network, and maximizes resource conservation, improves efficiency, ultimately, guides the deployment of mobile networks.

References

11

1.4 Chapter Arrangement Chapter 3 The rest of this chapter is organized as follows. We measure the performance and latency of different service workloads in Sect. 3.1. To reduce total latency for interactive workloads, we design a new algorithm called D3 G and present the detailed design in Sect. 3.2. Section 3.3 describes the deployment of D3 G, and Sect. 3.4 evaluates the experiment results on Baidu networks. Section 3.5 concludes this chapter. Chapter 4 The rest of this chapter is organized as follows. We introduce the main system framework TAFA in Sect. 4.4. Section 4.5 proves the properties of TAFA, and Sect. 4.6 shows the simulation results. Finally, we conclude this chapter and point out the future work in Sect. 4.7. Chapter 5 The rest of this chapter is structured as follows. Section 5.1 introduces the architecture of container group-based services. Definitions of container placement problem and container reassignment problem are given in Sect. 5.2. Our solutions to the two problems are proposed in Sects. 5.3 and 5.4, respectively. Section 5.6 compares our solutions with state-of-the-art designs by extensive evaluations. We implement our solutions in large-scale data centers of Baidu, and the details are given in Sect. 5.5. At last, Sect. 5.8 concludes the chapter. Chapter 6 The rest of this chapter is organized as follows. We provided a case for an application-level multicast overlay network in Sect. 6.1. To optimize interDC multicasts on overlay network with dynamical separation with latency-sensitive traffic, we present BDS+, a fully centralized near-optimal network system with dynamic bandwidth separation for data inter-DC multicast in Sect. 6.2. Section 6.5 presents the system design and implemetation of BDS+, and Sect. 6.6 compared BDS+ with three existing solutions. Section 6.7 concludes this chapter. Chapter 7 The rest of this chapter is organized as follows. We characterized the short video network and illustrated the essential difference with traditional CDN in Sect. 7.1. We show the AutoSight design in Sect. 7.2, and we evaluate our approach AutoSight using real traces in Sect. 7.3. Section 7.4 concludes this chapter. Chapter 8 The remainder of this chapter is organized as follows. In Sect. 8.1 we introduce the background of the application scenario. Section 8.2 describes some variables and specific formulas of our modeling. Section 8.3 presents the results and analysis. Finally, the conclusions and future work are laid out in Sect. 8.4.

References 1. Soltesz, S., Fiuczynski, M.E., Bavier, A., Peterson, L.: Container-based operating system virtualization: a scalable, high-performance alternative to hypervisors. In: ACM Sigops/Eurosys European Conference on Computer Systems, pp. 275–287 (2007) 2. Docker: http://www.docker.com/ (2016)

12

1 Introduction

3. Kumar, A., Jain, S., Naik, U., Raghuraman, A., Kasinadhuni, N., Zermeno, E.C., Gunn, C.S., Ai, J., Carlin, B., Amarandei-Stavila, M., et al.: BwE: flexible, hierarchical bandwidth allocation for WAN distributed computing. In: ACM SIGCOMM, pp. 1–14 (2015) 4. Zhang, Y., Xu, K., Yao, G., Zhang, M., Nie, X.: Piebridge: a cross-dr scale large data transmission scheduling system. In: Proceedings of the 2016 Conference on ACM SIGCOMM 2016 Conference, pp. 553–554. ACM (2016) 5. Savage, S., Collins, A., Hoffman, E., Snell, J., Anderson, T.: The end-to-end effects of Internet path selection. ACM SIGCOMM 29(4), 289–299 (1999) 6. Jain, S., Kumar, A., Mandal, S., Ong, J., Poutievski, L., Singh, A., Venkata, S., Wanderer, J., Zhou, J., Zhu, M., et al.: B4: experience with a globally-deployed software defined WAN. ACM SIGCOMM 43(4), 3–14 (2013) 7. Hong, C.-Y., Kandula, S., Mahajan, R., Zhang, M., Gill, V., Nanduri, M., Wattenhofer, R.: Achieving high utilization with software-driven WAN. In: ACM SIGCOMM, pp. 15–26 (2013) 8. Zhang, H., Chen, K., Bai, W., Han, D., Tian, C., Wang, H., Guan, H., Zhang, M.: Guaranteeing deadlines for inter-datacenter transfers. In: EuroSys, p. 20. ACM (2015) 9. Zhang, Y., Xu, K., Wang, H., Li, Q., Li, T., Cao, X.: Going fast and fair: latency optimization for cloud-based service chains. IEEE Netw. 32, 138–143 (2017) 10. kuaishou: Kuaishou. https://www.kuaishou.com (2019) 11. TikTok: Tiktok. https://www.tiktok.com (2019) 12. Go, Y.: Youtube go. https://youtubego.com (2019) 13. Stories, I.: Instagram stories. https://storiesig.com (2019) 14. Zhang, Y., Li, P., Zhang, Z., Bai, B., Zhang, G., Wang, W., Lian, B.: Challenges and chances for the emerging shortvideo network. In: Infocom, pp. 1–2. IEEE (2019) 15. Casteigts, A., Flocchini, P., Quattrociocchi, W., Santoro, N.: Time-varying graphs and dynamic networks. Int. J. Parallel Emergent Distrib. Syst. 27(5), 387–408 (2012) 16. Xiao, Z., Moore, C., Newman, M.E.J.: Random graph models for dynamic networks. Eur. Phys. J. B 90(10), 200 (2016) 17. Alizadeh, M., Yang, S., Sharif, M., Katti, S., McKeown, N., Prabhakar, B., Shenker, S.: pFabric: minimal near-optimal datacenter transport. ACM SIGCOMM Comput. Commun. Rev. 43(4), 435–446 (2013). ACM 18. Dogar, F.R., Karagiannis, T., Ballani, H., Rowstron, A.: Decentralized task-aware scheduling for data center networks. ACM SIGCOMM Comput. Commun. Rev. 44(4), 431–442 (2014). ACM 19. Zhang, Y., Xu, K., Wang, H., Shen, M.: Towards shorter task completion time in datacenter networks. In: 2015 IEEE 34th International Performance Computing and Communications Conference (IPCCC), pp. 1–8. IEEE (2015) 20. Dogar, F.R., Karagiannis, T., Ballani, H., Rowstron, A.: Decentralized task-aware scheduling for data center networks. SIGCOMM Comput. Commun. Rev. 44, 431–442 (2014) 21. Yu, T., Noghabi, S.A., Raindel, S., Liu, H., Padhye, J., Sekar, V.: Freeflow: high performance container networking. In: Proceedings of the 15th ACM Workshop on Hot Topics in Networks, pp. 43–49. ACM (2016) 22. Burns, B., Oppenheimer, D.: Design patterns for container-based distributed systems. In: 8th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 16) (2016) 23. Zhang, Y., Xu, K., Wang, H., Li, Q., Li, T., Cao, X.: Going fast and fair: latency optimization for cloud-based service chains. IEEE Netw. 32(2), 138–143 (2018) 24. Shen, M., Ma, B., Zhu, L., Mijumbi, R., Du, X., Hu, J.: Cloud-based approximate constrained shortest distance queries over encrypted graphs with privacy protection. IEEE Trans. Inf. Forensics Secur. 13(4), 940–953 (2018) 25. Ananthanarayanan, G., Kandula, S., Greenberg, A.G., Stoica, I., Lu, Y., Saha, B., Harris, E.: Reining in the outliers in map-reduce clusters using mantri. OSDI 10(1), 24 (2010) 26. Li, M., Andersen, D.G., Park, J.W., Smola, A.J., Ahmed, A., Josifovski, V., Long, J., Shekita, E.J., Su, B.-Y.: Scaling distributed machine learning with the parameter server. In: 11th

References

13

USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pp. 583– 598 (2014) 27. Wu, X., Zhu, X., Wu, G.-Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014) 28. Zhang, Y., Jiang, J., Xu, K., Nie, X., Reed, M.J., Wang, H., Yao, G., Zhang, M., Chen, K.: Bds: a centralized near-optimal overlay network for inter-datacenter data replication. In: Proceedings of the Thirteenth EuroSys Conference, p. 10. ACM (2018) 29. Xu, K., Li, T., Wang, H., Li, H., Wei, Z., Liu, J., Lin, S.: Modeling, analysis, and implementation of universal acceleration platform across online video sharing sites. IEEE Trans. Serv. Comput. 11, 534–548 (2016) 30. Wang, H., Li, T., Shea, R., Ma, X., Wang, F., Liu, J., Xu, K.: Toward cloud-based distributed interactive applications: measurement, modeling, and analysis. IEEE/ACM Trans. Netw. 26(99), 1–14 (2017) 31. Zhang, Y., Xu, K., Shi, X., Wang, H., Liu, J., Wang, Y.: Design, modeling, and analysis of online combinatorial double auction for mobile cloud computing markets. Int. J. Commun. Syst. 31(7), e3460 (2018) 32. Gavranovi´c, H., Buljubaši´c, M.: An efficient local search with noising strategy for Google machine reassignment problem. Ann. Oper. Res. 242, 1–13 (2014) 33. Wang, T., Xu, H., Liu, F.: Multi-resource load balancing for virtual network functions. In: IEEE International Conference on Distributed Computing Systems (2017) 34. Hong, Y.-J., Thottethodi, M.: Understanding and mitigating the impact of load imbalance in the memory caching tier. In: Proceedings of the 4th Annual Symposium on Cloud Computing, p. 13. ACM (2013) 35. Liebeherr, J., Nahas, M., Si, W.: Application-layer multicasting with Delaunay triangulation overlays. IEEE JSAC 200(8), 1472–1488 (2002) 36. Wang, F., Xiong, Y., Liu, J.: mTreebone: a hybrid tree/mesh overlay for application-layer live video multicast. In: ICDCS, p. 49 (2007) 37. Andreev, K., Maggs, B.M., Meyerson, A., Sitaraman, R.K.: Designing overlay multicast networks for streaming. In: SPAA, pp. 149–158 (2013) 38. Mokhtarian, K., Jacobsen, H.A.: Minimum-delay multicast algorithms for mesh overlays. IEEE/ACM TON, 23(3), 973–986 (2015) 39. Kosti´c, D., Rodriguez, A., Albrecht, J., Vahdat, A.: Bullet: high bandwidth data dissemination using an overlay mesh. ACM SOSP 37(5), 282–297 (2003). ACM 40. Repantis, T., Smith, S., Smith, S., Wein, J.: Scaling a monitoring infrastructure for the Akamai network. ACM Sigops Operat. Syst. Rev. 44(3), 20–26 (2010) 41. Huang, T.Y., Johari, R., Mckeown, N., Trunnell, M., Watson, M.: A buffer-based approach to rate adaptation: evidence from a large video streaming service. In: SIGCOMM, pp. 187–198 (2014) 42. Nygren, E., Sitaraman, R.K., Sun, J.: The Akamai network: a platform for high-performance internet applications. ACM SIGOPS Oper. Syst. Rev. 44(3) (2010) 43. Zhang, Y., Li, Y., Xu, K., Wang, D., Li, M., Cao, X., Liang, Q.: A communication-aware container re-distribution approach for high performance VNFs. In: IEEE ICDCS 2017, pp. 1555–1564. IEEE (2017) 44. Narayanan, A., Verma, S., Ramadan, E., Babaie, P., Zhang, Z.-L.: Deepcache: a deep learning based framework for content caching. In: Proceedings of the 2018 Workshop on Network Meets AI & ML, pp. 48–53. ACM (2018) 45. Zhang, Y., Li, Y., Xu, K., Wang, D., Li, M., Cao, X., Liang, Q.: A communicationaware container re-distribution approach for high performance VNFs. In: IEEE International Conference on Distributed Computing Systems (2017)

Chapter 2

A Survey of Resource Management in Cloud and Edge Computing

Abstract This chapter is to summarize the processing of the business, starting from the service access to the data center, to the data transmission control, to the server back-end communication, and the data synchronization service support, tracking the complete service flow of the data flow. And carry out comprehensive and in-depth research work for each of these links.

2.1 Latency Optimization for Cloud-Based Service Chains As more applications are deployed on clouds for better system scalability and lower operation cost, service chains are developing quickly. Many studies have shown that the latency is particularly problematic when interaction latency occurs together with network delays [1]. To minimize the latency on cloud-based applications, many researchers focused on minimizing datacenter latency, which does advance the state of the art [2]. We can classify these literatures into two categories. First, some literatures focused on network and processing latency. For example, Webb et al. [3] proposed a nearest server assignment to reduce the client-server latency. Vik explored the spanning tree problems in a distributed interactive application system for latency reduction in [4]. Moreover, [5] and [6] introduced a game theory into this topic and modeled the latency problem in DC as a bargaining game, and Seung guaranteed resource allocation by proposing resource allocation [7]. The other kind of related work is web service, which is an application model for decentralized computing and an effective mechanism for data and service integration on the Web [8]. Web service is becoming relatively mature recently. Some studies succeeded in dissecting latency contributors [9], showing that back-office traffic accounts for a significant fraction of web transactions in terms of requests and responses [10]. Although the above studies have already made excellent latency optimization, they ignored the interactions among multiple services (e.g., the case in Fig. 1.2), and the interactive workloads are suffering from longer latencies due to the processing procedures on multiple intermediate servers. Many researchers investigated the latency for interactive applications, and their researches showed that although these © Springer Nature Singapore Pte Ltd. 2020 Y. Zhang, K. Xu, Network Management in Cloud and Edge Computing, https://doi.org/10.1007/978-981-15-0138-8_2

15

16

2 A Survey of Resource Management in Cloud and Edge Computing

applications are quite delay-sensitive, service performance is greatly affected by interactions. To address this problem, some studies suggested that the interactions of different services should be further dissected to better understand the performance implications [10], and some researchers have already began to pay attention to interaction latency [11]. Our study explores the potential to reduce the response time for service chains and guarantee the non-interactive workloads simultaneously. In particular, we accelerate the interactive workloads by building a new dedicated queue and try to adjust resource allocation among different queues. By leveraging a feedback scheme, we can bound the influence on non-interactive workloads. We’ll describe the algorithm in Sect. 3.2 in detail after introducing our motivation in Sect. 3.1.

2.2 Toward Shorter Task Completion Time Although the ever-increasing used data center networks have configured high bandwidth and calculative ability, the task completion time still can be reduced to a large extent [12]. In this chapter, we describe the nature of today’s datacenter transport protocols, either flow-aware or task-aware and how the awareness of different levels does isolate with each other. As a result, although flow completion time or task completion time seems to reduce obviously, flow-aware protocols are blind to task level and vice versa. In particular, good flow-level awareness can help make task completion time shorter, while good task-level awareness can help flows cooperate harmonically. Figure 2.1 shows the development history of scheduling protocols, from DCTCP (2010) to FIFO-LM (2014), which can be categorized into two broad classes, flowaware and task-aware, both of which do advance the state-of-the-art technique. We’ll give a brief description and explanation to this progress. As the founder of many flow-aware TCP-like protocols, DCTCP [13] leverages Explicit Congestion Notification (ECN) in the network to provide feedback to end hosts. Experiments show that DCTCP delivers better throughput than TCP while

TAFA

Task&Flow Aware Task Aware

FIFO-LM DCTCP

D3

Flow Aware

PDQ D2TCP

pFabric pase Time

2010

2011

2012

Fig. 2.1 Brief development history of transport protocols

2013

2014

2015

2.2 Toward Shorter Task Completion Time

17

using 90% less buffer because it elegantly reduces the queue length. However, it is a deadline-agnostic protocol that equally throttles all flows, irrespective of whether their deadlines are near or far, so it may be less effective for the online dataintensive (OLDI) applications [14]. Motivated by these observations, D3 [15] use explicit rate control for the datacenter environment according to flow deadlines. D3 can determine the rate needed to satisfy the flow deadline when having the knowledge of flow’s sizes and deadlines. Although it outperforms TCP in terms of short flow latency and burst tolerance, D3 has the practical drawbacks of requiring changes to the switch hardware, which makes it not able to coexist with legacy TCP [14]. Deadline-Aware Datacenter TCP (D2 TCP) [14] is a deployable transport protocol compared to D3 . Via a gamma correction function, D2 TCP uses ECN feedback and deadlines to modulate the congestion window. Besides, D2 TCP can coexist with TCP without hurting bandwidth or deadline. Preemptive Distributed Quick (PDQ) flow scheduling [16] is designed to complete flows quickly and meet flow deadlines, and it builds on traditional real-time scheduling techniques: earliest deadline first and shortest job first, which help PDQ outperform TCP, RCP [17], and D3 significantly. pFabric [18] decouples flow scheduling from rate control. Unlike the protocols above, in pFabric, each flow carries a single priority number set independently, according to which switches execute a scheduling/dropping mechanism. Although pFabric achieves near-optimal flow completion time, it does not support work conservation in a multi-hop setting because end-hosts always send at maximum rate. To make these flows back off and let a lower priority flow at a subsequent hop, we need an explicit feedback from switches, i.e., a higher layer control. From a network perspective, tasks in DCNs typically comprise multiple flows, which traverse different servers at different times. Treating flows of one task in isolation will make flow-level optimizing while hurting task completion time. To solve this boundedness in flow-aware schemes, task-aware protocols have been proposed to explicitly take the higher layer information into consideration. A task-aware scheduling was proposed by Fahad R. Dogar in [12]. Using FirstIn-First-Out scheme to reduce both of the average and the tail task completion time, Dogar implemented First-In-First-Out with Limited Multiplexing (FIFO-LM) to change the level of multiplexing when heavy tasks are encountered, which can help heavy tasks not being blocked, or even starved. But as we all know, FIFO is not the most effective method to reduce average completion time no matter in flow level or task level, and the simple distinguish just between elephant tasks and mouse tasks is in coarse granularity as [19] said that the DCNs should be more load, more differentiation. Further, [20] and [21] give methods that can ensure user-level performance guarantee. Without cross-layer cooperation, these protocols have great blindness to each other, making scheduling inefficient. In order to give our method the advantages from both flow level and task level, we praise TAFA with the idea of co-assist that making flow-level schedule help task complete early and task-level schedule help flows mutually related. Our work performs well even in multiple resource-sharing environment.

18

2 A Survey of Resource Management in Cloud and Edge Computing

2.3 Container Placement and Reassignment for Large-Scale Network In this section, we survey some problems that are related to our problem, including multi-resource generalized assignment problem, Google Machine Reassignment Problem, traffic-aware virtual machine placement, network function placement, and container deployment and migration. Multi-resource Generalized Assignment Problem (MRGAP) MRGAP [22, 23] is an extension of the Generalized Assignment Problem (GAP) [24, 25], where multiple resources are associated with the items and bins. Solutions to MRGAP usually contain two phases. The first phase aims to obtain an initial feasible solution, and the second phase attempts to further improve the solution. Gavish et al. [23] proposed two heuristics to generate the initial solution and a branch and bound algorithm to improve the solution. Privault et al. [26] computed the initial solution by the bounded variable simplex method and optimized the solution by a simulated annealing algorithm. Mitrovi´c-Mini´c and Punnen [27] and Yagiura et al. [28] generated a random initial solution in the first phase and adopted local searching techniques in the second phase. Mazzola and Wilcox [29] combined Pivot and Complement (P&C) and the heuristic proposed in [23] to obtain high-quality solutions. In [30], Shtub et al. proposed a gradient descent-based solution to the dynamic MRGAP (DMRGAP), where the resource requirements of items change over time and an item can be assigned to several bins. Although we show that MRGAP is equivalent to the simplified CPP in Sect. 5.3.1, we emphasize that CPP and CRP are more complex than MRGAP because of the containerization-specific constraints (i.e., conflict, spread, co-locate, and transient constraints), which makes above solutions inapplicable in our scenarios. Google Machine Reassignment Problem (GMRP) GMRP was formulated by the Google research team as a subject of ROADEF/EURO Challenge, which aims to maximize the resource usage by reassigning processes among the machines in data centers. Gavranovi´c et al. proposed the winner solution [31] noisy local search (NLS), which combines local searching techniques and noising strategy in reallocation. Different from NLS, we depart the reassignment into two steps, namely, Sweep and Search. With the help of Sweep, we mitigate the hot hosts and obtain better initial conditions for the following local searching procedure. The evaluation result in Sect. 5.6 shows that Sweep&Search yields significantly better results than directly applying local searching techniques. Traffic-Aware Virtual Machine Placement Like containerization, virtual machine (VM) is also a popular virtualization technique, where isolated operation systems run above a hypervisor layer on bare metals. Since each VM runs a full operating system [32], VMs usually have bigger sizes and consume more power than containers. Hence, traditional VM placement mainly concerns about optimization of energy consumption, resource utilizations, and VM migration overhead [33]. Since the pioneer work of [34], many efforts have been made to mitigate inter-server

2.3 Container Placement and Reassignment for Large-Scale Network

19

communications by traffic-aware VM placement [35–44]. Meng et al. [34] defined the traffic-aware VM placement problem and proposed a two-tier approximate algorithm to minimize inter-VM communications. Choreo [36] adopts a greedy heuristic to place VMs to minimize application completion time. Li et al. [37] proposed a series of traffic-aware VM placement algorithms to optimize traffic cost as well as single-dimensional resource utilization cost. Rui et al. [42] adopt a system optimization method to re-optimize VM distributions for joint optimization of resource load balancing and VM migration cost. Different from this work, we optimize both communication overhead and multi-resource load balancing. Besides, since containers can be deployed in VMs instead of physical machines, the solutions proposed in our book are orthogonal to these VM placement strategies. Therefore, VM resource utilization and inter-VM communications could be optimized by container placement/reassignment, and that of physical machines could be optimized by VM placement. Network Function Placement Network functions virtualization (NFV) has recently gained wide attention from both industry and academia, making the study of their placement a popular research topic [45–60]. Wang et al. [45] studied the flow-level multi-resource load balancing problem in NFV and proposed a distributed solution based on the proximal Jacobian ADMM (alternating direction method of multipliers). Marotta et al. [49] proposed a mathematical model based on the robust optimization theory to minimize the power consumption of the NFV infrastructure. In [53], an affinity-based heuristic is proposed to minimize inter-cloud traffic and response time. Zhang et al. [54] proposed a Best-Fit Decreasing-based heuristic algorithm to place network functions to achieve high utilization of single dimensional sources. Taleb et al. studied the network function placement problem from many aspects, including minimizing path between users and their respective data anchor gateways [55], measuring existing NFV placement algorithms [56], placing Packet Data Network (PDN) Gateway network functionality and Evolved Packet Core (EPC) in the cloud [57, 58, 60], and modeling cross-domain network slices for 5G [59]. Since network functions work in chains and containers are deployed by groups, the communication patterns are totally different between the two systems. Hence, the communication optimization solutions in NFV are nonapplicable to container placement. Besides, none of these works aim at the joint optimization of communication overhead and multi-resource load balancing in data centers. Container Deployment and Migration A lot of work has been studied to deploy containers among virtual machines or physical machines for various optimization purposes. Zhang et al. [61] proposed a novel container placement strategy for improving physical resource utilization. The works [62–64] studied the container placement problem for minimizing energy consumption in the cloud. Mao et al. [65] presented a resource-aware placement scheme to improve the system performance in a heterogeneous cluster. Nardelli et al. [66] studied the container placement problem for optimizing deployment cost. However, none of the above work considers the communication cost among containers.

20

2 A Survey of Resource Management in Cloud and Edge Computing

The container migration issues also have been extensively studied in the literature. The first part of the related works concentrates on developing container live migration techniques. The works [67, 68] proposed solutions for live migrating Linux containers, while [69] proposed the techniques for live migrating Docker containers. The prior works [70–72] further optimized the existing container migration techniques for reducing migration overhead. The second part of the related works focused on the container migration strategies. Li et al. [73] aimed to achieve load balancing of cloud resources through container migration. Guo et al. [74] proposed a container scheduling strategy based on neighborhood division in micro service, with the purpose of reducing the system load imbalance and improve the overall system performance. Kaewkasi and Chuenmuneewong [75] applied the ant colony optimization (ACO) in the context of container scheduling, which aimed to balance the resource usages and achieve better performance. Xu et al. [76] proposed a resource scheduling approach for the container virtualized cloud environments to reduce response time of customers jobs and improve resource utilization. Again, none of the above work considers communication cost among containers.

2.4 Near-Optimal Network System for Data Replication Here we discuss some representative work that is related to BDS+ in three categories. Overlay Network Control Overlay networks realize great potential for various applications, especially for data transfer applications. The representative networks include peer-to-peer (P2P) networks and content delivery networks (CDNs). The P2P architecture has already been verified by many applications, such as live streaming systems (CoolStreaming [77], Joost [78], PPStream [79], UUSee [80]), video-on-demand (VoD) applications (OceanStore [81]), distributed hash tables [82], and more recently Bitcoin [83]. But, self-organizing systems based on P2P principles suffer from long convergence times. CDN distributes services spatially relative to end users to provide high availability and performance (e.g., to reduce page load time), serving many applications such as multimedia [84] and live streaming [85]. We briefly introduce the two baselines in the evaluation section: (1) Bullet [86], which enables geo-distributed nodes to self-organize into an overlay mesh. Specifically, each node uses RanSub [87] to distribute summary ticket information to other nodes and receive disjoint data from its sending peers. The main difference between BDS+ and Bullet lies in the control scheme, i.e., BDS+ is a centralized method that has a global view of data delivery states, while Bullet is a decentralized scheme and each node makes its decision locally. (2) Akamai designs a threelayer overlay network for delivering live streams [88], where a source forwards its streams to reflectors and reflectors send outgoing streams to stage sinks. There are two main differences between Akamai and BDS+. First, Akamai adopts a three-

2.4 Near-Optimal Network System for Data Replication

21

layer topology where edge servers receive data from their parent reflectors, while BDS+ successfully explores a larger search space through a finer-grained allocation without the limitation of three coarse-grained layers. Second, the receiving sequence of data must be sequential in Akamai because it is designed for a live streaming application. But there is no such requirements in BDS+, and the side effect is that BDS+ has to decide the optimal transmission order as additional work. Data Transfer and Rate Control Rate control of transport protocols at the DC level plays an important role in data transmission. DCTCP [89], PDQ [90], CONGA [91], DCQCN [92], and TIMELY [93] are all classical protocols showing clear improvements in transmission efficiency. Some congestion control protocols like the credit-based ExpressPass [94] and load balancing protocols like Hermes [95] could further reduce flow completion time by improving rate control. On this basis, the recently proposed NUMfabric [96] and Domino [97] further explore the potential of centralized TCP on speeding up data transfer and improving DC throughput. To some extend, co-flow scheduling [98, 99] has some similarities to the multicast overlay scheduling, in terms of data parallelism. But that work focuses on flow-level problems, while BDS+ is designed at the application level. Centralized Traffic Engineering Traffic engineering (TE) has long been a hot research topic, and many existing studies [100–106] have illustrated the challenges of scalability, heterogeneity etc., especially on inter-DC level. The representative TE systems include Google’s B4 [107] and Microsoft’s SWAN [108]. B4 adopts SDN [109] and OpenFlow [110, 111] to manage individual switches and deploy customized strategies on the paths. SWAN is another online traffic engineering platform, which achieves high network utilization with its software-driven WAN. Network Change Detection Detecting network changes is quite important not only in traffic prediction problems, but also in many other applications, such as abnormality detection, network monitoring, and security. There are two basic but mature methods that are widely used, the exponentially weighted moving average (EWMA) control scheme [112, 113] and the change point detection algorithm [114]. EWMA usually gives higher weights to recent observations while gives decreased weights in geometric progression to the previous observations, when predicting the next value. Although EWMA describes a graphical procedure for generating geometric moving averages smoothly, it faces an essential sensitivity problem; in other words, it cannot identify abrupt changes. In contrast, change point detection algorithms could exactly solve this problem, in both online [115–117] and offline [118–121] manner. BDS+ combines these two methods by designing a sliding observation window, which makes BDS+’s prediction algorithm both stable and sensitive. Overall, an application-level multicast overlay network with dynamic bandwidth separation is essential for data transfer in inter-DC WANs. Applications like user logs, search engine indexes, and databases would greatly benefit from bulk-data multicast. Furthermore, such benefits are orthogonal to prior WAN optimizations, further improving inter-DC application performance.

22

2 A Survey of Resource Management in Cloud and Edge Computing

2.5 Distributed Edge Caching in Short Video Network Here we discuss some representative caching policies in traditional CDN and some related edge caching systems. Existing caching policies The most representative caching policies such as FIFO, LRU, LFU, and their variations are simple but effective in traditional CDN, where the frequency of content visits can be modeled as Poisson distribution. Under these policies, the future popularity of a content is represented by the historical popularity. But in short video network, user access pattern is non-stationary and is no longer in Poisson distribution, so these policies become inefficiency in short video network. The same problem exists in the TTL (Time-to-live)-based caching policies. For example, in the caching (feedforward) networks, where the access pattern is Markov arrival process, [122] gives joint consideration about both TTL and request models and drives evictions by stopping times. Ferragut et al. [123] set an optimal timer to maximize cache hit rate, but it only works in the case of Pareto-distributed access pattern and Zipf distribution of file popularity and thus certainly becomes invalid in short video network. Basu et al. [124] propose two caches named f-TTL with two timer, so as to filter out non-stationary traffic, but it still relies on locally observed access patterns to change TTL values, regardless of the future popularity. One of the promising attempts in recent years is learning-based proactive prediction-based policy. Narayanan et al. [125] train a characteristics predictor to predict object future popularity and interoperate with traditional LRU and LFU, boosting the number of cache hits. However, it looks a fixed length into the future to predict object popularity (1–3, 12–14, and 24–26 h), ignoring the temporal and spatial video popularity pattern, thus cannot handle the varies life spans in short video network. Pensieve [126] trains a neural network model as an adaptive bitrate (ABR) algorithm; it complements our AutoSight framework, in the sense that the proposed dynamic control rules can give us a reference when facing the various life span problem, but it ignores temporal pattern information and only works in live streaming. Besides, [127] introduces reinforcement learning for making cache decisions, but it works in the case that user requests comply with the Markov process, while the proposes AutoSight in this book can work under arbitrary nonstationary user access patterns. Edge caching systems Edge computing was proposed to enable offloading of latency-sensitive tasks to edge servers instead of cloud and has achieved rapid development in many areas such as 5G, wireless, mobile networks [128], and video streaming [129]. Cachier [130] uses a caching model on edge servers to balance load between edge and cloud for image-recognition applications, but it doesn’t predict image loads in the future. Gabry et al. [131] study the content placement problem in edge caching to maximize energy efficiency; the analysis in this work gives us references to design our AutoSight network topology, but what we considered under such topology is the caching of short video network rather than energy-saving.

2.6 The Controllability of Dynamic Temporal Network

23

Ma et al. [129] propose a geo-collaborative caching strategy for mobile video network, suggesting that joint caching over multiple edges can improve QoE, which provides strong proof for our AutoSight design. While this chapter tries to reveal the characteristics of different mobile videos, we focus the short video network on edge servers with unique user access pattern and video popularity pattern.

2.6 The Controllability of Dynamic Temporal Network Dynamic network The network is composed of nodes and the relationship between nodes. The rapid development of the Internet makes human beings enter the network era, such as all kinds of social networks. The industrial Internet such as large power grid and the Internet of things, connects the whole world into a huge network. At present, network research has infiltrated into mathematics, life science, information science, and many other fields, and its fundamental goal is to find effective means to control network behavior and make it serve human beings. The paper [132] firstly applies the classical control theory to the analysis of network control. In a directed network, it uses a directed edge from node N1 to node N2 to denote that N2 can be controlled by N1 under some conditions. In this book, the network is abstracted into a linear time-invariant system, and the Kalman controllability criterion in the control theory is introduced into the research, and the sufficient and necessary condition for the network to achieve complete controllability is that the controllability matrix reaches full rank, so the controllability problem of the network is transformed into the problem of calculating the rank of the controllability matrix. What is mentioned in [133] is theoretically proved that the minimum driver nodes needed to control the entire network depend on the maximum matching in the network, in which all unmatching nodes are driver nodes of the network. Thus, the problem of network structure controllability is transformed into a classical graph theory matching problem, which reduces the time complexity of network controllability problem. However, this method can only be applied to directed networks and networks with unknown edge weights. According to Popov-BelevitchHautus criterion, the paper [134] has proved that the minimum number of driver nodes required by the network is equal to the maximum geometric multiplicity of all eigenvalues of the network matrix; as for undirected networks, the minimum driver nodes are the maximum algebraic multiplicity of all eigenvalues. It reduces the computational complexity of related problems greatly. References [135–138] presented a theoretical approach to describing the controllability of networks or proposed a way to change the state of some nodes to stabilize the network. All of the papers and methods mentioned above have not solved the problem of controllability generated in dynamic networks well until now. Internet of Vehicles Internet of Vehicles refers to the use of vehicle electronic sensing devices to enable two-way data exchange and sharing between vehicles,

24

2 A Survey of Resource Management in Cloud and Edge Computing

Fig. 2.2 Illustration of vehicle-to-vehicle communication

people, and transportation facilities through wireless communication technology, car navigation systems, intelligent terminal facilities, and information processing systems. It is comprehensive intelligent decision-making information system that realizes real-time monitoring, scientific dispatching, and effective management of vehicles, people, objects, roads, etc., thereby improving road transportation conditions and traffic management efficiency [139]. Vehicle-to-vehicle communication is shown in Fig. 2.2, the communication between the vehicle and the vehicle through the vehicle-mounted terminal, and it is mainly used for two-way data transmission between vehicles. The communication technologies used include microwave, infrared technology, dedicated short-range communication, etc., featuring high safety and real-time requirements. The vehicle terminals can collect information such as the speed, position, direction, and vehicle running alarm of the surrounding vehicles in real time. Those vehicles form an interactive communication platform through wireless communication technology, which can exchange pictures, text messages, videos, and audio information in real time [140, 141] (Fig. 2.3). The communication between the vehicle and the control center means that the vehicle-mounted mobile terminal establishes interconnection with the remote traffic control center through the public access network, to complete data transmission and information exchange and to accomplish the interaction and storage of data between the vehicle and the traffic control center. It is mainly used in vehicle navigation, vehicle remote monitoring, emergency rescue, information entertainment services, and so on. Moreover, it has the characteristics of long distance and high-speed movement [139]. In a vehicular network, vehicles are virtualized as mobile network nodes, and road side units (RSUs) are virtualized as stationary network nodes. The environmental information of the road and the vehicle is collected through sensors in the vehicle and the RSU. The structure of the vehicle network presents a dynamic topology. The high-speed moving vehicle nodes make the topology of the vehicle network change

References

25

Fig. 2.3 Illustration of communication between the vehicle and the control center

rapidly, and the access status changes dynamically due to the dynamic network topology.

References 1. Mauve, M., Vogel, J., Hilt, V., Effelsberg, W.: Local-lag and timewarp: providing consistency for replicated continuous applications. IEEE Trans. Multimedia 6(1), 47–57 (2004) 2. Shao, Z., Jin, X., Jiang, W., Chen, M., Chiang, M.: Intra-data-center traffic engineering with ensemble routing. In: INFOCOM, 2013 Proceedings IEEE, pp. 2148–2156. IEEE (2013) 3. Webb, S.D., Soh, S., Lau, W.: Enhanced mirrored servers for network games. In: Proceedings of the 6th ACM SIGCOMM Workshop on Network and System Support for Games, pp. 117– 122. ACM (2007) 4. Vik, K.-H., Halvorsen, P., Griwodz, C.: Multicast tree diameter for dynamic distributed interactive applications. In: INFOCOM 2008. The 27th Conference on Computer Communications IEEE. IEEE (2008) 5. Guo, J., Liu, F., Zeng, D., Lui, J.C., Jin, H.: A cooperative game based allocation for sharing data center networks. In: INFOCOM, 2013 Proceedings IEEE, pp. 2139–2147. IEEE (2013) 6. Xu, K., Zhang, Y., Shi, X., Wang, H., Wang, Y., Shen, M.: Online combinatorial double auction for mobile cloud computing markets. In: Performance Computing and Communications Conference (IPCCC), 2014 IEEE International, pp. 1–8. IEEE (2014) 7. Seung, Y., Lam, T., Li, L.E., Woo, T.: Cloudflex: seamless scaling of enterprise applications into the cloud. In: INFOCOM, 2011 Proceedings IEEE, pp. 211–215. IEEE (2011) 8. Yue, K., Wang, X.-L., Zhou, A.-Y., et al.: Underlying techniques for web services: a survey. J. Softw. 15(3), 428–442 (2004)

26

2 A Survey of Resource Management in Cloud and Edge Computing

9. Zaki, Y., Chen, J., Potsch, T., Ahmad, T., Subramanian, L.: Dissecting web latency in ghana. In: Proceedings of the 2014 Conference on Internet Measurement Conference, pp. 241–248. ACM (2014) 10. Pujol, E., Richter, P., Chandrasekaran, B., Smaragdakis, G., Feldmann, A., Maggs, B.M., Ng, K.-C.: Back-office web traffic on the Internet. In: Proceedings of the 2014 Conference on Internet Measurement Conference, pp. 257–270. ACM (2014) 11. Wang, H., Shea, R., Ma, X., Wang, F., Liu, J.: On design and performance of cloud-based distributed interactive applications. In: 2014 IEEE 22nd International Conference on Network Protocols (ICNP), pp. 37–46. IEEE (2014) 12. Dogar, F.R., Karagiannis, T., Ballani, H., Rowstron, A.: Decentralized task-aware scheduling for data center networks. ACM SIGCOMM Comput. Commun. Rev. 44, 431–442 (2014) 13. Alizadeh, M., Greenberg, A., Maltz, D.A., Padhye, J., Patel, P., Prabhakar, B., Sengupta, S., Sridharan, M.: Data center TCP (DCTCP). ACM SIGCOMM Comput. Commun. Rev. 41(4), 63–74 (2011) 14. Vamanan, B., Hasan, J., Vijaykumar, T.: Deadline-aware datacenter TCP (D2TCP). ACM SIGCOMM Comput. Commun. Rev. 42(4), 115–126 (2012) 15. Wilson, C., Ballani, H., Karagiannis, T., Rowtron, A.: Better never than late: meeting deadlines in datacenter networks. ACM SIGCOMM Comput. Commun. Rev. 41(4), 50–61 (2011). ACM 16. Hong, C.-Y., Caesar, M., Godfrey, P.: Finishing flows quickly with preemptive scheduling. ACM SIGCOMM Comput. Commun. Rev. 42(4), 127–138 (2012) 17. Dukkipati, N., McKeown, N.: Why flow-completion time is the right metric for congestion control. ACM SIGCOMM Comput. Commun. Rev. 36(1), 59–62 (2006) 18. Alizadeh, M., Yang, S., Sharif, M., Katti, S., McKeown, N., Prabhakar, B., Shenker, S.: pFabric: minimal near-optimal datacenter transport. ACM SIGCOMM Comput. Commun. Rev. 43(4), 435–446 (2013). ACM 19. Zhang, H.: More load, more differentiation – a design principle for deadline-aware flow control in DCNS. In: INFOCOM, 2014 Proceedings IEEE. IEEE (2014) 20. Shen, M., Gao, L., Xu, K., Zhu, L.: Achieving bandwidth guarantees in multi-tenant cloud networks using a dual-hose model. In: 2014 IEEE 33rd International Performance Computing and Communications Conference (IPCCC), pp. 1–8. IEEE (2014) 21. Xu, K., Zhang, Y., Shi, X., Wang, H., Wang, Y., Shen, M.: Online combinatorial double auction for mobile cloud computing markets. In: 2014 IEEE 33rd International Performance Computing and Communications Conference (IPCCC), pp.1–8. IEEE (2014) 22. Gavish, B., Pirkul, H.: Computer and database location in distributed computer systems. IEEE Trans. Comput. (7), 583–590 (1986) 23. Gavish, B., Pirkul, H.: Algorithms for the multi-resource generalized assignment problem. Manag. Sci. 37(6), 695–713 (1991) 24. Ross, G.T., Soland, R.M.: A branch and bound algorithm for the generalized assignment problem. Math. Program. 8(1), 91–103 (1975) 25. Oncan, T.: A survey of the generalized assignment problem and its applications. INFOR 45(3), 123–141 (2007) 26. Privault, C., Herault, L.: Solving a real world assignment problem with a metaheuristic. J. Heuristics 4(4), 383–398 (1998) 27. Mitrovi´c-Mini´c, S., Punnen, A.P.: Local search intensified: very large-scale variable neighborhood search for the multi-resource generalized assignment problem. Discret. Optim. 6(4), 370–377 (2009) 28. Yagiura, M., Iwasaki, S., Ibaraki, T., Glover, F.: A very large-scale neighborhood search algorithm for the multi-resource generalized assignment problem. Discret. Optim. 1(1), 87–98 (2004) 29. Mazzola, J.B., Wilcox, S.P.: Heuristics for the multi-resource generalized assignment problem. Nav. Res. Logist. 48(6), 468–483 (2001) 30. Shtub, A., Kogan, K.: Capacity planning by the dynamic multi-resource generalized assignment problem (DMRGAP). Eur. J. Oper. Res. 105(1), 91–99 (1998)

References

27

31. Gavranovi´c, H., Buljubaši´c, M.: An efficient local search with noising strategy for Google machine reassignment problem. Ann. Oper. Res. 242, 1–13 (2014) 32. Sharma, P., Chaufournier, L., Shenoy, P., Tay, Y.C.: Containers and virtual machines at scale: a comparative study. In: International Middleware Conference, p. 1 (2016) 33. Mann, Z.D., Szabó, M.: Which is the best algorithm for virtual machine placement optimization? Concurr. Comput. Pract. Exp. 29(7), e4083 (2017) 34. Meng, X., Pappas, V., Zhang, L.: Improving the scalability of data center networks with traffic-aware virtual machine placement. In: INFOCOM, 2010 Proceedings IEEE, pp. 1–9 (2010) 35. Popa, L., Kumar, G., Chowdhury, M., Krishnamurthy, A., Ratnasamy, S., Stoica, I.: Faircloud: sharing the network in cloud computing. ACM SIGCOMM Comput. Commun. Rev. 42(4), 187–198 (2012) 36. Lacurts, K., Deng, S., Goyal, A., Balakrishnan, H.: Choreo: network-aware task placement for cloud applications. In: Conference on Internet Measurement Conference, pp. 191–204 (2013) 37. Li, X., Wu, J., Tang, S., Lu, S.: Let’s stay together: towards traffic aware virtual machine placement in data centers. In: INFOCOM, 2014 Proceedings IEEE, pp. 1842–1850 (2014) 38. Ma, T., Wu, J., Hu, Y., Huang, W.: Optimal VM placement for traffic scalability using Markov chain in cloud data centre networks. Electron. Lett. 53(9), 602–604 (2017) 39. Zhao, Y., Huang, Y., Chen, K., Yu, M., Wang, S., Li, D.S.: Joint VM placement and topology optimization for traffic scalability in dynamic datacenter networks. Comput. Netw. 80, 109– 123 (2015) 40. Rai, A., Bhagwan, R., Guha, S.: Generalized resource allocation for the cloud. In: ACM Symposium on Cloud Computing, pp. 1–12 (2012) 41. Wang, L., Zhang, F., Aroca, J.A., Vasilakos, A.V., Zheng, K., Hou, C., Li, D., Liu, Z.: Greendcn: a general framework for achieving energy efficiency in data center networks. IEEE J. Sel. Areas Commun. 32(1), 4–15 (2013) 42. Rui, L., Zheng, Q., Li, X., Jie, W.: A novel multi-objective optimization scheme for rebalancing virtual machine placement. In: IEEE International Conference on Cloud Computing, pp. 710–717 (2017) 43. Gu, L., Zeng, D., Guo, S., Xiang, Y., Hu, J.: A general communication cost optimization framework for big data stream processing in geo-distributed data centers. IEEE Trans. Comput. 65(1), 19–29 (2015) 44. Shen, M., Xu, K., Li, F., Yang, K., Zhu, L., Guan, L.: Elastic and efficient virtual network provisioning for cloud-based multi-tier applications. In: 2015 44th International Conference on Parallel Processing (ICPP), pp. 929–938. IEEE (2015) 45. Wang, T., Xu, H., Liu, F.: Multi-resource load balancing for virtual network functions. In: IEEE International Conference on Distributed Computing Systems (2017) 46. Taleb, T., Bagaa, M., Ksentini, A.: User mobility-aware virtual network function placement for virtual 5G network infrastructure. In: IEEE International Conference on Communications, pp. 3879–3884 (2016) 47. Mehraghdam, S., Keller, M., Karl, H.: Specifying and placing chains of virtual network functions. In: 2014 IEEE 3rd International Conference on Cloud Networking (CloudNet), pp. 7–13. IEEE (2014) 48. Kawashima, K., Otoshi, T., Ohsita, Y., Murata, M.: Dynamic placement of virtual network functions based on model predictive control. In: NOMS 2016 – 2016 IEEE/IFIP Network Operations and Management Symposium, pp. 1037–1042 (2016) 49. Marotta, A., Kassler, A.: A power efficient and robust virtual network functions placement problem. In: Teletraffic Congress, pp. 331–339 (2017) 50. Addis, B., Belabed, D., Bouet, M., Secci, S.: Virtual network functions placement and routing optimization. In: IEEE International Conference on Cloud NETWORKING, pp. 171–177 (2015)

28

2 A Survey of Resource Management in Cloud and Edge Computing

51. Wang, F., Ling, R., Zhu, J., Li, D.: Bandwidth guaranteed virtual network function placement and scaling in datacenter networks. In: IEEE International Performance Computing and Communications Conference, pp. 1–8 (2015) 52. Ghaznavi, M., Khan, A., Shahriar, N., Alsubhi, K., Ahmed, R., Boutaba, R.: Elastic virtual network function placement. In: IEEE International Conference on Cloud Networking (2015) 53. Bhamare, D., Samaka, M., Erbad, A., Jain, R., Gupta, L., Chan, H.A.: Optimal virtual network function placement in multi-cloud service function chaining architecture. Comput. Commun. 102(C), 1–16 (2017) 54. Zhang, Q., Xiao, Y., Liu, F., Lui, J.C.S., Guo, J., Wang, T.: Joint optimization of chain placement and request scheduling for network function virtualization. In: IEEE International Conference on Distributed Computing Systems, pp. 731–741 (2017) 55. Taleb, T., Bagaa, M., Ksentini, A.: User mobility-aware virtual network function placement for virtual 5G network infrastructure. In: 2015 IEEE International Conference on Communications (ICC), pp. 3879–3884. IEEE (2015) 56. Laghrissi, A., Taleb, T., Bagaa, M., Flinck, H.: Towards edge slicing: VNF placement algorithms for a dynamic & realistic edge cloud environment. In: 2017 IEEE Global Communications Conference, pp. 1–6. IEEE (2017) 57. Prados, M.B.J., Laghrissi, A., Taleb, A.T., Taleb, T., Bagaa, M., Flinck, H.: A queuing based dynamic auto scaling algorithm for the LTE EPC control plane. In: 2018 IEEE Global Communications Conference, pp. 1–6. IEEE (2018) 58. Bagaa, M., Taleb, T., Ksentini, A.: Service-aware network function placement for efficient traffic handling in carrier cloud. In: 2014 IEEE Wireless Communications and Networking Conference (WCNC), pp. 2402–2407. IEEE (2014) 59. Bagaa, M., Dutra, D.L.C., Addad, R.A., Taleb, T., Flinck, H.: Towards modeling cross-domain network slices for 5G. In: 2018 IEEE Global Communications Conference, pp. 1–6. IEEE (2018) 60. Bagaa, M., Taleb, T., Laghrissi, A., Ksentini, A.: Efficient virtual evolved packet core deployment across multiple cloud domains. In: 2018 IEEE Wireless Communications and Networking Conference (WCNC), pp. 1–6. IEEE (2018) 61. Zhang, R., Zhong, A.-M., Dong, B., Tian, F., Li, R.: Container-VM-PM architecture: a novel architecture for docker container placement. In: International Conference on Cloud Computing, pp. 128–140. Springer (2018) 62. Piraghaj, S.F., Dastjerdi, A.V., Calheiros, R.N., Buyya, R.: A framework and algorithm for energy efficient container consolidation in cloud data centers. In: 2015 IEEE International Conference on Data Science and Data Intensive Systems (DSDIS), pp. 368–375. IEEE (2015) 63. Dong, Z., Zhuang, W., Rojas-Cessa, R.: Energy-aware scheduling schemes for cloud data centers on google trace data. In: 2014 IEEE Online Conference on Green Communications (OnlineGreencomm), pp. 1–6. IEEE (2014) 64. Shi, T., Ma, H., Chen, G.: Energy-aware container consolidation based on PSO in cloud data centers. In: 2018 IEEE Congress on Evolutionary Computation (CEC), pp. 1–8. IEEE (2018) 65. Mao, Y., Oak, J., Pompili, A., Beer, D., Han, T., Hu, P.: Draps: dynamic and resource-aware placement scheme for docker containers in a heterogeneous cluster. In: 2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC), pp. 1–8. IEEE (2017) 66. Nardelli, M., Hochreiner, C., Schulte, S.: Elastic provisioning of virtual machines for container deployment. In: Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering Companion, pp. 5–10. ACM (2017) 67. Qiu, Y.: Evaluating and improving LXC container migration between cloudlets using multipath TCP. Ph.D. dissertation, Carleton University, Ottawa (2016) 68. Machen, A., Wang, S., Leung, K.K., Ko, B.J., Salonidis, T.: Live service migration in mobile edge clouds. IEEE Wirel. Commun. 25(1), 140–147 (2018) 69. Pickartz, S., Eiling, N., Lankes, S., Razik, L., Monti, A.: Migrating Linux containers using CRIU. In: International Conference on High Performance Computing, pp. 674–684. Springer (2016)

References

29

70. Ma, L., Yi, S., Carter, N., Li, Q.: Efficient live migration of edge services leveraging container layered storage. IEEE Trans. Mob. Comput. 18, 2020–2033 (2018) 71. Ma, L., Yi, S., Li, Q.: Efficient service handoff across edge servers via docker container migration. In: Proceedings of the Second ACM/IEEE Symposium on Edge Computing, p. 11. ACM (2017) 72. Nadgowda, S., Suneja, S., Bila, N., Isci, C.: Voyager: complete container state migration. In: 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pp. 2137–2142. IEEE (2017) 73. Li, P., Nie, H., Xu, H., Dong, L.: A minimum-aware container live migration algorithm in the cloud environment. Int. J. Bus. Data Commun. Netw. (IJBDCN) 13(2), 15–27 (2017) 74. Guo, Y., Yao, W.: A container scheduling strategy based on neighborhood division in micro service. In: NOMS 2018-2018 IEEE/IFIP Network Operations and Management Symposium, pp. 1–6. IEEE (2018) 75. Kaewkasi, C., Chuenmuneewong, K.: Improvement of container scheduling for docker using ant colony optimization. In: 2017 9th International Conference on Knowledge and Smart Technology (KST), pp. 254–259. IEEE (2017) 76. Xu, X., Yu, H., Pei, X.: A novel resource scheduling approach in container based clouds. In: 2014 IEEE 17th International Conference on Computational Science and Engineering (CSE), pp. 257–264. IEEE (2014) 77. Zhang, X., Liu, J., Li, B., Yum, Y.-S.: CoolStreaming/DONet: a data-driven overlay network for peer-to-peer live media streaming. In: INFOCOM, vol. 3, pp. 2102–2111. IEEE (2005) 78. Joost: http://www.joost.com/ 79. Ppstream: http://www.ppstream.com/ 80. Uusee: http://www.uusee.com/ 81. Oceanstore: http://oceanstore.cs.berkeley.edu/ 82. Rhea, S., Godfrey, B., Karp, B., Kubiatowicz, J., Ratnasamy, S., Shenker, S., Stoica, I., Yu, H.: Opendht: a public DHT service and its uses. In: ACM SIGCOMM, vol. 35, pp. 73–84 (2005) 83. Eyal, I., Gencer, A.E., Sirer, E.G., Van Renesse, R.: Bitcoin-NG: a scalable blockchain protocol. In: NSDI (2016) 84. Zhu, W., Luo, C., Wang, J., Li, S.: Multimedia cloud computing. IEEE Signal Process. Mag. 28(3), 59–69 (2011) 85. Sripanidkulchai, K., Maggs, B., Zhang, H.: An analysis of live streaming workloads on the Internet. In: IMC, pp. 41–54. ACM (2004) 86. Kosti´c, D., Rodriguez, A., Albrecht, J., Vahdat, A.: Bullet: high bandwidth data dissemination using an overlay mesh. ACM SOSP 37(5), 282–297 (2003). ACM 87. Rodriguez, A., Albrecht, J., Bhirud, A., Vahdat, A.: Using random subsets to build scalable network services. In: USITS, pp. 19–19 (2003) 88. Andreev, K., Maggs, B.M., Meyerson, A., Sitaraman, R.K.: Designing overlay multicast networks for streaming. In: SPAA, pp. 149–158 (2013) 89. Alizadeh, M., Greenberg, A., Maltz, D.A., Padhye, J., Patel, P., Prabhakar, B., Sengupta, S., Sridharan, M.: Data center TCP (DCTCP). In: ACM SIGCOMM, pp. 63–74 (2010) 90. Hong, C.Y., Caesar, M., Godfrey, P.B.: Finishing flows quickly with preemptive scheduling. ACM SIGCOMM Comput. Commun. Rev. 42(4), 127–138 (2012) 91. Alizadeh, M., Edsall, T., Dharmapurikar, S., Vaidyanathan, R., Chu, K., Fingerhut, A., Lam, V.T., Matus, F., Pan, R., Yadav, N.: CONGA: distributed congestion-aware load balancing for datacenters. In: ACM SIGCOMM, pp. 503–514 (2014) 92. Zhu, Y., Eran, H., Firestone, D., Guo, C., Lipshteyn, M., Liron, Y., Padhye, J., Raindel, S., Yahia, M.H., Zhang, M.: Congestion control for large-scale RDMA deployments. ACM SIGCOMM 45(5), 523–536 (2015) 93. Mittal, R., Lam, V.T., Dukkipati, N., Blem, E., Wassel, H., Ghobadi, M., Vahdat, A., Wang, Y., Wetherall, D., Zats, D.: TIMELY: RTT-based congestion control for the datacenter. In: ACM SIGCOMM, pp. 537–550 (2015)

30

2 A Survey of Resource Management in Cloud and Edge Computing

94. Cho, I., Jang, K.H., Han, D.: Credit-scheduled delay-bounded congestion control for datacenters. In: ACM SIGCOMM, pp. 239–252 (2017) 95. Zhang, H., Zhang, J., Bai, W., Chen, K., Chowdhury, M.: Resilient datacenter load balancing in the wild. In: ACM SIGCOMM, pp. 253–266 (2017) 96. Nagaraj, K., Bharadia, D., Mao, H., Chinchali, S., Alizadeh, M., Katti, S.: Numfabric: fast and flexible bandwidth allocation in datacenters. In: ACM SIGCOMM, pp. 188–201 (2016) 97. Sivaraman, A., Cheung, A., Budiu, M., Kim, C., Alizadeh, M., Balakrishnan, H., Varghese, G., McKeown, N., Licking, S.: Packet transactions: high-level programming for line-rate switches. In: ACM SIGCOMM, pp. 15–28 (2016) 98. Chowdhury, M., Stoica, I.: Coflow: an application layer abstraction for cluster networking. In: ACM Hotnets. Citeseer (2012) 99. Zhang, H., Chen, L., Yi, B., Chen, K., Geng, Y., Geng, Y.: CODA: toward automatically identifying and scheduling coflows in the dark. In: ACM SIGCOMM, pp. 160–173 (2016) 100. Chen, Y., Alspaugh, S., Katz, R.H.: Design insights for MapReduce from diverse production workloads. California University Berkeley Department of Electrical Engineering and Computer Science, Technical Report (2012) 101. Kavulya, S., Tan, J., Gandhi, R., Narasimhan, P.: An analysis of traces from a production MapReduce cluster. In: CCGrid, pp. 94–103. IEEE (2010) 102. Mishra, A.K., Hellerstein, J.L., Cirne, W., Das, C.R.: Towards characterizing cloud backend workloads: insights from Google compute clusters. ACM SIGMETRICS PER 37(4), 34–41 (2010) 103. Reiss, C., Tumanov, A., Ganger, G.R., Katz, R.H., Kozuch, M.A.: Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In: Proceedings of the Third ACM Symposium on Cloud Computing, p. 7. ACM (2012) 104. Sharma, B., Chudnovsky, V., Hellerstein, J.L., Rifaat, R., Das, C.R.: Modeling and synthesizing task placement constraints in Google compute clusters. In: SoCC, p. 3. ACM (2011) 105. Wilkes, J.: More Google cluster data. http://googleresearch.blogspot.com/2011/11/ (2011) 106. Zhang, Q., Hellerstein, J.L., Boutaba, R.: Characterizing task usage shapes in Google’s compute clusters. In: LADIS (2011) 107. Jain, S., Kumar, A., Mandal, S., Ong, J., Poutievski, L., Singh, A., Venkata, S., Wanderer, J., Zhou, J., Zhu, M., et al.: B4: experience with a globally-deployed software defined WAN. ACM SIGCOMM 43(4), 3–14 (2013) 108. Hong, C.-Y., Kandula, S., Mahajan, R., Zhang, M., Gill, V., Nanduri, M., Wattenhofer, R.: Achieving high utilization with software-driven WAN. In: ACM SIGCOMM, pp. 15–26 (2013) 109. McKeown, N.: Software-defined networking. INFOCOM Keynote Talk 17(2), 30–32 (2009) 110. OpenFlow: Openflow specification. http://archive.openflow.org/wp/documents 111. McKeown, N., Anderson, T., Balakrishnan, H., Parulkar, G., Peterson, L., Rexford, J., Shenker, S., Turner, J.: Openflow: enabling innovation in campus networks. ACM SIGCOMM 38(2), 69–74 (2008) 112. Roberts, S.: Control chart tests based on geometric moving averages. Technometrics 1(3), 239–250 (1959) 113. Lucas, J.M., Saccucci, M.S.: Exponentially weighted moving average control schemes: properties and enhancements. Technometrics 32(1), 1–12 (1990) 114. Adams, R.P., MacKay, D.J.: Bayesian online changepoint detection. arXiv preprint arXiv:0710.3742 (2007) 115. Page, E.: A test for a change in a parameter occurring at an unknown point. Biometrika 42(3/4), 523–527 (1955) 116. Desobry, F., Davy, M., Doncarli, C.: An online kernel change detection algorithm. IEEE Trans. Signal Process. 53(8), 2961–2974 (2005) 117. Lorden, G., et al.: Procedures for reacting to a change in distribution. Ann. Math. Stat. 42(6), 1897–1908 (1971)

References

31

118. Smith, A.: A Bayesian approach to inference about a change-point in a sequence of random variables. Biometrika 62(2), 407–416 (1975) 119. Stephens, D.: Bayesian retrospective multiple-changepoint identification. Appl. Stat. 43, 159– 178 (1994) 120. Barry, D., Hartigan, J.A.: A Bayesian analysis for change point problems. J. Am. Stat. Assoc. 88(421), 309–319 (1993) 121. Green, P.J.: Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82(4), 711–732 (1995) 122. Berger, D.S., Gland, P., Singla, S., Ciucu, F.: Exact analysis of TTL cache networks: the case of caching policies driven by stopping times. ACM SIGMETRICS Perform. Eval. Rev. 42(1), 595–596 (2014) 123. Ferragut, A., Rodríguez, I., Paganini, F.: Optimizing TTL caches under heavy-tailed demands. ACM SIGMETRICS Perform. Eval. Rev. 44(1), 101–112 (2016). ACM 124. Basu, S., Sundarrajan, A., Ghaderi, J., Shakkottai, S., Sitaraman, R.: Adaptive TTL-based caching for content delivery. ACM SIGMETRICS Perform. Eval. Rev. 45(1), 45–46 (2017) 125. Narayanan, A., Verma, S., Ramadan, E., Babaie, P., Zhang, Z.-L.: Deepcache: a deep learning based framework for content caching. In: Proceedings of the 2018 Workshop on Network Meets AI & ML, pp. 48–53. ACM (2018) 126. Mao, H., Netravali, R., Alizadeh, M.: Neural adaptive video streaming with pensieve. In: Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pp. 197–210. ACM (2017) 127. Sadeghi, A., Sheikholeslami, F., Giannakis, G.B.: Optimal and scalable caching for 5G using reinforcement learning of space-time popularities. IEEE J. Sel. Top. Signal Process. 12(1), 180–190 (2018) 128. Li, X., Wang, X., Wan, P.-J., Han, Z., Leung, V.C.: Hierarchical edge caching in deviceto-device aided mobile networks: modeling, optimization, and design. IEEE J. Sel. Areas Commun. 36(8), 1768–1785 (2018) 129. Ma, G., Wang, Z., Zhang, M., Ye, J., Chen, M., Zhu, W.: Understanding performance of edge content caching for mobile video streaming. IEEE J. Sel. Areas Commun. 35(5), 1076–1089 (2017) 130. Drolia, U., Guo, K., Tan, J., Gandhi, R., Narasimhan, P.: Cachier: edge-caching for recognition applications. In: 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pp. 276–286. IEEE (2017) 131. Gabry, F., Bioglio, V., Land, I.: On energy-efficient edge caching in heterogeneous networks. IEEE J. Sel. Areas Commun. 34(12), 3288–3298 (2016) 132. Lombardi, A., Hörnquist, M.: Controllability analysis of networks. Phys. Rev. E 75(5) Pt 2, 056110 (2007) 133. Liu, Y.-Y., Slotine, J.-J., Barabási, A.-L.: Controllability of complex networks. Nature 473(7346), 167 (2011) 134. Yuan, Z., Zhao, C., Di, Z., Wang, W.X., Lai, Y.C.: Exact controllability of complex networks. Nat. Commun. 4(2447), 2447 (2013) 135. Cornelius, S.P., Kath, W.L., Motter, A.E.: Realistic control of network dynamics. Nat. Commun. 4(3), 1942 (2013) 136. Pasqualetti, F., Zampieri, S., Bullo, F.: Controllability metrics, limitations and algorithms for complex networks. IEEE Trans. Control Netw. Syst. 1(1), 40–52 (2014) 137. Francesco, S., Mario, D.B., Franco, G., Guanrong, C.: Controllability of complex networks via pinning. Phys. Rev. E Stat. Nonlinear Soft Matter Phys. 75(2), 046103 (2007) 138. Wang, W.X., Ni, X., Lai, Y.C., Grebogi, C.: Optimizing controllability of complex networks by minimum structural perturbations. Phys. Rev. E Stat. Nonlinear Soft Matter Phys. 85(2) Pt 2, 026115 (2012) 139. Gerla, M., Lee, E.K., Pau, G., Lee, U.: Internet of vehicles: from intelligent grid to autonomous cars and vehicular clouds. In: Internet of Things (2016)

32

2 A Survey of Resource Management in Cloud and Edge Computing

140. Kaiwartya, O., Abdullah, A.H., Cao, Y., Altameem, A., Liu, X.: Internet of vehicles: motivation, layered architecture network model challenges and future aspects. IEEE Access 4, 5356–5373 (2017) 141. Alam, K.M., Saini, M., Saddik, A.E.: Toward social Internet of vehicles: concept, architecture, and applications. IEEE Access 3, 343–357 (2015)

Chapter 3

A Task Scheduling Scheme in the DC Access Network

Abstract State-of-the-art microservices are starting to get more and more attention in recent years. A broad spectrum of online interactive applications are now programmed to service chains on cloud, seeking for better system scalability and lower operation cost. Different from the conventional batch jobs, most of these applications are composed of multiple stand-alone services that communicate with each other. These step-by-step operations unavoidably introduce higher latency to the delay-sensitive chained services. In this chapter, we aim at designing an optimization approach to reduce the latency of chained services. Specifically, presenting the measurement and analysis of chained services on Baidu’s cloud platform, our real-world trace indicates that these chained services are suffering from significantly high latency because they are mostly handled by different queues on cloud servers for multiple times. Such a unique feature, however, introduces significant challenge to optimize microservice’s overall queueing delay. To address this problem, we propose a delay-guaranteed approach to accelerate the overall queueing of chained services while obtaining fairness across all the workloads. Our real-world deployments on Baidu shows that the proposed design can successfully reduce the latency of chained services by 35% with minimal affect to other workloads.

3.1 Introduction In this section, we conduct some measurements in Baidu networks and disclose the long latency for service chains. Aiming at accelerating interactive workloads while not affecting non-interactive workloads (to ensure fairness), we then motivate this chapter. As the largest Chinese searching engine, Baidu has dozens of applications deployed in its networks. These applications cover every corner of people’s life, and they can further cooperate with each other to provide more comprehensive functions (as shown in the introduction section) [1–5]. To evaluate the performance of these services, we measured the workload latency from one server cluster at Baidu. In particular, we monitor all the workloads, record © Springer Nature Singapore Pte Ltd. 2020 Y. Zhang, K. Xu, Network Management in Cloud and Edge Computing, https://doi.org/10.1007/978-981-15-0138-8_3

33

34

3 A Task Scheduling Scheme in the DC Access Network 600

Latency (ms)

500

Interactive Non−interactive

400 300 200 100 0 00:00 03:00 06:00 09:00 12:00 15:00 18:00 21:00 24:00

Fig. 3.1 The workload latency in Baidu networks

the response time of service calls, and then calculate the average latency per minute for both interactive and non-interactive workloads by analyzing the trace log. We grab the log of these two kinds of workloads from 0:00 to 24:00 on 1 April 2016 and draw the statistical fig in Fig. 3.1. The x-coordinate denotes the time in one day, while the y-coordinate denotes the service latency in ms. From these results, we come to the following conclusions: (1) The average latency for non-interactive workloads is about 60–70 ms, while that for the interactive workloads is nearly 500 ms, i.e., the interactive workloads are suffering from 7 times longer latency compared to the non-interactive workloads. (2) Even when the network is not congested (e.g., during the midnight), the interactive workload latency is still much longer than the non-interactive workload latency. (3) When there is a slight burst, for example, 11:00 am or 16:00 pm, the performance for interactive workloads is influenced obviously, making the latency even higher. As we analyzed before, non-interactive workloads can be completed in just one instance, while the interactive workloads have to go through different servers one after another [6–11]. To optimize the performance of delay-sensitive interactive workloads, we should accelerate the processing of these workloads, such as assigning higher priorities or allocating more resources. However, improving interactive workloads will unavoidably affect non-interactive workloads because they are sharing the same infrastructures [12–17]. So a fair optimization scheme should have the following characteristics: (1) Reduce the latency for delay-sensitive interactive workloads; (2) Ensure the fairness across all workloads (not to degrade non-interactive workloads severely).

3.2 Dynamic Differentiated Service with Delay-Guarantee

35

3.2 Dynamic Differentiated Service with Delay-Guarantee In this section, we study the essence of the latency gap in Sect. 3.2.1 before introducing the design philosophy of our approach in Sect. 3.2.2. According to the philosophy, we design an algorithm called the Dynamic Differentiated service with delay-guarantee (D3 G), which reduces the latency of service chains while ensuring workload fairness.

3.2.1 The Components of Latency As interactive applications consist of basic functions applied on different servers, those workloads should go through multiple servers in a specific order so that the required functions are applied step-by-step. Therefore, the interactive workloads will be queued in servers for several times, while the non-interactive workloads will be queued for just once. To be specific, we analyze the latency for interactive Ri,j and non-interactive Ri,i workloads: as interactive workloads travel across multiple servers and are queued in each one, the final latency is the sum of the queuing and serving time on each server. Besides, the transfer time among different servers also contributes to the latency, while non-interactive workloads only go through one particular server with only once queuing and serving time. Thus, the overall latency of interactive workloads is much longer than that of non-interactive workloads (as shown in Fig. 3.1).

3.2.2 Design Philosophy As users’ patience is limited and they would abandon the system once the latency exceeds their patience, the interactive workload latency Wi (R(i, j )) is essential for these delay-sensitive applications. With this expected patience, we have the theorem of system-leaving, rephrased as follows: Theorem 3.1 When the overall latency W (Ri,j ) exceeds users’ patience (Ri,j ), users will abandon the system, and this leads to an abandoning rate of waiting queues. To ensure constant service and prevent users from abandoning system, interactive workloads should be scheduled within user tolerance. To do so, we do rescheduling and resource adjustment in this work. Specifically, we separate interactive workloads from non-interactive ones and make them pending in different queues. We leverage two queues in each server. Ql represents the queue for noninteractive requests, and Qr represents the queue for interactive requests. The two

36

3 A Task Scheduling Scheme in the DC Access Network

queues share the infrastructure and resources in the same server. The difficulty in accelerating interactive workloads is that allocating more resources to Qr will unavoidably affect the process of Ql . Thus, how to share the resources on one server among different workloads becomes a key concern. To address this issue, we design D3 G that can adjust resource allocation among different kinds of workloads automatically and in real time. To make D3 G more intelligent, we design an estimation algorithm to pre-calculate the processing time on other servers. Furthermore, we also introduce a feedback scheme to reduce negative impact to non-interactive workloads.

3.2.3 D3 G Framework As we described in the previous subsection, we separate the interactive workloads from non-interactive workloads and make them queue independently. We design a latency estimation algorithm, and once the estimated latency exceeds user tolerance, we dynamically adjust the resource allocation among queues according to a feedback scheme. Thus, the interactive workloads will get accelerated in all intermediate servers and finally enjoy the comparable latency as the non-interactive workloads. To be specific, when a request arrives at a server, a matching scheme will check whether this request should be forwarded immediately, i.e., if it is an interactive request, it will be queued in Qr , otherwise in Ql , with source (s), destination (d), and function (f ). Then the latency estimating algorithm pre-calculates the overall latency of this request. If it exceeds user patience, we dynamically adjust resource allocation according to the feedback scheme, which works automatically and in real time. In the latency estimation algorithm, when a request with s, d, f  enters a queue, we update the queue information and record the entrance time. Once a request begins to be served, we record the beginning time and the queueing time. If this request is an interactive one, it will be transmitted to the next service after service. If this request is a non-interactive one, we can get the finish time directly, which is calculated by summing the beginning time and the service time. Thus, we can calculate the overall estimated latency. In the feedback scheme, we formulate the arrival rate of workloads λ, server’s serving rate β, and user’s abandon rate γ before using the Markov chain to model the two queues. After calculating queue length and the expected waiting time, we equalize workload latencies by adjusting the resource allocation. The detailed adjustment description will be introduced in the next subsection. Overall, D3 G converts the performance optimization problem into a resource allocation problem. By estimating latencies of different network services, D3 G mitigates the imbalance by adjusting the allocated resources. The real-time estimation algorithm and the intelligent feedback scheme make D3 G work efficiently and automatically.

3.3 Deployment

37

3.2.4 Adjusting Resource Allocation To calculate the resource allocated to interactive workloads (μr ) and non-interactive workloads (μl ), we model the queuing problem in the adjusting scheme. To analyze the arrival and leaving rates of requests, we adopt a Markov model to represent state transmissions of the two queues. By modeling different distributions on service time and abandon rate, we can calculate the latency expectation of various workloads. At last, the feedback scheme could adjust resource allocation to ensure fairness. The arrival process of requests is a discrete-time random process, and the number of requests in the future is just related to the number at present, i.e., the queue can be formulated by a Markov chain, and the queue length can be calculated. When a request arrives, the queue length increases by 1. On the contrary, when a request abandons the queue, the number decreases. Let λ, β1 , γ1 represent the expectation of arrival rate, service time, and abandoning rate, respectively; we can model each queue as an M/M/1 queue and get the queue length Nt that is calculated by λ, θ , and μ. As described before, the service time is in an exponent distribution, and each service is mutually independent with each other, So the waiting time of a request is a convolution. So far, the expectation of waiting time on one server can be calculated. Recalling that interactive requests will be pended in queues for multiple times in different servers, and non-interactive requests just need to be pending once. With the expected waiting time, we assume the transfer time on server is in a Gaussian distribution [18] and then make the overall latency of the two queues equal to each other. Finally, we can work out the allocation rates to the interactive workloads μr and non-interactive workloads μl . With this adjusted allocation, interactive workloads from delay-sensitive applications can enjoy reduced latency that is within user tolerance.

3.3 Deployment We implement D3 G in the severs of Baidu networks, and the algorithm is written in C language. The servers use Linux operating system and are configured with tomcat webservers based on J ava. We choose four servers that are configured with 4G memory, two cores, and 100 Mbps public network bandwidth. As to the clients, there are 36 end-hosts, and each is configured with an Intel i5 1.7 GHz CPU and 2G memory. All these end-hosts keep sending either interactive or non-interactive requests to those servers constantly. The interactive workloads need to be served in each server, and the non-interactive workloads can be processed by just one server.

38

3 A Task Scheduling Scheme in the DC Access Network

We conduct a series of experiments in the next section: (1) Overall performance. We measure the average response time of both interactive and non-interactive workloads versus the state-of-the-art scheme without D3 G. (2) Algorithm dynamism. We test the algorithm performance under a dynamic scenario. (3) System scalability. We evaluate the optimization of D3 G under expanding scales.

3.4 D3 G Experiment As described in the deployment section, we conduct three groups of experiments to test algorithm efficiency and evaluate average response time and service performance under different network environments. Recall the example in the Motivation Section; the interactive workloads are actually suffering from seven times longer latency compared to non-interactive workloads. The experiment results in this section show that D3 G significantly reduces latency for time-sensitive workloads. At the same time, non-interactive workloads are not affected seriously and are still enjoying shorter latencies. Besides verifying the effectiveness of D3 G algorithm, we also prove the potential practicability for large-scale deployment.

3.4.1 Overall Performance In this subsection, we design several groups of experiments to evaluate the performance of D3 G under different network environments. As interactive workloads should go through multiple functions, these workloads are actually longer than non-interactive workloads. So we set the interactive workload length from 100 to 200 KB, and that for non-interactive workloads is from 1 to 100 KB. We start the experiment in a 9-to-1 model. In this case, every 9 end-hosts keep sending interactive and non-interactive requests to one server. We conduct the experiment for 200 times and calculate the average response time per 20 experiments with upper and lower error bars. Figure 3.2a shows the results without D3 G, from which we can observe that the average latency for interactive workloads is about 120 ms and that for non-interactive workloads is about 75 ms. Figure 3.2b shows the optimized results after deploying D3 G, where interactive workload latency is reduced by 33% (to 80 ms) on average with minimal impact on noninteractive workload latency.

3.4 D3 G Experiment

39 200 Interactive Non−interactive

180 160 140 120 100 80 60 40 20

Response Time (ms)

Response Time (ms)

200

Interactive Non−interactive

180 160 140 120 100 80 60 40 20

2

4

6

8

10

2

4

6

Experiment #

Experiment #

(a)

(b)

8

10

Fig. 3.2 Performance when non-interactive requests are (1 KB, 100 KB] and interactive requests are (100 KB, 200 KB]. (a) Latency without D3 G. (b) Latency with D3 G

Figure 3.2a shows that interactive requests are suffering from more than two times longer latency than non-interactive requests before implementing D3 G, while Fig. 3.2b shows the optimization results after implementing D3 G. For interactive requests, the average latency is about 90 ms, and for the non-interactive workloads, the latency is 78 ms. From this figure, we can also obtain that at the 100th ms, none of the interactive workloads in Fig. 3.2a are finished while non-interactive workloads are all finished. In Fig. 3.2b, 78% interactive workloads and 99% noninteractive workloads are finished. From these experiments, we can see that D3 G works well in accelerating interactive workloads under various circumstances.

3.4.2 Algorithm Dynamism To evaluate our algorithm in dynamic scenarios, we simulate a dynamic situation to verify the real-time efficiency of D3 G. We send only non-interactive requests in the previous 60 s and then begin to send interactive requests at the 60th s and stop sending at the 130th s. Figure 3.3a shows the average delay of this dynamic process. The latency is quite long for a short period of time (from 60 to 70 s). Then it begins to drop because more resources are allocated to the interactive queue. When interactive workloads stop at the 130th s, non-interactive workload latency drops. From these experiments, we can conclude that the interactive workloads can be accelerated after implementing D3 G and the non-interactive workloads performance is not seriously affected.

40

3 A Task Scheduling Scheme in the DC Access Network 200 Non−interactive Interactive

Interactive Non−interactive 150

Average Delay

Average Latency (ms)

200

100

50

0

50

100

150

200

150

100

50

0

50

100

300

Time (s)

Requests number

(a)

(b)

500

Fig. 3.3 Performance for different parameters. (a) Average Delay for dynamic scenario. (b) Average Delay for different scales

3.4.3 System Scalability At last, we extend experiment scales and increase concurrency to test algorithm scalability. We speed up the request sending rate, and Fig. 3.3b shows the average latency of various scales. When there are 50 concurrent requests, the average latency is about 65 ms for non-interactive workloads and 80 ms for interactive ones. When the number of concurrent requests increases to 500, the average latencies are about 150 and 170 ms, respectively. These results indicate that our algorithm is extensible in large-scale systems. Furthermore, if the interactive workloads are handled by more cloud servers, the latency without D3 G will get even higher (like the case in the Motivation Section), and our algorithm optimization will be more obvious. From the above deployment and evaluations, we can conclude that D3 G successfully reduce the interactive workload latency to a reasonable range with no distinct impact on non-interactive workloads, even in expanding scales. We believe that the crux idea of D3 G, reduce latency for interactive workloads from time-sensitive applications, will soon be adopted by nowadays microservices.

3.5 Conclusion For cloud-based service chains, we measure and analyze their performance in Baidu networks, and the results show that these delay-sensitive microservice-like applications are suffering from long latency due to the extra delay from the multiple stand-alone components. In this chapter, we propose a new algorithm called Dynamic Differentiated service for delay-guarantee (D3 G), which aims at reducing the overall latency for

References

41

chained applications while ensuring workload fairness. To this end, we design two queues in servers. One is for interactive requests, and the other is for non-interactive requests. To make the latency within user tolerance, we design a latency estimation algorithm to pre-calculate interaction latency. Furthermore, to guarantee fairness, we introduce a feedback control scheme based on resource allocation to ensure the performance for non-interactive workloads. A wide range of detailed evaluation results demonstrate that D3 G succeeds in accelerating chained services and ensuring workload fairness. As microservice-like application has many obvious advantages such as clean boundary, better system scalability, and lower operation cost, it will surely attract more and more attentions, and we believe that D3 G would further reveal its effectiveness along with the development of service chains.

References 1. Wang, H., Shea, R., Ma, X., Wang, F., Liu, J.: On design and performance of cloud-based distributed interactive applications. In: 2014 IEEE 22nd International Conference on Network Protocols (ICNP), pp. 37–46. IEEE (2014) 2. Pujol, E., Richter, P., Chandrasekaran, B., Smaragdakis, G., Feldmann, A., Maggs, B.M., Ng, K.-C.: Back-office web traffic on the internet. In: Proceedings of the 2014 Conference on Internet Measurement Conference, pp. 257–270. ACM (2014) 3. Zaki, Y., Chen, J., Potsch, T., Ahmad, T., Subramanian, L.: Dissecting web latency in ghana. In: Proceedings of the 2014 Conference on Internet Measurement Conference, pp. 241–248. ACM (2014) 4. Yue, K., Wang, X.-L., Zhou, A.-Y., et al.: Underlying techniques for web services: a survey. J. Softw. 15(3), 428–442 (2004) 5. Seung, Y., Lam, T., Li, L.E., Woo, T.: Cloudflex: seamless scaling of enterprise applications into the cloud. In: INFOCOM, 2011 Proceedings IEEE, pp. 211–215. IEEE (2011) 6. Xu, K., Zhang, Y., Shi, X., Wang, H., Wang, Y., Shen, M.: Online combinatorial double auction for mobile cloud computing markets. In: Performance Computing and Communications Conference (IPCCC), 2014 IEEE International, pp. 1–8. IEEE (2014) 7. Guo, J., Liu, F., Zeng, D., Lui, J.C., Jin, H.: A cooperative game based allocation for sharing data center networks. In: INFOCOM, 2013 Proceedings IEEE, pp. 2139–2147. IEEE (2013) 8. Vik, K.-H., Halvorsen, P., Griwodz, C.: Multicast tree diameter for dynamic distributed interactive applications. In: INFOCOM 2008. The 27th Conference on Computer Communications. IEEE. IEEE (2008) 9. Webb, S.D., Soh, S., Lau, W.: Enhanced mirrored servers for network games. In: Proceedings of the 6th ACM SIGCOMM Workshop on Network and System Support for Games, pp. 117– 122. ACM (2007) 10. Shao, Z., Jin, X., Jiang, W., Chen, M., Chiang, M.: Intra-data-center traffic engineering with ensemble routing. In: INFOCOM, 2013 Proceedings IEEE, pp. 2148–2156. IEEE (2013) 11. Sivaraman, A., Cheung, A., Budiu, M., Kim, C., Alizadeh, M., Balakrishnan, H., Varghese, G., McKeown, N., Licking, S.: Packet transactions: high-level programming for line-rate switches. In: ACM SIGCOMM, pp. 15–28 (2016) 12. McKeown, N., Anderson, T., Balakrishnan, H., Parulkar, G., Peterson, L., Rexford, J., Shenker, S., Turner, J.: Openflow: enabling innovation in campus networks. ACM SIGCOMM 38(2), 69–74 (2008) 13. McKeown, N.: Software-defined networking. INFOCOM Keynote Talk 17(2), 30–32 (2009) 14. Hong, C.-Y., Kandula, S., Mahajan, R., Zhang, M., Gill, V., Nanduri, M., Wattenhofer, R.: Achieving high utilization with software-driven WAN. In: ACM SIGCOMM, pp. 15–26 (2013)

42

3 A Task Scheduling Scheme in the DC Access Network

15. Jain, S., Kumar, A., Mandal, S., Ong, J., Poutievski, L., Singh, A., Venkata, S., Wanderer, J., Zhou, J., Zhu, M., et al.: B4: experience with a globally-deployed software defined WAN. ACM SIGCOMM 43(4), 3–14 (2013) 16. Zhang, H., Chen, L., Yi, B., Chen, K., Chowdhury, M., Geng, Y.: CODA: toward automatically identifying and scheduling coflows in the dark. In: ACM SIGCOMM, pp. 160–173 (2016) 17. Chowdhury, M., Stoica, I.: Coflow: an application layer abstraction for cluster networking. In: ACM Hotnets. Citeseer (2012) 18. Pebesma, E., Cornford, D., Dubois, G., Heuvelink, G.B., Hristopulos, D., Pilz, J., Stohlker, U., Morin, G., Skoien, J.O.: Intamap: the design and implementation of an interoperable automated interpolation web service. Comput. Geosci. 37(3), 343–352 (2011)

Chapter 4

A Cross-Layer Transport Protocol Design in the Terminal Systems of DC

Abstract Data centers are now used as the underlying infrastructure of many modern commercial operations, powering both large Internet services and a growing number of data-intensive scientific applications. The tasks in these applications always consist of rich and complex flows which require different resources at different time slots. The existing data center scheduling frameworks are however base on either task- or flow-level metrics. This simplifies the design and deployment but hardly unleashes the potentials of obtaining low task completion time for delaysensitive applications. In this chapter, we show that the performance (e.g., tail and average task completion time) of existing flow-aware and task-aware network scheduling is far from being optimal. To address such a problem, we carefully examine the possibility to consider both task- and flow-level metrics together and present the design of TAFA (task-aware and flow-aware) in data center networks. This approach seamlessly combines the existing flow and task metrics together while successfully avoids their problems as flow isolation and flow indiscrimination. The evaluation result shows that TAFA can obtain a near-optimal performance and reduce over 35% task completion time for the existing data center systems.

4.1 Introduction Scheduling policies determine the order in which tasks and flows are scheduled across the network [1–6]. In this section, we’ll show how do flow-aware and task-aware waste resources when being applied separately, and we motivate TAFA combining the two layer awareness together, improving task completion time two times better than flow-aware fairness sharing and 20% better than task-aware FIFO-LM. Before giving a specific example, we introduce the definition of task and flow: Definition 4.1 (Task) Tasks consist of multiple flows and can response to a user request completely. Applications in DC perform rich and complex tasks (such as executing a search query or generating a user’s wall). © Springer Nature Singapore Pte Ltd. 2020 Y. Zhang, K. Xu, Network Management in Cloud and Edge Computing, https://doi.org/10.1007/978-981-15-0138-8_4

43

44

4 A Cross-Layer Transport Protocol Design in the Terminal Systems of DC

Definition 4.2 (Flow) Flows are a fundamental unit of a basic action, and a series of flows can come into being one task to finish a request from user. Besides, flows traverse different parts of the network at potentially different times, and there are tight relationships among flows, such as sequencing and parallelization. The TCT of a particular task is dependent on the finish time of the last flow belonged to this task. With the concept of task and flow, we consider a small cluster with CPU and link resources, and their two tasks each have two steps that are separated by a barrier. This situation resembles map-reduce; map tasks are CPU intensive while reduce tasks are network intensive, like the example in [7]. There are two flows in each task, and each flow has two stages. The CPU processing stage needs two units of CPU time, and the network processing stage consumes two units of link time. Further, network processing stage cannot begin until CPU processing stage finishes. Flow-aware Consider the flow-aware fair sharing (FS) scheduling scheme. When assuming all flows are infinitely divisible, scheduling all four map flows would fully use up the cluster’s CPU for the first 4t, and then four reduce flows become runnable and the cluster will fairly allocate link to them. Each flow gets only 1/4 of resources due to contention, keeping link busy for another 4t. Thus, both of the two tasks will finish scheduling at time 8t. Task-aware Obviously, flow-aware fair sharing is not a good choice here. Now we consider the task-aware scheduling FIFO-LM in [8]. According to FIFO, the two flows of A should be scheduled first, and with the same task − id, these CPU stages of the two flows share the CPU fairly, so this phase will occupy 2t of CPU, and then at 2t, network stage of task#1 and CPU stages of task#2 can start. At 4t, network stages of task#2 can start. The schedule is shown in the upper part of Fig. 4.1 (where bottleneck x denotes CPU and bottleneck y denotes network). The two tasks now finish at time 4t and 6t. Average completion time reduce from 8t to 5t compared with FS. What we should pay attention to is that this result is far from optimal to a large extent! Now we will show how to reduce TCT over task-aware scheme! The core idea is to make task-aware scheduling scheme flow-aware! As described in Definition 4.2, flow completion time is closely relative to task completion time. To reduce task completion time, we should distinguish different flows of one task, because reducing average flow completion time will also shorten task completion time. So here we discard the fair sharing method among flows within one step and make the cluster serve flows one by one (later in Sect. 4.4, we will introduce the FQH to decide the flow serving order). As shown in the lower part of Fig. 4.1, the CPU stages of the two flows are not served simultaneously but making one of them finish processing early, so the corresponding reduce phase can start at 1t (while in FIFO, this reduce phase start at 2t). Along this line, the flows of task#2 can also be scheduled in advance. Thus, the finishing times of the two tasks are 3t and 5t. The average TCT is a half of FS (4t ← 8t) and 20% less than task-aware (4t ← 5t). From this simple scenario, we can see that just in simple one-by-one order, we can reduce average completion time by 20% over FIFO-LM.

4.2 TAFA’s Control Scheme

45 Task #2

Task #1

Flow B with bottleneck X

Flow B with bottleneck Y

Flow A with bottleneck X

Flow A with bottleneck Y

Flow B with bottleneck X

Flow B with bottleneck Y

Flow A with bottleneck X

Flow A with bottleneck Y

Task aware only Time

0

1

2

3

4

5

6

Task #2 Flow B with bottleneck X Flow A with bottleneck X

Task #1 Flow B with bottleneck X Flow A with bottleneck X

Flow B with bottleneck Y

Flow A with bottleneck Y

Flow B with bottleneck Y

Task and flow aware

Flow A with bottleneck Y

Time 0

1

2

3

4

5

6

Fig. 4.1 Distilling the benefits of both task-aware and flow-aware scheme (TAFA) over task-aware

The above examples highlight that isolate flow-awareness and task-awareness are inefficient and do not optimize task completion time, which indicates that disregarding cross-layer relevant awareness leads to the waste of resources. Before we describe our framework TAFA, which is both flow-aware and task-aware, we now analyze the difficulties in Sect. 4.3, and then we’ll show how TAFA outperform the state-of-the-art protocols in Sect. 4.4.

4.2 TAFA’s Control Scheme Now we take a closer look at TAFA’s control scheme, trying to find out the reason it performs better than FIFO in the above scenario. We present the analysis of TAFA in a simplified setting. We consider N tasks (T1 , . . . , TN ) in a map-reduce scenario; each consists of ni flows and ni ∈ [nmin , nmax ]. The number of tasks that consists nmax mi = N . It is further assumed that all flows are in the of ni flows is mi , i.e., i=n min same size l and share the single bottleneck resource of serving speed v. Of course,

46

4 A Cross-Layer Transport Protocol Design in the Terminal Systems of DC

not all flows are in same size in reality; however, this is just a case in which we care about most in the relationship between flows and tasks, so this toy example still allows us to capture the impact of two-level awareness on task completion time. In FIFO, for a task consisting of ni flows, all of these flows are scheduled synchronously, so the serving speed allocated to a specific flow is v/ni , so the l completion time of this task is v/n . So the completion time for all N tasks is i calculated as: nmax T CTF I F O = i=n mi × min

l × ni v

(4.1)

In TAFA, for a task consisting of ni flows, the flows are not scheduled synchronously, but in a particular order (here we also use the one-by-one scheme for simplicity), so the map process of second flow can start to be scheduled when the first flow finishes map process, because the first flow will release resources when moving into reduce phase, we call this circumstance pre-processing. So the completion time of this task in TAFA is: l l(ni + 1) l/2 + (ni − 1) × = v v 2v

(4.2)

As the pre-processing also exists between tasks, the completion time for all N tasks is calculated as: nmax T CTT AF A = i=n mi × min

l/2 l(ni + 1) − (N − 1) × 2v v

(4.3)

To make the comparison clearly, we set ni to 2, N to 10, and each mi to 1 and calculate the task completion time for both schemes: 20l l×2 = v v l/2 21l 10 l(2 + 1) −9× = = i=1 2v v 2v

10 T CTF I F O = i=1

T CTT AF A

(4.4)

Through this circumstance, we can see that TAFA reduces TCT to 50% than FIFO.

4.3 Key Challenges Task completion time is affected by two factors, one is the schedule order, and the other is the rate control. For the former, schedule order is not just related to task order but also to the flow schedule order, and even the latter is a key element (because task completion time is dependent on the finish time of its last flow). For rate control, to shorten task completion time, short flow first (SFF) is known

4.4 TAFA: Task-Aware and Flow-Aware

47

to be most effective. To achieve SFF, the congestion window should be adjusted according to flow size, while in reality, almost all the existing protocols ignore flow size. To solve these two questions is never an easy problem, and here we list some key challenges in this section. Firstly, achieving short task first (STF) without prior knowledge. As first-in-firstout is not an effect schedule method to reduce average task completion time, we should try to achieve STF at task level. However, as we described before, tasks consist of multiple flows, which traverse different parts of the network at potentially different times, so we cannot know the total size of a task until all its flows arrive, leading to the unclear priority of tasks when scheduling. The key challenge here is to achieve short task completion time without prior knowledge on tasks. Secondly, flow-level awareness. As task completion time is dependent on the last flow’s finish time, and SFF is the most effective method to shorten flow finish time, the key problem is to make short flows scheduled earlier than long ones. The difficulty lies in making the rate of a shorter flow be higher than that of a long flow. To sum up, an efficient scheduling scheme should take the above two aspects into account and be both task-aware and flow-aware. To achieve shorter task completion time, we design our model TAFA, and the framework is introduced in the next section.

4.4 TAFA: Task-Aware and Flow-Aware In this section, we describe TAFA’s scheduling heuristic, combining both task-aware and flow-aware together to make more preferable and reasonable scheduling. As task completion time depends on the last flow’s finish time, to determine the order to minimize TCT, two questions should be clarified. One is the task schedule order, which is a well-known NP-hard problem [8], and the other is flow completion time, which can reduce task completion time. We develop heuristics that enable STF with no prior knowledge using commodity switches (Sect. 4.4.1). To give the detailed flows scheduling method for reducing completion time, we design FQH in Sect. 4.4.2.

4.4.1 Task-Awareness The task scheduling policy determines the order in which tasks are scheduled across the network, while one task consists of multiple flows, the original priorities of these flows are dependent on task order. In this subsection, we focus on task priority. At a high level, TAFA main mechanisms include priority queuing and ECN marking, which can adjust the priority of tasks dynamically according to the bytes they have sent.

48

4.4.1.1

4 A Cross-Layer Transport Protocol Design in the Terminal Systems of DC

End-Host Operations

In TAFA, end-hosts are responsible for two things: one is to generate task-id, and the other is to response to the rate control according to the marking punched by switches. For the former, end-host assigns a globally unique identifier (task-id) for each task. When an end-host produces a new task, each flow of this task will be tagged with the task-id. To generate this id, each host maintains a monotonically increasing counter. Unlike PIAS in [9], in which tags are carried by packets, TAFA allows flows to carry these tags, making tasks quite clear to the loading in switch; Unlike taskaware in [8], which just separates heavy tasks from short ones, we catalog tasks to multiple priorities, which will be explained in Sect. 4.4.1.2. For the rate control, we should first explain the relationship among multiple flows. As tasks consist of number of complex flows, which will traverse different servers at different times to respond to a user request, not all of them are active at the same time. Though there is huge diversity among different applications, according to their communication patterns, the relationship of flows can be grouped into three categories: • For parallel flows, they may be a request to a cluster of storage servers; the flows of these tasks are parallel; • For a sequential access task, as the flows order is sequential, the flows in one task should be scheduled one by another; • For a partition-aggregate task, which may involve tens and hundreds of flows, the flow order is in particular of importance. However, PIAS has a serious weakness that it ignores the relationship of flows in the same task, and each flow has an adjustable priority in isolation. For a sequential access task, if the previous flow is heavy, then its priority will be gradually demoted, while the subsequent short flows with higher priorities will finish earlier. But the subsequent results are useless because of the lack of the previous result. To avoid this situation, TAFA should adjust the sending rate and order according to the marking punched by switches whenever necessary. The detailed scheme will be described in Sect. 4.4.2.

4.4.1.2

Switch Operations

The TAFA switches should built in two functions, priority queuing and ECN marking. For priority queuing, end-hosts tag a task-id for each flow from each task, so that the order of this flow is decided. The only thing switches should do is to maintain the queue. Flows are waiting in different priority queues (more than two) in switches. Whenever a link is idle or has enough resource to schedule a new flow, the first flow in the highest non-empty priority queue is served. With the bytes one task rising, the

4.4 TAFA: Task-Aware and Flow-Aware

49

priority of this task was gradually demoted, so the flows of this task are also affected by the demotion, making these flows tagged a lower priority, and should wait longer in the queue. For ECN marking (a feature already available in modern commodity switches [10]), flows will be marked with congestion experienced (CE) in the network to provide multi-bit feedback to the end-hosts. So TAFA employs a very simple marking scheme at switches, and there is only one parameter, the marking threshold, ϒ. If the bytes of one task is greater than ϒ, then the flow of this task is marked with CE, and not marked otherwise. This ECN marking can notify end-host to demote its priority. The end-hosts whose flows are marked with ECN should tag their flows a lower priority than the current one. This feedback scheme can ensure that the tasks with less and shorter flows can be scheduled earlier than that with much more and longer flows. Using this short-task-first scheme, we can not only reduce the TCT but also ensure heavy tasks not being starved.

4.4.1.3

Multiple Priority Queues

As the key problem in flow-level scheduling [9] is the determining of the threshold, ϒ, here we give a vector of threshold ϒ (consist of ϒ1 , ϒ2 , . . . , ϒτ −1 ), where τ is the number of priority queues and ϒi is the threshold of priority queue i, i.e., in the ith priority queue. When the accumulated bytes of one task are greater than ϒi , the flows of this task will be marked with ECN, and this task should be demoted to the lower priority queue. The advantage of threshold vector than just one threshold is in the demotion process, and only one threshold cannot manage any bursts. Before explaining the reason, we now consider two flow size distributions [11] (as shown in Fig. 4.2): the first distribution is from a data center supporting web search [10], and the other distribution is from a cluster running large data mining jobs [12]. According to the analysis by Alizadeh, these two workloads are a diverse mix of small and large flows. Over 95% of all bytes come from 30% of the flows in web search, while more than 95% of the bytes are from 4% flows, and 80% of the flows are less than 10 KB in data mining workload. These data analyses introduce an important extreme case, when there is a burst by short tasks and short flows. Assume there are multiple new tasks, each contains plenty of flows, so all these flows are in the highest priority queue. When all the priority queues have the same threshold, then all these flows will fall into the lower priority queue simultaneously, which makes multiple queues invalidate. The stagewise threshold vector can solve this problem, 1 < 2 < . . . < τ −1 , the speed that tasks are demoted into lower priority queues is progressively slowed down, so as to avoid the concurrency of priority demoting. As the value of threshold set is related to actual flows, we will further show the robustness to traffic variations in the simulation section.

50

4 A Cross-Layer Transport Protocol Design in the Terminal Systems of DC

Fig. 4.2 Two flow size distributions. One is web search workload, and the other is data mining workload

4.4.2 Flow-Awareness To introduce TAFA in detail, in this subsection, we will display how to make taskaware scheduling flow-aware. Like DCTCP and D2 TCP, we require that the switches support ECN, which is true of nowaday’s data center switches. As short flow first (SFF) is known to be the most effective way to shorten flow completion time, we modulate the congestion window in a size-aware manner. When congestion occurs, long flows back off aggressively, and short flows back off only a little. With this sizeaware congestion avoidance scheme, more flows can be scheduled at early time. To explain the flow-awareness if TAFA, we start with D2 TCP and build sizeawareness on top of it. Like DCTCP and D2 TCP, the sender maintains α, the estimated fraction of packets that are marked when the buffer occupancy exceeds the threshold K. α updated every one RTT as follows: α = (1 − g) × α + g × f

(4.5)

where f is the fraction of packets marked with CE bits in the last data window and 0 < g < 1 is the weight given to the new sample. As D2 TCP maintains d as the

4.4 TAFA: Task-Aware and Flow-Aware

51

deadline imminence factor and larger d implies a closer deadline, they design the penalty function as: p = αd

(4.6)

The size of congestion window W is calculated by p. However, as [13] proposed, the flow rate control scheme should respect the differentiation principle, i.e., when traffic load becomes heavier, the differences between rates of different level flows should be increased. D2 TCP violates the principle and works badly in some scenarios [13]. So we design the penalty function in TAFA as: p = α/s

(4.7)

where s is determined based on flow size information. Say Smax is an upper bound for truncating a larger s, and Sc is the remanent size for a flow to complete transmitting, we set: s=

Smax Sc

Note that being a fraction, α ≤ 1 and s ≤ 1, therefore, p = α/s = we resize the congestion window W as follows:  W =

W × (1 − p)

with congestion

W +1

without congestion

(4.8) α×Sc Smax

≤ 1. So

(4.9)

c With this simple algorithm, we compute p = α × SSmax and use it to adjust the congestion window size. If there is no congestion in the last window, W is increased by 1 like TCP, while if any congestion occurs, W is decreased by a fraction of p. c When all the flow sizes are equal, SSmax = 1, severe congestion causes a full back-off similar to TCP and DCTCP. For different flow size,

∂(1 − α/s) α ∂(1 − p) = = 2 >0 ∂(s) ∂(s) s 1 ∂ 2 (1 − p) = 2 >0 ∂(s)∂(α) s

(4.10)

which indicate that TAFA meets the PD principle in [13], i.e., the difference between two flows with different size would be increased when traffic loads become heavier.

52

4 A Cross-Layer Transport Protocol Design in the Terminal Systems of DC

End-host

Input Queue

Switch

Task-id

cwnd

Highest Priority

Tag ECN

Lowest Priority

Fig. 4.3 TAFA overview

4.4.3 Algorithm Implementation We now introduce the framework of “TAFA” algorithm, and Fig. 4.3 gives the abstract overview of TAFA. There are several queues with different priorities for input tasks. When flows from a new task arrive, they are marked with highest priority and enter the first queue. The length of flows from one task is added up to compare with ϒk , which is the threshold of queue k. When the total length exceeds ϒk , the corresponding task is degraded from priority k to k + 1, and the following flows from this task will enter queue k+1 directly. As the scheduling begins from high-priority queue to lowpriority queues, with this accumulated task length, TAFA can successfully emulate shortest-task-first scheduling while requires no prior knowledge about tasks. j We design TAFA to realize this process as shown in Algorithm 1, where Fi denotes the j th flow from T aski and P (T )/P (F ) denoted the priority of a task/flow. EnQueue(P , F ) is a function making F enter queue P . This algorithm can adjust the priority of tasks dynamically according to the task length, making short task be scheduled earlier than long ones without prior knowledge of task length. While many other researches do schedule under the assumption that task (or flow) size are already known, TAFA makes this process more practical. For flow-level adjustment, we design Algorithm 2 to take congestion extent and flow size into consideration. Switches will tag ECN when congestion occurs and send back this signal within flows to end-hosts. When end-host receives flows with ECNs, it should adjust its congestion window cwnd. In TAFA, which aims at getting shorter completion time, the size of cwnd is dependent on the congestion extent α and flow size Sc , i.e., when traffic congestion α becomes serious, all flows should back off in direct proportion to α, but for smaller flows, the penalty function p is smaller than that of longer ones, making the back-off slighter and could send more flows. When end-host receives flows without ECNs, it acts as general TCP and just increases cwnd by 1. Then end-hosts will send packets according to this updated size of congestion window.

4.5 System Stability

53

Algorithm 1 TAFA 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:

Initialization j while Fi arrives do j TimeStamp(Fi ) if Ti == newTask then P (Ti ) ← 1 end if j P (Fi ) ← CheckPriority(Ti ) j j EnQueue(P (Fi ), Fi ) j j l(Fi ) ← length(Fi ) j AddLength(Ti , l(Fi )) if length(Ti ) > ϒk then Degrade(Ti ) end if end while FlowLevelScheduling

Algorithm 2 Flow-level scheduling 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:

Switch: if Congestion occurs then Tag(ECN) end if SendBackToEndhost Endhost: if ECN == true then s ← SSmax c α ← (1 − g) × α + g × f p ← α/s cwnd ← cwnd × (1 − p) else cwnd ← cwnd + 1 end if SendBytes(cwnd)

4.5 System Stability To demonstrate the steady-state behavior of TAFA, we present the analysis of TAFA in a simple circumstance that we assume there are N flows with same RTT T and sharing the resource capacity C [14]. We use this assumption to capture the impact of p on TAFA’s performance. Like [14], we’ll analyze the following parameters: back-off penalty p as a function of flow size Sc and Smax , the window size W o when switch starts marking packets with CE codepoint, the amplitude of queue oscillations A, and the maximum queue length Qmax . See Fig. 4.4.

54

4 A Cross-Layer Transport Protocol Design in the Terminal Systems of DC

Queue size

Fig. 4.4 Window size changement of a single sender

W2 Wo

A

W1

Time Let X(W1 , W2 ) denote the number of packet being sent by a flow during the period from W1 to W2 (W2 > W1 ). As there is no congestion, the window increases by 1 per RTT, so this process takes W2 − W1 RTTs.  X(W1 , W2 ) =

t1

W dt =

t0

(W22 − W12 ) 2

(4.11)

When switch starts to mark CE codepoint, W o = (CT + K)/N while K denotes the threshold. As it still needs one more RTT before senders react to the congestion marks, another W o packet has been sent during this period. As a consequence, the fraction of marked packets, α, calculated as dividing the number of marked packets in this RTT by the total number of packet during the window increase from W1 to W2 : α=

X(W o , W o + 1) + 1)(1 − p), W o + 1

X(W o

((W o + 1)2 − (W o )2 ) = (W o + 1)2 − ((W o + 1)(1 − p))2

(4.12)

Simplifying Equation 4.12 and rearranging it as: α(2p − p2 ) = As p = α × s (s = is small:

Sc Smax ),

(W o + 1)2 − (W o )2 2(W o + 1)2

(4.13)

we can rewrite the equation as follows when assuming p

αp = α 2 s Solving this equation gives us:

(W o + 1)2 − (W o )2 2(W o + 1)2

(4.14)

4.5 System Stability

55

 α=  p=

(W o + 1)2 − (W o )2 2s(W o + 1)2 (W o

+ 1)2

(4.15)

− (W o )2

2(W o + 1)2

As there are N flows, the amplitude of queue oscillations A is calculated as follows: A = N × [(W o + 1) − (W o + 1)(1 − p)] = N [p(W o + 1)]  (W o + 1)2 − (W o )2 = N(W o + 1) 2(W o + 1)2  2 ≈ NW o Wo + 2

(4.16)

So the maximum queue length is: Qmax = N(W o + 1) − C × T = N(

CT + K + 1) − CT N

(4.17)

= CT + K + N − CT =N +K Revealed by Equation 4.16, we observe an important property that the amplitude of the queue oscillations of TAFA can be calculated as follows: 

2 ) +2  2 CT + K ) = O(N o N W +2  2N ) = O((CT + K) CT + K + 2N √ = O( C × T ) O(N W

o

Wo

(4.18)

The above √ equation implies that the amplitude of the queue oscillations for small N is in O( C × T ), indicating that this system has characteristics of stability.

56

4 A Cross-Layer Transport Protocol Design in the Terminal Systems of DC

4.6 TAFA Experiment In this section, we evaluate the performance of TAFA using extensive simulations. To understand the performance in large scale, we do trace-driven simulations using trace from production clusters. Our evaluation consists of three parts. First, we evaluate TAFA’s basic performance such as its TCT, throughput, resource utilization, threshold vector, number of priority queues, and how it can handle concurrent tasks. Building on these, we show how TAFA achieves benefits compared to task-aware schemes and flow-aware scheduling scheme in realistic data center networks running load.

4.6.1 Setup Flows We use the realistic workloads that have been observed in production data centers, the web search distribution in [10] and the large data mining jobs in [12]. As [11] has given, the arrival pattern of flows is in Poisson process. According to the empirical traffic distributions used for benchmarks, both of the two clusters have a diverse mix of heavy and light flows with heavy-tailed characteristics, and we also analyze TAFA’s performance across these two different workflows. Trace Using the Google cluster traces in [15] and [16], we illustrate the heterogeneity of server configurations in one of the cluster [17], where the CPUs and memory of each server are normalized. We use the information of over 900 users on a cluster of 12K servers as the input of TAFA and evaluate its performance against other policies based on these traces. TAFA To generalize our work, we consider three sets of experiments. Firstly, we test TAFA’s parameter performance. For tasks containing plenty of flows, we use task completion time (TCT) defined as the finish time of the last flow in this task and consider the average TCT across all tasks from end-hosts. Besides, we find the resource utilization rate of TAFA is high and the demotion threshold and the number of priority queues affect the finishing time of flows. Secondly, we compare TAFA with task-aware policies. As we introduced before, flow completion time seriously affects TCT. With flow-level knowledge, we can schedule flows in a more proper way to advance the finish time of the last flow in one task, i.e., reducing TCT. Lastly, we compare TAFA with flow-aware policies and demonstrate that TAFA can shorten flow response time (FRT) significantly.

4.6.2 Overall Performance of TAFA To evaluate how TAFA adapts to realistic activity, we set up the environments that mimic a typical DCN scenario. The front-end comprises of three clients; each client sent out tasks persistently and tags flows of these tasks by a maintained separate

4.6 TAFA Experiment

57

Fig. 4.5 Comparison between different demotion threshold

12

Average TCT

10

Same Various

8 6 4 2 0

20

50

100

Load

marker. Each task is initialized to be in the highest priority and will be degraded once its size of sent data achieves the threshold. The small cluster is configured in proportion to [15]. CPU and memory units are normalized to the maximum server. The six kinds of configuration rates are: (0.50, 0.50), (0.50, 0.25), (0.50, 0.75), (1.00, 1.00), (0.25, 0.25), and (0.50, 0.12). Threshold We first evaluate the impact of varying the value of threshold in switches. As there are more than one priority queue in switches, a task may demote from one higher priority queue to a lower one depending on the bytes it has sent. The demotion is dependent on the queue threshold ϒ. As we described in Sect. 4.4, not using a global threshold in all queues, we give a vector of ϒ (consisting of ϒ1 , ϒ2 , . . . , ϒτ −1 ), where τ is the number of priority queues and ϒi is the threshold of priority queue i. To test the performance of different values, we consider a scenario where there are three different queues in switches and ϒ1 , ϒ2 are set to be the mean and three quarters of the largest task size, respectively. As a contrast experiment, we set the two thresholds as the same (mean value of task size). Figure 4.5 shows the results in three different experiments with 20, 50, and 100 requests in different simulations. From this figure, we can find that our incremental threshold outperforms the fixed threshold obviously and the more tasks, the more advantages. Queues No matter the sequential or aggregational access schemes, the order of flows does influence the final completion result, because only when all the flows return can the final result form. So when the previous flow is heavy, and blocks the process of following ones, multiple queues can handle this scenario. The number of priority queues will affect the optimizing degree. We set the request from clients to 20, 40, 60, 80, and 100, respectively, using different numbers of priority queues (2, 3, 4, 8). Figure 4.6 gives the results, from which we can conclude that multiple queue can optimize TCT to some extent, but with the queue number increasing, the superiority is not that obvious. So for a specific DCN, the number of queues is not the more the better, but should be set to an appropriate amount by considering the overhead of adding an additional queue.

58

4 A Cross-Layer Transport Protocol Design in the Terminal Systems of DC

Fig. 4.6 TCT comparison of different number of queues

9 2 Queues 3 Queues 4 Queues 8 Queues

Average TCT

8 7 6 5 4 3 2 20

40

60

80

100

Load Fig. 4.7 The increasing finish time of droptail performance in TAFA and FIFO

15 TAFA FIFO

Time

10

5

0

20

40

60

80

100

Number of tasks

Fairness To test the fairness of TAFA among all end-hosts, we extract all the flows of three specified end-hosts by filtering the source address. During each period of time, we add up the applied resources assigned to users and calculate the proportion of resources allocation (Fig. 4.7). Figure 4.8 shows the sharing rate of CPU, while Fig. 4.9 shows the rate of link, and the average rates for all these three end-hosts are about 33% with stable distributions.

4.6.3 TAFA vs. Task-Aware We compare TAFA’s performance against FIFO, which is used in the task-aware scheduling in [8]. Figure 4.7 shows results of an experiment with 3 clients, 100 tasks, and task size of 1–100 KB. In this case, TAFA reduces task droptail completion time by about 34% compared to FIFO.

4.6 TAFA Experiment

59

Fig. 4.8 CPU sharing among three end-hosts

1 End−host 1 End−host 2 End−host 3

CPU Share(%)

0.8 0.6 0.4 0.2 0 1

1.5

2

2.5

3

Time Fig. 4.9 Link sharing among three end-hosts

1 End−host 1 End−host 2 End−host 3

Link Share(%)

0.8 0.6 0.4 0.2 0 1

1.5

2

2.5

3

Time

To generalize the case, we test TAFA’s performance by increasing the number of concurrent requests, from 10 tasks per client to 50 tasks, and we show that TAFA can hold the large scale of concurrency. Figure 4.10 gives the comparing results of average task completion time. The reason why TAFA outperforms task-aware schemes lies in the acknowledgment of flow information. As task completion time is dependent on the last flow of it, making flows scheduled earlier will certainly reduce TCT. The flow-level scheduling algorithm makes TAFA flow-aware, so that TAFA can significantly improve task completion time compared to all task-aware only policies.

4.6.4 TAFA vs. Flow-Aware In this subsection, we evaluate TAFA with flow-aware schemes. For the experiment, we consider Google’s trace files in [15] and [16]. Along with the experiments in [8], we compare TAFA’s performance against D2 TCP. Figure 4.11 shows the results

60

4 A Cross-Layer Transport Protocol Design in the Terminal Systems of DC

Fig. 4.10 The number of concurrent tasks

15

Average TCT

TAFA FIFO 10

5

0

10

20

30

40

50

Load Fig. 4.11 100 requests with task size from 1 to 100 KB

4

TAFA D2TCP

3.5

TCT

3 2.5 2 1.5 1 0.5 0 0

20

40

60

80

100

Task Number

of an experiment with 100 requests, task size from 1 to 100 KB. Obviously, TAFA takes shorter time to finish these tasks, reducing the tail completion time by 36% to D2 TCP. Also, we can observe more gains in long tasks simulation. We plot the average TCT in Fig. 4.12 and each end-host with the shortest task 10 MB. The results show that TAFA reduces 45% average TCT compared with D2 TCP. In short, TAFA works well for both workloads. TAFA also achieves very good performance for the CDF of task completion, shown in Fig. 4.13. TAFA can finish scheduling all the tasks at about four, while it takes D2 TCP more than six. To test the scalability of TAFA, we increase the number of requests and simulate the situation with 500 tasks, and Figs. 4.14, 4.15, and 4.16 show the task completion time for short tasks, long tasks, and average CDF, respectively. From these figures, we can conclude that TAFA can be adaptive to large-scale environments. The reason why TAFA can achieve better results than D2 TCP is that TAFA takes flow size into consideration when adjusting congestion window. When congestion occurs, the back-off extent of short flows is slighter than long ones, making short flows scheduled earlier; thus, TAFA reduces average task completion time.

4.6 TAFA Experiment

61

Fig. 4.12 100 requests with task size from 10 M to ∞

6

TAFA D2TCP

5

TCT

4 3 2 1 0 0

20

40

60

80

100

Task Number

Fig. 4.13 Average CDF of task completion time

1 TAFA D2TCP

CDF

0.8 0.6 0.4 0.2 0 0

1

2

3

4

5

6

Time

Fig. 4.14 500 tasks with size from 1 to 100 KB. 10

TAFA D2TCP

TCT

8 6 4 2 0 0

100

200

300

Task Number

400

500

62

4 A Cross-Layer Transport Protocol Design in the Terminal Systems of DC

Fig. 4.15 500 tasks with size from 10 M to ∞

30

TAFA D2TCP

25

TCT

20 15 10 5 0 0

100

200

300

400

500

Task Number Fig. 4.16 Average CDF of task completion time

1 TAFA D2TCP

CDF

0.8 0.6 0.4 0.2 0 0

5

10

15

20

25

Time

4.7 Conclusion In this chapter, we studied the scheduling problem in data center networks (DCNs), where the existing protocols are either task-aware or flow-aware. To optimize task completion time (TCT), we present TAFA, which is both taskaware and flow-aware. As tasks consist of plenty of flows, the completion time is dependent on the last flow’s finishing time, so good flow-level acknowledgment can help reduce task-level optimization. In the task level of TAFA, we adopt a heuristic demotion algorithm, which can demote the priority of heavy tasks without prior knowledge of task size, so that TAFA obtains the advantages of short job first, which is known to be the most effective method to reduce average completion time in one link. Besides, we set a vector of thresholds of different priority queues instead of a fixed one, making the demotion from a higher-priority queue to a lower-priority queue more reasonable. The incremental threshold gives TAFA the ability to handle bursts. In the flow level of TAFA, we modulate the congestion window in a sizeaware manner, thus making long flows back off more aggressively than short flows.

References

63

As to the rate control problem, we take flow size into consideration and adjust congestion window according to the estimation calculated by flow size and the fraction of packets that were marked in the last RTT. This scheme can help shorter flows back off more slighter than long ones and make short flows to be scheduled earlier, resulting in shorter task completion time. Our large-scale simulations driven by real production data center trace show that, compared to traditional task-aware or flow-aware only policies, TAFA can significantly reduce the average task completion time.

References 1. Nagaraj, K., Bharadia, D., Mao, H., Chinchali, S., Alizadeh, M., Katti, S.: Numfabric: fast and flexible bandwidth allocation in datacenters. In: ACM SIGCOMM, pp. 188–201 (2016) 2. Zhang, H., Zhang, J., Bai, W., Chen, K., Chowdhury, M.: Resilient datacenter load balancing in the wild. In: ACM SIGCOMM, pp. 253–266 (2017) 3. Cho, I., Jang, K.H., Han, D.: Credit-scheduled delay-bounded congestion control for datacenters. In: ACM SIGCOMM, pp. 239–252 (2017) 4. Mittal, R., Lam, V.T., Dukkipati, N., Blem, E., Wassel, H., Ghobadi, M., Vahdat, A., Wang, Y., Wetherall, D., Zats, D.: TIMELY: RTT-based congestion control for the datacenter. In: ACM SIGCOMM, pp. 537–550 (2015) 5. Zhu, Y., Eran, H., Firestone, D., Guo, C., Lipshteyn, M., Liron, Y., Padhye, J., Raindel, S., Yahia, M.H., Zhang, M.: Congestion control for large-scale RDMA deployments. ACM SIGCOMM 45(5), 523–536 (2015) 6. Alizadeh, M., Edsall, T., Dharmapurikar, S., Vaidyanathan, R., Chu, K., Fingerhut, A., Lam, V.T., Matus, F., Pan, R., Yadav, N.: CONGA: distributed congestion-aware load balancing for datacenters. In: ACM SIGCOMM, pp. 503–514 (2014) 7. Grandl, R., Ananthanarayanan, G., Kandula, S., Rao, S., Akella, A.: Multi-resource packing for cluster schedulers. In: Proceedings of the 2014 ACM conference on SIGCOMM, pp. 455–466. ACM (2014) 8. Dogar, F.R., Karagiannis, T., Ballani, H., Rowstron, A.: Decentralized task-aware scheduling for data center networks. ACM SIGCOMM Comput. Commun. Rev. 44(4), 431–442 (2014) 9. Bai, W., Chen, L., Chen, K., Han, D., Tian, C., Sun, W.: Pias: practical information-agnostic flow scheduling for data center networks. In: Proceedings of the 13th ACM workshop on hot topics in networks, p. 25. ACM (2014) 10. Alizadeh, M., Greenberg, A., Maltz, D.A., Padhye, J., Patel, P., Prabhakar, B., Sengupta, S., Sridharan, M.: Data center TCP (DCTCP). ACM SIGCOMM Comput. Commun. Rev. 41(4), 63–74 (2011) 11. Alizadeh, M., Yang, S., Sharif, M., Katti, S., McKeown, N., Prabhakar, B., Shenker, S.: pfabric: minimal near-optimal datacenter transport. ACM SIGCOMM Comput. Commun. Rev. 43(4), 435–446 (2013). ACM 12. Greenberg, A., Hamilton, J.R., Jain, N., Kandula, S., Kim, C., Lahiri, P., Maltz, D.A., Patel, P., Sengupta, S.: Vl2: a scalable and flexible data center network. ACM SIGCOMM Comput. Commun. Rev. 39(4), 51–62 (2009). ACM 13. Zhang, H.: More load, more differentiation – a design principle for deadline-aware flow control in DCNs. In: INFOCOM, 2014 Proceedings IEEE. IEEE (2014) 14. Obermuller, N., Bernstein, P., Velazquez, H., Reilly, R., Moser, D., Ellison, D.H., Bachmann, S.: Expression of the thiazide-sensitive Na-Cl cotransporter in rat and human kidney. Am. J. Physiol.-Renal Physiol. 269(6), F900–F910 (1995)

64

4 A Cross-Layer Transport Protocol Design in the Terminal Systems of DC

15. Reiss, C., Tumanov, A., Ganger, G.R., Katz, R.H., Kozuch, M.A.: Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In: Proceedings of the Third ACM Symposium on Cloud Computing, p. 7. ACM (2012) 16. Reiss, C., Wilkes, J., Hellerstein, J.L.: Google cluster-usage traces. http://code.google.com/p/ googleclusterdata/ 17. Wang, W., Li, B., Liang, B.: Dominant resource fairness in cloud computing systems with heterogeneous servers. arXiv preprint arXiv:1308.0083 (2013)

Chapter 5

Optimization of Container Communication in DC Back-End Servers

Abstract Containerization has been used in many applications for isolation purposes due to its lightweight, scalable, and highly portable properties. However, to apply containerization in large-scale Internet data centers faces a big challenge. Services in data centers are always instantiated as a group of containers, which often generate heavy communication workloads and therefore resulting in inefficient communications and downgraded service performance. Although assigning the containers of the same service to the same server can reduce the communication overhead, this may cause heavily imbalanced resource utilization since containers of the same service are usually intensive to the same resource. To reduce communication cost as well as balance the resource utilization in large-scale data centers, we further explore the container distribution issues in a real industrial environment and find that such conflict lies in two phases – container placement and container reassignment. The objective of this chapter is to address the container distribution problem in these two phases. For the container placement problem, we propose an efficient Communication Aware Worst Fit Decreasing (CAWFD) algorithm to place a set of new containers into data centers. For the container reassignment problem, we propose a two-stage algorithm called Sweep&Search to optimize a given initial distribution of containers by migrating containers among servers. We implement the proposed algorithms in Baidu’s data centers and conduct extensive evaluations. Compared with the state-of-the-art strategies, the evaluation results show that our algorithms perform better up to 70% and increase the overall service throughput up to 90% simultaneously.

5.1 Container Group-Based Architecture Containerization is gaining tremendous popularity recently because of its convenience and good performance on deploying applications and services. First, containers provide good isolations with namespace technologies (e.g., chroot [1]), eliminating conflicts with other containers. Second, containers put everything in

© Springer Nature Singapore Pte Ltd. 2020 Y. Zhang, K. Xu, Network Management in Cloud and Edge Computing, https://doi.org/10.1007/978-981-15-0138-8_5

65

66

5 Optimization of Container Communication in DC Back-End Servers

one package (code, runtime, system tools, system libraries) and do not need any external dependencies to run processes [2], making containers highly portable and fast to distribute. To ensure service integrity, a function of a particular application may instantiate multiple containers. For example, in Hadoop, each mapper or reducer should be implemented as one container, and the layers in a web service (e.g., load balancer, web search, back-end database) are deployed as container groups. These container groups are deployed in cloud or data centers and managed by orchestrators such as Kubernetes [3] and Mesos [4]. Using name services, these orchestrators can quickly locate the containers on different servers, so application upgrade and failure recovery can be well handled. As containers are easy to build, replace, or delete, such architecture makes it convenient to maintain container group-based applications. But the container group-based architecture also introduces side effects, i.e., the low communication efficiency. As the functions deployed in the same container group belong to the same service, they need to exchange control messages and transfer data. Therefore, communication efficiency within a container group greatly affects the overall service performance [5]. However, simple consolidation strategies may result in imbalanced utilization of multiple resources, because containers of the same group are usually intensive to the same resource. The above orchestrators provide the possibility to leverage containers, but how to manage container groups for reducing communication overhead and balancing resource utilization is still a pending problem.

5.2 Problem Definition In this section, we go into the above trade-off by analyzing the overall costs under specified constraints. Let H denote the set of servers in a data center. Each server has multiple types of resources. Let R denote the set of resource types. For each server h ∈ H, let P (h, r) denote the capacity of resource r ∈ R. Let S denote the set of services. Each service is built in a set of containers. Each container may have several replicas. For a specific container c, let Dcr denote its resource requirement for resource r (r ∈ R). The set of containers to be placed is denoted by C.

5.2.1 Objective According to Sect. 5.1, there are two aspects when quantifying the total overhead of any distribution status, i.e., the communication cost and the resource utilization (we consider both in-use resources and residual resources).

5.2 Problem Definition

5.2.1.1

67

Communication Cost

So far we can formulate the overall communication cost as follows: for each container c ∈ C, let H (c) denote the server that container c assigned to. For a pair of containers ci and cj , let f (ci , cj ) denote the communication cost incurred by these two containers. Since the communication overhead exists mainly in host networks, if ci and cj are placed on the same server (H (ci ) = H (cj )), the communication cost is negligible, i.e., f (ci , cj ) = 0. Thus, the overall communication cost for the data center is the sum of the communication cost produced by all possible container pairs, which is given by 

Ccost =

f (H (ci ), H (cj )).

(5.1)

∀ci ,cj ∈C,ci =cj

The next two metrics measure the resource utilizations of servers, which are resource utilization cost and residual resource balance cost.

5.2.1.2

Resource Utilization Cost

If the resource utilization of a server is much higher than others, it will easily become the bottleneck of a service, seriously degrading the overall performance. The ideal situation is that all servers enjoy equal resource utilization. For each resource type r ∈ R, the resource utilization cost for r is defined as the variance of resource usage for r of all servers, i.e.,  [U (h, r) − U¯ (r)]2 , |H|

(5.2)

h∈H

where U (h, r) denotes the utilization of resource r on server h, U¯ (r) is the mean utilization of resource r of all servers, and |H| is the number of servers. This metric can reflect whether resource r is used in a balanced way among servers. The total resource utilization cost for the data center is the sum of the resource utilization cost of all resource types, which is given by Ucost =

  [U (h, r) − U¯ (r)]2 , |H|

(5.3)

r∈R h∈H

5.2.1.3

Residual Resource Balance Cost

Any amount of CPU resource without any available RAM is useless for coming requests, so the residual amount of multiple resources should be balanced [6]. For two different resources ri and rj , let t (ri , rj ) represent the target proportion between

68

5 Optimization of Container Communication in DC Back-End Servers

resource ri and resource rj . The residual resource balance cost incurred by ri and rj is defined as cost(ri , rj ) =



max{0, A(h, ri ) − A(h, rj ) × t (ri , rj )},

(5.4)

h∈H

where A(h, r) refers to the residual available resource r on server h. This metric can reflect whether different types of resources are used according to the expected proportion. The total residual resource balance cost for the data center is the sum of the residual balance cost of all possible pairs of resource types, which is given by Bcost =



cost(ri , rj ).

(5.5)

∀ri ,rj ∈R,ri =rj

Based on the above definitions, the overall resource utilization of servers can be measured by the sum of the total resource utilization cost and the total residual resource balance cost. It is easy to see that smaller cost indicates more balanced resource utilization among servers. A commonly used approach to optimize multiple optimization objectives is to transfer multiple objectives into a single scalar [7]. We adopt this approach in this chapter and define the objective to be minimized as a weighted sum of all the costs defined above, i.e., Cost = wU ∗ Ucost + wB ∗ Bcost + wC ∗ Ccost .

(5.6)

5.2.2 Constraints To minimize the above cost, containers should be placed or reassigned to the most suitable servers, but this process should satisfy some strict constraints. (a) Capacity Constraint: First, the resource consumed by the containers on each server cannot exceed the capacity of the server for each resource type, i.e., 

Dcr ≤ P (h, r), ∀h ∈ H, ∀r ∈ R.

(5.7)

c∈C,H (c)=h

(b) Conflict Constraint: Second, as mentioned earlier, each container may have several replicas for parallel processing purpose. Generally, the replicas of the same container cannot be placed on the same server. Let (c, c ) denote whether c and c are the replicas of the same container, with (c, c ) = 1 indicating yes and (c, c ) = 0 otherwise. The conflict constraint can be represented by (c, c ) = 1 ⇒ H (c) = H (c ), ∀c, c ∈ C, c = c .

(5.8)

5.2 Problem Definition

69

(c) Spread Constraint: Third, a specific function in a high performance application is usually implemented on multiple containers to support concurrent operations. For example, a basic search function in web services is usually instantiated in different servers or even different data centers. As these containers are sensitive to the same resource, they cannot be put on the same server. Otherwise, there will be serious waste of other resources like memory and I/O. Therefore, for each service Si ∈ S, let M(Si ) ∈ N be the minimum number of different servers where at least one container of Si should run, we can define the following spread constraint for each service:  min(1, |c ∈ Si ∈ S|H (c) = hi |) ≥ M(Si ), ∀Si ∈ S. (5.9) hi ∈H

(d) Co-locate Constraint: Fourth, some services require critical data transmission delay among containers. In order to satisfy the latency requirement, the containers with critical frequent interactions should be assigned to the same server. Let (c, c ) denote whether c and c should be co-located on the same server, with (c, c ) = 1 indicating yes and (c, c ) = 0 otherwise. The colocate constraint can be represented by (c, c ) = 1 ⇒ H (c) = H (c ), ∀c, c ∈ C, c = c .

(5.10)

(e) Transient Constraint: The container reassignment problem assumes a given placement of containers and tries to further improve the initial placement by migrating containers among servers. For each container c ∈ C, let H (c) and H  (c) denote the original server and the new server (after migration) that container c is assigned to. In order to guarantee service availability, any migrated container cannot be destroyed at the original server until the new instance is created on the new server. Therefore, the resources are consumed at both original server H (c) and the new server H  (c) during container migration. This constraint can be represented by  c∈C,H (c)=h



Dcr ≤ P (h, r), ∀h ∈ H, ∀r ∈ R.

(5.11)

H  (c)=h

Based on the above discussions, we formally define the problems to be addressed in this chapter. • Container Placement Problem (CPP). Given a set of new containers, to find the optimal placement of containers such that the total cost defined by (5.6) is minimized while the constraints (5.7), (5.8), (5.9), and (5.10) are not violated. • Container Reassignment Problem (CRP). Given an initial placement of containers, to find the optimal new placement of containers such that the total cost defined by (5.6) is minimized while the constraints (5.7), (5.8), (5.9), (5.10), and (5.11) are not violated.

70

5 Optimization of Container Communication in DC Back-End Servers

5.3 Container Placement Problem In this section, we firstly show that CPP is NP-hard and then propose a heuristic algorithm called Communication Aware Worst Fit Decreasing to approximate the optimal solution of CPP.

5.3.1 Problem Analysis To prove that CPP is NP-hard, let us consider the Multi-Resource Generalized Assignment Problem [8] (MRGAP) first. Given m agents {A = 1, 2, . . . , m}, n tasks {T = 1, 2, . . . , n} and l resources {R = 1, 2, . . . , l}, each agent i has capi,r units of resource r, and each task j requires reqj,r units of resource r. Assigning a task j to agent i induces a cost costi,j . MRGAP tries to assign each task to exactly one agent with the purpose of minimizing the total cost without violating the resource constraints, i.e.,  min costi,A(i) (5.12) i∈T

s.t.



reqi,j,r ≤ capj,r , ∀r ∈ R

j ∈A,i∈T (j )

where A(i) denotes the agent task i is assigned to and T (j ) denotes the set of tasks on agent j . It is widely accepted that the MRGAP is a strongly NP-hard problem [9]. We note that MRGAP is essentially a simplified version of CPP. Suppose we deploy service S into some empty servers with only the capacity constraint considered (i.e., we do not consider constraints (5.8), (5.9), and (5.10)). Because the total cost is 0 before deploying  S (note that the servers are initially empty), we can denote the final cost by

Costc , where CS denotes the set of containers of service S, and c∈CS

Costc denotes the increment in cost in Equation (5.6) by placing container c. This Simplified CPP (SCPP) can be formulated as follows: min



Costc

c∈CS

s.t.



c∈CS ,H (c)=h

Dcr ≤ P (h, r), ∀h ∈ H, ∀r ∈ R.

(5.13)

5.3 Container Placement Problem

71

From Equations (5.12) and (5.13), it is easy to see that MRGAP is equivalent to SCPP if we regard agents as servers and tasks as containers. In other words, MRGAP is a special case of CPP. Therefore, it follows that CPP is NP-hard. Although MRGAP is a special case of CPP, there are key differences between them, which make existing solutions to MRGAP inapplicable for CPP. Firstly, there are more constraints in CPP, thus feasible solutions to MRGAP may be infeasible to CPP. Secondly, unlike in MRGAP, the assignment cost in CPP (i.e., increment of cost induced by an assignment) is dependent upon assignment sequence, which could even be negative (note that a proper placement can improve resource utilization without increasing communication overhead and thus decreases the overall cost).

5.3.2 CA-WFD Algorithm As large-scale data centers usually have thousands of containers and servers, the approaches that try to find optimal solutions are impractical for CPP due to the high computation complexity. In this section, we propose a heuristic algorithm to approximate the optimal solution to CPP, which is extended from the Worst Fit Decreasing (WFD) [10] strategy. The basic idea of WFD is to sort the items in a decreasing order according to their sizes, and each item is assigned to the bin with largest residual capacity. WFD is widely used for load balancing [11] because WFD tends to distribute slack among multiple bins. However, we face several challenges to apply WFD in CPP. The first is how to measure the sizes of containers and the capacities of servers. A commonly used approach is to transfer the multidimensional resource vector into a scalar. Since different designs of the scaler may yield different performances [12], we need to carefully scale the resource vectors in CPP. Moreover, WFD is traditionally applied for balancing resource utilization, so we have to extend WFD to CPP, where both resource load balance and communication overhead reduction are considered. To measure the sizes of containers, we define the dominant requirement of a container, i.e., the maximum requirement on different resources, which is expressed as maxr∈R Dcr . We use the weighted sum of residual  resources to measure the available capacity of a server, which is defined as r∈R wr ∗ A(h, r), where wr refers to the weight of resource r. In fact, motivated by prior researches [12], we proposed and tested several designs based on real-world environments and finally choose the two metrics. To extend WFD to CPP, instead of simply picking the server with the largest free space in WFD, we take two steps to select a server for a new container. In the first step, we put emphasis on load balance, where d candidate servers with the most available resources are selected. In the second step, to reduce communication overhead, we choose the server who has the maximum number of related containers that belongs to the same service with the new container.

72

5 Optimization of Container Communication in DC Back-End Servers

Algorithm 1 CA-WFD 1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

C ← the set of new containers to be placed H ← the set of servers Sort containers in C according to their sizes while C = ∅ do Pick the container c ∈ C with the largest size Pick d servers Hd with the largest available capacity that can accommodate c without violating any constraint Pick the server h ∈ Hd that accommodates the most containers that belongs to the same service with c Assign c to h C ← C\{c} end while

We propose the Communication Aware Worst Fit Decreasing (CA-WFD) algorithm shown in Algorithm 1. The algorithm sorts the containers according to their sizes (measured by dominant requirements) in line 3. Then, it repeatedly assigns the largest container until all the containers are assigned (lines 4–10). Each time the algorithm picks the top d servers with the largest residual capacity (measured by weighted sum of residual resources) and chooses the one with the most containers that belongs to the same service with c to minimize the increment of Ccost .

5.4 Container Reassignment Problem CRP aims to optimize a given initial placement of containers by migrating containers among servers. As mentioned earlier, all the constraints should be satisfied during the migrations in order to guarantee the online services. Since containers are already placed on servers initially, the residual capacities of servers that can be utilized during migrations are quite limited, making CRP challenging. Classical heuristic algorithms have been used to solve similar problems [13–15]. However, the existing approaches are inefficient to handle big containers at the hot hosts due to transient constraints in CRP.

5.4.1 Problem Analysis Figure 5.1 shows the additional constraint and challenge in CRP, where single resource and homogeneous servers are considered for simplicity. There are six containers which are placed on three servers. Initially (as shown in Fig. 5.1a), three containers (CA1 , CB1 , and CC1 ) are placed on Server 1, whose resource requirements are 20%, 20%, and 30%, respectively. Two containers (CA2 and CB2 ) are placed on Server 2, whose resource requirements are both 50%. One container (CC2 ) is placed on Server 3, whose resource requirement is 40%. Suppose the

5.4 Container Reassignment Problem 100%

73 100%

100% CB2

CB2

CC1

CB1 CA1 1

CC1

CA2

CC2

CB1 CA1

2

3

1

(a)

CA2

CC2

2

3

(b)

CB2

CA1 1

CB1

CC1

CA2

CC2

2

3

(c)

Fig. 5.1 There are three services each with two containers: (a) the current assignment; (b) the impractical reassignment; (c) the optimal solution

capacity of each server is 100%. Obviously, the optimal placement of containers is as shown in Fig. 5.1c, i.e., each server has two containers and the total resource utilization is 70%. As illustrated in Fig. 5.1b, it is impossible to get the optimal placement from the initial placement by migrating containers concurrently. This is because in order to achieve the optimal placement, we need to move CC1 from Server 1 to Server 3 and move CB2 from Server 2 to Server 1. However, migrating CB2 from Server 2 to Server 1 is infeasible since the transient constraint will be violated on Server 1 (the sum of resource consumption on Server 1 will exceed its capacity if CB2 is migrated). For the above example, if we first move CC1 from Server 1 to Server 3 and suppose the transient resource is released on Server 1 after migration, then CB2 can be migrated from Server 2 to Server 1 without violating any constraint. Inspired by this observation, we propose a two-stage container reassignment algorithm named Sweep&Search to solve the problem.

5.4.2 Sweep&Search Algorithm The Sweep&Search algorithm has two stages, which are Sweep and Search. The Sweep stage tries to handle the large containers on the hot servers, i.e., trying to migrate the large containers to the expected locations. Based on the placement produced by the Sweep stage, the Search stage adopts a tailored variable neighborhood local search to further optimize the placement of containers. Note that, the Sweep&Search algorithm is only used to compute the migration plan, i.e., which server each container will be migrated to. So, all the placement changes (e.g., migrate, shift, swap) in the algorithm description are hypothetical. After the migration plan is figured out, the containers are physically migrated to their target servers as follows: first, for each container, a new replica of the container is constructed in the target server; second, the workload mapped to the old replica of the container is redirected to the new replica; third, the old replica of the container is physically deleted. The Sweep&Search algorithm has taken the transient resource constraints into account when computing the migration plan, so the resource constraints at both the original servers and the target servers can be always satisfied during migration.

74

5 Optimization of Container Communication in DC Back-End Servers

Algorithm 2 Sweep 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:

Sort H in descending order according to the residual capacity Hhot ← {h|U (h) > safety threshold} N ← the size of Hhot Hspare ← top N spare servers for each container c on h ∈ Hspare do h ← F indH ost (c) Migrate c from h to h end for for each h ∈ Hhot do while U (h) > U¯ (h) do pick a container c on h pick a spare server h ∈ Hspare that can accommodate c without violating any constraint 13: Migrate c from h to h 14: end while 15: end for

5.4.2.1

Sweep

Recall that one of our objectives is to balance resource utilization among servers. The traditional approaches for load balancing normally move workload directly from servers with high resource utilization to servers with low resource utilization. However, as shown in Fig. 5.1, the large containers are hard to migrate due to the transient resource constraints. To address this issue, we propose a novel two-step approach in the Sweep stage. In the first step, we try to empty the spare servers as much as possible by moving out containers from the spare servers to other servers. This will free up space for accommodating more large containers from the hot servers. In the second step, we move large containers from hot servers to spare servers so that resource utilization among servers can be balanced. The pseudo-code of Sweep is shown in Algorithm 2. The algorithm first selects a set of hot servers (i.e., the servers whose resource utilization is higher than a predefined safety threshold). Suppose the number of hot servers is N . The algorithm then tries to clear up N spare servers (i.e., the top N servers with the lowest resource utilization). When working on a spare server, the algorithm tries to migrate as many containers as possible from the spare server to other servers (lines 5–8). For a specific container c, the procedure FindHost returns a normal server (i.e., neither a hot server nor a spare server) that can accommodate c. After that, the resources occupied by the containers which have been migrated can be released on the spare servers. Then, the algorithm tries to migrate containers from hot servers to spare servers to balance the resource utilization (lines 9–15). Specifically, the algorithm iterates over the hot servers and repeatedly migrates containers from each hot server to spare servers until the resource utilization of the hot server is below the average of all servers if possible.

5.4 Container Reassignment Problem CB1

100%

100%

75 CB1

CA2

2

100%

CB2

1

CC1

CB2

CB1 CC1

CB2

CB1 CA1

1

2

1

(a)

CA2 CB2 (b)

2

CC1 CA3

CB3

1

2 (c)

CC2 3

Fig. 5.2 Three kinds of moves Sweep&Search explores. (a) Shift. (b) Swap. (c) Replace

5.4.2.2

Search

The Sweep stage mainly focuses on balancing the resource utilization among servers. However, the communication cost may still be high after the Sweep stage. The Search stage will further optimize the solution produced by the Sweep stage using a local search algorithm. The local search algorithm incrementally adjust the placement through three basic movements: shift, swap, and replace. A shift move is to reassign a container from one server to another server (Fig. 5.2a). It is the most simple neighbor exploration that directly reduces the overall cost. For example, reassigning a container from a hot server to a spare server will reduce Ucost ; moving a CPU-intensive container from a server with little residual CPU will reduce Bcost ; moving a container nearer to its group members reduces Ccost . A swap move is to exchange the assignment of two containers on two different servers (Fig. 5.2b). It is easy to see that the size of the swap neighborhood is O(n2 ), where n is the number of containers. To limit the branch number, we cut off the neighbors that obviously violate the constraints and worsen the overall cost. The replace move is more complex than the shift move and the swap move, which is to shift a container from one server (the original server) to another server (the relay server) and meanwhile shifts zombie containers on the relay server to other servers (the target servers) (Fig. 5.2c). A zombie container is a container that was planned to move here (from other servers) in the earlier search stage, but the actual operation has not been executed yet. We represent zombie containers with dashed edges in the figure. Since the zombie containers have not been migrated, it will not incur additional overhead if we reassign them to other servers. It is obvious that replace is more powerful than shift and swap, but the overhead is much higher. This is because there are so many potential movements for a zombie container and replace should explore all the possible branches. Fortunately, the overhead can be bounded. In each iteration of the Search algorithm, one shift and one swap will be accepted, which will generate three zombies. So there are three cases for the next replace phase. In each branch, we try to move the zombie container away from the assigned host, and this is another shift operation. So in the ith iteration, there are at most 3i zombie containers. If we assume the overhead of exploring a shift neighbor is os , the total overhead of Sweep&Search is linear to os .

76

5 Optimization of Container Communication in DC Back-End Servers

Algorithm 3 Search 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:

Pcrt ← the initial container placement repeat Sort H by resource utilization N ← H(T op(δ)) + H(T ail(δ)) Pshif t ← shiftSearch(Pcrt , N ) Pswap ← swapSearch(Pcrt , N ) Preplace ← replaceSearch(Pcrt , N ) Pcrt ← arg min(Cost (P )), P ∈ {Pshif t , Pswap , Preplace } until Cost (Pcrt ) < T Pbest ← Pcrt Output Pbest

The local search algorithm is described by Algorithm 3. The algorithm repeats iteratively until the overall cost falls to the pre-set threshold T . In each iteration, three procedures are executed, namely, shiftSearch, swapSearch, and replaceSearch. The shiftSearch procedure attempts to migrate containers from hot servers to spare servers to reduce the total cost. It first randomly selects a set of hot servers and a set of spare servers. Then, it tries to shift a container on the selected hot servers to one of the spare servers with the condition that the total cost is reduced after the shift move. The swapSearch procedure aims at reducing the total cost through swapping the locations of containers. It first randomly selects a set of hot servers and a set of nonhot servers. For each container on the selected hot servers, the algorithm tries to find a container on the selected non-hot servers such that the total cost is reduced if the locations of the two containers are swapped. The replaceSearch procedure tries to reduce the total cost by reassigning a container from hA to hB under the premise to move a zombie container from hB to hC . It firstly chooses a set of hot servers as origin servers. For each container c on the origin server h, it selects a set of non-hot servers as relay servers. For each zombie container c on the relay server h , replaceSearch tries to find a target server h from a set of randomly selected spare servers, such that the overall cost is reduced if reassigning c from h to h meanwhile moving c from h to h . In Sect. 5.7, we give a detailed algorithm analysis and prove that the deviation between Sweep&Search and the theoretical optimal solution has an upper bound.

5.5 Implementation In this section, we present the implementation of our solutions in Baidu. All the proposed algorithms have been implemented in a middleware product called FreeContainer [16], while FreeContainer is built in the data center orchestrator system that manages virtualized services.

5.5 Implementation

77

500 Throughput(qps)

Response Time(ms)

5

10

400 300 200 100 0

Current FreeContainer 0

50 100 150 Experiment No.

200

x 10

Current FreeContainer

8 6 4 2 0

S1

S2

(a)

S3

S4

S5

(b)

Fig. 5.3 System performance before and after deploying our solution. (a) Request response time. (b) Throughput per service Table 5.1 Service information

Service S1 S2 S3 S4 S5

Containers 3052 (large) 378 (medium) 96 (small) 192 (medium) 98 (small)

Replicas per container 1 (small) 6 (medium) 10 (large) 3 (small) 6 (medium)

FreeContainer is deployed in an Internet data center with 6,000 servers, where 35 services are deployed together with some other background services. To evaluate the performance improvement of the data center after deploying our solution, we conduct a series of experiments to measure the following system features: response time, service throughput, and resource utilization. Response Time The service response time refers to the total response time of a particular request. As each request would go through multiple containers, an efficient container communication scheme should give low network latency. We show the results in Fig. 5.3a, where the small red (big blue) points denote the request response time before (after) deploying our solution. The average response times are 451 and 365 ms, respectively. We can conclude that communication latency reduces up to 20% by container distribution optimization. Service Throughput To validate the online performance of the proposed algorithm, we perform stress test on this data center and measure the throughput of five representative services. The representative services are selected as follows. We classify the services into three types, i.e., small, medium, and large, with respect to the number of containers and the mean number of replicas per container, respectively. Table 5.1 summarizes the types of the services selected. Our intention is to cover as many service types as possible. The service throughput before and after deploying FreeContainer is shown in Fig. 5.3b. Taking S2 as an example,

5 Optimization of Container Communication in DC Back-End Servers 1

1

0.8

0.8

0.6 0.4 0.2 0

Current FreeContainer 0

2000 4000 Servers (a)

Memory Utilization

CPU Utilization

78

0.6 0.4 0.2 0

6000

Current FreeContainer 0

2000 4000 Servers (b)

6000

0.8

SSD Utilization

0.8 0.6 0.4 0.2 0

Current FreeContainer 0

2000 4000 Servers (c)

6000

Response Time (s)

1

0.6 0.4 0.2 0 0.2

0.4 0.6 0.8 Server CPU Utilization

1

(d)

Fig. 5.4 (a) CDF of CPU utilizations of servers. (b) CDF of Memory utilization of servers. (c) CDF of SSD utilization of servers. (d) Impact of CPU utilization on response time

the maximum throughput before deploying FreeContainer is 510,008 queries per second (qps), which raises to 972,581 qps after implementing our algorithms (with an increase of 90%). We also observe that for S5 , the throughput is improved from 295,581 qps to 384,761 qps, with 30% increase. This is because there are only 588 interactive containers in S5 but 2,268 containers in S2 . The results imply that the benefit is more significant for the service with more containers to communicate. Resource Utilization Resource utilization is another performance indicator. If resource utilization are balanced among servers, the throughput is generally also good. We measure resource utilizations under stress tests and show the results in Fig. 5.4 (CPU in Fig. 5.4a, MEM in Fig. 5.4b, and SSD in Fig. 5.4c). From these figures, we can see that our solution eliminates the long tails of resource utilization. Taking CPU utilization as an example, there are about 800 servers whose utilization exceeds 80%. To show the influence caused by high resource utilization, we classify the servers according to CPU utilization (every 10%) and calculate the average response time of queries on these classified servers. The results show that when the average CPU utilization is below 60%, the latency keeps below 50 ms, but after

5.6 Experiment

79

that, the latency increases significantly with the increasing CPU utilization. For the servers with CPU utilization higher than 80%, the latency increases to 200 ms, and for the servers with CPU utilization higher than 90%, the latency increases to 800 ms (16 times longer than that under 60% CPU utilization). Thus, the affected servers suffer from long request latency due to CPU resource shortage and become the bottlenecks of the overall service performance. From the above results, we can conclude that our solution can lead to more balanced resource utilization.

5.6 Experiment We have conducted extensive experiments with variant parameter settings in Baidu’s large-scale data centers to evaluate CA-WFD and Sweep&Search. As we cannot deploy comparison systems in real data centers due to safety concerns, evaluations are performed in two experimental data centers, where there are 2,513 servers accommodating 10,000+ containers of 25 services in DCA and 4,361 servers accommodating 25,000+ containers of 29 services in DCB . The configurations of servers in the two data centers are summarized in Table 5.2. Resource requirements of typical services in DCA (SAi ) and DCB (SBi ) are given in Table 5.3. The values in Tables 5.2 and 5.3 are after normalization, where the Table 5.2 Sever configuration

Data center DCA

DCB

Server no. 2077 169 112 2265 1394 160

CPU 1.000 0.342 0.589 0.830 0.430 1.000

MEM 1.000 0.750 0.750 0.667 0.333 1.000

SSD 0.400 1.000 1.000 0.242 0.242 0.606

Table 5.3 Service information Service SA1 SA2 SA3 SA4 SA5 SB1 SB2 SB3 SB4 SB5

CPU 0.102 0.153 0.084 0.080 0.110 0.017 0.210 0.056 0.014 0.132

MEM 0.119 0.190 0.111 0.206 0.111 0.044 0.157 0.065 0.050 0.083

SSD 0.068 0.137 0.062 0 0.043 0.030 0.124 0.033 0.033 0.046

CTNR. no. 3052 2142 960 777 588 10090 1974 1920 1152 960

DUPL. no. 1 6 10 7 6 2 6 6 6 10

80

5 Optimization of Container Communication in DC Back-End Servers

top server configuration of each dimension of resource is normalized to 1. These real-world data show that both the server configurations and resource requirement of containers are significantly heterogeneous.

5.6.1 Performance of CA-WFD 5.6.1.1

Algorithm Performance

We consider such a scenario where two new services (SA and SB ) are being deployed into DCA and DCB , respectively. SA is instantiated as 2,000+ containers, and SB is instantiated as 5,000+ containers. These containers have different resource requirements. In this set of experiments, to deploy these newly instantiated containers into data centers, CA-WFD is compared with four state-of-the-art container distribution strategies that are used by top container platform providers (e.g., Docker [17], Swarm [18], and Amazon [19]). • CA-WFD is the Communication Aware Worst Fit Decreasing algorithm proposed in Sect. 5.3. In the evaluation, we set d as 2 in line 6 of Algorithm 1, i.e., two candidate servers are picked each time. • Random assigns containers randomly. Random serves as a baseline in the evaluation. • HA (High Availability) selects the server with the fewest containers of that service at the time of each container’s deployment. HA is applied to optimize the load balance as well as service availability. However, it may also induce heavy communication overhead, because the containers spread over all the servers. • ENF (Emptiest Node First) chooses the server with the fewest total containers. ENF aims at balancing the load of all the servers at a coarse granularity. Note that fewer containers do not definitely mean less resource utilization, since one big container (e.g., SB2 in Table 5.3) can resume more resource than several small containers (e.g., SB1 in Table 5.3). • Binpack assigns containers to the server with the least available amount of CPU. Binpack tends to minimize the number of servers used. Table 5.4 compares the total cost after placing the containers with different algorithms. For clarity, in this section, the value of costs is after min-max normalization [20], and the lower and upper bounds are normalized to 0 and 1, respectively. The upper and lower bounds are calculated in the ideal conditions. Take communication cost, for example, the upper bound of Ccost of a service is calculated when all the containers of this service spread on as many servers as possible, while the lower bound is calculated when these containers are all placed in the same server or the nearest servers (still under the capacity constraint, conflict constraint, and spread constraint). With respect to resource utilization cost (Ucost ), CA-WFD

5.6 Experiment

81

Table 5.4 The costs under different placement strategies

Data centers DCA

DCB

Memory Utilization

CPU Utilization

0.6

CA−WFD Random HA ENF Binpack

0.4 0.2 0 0

500

1000 1500 Servers

(a)

2000

2500

1

1

0.8

0.8

0.6 0.4

CA−WFD Random HA ENF Binpack

0.2 0 0

500

1000 1500 Servers

(b)

2000

2500

SSD Utilization

1 0.8

Algorithms CA-WFD Random HA ENF Binpack CA-WFD Random HA ENF Binpack

Ucost 0.069 0.114 0.111 0.087 0.163 0.048 0.062 0.050 0.042 0.088

Bcost 0.703 0.824 0.709 0.715 0.686 0.053 0.147 0.116 0.090 0.170

Ccost 0.403 0.400 0.424 0.406 0.396 0.695 0.747 0.769 0.758 0.698

0.6 0.4

CA−WFD Random HA ENF Binpack

0.2 0 0

500

1000 1500 Servers

2000

2500

(c)

Fig. 5.5 The resource utilization of servers in DCA under different placement strategies. (a) CPU utilization. (b) MEM utilization. (c) SSD utilization

performs overwhelmingly better than the other algorithms up to 57.7% in DCA and up to 45.5% in DCB . The second optimal algorithm is ENF, and the reason is that both CA-WFD and ENF tend to assign containers to servers with more free space, which benefits load balance. However, ENF regards the server with least containers as the “emptiest,” which is imprecise. Ucost of Binpack is almost two times of CA-WFD, because Binpack assigns containers to least servers, which harms load balance. CA-WFD also achieves better or similar performance in both data centers in terms of residual resource balance cost (Bcost ) and communication cost (Ccost ), which confirms the effectiveness of CA-WFD. HA performs obviously worse than other designs in terms of Ccost , because HA spreads the containers among the servers which induces more cross-server communications. Figure 5.5 shows CDF of resource utilization of DCA servers. The horizontal axis represents the 2513 servers in DCA , and the vertical axis represents the utilization of 3 different resources. CA-WFD yields more balanced resource usages than other algorithms, which is consistent with the results in Table 5.4. As server resource utilization reflects the data center performance under stress tests, i.e., when burst occurs, we can expect better service throughput when placing containers by CA-WFD.

82

5 Optimization of Container Communication in DC Back-End Servers

5.6.1.2

Algorithm Variations

As illustrated in Sect. 5.3, CA-WFD uses dominant requirement and weighted sum of residual resources to represent the “size” of containers and servers, respectively. Motivated by [12], we evaluated several design choices in our experimental data centers, and the results perfectly confirmed the effectiveness of our design choice. In this section, we give the comparison between CA-WFD and another two representative variants, which run the same procedure as Algorithm 1 but with different metrics. • DR-WS (dominant requirement-weighted sum), i.e., the design choice we adopt in Sect. 5.3. • WS-DP (weighted sum-dot product) sorts the containers by the weighted sum  of requirement vectors (i.e., wr · Dcr ) and selects the best server according r∈R

to dot product of the vector requirement and the vector of residual  of container ar · Dcr · A(h, r)), where ar = exp(0.01 · avdemr ) resource of server (i.e., r∈R 1  Dcr . Simulations in [12] show that WS-DP performs well and avdemr = |R| r∈R

in vector bin packing [21]. • C-C (CPU-CPU) is a single-dimensional version Communication Aware Worst Fit Decreasing strategy, which only considers CPU utilizations when sorting and placing containers. Table 5.5 compares the costs of CA-WFD and its variants. DR-WS (i.e., our design choice in Sect. 5.3.2) outperforms the other two designs in both data centers. A point worthy to note is that WS-DP performs obliviously worse in DCB than in DCA . We contribute this to WS-DP which cannot effectively capture the resource features in more heterogeneous environments (note that in Table 5.2, the server capacity of DCA is more heterogeneous than that of DCB ). C-C assigns containers to the server with the maximum residual CPU; hence, containers tend to be packed on servers with high-end CPUs. This explains why it gains a slightly better result for Ccost than DR-WS and WS-DP but poor performance for Ucost and Bcost . This implies that single-dimensional placement strategy is insufficient in real-world environments, because optimization of single resource easily brings poor utilization of other resources. Table 5.5 The costs under different algorithm variations

Data centers DCA

DCB

Designs DR-WS WS-DP C-C DR-WS WS-DP C-C

Ucost 0.069 0.072 0.139 0.048 0.076 0.091

Bcost 0.703 0.705 0.953 0.053 0.103 0.244

Ccost 0.403 0.405 0.399 0.695 0.682 0.671

5.6 Experiment

83

In summary, compared with the state-of-the-art algorithms, CA-WFD gains much balanced multi-resource utilization without inducing heavy communication overhead, which furtherly yields a better performance of services.

5.6.2 Performance of Sweep&Search 5.6.2.1

Algorithm Performance

We compared Sweep&Search with the following two alternative solutions, NLS and Greedy. Again, we evaluate these algorithms in experimental data centers for safety concerns. • Sweep&Search (S&S) is the container reassignment algorithm we propose in Sect. 5.4. To speed up the convergence of the Search procedure in Algorithm 3, 1 we empirically set wu , wb , and wc in Equation (5.6) as 1, |H| , and 1 2 , |C| respectively, so that the three components of cost (i.e., wu ∗ Ucost , wb ∗ Bcost , and wc ∗ Ccost ) fall in similar value ranges. Besides, we set δ in Sweep as 2%. • NLS is a noisy local search method, which is based on the winner team solution for Google Machine Reassignment Problem (GMRP) [6]. This method reallocates processes among a set of machines to improve the overall efficiency. In the evaluation, NLS adopts the same value of wu , wb , and wc as Sweep&Search in local searching. • Greedy is a greedy algorithm, which tries to move containers from the “hottest” server to the “sparest” server each time. This algorithm reduces Ucost directly in a straightforward way. Table 5.6 shows the total costs produced by Sweep&Search, NLS, and Greedy, separately. In DCA , compared with Greedy (NLS), Sweep&Search achieves 40.4% (30.6%), 69.0% (66.0%), and 9.1% (6.2%) better performance in terms of Ucost , Bcost , and Ccost , respectively. In DCB , the benefits are 33.9%(21.2%), 72.7%(80.4%), and 6.3%(3.8%), respectively. The results show that Sweep&Search can jointly optimize communication overhead and balance resource utilizations. Figure 5.6a shows the CDF of CPU utilization of the 2,513 servers in DCA . The horizontal axis represents the 2513 servers in DCA and the vertical axis represents the CPU utilization. There are about 330 servers whose CPU utilizations exceed Table 5.6 The costs under different container reassignment algorithms

Data centers DCA

DCB

Algorithms Sweep&Search NLS Greedy Sweep&Search NLS Greedy

Ucost 0.034 0.049 0.057 0.041 0.052 0.062

Bcost 0.121 0.356 0.390 0.033 0.168 0.121

Ccost 0.329 0.351 0.362 0.606 0.630 0.647

84

5 Optimization of Container Communication in DC Back-End Servers 1

0.9 CPU Utilization

0.8 CPU Utilization

1

Greedy NLS S&S

0.6 0.4 0.2 0 0

500

Greedy NLS S&S

0.8 0.7 0.6 0.5 2200

1000 1500 2000 2500 Servers

(a)

0.95

2500

1

Greedy NLS S&S

0.9

0.9 0.85 0.8 2200

2400 Servers

(b)

SSD Utilization

Memory Utilization

1

2300

2300 2400 Servers

2500

0.8 Greedy NLS S&S

0.7 0.6 2200

(c)

2300 2400 Servers

2500

(d)

Fig. 5.6 The resource utilizations of servers in DCA under different reassignment strategies. (a) CPU utilization. (b) CPU tail utilization. (c) Memory tail utilization. (d) SSD tail utilization Table 5.7 Average CPU utilization of bottleneck servers

Algorithm Sweep&Search NLS Greedy

Top 1% 0.571 0.719 0.736

Top 5% 0.570 0.692 0.703

Top 10% 0.569 0.657 0.675

60% under the greedy algorithm and 210 servers under the NLS algorithm. But when leveraging Sweep&Search, the highest CPU utilization falls down to 52%, which is much better than Greedy and NLS. For high-performance network services, the overall throughput of the system is generally determined by hot servers. We collect the resource usages of the top 300 hot servers produced by each algorithm, and the results are shown in Fig. 5.6. Taking SSD as an example, the average utilization of the top 300 hot servers under Greedy, NSL, and Sweep&Search are 97.71%, 93.15%, and 81.33%, respectively. To clearly show the quantified optimization results, the average CPU utilization of the top hot servers is shown in Table 5.7. The overall average CPU utilization of the 2,513

5.7 Approximation Analysis of Sweep&Search Table 5.8 The costs under different parameter settings of Sweep&Search in DCA

85 δ Value 2% 10% 20%

Ucost 0.034 0.032 0.033

Bcost 0.121 0.042 0.034

Ccost 0.329 0.206 0.168

servers is 51.16%. We can see that Sweep&Search’s performance is very close to the lower bound and outperforms Greedy and NSL by up to 70%. We attribute this to the following reasons. First, we take Ucost and Bcost into consideration to minimize the difference in resource utilizations and balance residual multiple resources. Second, the Sweep stage makes room for the following search procedure, based on which the Search stage could explore more branches to find better solutions.

5.6.2.2

Algorithm Efficiency

In this section, we show the effectiveness of Sweep and then evaluate the impacts of different parameter settings on the performance of Sweep&Search in DCA . In the evaluation, we in turn set δ = 2%(10%, 20%), which means that in each exploring iteration, we select 4%(20%, 40%) servers as the set of candidates from the top 2%(10%, 20%) and the tail 2%(10%, 20%) and leverage neighbor searching on these candidate servers. Table 5.8 shows the costs under different parameter settings. Ucost , Bcost , and Ccost all benefit for a larger δ, which is consistent with the analysis in Sect. 5.7. Especially, compared with 2%, by setting δ to 10% (20%), Bcost and Ccost are improved by 65.3% (71.9%) and 37.4% (48.9%), respectively. However, Ucost gains smaller improvement than Bcost and Ccost , which implies that a small δ can produce a good result in resource load balance. Note that although larger δ yields a reduction in the total cost, it also spends more time to sweep the servers in Algorithm 2.

5.7 Approximation Analysis of Sweep&Search In this section, we would like to prove that the output of Sweep&Search is (1 + , θ )-approximate to the theoretical optimum result P ∗ (for simplification, we use Pˆ instead of Pbest in the subsequent analysis), where  is an accuracy parameter and θ is a confidence parameter that represents the possibility of that accuracy [22]. More specifically, this (1 + , θ )-approximation can be formulated as the following inequality: P r[|Pˆ − P ∗ | ≤ P ∗ ] ≥ 1 − θ.

(5.14)

86

5 Optimization of Container Communication in DC Back-End Servers

If  = 0.05 and θ = 0.1, it means that the output of Sweep&Search Pˆ differs from the optimal solution P ∗ by at most 5% (the accuracy bound) with a probability 90% (the confidence bound). Sweep introduces no deviation, so the deviation of Pˆ and P ∗ mainly comes from the Search stage which consists of two parts: one is to select a subset of host set H to run Search, the other is the stopping condition. Specifically, the core idea of the search algorithm is to select some candidate hosts from H, expand to search three kinds of neighbors for a few iterations, and generate an approximate result in each iteration. Let bi be the branch that is explored on hi and xi be the minimum cost of bi . Assume there are n branches in total; a fact that can be easily seen is that the optimal result (minimum cost) is Q∗ = min{x1 , . . . , xn }. Given an approximation ratio , ˆ meets we would like to prove that the output of Sweep&Search Pˆ with cost Q ˆ − Q∗ | ≤ Q∗ with a bounded probability. We split the total error  into two |Q parts and try to bound the above two errors separately. Bound the error from stopping conditions In each iteration, we explore 2δ ˆ iter = min{min{x1 , . . . , branches. Let Q (1−)Q∗iter }, which means that the stopping condition is a balance of the x2δ }, 2 following two: (1) the minimum cost on these branches reaches the best result; and (2) the threshold 1− 2 is reached. ˆ iter in each iteration satisfies |Q ˆ iter − Q∗ | ≤  Q∗ Lemma 5.1 The output Q iter 2 iter Proof There are two parts: (1) If the minimum cost on these 2δ branches reaches ˆ iter = Q∗ ; (2) If the minimum cost on these branches the current best one, then Q iter ˆ iter = 1− Q∗ . is greater than the current best, then Q iter 2 1− ∗ ∗ ∗ | =  Q∗ , so the deviation is bounded ˆ iter −Q | ≤ | Q −Q Overall, |Q iter iter iter 2 2 iter by 2 . Bound the error from subset selection We now try to prove that we can bound ˆ − Q∗ | by exploring 2δ hosts in each iteration. |Q Before giving the proof, we first introduce the Hoeffding Bound: Hoeffding inequality: There are k random identical and independent variables Vi . For any ε, we have P r[|V − E(V )| ≥ ε] ≤ e−2ε k . 2

(5.15)

With this Hoeffding Bound, we have the following lemma: ˆ − Q∗ | ≤  Q∗ ] when exploring 2δ Lemma 5.2 There is an upper bound for P r[|Q 2 hosts in each iteration. Proof Assume the minimum cost of all branches is in uniform distribution (range ˆ = (min{x1 , . . . , x2δ }) and E(Q) ˆ = (1 − xi )2δ . Let from a to b), so we have Q b−a xi , and Y = 2δ Y , we have Yi = 1 − b−a i i=1

5.8 Conclusion

87

ˆ = Y < Y 2δ . E(Q) n

(5.16)

ˆ and the expectation of the minimum Yi is associated As Y is associated with Q ∗ ∗ ˆ with Q , so Q and Q are linked together. More specifically, E(Y ) = E(Yi )2δ , 1 E(Yi ) = E(Y ) 2δ . Thus, we have E(Q∗ ) = (1 −

n xi n ) = E(Yi )n = E(Y ) 2δ b−a

(5.17)

and n n ˆ − E(Q∗ )| ≥  ] < P r[|Y 2δ − E(Y ) 2δ | ≥  ]. P r[|E(Q) 2 2

(5.18)

Through the above Hoeffding Bound (In Equation 5.15), we have n

n

P r[|Y 2δ − E(Y ) 2δ | ≥

 2  ] ≤ e−2( 2 ) ∗2δ . 2

(5.19)

Finally, ˆ − E(Q∗ )| ≥  ] ≤ e− 2 δ . P r[|E(Q) 2

(5.20)

Now we can combine the above two errors: the error rate from stopping condition is within 2 , and the error rate from subset selection is also bounded to 2 within a probability. So we can propose the whole theorem as follows. Theorem Let Q∗ be the theoretical optimal result (with the minimum overall cost) of the container group reassignment problem; Sweep&Search can output an ˆ where |Q ˆ − Q∗ | ≤ Q∗ with probability at least e− 2 δ . approximate result Q As a conclusion, we can give an upper bound to the deviation, and the possibility is associated with the number of selected hosts in each searching iteration. With a given accuracy, we can further improve that probability by exploring more hosts.

5.8 Conclusion More and more Internet service providers deploy their services in containers due to the promising properties of containerization. However, applying containerization in large-scale Internet data centers faces the trade-off between communication cost and multi-resource load balance. In this chapter, we go into the container distribution problem in large-scale data centers and break it down into two stages, i.e., container placement problem and container reassignment problem, which are both NP-hard. For container placement, we propose an efficient heuristic named Communication Aware Worst Fit Decreasing which extends WFD to CPP by considering both multiple resource

88

5 Optimization of Container Communication in DC Back-End Servers

load balance and communication overhead reduction. For container reassignment problem, we design a two-stage algorithm called Sweep&Search to re-optimize the container distribution, which firstly handles overloaded servers and then optimizes the objectives by local search techniques. Extensive experiments have been conducted to evaluate our algorithms. The results show that our algorithms outperform the state-of-the-art solutions up to 70%. We further implemented our solutions in a data center with more than 6,000 servers and 35 services, and the measurements indicate that our solutions can effectively reduce the communication overhead among interactive containers while simultaneously increasing the overall service throughput up to 90%.

References 1. FreeBSD.chroot FreeBSD ManPages: http://www.freebsd.org/cgi/man.cgi (2016) 2. Felter, W., Ferreira, A.P., Rajamony, R., Rubio, J.C.: An updated performance comparison of virtual machines and Linux containers. In: 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 171–172. IEEE (2015) 3. Kubernetes: http://kubernetes.io/ (2016) 4. Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A.D., Katz, R.H., Shenker, S., Stoica, I.: Mesos: a platform for fine-grained resource sharing in the data center. In: NSDI, vol. 11, pp. 22–22 (2011) 5. Yu, T., Noghabi, S.A., Raindel, S., Liu, H., Padhye, J., Sekar, V.: Freeflow: high performance container networking. In: Proceedings of the 15th ACM Workshop on Hot Topics in Networks, pp. 43–49. ACM (2016) 6. Gavranovi´c, H., Buljubaši´c, M.: An efficient local search with noising strategy for google machine reassignment problem. Ann. Oper. Res. 242, 1–13 (2014) 7. Marler, R.T., Arora, J.S.: Survey of multi-objective optimization methods for engineering. Struct. Multidiscip. Optim. 26(6), 369–395 (2004) 8. Gavish, B., Pirkul, H.: Algorithms for the multi-resource generalized assignment problem. Manag. Sci. 37(6), 695–713 (1991) 9. Sahni, S., Gonzalez, T.: P-complete approximation problems. J. ACM 23(3), 555–565 (1976) 10. Johnson, D.S.: Fast algorithms for bin packing. J. Comput. Syst. Sci. 8(3), 272–314 (1974) 11. Lakshmanan, K., Niz, D.D., Rajkumar, R., Moreno, G.: Resource allocation in distributed mixed-criticality cyber-physical systems. In: 2010 IEEE 30th International Conference on Distributed Computing Systems, pp. 169–178. IEEE (2010) 12. Panigrahy, R., Talwar, K., Uyeda, L., Wieder, U.: Heuristics for vector bin packing. research. microsoft. com (2011) 13. Mitrovi´c-Mini´c, S., Punnen, A.P.: Local search intensified: very large-scale variable neighborhood search for the multi-resource generalized assignment problem. Discrete Optim. 6(4), 370–377 (2009) 14. Dıaz, J.A., Fernández, E.: A tabu search heuristic for the generalized assignment problem. Eur. J. Oper. Res. 132(1), 22–38 (2001) 15. Masson, R., Vidal, T., Michallet, J., Penna, P.H.V., Petrucci, V., Subramanian, A., Dubedout, H.: An iterated local search heuristic for multi-capacity bin packing and machine reassignment problems. Expert Syst. Appl. 40(13), 5266–5275 (2013) 16. Zhang, Y., Li, Y., Xu, K., Wang, D., Li, M., Cao, X., Liang, Q.: A communicationaware container re-distribution approach for high performance VNFs. In: IEEE International Conference on Distributed Computing Systems (2017)

References

89

17. Container distribution strategies: https://docs.docker.com/docker-cloud/infrastructure/ deployment-strategies/ (2017) 18. Docker swarm strategies: https://docs.docker.com/swarm/scheduler/strategy/ (2017) 19. Service, A.E.C.: Amazon ECS task placement strategies. https://docs.aws.amazon.com/ AmazonECS/latest/developerguide/task-placement-strategies.html (2017) 20. Han, J.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco (2005) 21. Christensen, H.I., Khan, A., Pokutta, S., Tetali, P.: Approximation and online algorithms for multidimensional bin packing: a survey. Comput. Sci. Rev. 24, 63–79 (2017) 22. Han, Z., Hong, M., Wang, D.: Signal Processing and Networking for Big Data Applications. Cambridge Press, Cambridge/New York (2017)

Chapter 6

The Deployment of Large-Scale Data Synchronization System for Cross-DC Networks

Abstract Many important cloud services require replicating massive data from one datacenter (DC) to multiple DCs. While the performance of pair-wise inter-DC data transfers has been much improved, prior solutions are insufficient to optimize bulkdata multicast, as they fail to explore the rich inter-DC overlay paths that exist in geo-distributed DCs, as well as the remaining bandwidth reserved for online traffic under fixed bandwidth separation scheme. To take advantage of these opportunities, we present BDS+, a near-optimal network system for large-scale inter-DC data replication. BDS+ is an application-level multicast overlay network with a fully centralized architecture, allowing a central controller to maintain an up-to-date global view of data delivery status of intermediate servers, in order to fully utilize the available overlay paths. Furthermore, in each overlay path, it leverages dynamic bandwidth separation to make use of the remaining available bandwidth reserved for online traffic. By constantly estimating online traffic demand and rescheduling bulkdata transfers accordingly, BDS+ can further speed up the massive data multicast. Through a pilot deployment in one of the largest online service providers and largescale real-trace simulations, we show that BDS+ can achieve 3–5× speedup over the provider’s existing system and several well-known overlay routing baselines of static bandwidth separation. Moreover, dynamic bandwidth separation can further reduce the completion time of bulk data transfers by 1.2 to 1.3 times.

6.1 Motivation of BDS+ Design We start by providing a case for an application-level multicast overlay network. We first characterize the inter-DC multicast workload in Baidu, a global-scale online service provider (Sect. 6.1.1). We then show the opportunities of improving multicast performance by leveraging disjoint application-level overlay paths available in geo-distributed DCs and by leveraging dynamic bandwidth separation (Sect. 6.1.2). Finally, we examine Baidu’s current solution of inter-DC multicast (Gingko) and draw lessons from real-world incidents to inform the design of BDS+ (Sect. 6.1.3). We conclude all these observations, which are based on a dataset of

© Springer Nature Singapore Pte Ltd. 2020 Y. Zhang, K. Xu, Network Management in Cloud and Edge Computing, https://doi.org/10.1007/978-981-15-0138-8_6

91

92

6 The Deployment of Large-Scale Data Synchronization System for Cross-DC. . .

Table 6.1 Inter-DC multicast (replicating data from one DC to many DCs) dominantes Baidu’s inter-DC traffic

(a) Proportion of multicast transfers destined to percent of DCs.

Type of application All applications Blog articles Search indexing Offline file sharing Forum posts Other DB sync-ups

% of multicast traffic 91.13% 91.0% 89.2% 98.18% 98.08% 99.1%

(b) Proportion of multicast transfers larger than certain threshold.

Fig. 6.1 Inter-DC multicasts (a) are destined to a significant fraction of DCs, and (b) have large data sizes

Baidu’s inter-DC traffic collected in a duration of 7 days. The dataset comprises of about 1265 multicast transfers among 30+ geo-distributed DCs (Sect. 6.1.4).

6.1.1 Baidu’s Inter-DC Multicast Workload Share of inter-DC multicast traffic Table 6.1 shows inter-DC multicast (replicating data from one DC to multiple DCs) as a fraction of all inter-DC traffic.1 We see that inter-DC multicast dominates Baidu’s overall inter-DC traffic (91.13%), as well as the traffic of individual application types (89.2 to 99.1%). The fact that inter-DC multicast traffic amounts to a dominating share of inter-DC traffic highlights the importance of optimizing the performance of inter-DC multicast. Where are inter-DC multicasts destined? Next, we want to know if these transfers are destined to a large fraction (or just a handful) of DCs and whether they share common destinations. Figure 6.1a sketches the distribution of the percentage

1 The

overall multicast traffic share is estimated using the traffic that goes through one randomly sampled DC, because we do not have access to information of all inter-DC traffic, but this number is consistent with what we observe from other DCs.

6.1 Motivation of BDS+ Design

93

of Baidu’s DCs to which multicast transfers are destined. We see that 90% of multicast transfers are destined to at least 60% of the DCs and 70% are destined to over 80% of the DCs. Moreover, we found a great diversity in the source DCs and the sets of destination DCs (not shown here). These observations suggest that it is untenable to pre-configure all possible multicast requests; instead, we need a system to automatically route and schedule any given inter-DC multicast transfers. Sizes of inter-DC multicast transfers Finally, Fig. 6.1b outlines the distribution of data size of inter-DC multicast. We see that for over 60% multicast transfers, the file sizes are over 1 TB (and 90% are over 50 GB). Given that the total WAN bandwidth assigned to each multicast is on the order of several Gb/s, these transfers are not transient but persistent, typically lasting for at least tens of seconds. Therefore, any scheme that optimizes multicast traffic must dynamically adapt to any performance variation during a data transfer. On the flip side, such temporal persistence also implies that multicast traffic can tolerate a small amount of delay caused by a centralized control mechanism, such as BDS+ (Sect. 6.2). These observations together motivate the need for a systematic approach to optimizing inter-DC multicast performance.

6.1.2 Potentials of Inter-DC Application-Level Overlay It is known that, generally, multicast can be delivered using application-level overlays [1]. Here, we show that inter-DC multicast completion time (defined by the time until each destination DC has a full copy of the data) can be greatly reduced by an application-level overlay network. Note that an application-level overlay does not require any network-level support, so it is complementary to prior work on WAN optimization. The basic idea of an application-level overlay network is to distribute traffic along bottleneck-disjoint overlay paths [2], i.e., the two paths do not share a common bottleneck link or intermediate server. In the context of inter-DC transfers, two overlay paths either traverse different sequences of DCs (Type I) or traverse different sequences of servers of the same sequence of DCs (Type II), or some combination of the two. Next, we use examples to show bottleneck-disjoint overlay paths can arise in both types of overlay paths and how they improve inter-DC multicast performance. Examples of bottleneck-disjoint overlay paths In Fig. 1.1, we have already seen how two Type I overlay paths (A → B → C and A → C → B) are bottleneckdisjoint, and how it improves the performance of inter-DC multicast. Figure 6.2 shows an example of Type II bottleneck-disjoint overlay paths (traversing the same sequence of DCs but different sequence of servers). Suppose we need to replicate 36 GB data from DC A to B and C via two bottleneck-disjoint paths, (1) A → C: from A through B to C using IP-layer WAN routing with 2 GB/s capacity, or (2) A → b → C, from A to a server b in B with 6 GB/s capacity and b to C with

94

6 The Deployment of Large-Scale Data Synchronization System for Cross-DC. . .

A sends B & C a 36GB-file which consists of six 6-GB blocks (6 ) DC A 6GB/s

DC B Server b

3GB/s

(b) Direct replication: 18 sec

2GB/s DC A

DC B

DC C

DC C

(a) Set up and topology

1 st

DC B DC A

(= max(36GB/6GB/s, 36GB/2GB/s))

DC B

DC C DC A

step:

2 nd step:

DC C

1st step: 3 sec = max(18GB/6GB/s, 6GB/2GB/s)

3 rd step:



… 7 th step:

2nd step: 6 sec

(c) Chain replication: 13 sec (=6GB/6GB/s+6GB/3GB/s+6GB/3GB/s+ 6GB/3GB/s+6GB/3GB/s+6GB/3GB/s+6GB/3GB/s)

= max(18GB/6GB/s, 18GB/3GB/s, 12GB/2GB/s)

(d) Intelligent overlay: 9 sec (= 3s+6s)

Fig. 6.2 An illustrative example comparing the performance of an intelligent application-level overlay (d) with that of baselines: naive application-level overlay (c) and no overlay (b)

3 GB/s capacity. The data is split into six 6GB-blocks. We consider three strategies. (1) Direct replication: if A sends data directly to B and C via WAN paths (Fig. 6.2b), the completion time is 18 s. (2) Simple chain replication: a naive use of applicationlevel overlay paths is to send blocks through server b acting as a store-and-relay point (Fig. 6.2c), and the completion time is 13 s (27% less than without overlay). (3) Intelligent multicast overlay: Fig. 6.2d further improves the performance by selectively sending blocks along the two paths simultaneously, which completes in 9 s (30% less than chain replication, and 50% less than direct replication). Bottleneck-disjoint overlay paths in the wild It is hard to identify all bottleneckdisjoint overlay paths in our network performance dataset, since it does not have per-hop bandwidth information of each multicast transfer. Instead, we observe that if two overlay paths have different end-to-end throughput at the same time, they should be bottleneck-disjoint. We show one example of bottleneck-disjoint overlay paths in the wild, which consists of two overlay paths A → b → C and A → C, where the WAN routing from DC A to DC C goes through DC B and b is a server in B (these two paths are topologically identical to Fig. 6.2). If BWA→C = 1, they BWA→b→C

are bottleneck-disjoint (BWp denotes the throughput of path p). Figure 6.3 shows the distribution of BWA→C among all possible values of A, b, and C in the dataset. BWA→b→C

We can see that more than 95% pairs of A → b → C and A → C have different end-to-end throughput, i.e., they are bottleneck disjoint.

6.1 Motivation of BDS+ Design

95

Fig. 6.3 There is a significant performance variance among the inter-DC overlay paths in our network, indicating that most pairs of overlay paths are bottleneck disjoint

1

CDF

0.8 0.6 0.4 0.2 0 0.5

1

1.5

2

Ratio 100%

Latency-sensitive traffic experienced 30× longer del

80% 60% 40% 20% Day1 12:00

Safety threshold

More than 50% bandwidth is wasted

Day1 24:00

Day2 12:00

Outgoing (outbound) link

Day2 24:00

Incoming (inbound) link

Fig. 6.4 The utilization of the inter-DC link in 2 days: The traffic valley on the 1st day results in nearly 50% bandwidth waste. Inter-DC bulk data transfer on the 2nd day caused severe interference on latency-sensitive traffic

Interaction with latency-sensitive traffic The existing multicast overlay network shares the same inter-DC WAN with latency-sensitive traffic. Despite using standard QoS techniques, and giving the lowest priority to bulk data transfers, we still see negative impacts on latency-sensitive traffic by bursty arrivals of bulk-data multicast requests and inefficiency on bulk-data transfer when latency-sensitive traffic is in its valley. Figure 6.4 shows the bandwidth utilization of an inter-DC link in 2 days during which a 6-hour-long bulk data transfer started at 11:00pm on the second day. The blue line denotes the outgoing bandwidth, and the green line denotes the incoming bandwidth. We can see that the bulk data transfer caused excessive link utilization (i.e., exceeding the safety threshold of 80%), and as a result, the latencysensitive online traffic experienced over 30× delay inflation. Also, at 4:00–5:00am in the first day, near 50% of the bandwidth was being wasted. These cases show that an algorithm with dynamical interactions with latency-sensitive traffic would be more reasonable and efficient.

6.1.3 Limitations of Existing Solutions Realizing and demonstrating the potential improvement of an application-level overlay network has some complications. As a first-order approximation, we

96

6 The Deployment of Large-Scale Data Synchronization System for Cross-DC. . .

can simply borrow existing techniques from multicast overlay networks in other contexts. But the operational experience of Baidu shows two limitations of this approach that will be described below. Existing solutions of Baidu To meet the need of rapid growth of inter-DC data replication, Baidu has deployed Gingko, an application-level overlay network a few years ago. Despite years of refinement, Gingko is based on a receiver-driven decentralized overlay multicast protocol, which resembles what was used in other overlay networks (such as CDNs and overlay-based live video streaming [3–5]). The basic idea is that when multiple DCs request a data file from a source DC, the requested data would flow back through multiple stages of intermediate servers, where the selection of senders in each stage is driven by the receivers of the next stage in a decentralized fashion. Limitation 1: Inefficient local adaptation The existing decentralized protocol lacks the global view and thus suffers from suboptimal scheduling and routing decisions. To show this, we sent a 30 GB file from one DC to two destination DCs in Baidu’s network. Each DC had 640 servers, each with 20 Mbps upload and download bandwidth (in the same magnitude of bandwidth assigned to each bulk-data transfer in production traffic). This 30 GB file was evenly stored across all these 640 servers. Ideally, if the servers select the best source for all blocks, 30×1024 the completion time will be 640×20 Mbps × 60 s/min = 41 min. But as shown in Fig. 6.5, servers in the destination DCs on average took 195 min (4.75× the optimal completion time) to receive data, and 5% of servers even waited for over 250 min. The key reason for this problem is that individual servers only see a subset of available data sources (i.e., servers who have already downloaded part of a file) and thus cannot leverage all available overlay paths to maximize the throughput. Such suboptimal performance could occur even if the overlay network is only partially decentralized (e.g., [6]), where even if each server does have a global view, local adaptations by individual servers would still create potential hotspots and congestion on overlay paths. Limitation 2: High computation overhead To obtain a global view and achieve optimal scheduling protocols, existing centralized protocols suffer from high computation overhead. Most formulations are superlinear, so the computational overhead of centralized protocols always grows exponentially, making them intractable in practice. 1 0.8

CDF

Fig. 6.5 The CDF of the actual flow completion time at different servers in the destination DCs, compared with that of the ideal solution

0.6 Current Solution Ideal solution

0.4 0.2 0 0

100

200

300

400

Completion Time (m)

6.2 System Overview

97

Limitation 3: Fixed bandwidth separation As shown in Fig. 6.4, a fixed separation of link bandwidth would result in both excessive utilization and underutilization. Ideally, if we can make full use of the available bandwidth left by online traffic in real time, then the link utilization would be more stable. In this particular example, about 18.75% bandwidth was wasted in those 2 days (while still caused excessive utilization case).

6.1.4 Key Observations The key observations from this section are following: • Inter-DC multicasts amount to a substantial fraction of inter-DC traffic, have a great variability in source-destination, and typically last for at least tens of seconds. • Bottleneck-disjoint overlay paths are widely available between geo-distributed DCs. • Existing solutions that rely on local adaptation can have suboptimal performance and negative impact on online traffic. • Dynamic bandwidth separation can be helpful to improve link utilization by making full use of the remaining bandwidth of online services.

6.2 System Overview To optimize inter-DC multicasts on overlay network with dynamical separation with latency-sensitive traffic, we present BDS+, a fully centralized near-optimal network system with dynamic bandwidth separation for data inter-DC multicast. Before presenting the details, we first highlight the intuitions behind the design choices and the challenges behind its realization. Centralized control Conventional wisdom on wide-area overlay networks has relied, to some extent, on local adaptation of individual nodes (or relay servers) to achieve desirable scalability and responsiveness to network dynamics (e.g., [3, 6– 8]), despite the resulting suboptimal performance due to lack of global view or orchestration. In contrast, BDS+ takes an explicit stance that it is practical to fully centralize the control of wide-area overlay networks and still achieve nearoptimal performance in the setting of inter-DC multicasts. The design of BDS+ coincides with other recent works that centralize the management of large-scale distributed systems, e.g., [9]. At a high level, BDS+ uses a centralized controller that periodically pulls information (e.g., data delivery status) from all servers, updates the decisions regarding overlay routing, and pushes them to agents running locally on servers (Fig. 6.6). Note that when the controller fails or is unreachable, the system will fall back to a decentralized control scheme to ensure graceful performance degradation to local adaptation (Sect. 6.5.3).

98

6 The Deployment of Large-Scale Data Synchronization System for Cross-DC. . .

BDS + Controller 1

2

DC

Server

1 2

Data transmission Gather data delivery status to the controller Push overlay routing decisions to servers

Fig. 6.6 The centralized design of BDS+

Our centralized design is driven by several empirical observations: 1. Large decision space: The sheer number of inter-DC overlay paths (which grow exponentially with the increasing number servers acting as overlay nodes) makes it difficult for individual servers to explore all available overlay paths based only on local measurements. In contrast, we could significantly improve overlay multicast performance by maintaining a global view of data delivery status of all servers and dynamically balancing the availability of various data blocks, which turns out to be critical to achieving near-optimal performance (Sect. 6.3.3). 2. Large data size: Unlike latency-sensitive traffic which lasts on timescales of several to 10 s of milliseconds, inter-DC multicasts last on much coarser timescales. Therefore, BDS+ can tolerate a short delay (of a few seconds) in order to get better routing decisions from a centralized controller which maintains a global view of data delivery and is capable of orchestrating all overlay servers. 3. Flexible traffic control: BDS+ can enforce bandwidth allocation by setting limit rates in each data transfer, while each server can use Linux Traffic Control (tc) to enforce the limit on the teal bandwidth usage. This allows BDS+ to leverage flexible dynamic bandwidth separation. Once any network changes are detected, BDS+ could easily adjust bandwidth for each data transfer by controlling the sending rate at all servers in a centralized fashion (no matter to reserve more bandwidth when online traffic burst, or to reduce transfer rate when online traffic is in valley) (Sect. 6.5.4). 4. Lower engineering complexity: Conceptually, the centralized architecture moves the control complexity to the centralized controller, making BDS+ amenable to a simpler implementation, in which the control logic running locally in each server can be stateless and triggered only on arrivals of new data units or control messages.

6.3 Near-Optimal Application-Level Overlay Network

99

The key to realizing centralized control In essence, the design of BDS+ performs a trade-off between incurring a small update delay in return for the near-optimal decisions brought by a centralized system. Thus, the key to striking such a favorable balance is a near-optimal yet efficient overlay routing algorithm that can update decisions in near real time. At a first glance, this is indeed intractable. For the workload at a scale of Baidu, the centralized overlay routing algorithm must pick the next hops for 105 of data blocks from 104 servers. This operates at a scale that could grow exponentially when we consider the growth in the number of possible overlay paths that go through these servers and with finer grained block partitioning. With the standard routing formulation and linear programming solvers, it could be completely unrealistic to make near-optimal solutions by exploring such a large decision space (Sect. 6.6.2.4). The key to realizing dynamic bandwidth separation Dynamic bandwidth separation raises two requirements, one is to reserve enough bandwidth for latencysensitive online traffic so as to avoid negative impacts on these services, and the other is to make full use of the residual bandwidth so as to reduce the completion time of bulk data transfer. With the traditional strict safety threshold and decentralized protocols, it could be impossible to make efficient bandwidth usage in the dynamic and mixed deployed network (Sect. 6.6.3). The following two section will present how BDS+ works.

6.3 Near-Optimal Application-Level Overlay Network The core of BDS+ is a centralized decision-making algorithm that periodically updates overlay routing decisions at scale and in near real-time. BDS+ strikes a favorable trade-off between solution optimality and near real-time updates by decoupling the control logic into two steps (Sect. 6.3.2): overlay scheduling, i.e., which data blocks to be sent (Sect. 6.3.3), and routing, i.e., which paths to use to send each data block (Sect. 6.3.4), each of which can be solved efficiently and nearoptimally.

6.3.1 Basic Formulation We begin by formulating the problem of overlay traffic engineering. Table 6.2 summarizes the key notations. The overlay traffic engineering in BDS+ operates at a fine granularity, both spatially and temporally. To exploit the many overlay paths between the source and destination DCs, BDS+ splits each data file into multiple data blocks (e.g., 2 MB).

100

6 The Deployment of Large-Scale Data Synchronization System for Cross-DC. . .

Table 6.2 Notations used in BDS+’s decision-making logic Variables B b ρ(b) Ps,s  p l c(l)

T Tk (T )

Meaning Set of blocks of all tasks A block The size of block b Set of paths between a source and destination pair A particular path A link on a path Capacity of link l A scheduling cycle The k-th update cycle

wb,sk

Binary: if s is chosen as destination server for b at Tk

Rup (s)

Upload capacity of server s

Rdown (s)

Download capacity of server s

(T )

fb,pk

Bandwidth allocated to send b on path p at Tk

To cope with changes of network conditions and arrivals of requests, BDS+ updates the decisions of overlay traffic engineering every T (by default, 3 s2 .). Now, the problem of multicast overlay routing can be formulated as following: Input BDS+ takes as input the following parameters: the set of all data blocks B, each block b with size ρ(b), the set of paths from server s  to s, Ps  ,s , the update cycle interval T , and for each server s the upload (resp. download) capacity Rup (s) (resp. Rdown (s)). Note that each path p consists of several links l, each defined by a pair of servers or routers. We use c(l) to denote the capacity of a link l. Output For each cycle Tk , block b, server s, and path p ∈ Ps  ,s destined to s, (T ) (T ) (T ) BDS+ returns as output a 2-tuple wb,sk , fb,pk , in which wb,sk denotes whether (Tk ) denotes how server s is selected as the destination server of block b in Tk , fb,p (T )

much bandwidth is allocated to send block b on path p in Tk , and fb,pk = 0 denotes path p is not selected to send block b in Tk . Constraints • The allocated bandwidth on path p must not exceed the capacity of any link l in p, as well as the upload capacity of the source server Rup (s), and the download capacity of the destination server Rdown (s  ).

2 We

use a fixed interval of 3 s, because it is long enough for BDS+ to update decisions at a scale of Baidu’s workload, and short enough to adapt to typical performance churns without noticeable impact on the completion time of bulk data transfers. More details in Sect. 6.6

6.3 Near-Optimal Application-Level Overlay Network

101

(T ) (T ) (T ) fb,pk ≤ min minl∈p c(l), qb,sk · Rup (s  ), wb,sk · Rdown (s)

(6.1)

for ∀b, p ∈ Ps  ,s (T ) (T ) where qb,sk = 1 − i k, since otherwise, the multicast is already complete. Next, we prove that in a simplified setting, BDS+’s completion time in A is strictly less than B. To simplify the calculation of BDS+’s completion time, we now make a few assumptions (which are not critical to our conclusion): (1) all servers have the same upload (resp. download) bandwidth Rup (resp. Rdown ), and (2) no two duplicas share the same source (resp. destination) server, so the upload (resp. download) bandwidth of each block is Rup (resp. Rdown ). Now we can write the completion time in the two cases as following: tA = tB =

V min{c(l),

kRup kRdown m−k , m−k }

V k R

(6.6)

k R

Rdown k2 Rdown 1 up min{c(l), m−k , 2 up , k1m−k , m−k2 } 1 m−k2 1

where V denotes the total size of the untransmitted blocks, V = N (m − k)ρ(b) = N N 2 (m − k1 )ρ(b) + 2 (m − k2 )ρ(b). In the production system of Baidu, the interDC link capacity c(l) is several orders of magnitudes higher than upload/download capacity of a single server, so we can safely exclude c(l) from the denominator in the equations. Finally, if we denote min{Rup , Rdown } = R, then tA = (m−k)V and kR 1 )V . tB = (m−k k1 R We can show that

(m−k)V kR

is a monotonically decreasing function of k:

d (m − k)2 Nρ(b) Nρ(b) m2 d (m − k)V = = (1 − 2 ) < 0 dk kR dk kR R k

(6.7)

Now, since k > k1 , we have tA < tB .

References 1. Chu, Y.-H., Rao, S.G., Zhang, H.: A case for end system multicast. ACM SIGMETRICS Perform. Eval. Rev. 28(1), 1–12 (2000). ACM, New York 2. Datta, A.K., Sen, R.K.: 1-approximation algorithm for bottleneck disjoint path matching. Inf. Process. Lett. 55(1), 41–44 (1995) 3. Andreev, K., Maggs, B.M., Meyerson, A., Sitaraman, R.K.: Designing overlay multicast networks for streaming. In: Proceedings of the Fifteenth Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 149–158 (2003) 4. Sripanidkulchai, K., Maggs, B., Zhang, H.: An analysis of live streaming workloads on the internet. In: IMC, pp. 41–54. ACM (2004) 5. Zhang, X., Liu, J., Li, B., Yum, Y.-S.: CoolStreaming/DONet: a data-driven overlay network for peer-to-peer live media streaming. In: INFOCOM, vol. 3, pp. 2102–2111. IEEE (2005)

References

119

6. Huang, T.Y., Johari, R., Mckeown, N., Trunnell, M., Watson, M.: A buffer-based approach to rate adaptation: evidence from a large video streaming service. In: SIGCOMM, pp. 187–198 (2014) 7. Repantis, T., Smith, S., Smith, S., Wein, J.: Scaling a monitoring infrastructure for the Akamai network. ACM SIGOPS Oper. Syst. Rev. 44(3), 20–26 (2010) 8. Mukerjee, M.K., Hong, J., Jiang, J., Naylor, D., Han, D., Seshan, S., Zhang, H.: Enabling near real-time central control for live video delivery in CDNS. ACM SIGCOMM Comput. Commun. Rev. 44(4), 343–344 (2014). ACM 9. Gog, I., Schwarzkopf, M., Gleave, A., Watson, R.N.M., Hand, S.: Firmament: Fast, Centralized Cluster Scheduling at Scale. In: OSDI, pp. 99–115. USENIX Association, Savannah (2016). [Online]. Available: https://www.usenix.org/conference/osdi16/technical-sessions/ presentation/gog 10. Cohen, B.: Incentives build robustness in bittorrent. In: Proceedings of the First Workshop on the Economics of Peer-to-Peer Systems, pp. 1–1 (2003) 11. Garg, N., Vazirani, V.V., Yannakakis, M.: Primal-dual approximation algorithms for integral flow and multicut in trees. Algorithmica 18(1):3–20 (1997) 12. Garg, N., Koenemann, J.: Faster and simpler algorithms for multicommodity flow and other fractional packing problems. SIAM J. Comput. 37(2):630–652 (2007) 13. Reed, M.J.: Traffic engineering for information-centric networks. In: IEEE ICC, pp. 2660– 2665 (2012) 14. Fleischer, L.K.: Approximating fractional multicommodity flow independent of the number of commodities. In: SIDMA, pp. 505–520 (2000) 15. Friedrich and Pukelsheim: The three sigma rule. Am. Stat. 48(2):88–91 (1994) [Online]. Available: https://www.tandfonline.com/doi/abs/10.1080/00031305.1994.10476030 16. Adams, R.P., MacKay, D.J.: Bayesian online changepoint detection. arXiv preprint :0710.3742 (2007) 17. Roberts, S.: Control chart tests based on geometric moving averages. Technometrics 1(3):239– 250 (1959) 18. Lucas, J.M., Saccucci, M.S.: Exponentially weighted moving average control schemes: properties and enhancements. Technometrics 32(1):1–12 (1990) 19. Smith, A.: A Bayesian approach to inference about a change-point in a sequence of random variables. Biometrika 62(2):407–416 (1975) 20. Stephens, D.: Bayesian retrospective multiple-changepoint identification. Appl. Stat. 43, 159– 178 (1994) 21. Barry, D., Hartigan, J.A.: A Bayesian analysis for change point problems. J. Am. Stat. Assoc. 88(421):309–319 (1993) 22. Green, P.J.: Reversible jump markov chain monte carlo computation and bayesian model determination. Biometrika 82(4):711–732 (1995) 23. Page, E.: A test for a change in a parameter occurring at an unknown point. Biometrika 42(3/4):523–527 (1955) 24. Desobry, F., Davy, M., Doncarli, C.: An online kernel change detection algorithm. IEEE Trans. Signal Process. 53(8):2961–2974 (2005) 25. Lorden, G., et al.: Procedures for reacting to a change in distribution. Ann. Math. Stat. 42(6):1897–1908 (1971) 26. Bayesian changepoint detection. https://github.com/dtolpin/bocd 27. Kumar, A., Jain, S., Naik, U., Raghuraman, A., Kasinadhuni, N., Zermeno, E.C., Gunn, C.S., Björn Carlin, J.A., Amarandei-Stavila, M., et al.: BwE: flexible, hierarchical bandwidth allocation for WAN distributed computing. In: ACM SIGCOMM, pp. 1–14 (2015) 28. Wang, H., Li, T., Shea, R., Ma, X., Wang, F., Liu, J., Xu, K.: Toward cloud-based distributed interactive applications: measurement, modeling, and analysis. In: IEEE/ACM ToN (2017) 29. Chen, Y., Alspaugh, S., Katz, R.H.: Design insights for MapReduce from diverse production workloads. University of California, Berkeley, Department of Electrical Engineering & Computer Sciences, Technical Report (2012)

120

6 The Deployment of Large-Scale Data Synchronization System for Cross-DC. . .

30. Kavulya, S., Tan, J., Gandhi, R., Narasimhan, P.: An analysis of traces from a production mapreduce cluster. In: CCGrid, pp. 94–103. IEEE (2010) 31. Mishra, A.K., Hellerstein, J.L., Cirne, W., Das, C.R.: Towards characterizing cloud backend workloads: insights from Google compute clusters. ACM SIGMETRICS PER 37(4):34–41 (2010) 32. Reiss, C., Tumanov, A., Ganger, G.R., Katz, R.H., Kozuch, M.A.: Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In: Proceedings of the Third ACM Symposium on Cloud Computing, p. 7. ACM (2012) 33. Lamport, L.: The part-time parliament. ACM TOCS 16(2):133–169 (1998) 34. The go programming language: https://golang.org 35. Kosti´c, D., Rodriguez, A., Albrecht, J., Vahdat, A.: Bullet: high bandwidth data dissemination using an overlay mesh. ACM SOSP 37(5), 282–297 (2003). ACM 36. Solve linear programming problems – matlab linprog: https://cn.mathworks.com/help/optim/ ug/linprog.html?s_tid=srchtitle 37. cluster-trace-v2018 from ali: https://github.com/alibaba/clusterdata/blob/v2018/cluster-tracev2018/trace_2018.md

Chapter 7

Storage Issues in the Edge

Abstract Recent years have witnessed a rapid increase of short video traffic in content delivery network (CDN). While the video contributors change from large video studios to distributed ordinary end users, edge computing naturally matches the cache requirements from short video network. But the distributed edge caching exposes some unique characteristics: non-stationary user access pattern and temporal and spatial video popularity pattern, which severely challenge the edge caching performance. While the Quality of Experience (QoE) in traditional CDN has been much improved, prior solutions become invalid in solving the above challenges. In this chapter, we present AutoSight, a distributed edge caching system for short video network, which significantly boosts cache performance. AutoSight consists of two main components, solving the above two challenges, respectively: (i) the CoStore predictor, which solves the non-stationary and unpredictability of local access pattern, by analyzing the complex video correlations, and (ii) a caching engine Viewfinder, which solves the temporal and spatial video popularity problem by automatically adjusting future horizon according to video life span. All these inspirations and experiments are based on the real traces of more than 28 million videos with 100 million accesses from 488 servers located in 33 cities. Experiment results show that AutoSight brings significant boosts on distributed edge caching in short video network.

7.1 Introduction We start by characterizing the short video network, illustrating the essential difference with traditional CDN, and then we show the limitations of existing schemes and draw lessons from real-world traces to inform the design of AutoSight. The findings are based on real datasets from Kuaishou’s caching network collected during 4 days, from 9 October 2018 to 12 October 2018.

© Springer Nature Singapore Pte Ltd. 2020 Y. Zhang, K. Xu, Network Management in Cloud and Edge Computing, https://doi.org/10.1007/978-981-15-0138-8_7

121

122

7 Storage Issues in the Edge

7.1.1 The Characteristics of Edge Caching in Short Video Network Short video platforms allow users to upload seconds of videos (usually within 15 s) to the network [1–6]; such convenience on content uploading and accessing finally leads to a revolution in the way network works and the way videos are cached [7–12]. Non-stationary user access pattern Figure 7.1 shows the request number of two particular videos v1 and v2 , from which we can see the non-stationary of the user access pattern (with sudden increase and decrease). In the first 2 minutes, v1 receives less requests while v2 receives more, but the access pattern reverses at 2nd min (request burst on v1 while valley on v2 ). There is also a similar reverse at the 5th min. This indicates that the video popularity in the past could not represent that in the future because the access pattern is no longer stationary. Temporal and spatial video popularity pattern Figure 7.2 shows the life spans of two videos from an edge server but appear in different time periods, which are significantly different from each other. This figure illustrates the variability of video life span during different time period, and similar situations also occur among different edge servers, which means that in some temporal and spatial video popularity pattern, videos expire at different speeds; this finding motivates us to equip our caching engine with auto-adjusted future horizons (Viewfinder in Sect. 7.2.3).

v1 v2

18

Access Number

16 14 12 10 8 6 4 2 0 1

2

3

4

Time(min) Fig. 7.1 Non-stationary video access pattern

5

6

7.1 Introduction

123

v1 v2

70

Access Number

60 50 40 30 20 10 0 0

50

100

150

200

Time(min) Fig. 7.2 Temporal and spatial video popularity pattern

7.1.2 Limitations of Existing Solutions Realizing caching improvement of short video network has some complications. As a first-order approximation, we planned to simply borrow existing techniques from traditional CDN. But the above two characteristics result in inefficiency of existing approach that will be described below. Reactive caching policy The most representative heuristic reactive policies are FIFO, LRU, and LFU; such policies become inefficient under the non-stationary access pattern [13–17]. In the case shown in Fig. 7.1, v1 will be ejected at the 2nd min if there is a new video request under both LRU and LFU, because v1 is less recently used and less frequently used in the first 2 mins. However, as we pointed out in Sect. 7.1.1, the popularity in the past could not represent that in the future due to the non-stationary access pattern; v1 gets more requests in the next time slot, so the ejected video at the 2nd min should be v2 . Key Observation 1: The non-stationary access pattern makes the heuristic reactive caching policy invalid. Proactive caching policy Existing learning-based proactive caching policies always look a fixed length into the future to predict video popularity, for example,

t time, which is a fixed length probability window, and their output is a sequence of k future popularity probabilities, where k represents the number of probabilities to predict during the future t time. In this book we name this period future horizon. Definition 1 Future horizon. It represents how far we look into the future when we plan the caching, it is the length of future time t , during which period of time, the predicted video popularity could represent the popularity of a current video.

124

7 Storage Issues in the Edge

With a fixed-length future horizon, these policies conduct a replacement if the new video gets a higher predicted popularity than an already cached one. But in the short video network scenario, the average video life span on different edge servers or during different time periods significantly varies from each other; there is no “one size fits all.” In the case shown in Fig. 7.2, if future horizon t is set to 2 h, even though those policies can have 100% prediction accuracy, the one would be ejected at 50th min is v1 rather than v2 , because the predicted popularity of v1 is less than that of v2 in the next 2 h. But in reality, v1 is much more popular than v2 in the near future. Ejecting v1 will obviously downgrade the performance. Key Observation 2: Temporal and spatial video popularity pattern on different edge servers and at different time periods make the fixed-horizon proactive caching policy inefficient.

7.2 AutoSight Design The core of AutoSight is a distributed caching algorithm that makes adaptive caching for edge servers. There are two main components in AutoSight: a correlation analyzer named CoStore, which solves the non-stationary access pattern problem by analyzing the videos correlations, and a caching engine named Viewfinder, which solves the temporal/spatial popularity problem by automatically adjusting horizon to adapt to different edge caching server during different time periods.

7.2.1 System Overview AutoSight takes an explicit stance that it works the distributed edge caching servers to handle the non-stationary user access pattern and temporal/spatial video popularity pattern, significantly boosting the cache hit rate in the setting of short video network. AutoSight uses a correlation-based predictor that predicts the number of times a video will be requested by making real-time analysis of videos’ cross visits and uses a caching engine with adaptive future horizon to make caching decisions. The framework of AutoSight is shown in Fig. 7.3.

7.2.2 Correlation-Based Predictor: CoStore For a particular video in short video network, although the historical access pattern is non-stationary, correlations with other videos could provide much room to make accurate prediction [18]. Inspired by the long short-term memory (LSTM) network that has already shown its dominance in natural language processing (NLP), machine translation, and sequence prediction, CoStore is built on LSTM, using not

7.2 AutoSight Design

125

AutoSight CoStore

S1 S2

Y Cache Mem

Viewfinder Access seq. Life span

Cross visits from other edge servers Cross visits to other edge servers

Δt

Download videos from backend server Upload local videos to backend server

Fig. 7.3 The design of distributed edge caching and the framework of AutoSight

only video access pattern but also video correlation as input features and predicting request numbers within the future horizon (Sect. 7.2.3). In particular, for video vi at time t, the input consists of two sets of access sequence: S1 = {rv1i , rv2i , . . . , rvt i } and S2 = {rv1j , rv2j , . . . , rvt j }, where rvki denotes the request number of vi at time k and vj is the most related video to vi at time t. The output of CoStore is the expected request number of vi during the future horizon t .

7.2.3 Caching Engine: Viewfinder As shown in Fig. 7.2, edge caching servers are experiencing temporal and spatial video popularity pattern. Recall the case shown in Sect. 7.1.2; inappropriate future horizon t (too short-sighted or too long-sighted) always leads to inefficient or even wrong caching decisions. We therefore design Viewfinder, which can adjust future horizon automatically. The challenge here is that there are too many options to explore (in the granularity of seconds/minutes), which introduces unacceptable overhead. Viewfinder changes it into a classification problem that chooses t from a predefined set:

T = {60min, 120min, 160min, 180min, 200min, 360min}. This significantly reduces the computation overhead, and the experiment results show that Viewfinder works well in edge caching servers, disclosing that the quicker videos get expired, the shorter sight the caching policy should be.

126

7 Storage Issues in the Edge

7.3 AutoSight Experiment In this section, we evaluate our approach AutoSight using real traces and show the results of applying AutoSight on them versus the existing representative policies.

7.3.1 Experiment Setting Algorithms We compare AutoSight with four existing solutions: FIFO, LRU, LFU, and LSTM-based prediction scheme without the auto-adjusted future horizon. Datasets We analyze the traces from 2 cities with 1,128,989 accesses to 132,722 videos in 24 h. Each trace item contains the timestamp, anonymized source IP, video ID and url, file size, location, server ID, cache status, and consumed time. Thus we can deploy and evaluate different caching policies.

7.3.2 Performance Comparison We first provide dataset introduction and analysis about these two cities, and then we compare the overall hit rate on edge servers among five different caching policies; after that, we look into the proposed AutoSight and show the power of Viewfinder with auto-adjusted future horizons. Dataset Analysis Figure 7.4a shows the access number per minute during the 24 hours in a particular edge server, which shows that the number of requested videos per minute is higher during 20:00–21:00pm than in midnight. Figure 7.4b shows the

(a)

(b)

Fig. 7.4 Dataset analysis and the performance of Viewfinder. (a) User access pattern. (b) Video popularity pattern

7.4 Conclusion

127

(a)

(b)

Fig. 7.5 Performance of Viewfinder. (a) The power of Viewfinder. (b) Hit rate comparison

popularity of two specific videos from the above two periods of time, respectively. v1 is the popular video in 20:00–21:00pm, which gets expired quicker than v2 that is relatively popular in midnight, illustrating the temporal video popularity pattern. The power of Viewfinder To evaluate the effect of the caching engine Viewfinder with the auto-adjusted future view, we show the cache hit rate with Viewfinder set to fixed t . Fig. 7.5a shows the corresponding cache hit rate under different future horizons. The optimal value for each period varies with time, i.e., during late midnight when video life span is longer, Viewfinder tends to be long sighted, while during the leisure time (e.g., 20:00–21:00pm) when video life span is shorter, Viewfinder also tends to be short sighted. These results further emphasize the necessity of Viewfinder with adaptive future horizon. Overall cache hit rate As analyzed in Sect. 7.1.2, the non-stationary access pattern would make the reactive caching policy inefficient, and the temporal/spatial video popularity pattern also invalidates learning-based policies with fixed-length future horizon. Figure 7.5b shows the overall cache hit rate of applying the five policies. AutoSight outperforms all the existing algorithms.

7.4 Conclusion In this chapter, we analyze Kuaishou dataset and use trace-driven experiments to motivate and investigate edge caching performance for short video network. We first disclose the characteristics on non-stationary user video access pattern and temporal/spatial video popularity pattern and illustrate the invalidation of existing caching policies by giving two real cases. Then we design AutoSight, a distributed edge caching system for short video network, with CoStore and Viewfinder. Results show that enabling AutoSight in edge caching servers could significantly outperforms the existing algorithms.

128

7 Storage Issues in the Edge

References 1. Lorden, G., et al.: Procedures for reacting to a change in distribution. Ann. Math. Stat. 42(6), 1897–1908 (1971) 2. Desobry, F., Davy, M., Doncarli, C.: An online kernel change detection algorithm. IEEE Trans. Signal Process. 53(8), 2961–2974 (2005) 3. Page, E.: A test for a change in a parameter occurring at an unknown point. Biometrika 42(3/4), 523–527 (1955) 4. Adams, R.P., MacKay, D.J.: Bayesian online changepoint detection. arXiv preprint :0710.3742 (2007) 5. Lucas, J.M., Saccucci, M.S.: Exponentially weighted moving average control schemes: properties and enhancements. Technometrics 32(1), 1–12 (1990) 6. Roberts, S.: Control chart tests based on geometric moving averages. Technometrics 1(3), 239– 250 (1959) 7. Ferragut, A., Rodríguez, I., Paganini, F.: Optimizing TTL caches under heavy-tailed demands. ACM SIGMETRICS Perform. Eval. Rev. 44(1), 101–112 (2016). ACM 8. Berger, D.S., Gland, P., Singla, S., Ciucu, F.: Exact analysis of TTL cache networks: the case of caching policies driven by stopping times. ACM SIGMETRICS Perform. Eval. Rev. 42(1), 595–596 (2014) 9. Green, P.J.: Reversible jump markov chain monte carlo computation and bayesian model determination. Biometrika 82(4), 711–732 (1995) 10. Barry, D., Hartigan, J.A.: A Bayesian analysis for change point problems. J. Am. Stat. Assoc. 88(421), 309–319 (1993) 11. Stephens, D.: Bayesian retrospective multiple-changepoint identification. Appl. Stat. 43, 159– 178 (1994) 12. Smith, A.: A Bayesian approach to inference about a change-point in a sequence of random variables. Biometrika 62(2), 407–416 (1975) 13. Mao, H., Netravali, R., Alizadeh, M.: Neural adaptive video streaming with pensieve. In: Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pp. 197–210. ACM (2017) 14. Sadeghi, A., Sheikholeslami, F., Giannakis, G.B.: Optimal and scalable caching for 5G using reinforcement learning of space-time popularities. IEEE J. Sel. Top. Sign. Proces. 12(1), 180– 190 (2018) 15. Narayanan, A., Verma, S., Ramadan, E., Babaie, P., Zhang, Z.-L.: Deepcache: a deep learning based framework for content caching. In: Proceedings of the 2018 Workshop on Network Meets AI & ML, pp. 48–53. ACM (2018) 16. Basu, S., Sundarrajan, A., Ghaderi, J., Shakkottai, S., and Sitaraman, R., Adaptive TTL-based caching for content delivery. ACM SIGMETRICS Perform. Eval. Rev. 45(1), 45–46 (2017) 17. OpenFlow: Openflow specification. http://archive.openflow.org/wp/documents 18. Zhang, Y., Li, P., Zhang, Z., Bai, B., Zhang, G., Wang, W., Lian, B.: Challenges and chances for the emerging shortvideo network. In: Infocom, pp. 1–2. IEEE (2019)

Chapter 8

Computing Issues in the Edge

Abstract Along with the development of IoT and mobile edge computing in recent years, everything can be connected into the network at anytime, resulting in quite dynamic networks with time-varying connections. Controllability has long been recognized as one of the fundamental properties of such temporal networks, which can provide valuable insights for the construction of new infrastructures, and thus is in urgent need to be explored. In this chapter, we take smart transportation as an example, first disclose the controllability problem in IoV (Internet of Vehicles) and then design a DND (driver node) algorithm based on Kalman’s rank condition to analyze the controllability of dynamic temporal network and also to calculate the minimum number of driver nodes. At last, we conduct a series of experiments to analyze the controllability of IoV network, and the results show the effects from vehicle density, speed, and connection radius on network controllability. These insights are critical for varieties of applications in the future smart connected living.

8.1 Background Taking a scene in life as an example, vehicles that traveling in a section of a road exist at a certain density, and each vehicle travels along the road at a certain speed [1–5]. Each vehicle has a certain communication capability to exchange the required data with other vehicles within the communication radius. If each car is regarded as a node and there is an edge between the two cars which can exchange data to each other, then at some point in time, the nodes and the connection between them can be abstracted as an undirected graph as shown in Fig. 8.1. Since the vehicle is moving constantly, the connection relationship between the vehicles is dynamic, so the connection relationship between the vehicles needs to be calculated every small period of time [6–10]. The time interval for recalculating the connection relationship is called refresh time. The refresh time is usually related to the velocity of vehicles. In order to ensure the controllability of the network, the faster the vehicle speed, the higher the refresh frequency. When we set a so-called

© Springer Nature Singapore Pte Ltd. 2020 Y. Zhang, K. Xu, Network Management in Cloud and Edge Computing, https://doi.org/10.1007/978-981-15-0138-8_8

129

130

8 Computing Issues in the Edge

Fig. 8.1 The abstract graph of vehicle network

Fig. 8.2 The abstract graph of vehicle network with driver nodes

control time, it is necessary that the network of all vehicles in the entire area needs to reach a controllable state during this time period, and this time period will include at least one refresh time, usually more than one. During the control time, all the vehicle connection status in the entire area will be updated after each refresh [11–16]. We not only need the connection status at the current time but also need to retain the previous connection status. Finally, we get the connection state needed in a control time, and it can be abstracted into an undirected graph. According to the undirected graph generated by those parameters, under the premise of controllability of the whole network, we can calculate the required driver nodes. As shown in Fig. 8.2, the red point is the driver node. This will be explained in detail in Sect. 8.2. In the area we want to calculate, the density, the velocity, the radius of communication of the vehicle, the refresh time, and the control time which we set will affect the state of the entire network and then affecting the minimal number of driver node network [17–19]. We want to find the number of minimum driver nodes and the relationship between these parameters by using the control variable method for these main parameters.

8.2 DND: Driver Node Algorithm

131

8.2 DND: Driver Node Algorithm 8.2.1 Parameters and Variable Declarations In this section, we define the driver node problem in the vehicle network in order to obtain the minimum number of driver nodes in the network. Table 8.1 defines the symbols to be used in the definition of the problem or to make it easier to describe.

8.2.2 Modeling Set a coordinate system for calculating range: XoY . Device nodes are randomly generated in the whole region according to the normal distribution probability of X ∼ N(P , 1), where the coordinates of the nodes are: xi = random[0, X)

(8.1)

yi = random[0, Y )

(8.2)

Table 8.1 Notation definitions Notation X Y N ND P R (xi , yi ) Vi x Vi y T t A

Meaning The length of the calculated range The width of the calculated range The number of nodes in the calculated range Number of control nodes in the calculated range Device density in every Y*Y area of calculated range Communication radius of the device Coordinate value of the device The speed of device i in the X direction The speed of device i in the Y direction Control time Refresh time 0-1 adjacency matrix of the graph, Aij = 1 means that devices i and j are connected by edges, and Aij = 0 means that there is no edge connecting between devices i and j

132

8 Computing Issues in the Edge

Calculate the adjacency matrix according to the generated graph: ⎡

A11 · · · ⎢ .. ⎢ . ⎢. ⎢ A = ⎢ .. Aij ⎢ .. ⎣ . AN 1 · · ·

A1N .. .

⎤ ⎥ ⎥ ⎥ ⎥ , 1 ≤ i, j ≤ N ⎥ ⎥ ⎦

(8.3)

AN N

and note that: 

1,  ni , nj 2 ≤ R 0,  ni , nj 2 > R   2  2 2 xi − xj + yi − yj  ni , nj  = Aij =

(8.4) (8.5)

Coordinate update of node in a unit interval: ⎧ xi + Vi x, yi + Vi y, xi + Vi x ≤ X, yi + Vi y ≤ Y ⎪ ⎪ ⎨ X, yi + Vi y, xi + Vi x > X, yi + Vi y ≤ Y (xi , yi ) = ⎪ x + Vi x, Y, xi + Vi x ≤ X, yi + Vi y > Y ⎪ ⎩ i X, Y, xi + Vi x > X, yi + Vi y > Y

(8.6)

Node velocity update:  Vi x =  Vi y =

−V i x, xi = X Vi x, xi < X

(8.7)

−V i y, yi = Y Vi y, yi < Y

(8.8)

The adjacency matrix after refresh:  t+1 Aij

=

0, Atij = Aij = 0 1, Atij = 0 or Aij = 0

(8.9)

where A is the adjacency matrix calculated according to the node distance at current moment and At is the state of the adjacency matrix at the previous moment. The number of driver nodes[10]: ND = μ(λM )

(8.10)

8.2 DND: Driver Node Algorithm

133

It needs to be noted that A should be a connected graph. If the graph in the current state is not a connected graph, it needs to calculate its connected subgraph graph separately and finally add the results of each connected subgraph.

8.2.3 Abstraction of Topology In this section, we introduce how to map a network topology into a matrix briefly by using a simple example and the process of calculating the required driver nodes for the network based on this matrix. We take a connected subgraph in the link status diagram above as an example. Each node in the figure is given a serial number, as shown in Fig. 8.3. According to the topology, an adjacency matrix A as Fig. 8.4 can be obtained. After calculation, the eigenvalue vector of the matrix λ = [3.521, 2.284, 1.272, −0.846, −0.231, −0.618, 1.618]T . The eigenvalues are different, and we select λM = 1.618, the maximum algebraic multiplicity μ(λM ) = 1. Now the number of driver nodes can be obtained, and it can proceed to find out the driver nodes as follows. Then, we get the matrix B = A − λM EN as Fig. 8.5, where EN is the identity matrix. We perform the elementary column transformation on matrix B to get the column canonical form as Fig. 8.6. Fig. 8.3 Network topology

Fig. 8.4 Matrix A of network shown in Fig. 8.3

134

8 Computing Issues in the Edge

Fig. 8.5 Matrix B

Fig. 8.6 Column canonical form of B

The last row is linearly dependent on others in the column canonical form. The node corresponding to it is the driver node colored by red in Fig. 8.3.

8.3 DND Experiment The experiment sets a scene of a road with Y meters wide and X meters long. According to the width of the road, the entire road is divided into X/Y segments. Nodes in each segment are generated according to the density and obey the Poisson distribution. In each segment, the position (xi , yi ) of each node is random and with a velocity Vi x in the X direction and a velocity Vi y in the Y direction according to the velocity. And note that the value of the velocity is V x ± 2 or V y ± 2 randomly. Both the communication radius and the control time are set according to the experimental requirements. Node position is periodically updated according to its velocity. When a node reaches the boundary of the region, its movement will be set to the opposite direction at the next moment, so as to ensure the number of nodes in the entire region does not change. According to the model, we compare the results of experiments by assigning different values to the parameters of node speed, communication radius, density, and control time. In order to eliminate the randomness of the results caused by randomly generated nodes, when the density is fixed, the same random nodes are used in calculations of different parameter assignment. Moreover, the driver nodes were calculated 100 times for each assignment and get the average value as the final result.

8.3 DND Experiment

135

Fig. 8.7 Relationship between number of driver nodes and communication radius

8.3.1 Communication Radius When only considering the state at a certain time, that is, not considering the impact of refresh and node movement, the relationship between the number of driver nodes and node communication radius is shown in Fig. 8.7. The abscissa is the value of the velocity in m/s, and the ordinate represents the number of required driver nodes. The density is in units of 90 m2 . Five curves with different colors represent different node density. It is necessary to note that when the velocity is 10, it requires less driver nodes under the blue curve (the density is 3) than under the red curve (the density is 5) or the yellow curve (the density is 8). It can be seen from the graph that with the increasement of communication radius, the number of driver nodes needed in the whole network decreases and tends to the minimum of 1. The reason for the overall downward trend is that as the communication radius increases, each node can establish connection relations with more nodes, and the number of edges in the whole network increases. Under different density conditions, the more different the rate of increase of the number of connections, the more different the rate of decline, and the higher the density, the higher the rate of increase of the number of connections.

8.3.2 Nodes Density Similarly, all conditions are consistent with those mentioned above; when only considering the static state of a certain time point, the relationship between the number of driver nodes and the density of nodes is shown in Fig. 8.8. Five curves with different colors represent different communication radius of nodes. The number of driver nodes decreases with the increase of density and tends to a minimum of 1. Moreover, under different communication radius conditions, the decline rate of the number of driver nodes is different with the increase of node density, and

136

8 Computing Issues in the Edge

Fig. 8.8 Relationship between number of driver nodes and density

Fig. 8.9 Relationship between number of driver nodes and velocity

the larger the communication radius is, the faster it decreases to the minimum value. This is because, when increasing the same node density, the number of new connections with others of each node is proportional to the square of the radius, so the larger the communication radius, the faster it approaches the minimum value.

8.3.3 Nodes Velocity For the node velocity as a control variable parameter, other variables are combined in different cases. The result of number of driver nodes is shown in Fig. 8.9. Six curves with different colors correspond to different control time, node communication radius, or node density. According to the overall trend of the six color curves, with the increase of node velocity, the number of required driver nodes in the network shows a downward trend; ultimately, it tends to the minimum value of 1. This is because that with the increasement of velocity, refresh frequency will increase, and then the connection degree of the whole network will increase in the same control time; longer control time is even more obvious. As we explained above that different curves have different decline rates.

8.4 Conclusion

137

Fig. 8.10 Relationship between number of driver nodes and control time

8.3.4 Control Time For the control time as a control variable parameter, other variables compose different situations. The results of number of driver nodes as shown in Fig. 8.10. Six curves with different colors correspond to different node density, node communication radius, or node velocity. With the increase in control time, the total times of refresh also increase, which means the connection degree of the whole network increases, and this makes the number of required driver nodes show a downward trend, and finally tends to be 1. The driver nodes required in the latter two experiments are much less than those in the former two experiments. This is because the influence of velocity and control time has been considered, which improve the connection degree of the whole graph compared with that in the former two experiments. Although the experimental data are simulated, they are generated according to the actual situation. If there is a chance to obtain the actual road condition data through the relevant agencies, the real results can be obtained. However, the regular relationship between the parameters obtained through the experiment is universal.

8.4 Conclusion The rapid movement of nodes in dynamic network leads to the rapid change of network structure, and the controllability of network is facing enormous challenges. We apply cybernetics and system theory to real network scenarios and accurately calculate the minimum number of driver nodes needed under the conditions of the whole network controllability under the corresponding scenarios. We get the relationship between several main parameters and the minimum number of driver nodes in the network. Applying the conclusion to the specific scenario can greatly save the deployment cost and improve the efficiency of the network. This book only

138

8 Computing Issues in the Edge

completes a part of the work in related fields. The deployment of specific access points and other application scenarios need to be further improved.

References 1. Xiao, Z., Moore, C., Newman, M.E.J.: Random graph models for dynamic networks. Eur. Phys. J. B 90(10), 200 (2016) 2. Casteigts, A., Flocchini, P., Quattrociocchi, W., Santoro, N.: Time-varying graphs and dynamic networks. Int. J. Parallel Emergent Distrib. Syst. 27(5), 387–408 (2012) 3. Gerla, M., Lee, E.K., Pau, G., Lee, U.: Internet of vehicles: from intelligent grid to autonomous cars and vehicular clouds. In: Greengard, S. (ed.) Internet of Things. MIT Press, Cambridge (2016) 4. Alam, K.M., Saini, M., Saddik, A.E.: Toward social internet of vehicles: concept, architecture, and applications. IEEE Access 3, 343–357 (2015) 5. Kaiwartya, O., Abdullah, A.H., Cao, Y., Altameem, A., Liu, X.: Internet of vehicles: motivation, layered architecture network model challenges and future aspects. IEEE Access 4, 5356–5373 (2017) 6. Wang, W.X., Ni, X., Lai, Y.C., Grebogi, C.: Optimizing controllability of complex networks by minimum structural perturbations. Phys. Rev. E Stat. Nonlinear Soft Matter Phys. 85(2) Pt 2, 026115 (2012) 7. Francesco, S., Mario, D.B., Franco, G., Guanrong, C.: Controllability of complex networks via pinning. Phys. Rev. E Stat. Nonlinear Soft Matter Phys. 75(2), 046103 (2007) 8. Pasqualetti, F., Zampieri, S., Bullo, F.: Controllability metrics, limitations and algorithms for complex networks. IEEE Trans. Control Netw. Syst. 1(1), 40–52 (2014) 9. Cornelius, S.P., Kath, W.L., Motter, A.E.: Realistic control of network dynamics. Nat. Commun. 4(3), 1942 (2013) 10. Yuan, Z., Zhao, C., Di, Z., Wang, W.X., Lai, Y.C.: Exact controllability of complex networks. Nat. Commun. 4(2447), 2447 (2013) 11. Lombardi, A., Hörnquist, M.: Controllability analysis of networks. Phys. Rev. E 75(5) Pt 2, 056110 (2007) 12. Mauve, M., Vogel, J., Hilt, V., Effelsberg, W.: Local-lag and timewarp: providing consistency for replicated continuous applications. IEEE Trans. Multimedia 6(1), 47–57 (2004) 13. Wang, H., Shea, R., Ma, X., Wang, F., Liu, J.: On design and performance of cloud-based distributed interactive applications. In: 2014 IEEE 22nd International Conference on Network Protocols (ICNP), pp. 37–46. IEEE (2014) 14. Pujol, E., Richter, P., Chandrasekaran, B., Smaragdakis, G., Feldmann, A., Maggs, B.M., Ng, K.-C.: Back-office web traffic on the internet. In: Proceedings of the 2014 Conference on Internet Measurement Conference, pp. 257–270. ACM (2014) 15. Zaki, Y., Chen, J., Potsch, T., Ahmad, T., Subramanian, L.: Dissecting web latency in ghana. In: Proceedings of the 2014 Conference on Internet Measurement Conference, pp. 241–248. ACM (2014) 16. Yue, K., Wang, X.-L., Zhou, A.-Y., et al.: Underlying techniques for web services: a survey. J. Softw. 15(3), 428–442 (2004) 17. Li, X., Wang, X., Wan, P.-J., Han, Z., Leung, V.C.: Hierarchical edge caching in deviceto-device aided mobile networks: modeling, optimization, and design. IEEE J. Sel. Areas Commun. 36(8), 1768–1785 (2018) 18. Sadeghi, A., Sheikholeslami, F., Giannakis, G.B.: Optimal and scalable caching for 5G using reinforcement learning of space-time popularities. IEEE J. Sel. Top. Signal Process. 12(1), 180–190 (2018) 19. Mao, H., Netravali, R., Alizadeh, M.: Neural adaptive video streaming with pensieve. In: Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pp. 197–210. ACM (2017)