Edge Intelligence: From Theory to Practice 3031221540, 9783031221545

This graduate-level textbook is ideally suited for lecturing the most relevant topics of Edge Computing and its ties to

296 107 9MB

English Pages 253 [254] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Edge Intelligence: From Theory to Practice
 3031221540, 9783031221545

Table of contents :
Preface
Acknowledgments
Contents
1 Distributed Computing Continuum Systems
1.1 Introduction
1.2 Related Work
1.3 System Management in the Cartesian Space
1.3.1 Cartesian Blanket
1.3.2 Computing Continuum Characteristics
1.3.3 Current Issues and Challenges
1.3.3.1 Reactive Management System
1.3.3.2 System's Stability Is Linked to the Infrastructure
1.3.3.3 Unknown System Derivatives
1.3.3.4 Lack of Causality Relations
1.4 System Management in the Markovian Space
1.4.1 Vision
1.4.1.1 System State
1.4.1.2 Markov Blanket
1.4.1.3 Equilibrium
1.4.1.4 Adaptation
1.4.1.5 An Illustrative Example
1.4.2 Learning
1.4.2.1 Design Phase Learning
1.4.2.2 Runtime Phase Learning
1.5 Use Case Interpretation
1.5.1 Application Description
1.5.2 SLOs as Application Requirements
1.5.3 Developing the DAG
1.6 Conclusion
References
2 Containerized Edge Computing Platforms
2.1 Containers vs. Virtual Machines
2.2 Container Engines
2.2.1 Docker
2.2.2 Podman
2.2.3 LXD
2.3 Container Orchestration Platforms
2.3.1 Self-Hosted vs. Managed Container Orchestration
2.3.2 Well-Known Container Orchestration Platforms
2.4 Kubernetes
2.4.1 Kubernetes Cluster
2.4.1.1 Control Plane Components
2.4.1.2 Node Components
2.4.2 Kubernetes Objects and Resource Types
2.4.3 Container Interfaces
2.4.3.1 Container Runtime Interface (CRI)
2.4.3.2 Container Network Interface (CNI)
2.4.3.3 Container Storage Interface (CSI)
2.4.4 Accessing and Managing Kubernetes Resources
2.4.4.1 Kubernetes Kubectl
2.4.4.2 Kubernetes Dashboard
2.5 Kubernetes SDKs
2.5.1 Kubernetes Python Client
2.5.1.1 Installing the Python Client
2.5.1.2 Using the Python Client
2.5.2 Kubernetes Java Client
2.5.2.1 Installing the Java Client
2.5.2.2 Using the Java Client
2.6 Summary
References
3 AI/ML for Service Life Cycle at Edge
3.1 Introduction
3.1.1 State of the Art
3.1.1.1 Wireless Networking
3.1.1.2 Service Placement and Caching
3.1.1.3 Computation Offloading
3.1.2 Grand Challenges
3.2 Al/ML for Service Deployment
3.2.1 Motivation Scenarios
3.2.1.1 The Heterogeneous Network
3.2.1.2 Response Time of Micro-Services
3.2.1.3 A Working Example
3.2.2 System Model
3.2.2.1 Describing the Correlated Micro-Services
3.2.2.2 Calculating the Response Time
3.2.3 Problem Formulation
3.2.4 Algorithm Design
3.2.4.1 Variables
3.2.4.2 The SAA-RP Framework
3.2.4.3 The GASS Algorithm
3.3 AI/ML for Running Services
3.3.1 System Description and Model
3.3.2 Algorithm Design
3.3.3 RL-Based Approach
3.4 AI/ML for Service Operation and Management
3.4.1 System Model
3.4.2 Problem Analysis
3.4.3 Dispatching with Routing Search
3.4.4 Scheduling with Online Policy
3.5 Summary
References
4 AI/ML for Computation Offloading
4.1 Introduction
4.2 AI/ML Optimizes Task Offloading in the Binary Mode
4.2.1 System Model
4.2.1.1 Local Execution Latency Evaluation
4.2.1.2 Task Offloading Latency
4.2.1.3 Battery Energy Consumption
4.2.1.4 Problem Formulation
4.2.2 Cross-Edge Computation Offloading Framework
4.3 AI/ML Optimizes Task Offloading the Partial Mode
4.3.1 System Model and Overheads
4.3.1.1 System Model
4.3.1.2 Overheads
4.3.2 Problem Formulation
4.3.3 Solution
4.3.3.1 Allocation of CPU Frequency and Power
4.3.3.2 Solution of Offloading Policy
4.3.3.3 Algorithm Analysis
4.4 AI/ML Optimizes Complex Jobs
4.4.1 System Model and Problem Formulation
4.4.1.1 A Working Example
4.4.1.2 Problem Formulation
4.4.2 Algorithm Design
4.4.2.1 Finding Optimal Substructure
4.4.2.2 Optimal Data Splitting
4.4.2.3 Dynamic Programming-Based Embedding
4.5 Summary
References
5 AI/ML Data Pipelines for Edge-Cloud Architectures
5.1 Introduction
5.2 State-of-the-Art Stream Processing Solutions for Edge-Cloud Architectures
5.3 Data Pipeline in Existing Platforms
5.4 Critical Challenges for Data Pipeline Solutions
5.5 MapReduce
5.5.1 Limitations of MapReduce
5.5.2 Beyond MapReduce
5.6 NoSQL Data Storage Systems
5.6.1 Apache Cassandra
5.6.2 Apache Flink
5.6.2.1 Flink Connectors
5.6.2.2 Flink Architecture
5.6.2.3 Flink Deployment Plan
5.6.3 Apache Storm
5.6.3.1 Storm Concepts
5.6.3.2 Storm Deployment Architecture
5.6.4 Apache Spark
5.6.4.1 Spark Architecture
5.6.4.2 Spark Execution Engine
5.7 Conclusion
References
6 AI/ML on Edge
6.1 Introduction
6.2 System Overflow
6.2.1 Caching on the Edge
6.2.2 Training on the Edge
6.2.3 Inference on the Edge
6.2.4 Offloading on the Edge
6.3 Edge Training
6.3.1 Architecture
6.3.2 Training Optimization
6.3.3 Federated Learning
6.4 Edge Inference
6.4.1 Model Design
6.4.2 Model Compression
6.4.2.1 Network Pruning
6.4.2.2 Quantization
6.4.2.3 Knowledge Distillation
6.5 Summary
References
7 AI/ML for Service-Level Objectives
7.1 SLO Script: A Language to Implement Complex Elasticity-Driven SLOs
7.1.1 SLOs and Elasticity
7.1.2 Motivation
7.1.3 Research Challenges
7.1.4 Language Requirements Overview
7.1.5 SLO Script Language Design and Main Abstractions
7.1.5.1 SLO Script Overview and Language Meta-Model
7.1.5.2 StronglyTypedSLO
7.1.5.3 Strongly Typed Metrics API
7.1.5.4 SLOC Object Model
7.2 A Middleware for SLO Script
7.2.1 Research Challenges
7.2.2 Framework Overview
7.2.2.1 Architecture
7.2.2.2 SLOC CLI
7.2.3 Mechanisms
7.2.3.1 Orchestrator-Independent SLO Controller
7.2.3.2 Provider-Independent SLO Metrics Collection and Processing Mechanism
7.2.4 Implementation
7.2.4.1 Orchestrator-Independent SLO Controller
7.2.4.2 Provider-Independent SLO Metrics Collection and Processing Mechanism
7.3 Evaluation
7.3.1 Demo Application Setup
7.3.2 Qualitative Evaluation
7.3.3 Performance Evaluation
7.4 Summary
References

Citation preview

Javid Taheri · Schahram Dustdar Albert Zomaya · Shuiguang Deng

Edge Intelligence From Theory to Practice

Edge Intelligence

Javid Taheri • Schahram Dustdar • Albert Zomaya • Shuiguang Deng

Edge Intelligence From Theory to Practice

Javid Taheri Department of Computer Science Karlstad University Karlstad, Sweden

Schahram Dustdar Distributed Systems Group TU Wien Vienna, Austria

Albert Zomaya School of Computer Science University of Sydney Sydney, NSW, Australia

Shuiguang Deng College of Computer Science & Technology Zhejiang University Hangzhou, China

ISBN 978-3-031-22154-5 ISBN 978-3-031-22155-2 https://doi.org/10.1007/978-3-031-22155-2

(eBook)

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

To my mother, Fakhri, for her true love and support throughout my life, and to my father, Javad, for his pure heart, unconditional love, and being my moral compass in life. Javid Taheri To all students of Computer Science who aim at contributing to the improvement of society using technology. Schahram Dustdar To the memory of my mother, Elizabeth, who always helped me to see the hope inside myself. I am eternally grateful. Albert Zomaya To all my students who made contributions to this book. Shuiguang Deng

Preface

Edge Intelligence: From Theory to Practice is a reference book suitable for lecturing the most relevant topics of edge computing and its ties to artificial intelligence (AI) and machine learning (ML) approaches. The book starts from basics and gradually advances, step-by-step, to ways AI/ML concepts can help or benefit from edge computing platforms. Using practical labs, each topic is brought to earth so that students can practice their learned knowledge with industry-approved software packages. The book is structured into seven chapters; each comes with its own dedicated set of teaching materials (practical skills, demonstration videos, questions, lab assignments, etc.). Chapter 1 opens the book and comprehensively introduces the concept of distributed computing continuum systems that led to the creation of edge computing. Chapter 2 motivates the use of container technologies and how they are used to implement programmable edge computing platforms. Chapter 3 introduces ways to employ AI/ML approaches to optimize service life cycles at the edge. Chapter 4 goes deeper in the use of AI/ML and introduces ways to optimize spreading computational tasks along edge computing platforms. Chapter 5 introduces AI/ML pipelines to efficiently process generated data on the edge. Chapter 6 introduces ways to implement AI/ML systems on the edge and ways to deal with their training and inferencing procedures considering the limited resources available at the edge nodes. Chapter 7 motivates the creation of a new orchestratorindependent object model to descriptive objects (nodes, applications, etc.) and requirements (SLAs) for underlying edge platforms. To provide hands-on experience to students and step-by-step improve their technical capabilities, seven sets of Tutorials-and-Labs (TaLs) are also designed. Codes and instructions for each TaL are provided in the book website and accompanied by videos to facilitate their learning process. TaL 1 shows how to install basic software packages (VirtualBox, Visual Studio Code) and programming environments (Node.js) and write sample programs (e.g., a “Hello, World” program) to perform all future TaLs. TaL 2 shows how to install a stand-alone Kubernetes (K8) platform, how to containerize and deploy a sample Node.js code, as well as how to deploy it on a K8 platform and consume its provided services. TaL 3 shows how to vii

viii

Preface

profile containers and develop an external auto-scaler to scale up/down K8 services. TaL 4 describes the placement algorithms and how they can be implemented for K8 platforms, as well as how to develop, containerize, and deploy a placement solver. TaL 5 shows how to build containers to emulate the behavior of various components of distributed computing continuum platforms (sensors, edge nodes, and cloud servers), as well as how to develop a basic fault detector to collect historic data from sensor and identify their faulty readings in the offline mode. TaL 6 shows how to develop, containerize, and deploy an ML algorithm on the edge nodes to detect faults in the online mode. TaL 7 demonstrates how to build a configurable service-level objective (SLO) controller for Kubernetes and trigger an elasticity strategy upon violation of the SLO. The book comes with a website to help lecturers prepare the teaching materials, and students to access the provided videos, instructions, labs, etc. It can be reached using the following link: https://sites.google.com/view/edge-intelligence-book. Karlstad, Sweden Vienna, Austria Sydney, NSW, Australia Hangzhou, China October 2022

Javid Taheri Schahram Dustdar Albert Zomaya Shuiguang Deng

Acknowledgments

We would like to express our thanks and deepest appreciation for the many comments and suggestions that we have received from several of our colleagues during various stages of the writing of this book. Particularly, we would like to thank Dr. Muhammad Usman from Karlstad University (Sweden), Dr. Thomas Pusztai and Dr. Praveen Kumar Donta from Vienna University of Technology (Austria), Dr. Wei Li, Dr. Yucen Nan and Dr. Mohammadreza Hoseinyfarahabady from the University of Sydney (Australia), and Dr. Hailiang Zhao, Dr. Cheng Zhang, Dr. Yishan Chen and Dr. Haowei Chen from Zhejiang University (China).

ix

Contents

1

Distributed Computing Continuum Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 System Management in the Cartesian Space . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Cartesian Blanket . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Computing Continuum Characteristics . . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 Current Issues and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 System Management in the Markovian Space . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Vision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Use Case Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Application Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.2 SLOs as Application Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.3 Developing the DAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 3 4 4 5 7 8 8 17 22 22 23 25 25 27

2

Containerized Edge Computing Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Containers vs. Virtual Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Container Engines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Docker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Podman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 LXD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Container Orchestration Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Self-Hosted vs. Managed Container Orchestration . . . . . . . . . . . 2.3.2 Well-Known Container Orchestration Platforms . . . . . . . . . . . . . . 2.4 Kubernetes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Kubernetes Cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Kubernetes Objects and Resource Types . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Container Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.4 Accessing and Managing Kubernetes Resources . . . . . . . . . . . . .

31 31 33 33 34 36 36 37 38 39 39 42 44 46 xi

xii

Contents

2.5 Kubernetes SDKs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Kubernetes Python Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Kubernetes Java Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48 48 50 52 53

3

AI/ML for Service Life Cycle at Edge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.1.1 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.1.2 Grand Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.2 Al/ML for Service Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.2.1 Motivation Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.2.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.2.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.2.4 Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.3 AI/ML for Running Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.3.1 System Description and Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.3.2 Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.3.3 RL-Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.4 AI/ML for Service Operation and Management . . . . . . . . . . . . . . . . . . . . . . . 86 3.4.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.4.2 Problem Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.4.3 Dispatching with Routing Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 3.4.4 Scheduling with Online Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

4

AI/ML for Computation Offloading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 AI/ML Optimizes Task Offloading in the Binary Mode . . . . . . . . . . . . . . 4.2.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Cross-Edge Computation Offloading Framework . . . . . . . . . . . . . 4.3 AI/ML Optimizes Task Offloading the Partial Mode. . . . . . . . . . . . . . . . . . 4.3.1 System Model and Overheads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 AI/ML Optimizes Complex Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 System Model and Problem Formulation. . . . . . . . . . . . . . . . . . . . . . 4.4.2 Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

AI/ML Data Pipelines for Edge-Cloud Architectures. . . . . . . . . . . . . . . . . . . . 159 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.2 State-of-the-Art Stream Processing Solutions for Edge-Cloud Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

111 111 115 116 121 125 125 132 133 143 146 150 153 154

Contents

xiii

5.3 Data Pipeline in Existing Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Critical Challenges for Data Pipeline Solutions . . . . . . . . . . . . . . . . . . . . . . . 5.5 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Limitations of MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2 Beyond MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 NoSQL Data Storage Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.1 Apache Cassandra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.2 Apache Flink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.3 Apache Storm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.4 Apache Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

162 165 167 167 168 168 169 171 174 177 180 180

6

AI/ML on Edge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 System Overflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Caching on the Edge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Training on the Edge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Inference on the Edge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.4 Offloading on the Edge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Edge Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Training Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Edge Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Model Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Model Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

183 183 185 188 189 190 191 191 192 194 195 197 198 198 207 208

7

AI/ML for Service-Level Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 SLO Script: A Language to Implement Complex Elasticity-Driven SLOs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 SLOs and Elasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.3 Research Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.4 Language Requirements Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.5 SLO Script Language Design and Main Abstractions . . . . . . . . 7.2 A Middleware for SLO Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Research Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Framework Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Demo Application Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

213 214 214 215 216 218 219 226 227 228 231 236 239 239

xiv

Contents

7.3.2 Qualitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

243 244 246 246

Chapter 1

Distributed Computing Continuum Systems

Abstract This chapter presents our vision on the need of developing new managing technologies to harness distributed computing continuum systems. These systems are concurrently executed on multiple computing tiers: cloud, fog, edge, and IoT. This simple idea develops many challenges due to the inherent complexity of their underlying infrastructures, to the extent that current methodologies for managing Internet-distributed systems would no longer be appropriate for them. To this end, we present a new methodology, based on a mathematical artifact called Markov blanket, to cope with complex characteristics of distributed computing continuum systems. We also develop the concept of equilibrium for these systems, providing a more flexible management framework compared with the currently used thresholdbased ones. Because developing the entire methodology requires a great effort, we finish the chapter with an overview of the techniques required to develop this methodology.

1.1 Introduction The last decade has been dominated by applications based on cloud infrastructures; however, during recent years, we are witnessing the appearance of new computing tiers, such as the edge or the fog. They provide valuable features to applications, such as very low latency or privacy enhancements. Ultimately, new applications are emerging that take advantage of all the computing tiers available; hence, they are systems that are simultaneously executed on the edge, fog, and cloud computing tiers. These systems are known as “computing continuum” systems. The applications provided by the distributed computing continuum systems enable developments that belong to the infrastructure landscape for some years

This chapter reuses literal text and materials from • S. Dustdar et al., “On distributed computing continuum systems,” in IEEE Transactions on Knowledge and Data Engineering, https://doi.org/10.1109/TKDE.2022.3142856. ©2022 IEEE, reprinted with permission.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Taheri et al., Edge Intelligence, https://doi.org/10.1007/978-3-031-22155-2_1

1

2

1 Distributed Computing Continuum Systems

now, such as smart cities with greener and sustainable spaces, non-polluted avenues with autonomous vehicles, sustainable and efficient manufacturing with accurate traceability of products, or health systems able to personalize anyone’s treatment independently of their current location. Most research efforts have been devoted to solve specific challenges of these systems such as data caching, resource allocation, or scheduling among others. However, general methodologies for the design and management of these systems remain one of the fundamental challenges. This has been overlooked because the community has been using traditional methods for design and management developed for the first Internet-based systems, that is, those comprising a server and a client, which are completely specified through the software application itself. In general, these systems are being developed from a top-down perspective. Hence, each system’s architecture is defined ad hoc to solve a precise problem. This methodology is still valid for the cloud paradigm, but it falls short for the computing continuum paradigm. For instance, the concept of developing an architecture for a system needs to be re-thought for computing continuum systems. One can find a multitude of different architectures within the same system. Furthermore, in some situations, the desired architecture will not be possible to implement, and from a management perspective, the mechanisms developed need to be able to deal with situations where the architecture is another dynamic feature of the system. We claim that new methodologies are required because computing continuum systems have fundamental characteristics that make previous methods inadequate: the application and the underlying software infrastructure are seamlessly blended. Therefore, the characteristics of the system are driven by their infrastructure, which is similar to those of complex systems, for example, as cited in [1], a system is complex if its behaviour crucially depends on the details of the system. We foresee a methodology that takes into account the shared characteristics of all such systems and changes the role of the underlying infrastructure’s software. Hence, with this change of mindset and by leveraging each system’s infrastructure data, we will be able to manage these systems obtaining robust, adaptive, and cooperative systems. The objectives of this chapter are twofold: first, to present a general methodology derived from cloud computing that shows its limitations in the computing continuum context and second, to motivate that a completely new approach is required to design and manage these systems, sketching our vision on how we foresee this new methodology. The rest of this chapter is organized as follows. In Sect. 1.2, we show related work for managing cloud systems; in Sect. 1.3, we explain the reasons why we need a new approach for managing the computing continuum systems. Section 1.4 details our vision for the management approach and underlines the required learning techniques that will be required to develop it. Finally, in Sect. 1.6, we present our conclusions.

1.2 Related Work

3

1.2 Related Work In recent years, research efforts have focused on developing edge and fog tiers. Through a review of literature, one can find three different approaches. First, there is research targeting specific topics of these tiers, such as data caching [2, 3], service deployment [4], computation offloading [5, 6], resource allocation [7], scheduling [8, 9], traffic grooming [10], or service decomposition [11], among others. Second, there is research that addresses these specific topics, but they are contextualized in a specific domain. For example, in [12], a QoE and energyaware methodology for offloading is contextualized in a smart city, or in [13], a scheduling solution is presented in a smart manufacturing context. Similarly, in [14], the scheduling is taken to the healthcare application. Third, there is research that focuses on general architectures for specific contexts, such as in [15] where a specific fog architecture for banking applications is described or in [16] where architectures for a healthcare system are presented. But, to the best of our knowledge, there is a shortcoming for transversal and general methodologies for managing “computing continuum” systems. Specifically, transversal methodologies apply to any application domain, and general methodologies are able to solve any specific issue. In cloud computing, which has been on research for a couple of decades now, there is a methodology that we claim is required for the computing continuum systems. In this regard, the work on elasticity [17–19] develops a methodology for managing cloud systems that matches the need for the computing continuum systems in terms of generality and transversality. This starts by abstracting any application with three variables: Resources, Quality, and Cost. Then, the system keeps track at the values of these variables, and if any of them goes beyond a certain threshold, elasticity strategies are applied. Simply put, these change the application configuration in the cloud infrastructure, so that its current needs are matched with the expected performance, and therefore the Resources, Quality, and Cost variables return at their expected range of values. It is worth mentioning that in [20], some difficulties are already identified when dealing with multi-cloud systems. Now the evolution of cloud computing is going toward the serverless computing [21]; this basically implies that the thresholds, commonly known as service-level objectives (SLOs) that controlled the application state, require a higher level of abstraction. This is more convenient for cloud users, as they no longer require to determine low-level specifications of the application, such as the range of CPU usage, but higher-level ones such as the ratio between the cost of the infrastructure and the efficiency of the application. This is called cost-efficiency as developed in this work [22–24]; it proposes a cloud management system completely controlled through high-level SLOs. A similar approach is required to manage computing continuum systems; however, the characteristics of the underlying infrastructure of such systems make this cloud computing approach not adequate.

4

1 Distributed Computing Continuum Systems

Before continuing, it is worth detailing two concepts that will be very recurrent throughout the chapter. A computing continuum application only refers to an application developed on top of a computing continuum fabric of resources; this can range from a video analysis or an entire vehicle fleet management for a smart city. A computing continuum system refers to all the resources required to enable an application of the computing continuum; this includes the application itself but just as another component of the system.

1.3 System Management in the Cartesian Space In this section, we will discuss how the cloud computing management system could be applied to computing continuum systems. But once there, by reasoning about the characteristics of these emerging systems, we will be able to show why this methodology is not sufficient.

1.3.1 Cartesian Blanket Similarly as done with the cloud, the system space of a computing continuum application can be represented with three variables: Resources, Quality, and Cost. Hence, it can be represented in a three-dimensional Cartesian space. Cloud-based systems can be usually abstracted as virtually unlimited and homogeneous source of resources; this simplifies the system’s Cartesian representation using Resources, Quality, and Cost. In fact, given the heterogeneity of computing continuum systems, their encoding on this space is not straightforward. Nevertheless, we can assume that given a domain, this Cartesian frame could be shared. Simply put, we foresee that healthcare applications can have the same quality axis, as it can be interpreted similarly for all cases. But this might not be the same case for an application devoted to controlling the product distribution. In any case, it is conceivable to develop transformations between Cartesian frames to relate applications from different contexts. Ideally, formalizing relations between systems from different domains could allow the further development of transversal methodologies seeking, for instance, cooperation between systems. It is also worth noting that cloud systems use SLOs to check if the system is performing as expected. In this regard, only high-level SLOs can be represented in the Cartesian space, as it is a high-level abstraction of the system and low-level SLOs cannot be represented. In general, high-level SLOs will be expressed as lower and higher boundaries for each of the axes. For instance, two limits on the Cost axis represent the range in Cost that a computing continuum system can assume. One could argue that Cost could only have an upper limit. However, it is important to take into account that when simply using infrastructure, there is an associated Cost, and actually, being below a threshold could imply that there is some required infrastructure component that

1.3 System Management in the Cartesian Space

5

is not being properly used; hence, the system performance could be endangered. Visually, this develops a hexahedron on the Cartesian space that represents the available configurations for the system state. Within that space, the system is operating with its specified characteristics. Unfortunately, this is not entirely true. As previously mentioned, computing continuum systems do not have an unlimited pool of homogeneous infrastructure resources, as it is usually assumed for the cloud. Therefore, their operative system’s space is linked to the actual system’s infrastructure. In other words, the underlying infrastructure of the system has to be represented in Cartesian space to understand the possible configurations for the system’s space. Infrastructure has fixed characteristics of Resources, Quality, and Cost. They are represented as points in that space. At this point, one could argue that the same infrastructure can have different characteristics, and this is true for the cloud, where we can have virtual machines with different specifications depending on the need, making the space fully continuous. But, as the infrastructure is approaching the edge, the resources are limited, and therefore, we assume that using an infrastructure implies using all its capabilities. Then, the system’s infrastructure can be added as points inside the Cartesian space. Certainly, infrastructure that is out of the hexahedron will not be adequate for the computing continuum system. Therefore, the possible configurations for the system state space can be visually interpreted as a stretched blanket linked to the infrastructure points and limited by its specification hexahedron, as can be seen in Fig. 1.1. Finally, an elastic strategy, which can be understood as a reconfiguration the system’s underlying infrastructure, is put in place if the system state is outside of its SLO-defined hexahedron, to make the system return into its space.

1.3.2 Computing Continuum Characteristics In Sect. 1.3.1, it has been already observed that computing continuum systems have different characteristics than cloud systems. In this regard, characteristics of computing continuum systems are inherited from its underlying infrastructure, similarly as it happens with the cloud, but the latter can abstract its infrastructure as virtually unlimited, homogeneous, and centralized. This simplifies its analysis and allows a management system as the one explained in Sect. 1.2. On the contrary, computing continuum systems’ infrastructure is large, diverse, and distributed. These systems can have applications that target an entire city; this means that its infrastructure is spread through the city with components ranging from sensors or micro-controllers up to large computing units or robots. Another characteristic of distributed computing continuum systems is that its performance depends on many interactions between its components. Therefore, the application pipeline is large and contains many ramifications. This, together with the size of the system, makes it fragile, as it becomes difficult to know which part of the pipeline might break first and how this issue could propagate. Furthermore, this impedes knowing

6

1 Distributed Computing Continuum Systems

Fig. 1.1 The figure shows the three axis of the Cartesian space: Resources, Quality, and Cost. For each of them, there are a couple of thresholds that limit the system state from a requirement perspective. On top, the hexahedron generated from the thresholds is sketched; inside the Cartesian blanket, it is drawn by linking the points that represent the system’s underlying infrastructure. © 2022 IEEE, reprinted, with permission, from S. Dustdar et al. (2022) [25]

with certainty how this issue will affect the system’s performance. Finally, these systems have to be considered as open systems. This means that the system is influenced by environmental events. For example, network congestion or a high increase of users request can unpredictably occur, as they can be consequence of a crowd gathering due to a non-related social event. Even like this could create a high level of uncertainty with respect to which components can stop servicing as expected and break the application’s pipeline. The combination of the described characteristics makes computing continuum systems behave similarly as complex systems; therefore, the required methodology to manage them has to take this into account; and this completely changes the paradigm with respect to previous Internet systems.

1.3 System Management in the Cartesian Space

7

1.3.3 Current Issues and Challenges The previous analysis of the system state representation and its management methodology and the characteristics of these systems have opened a set of issues that have motivated the need for a new methodology for representing and managing computing continuum systems.

1.3.3.1

Reactive Management System

The first issue is that the management system is reactive. Therefore, the corrective action is performed once the state of the system is outside the configured thresholds. This approach is not valid for most complex systems; in fact, their propensity to have cascade errors makes the reactive action useless as derived errors arise. It could be argued that the previous can be solved by taking the action before, for example, by narrowing down the thresholds. But this has other implications: it forces to constraint the state space of the system, which implies much less design flexibility in terms of Resources, Quality, and Cost. In summary, thresholds are not a suitable solution for computing continuum systems.

1.3.3.2

System’s Stability Is Linked to the Infrastructure

One can also realize that once the state space of the system is out of the Cartesian blanket stitched to the infrastructure points, the system is no longer under control, because at least a component of its state is not properly linked to its current infrastructure state. Hence, the stability of the system does not depend on the location of the system state with respect to the SLO frames but with respect to its underlying infrastructure. This different behavior, with respect to cloud systems, stems from the fact that the infrastructure components are not as flexible as in a cloud-based system. In this regard, their relation with the underlying infrastructure needs to be further developed.

1.3.3.3

Unknown System Derivatives

Following the previous discussion, it can be argued that the managing action must be performed as soon as the state of the system slightly deviates from its Cartesian blanket space. However, this reveals another shortcoming of this representation. The Cartesian representation does not encode the system derivatives; in other words, it is a representation that does not provide information on the evolution of the system. The characteristics of the system do not allow us to define derivatives of the state in order to understand whether a small deviation is negligible or if it is actually going to further deviate from its state. This is on top of the facts that these systems

8

1 Distributed Computing Continuum Systems

are not free from environmental noises, which are required to understand the nature of the deviation. Furthermore, in such a case, the performed actions need to be very carefully sized as it could aggravate the situation. In summary, computing continuum systems require to include the sense of evolution in their management mechanisms as well.

1.3.3.4

Lack of Causality Relations

This also relates to the last flaw identified in this representation, that is, the lack of the notion of causality. The actions that can be taken in the cloud computing paradigm are basically either a vertical or horizontal scaling or sometimes a combination of both. However, the infrastructure of computing continuum systems requires more detailed actions, as they depend on each infrastructure component. Therefore, understanding the cause of a deviation on the system’s state is fundamental to overcome it, and this implies that a causality model is also required.

1.4 System Management in the Markovian Space This section first develops our vision for the new approach to manage distributed computing continuum systems, and then it provides research lines with respect to learning techniques that can be used to fully develop the methodology.

1.4.1 Vision Our vision for continuum systems are multifold, and here is our description from different angles.

1.4.1.1

System State

Understanding the state of the system through high-level variables provides an abstraction that is essential to generalize any type of such systems. Furthermore, it facilitates communication between infrastructure providers and developers of applications. Therefore, our vision keeps the same idea of using Resources, Quality, and Cost as done in the cloud computing systems, and therefore .Sys = (R, Q, C) (Fig. 1.2). Nevertheless, the heterogeneity of these systems presents a challenge, as already identified in [20]. As mentioned above, systems that share a domain or run similar applications are easier to relate by abstracting their common characteristics. But this is not the case when dealing with systems from different domains. For instance,

1.4 System Management in the Markovian Space

9

Fig. 1.2 The system state represented as a node, encoding its high-level variables: Resources, Quality, and Cost. ©2022 IEEE, reprinted, with permission, from S. Dustdar et al. (2022)

developing the same abstraction to represent quality in a system that allows autonomous driving and another used for managing city waste is a challenge. Therefore, our methodology will define the state of the system, its Resources, Quality, and Cost depending on the application domain. The high-level definition of the system state limits its observability; therefore, a set of metrics, that which are observable, have to be defined based on the application requirements to provide the link between computing continuum resources and the system state. In other words, it is not feasible to observe the low-level resources of the computing continuum fabric to directly obtain the system state. Figure 1.3 shows how the low-level resources are aggregated and filtered into a set of metrics. Similarly, this set of metrics is related to the system state. Thus, depending on the application domain, the system will require a higher dependency on network latency aspects, as well as an efficient use of the computational resources. It is worth noticing two details from Fig. 1.3: (1) the arrows keep the knowledge of causality relations, which are important to understand the reasons of the system’s behavior; and (2) both the set of metrics and the relation of these metrics with the system state are dependent on the application requirements, whereas the relation between the computing continuum resources and the set of metrics is not. The methodology that is being discussed in this chapter also requires to act in order to provide adaptive mechanisms to the system. The system has to be able to change the configuration of its underlying infrastructure in order to overcome any possible issue encountered. Providing the system state with capabilities to directly act on the computing continuum resources is not feasible due to the large conceptual difference between a high-level representation of the system and the actual low-level adaptation on the underlying infrastructure. Therefore, there is the need to create a set of actions to move from the higher level to the lower level, as seen in Fig. 1.4. This can be a set of different configurations that allows the system to run within its expected requirement. However, these configurations might not all be possible to be described at design phase; therefore, this requires to have the capacity to learn new configurations given application constraints and the current state of the computing continuum resources. In any case, this will be further expanded in Sect. 1.4.1.4.

10

1 Distributed Computing Continuum Systems

Fig. 1.3 The figure shows how the system state is influenced by a set of metrics. Both the set of metrics and level of influencing mostly depend on the specified application requirements. The figure also shows how causal relations for the metrics values can be connected with the computing continuum resources or, in other words, its underlying infrastructure. ©2022 IEEE, reprinted, with permission, from S. Dustdar et al. (2022)

1.4.1.2

Markov Blanket

It is now possible to provide a wider view of the system representation, by merging together the previous ideas, as seen in Fig. 1.5. This gives the opportunity to introduce a concept, namely, the Markov blanket (MB), to provide several key features to the managing methodology for emerging computing continuum systems. As defined in [26], the MB of a variable are those variables that provide enough information to infer its state. Formally, if the MB of the variable V are variables Y and X, we can say that .P (V |X, Y, Z) = P (V |X, Y ). The previous definition implies that without a direct observation of the variable V , it is possible to infer its state given the observation of its MB variables. This is a useful tool to infer the system state, given that as previously stated, it is not possible to obtain direct observations. Additionally, the MB is used as a causality filter in our methodology, that is, the granularity or detail of the observations to infer the system state can be chosen providing that the selected set of observations is meaningful to the system state. Hence, the set of metrics and actions can be selected in order to easily match the application requirements. Furthermore, it allows the creation of nested representations of the system. Therefore, besides providing a tool to choose granularity, it allows focusing on a precise application that is running on the entire

1.4 System Management in the Markovian Space

11

Fig. 1.4 The figure shows the relation of the system state with the set of actions. In this case, the set of actions require to be learned as well as the precise relation of the state with them in order to develop a system with autonomous and adaptive capabilities. The figure also shows the adaptation capabilities of the system influence the computing continuum resources. ©2022 IEEE, reprinted, with permission, from S. Dustdar et al. (2022)

Fig. 1.5 This figure shows the complete perspective on the required representation of the system to apply the methodology depicted in this chapter. It is worth noticing the central gray space that sets the boundaries of the Markov blanket and defines the system state. ©2022 IEEE, reprinted, with permission, from S. Dustdar et al. (2022)

system. If the application is managing the mobility of autonomous cars on a city, for example, a nested MB representation allows focusing on a smaller application, such as the visualization of the street cameras in order to solve any issue there, but

12

1 Distributed Computing Continuum Systems

keeping clear track of the relations with the other parts of the system (thanks to the causality links encoded). It is important to remark that the MB variables include both the metrics and the actions, as seen in the gray area of Fig. 1.5. To understand the current state of the system, the metrics that relate with the underlying infrastructure are required, as well as the current configuration of the system, provided by the state of the action variables. Usually, the MB is represented as a directed acyclic graph (DAG) which provides a framework for using Bayesian inference that can initially help to construct a graph to represent a system and later on to develop knowledge on causality reasons of an event and their propagation up- or downstream. Formally, the MB-DAG consists of the parents, the children, and the spouses of the central node. Figure 1.6 shows a more detailed representation of an MB for a computing continuum system. It can be seen how several metrics influence the system state and how this influences the action states; additionally, there can be direct links from metrics to actions and also from resources of the underlying infrastructure to actions. Therefore, building the DAG representation is not trivial and requires a proper definition of the components, as well as their links to keep track of causality effects

Fig. 1.6 The Markov blanket (MB) for the system state is represented as a directed acyclic graph (DAG). Each M represents an element from the set of metrics, each A represents an element from the set of actions, each R makes reference to a resource, and each C makes reference to a component in the underlying infrastructure of the system. ©2022 IEEE, reprinted, with permission, from S. Dustdar et al. (2022)

1.4 System Management in the Markovian Space

13

between them. To that end, Sect. 1.4.2 will take advantage of the DAG representation to detail learning techniques that can be used to build it.

1.4.1.3

Equilibrium

The causality filter provided by the MB also provides a high-level perspective on the system scope in terms of the data, or information, it can use, as well as the adaptation capabilities it can have. This scope develops a separation between the system and the environment, which brings a new concept: the system’s equilibrium. We can put in place some of the explained concepts to create a visual metaphor: picture the system as blanket on top of an underlying infrastructure where it can be tied and stretched. In this scenario, having the blanket properly stretched and attached to the underlying infrastructure would mean having the system in equilibrium. Therefore, once the underlying infrastructure changes due to its dynamic and complex behavior, the blanket would become rippled or uneven, and hence an adaption capability would tie and stretch the blanket on another infrastructure resource. In this regard, we are using the concept of equilibrium in order to quantify the need for the system to adapt. If the system is not in equilibrium, an adaptation is required; otherwise, it can stay as it is. The management decisions over the system equilibrium aim at freeing the system from threshold dependencies, providing a much more flexible framework. Computing continuum systems do not have this intrinsic equilibrium defined. However, this does not mean that they cannot be described and/or leveraged. In this sense, the equilibrium of a computing continuum system can be defined as an operation mode defined by a specific infrastructure configuration. This implies that the equilibrium depends on the evolution of the metrics that define the system state and also on the specific configuration of the computing continuum resources. This depends on the MB nodes. Therefore, if relations are created between the MB nodes during the graph construction, their derivatives will provide the meaningful information to describe system evolution and consequently an equilibrium that provides a framework to allow proactive behavior over reactive ones. It is worth emphasizing that because this research is framed to a probabilistic approach, the system derivatives will be probability distributions over a random variable variation. The equilibrium and the derivatives of the system bring another concept to the surface: the temporal evolution of the system because all previous arguments would not make sense for static systems. Figure 1.7 shows the temporal dimension on the system. The initial state of the configuration of the computing continuum underlying infrastructure changes (sensed by the system metrics) leads to a degradation of the equilibrium state of the application; therefore, a set of actions is activated to reconfigure the state of the underlying infrastructure and preserve the equilibrium. It is worth noticing that given the complexity of these systems, the equilibrium of the system state can be found at different configurations; this will lead to different equilibrium states or operation modes. Here, we assume that a different

14

1 Distributed Computing Continuum Systems

configuration of the underlying infrastructure will not assure that the operation mode of the system is maintained; it only assured the satisfaction of its requirements.

1.4.1.4

Adaptation

Triggered by an equilibrium disturbance, an adaptation process needs to be performed. This adaptation process (set of actions) aims at changing the system configuration, with respect to the underlying infrastructure, to reach a new operation mode for the system and recover the equilibrium for the system state. This process faces two main challenges: (1) knowing the configuration capabilities of the system to determine the complete set of actions and (2) selecting the best strategy or set of actions to adapt. The first challenge raises a crucial issue for these systems, because the space of possible configurations for the system increases with its size and complexity (generally very large) and may be partially unknown. We foresee two approaches to deal with it; both require a learning framework knowing that it is not feasible to encode all possible configurations. The first option would be to expose the system low-level options for adaptation and, given some constraints, let the system learn how low-level adaptations are mapped into solving higher-level disturbances. This would allow the system to change the initial set of low-level adaptations to higher-level ones, which could eventually lead to better decisions for the second challenge of selecting the best action. The second option would be to develop highlevel adaptation frameworks from partial representations of the system using the nested capacity of the MB representation. This clearly reduces the space of possible configurations, which might allow to develop a significant list of possible high-level adaptations. However, this will never cover the entire space and might not be able to expose adequate solutions for the overall system. In both options, it is required to provide the system with learning capabilities to develop its configuration space or set of actions. Furthermore, given the space of possible adaptations of the system, the system must choose the best possible option, which is the second challenge faced in adaptation. In general, finding the best action to recover the system’s equilibrium given its complexity, the partially observable data, and the stochastic nature of several underlying phenomena is already a challenge, and it is usually solved with heuristics. Furthermore, given that systems are open and have many interconnected relations with the underlying infrastructure, finding a solution for this problem might also require a learning framework. In fact, the selected actions can affect the system from different perspectives and prevent the system state to reach its equilibrium, and this can lead to the need of developing larger policies until the new equilibrium states are found. Figure 1.7 adds the learning framework to the previous schema of the system, where the chosen reconfiguration for the system is taken as input, and shows how following an iterative process the learning can be achieved. Interestingly, there is a principle, called the free energy principle (FEP) [27], from neuroscience, which explains the adaptive behavior of the brain, and, in

1.4 System Management in the Markovian Space

15

Fig. 1.7 This figure presents a complete perspective on the system’s representation. This figure emphasizes the temporal dimension of the system and how the configuration of the computing continuum resources follows this temporal dimension. Additionally, it presents the structure of the learning framework for the adaptation capabilities of the system, which resembles to the typical framework for reinforcement learning. ©2022 IEEE, reprinted, with permission, from S. Dustdar et al. (2022)

general, for any system that has adaptive capacities. However, this generalization is actually being disputed by the work of [28] that claims the principle is not that general. In any case, there are several similarities with our approach that are worth mentioning. The FEP already uses an MB to develop a representation for the brain and furthermore categorizes the MB nodes in two groups: the sensing and the acting. In our representation, the sensor nodes are metrics, and the active nodes are capabilities of the system to modify the underlying infrastructure. The FEP states that, as an adaptive system, the brain acts as if it were minimizing the free energy of the system or, similarly, maximizing the evidence lower bound (ELBO). In this regard, the free energy is the difference between an expected observation and the obtained observation. There are two main challenges that need to be addressed in order to determine if the FEP would also apply for the systems that we are considering in this work. First, it would be required to quantify the relations between MB nodes and the system state, which in the case of the metrics seems more attainable (when dealing with the actions, this is still a very open issue). Second, it is required that our system is able to have a representation of the environment; hence, it can have an expected observation given an action. This is commonly known as a generative model of the environment, which is also a relevant challenge that stills need to be tackled.

16

1.4.1.5

1 Distributed Computing Continuum Systems

An Illustrative Example

The following presents an illustrative metaphor in order to better understand all previous methodologies. Imagine that the application of computing continuum is simply about transporting containers through the ocean; here, the computing continuum system is a boat that carries the containers. For the sake of the metaphor, we will assume that the containers cannot be fixed to the boat; thus, if the boat tilts too much and loses its “equilibrium,” all containers will fall and be lost in the ocean. In any case, there are two main requirements for the boat in order to properly handle the transportation: reaching the destination harbor and not to lose any container. The first requirement will be achieved if the boat has sophisticated technologies to be guided, such as GPS or a compass, and the destination harbor is fixed within the navigation frame. The second requirement is more challenging, because the stability of the boat depends on the weather and sea conditions. Going back for a moment to the computing continuum, the first requirement would be for an application to work without the need of any computing continuum underlying infrastructure such as Internet applications. The second requirement takes into account that the environment that hosts the applications is dynamic and complex. Therefore, to assess the stability of the containers, the boat has several inclinometers, sailors who observe the surrounding state of the ocean waves, and a report on the weather condition of its current area for the next couple of hours. The causal relations for these observations can be much deeper, for example, one could think that to better understand or predict the tilt of the boat, it is required to observe the ocean and atmosphere characteristics such as the temperature of the undersea currents or the position and trajectory of the closest squall. This could actually improve some prediction models, but it drastically increases the complexity on the system. A similar idea can be cast toward the possible configurations of the boat in order to overcome a wind gust or a large wave; if sailors are experienced enough, just adapting the sail position can be enough to manage that, but if they are not, they could try more options, which in some situations can lead to worse scenarios. This idea mimics the benefits brought by the MB in terms of filtering causality for the system state and to have focus on the system state by observing the metrics and the system’s configuration. This metaphor also provides a clear view on the idea of the system equilibrium; if the boat tilts when waves come, then it can lose containers. Therefore, an equilibrium is required to dynamically change the system configuration to maintain the application performance as expected. In this subsection, we have developed our vision of the required methodology to manage computing continuum systems. In this regard, we have identified several aspects of the methodology that will require to leverage learning techniques. These aspects can be separated in two classes: the design phase where the methodology is adapted to a specific computing continuum system and the runtime phase where the methodology is used to manage the system. From this perspective, we can identify that the design phase will require learning algorithms to optimize the DAG representation of the system, in terms of the nodes that are represented (set of metrics and set of actions) and their relations. For the runtime phase, we will

1.4 System Management in the Markovian Space

17

need algorithms that can update the system representation based on the current situation of the system, but, more importantly, algorithms that learn which adaptive mechanism to select and how to perform the adaptation.

1.4.2 Learning This section details the learning on system representation during both the design and the runtime phase. In the design phase, we discuss various graph-learning algorithms used to refine the system and MB learning. In the runtime phase, the FEP performs the predictions and action selections through active inference. These two learning approaches are operated iteratively, as shown in Fig. 1.7.

1.4.2.1

Design Phase Learning

The DAG is built with conditional dependencies between states from a given set of available data [29]. Explaining these data using DAG is considered an NP-hard problem [30]. Still, several software tools (Java-based) and libraries (for MATLAB, Weka, R, C++, Python, etc.) are available to build initial DAG [31, 32]. Once the DAG is constructed, the learning can take place to keep track of causality relations (strength and weakness of a relation) and identify redundant relations and states, optimal MB discovery, etc. There are several graph-learning algorithms in the literature to extract the knowledge from the graph data. These algorithms are used to estimate the proximity, structural information, failure state predictions, etc. Some of these graph-learning algorithms are also useful to extract the knowledge from the DAGs. Most of these learning algorithms use the DAG as an adjacency matrix to simplify the problem. The summary of graph-learning algorithms for DAG is presented in Table 1.1. Here, the proximity indicates the strength of the causality relationship among the two states. The stability in the table implies that

Table 1.1 Summary of design phase learning algorithms. ©2022 IEEE, reprinted, with permission, from S. Dustdar et al. (2022) Algorithms MF RWL SRL GNN RL SDNE GDL

Proximity ✔ ✔ ✔ ✔ ✗ ✔ ✔

Stability ✗ ✗ ✗ ✗ ✔ ✗ ✔

Reactive ✗ ✔ ✔ ✔ ✔ ✗ ✔

Handling missing values ✗ ✔ ✗ ✗ ✔ ✗ ✔

Prediction accuracy – High Low High Medium Low High

Complexity Low Reasonable High High High High High

18

1 Distributed Computing Continuum Systems

small changes in the inputs will not affect the results or outcome drastically. A reactive means the algorithm responds and acts to the unprotected occurrence of events during the learning process. Matrix factorization (MF) or decomposition mechanism is used to estimate the strength of the relations (proximity) on a DAG. In general, the MF reduces a matrix into constituent parts to more easily understand and capture complex relations [33]. It also helps to identify the contested edges (the removal of a link improves the prediction accuracy among the states) in the DAG [34]. Nonnegative MF further helps to identify the irrelevant states or relations in the DAG. It is further improved by combining with manifold learning strategy to exploit the non-linear relations in the DAG [35]. Besides these benefits, there are a few limitations, including the sensitivity to outliers and noises. In fact, the MF is unstable until it removes the noises and outliers. Because the computing continuum systems are generally very complex due to their several characteristics discussed in Sect. 1.3.2, using MF mostly would not result in accurate outputs for these systems. Random walk learning (RWL) can identify the similar property states in a short hop distance over the DAG so that we can quickly identify and remove the redundant states. Unlike MF, the random walk can handle the missing information and noise data. The learning accuracy of a DAG would be more if data is available. The dynamic characteristics of the computing continuum, such as the states or relations, may evolve over time and require more effort to learn because new relations appear while old relations may not be valid for a long time. The random walk can perform the learning on such dynamic systems through continuous learning [36]. Besides these advantages, there are certain limitations for random walk learning: (1) it can predict the proximity only if the relationship existed between any two states in the system; and (2) it can create uncertainty in the relations for the small systems because of random strategies and less data availability. RWL is computationally high. The RWL can benefit from computing continuum system representations in predicting the strength of causality relations between the states. It also adopts the dynamic changes in the computing continuum representation and learns accordingly. Statistical relational learning (SRL) is used for link predictions (properties of relations) on a complex and/or an uncertain DAG. There are several categories of SRL techniques, such as probabilistic relational models (PRM), Markov logic networks (MLN), Bayesian logic programs (BLP), and relational dependency network (RDN) [37]. These techniques efficiently predict the strength of relations among the states. PRM, BLP, and MLNs suffer from extracting the inference for the small systems or the large systems with unlabeled information. The SRL fails to forecast accurate predictions in the case of the unlabeled instances in the system. SRL causes too expensive computations if labels are missing in the system. The SRL may not fit the computing continuum unless the label data is provided without missing information. Due to the scalability and dynamic nature of the computing continuum, it is strenuous to provide all such accurate

1.4 System Management in the Markovian Space

19

information, and therefore SRL may not properly fit the computing continuum scenarios. Graph neural network (GNN) is a supervised learning strategy and uses a neural network framework to process the DAG. The GNN can be operated for learning three general types: systems, states, or relations [38]. The system-level and state-level learning strategies can result in classification (categorize the states of representations according to their characteristics), predict the possible relations among the states, and group the similar property nodes in the system for simplifying the learning strategies. The relation-level learning strategies can predict the possible relations among the states and also classify the relations depending on their characteristics. Combining the GNN and SRL techniques can help estimate the relational density between the states on representations [39]. There are certain limitations of GNN such as (1) challenging to represent the system compatible with neural networks, (2) being computationally expensive, and (3) customization of the GNN model (deciding the number of GNN layers, aggregation function, style of message passing, etc.). For computing continuum representation, the GNN is useful for classifying the states, predicting the strength of the relations, and grouping the similar property nodes or relations. Reinforcement learning (RL) can be used to learn and extract knowledge from a DAG by considering accuracy, diversity, and efficiency in a reward function [40]. RL and GNN were combined to extract the behavior of a DAG [41]. This combination is also used to identify cascades of failured states in the DAG [42]. These techniques are suitable for small systems because they are computationally hungry mechanisms. These methods are also fit for dynamic environments such as computing continuum because of adaptability. Structural deep network embedding (SDNE) exploits the first- and secondorder proximity into the learning process to achieve the structure-preserving and sparsity [43]. The first-order proximity can extract the pairwise similarity between any two states (local network). In contrast, the second-order proximity identifies the missing legitimate links throughout the system. Therefore, the second-order proximity can identify redundant states in DAG. The characteristics of the local and global systems are learned using these two proximity factors. It can remember the initial structure of the DAG and easily reconstruct the original one in case of any failure during learning. Along with this, the SDNE showed the most performance in terms of classification and link predictions. However, it is essential to identify the balance point among the first- and second-order proximity factors to achieve optimal results. The SDNE is helpful to predict the strength of causality relations on computing continuum representations. Since SDNE is highly computational, it is possible to minimize by grouping similar states or relations using GNN. Geometric deep learning (GDL) efficiently classifies similar states of a system and parses the heterogeneous systems efficiently. GDL is different from the traditional GNN, deep learning, and convolutional neural networks (CNN) algorithms that could be very inaccurate. The primary reason is that the GNN, deep learning, and CNN are worked based on convolutions, and therefore,

20

1 Distributed Computing Continuum Systems

they can work very efficiently only on Euclidean data [44]. The computing continuum data or representations could be non-Euclidean; that is, they could be represented with vastly different states with a different neighbor or connectivity. Therefore, it is not easy to apply the convolution because of non-Euclidean data. Because GDL does not attempt to resolve the curse of dimensionality, it does not require additional abstractions to perform the learning on computing continuum representations. It is dynamic and accurate with low computational overhead for learning complex computing continuum systems. Bayesian network structure learning (BNSL) extracts the correlations among the random variables from the data. The BNSL is performed in three learning approaches: constraint-based, score-based, and hybrid that combines them. The constraint-based approach learns structure dependencies from the data using conditional-independent tests. The score-based approach minimizes/maximizes the scores as objective functions [45]. These three approaches are primarily useful to perform learning on a set of relations between the states and quantify the strength of causality relations [46]. During the design phase learning, all algorithms discussed above identify and remove the redundant states and relations from the system representations. They also reconstruct the system representation according to the strength of the causality relations. It is also worth detailing that the MB concept can be used to optimize a DAG. In this regard, we need to separate the conceptual perspective of having the system’s representation using the state variables as a central node for the MB, from using the MB concept in order to optimize the final representation of the system. The latter allows to center an MB at each node of the DAG in order to verify the dependency of the current relations; this is a very popular technique for feature selection in machine learning. Markov blanket learning (MBL) follows a supervised learning strategy to extract the DAG that characterizes the target state. Unlike other supervised learning algorithms (neural networks, regression, support vector machine, decision tree, etc.), the MBL returns a generative model, which is a more robust model even in the presence of missing values or redundant features. Similar to BNSL, MBL can perform either constraint-based or score-based approaches. The constraintbased approach is further divided into topology-based (tackle the data efficiency) and non-topology-based (greedily test the states’ independence). The time complexity of topology-based learning is reasonable, compared with the nontopology- and score-based learning approaches. The primary goal of these learning strategies is to reduce the loss in the accuracy in MB discovery with a reasonable complexity [47]. Few improvements were made on these learning algorithms to overcome this challenge. A recursive MBL algorithm is introduced in [48] for learning the BN and removing the states in which the statistical dependencies do not affect the other states of the system. The primary goal of the MBL is feature selection or discovering the best set of MBs [49–51]. The MBL produces a robust system by reducing the number of states and non-linear relations [52]. By using the independence relation between the states, MBL is discovered for each node and then connected the MB consistently to form an

1.4 System Management in the Markovian Space

21

updated global system [49]. Learning an MB with low computation and high efficiency is crucial. In this context, several MBL algorithms are introduced in the literature. Ling et al. [53] proposed an efficient approach to discover the MB using local structure learning. Here, the MB are identified based on the currently selected parents and children (PC). This approach results in more true positive states (for the selected MBs) in a large DAG with minimal computations. A local structure learning algorithm is introduced for efficient MB discovery in [54] to distinguish the parents from children based on the edge directions from the DAG. A minimum message length-based MBL algorithm is introduced in [55] for large-scale DAG with perfect and imperfect data. The perfect DAG means detailed information about the states and their relations is maintained using a conditional probability table. In this approach, the MBL is smooth for a perfect DAG. The imperfect DAG uses a naive Bayes to assume the independence between the MB and the remaining states. A topology-based MB discovery algorithm is introduced by Gao et al. [56] using a simultaneous MB algorithm, in which the false PC and coexistence property of spouses states are identified and removed simultaneously to minimize the computational time. A selection via group alpha-investing (SGAI) technique is introduced for efficient MB selection from a set of multiple MBs using representation sets [57]. The SGAI algorithm avoids unnecessary computations for parameter regularization. The overall complexity of the MB discovery will be minimized through this approach. There are certain benefits of using MBL on computing continuum. MBL-based feature selection helps identify the best resource (state) for service placements [58] with rapid actions through quick decisions. The MBL can also be beneficial in deciding on a set of metrics that influence a resource’s workload among all the metrics. Identifying fewer metrics to determine the influence of a resource will reduce the computation burden.

1.4.2.2

Runtime Phase Learning

As we discussed in Sect. 1.4.1.4, the states of a system representation are differentiated by an MB as sensory and active states. In an MB, the internal and external states are conditionally independent of each other. These two states are influenced via sensory and active states [59]. The internal and active states directly affect the structural integrity of an MB through active inferences. Active inferences are helpful to minimize the uncertainties of the sensory states in the MB and derive the minimal free energy for the internal states. The FEP mainly works based on active inference and performs learning and perception on the system [60]. The active inference understands the behavior of the system depending on the generative model and predicts the sensory states. The active inference is helpful to eliminate prediction errors by updating an internal system that generates predictions through a perceptual interface and perceptual learning. The FEP majorly performs predictions of future states and, through active inference, is able to improve them. Furthermore, FEP through active inference is

22

1 Distributed Computing Continuum Systems

able to learn about the best action to select. Initially, predictive coding consistently updates the system representation based on the predicted sensory state information. The FEP minimizes prediction errors by comparing the actual sensory state information with the predicted data. The predictions generated by the FEP are more accurate than the traditional machine learning algorithms. FEP is very powerful in minimizing prediction errors by backpropagation [61]. Actions are the source of the system control. These actions are selected and controlled by the FEP through the transmission of variational messages based on the predictions of the sensory state. The theory of active inference inspires the proper selection of an action. The active inference is also useful to generate a deep generative model for adoptive actions through exploration. The critical limitation of active inference is scalability, and it can be addressed using deep neural networks or deep RL algorithms [62]. Traditional machine learning strategies and FEP are combined in the literature to achieve better system performance. For example, an artificial neural network is incorporated with FEP to enhance the learning rate in terms of better action selection and control [63]. In [64], FEP is used as active inference for RL to speed up the learning process to quickly reach the goal state. FEP is used for reinforced imitation learning to minimize the explorations through learning and perception. This discussion concludes that combining traditional ML with FEP can extend the learning paradigm. This strategy reduces the cost and produces high performance by quickly reaching the goal state.

1.5 Use Case Interpretation In this section, we present a use case in order to exemplify how we envision our methodology.

1.5.1 Application Description We set the application in the traffic control area: checking whether car drivers are using their phones with their hands while driving. To do so, several cameras are installed at the roads, and videos are recorded to detect driver activity using an AI inference.1 The system is composed of a set of sensors distributed through the roadside. The basic set would require cameras to record driver activity, radars to get car speed, ground sensors to count the number of cars in the road, and a light system to allow recordings without daylight. Additionally, the sensors require to connect to small

1 www.newscientist.com/article/2300329-australias-ai-cameras-catch-over-270000-drivers-

using-their-phones/amp/.

1.5 Use Case Interpretation

23

CPUs - Cloud Cameras

Al - Cloud on board CPU

Sensors

CPUs

Data storage

Al - Jetson User requests

Fig. 1.8 Schema of the application’s architecture. ©2022 IEEE, reprinted, with permission, from S. Dustdar et al. (2022)

computational units to pre-process data; for cameras, AI inference boards are also expected to be within their proximity. Additionally, the application will have a larger server to process and store the data. Finally, the application is also ready to offload some of the inference or processing needs to a public cloud. Figure 1.8 shows a simplified schema of the application architecture.

1.5.2 SLOs as Application Requirements Service-level objectives (SLOs) are used by cloud providers to map requirements into measurable metrics of the system. In our approach, we go one step further, and we focus on high-level SLOs [22] in order to encode service-oriented requirements, as well as to be able to relate various SLOs with the three highest-level variables of the system: Resources, Quality, and Cost. However, this advantage comes with the cost to map these into measurable parts of the system, which is not always straightforward. For this use case, we have selected the following SLOs. SLO 1: Percentage of drivers recognized This SLO is strongly linked with the Quality of the system. It would be computed by taking the total number of drivers recognized by the system and dividing it between the total number of cars. To get the total number of cars, we need the output of ground sensors, and to obtain the number of drivers, we need the output of the AI-based inference system. However, in order to have a complete picture of what is the status of the application regarding this SLO, we also need to (1) consider the number of cameras in place and (2) account for obstructions between cars; (3) the video quality which involves the image resolution as affected by environmental conditions such as daylight, fog, and rains; and (4) the inference model. All this information does not properly provide the SLO value, but is expected to be able to explain the current situation of the SLO; in fact, they can be understood as a Markov blanket around the SLO.

24

1 Distributed Computing Continuum Systems

SLO 2: Percentage of drivers recognized faster (e.g., 50% higher) than the speed limit This SLO is very related with the previous, but is easily related with the business strategy of the application. Knowing that there is a greater chance of accident for drivers at the phone who are driving beyond the speed limit, they usually receive much heavier traffic fines. The only difference that we account for this SLO, with respect to the previous, is the need to incorporate data from radars on the overall equation. SLO 3: System up-time This SLO is related with the Resources that an application is using. It would be assessed by considering the total up-time of the application with respect to the total time elapsed. However, we want to account for the causes and the relations between the system components that allow this level of availability. Hence, with this SLO, we will also take into account other ways to measure the system’s availability, such as having a periodical ping to the computing units or checking the continuous stream of data from the sensors. More specifically, we imagine end-to-end tests that could also perform an overall check of the entire system, so that we can check that it is up and well functioning. SLO 4: Expected remaining energy above certain value IoT and edge devices can have energy constraints if they are not plugged into a source of energy. This SLO is to ensure that devices do not run out of power/energy. This SLO relates both Resources and Cost of the system. To measure it, we need to track the operation time of energy-constrained devices, as well as their charging times. SLO 5: Percentage of inferences at public cloud The use of public clouds could be required at some specific moments, for example, due to periods of over-request on an application from users and/or sensors. However, this type of offloading highly increases the cost of the application due to the fees applied by the public cloud providers. Therefore, this SLO is used to have control over unexpected Costs. It would be computed by keeping track of offloadings to public clouds. SLO 6: Privacy level One of the most important challenges is to provide qualitative measure for privacy standards of an application. It is clear that this type of requirement is a must for “computing continuum” applications. For this use case, we are considering a metric called privacy concern to categorize the privacy of the system with respect to the system topology and the use of public clouds, which need extra privacy measurements. This metric will be balanced with the privacy measures in place to provide the final privacy level of a system. Therefore, by keeping track of the offloaded instances, as well as the current privacy concern of the system and the implemented privacy measures, we would be able to compute this SLO.

1.6 Conclusion

25

1.5.3 Developing the DAG From the previous description of the required SLOs, we can draft a DAG that expresses the relation among low-level metrics as defined for “computing continuum” resources, the set of high-level SLOs, and system state variables. As shown in Fig. 1.3, this DAG expresses the first part of our new representation. Additionally, we expect to find a mapping between this representation and a causal representation of the system. As shown in Fig. 1.4, the second part of the system’s representation relates the high-level system variables and, possibly, some SLOs with the system’s adaptive means. We will not unfold this part in this book chapter. A first approximation to the DAG can be seen in Fig. 1.9. In order to build this first approximation, we will go through SLO needs to identify measurable resources that can be considered. It is clear that this might not be the final graph, because we might have ignored relevant resources that could really influence the SLO. Further, it can even change due to new constraints or requirements. Hence, from this point, it is required to use the different techniques depicted in Sect. 1.4.2 to gradually adjust the representation of the final configuration of the system. It is important to highlight that this DAG is also a germinal causal graph for the application. This causal graph is expected to encode the capability for the system to trace back to the causes of its current status and lead to the capacity of performing precise actions to solve any encountered issue. However, the causal graph will also require further refinement in order to ensure that causality relations are well defined. We expect both DAG (the system’s Markov blanket and the causal graph) to deviate and differentiate as they precisely encode their expected features. However, building both from the system’s SLOs ensures finding common points in order to move from the abstraction of causality analysis to the actual measurements and components of the system.

1.6 Conclusion This chapter highlights the need of developing new methodologies to manage computing continuum systems; this includes developing a new representation of these systems and a framework to develop tools and mechanisms for them. The elasticity paradigm developed for managing cloud systems falls short when dealing with computing continuum systems. In short, it has been seen that in the Cartesian space, it is not able to properly grasp the characteristics of the underlying infrastructure, on which these systems are based, nor it provides active mechanisms to solve the perturbation that they can suffer. Nevertheless, the system’s space composed of Resources, Quality, and Cost is kept to provide a high-level representation. The methodology develops a new representation for these systems that is based on the Markov blanket concept. This provides manifold advantages to deal with complex systems, including (1) it provides a causality filter to control the scope

% identified drivers

Radar data

Quality

% identified drivers at 2VI

Video quality

Frame resolution

Cameras in use

Light condition

Cost

Overall test signal

Resources

System’s up time

Devices ping

Sensors cont. stream

Expected remaining Energy

Operation time

Devices in use

% Cloud usage

Offloaded instance

Privacy cencern

Privacy level

Privacy measures

Inference model

Frame rate

Fig. 1.9 DAG of the system: from the measurable “computing continuum” resources to the high-level system variables. © 2022 IEEE, reprinted, with permission, from S. Dustdar et al. (2022)

Count Sensor data

Weather condition

26 1 Distributed Computing Continuum Systems

References

27

of the system representation, (2) it allows us to develop nested representations to focus on specific issues, (3) it develops a formal separation between the system and the environment providing space for the development of cooperative interfaces between different systems, and (5) it encodes causality relations within the system components to provide knowledge on the system’s evolution. The managing framework is based on the concept of equilibrium, leaving behind the use of thresholds. In this regard, the equilibrium aims at alerting the system in advance so that the adaptive mechanisms performed are more efficient. Additionally, it also encodes the system derivatives, which allows adjusting the adaptive mechanism to the precise needs of the system. This new approach focuses on the complexity inherent to these systems due to their underlying infrastructure. However, this does not solve the problem of their complexity; it just sets the tools to manage them. Therefore, this chapter also presents a survey on learning methods required to completely develop the methodology, ranging from methods to build and maintain an optimized representation of the system, as well as mechanisms to develop the self-adaptive capacities of computing continuum systems.

References 1. Giorgio Parisi. Complex systems: a physicist’s viewpoint. Physica A: Statistical Mechanics and its Applications, 263(1-4):557–564, feb 1999. 2. Bo Li, Qiang He, Feifei Chen, Hai Jin, Yang Xiang, and Yun Yang. Auditing Cache Data Integrity in the Edge Computing Environment. IEEE Transactions on Parallel and Distributed Systems, 32(5):1210–1223, 2021. 3. Xin Gao, Xi Huang, Yinxu Tang, Ziyu Shao, and Yang Yang. History-Aware Online Cache Placement in Fog-Assisted IoT Systems: An Integration of Learning and Control. IEEE Internet Things J., page 1, 2021. 4. Tien-Dung Nguyen, Eui-Nam Huh, and Minho Jo. Decentralized and Revised Content-Centric Networking-Based Service Deployment and Discovery Platform in Mobile Edge Computing for IoT Devices. IEEE Internet Things J., 6(3):4162–4175, 2019. 5. Dragi Kimovski, Roland Matha, Josef Hammer, Narges Mehran, Hermann Hellwagner, and Radu Prodan. Cloud, Fog or Edge: Where to Compute? IEEE Internet Computing, page 1, 2021. 6. Huaming Wu, Katinka Wolter, Pengfei Jiao, Yingjun Deng, Yubin Zhao, and Minxian Xu. EEDTO: An Energy-Efficient Dynamic Task Offloading Algorithm for Blockchain-Enabled IoT-Edge-Cloud Orchestrated Computing. IEEE Internet Things J., 8(4):2163–2176, 2021. 7. Jianji Ren, Haichao Wang, Tingting Hou, Shuai Zheng, and Chaosheng Tang. Collaborative Edge Computing and Caching With Deep Reinforcement Learning Decision Agents. IEEE Access, 8:120604–120612, 2020. 8. He Li, Kaoru Ota, and Mianxiong Dong. Deep Reinforcement Scheduling for Mobile Crowdsensing in Fog Computing. ACM Trans. Internet Technol., 19(2), apr 2019. 9. S Nastic, T Pusztai, A Morichetta, V Casamayor Pujol, S Dustdar, D Vij, and Y Xiong. Polaris Scheduler: Edge Sensitive and SLO Aware Workload Scheduling in Cloud-Edge-IoT Clusters. In 2021 IEEE 14th International Conference on Cloud Computing (CLOUD), 2021. 10. Ruijie Zhu, Shihua Li, Peisen Wang, Mingliang Xu, and Shui Yu. Energy-efficient Deep Reinforced Traffic Grooming in Elastic Optical Networks for Cloud-Fog Computing. IEEE Internet Things J., page 1, 2021.

28

1 Distributed Computing Continuum Systems

11. Badraddin Alturki, Stephan Reiff-Marganiec, Charith Perera, and Suparna De. Exploring the Effectiveness of Service Decomposition in Fog Computing Architecture for the Internet of Things. IEEE Transactions on Sustainable Computing, page 1, 2019. 12. Yifan Dong, Songtao Guo, Jiadi Liu, and Yuanyuan Yang. Energy-Efficient Fair Cooperation Fog Computing in Mobile Edge Networks for Smart City. IEEE Internet Things J., 6(5):7543– 7554, 2019. 13. Xiaomin Li, Jiafu Wan, Hong-Ning Dai, Muhammad Imran, Min Xia, and Antonio Celesti. A Hybrid Computing Solution and Resource Scheduling Strategy for Edge Computing in Smart Manufacturing. IEEE Transactions on Industrial Informatics, 15(7):4225–4234, 2019. 14. Randa M Abdelmoneem, Abderrahim Benslimane, and Eman Shaaban. Mobility-aware task scheduling in cloud-Fog IoT-based healthcare architectures. Computer Networks, 179:107348, oct 2020. 15. Elena Hernández-Nieves, Guillermo Hernández, Ana-Belén Gil-González, Sara RodríguezGonzález, and Juan M Corchado. Fog computing architecture for personalized recommendation of banking products. Expert Systems with Applications, 140:112900, feb 2020. 16. David C Klonoff. Fog computing and edge computing architectures for processing data from diabetes devices connected to the medical internet of things. Journal of diabetes science and technology, 11(4):647–652, 2017. 17. Schahram Dustdar, Yike Guo, Benjamin Satzger, and Hong Linh Truong. Principles of elastic processes. IEEE Internet Computing, 15(5):66–71, sep 2011. 18. Georgiana Copil, Daniel Moldovan, Hong-Linh Truong, and Schahram Dustdar. Multi-level Elasticity Control of Cloud Services. In Service-Oriented Computing, volume 8274 LNCS, pages 429–436. Springer, Berlin, Heidelberg, 2013. 19. Philipp Hoenisch, Dieter Schuller, Stefan Schulte, Christoph Hochreiner, and Schahram Dustdar. Optimization of Complex Elastic Processes. IEEE Transactions on Services Computing, 9(5):700–713, sep 2016. 20. Hong Linh Truong, Schahram Dustdar, and Frank Leymann. Towards the Realization of Multidimensional Elasticity for Distributed Cloud Systems. Procedia Computer Science, 97:14–23, jan 2016. 21. Johann Schleier-Smith, Vikram Sreekanti, Anurag Khandelwal, Joao Carreira, Neeraja J. Yadwadkar, Raluca Ada Popa, Joseph E. Gonzalez, Ion Stoica, and David A. Patterson. What serverless computing is and should become. Communications of the ACM, 64(5):76–84, may 2021. 22. Stefan Nastic, Andrea Morichetta, Thomas Pusztai, Schahram Dustdar, Xiaoning Ding, Deepak Vij, and Ying Xiong. SLOC: Service level objectives for next generation cloud computing. IEEE Internet Computing, 24(3):39–50, may 2020. 23. T Pusztai, S Nastic, A Morichetta, V Casamayor Pujol, S Dustdar, X Ding, D Vij, and Y Xiong. A Novel Middleware for Efficiently Implementing Complex Cloud-Native SLOs. In 2021 IEEE 14th International Conference on Cloud Computing (CLOUD), 2021. 24. T Pusztai, S Nastic, A Morichetta, V Casamayor Pujol, S Dustdar, X Ding, D Vij, and Y Xiong. SLO Script: A Novel Language for Implementing Complex Cloud-Native Elasticity-Driven SLOs. In 2021 IEEE International Conference on Web Services (ICWS), 2021. 25. Schahram Dustdar, Victor Casamajor Pujol, and Praveen Kumar Donta. On distributed computing continuum systems. IEEE Transactions on Knowledge and Data Engineering, NA:1–14, 2022. 26. Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988. 27. Karl Friston, James Kilner, and Lee Harrison. A free energy principle for the brain. Journal of Physiology Paris, 100:70–87, 2006. 28. Vicente Raja, Dinesh Valluri, Edward Baggs, Anthony Chemero, and Michael L. Anderson. The Markov blanket trick: On the scope of the free energy principle and active inference. Physics of Life Reviews, sep 2021. 29. Xue-Wen Chen, Gopalakrishna Anantha, and Xiaotong Lin. Improving bayesian network structure learning with mutual information-based node ordering in the k2 algorithm. EEE Trans. Knowl. Data. Eng., 20(5):628–640, 2008.

References

29

30. Mark Bartlett and James Cussens. Integer linear programming for the bayesian network structure learning problem. Artificial Intelligence, 244:258–271, 2017. 31. Mauro Scanagatta, Antonio Salmerón, and Fabio Stella. A survey on bayesian network structure learning from data. Progress in Artificial Intelligence, 8(4):425–439, 2019. 32. Zhaolong Ling, Kui Yu, Yiwen Zhang, Lin Liu, and Jiuyong Li. Causal learner: A toolbox for causal structure and markov blanket learning. arXiv preprint arXiv:2103.06544, 2021. 33. Junpeng Li, Changchun Hua, Yinggan Tang, and Xinping Guan. A fast training algorithm for extreme learning machine based on matrix decomposition. neurocomputing, 173:1951–1958, 2016. 34. Jonathan Strahl, Jaakko Peltonen, Hirsohi Mamitsuka, and Samuel Kaski. Scalable probabilistic matrix factorization with graph-based priors. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5851–5858, 2020. 35. Chong Peng, Zhao Kang, Yunhong Hu, Jie Cheng, and Qiang Cheng. Nonnegative matrix factorization with integrated graph and feature learning. ACM Transactions on Intelligent Systems and Technology (TIST), 8(3):1–29, 2017. 36. Giang H Nguyen, John Boaz Lee, Ryan A Rossi, Nesreen K Ahmed, Eunyee Koh, and Sungchul Kim. Dynamic network embeddings: From random walks to temporal random walks. In 2018 IEEE International Conference on Big Data (Big Data), pages 1085–1092. IEEE, 2018. 37. Alexandrin Popescul and Lyle H Ungar. Statistical relational learning for link prediction. In IJCAI workshop on learning statistical models from relational data, volume 2003. Citeseer, 2003. 38. Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and applications. AI Open, 1:57–81, 2020. 39. Devendra Singh Dhami, Siwen Yan, and Sriraam Natarajan. Bridging graph neural networks and statistical relational learning: Relational one-class gcn. arXiv preprint arXiv:2102.07007, 2021. 40. Sephora Madjiheurem and Laura Toni. Representation learning on graphs: A reinforcement learning application. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3391–3399. PMLR, 2019. 41. Daniele Gammelli, Kaidi Yang, James Harrison, Filipe Rodrigues, Francisco C Pereira, and Marco Pavone. Graph neural network reinforcement learning for autonomous mobility-ondemand systems. arXiv preprint arXiv:2104.11434, 2021. 42. Eli Meirom, Haggai Maron, Shie Mannor, and Gal Chechik. Controlling graph dynamics with reinforcement learning and graph neural networks. In International Conference on Machine Learning, pages 7565–7577. PMLR, 2021. 43. Daixin Wang, Peng Cui, and Wenwu Zhu. Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD international conf. on Knowledge discovery and data mining, pages 1225–1234, 2016. 44. Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M Bronstein. Geometric deep learning on graphs and manifolds using mixture model CNNs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5115–5124, 2017. 45. Marco Scutari, Catharina Elisabeth Graafland, and José Manuel Gutiérrez. Who learns better bayesian network structures: Accuracy and speed of structure learning algorithms. Int J Approx Reason., 115:235–253, 2019. 46. Sangmin Lee and Seoung Bum Kim. Parallel simulated annealing with a greedy algorithm for bayesian network structure learning. EEE Trans. Knowl. Data. Eng., 32(6):1157–1166, 2019. 47. Tian Gao and Qiang Ji. Efficient score-based markov blanket discovery. Int J Approx Reason., 80:277–293, 2017. 48. Ehsan Mokhtarian, Sina Akbari, AmirEmad Ghassami, and Negar Kiyavash. A recursive markov blanket-based approach to causal structure learning. arXiv preprint arXiv:2010.04992, 2020.

30

1 Distributed Computing Continuum Systems

49. Shunkai Fu and Michel C Desmarais. Fast markov blanket discovery algorithm via local learning within single pass. In Conference of the Canadian Society for Computational Studies of Intelligence, pages 96–107. Springer, 2008. 50. Zhaolong Ling, Kui Yu, Hao Wang, Lin Liu, Wei Ding, and Xindong Wu. Bamb: A balanced markov blanket discovery approach to feature selection. ACM Transactions on Intelligent Systems and Technology (TIST), 10(5):1–25, 2019. 51. Xianglin Yang, Yujing Wang, Yang Ou, and Yunhai Tong. Three-fast-inter incremental association markov blanket learning algorithm. Pattern Recognition Letters, 122:73–78, 2019. 52. Jean-Philippe Pellet and André Elisseeff. Using markov blankets for causal structure learning. Journal of Machine Learning Research, 9(7), 2008. 53. Zhaolong Ling, Kui Yu, Hao Wang, Lei Li, and Xindong Wu. Using feature selection for local causal structure learning. IEEE Transactions on Emerging Topics in Computational Intelligence, 5(4):530–540, 2021. 54. Shuai Yang, Hao Wang, Kui Yu, Fuyuan Cao, and Xindong Wu. Towards efficient local causal structure learning. IEEE Trans. on Big Data, pages 1–1, 2021. 55. Yang Li, Kevin B Korb, and Lloyd Allison. Markov blanket discovery using minimum message length. arXiv preprint arXiv:2107.08140, 2021. 56. Tian Gao and Qiang Ji. Efficient markov blanket discovery and its application. IEEE Trans. Cybern., 47:1169–1179, 2017. 57. Kui Yu, Xindong Wu, Wei Ding, Yang Mu, and Hao Wang. Markov blanket feature selection using representative sets. IEEE Trans Neural Netw Learn Syst., 28(11):2775–2788, 2017. 58. Cosmin Avasalcai, Christos Tsigkanos, and Schahram Dustdar. Adaptive management of volatile edge systems at runtime with satisfiability. ACM Trans. Internet Technol., 22(1), September 2021. 59. Michael Kirchhoff, Thomas Parr, Ensor Palacios, Karl Friston, and Julian Kiverstein. The markov blankets of life: autonomy, active inference and the free energy principle. Journal of The royal society interface, 15(138):20170792, 2018. 60. Karl J Friston, Lancelot Da Costa, and Thomas Parr. Some interesting observations on the free energy principle. Entropy, 23:1076, 2021. 61. Christopher L Buckley, Chang Sub Kim, Simon McGregor, and Anil K Seth. The free energy principle for action and perception: A mathematical review. Journal of Mathematical Psychology, 81:55–79, 2017. 62. Beren Millidge. Applications of the free energy principle to machine learning and neuroscience. CoRR, abs/2107.00140, 2021. 63. Myoung Won Cho. Simulations in a spiking neural network model based on the free energy principle. Journal of the Korean Physical Society, 75(3):261–270, 2019. 64. Noor Sajid, Philip J Ball, Thomas Parr, and Karl J Friston. Active inference: demystified and compared. Neural Computation, 33(3):674–712, 2021.

Chapter 2

Containerized Edge Computing Platforms

Abstract This chapter covers details of containerized edge computing platforms. We begin with a basic introduction and definition of core concepts before highlighting container use case scenarios. We review what a container engine is and what alternatives are available in the market. We also provide details on automated container management processes that free operators from tasks such as re-creating and scaling containers. We also elaborate how container orchestration helps in managing container networking and storage functions. We will discuss the Kubernetes orchestration platform, which is currently the most widely used container orchestrator, and provide a high-level overview of the Kubernetes API for various programming languages.

2.1 Containers vs. Virtual Machines The possibility of creating/using containers has existed for decades. However, they were made publicly available in 2008 when Linux integrated container capabilities into its kernel; since then, they have been widely used with the emergence of the Docker open-source containerization platform in 2013. Containers are lightweight, executable application components that include all operating system (OS) libraries and dependencies needed to run a code in any environment. A container is an executable software application that aids in the packaging and running of codes, libraries, dependencies, and other components of an application to be ported in a variety of computing environments. Containerized software applications can run as smoothly on a local desktop as on a cloud platform. The process of designing, packaging, and deploying programs in containers is known as containerization. We can containerize distinct parts of an application in separate containers or containerize the entire application into one container. Containers are executed on physical server hardware and share a single OS, whereas virtual machines use hardware, software, and firmware features to create several independent machines on top of a single host, each running a separate OS. The design difference between the two technologies is shown in Fig. 2.1. Understanding the terms “containers,” “containerization,” and “container orchestration” can help comprehend why software engineers designed containers decades ago. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Taheri et al., Edge Intelligence, https://doi.org/10.1007/978-3-031-22155-2_2

31

32

2 Containerized Edge Computing Platforms

Virtual Machines (VMs)

Containers

Application 1

Application 2

Application 3

Application 1

Application 2

Application 3

Binaries/ Libraries

Binaries/ Libraries

Binaries/ Libraries

Binaries/ Libraries

Binaries/ Libraries

Binaries/ Libraries

Guest Operating System

Guest Operating System

Guest Operating System

Container Engine

Hypervisor

Operating System (OS)

Infrastructure

Infrastructure

Fig. 2.1 An architectural comparison between virtual machines and containers

Engineers devised the idea of running software in an isolated manner on top of a physical server to assist in utilizing the abundance of resources at their disposal. This is how the concept of virtual machines emerged. Using them, engineers were able to create many virtual computers by leveraging a hypervisor (hardware, firmware, or software that generates, runs, and monitors virtual machines) on top of a physical server’s hardware. This procedure is now commonly referred to as virtualization. By using virtualization, we can run multiple operating systems on the same physical machine. Each virtual machine can run its own OS (referred to as the guest OS). As a result, each virtual machine can serve applications, libraries, and binaries distinct from the other ones co-located with it on the same host. This will increase the processing power, lower hardware costs, and reduce the operational space. They are no longer needed to execute a single program on the entire server. This frees up computation resources that could be better utilized elsewhere. Virtual machines, however, are not without flaws. Because each virtual machine contains an OS image, binaries, and libraries, its size can rapidly grow to several gigabytes. Powering on a virtual machine may also take minutes rather than seconds. This slowness is a performance barrier when running sophisticated applications and/or performing disaster recovery attempts, because minutes can easily add up to hours. When shifting from one computing domain to another, virtual machines also have difficulty running software smoothly. In an era when users switch between devices to access services from anywhere and anytime, this can be a limiting factor. In comparison to virtual machines, containers are more lightweight, share the host server’s resources, and, perhaps most importantly, are designed to work in any environment such as on-premises, cloud, or local laptops. The enterprises use

2.2 Container Engines

33

containers for different purposes. The following are the main reasons they are used in cloud computing environments. • To ensure that applications can be migrated from the development environment to the production environment with minimal changes. • To restructure legacy software applications while optimizing them to work seamlessly in cloud. • To make it possible for micro-service-based software applications to run in the cloud. • To utilize the identical container images to assist engineering teams in implementing continuous integration and continuous development (CI/CD) pipelines in a DevOps culture. • To take advantage of the “lift and shift” cloud migration techniques to migrate legacy or on-premises software applications to the cloud. • To easily schedule deployment of recurrent tasks in the background. • To reduce the number of hardware resources required to virtualize applications and consequently lower cloud computing costs. • To use multi-cloud strategies to distribute computing workloads between cloud platforms and on-premises data centers

2.2 Container Engines A container engine is a software program that accepts user requests, pulls images, and executes the container according to the end user’s perspective. Docker is one of the most feature-rich and widely utilized container engines, with millions of software solutions already using it. Lately, several cloud providers have also developed container engines. Although Docker is an advanced and powerful standalone platform with a robust toolkit to control containerized processes, there are a variety of Docker alternatives with particular use cases and capabilities. This section discusses Docker, as well as a few alternatives to the Docker ecosystem.

2.2.1 Docker Docker [1] is an open-source project that offers a container-based software development solution written in “Go.” Docker relies on a client-server model. Docker is essentially a container engine to create containers on top of an OS leveraging Linux kernel capabilities such as namespaces and control groups. It simplifies the implementation of all container principles and features enabled by Linux Containers (LXC). With Docker, we can get containers up and running with just a few commands and settings. In addition to being a container technology,

34

2 Containerized Edge Computing Platforms

Docker also includes well-defined wrapper components to simplify the packaging of programs. Before Docker, it was not easy to execute containers. Packaging all application system requirements into a container, Docker does a great job decoupling our application from the infrastructure. For example, if we have a Python file, we can execute it on any host with the Python runtime installed. Similarly, once we have packaged a container with compulsory applications using Docker, we can execute it on any host that with Docker installed. Docker had a monolithic architecture when it was first released. Later on, it is divided into three components (Docker engine, Docker containerd, and Docker runC). The fundamental reason was that Docker and other significant stakeholders wanted a standard container runtime and management layer. As a result, containerd and runC were moved to the Cloud Native Computing Foundation (CNCF), with contributions from various communities. The official high-level Docker architecture diagram, which depicts the typical Docker workflow, is shown in Fig. 2.2. The Docker engine comprises three components: the Docker daemon, an API interface, and the Docker CLI. Docker daemon (dockerd) is a systemd service that runs continuously. It is in charge of creating the Docker images. Images are one of the core building blocks of Docker. To run a Docker container, we first need to create an image. The image contains the OS libraries, dependencies, and all the tools required to execute an application successfully. A registry is a repository for maintaining Docker images. Using the Docker registry, we store and share images. Docker client is used for interacting with containers, and we refer to it as the Docker user interface.

2.2.2 Podman Podman [2] is a popular container engine in the container digital world. It manages user requests/loads; verifies container images from a registry server; monitors, allocates, and isolates system resources; and runs containers using a bundled container runtime. It provides a user interface that encapsulates the complexities of interacting with system security rules and policies (such as SELinux), allowing users to interact with and use containers. Podman from Red Hat is a daemon-free, opensource, Linux-native container engine for creating, running, and managing Linux Open Container Initiative (OCI) containers and container images. Podman has a command-line interface similar to Docker, but it works differently. Docker and Podman differ in that the former uses a persistent, self-contained runtime called dockerd to manage its objects or daemon. Podman, on the other hand, is not dependent on a daemon to function. Containers are launched as child processes by Podman. It connects directly to the registry and the Linux kernel via a runtime process. Because of this, Podman is known as a daemon-less container technology. The lack of a daemon improves Podman’s flexibility as a container engine, because it removes Podman’s reliance on a single process, which could act as a single point of failure, causing child processes to be orphans or fail. Podman is

Fig. 2.2 Docker architecture overview

docker run image:latest

docker pull image:latest

docker push image:latest

docker build -t image:latest

Docker Client

Python Nginx

container C

Ubuntu

container A container B

Images

Containers

Docker daemon

Docker Host

Nginx

Mongodb

Python

Ubuntu

Docker Registry

2.2 Container Engines 35

36

2 Containerized Edge Computing Platforms

also distinct from Docker, because it does not require root access. This function adds an extra security buffer by limiting possibly lethal processes that can change critical system configurations and/or expose the container and the encapsulated application. Podman can also run Pods, collections of one or more containers handled as a unified entity, and share a set of resources. Podman users can now migrate their workloads to Kubernetes due to this new functionality.

2.2.3 LXD LXD [3] is an open-source container engine that is optimized for LXC (Linux Containers). LXC allows users to run applications in segregated containers or virtual environments (similar to virtual machines) without dealing with different kernels. LXD creates a daemon for handling networking, data storage, and maintaining numerous LXC while providing an interface to connect to the LXC software library. Although LXC can be used independently, it has limited capabilities. LXD offers those extra features and, as a result, relies on LXC to function. LXC and LXD, on the other hand, serve a small segment of the container technology ecosystem and have a small user base. They are also better suited to use cases that require longterm persistent contexts for virtual application execution than those that rely on short-term containers. Unlike Docker, which advocates for only running one process per container, LXD containers can run multiple processes. Docker containers are also more portable than LXD containers due to how Docker abstracts resources. Finally, unlike Docker, which can run on Windows and Mac OS X, LXD is only available for Linux.

2.3 Container Orchestration Platforms Container orchestration is a process of automating the management of containers, which relieves us from responsibilities such as creating, scaling, and updating containers. Container orchestration also aids in the management of networking and storage resources for containers. Containers have grown in popularity as a means of developing software. For many, containers are the preferred method for developing new, modern software and transferring legacy programs. The convenience of use is one of the reasons containers are so popular. Containers are straightforward to design and run, but there is a catch. As the number of containers increases, the time spent on managing those containers also increases. Even simple apps might be made up of dozens of containers in a micro-services architecture. Container orchestration platforms allow cutting down on the time we spend managing container life cycles.

2.3 Container Orchestration Platforms

37

Container orchestration comprises several tools used to manage containers and reduce operational workload. To justify their need, consider the following scenario. Assume we have 50 containers that we must upgrade one at a time. Although we can achieve this manually, it will take a long time and much human effort. A container orchestration platform can be requested (using just a simple YAML configuration file) to do the same work for us. This upgrade is just one of the container orchestration platform’s many features. They are designed to handle nearly all the tasks necessary to keep a containerized application running. Container orchestration platforms can instantaneously restart crashed containers, scale them when load increases, and ensure that all containers are deployed uniformly across all available nodes. If either of the containers malfunctions (e.g., taking all available memory on the node), the container orchestration platform will guarantee that all of the other containers are moved to other nodes. A container orchestration platform can also be used to ensure that a specific storage volume is accessible to containers even if they are moved to other nodes. A container orchestration platform can also be used to prevent some containers from communicating with the Internet while allowing others to communicate only with specific endpoints.

2.3.1 Self-Hosted vs. Managed Container Orchestration We may/can build our container orchestration platform using an open-source platform from the ground up; but, we will be responsible for installing and configuring all its components. Although this option provides complete control over the platform and allows us to tailor it to our specific requirements, it is seldom taken. The alternating option is to select one of the managed platforms. When we select a managed container orchestration platform, the cloud provider handles all the installation and operations using the platform’s features. In fact, we are usually not required to understand how these platforms work behind the scenes. All we want from the platform is to manage our containers. Managed platforms are helpful in these situations and save us a considerable amount of time and effort to set up and maintain them. Google GKE, Azure AKS, Amazon EKS, IBM Cloud Kubernetes Service, and Red Hat OpenShift are well-known examples of managed container orchestration solutions. The list of features will vary depending on the platform we choose. In order to build a reliable, fully functional infrastructure, we will need to consider some additional features besides the platform itself. For instance, a container orchestration platform is not responsible for maintaining our container images. As a result, we will require to set up an image registry. Based on the underlying infrastructure and the deployment mode, whether we are using a public cloud or an on-premises data center, we may also need to implement a load balancer by ourselves. Most managed container orchestration platforms, for example, take care of cloud load balancers and other cloud services for customers as well.

38

2 Containerized Edge Computing Platforms

2.3.2 Well-Known Container Orchestration Platforms This section will look at a few well-known container orchestration platforms. Kubernetes [4] was developed by Google in 2014. It has recently got global recognition. Kubernetes is an open-source container deployment and management platform. Its main job is to maintain all desired states that we define using YAML configuration files. Not only does it keep our containers up and running; it also provides advanced networking and storage features. Kubernetes will also do a health monitoring of the cluster. It is a complete platform for running modern applications. Docker Swarm [5] is a cluster management platform from the Docker ecosystem that includes technologies ranging from development to production deployment frameworks. A combination of Docker Compose, Docker Swarm, an overlay network, and an impeccable state management data store like “etcd” can be used to manage a cluster of Docker containers. However, Docker Swarm’s functionality is still evolving compared to other well-known open-source container cluster management platforms. Given a large number of Docker contributors, it should not take long before the Docker Swarm has all of the best features available in other tools. Docker has created a decent production plan for utilizing the Docker Swarm in production environments. OpenShift [6] is a container orchestration platform developed by Red Hat. Its major goal is to create a platform similar to Kubernetes for deploying containers on-premises or in hybrid cloud environments. In reality, OpenShift is built on top of Kubernetes and contains almost similar features. Nevertheless, there exist numerous differences between Kubernetes and OpenShift as well. The notion of build-related artifacts is the most important one. OpenShift implements these artifacts as first-class Kubernetes resources. A further distinction is that Kubernetes is not interested in the OS, whereas OpenShift is strongly connected to Red Hat Enterprise Linux. Furthermore, OpenShift has several components that are optional in vanilla Kubernetes. For instance, Prometheus can be used for monitoring, and Istio can be used for service mesh realization. In summary, although Kubernetes provides unlimited freedom and flexibility, OpenShift aims to offer a comprehensive package for its users, who are mainly enterprises. Mesos [7] is a cluster management platform that excels in container orchestration. Initially, Twitter built it for its infrastructure use case and made it available as an open-source project. Well-known companies, such as eBay and Airbnb, use Mesos. Mesos, however, is not a container-specific solution. Instead, it can be used for bare-metal clustering for workloads other than containers (e.g., big data). Marathon is a general framework for deploying and managing containers on a Mesos cluster. On a Mesos cluster, we can also run a Kubernetes cluster. Nomad [8] is a container-friendly orchestration platform that was developed by HashiCorp. It follows the same principle as Kubernetes when handling largescale applications. Nomad can handle both containerized and non-containerized workloads. Other HashiCorp tools, such as Consul and Terraform, have native

2.4 Kubernetes

39

integration support with Nomad. The main use cases for Nomad are container orchestration, non-containerized application orchestration, and automated service networking with Consul.

2.4 Kubernetes Kubernetes is a container orchestration platform that supports declarative automation and configuration. It is currently the most widely used container orchestrator. It was initially started and developed at Google before being handed over to the Cloud Native Computing Foundation as an open-source project. The advantages of Kubernetes over alternative orchestration platforms are mainly due to its extensive and powerful functionalities offered in several domains, including the following operations. • Container deployment and rollouts: Kubernetes deploys a set number of containers to a given host and maintains them operating in the appropriate state. A change to “Deployment” is referred to as a rollout. Rollouts can be started, paused, resumed, or rolled back in Kubernetes. • Service discovery and load balancing: Kubernetes creates and maintains a DNS name (or an IP address) to automatically expose a container to the Internet or the other containers. In case of excessive traffic to a container, Kubernetes offers load balancing and scaling features to distribute the load across the cluster nodes to ensure performance and reliability. • Self-healing: Kubernetes can automatically restart or replace failing containers. It can also remove containers not meeting already set health check criteria. • Storage provisioning: Kubernetes can mount persistent local or cloud storage for containers. • Support and portability across cloud vendors: As stated previously, Kubernetes is widely supported by all major cloud providers. This support is particularly crucial for businesses using a hybrid cloud architecture to deliver containerized applications. • Rapidly growing ecosystem of open-source projects: Kubernetes is backed by a rapidly growing set of usability and networking tools to help it expand its capabilities.

2.4.1 Kubernetes Cluster A Kubernetes cluster consists of a number of nodes that work together to run containerized applications. A containerized app package contains its dependencies as well as other essential services. They are lightweight and more adaptable than virtual machines. Containerized apps enable Kubernetes cluster users to easily

40

2 Containerized Edge Computing Platforms

develop, migrate, and manage applications. In a Kubernetes cluster, it is possible to run containers on multiple machines and/or environments (physical, virtual, on-premises, and cloud). Unlike virtual machines, Kubernetes containers are not limited to a single operating system. Instead, they can share operating systems and execute anywhere. A minimal Kubernetes cluster consists of one master node and several worker nodes. Depending on the type of cluster, these nodes can be physical machines or virtual machines (Fig. 2.3).

2.4.1.1

Control Plane Components

The control plane components perform global cluster decisions (such as scheduling) and detect and respond to cluster-level events (such as starting up a new Pod to meet a deployment’s requirements). Components of the control plane can run on any node in the cluster. Kube-apiserver is a control plane component used to expose the Kubernetes API. Kube-apiserver shares the cluster’s state through which all other components can communicate with each other. It is also called the cluster’s front-end. The kubectl interface is commonly used to access the Kubernetes API. Kubectl is a command-line interface for users to generate and manage containerized application instances using the Kubernetes API. Etcd is used to offer a persistent and highly available key-value data store. It acts as the cluster’s support storage to store all cluster-related data and information. Kube-scheduler is a control plane component responsible to look for freshly created Pods and assign them to a node have they not been assigned to any node yet. It considered the following requirements: shared resource requirements, hardware and software constraints, affinity specifications, data locality, and deadlines. Kube-controller-manager is a control plane component to run controller processes. Each controller is logically a separate process, but they are all compiled into a single binary to operate as a single process (to decrease complexity). It includes the following controllers. • Node controllers are in charge of detecting and responding to node failures. • Endpoints controllers are used to fill the endpoint object with data (i.e., joins Service with Pods). • Service account and token controllers are to create default accounts and API access tokens for new namespaces. Cloud-controller-manager is a Kubernetes control plane component to embed cloud-specific control logic. The cloud-controller-manager allows us to connect our cluster to a specific cloud provider’s API. It allows isolating components that interface with the cloud platform from those that deal with our on-premises cluster. Only controllers relevant to our cloud provider are run via the cloudcontroller-manager. The cluster has no cloud-controller-manager if we operate

Fig. 2.3 An overview of a Kubernetes control plane and its node components

2.4 Kubernetes 41

42

2 Containerized Edge Computing Platforms

Kubernetes locally or in a learning setup (e.g., inside our laptop). The cloudcontroller-manager, like the kube-controller-manager, is a single binary that integrates numerous conceptually distinct control loops into a single process. We can scale horizontally to boost performance or accept failures (run several copies). The following controllers may have cloud provider dependencies. • Node controller checks the cloud provider to see if a node in the cloud has been deleted after it stops responding. • Route controller is to configure routes in the cloud infrastructure. • Service controller is responsible for creating, updating, and deleting cloud provider load balancers.

2.4.1.2

Node Components

Every node has node components to keep Pods operating and provide the Kubernetes runtime environment. Kubelet is an agent that operates on each worker node. It ensures containers in a Pod are operating. Kubelet takes a set of PodSpecs from multiple sources and ensures that the containers defined in those PodSpecs are up and running. The kubelet does not manage containers that are not created by Kubernetes. Kube-proxy is a network proxy service that runs on each node in a Kubernetes cluster and implements part of the Kubernetes services. On worker nodes, kubeproxy keeps track of network rules. These network rules allow network sessions inside or outside of the cluster to communicate with Pods. If an OS packet filtering layer exists, kube-proxy uses it; otherwise, traffic is forwarded by kubeproxy. Container runtime is a software for managing container execution. Kubernetes provides numerous container runtime options (including containerd, Docker, and CRI-O) as well as Kubernetes Container Runtime Interface (CRI) implementation.

2.4.2 Kubernetes Objects and Resource Types Kubernetes objects are defined as persistent entities of a Kubernetes system. Kubernetes employs these entities to represent the cluster’s status. For instance, they are capable of describing the following: • All running applications and where (node) they are deployed. • Policies for managing and controlling applications, including fault tolerance and rollout policies. • Available resources to reliably execute embedded applications.

2.4 Kubernetes

43 Namespace Deployment Replicaset

Service

Selector App: beta

10.0.20.5

10.0.20.6

App: beta

10.0.20.4

10.0.20.1

App: beta

10.0.20.2

App: beta

10.0.20.3

App: beta

App: alpha

Fig. 2.4 An overview of Kubernetes objects composition

The rest of this section will provide high-level information about a few wellknown Kubernetes core objects that are commonly used (Fig. 2.4). Namespace is a technique in Kubernetes for segregating sets of resources within a single cluster. Within a single namespace, resource names must be unique. Namespaces are designed for situations with many users spread across several different teams or projects. Label is a key/value pair associated with objects like Pods. Labels are designed to specify the relevant and meaningful identifying qualities of each object to users, but do not imply semantics to the primary system. Labels can be vital for organizing and selecting subsets of objects. Labels are attached to the objects at the creation time and then added and modified later on. For each object, there can be a set of labels defined. Also, every key must be distinct for each object. When taking the Kubernetes learning journey, it is critical to understand how anyone can deploy an application in a Kubernetes cluster. Deployment is possible with a diverse selection of resource types, such as pods, deployments, replica sets, and services. Here, we explore what these resource types can offer. Pod is the smallest deployable unit in Kubernetes. One or more containers can be wrapped inside a single Pod. Containers embedded in a single Pod, all share the same resources and network. Each Pod has its IP address, which can be used to communicate with other Pods. Deployment defines the next level of a Pod. The life cycle of Pods is managed via a Deployment. The Deployment component processes everything that comes into Pod. One of the most prominent aspects of Deployment is the ability to control

44

2 Containerized Edge Computing Platforms

the number of replicas. We do not have to manually maintain the replica count using Deployments. ReplicaSet is to keep a consistent set of replica Pods running at all times. As a result, it is frequently used to ensure the availability of a certain number of identical Pods. A ReplicaSet has fields such as (1) a selector for telling it how to find Pods it can acquire, (2) a number of replicas for telling it how many Pods it should keep, and (3) a Pod template for telling it what data it should put in new Pods to match the number of replicas requirement. The objective of a ReplicaSet is then fulfilled by producing and deleting Pods as necessary to meet the specified number. A ReplicaSet uses the Pod template to construct new Pods. Service is an abstract approach to expose an application running on a set of Pods. It defines a logical set of Pods and a policy for accessing them. The need for Services arises because Kubernetes Pods are frequently created and destroyed, leading to a problem when certain Pods have to communicate with other Pods inside the cluster, but Pod IP gets changed when a Pod is re-created or restarted. Kubernetes service solves this problem. When using a Kubernetes service, each Pod is assigned an IP address. Because this address may not be immediately known, the service enables accessibility before connecting the relevant Pod. For example, when a service is formed, it advertises its virtual IP address to every Pod as an environment variable so that other Pods can communicate with it. If the number of available Pods changes, the service will be updated, and traffic will be appropriately directed without the need for manual intervention. A selector is typically used to determine the Pods targeted by a Service.

2.4.3 Container Interfaces Kubernetes is built using a bottom-up approach to facilitate modular cloud-native applications requirements. Kubernetes uses plugins, services, and interfaces to extend the platform’s fundamental capabilities. Its inherent configuration allows us to adjust and personalize the overall platform; however, customization stretches beyond flags and local configuration settings. Extensions can integrate seamlessly with the rest of the architecture, providing native-like functionalities and extending the cluster operator’s command set. Custom hardware can also be supported using extensions. This section will cover interface plugins that serve three distinct purposes: runtime plugins, storage plugins, and network plugins. In short, we will go over the concepts of Container Runtime Interface (CRI), Container Network Interface (CNI), and Container Storage Interface (CSI) in terms of extending the Kubernetes platform’s functionality.

2.4 Kubernetes

2.4.3.1

45

Container Runtime Interface (CRI)

Container runtime is considered the heart of every Kubernetes deployment. It is the part of the architecture that manages hardware resources, starts and stops containers, and ensures containers have the resources they require to run efficiently. The CRI plugins can help individuals get the most out of the new CR API. With the correct plugin, runtime engines, such as Docker, can become more adaptable. CRI plugins allow anyone to use alternative container runtimes without rewriting existing code. One of the most popular CNI plugins is CRI-O [9], which is a container runtime best known for being exceptionally light and agile. It is compatible with various Kubernetes flavors such as Kubic [10] (which comes pre-configured to run CRI-O), Minikube [11], and Kubeadm. It fully complies with the Open Container Initiative (OCI) and eliminates the need for Docker.

2.4.3.2

Container Network Interface (CNI)

A CNI plugin is in charge of adding a network interface (i.e., one end of a virtual Ethernet (vEth) pair) to the container network namespace and making any required changes to the host (i.e., attaching the other end of the vEth into a bridge). By calling the relevant IP Address Management (IPAM) plugin, it subsequently assigns an IP address to the interface and configures the routes following the IP Address Management scheme. Several useful CNI plugins are developed to satisfy Kubernetes clusters with varying connectivity requirements. One of the most well-known CNI plugins is Flannel [12]. Network functions are carried out by Flannel without the assistance of any database. Instead, it uses the Kubernetes API to set up the default VXLAN architecture. Flannel can also be integrated with other tools to provide support for a large number of users to satisfy their specific networking requirements.

2.4.3.3

Container Storage Interface (CSI)

The final concept is concerned with the storage interface. For managing storage devices, Kubernetes has traditionally relied on a volume plugin technique, which was never open enough to allow third-party management tools to work without problems. CSI is thought to be the solution because it provides CSI volumes and dynamic storage block provisioning as functions. Third-party storage vendors can now provide cluster operators with persistent and dynamic storage blocks without waiting months for them to be implemented. A key distinction between CSI plugins and core Kubernetes volume plugins is that CSI plugins do not need to be built and delivered alongside the core Kubernetes binaries. Ceph [13] is one of the famous CSI plugins. The Ceph CSI plugin interfaces between the container orchestrator and a Ceph cluster. It enables Ceph volumes to

46

2 Containerized Edge Computing Platforms

be dynamically provisioned and attached to workloads. To support CephFS-backed volumes and Rados Block Device (RBD), separate CSI plugins are provided.

2.4.4 Accessing and Managing Kubernetes Resources This section will look at two different Kubernetes management tools that we can employ to easily manage Kubernetes clusters. 2.4.4.1

Kubernetes Kubectl

The Kubernetes command-line tool is called kubectl [14]. We leverage it to execute different commands against Kubernetes clusters to deploy applications, inspect and manage cluster resources, and monitor application logs. A Kubernetes reference guide is provided in Table 2.1 to help anyone get started quickly with kubectl commands in Kubernetes. 2.4.4.2

Kubernetes Dashboard

Kubernetes dashboard [15] is the official general-purpose web user interface (UI) for Kubernetes clusters. The dashboard can be used to manage deployed workloads to a Kubernetes cluster, troubleshoot them, and provide control functions for the cluster’s resources. It can provide a quick overview of deployed applications in the cluster and create or change a specific Kubernetes resource (such as Deployments or Pods). For example, we can use a deployment wizard to scale deployment, restart a failed pod, or deploy additional workloads. Interestingly, the dashboard is nothing more than a Kubernetes-managed container that enables users to access cluster information. The dashboard UI is not enabled by default. We have to run the following command to deploy the dashboard version 2.4.0 in the cluster: 1

$kubectl apply -f https://raw.githubusercontent.com/ kubernetes/dashboard/v2.4.0/aio/deploy/recommended .yaml

The dashboard gets deployed with a simple role-based access control (RBAC) configuration by default to safeguard cluster data. The dashboard only accepts Bearer Tokens as a form of authentication. To create a token for accessing the dashboard, we can follow the official guide and run the following command with the kubectl command-line tool [16]. 1

$kubectl proxy

2.4 Kubernetes

47

Table 2.1 A quick reference guide to Kubernetes commands Category Cluster info Contexts Objects Labels

Describe Pod

Deployment

Service

Delete Logs

Command $kubectl config get-clusters $kubectl cluster-info $kubectl config get-contexts $kubectl config use-context ctxt $kubectl get all|nodes|pods $kubectl get pods -show-labels $kubectl get pods -l environment=production,tier!=backend $kubectl describe pod/[id] $kubectl run hello-world -generator=run-pod/v1 -image=docker pull hello-world:latest -output yaml -export -dry-run > pod.yml $kubectl apply -f pod.yml $kubectl run hello-world -image=docker pull hello-world:latest -output yaml -export -dry-run > deploy.yml $kubectl apply -f deploy.yml $kubectl expose deployment hello-world -port 8080 -target-port=8080 -output yaml -export -dry-run > hello-world-service.yml $kubectl apply -f hello-world-service.yml $kubectl delete pods|svc[id] $kubectl logs -l app=hello $kubectl attach pod-name

Description Get clusters Get cluster info Get list of contexts Switch context Get details of object Get pods labels Get pods by label Describe object details Generate YAML descriptor file

Create or update pod Generate YAML descriptor file

Create or update deployment Generate YAML descriptor file

Create or update service Delete an object Get logs Watch logs in real time

Then, the dashboard is accessible via 1

http://localhost:8001/api/v1/namespaces/kubernetesdashboard/services/https:kubernetes-dashboard:/ proxy/

For the first time, when we access dashboard UI on an empty cluster, as shown in Fig. 2.5, we land on the welcome page. There, we can see what system applications are running in our cluster’s kube-system namespace (including the dashboard itself). Note that the user interface will be only accessible from the node where the proxy command is run.

48

2 Containerized Edge Computing Platforms

Fig. 2.5 Kubernetes dashboard UI

2.5 Kubernetes SDKs This section provides an overview of the Kubernetes API [17] for different programming languages. We do not have to implement the API methods and request or response types by ourselves if we employ the Kubernetes REST API to create applications. We can utilize a client library for the programming language of our choice. Client libraries can help us perform routine activities, such as authentication. If the API client is operating inside the Kubernetes cluster, most client libraries can identify and use the Kubernetes service account to login or get the credentials and the API server address from the kubeconfig file. The client libraries for Dotnet, Go, Python, Java, JavaScript, and Haskell are officially maintained by Kubernetes SIG API Machinery; other client libraries are maintained by their authors rather than the Kubernetes team. To get anyone started, we provide two examples of client library in this section.

2.5.1 Kubernetes Python Client 2.5.1.1

Installing the Python Client

We can install Python client from source using the following commands. 1

2 3

$git clone --recursive https://github.com/kubernetesclient/python.git $cd python $python setup.py install

2.5 Kubernetes SDKs

49

Alternatively, we can install Python client from PyPI directly using the following command. 1

2.5.1.2

$pip install kubernetes

Using the Python Client

Using the Python client library, here is an example to show how we can list all Pods deployed in a Kubernetes cluster. The given code snippets require Python 2.7 or 3.5+ to run successfully. 1

from kubernetes import client, config

2 3

4

# We may set configs in Configuration class directly or use the helper utility. config.load_kube_config()

5 6 7 8

9 10

k8s_version1_api = client.CoreV1Api() print("List Pods with their IPs:") ret_list = k8s_version1_api. list_pod_for_all_namespaces(watch=False) for i in ret_list.items: print("%s\t%s\t%s" % (i.status.pod_ip, i.metadata. name, i.metadata.namespace))

Here is another example leveraging Python client library to create an Nginx deployment in the Kubernetes cluster. 1 2 3

""" Create a deployment based on file deployment.yaml. """

4 5 6 7

from os import path import yaml from kubernetes import client, config

8 9 10

def main(): config.load_kube_config()

11 12

13 14 15

16 17

with open(path.join(path.dirname(__file__), " deployment.yaml")) as f: deployment_file = yaml.safe_load(f) k8s_version1_api = client.AppsV1Api() resp = k8s_version1_api. create_namespaced_deployment( body=deployment_file, namespace="default") print("New deployment is created. status=’%s’" % resp.metadata.name)

18 19 20 21

if __name__ == ’__main__’: main()

50

2 Containerized Edge Computing Platforms

Here is the content of deployment.yaml. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

apiVersion: apps/v1 kind: Deployment metadata: name: my-nginx-deployment labels: app: nginx spec: replicas: 5 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - name: nginx image: nginx:latest ports: - containerPort: 80

2.5.2 Kubernetes Java Client 2.5.2.1

Installing the Java Client

We can install Java client to our local Maven repository by executing the following commands. 1

2 3

$ git clone --recursive https://github.com/kubernetesclient/java $ cd java $ mvn install

When using a Maven project, we can include Java client dependency via project POM.xml file. 1 2 3 4 5

io.kubernetes client-java 10.0.0

2.5 Kubernetes SDKs

2.5.2.2

51

Using the Java Client

Using the Java client library, here is an example to show how we can list all Pods deployed in a Kubernetes cluster. 1 2 3 4 5 6 7 8

import import import import import import import import

io.kubernetes.client.util.Config; io.kubernetes.client.openapi.Configuration; io.kubernetes.client.openapi.ApiClient; io.kubernetes.client.openapi.ApiException; io.kubernetes.client.openapi.models.V1Pod; io.kubernetes.client.openapi.models.V1PodList; io.kubernetes.client.openapi.apis.CoreV1Api; java.io.IOException;

9 10 11

12 13 14 15 16 17 18

/** * A simple example of how to interact with a Kubernetes * cluster using the Java Client API. */ public class ListPods { public static void main(String[] args) throws IOException, ApiException { ApiClient api_client = Config.defaultClient(); Configuration.setDefaultApiClient(api_client);

19

CoreV1Api core_api = new CoreV1Api(); V1PodList pod_list = core_api.listPodForAllNamespaces(null, null, null, null, null, null, null, null, null, null); System.out.println("Deployed Pods in Kubernetes Cluster:"); for (V1Pod item : pod_list.getItems()) { System.out.println(item.getMetadata().getName()) ; }

20 21 22

23

24

25 26

27

}

28 29

}

Here is another example leveraging the Java client library to copy file from Pod to host in Kubernetes cluster. 1 2 3

4 5 6 7 8 9 10

import io.kubernetes.client.util.Streams; import io.kubernetes.client.util.Config; import io.kubernetes.client.util.exception. CopyNotSupportedException; import io.kubernetes.client.openapi.ApiClient; import io.kubernetes.client.openapi.Configuration; import io.kubernetes.client.openapi.ApiException; import io.kubernetes.client.Copy; import java.io.InputStream; import java.io.IOException; import java.nio.file.Paths;

52

2 Containerized Edge Computing Platforms 11 12 13 14 15 16 17 18 19

20 21

/** * An example of how to copy a file hello-world.yaml * from Pod to host using the Java API. */ public class CopyFile { public static void main(String[] args) throws ApiException, IOException, CopyNotSupportedException, InterruptedException { String pod_namespace = "kube-system"; String pod_name = "hello-world";

22

ApiClient api_client = Config.defaultClient(); Configuration.setDefaultApiClient(api_client);

23 24 25

Copy copy = new Copy(); InputStream dataStream = copy.copyFileFromPod( namespace, podName, " hello-world.yaml"); Streams.copy(dataStream, System.out);

26 27

28

29 30

System.out.println("File copied.");

31

}

32 33

}

2.6 Summary This chapter covered the fundamental concepts of containers and related technologies. A container is defined as a self-contained application that can be easily run in any production environment. Containers have grown in popularity because they are simple to use and well suited for edge computing environments as they are significantly lighter than virtual machines. In particular, containers are popular for new, advanced software development and migrating legacy applications. This chapter also discussed the various container engines available to manage the individual containers. The choice of the container engine depends on the use case. Docker, for example, is one of the most popular container engines due to its user-friendly features. Despite these ease-of-use benefits, a standalone container engine started to prove insufficient for managing the container’s entire life cycle. For example, as the number of containers increased, operators had to put in more effort to manage them efficiently. As a result, they needed a container orchestration platform to cut down on the time they spent managing container life cycles. Among various container orchestration platforms, Kubernetes is the most popular open-source solution, because it provides numerous features for quickly deploying containers

References

53

across a cluster of edge nodes. It provides both command-line and graphical tools for managing Kubernetes clusters (e.g., kubectl and the Kubernetes dashboard UI). In addition, Kubernetes APIs for various programming languages are now available to increase Kubernetes’ future adoption.

References 1. Docker. Docker Documentation. https://docs.docker.com/get-started/overview/. 2021-11-01 2. Podman. https://docs.podman.io/en/latest/. 2021-11-05. 3. LXD. https://linuxcontainers.org/lxd/introduction/. 2021-11-05. 4. Kubernetes. Section: docs. https://kubernetes.io/docs/concepts/overview/components/. 202110-31 5. Swarm. Docker Documentation. https://docs.docker.com/engine/swarm/. 2021-11-05. 6. OpenShift. https://www.redhat.com/en/technologies/cloud-computing/openshift/containerplatform. 2021-11-05. 7. Apache Mesos. http://mesos.apache.org/. 2021-11-05. 8. Nomad. Nomad by HashiCorp. https://www.nomadproject.io/. 2021-11-05. 9. cri-o. https://cri-o.io/. 2021-11-22. 10. openSUSE Kubic. https://kubic.opensuse.org/. 2021-11-22. 11. Minikube. https://minikube.sigs.k8s.io/docs/start/. 2021-11-22. 12. Flannel. Apache-2.0. flannel-io. original-date: 2014-07-10T17:45:29Z. https://github.com/ flannel-io/flannel. 2021-11-22. 13. Ceph CSI. Apache-2.0. original-date: 2018-01-08T15:21:11Z. https://github.com/ceph/cephcsi. 2021-11-22. 14. Kubectl. https://kubernetes.io/docs/tasks/tools/. 2021-11-18. 15. Kubernetes Dashboard. Section: docs. https://kubernetes.io/docs/tasks/access-applicationcluster/web-ui-dashboard/. 2021-11-18. 16. Sample User. Kubernetes. original-date: 2015-10-15T23:09:14Z. https://github.com/ kubernetes/dashboard/blob/482ca6993db46abd7762dee6aa05e4f9e3dcb06a/docs/user/accesscontrol/creating-sample-user.md. 2021-11-18. 17. Client Libraries. Section: docs. https://kubernetes.io/docs/reference/using-api/client-libraries/. 2021-11-18.

Chapter 3

AI/ML for Service Life Cycle at Edge

Abstract This chapter analyzes how the artificial intelligence and machine learning algorithms can be used to prosper edge computing. In the first section, we propose several key performance indicators which are universal for edge computing problems in different hierarchy, including performance, cost, and efficiency. We divide the typical problems for the life cycle of services in edge computing systems into topology, content, and service in a bottom-up approach. In the second section, we use several examples of micro-service function chain redundancy placement and deployment problems to show how intelligence algorithms help in decisionmaking for the deployment mode of services. In the third section, we study how the online learning algorithms, especially deep reinforcement learning methods, help in solving the job scheduling problems in the running mode of services. Finally, we study the sample-and-learning framework for the load balancing problems in the operation mode of services.

This chapter reuses literal text and materials from • S. Deng et al., “Burst Load Evacuation Based on Dispatching and Scheduling In Distributed Edge Networks,” in IEEE Transactions on Parallel and Distributed Systems, 2021, https://doi. org/10.1109/TPDS.2021.3052236. ©2021 IEEE, reprinted with permission. • S. Deng et al., “Dynamical Resource Allocation in Edge for Trustable Internet-of-Things Systems: A Reinforcement Learning Method,” in IEEE Transactions on Industrial Informatics, 2020, https://doi.org/10.1109/TII.2020.2974875. ©2020 IEEE, reprinted with permission. • S. Deng et al., “Edge Intelligence: The Confluence of Edge Computing and Artificial Intelligence,” in IEEE Internet of Things Journal, 2020, https://doi.org/10.1109/JIOT.2020.2984887. ©2020 IEEE, reprinted with permission. • H. Zhao et al., “Distributed Redundant Placement for Microservice-based Applications at the Edge,” in IEEE Transactions on Services Computing, 2022, https://doi.org/10.1109/TSC.2020. 3013600. ©2022 IEEE, reprinted with permission.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Taheri et al., Edge Intelligence, https://doi.org/10.1007/978-3-031-22155-2_3

55

56

3 AI/ML for Service Life Cycle at Edge

3.1 Introduction As a distributed computing paradigm, mobile edge computing (MEC) is expected to decentralize data and provide services with robustness and elasticity at the network edge. Edge computing faces resource allocation problems in different layers, such as CPU cycle frequency, access jurisdiction, radio-frequency, and bandwidth. As a result, it has great demands on various powerful optimization tools to enhance system efficiency. Both heuristics and AI technologies are capable to handle this task. Essentially, AI models extract unconstrained optimization problems from real scenarios and then find the asymptotically optimal solutions iteratively with stochastic gradient descent (SGD) methods. Either statistical learning methods or deep learning methods can help. Reinforcement learning, on the other hand, including multi-armed bandit theory, multi-agent learning, and deep Q-network (DQN), is playing a growing and important role in resource allocation problems for the edge [1]. When optimizing the edge, we focus on the following indicators. Performance of optimization algorithms is problem-dependent. For example, performance could be the ratio of successfully offloading when it comes into the computation offloading problems. Similarly, it could be the service providers’ need-to-be-maximized revenue and need-to-be-minimized hiring costs of base stations (BSs) when it comes into the service placement problems. Although the computation scenarios have changed from cloud clusters to the complex system of devices, edges, and cloud, this criterion still plays a very important role. Cost usually consists of computation cost, communication cost, and energy consumption. Computation cost reflects the demand for computing resources such as achieved CPU cycle frequency and allocated CPU time. Communication cost presents the request for communication resources such as power, frequency band, and access time. Many works also focused on minimizing the delay (latency) caused by allocated computation and communication resources. Energy consumption is not unique to edge computing, but more crucial due to the limited battery capacity of mobile devices. Cost reduction is crucial because edge computing promises a dramatic reduction in delay and energy consumption by tackling the key challenges for realizing 5G. Efficiency promises a system with excellent performance and low overhead. The pursuit of efficiency is the key factor for improving existing algorithms and models, especially for AI on edge. Many approaches such as model compression, conditional computation, and algorithm asynchronization are proposed to improve the efficiency of training and inference of deep AI models. Using the bottom-up approach, the key concerns in AI optimizes edge are categorized into three layers: topology, content, and service. Topology: We pay close attention to the orchestration of edge sites (OES) and wireless networking (WN). We define an edge site as a micro data center with applications deployed, attached to a small-cell base station (SBS). OES studies the deployment and installation of wireless telecom equipment and

3.1 Introduction

57

servers. In recent years, research efforts on the management and automation of unmanned aerial vehicles (UAVs) became very popular [2–4]. UAVs with a small server and an access point can be regarded as moving edge servers with maneuverability. Therefore, many works explore scheduling and trajectory planning problems with the minimization of energy consumption of UAVs. For example, Chen et al. study the power consumption of UAVs by caching the popular contents under predictions, where a conceptor-based echo state network (ESN) algorithm is proposed to learn the mobility pattern of users. With the help of this effective machine learning technique, the proposed algorithm greatly outperforms benchmarks in terms of transmit power and QoE satisfaction. WN studies data acquisition and network planning. The former concentrates on the fast acquisition from rich, but highly distributed, data at subscribed edge devices, while the latter concentrates on network scheduling, operation, and management. Fast data acquisition includes multiple access, radio resource allocation, and signal encoding/decoding. Network planning studies efficient management with protocols and middleware. In recent years, there has been an increasing trend in intelligent networking, which involves building an intelligent wireless communication mechanism by popular AI technologies. For example, Zhu et al. propose learning-driven communication, which exploits the coupling between communication and learning in edge learning systems [5]. Sun et al. study the resource management in F-RANs (fog radio access network) with DRL. In order to minimize long-term system power consumption, an MDP is formulated, and the DQN technique is utilized to make intelligent decisions on the user’s equipment communication modes [6]. Content: We place an emphasis on data provisioning, service provisioning, service placement, service composition, and service caching. For data and service provisioning, the available resources can be provided by remote cloud data centers and edge servers. In recent years, there exist research efforts on constructing lightweight QoS-aware service-based frameworks [7–9]. The shared resources can also come from mobile devices if a proper incentive mechanism is employed. Service placement is an important complement to service provisioning, which studies where and how to deploy complex services on possible edge sites. In recent years, many works studied service placement from the perspective of application service providers (ASPs). For example, Chen et al. deployed services under limited budget on basic communication and computation infrastructures [10]. Multi-armed bandit theory, a branch of reinforcement learning, was adopted to optimize the service placement decision. Service composition studies how to select candidate services for composition in terms of energy consumption and QoE of mobile end users [11]. It opens up research opportunities where AI technologies can be utilized to generate better service selection schemes. Service caching can also be viewed as a complement to service provisioning. It studies how to design a caching pool to store the frequently visited data and services. Service caching can also be studied in a cooperative way [12]. It leads to research opportunities where multi-agent learning can be utilized to optimize QoE in large-scale edge computing systems.

58

3 AI/ML for Service Life Cycle at Edge

Service: We focus on computation offloading, user profile migration, and mobility management. Computation offloading studies the load balancing of various computational and communication resources in the manner of edge server selection and frequency spectrum allocation. More and more research efforts focus on dynamically managing the radio and computational resources for multi-user multi-server edge computing systems, for example, utilizing Lyapunov optimization techniques [13, 14]. In recent years, optimizing computation offloading decisions via DQN has become popular [15–17]. It models the computation offloading problem as a Markov decision process (MDP) and maximizes the long-term utility performance. The utility can be composed of the above QoE indicators and evolves according to the iterative Bellman equation. The asymptotically optimal computation offloading decisions are achieved based on deep Q-network. User profile migration studies how to adjust the place of user profiles (configuration files, private data, logs, etc.) when the mobile users are continuously moving. User profile migration is often associated with mobility management [18]. In [19], the proposed JCORM algorithm jointly optimizes computation offloading and migration by formulating cooperative networks. It opens research opportunities where more advanced AI technologies can be utilized to improve optimality. Many existing research efforts study mobility management from the perspective of statistics and probability theory. It has strong interests in realizing mobility management with AI. In the following, we present the state-of-the-art research works on this topic. Figure 3.1 gives an example of how AI technologies are utilized in the mobile edge computing (MEC) environment. Firstly, we need to identify the problem to be studied. Taking performance optimization as an example, the optimization goal, decision variables, and potential constraints need to be confirmed. The need-tobe-optimized goal could be the combination of task execution delay, transmission delay, and task dropping cost. Task can be offloaded either in full or partial. If the long-term stability of system is considered, Lyapunov optimization technique could be used to formalize the problem. Finally, we should design an algorithm to solve the problem. In fact, the model construction is decided by not only the to-bestudied problem but also the to-be-applied optimization algorithms. Taking DQN, for example, we have to model the problem as an MDP with finite states and actions. Thus, the constraints cannot exist in the long-term optimization problem. The most common way is reformat constraints as penalty and add their associated penalties to the optimization goal.

3.1 Introduction

59 Refactor

Problem Definition

Model Construction

Need-to-be-minimized delay execution delay

Goal handover delay task dropping cost

Decision Variables Binary Task

Algorithm Design Deep Q-Network (DQN)

Performance Optimization in MEC

Observe States energy state task request state resource usage

...

energy allocation

Observe Actions (Discretization)

edge server selection

energy allocation

Memory Pool (Database of Samples)

Minibatch

DNN (alternative)

Weight Updating

Action

edge server selection

Partial Task

partition point

offloading decision

Action

Environment

Constraints

policy

gradients

offloading or not

State Cost

Remove Constraints

battery energy level add penalty task execution deadline transfer to goal radio frequency bandwidth add assumption computing resources

...

Fig. 3.1 The utilization of AI technology for performance optimization. ©2020 IEEE, reprinted, with permission, from S. Deng et al. (2020)

3.1.1 State of the Art Considering that current research efforts on AI for edge concentrate on wireless networking, service placement, service caching, and computation offloading, we only focus on these topics in this chapter.

3.1.1.1

Wireless Networking

The 5G technology promises eMBB, URLLC, and mMTC in a real-time and highly dynamic environment. Researchers reach a consensus that AI technologies should and can be integrated across the wireless infrastructure and mobile users [20]. We believe that AI should be applied to achieve intelligent network optimization in a fully online manner. One of the typical works in this area is [5]. This paper advocates a new set of design principles for wireless communication on edge with

60

3 AI/ML for Service Life Cycle at Edge

machine learning technologies and models embedded, which are collectively named as learning-driven communication. It can be achieved across the whole process of data acquisition, which are in turn multiple access, radio resource management, and signal encoding. Learning-driven multiple access advocates that the unique characteristics of wireless channels should be exploited for functional computation. Over-the-air computation (AirComp) is a typical technique used to realize it [21, 22]. In [23], the authors put this principle into practice based on broadband analog aggregation (BAA). Concretely, Zhu et al. suggest that simultaneously transmitted model updates in federated learning should be analog aggregated by exploiting the waveform-superposition property of multi-access channels [23]. The proposed BAA can dramatically reduce communication latency compared with traditional orthogonal frequency-division multiple access (OFDMA). The work in [24] explores the over-the-air computation for model aggregation in federated learning. More specifically, Yang et al. put the principle into practice by modelling the device selection and beam-forming design as a sparse and low-rank optimization problem, which is computationally intractable [24]. To solve the problem with a fast convergence rate, they proposed a differenceof-convex-functions (DC) representation via successive convex relaxation. The numerical results show that the proposed algorithm can achieve lower training loss and higher inference accuracy compared with state-of-the-art approaches. This contribution can also be categorized as model adaptation in AI on edge, but it accelerates federated learning from the perspective of fast data acquisition. Learning-driven radio resource management promotes the idea that radio resources should be allocated based on the value of transmitted data, not just the efficiency of spectrum utilization. Therefore, it can be understood as importanceaware resource allocation, and an obvious approach is importance-aware re-transmission. In [25], the authors proposed a re-transmission protocol, named importance-aware automatic-repeat-request (importance ARQ). Importance ARQ makes the trade-off between signal-to-noise ratio (SNR) and data uncertainty under the desired learning accuracy. It can achieve fast convergence while avoiding learning performance degradation caused by channel noise. Learning-driven signal encoding stipulates that signal encoding should be designed by jointly optimizing feature extraction, source coding, and channel encoding. To that end, authors of [26] proposes a hybrid federated distillation (HFD) scheme based on separate source-channel coding and over-the-air computing. It adopts sparse binary compression with error accumulation in source-channel coding. For both digital and analog implementations over Gaussian multiple-access channels, HFD can outperform the vanilla version of federated learning in a poor communication environment. This principle has something in common with dimensional reduction and quantization from model adaptation in AI on edge, but it reduces the feature size from the source of data transmission. It opens up great research opportunities for the co-design of learning frameworks and data encoding.

3.1 Introduction

61

Learning-driven communication contributes to AI for wireless networking from the perspective of energy consumption and radio resource efficiency. Zhang et al. propose a deep reinforcement learning (DRL)-based decentralized algorithm to maximize the sum capacity of vehicle-to-infrastructure users while meeting the latency and reliability requirements of vehicle-to-vehicle (V2V) pairs [27]. Lu et al. design a deep RL-based energy trading algorithm to address the supplydemand mismatch problem for a smart grid with a large number of microgrids (MGs) [28]. Shen et al. utilize graph neural networks (GNNs) to develop scalable methods for power control in K-user interference channels [29]. They first model the K-user interference channel as a complete graph and then use it to learn the optimal power control with a graph convolution neural network. Temesgene et al. study an energy minimization problem where the baseband processes of the virtual small cells powered solely by energy harvesters and batteries can be opportunistically executed in a grid-connected edge server [30]. Based on multi-agent learning, several distributed fuzzy Q-learning-based algorithms are tailored.

3.1.1.2

Service Placement and Caching

Many researchers study service placement from the perspective of application service providers (ASPs). They model the data and service placement problem as a Markov decision process (MDP) and utilize AI methods (such as reinforcement learning) to achieve the optimal placement decision. A typical work implementing this idea is [10], where authors propose a spatial-temporal algorithm based on multi-armed bandit (MAB); they achieved the optimal placement decisions while learning the benefit. They also studied how many SBSs should be rented for edge service hosting to maximize the expected utility within predefined time intervals. The expected utility is composed of delay reduction of all mobile users. Another MAB-based algorithm, named SEEN, is also proposed to learn the local users’ service demand patterns of SBSs. It can achieve the balance between exploitation and exploration automatically according to the fact whether the set of SBSs is chosen before. Another work attempts to integrate AI approaches with service placement [31]. This work jointly decides which SBS to deploy in each data block and service component, as well as how much harvested energy should be stored in mobile devices with a DQN-based algorithm. Service caching can be viewed as a complement to service placement [32]. Edge servers can be equipped with special service cache to satisfy user demands on popular contents. A wide range of optimization problems on service caching are proposed to endow edge servers with learning capability. Sadeghi et al. study a sequential fetch-cache decision based on dynamic prices and user requests [12]. They used SBSs with efficient fetch-cache decision-making schemes operating in dynamic settings and formulated a cost minimization problem with service popularity considered. For the long-term stochastic optimization problem, several computationally efficient algorithms were developed based on Q-learning.

62

3.1.1.3

3 AI/ML for Service Life Cycle at Edge

Computation Offloading

Computation offloading can be considered as the most active topic when it comes to AI for edge. It studies the transfer of resource-intensive computational tasks from resource-limited mobile devices to edge or cloud. This process involves the allocation of many resources, ranging from CPU cycles to channel bandwidth. Therefore, AI technologies with strong optimization abilities have been extensively used in recent years. Among all these AI technologies, Q-learning and its derivatives (such as DQN) are in the spotlight. For example, Qiu et al. design a Q-learningbased algorithm for computation offloading [33]. They formulated the computation offloading problem as a non-cooperative game in multi-user multi-server edge computing systems and proved that Nash equilibrium exists. They also proposed a model-free Q-learning-based offloading mechanism to help mobile devices learn their long-term offloading strategies toward maximizing their long-term utilities. More works are based on DQN because the curse of dimensionality could be overcome with non-linear function approximation. For example, Min et al. studied the computation offloading for IoT devices with energy harvesting in multi-server MEC systems [15]. The need-to-be-maximized utility formed from overall data sharing gains, task dropping penalty, energy consumption, and computation delay, which is updated according to the Bellman equation. DQN is then used to generate the optimal offloading scheme. In [16] and [35], the computation offloading problem is formulated as a MDP with finite states and actions. The state set is composed of the channel qualities, the energy queue, and the task queue; the action set is composed of offloading decisions in different time slots. A DQN-based algorithm was then proposed to minimize the long-term cost. Based on DQN, task offloading decisions and wireless resource allocation are jointly optimized to maximize the data acquisition and analysis capability of the network [36, 37]. The work in [38] studied the knowledge-driven service offloading problem for the Internet of Vehicles. The problem was formulated as a long-term planning optimization problem and solved based on DQN. In summary, computation offloading problems in various industrial scenarios have been extensively studied from all sorts of perspectives. There also exist works that explore the task offloading problem with other AI technologies. For example, authors of [39] proposed a long-short-term memory (LSTM) network to predict the task popularity and then formulated a joint optimization of the task offloading decisions, computation resource allocation, and caching decisions. A Bayesian learning automata-based multi-agent learning algorithm was then proposed to find the optimal placement.

3.1.2 Grand Challenges In this section, we will highlight grand challenges across the whole theme of AI for edge research. Although closely related, each challenge has its own merits.

3.2 Al/ML for Service Deployment

63

Model Establishment: If we want to use AI methods, the mathematical models have to be limited, and the formulated optimization problem needs to be restricted. On the one hand, this is because the optimization basis of AI technologies, SGD (stochastic gradient descent) and MBGD (mini-batch gradient descent) methods, may not work well if the original search space is constrained. On the other hand, especially for MDPs, the state set and action set cannot be infinite, and discretization is necessary to avoid the curse of dimensionality before further processing. The common solution is to change the constraints into penalties and incorporating them into the global optimization goal. The status quo greatly restricts the establishment of mathematical models, which in turn leads to performance degradation. It can be viewed as a compromise for the utilization of AI methods. Therefore, how to establish an appropriate system model poses great challenges. Algorithm Deployment: The state-of-the-art approaches often formulate a combinatorial and NP-hard optimization problem; both have fairly high computational complexities. A very few approaches can actually achieve an analytic approximate optimal solution with convex optimization methods. In fact, for AI for edge, the solution mostly comes from iterative learning-based approaches, and therefore, many challenges arise when implementing them on the edge in an online manner. Furthermore, the decision about which edge device should undertake the responsibility for deploying and running the proposed complicated algorithms is often left unanswered. The existing research efforts usually concentrate on their specific problems and do not provide details on how they could be deployed on real platforms. Balance Between Optimality and Efficiency: Although AI technologies can provide solutions that are optimal, the trade-off between optimality and efficiency cannot be ignored when it comes to the resource-constrained edges. Thus, how to improve the usability and efficiency of edge computing systems for different application scenarios with AI technologies embedded is a severe challenge. The trade-off between optimality and efficiency should be realized based on the characteristics of dynamically changing requirements on QoE and the network resource structure. It is often coupled with the service subscribers’ pursuing superiority and the utilization of available resources.

3.2 Al/ML for Service Deployment MEC offers the development on the network architecture and the innovation in service patterns. Considering that small-scale data centers can be deployed near cellular towers, there are exciting possibilities that micro-service-based applications can be delivered to mobile devices without backbone transmission. Container technologies (such as Docker [40] and its dominant orchestration and maintenance tool, Kubernetes [41]), are becoming the mainstream solution for packaging, deploying, maintaining, and healing applications. Each micro-service can be decoupled from

64

3 AI/ML for Service Life Cycle at Edge

the application and packaged as a Docker image. Kubernetes is naturally suitable for building cloud-native applications by leveraging the benefits of the distributed edge, because it can hide the complexity of micro-service orchestration while managing their availability with lightweight virtual machines (VMs). This has greatly motivated application service providers (ASPs) to participate in service provision within the access and core networks. Service deployment from ASPs is the carrier of service provision, which touches on where to place the services and how to deploy their instances. In the last 2 years, there exist research that study the placement at the network edge from the perspective of quality of experience (QoE) of end users or the budget of ASPs [42–48]. These works commonly have two limitations. Firstly, the to-be-deployed service is only studied in an atomic way. It is often treated as a single abstract function with given input and output data size. More complex properties (such as time series or composition property) of services are not fully taken into consideration. Secondly, high availability of deployed service is not carefully studied. Due to the heterogeneity of edge sites (such as different CPU cycle frequency and memory footprint, varying background load, and transient network interrupts), the service provision platform might greatly slow down or even crash. The default assignment, deployment, and management of containers do not fully take the heterogeneity in both physical and virtualized nodes into consideration. Besides, the healing capability of Kubernetes is principally monitoring the status of containers, pods, and nodes and timely restarting the failed services; this is not enough for high availability. Vayghan et al. find that in the specific test environment, when the pod failure happens, the outage time of the corresponding service could be dozens of seconds, whereas, when node failure happens, the outage time could be dozens of minutes [49, 50]. Therefore, with the vanilla version of Kubernetes, high availability might not be ensured, especially for the latency-critical cloud-native applications. Besides, one micro-service could have several alternative execution solutions. For example, electronic payment, as a micro-service of a composite service, can be executed by PayPal, WeChat Pay, and Alipay.1 This versatility can further complicate the placement problem, because the place/speed of such services can greatly exacerbate the problem. In order to solve the above problems, in this chapter, we propose a distributed redundant placement framework, sample average approximation-based redundancy placement (SAA-RP), for micro-service-based applications with sequential combinatorial structure. For this application, if all micro-services are placed on one edge site, network congestion is inevitable. Therefore, we adopt a distributed placement scheme, which is naturally suitable for the distributed edge. Redundancy is the core of SAA-RP; this is to allow one candidate micro-service to be dispatched to multiple edge sites. By creating multiple candidate instances, it boosts a faster response to service requests. To be specific, it alleviates the risk of a long delay when a candidate

1 Both Alipay and WeChat Pay are third-party mobile and online payment platforms, established in China.

3.2 Al/ML for Service Deployment

65

is assigned to only one edge site. With one candidate deployed on more than one edge site, requests from different end users at different locations can be balanced, so as to ensure the high availability of service and the robustness of the provision platform. Specifically, we derive expressions to decide where each candidate should be dispatched, along with the number of instances each edge site must host. By collecting user requests for different service composition schemes, we model the distributed redundant placement as a stochastic discrete optimization problem and approximate the expected value through Monte Carlo sampling. During each sampling, we solve the deterministic problem based on an efficient evolutionary algorithm.

3.2.1 Motivation Scenarios 3.2.1.1

The Heterogeneous Network

Let us consider a typical scenario for the pre-5G heterogeneous network (HetNet), which is the physical foundation of redundant service placement at the edge. As demonstrated in Fig. 3.2, for a given region, the wireless infrastructure of the access network can be simplified into a macro base station (MBS) and several small-cell

Fig. 3.2 A typical scenario for a pre-5G HetNet. ©2022 IEEE, reprinted, with permission, from H. Zhao et al. (2022)

66

3 AI/ML for Service Life Cycle at Edge

base stations (SBSs). The MBS is indispensable in any HetNet to provide ubiquitous coverage and support capacity. The cell radius ranges from 8 km to 30 km. The SBSs, including femtocells, microcells, and picocells, are parts of the network densification for densely populated urban areas. Without loss of generality, WiFi access points, routers, and gateways are viewed as SBSs for simplification. Their cell radius ranges from 0.01 km to 2 km. The SBSs can be logically interconnected to transfer signaling, broadcast message, and select routes. To match realistic deployments where SBSs are set up by different mobile telecom carriers (MTCs), we assume that SBSs are not fully interconnected nor always achievable by each other; however, we reasonably assume that each SBS is mutually reachable to formulate an undirected connected graph. This can be seen in Fig. 3.2. Each SBS has a corresponding small-scale data center attached for the deployment of microservices and the allocating of resources. In this scenario, end users with their mobile devices can move arbitrarily within a certain range. For example, an end user may work within a building and rest at home. In this case, the connected SBS of each end user does not change.

3.2.1.2

Response Time of Micro-Services

A micro-service-based application consists of multiple micro-services. Each microservice can be executed by many available candidate micro-services. Take an arbitrary e-commerce application as an example. When we shop on a client browser, we firstly search the items we want using site search APIs. Secondly, we add them to the cart and pay for them. The electronic payment can be accomplished by Alipay, WeChat Pay, or PayPal by invoking their APIs. After that, we can review and rate for those purchased items. In this example, each micro-service is focused on single business capability. In addition, the considered application might have complex compositional structures with sophisticated correlations between the before-andafter candidate micro-services to handle bundle sales. For example, when we are shopping on Taobao (the world’s biggest e-commerce website, established in China), only Alipay is supported for online payment. The application in the above example has a linear structure. For simplicity, we only address sequentially composed application in this chapter. Note that any complicated case that can be represented using a directed acyclic graph (DAG) can be decomposed into several linear chains by applying flow decomposition theorem [51]. The pre-5G HetNet allows SBSs to share a mobile service provision platform, where user configurations and contextual information can be uniformly managed [52]. As we have mentioned before, the unified platform can be implemented by Kubernetes. In our scenario, each mobile device sends its service request to the nearest SBS (e.g., the one with the strongest signal). However, if there are no SBSs accessible, the request has to be responded by the MBS and processed by cloud data centers. All the possibilities of the response status of the first micro-service are discussed below.

3.2 Al/ML for Service Deployment

67

1. The requested service is deployed at the chosen SBS. It will be processed by this SBS instantly. 2. The requested service is not deployed at the nearest SBS, but accessible on other SBSs. This leads to multi-hop transfers between the SBSs until the request is responded by another SBS. In this case, the request will route through the HetNet until it is responded by an SBS who deploys the required service. 3. The requested service is not deployed on any SBSs in the HetNet. It can only be processed by cloud through backbone transmission. For the subsequent micro-services, the response status also faces many possibilities: 1. The previous service is processed by an SBS. Under this circumstance, if the service instance can be found in the HetNet, multi-hop transfer is required. Otherwise, it has to be processed by cloud. 2. The previous service is processed by cloud. Under this circumstance, the service instance for subsequent micro-services should always be responded by cloud without unnecessary backhaul. Our job is to find an optimal redundant placement policy with the trade-offs between resource occupation and response time considered. We should know which services should be made redundant and where to deploy them.

3.2.1.3

A Working Example

This subsection describes a small-scale working example. Micro-Services and Candidates: Figure 3.3 demonstrates a chained application constitutive of four micro-services. Each micro-service has two, one, two, and two candidates (nodes to host the micro-service), respectively. The first service composition scheme is .c11 → c21 → c31 → c42 , and the second service composition scheme is .c12 → c21 → c32 → c41 . In practice, the composition scheme is decided by the daily usage profile of end users. Because of bundle sales, part of the composition might be fixed. Service Placement of Instances: Figure 3.4 shows the undirected connected graph consists of six SBSs. The number tagged inside each SBS is its maximum number of placeable services. For example, SBS2 can be placed at most four services. This constraint exists because the edge sites have very limited computation and storage resources. The squares beside each SBS are the deployed services. For example, SBS1 deploys two services (.c11 and .c21 ). Notice that because of the redundancy mechanism, the same services might be deployed on multiple SBSs. For example, .c21 is dispatched to both SBS1 and SBS2. Figure 3.4 also demonstrates two mobile devices, MD1 and MD2, which are located beside SBS1 and SBS5, respectively. It means that MD1’s closest SBS is SBS1 and MD2’s closest SBS is SBS5. Note that the service request of each

68

3 AI/ML for Service Life Cycle at Edge

Fig. 3.3 Two service composition schemes for a four-micro-service app. ©2022 IEEE, reprinted, with permission, from H. Zhao et al. (2022)

Fig. 3.4 The placement of each candidate on the HetNet. ©2022 IEEE, reprinted, with permission, from H. Zhao et al. (2022)

mobile device is responded by the nearest SBS. Thus, for MD1 and MD2, SBS1 and SBS5 are the corresponding SBS for responding, respectively. Response Time Calculation: We assume that the service composition scheme of MD1 is the red one in Fig. 3.3 and MD2’s is the blue one. The number tagged inside each candidate is the executing sequence. Let us take a closer look at MD1.

3.2 Al/ML for Service Deployment

69

c11 is deployed on SBS1; thus, the response time of .c11 is equal to the sum of the expenditure of time on wireless access between MD1 and SBS1 and the processing time of .c11 on SBS1. .c21 can also be found on SBS1; thus, the expenditure time is zero. The response time of .c21 consists of the processing time of .c21 on SBS1. .c31 is on SBS2; thus, the expenditure time of it is equal to the routing time from SBS1 to SBS2. Here, we assume that the routing between two nodes always selects the nearest path in the undirected graph (one hop from SBS1 .→ SBS2 in this case). Thus, the response time of .c31 consists of the routing time from SBS1 to SBS2 and the processing time of .c31 on SBS2. .c42 is only on SBS6; thus, the expenditure requires two hops (SBS2 .→ SBS4 .→ SBS6 or SBS2 .→ SBS5 .→ SBS6). Finally, the output of .c42 needs to be transferred back to MD1 via SBS1. The nearest path from SBS6 to SBS1 requires three hops (e.g., SBS6 .→ SBS4 .→ SBS2 .→ SBS1). Thus, the response time of .c42 consists of the routing time from SBS2 to SBS6, the processing time of .c42 on SBS6, the routing time from SBS6 to SBS1, and the wireless transmission time from SBS1 to MD1. .

The response time of the first composition scheme is the sum of the response time of .c11 , .c21 , .c31 , and .c42 . The same procedure applies to MD2. In addition, two unexpected cases need to be addressed. The first one is that if a mobile device is covered by no SBS, the response should be made by the MBS, and all microservices need to be processed by cloud. The second one is that if a required service is not deployed on any SBS, then a communication link from the SBS processing the last service to the cloud should be established. This service and all services after it will be processed on cloud. The response time is calculated in a different way for these cases. As it can be inferred, a better service placement policy can lead to less time spent. Our aim is to find a service placement policy to minimize the response time of all mobile devices. What need to be determined are not only how many instances are required for each candidate but also which edge sites to place them. The next section will demonstrate the system model.

3.2.2 System Model The HetNet consists of N mobile devices, indexed by .N  {1, . . . , i, . . . , N }; M SBSs, indexed by .M  {1, . . . , j, . . . , M}; and one MBS, indexed by 0. Considering that each mobile device might be covered by many SBSs, let us use .Mi to denote the set of SBSs whose wireless signal covers the ith mobile devices. Correspondingly, .Nj denotes the set of mobile devices that are covered by the j th SBS. Notice that the service request from a mobile device is responded by its nearest available SBS; otherwise, by the MBS. The MBS here is to provide ubiquitous signal coverage and is always connectable to each mobile device. Both the SBSs and the MBS are connected to the backbone.

70

3.2.2.1

3 AI/ML for Service Life Cycle at Edge

Describing the Correlated Micro-Services

We consider an application with Q sequential composite micro-services t1 , . . . , tQ , indexed by .Q. .∀q ∈ Q, micro-service .tq has .Cq candidates

.

C

{sq1 , . . . , sqc , . . . , sq q }, indexed by .Cq . Let us use .D(sqc ) ⊆ M to denote the set of SBSs on which .sqc is deployed. Our redundancy mechanism allows that .|D(sqc )| > 1; it means that a service instance could be dispatched to more than one SBS. Besides, let us use .E(sqc ), c ∈ Cq to represent that for micro-service .tq , the cth candidate is selected for execution. .E(sqc ) can be viewed as a random event with an unknown distribution. Further, .P(E(sqc )) denotes the probability that .E(sqc ) happens. Thus, for each mobile device, its selected candidates can be described as a Q-tuple:

.

     c E(s)  E s1c1 , ..., E sQQ ,

.

(3.1)

where .q ∈ Q, cq ∈ Cq . The sequential composite application might have correlations between the before-and-after candidates. The definition below gives a rigorous mathematical description. Definition 3.1 (Correlations of Composite Service) .∀q ∈ Q\{1}, .c1 ∈ Cq−1 , and c1 c1 c2 ∈ Cq , candidate .sqc2 and .sq−1 are correlated iff .P(E(sqc2 )|E(sq−1 )) ≡ 1, and

.

c

c1 ∀c2 ∈ Cq \{c2 }, .P(E(sq2 )|E(sq−1 )) ≡ 0.

.

With the above definition, the probability .P(E(s)) can be calculated by Q     c      c q−1 . P(E(s)) = P E s1c1 · P E sqq |E sq−1

.

(3.2)

q=2

3.2.2.2

Calculating the Response Time

The response time of one service chain consists of data uplink transmission time, service execution time, and data downlink transmission time. The data uploaded is mainly encoded service requests and configurations, while the output is mainly the feedback on successful service execution or a request to invoke the next candidate. If all requests are responded within the access network, most of the time is spent on routing with multi-hops between SBSs. Notice that except the last one, each service’s data uplink transmission time is equal to the data downlink transmission time of the candidate of its previous micro-service. The first service of a service chain: For the ith mobile device and its selected candidate .s1c1 (i) of the initial micro-service .t1 , where .c1 ∈ C1 , one of the following two cases may happen.

3.2 Al/ML for Service Deployment

71

• if the ith mobile device is not covered by any SBSs around (i.e., .Mi = ∅), the request has to be responded by the MBS and processed by cloud data center. • if the ith mobile device is covered by an SBS (.Mi = ∅), the nearest SBS  .j ∈ Mi is chosen and connected. The following cases must be considered in i this case. – If the candidate .s1c1 (i) is not deployed on any SBSs from .M (i.e., c1 .D(s (i)) = ∅), the request has to be forwarded to cloud data center 1 through a backbone transmission. – If .D(s1c1 (i)) = ∅ and .ji ∈ D(s1c1 (i)), the request can be directly processed by SBS .ji without any hops. – If .D(s1c1 (i)) = ∅ and .ji ∈ / D(s1c1 (i)), the request has to be responded by c1  .j and processed by another SBS from .D(s (i)). i 1 We use .jp (s) to denote the SBS who actually processes s. Remember that the routing between two nodes always selects the nearest path. Thus, for .q = 1, c1 .jp (s (i)) can be obtained by (3.3), where .ζ (j1 , j2 ) is the shortest number of 1 hops from node .j1 to node .j2 . ⎧ ⎪ Mi = ∅ or D(s1c1 (i)) = ∅; ⎨ cloud, c1  D(s1c1 (i)) = ∅, ji ∈ D(s1c1 (i)); ji , .jp (s (i)) = 1 ⎪  • ⎩ argmin • c j ∈D(s 1 (i)) ζ (ji , j ), otherwise 1

(3.3) τin (s1c1 (i)) ⎧ α · d(i, 0) + τb , Mi = ∅; ⎪ ⎪ ⎪ ⎨ Mi = ∅, D(s1c1 (i)) = ∅; α · d(i, ji ) + τb , = Mi = ∅, ji ∈ D(s1c1 (i)); α · d(i, ji ), ⎪ ⎪ ⎪  • ⎩ α · d(i, j  ) + β · min • c i j ∈D(s 1 (i)) ζ (ji , j ), otherwise

.

1

(3.4) c

jp (sqq (i)) ⎧ c cloud, Mi = ∅ or D(sqq (i)) = ∅; ⎪ ⎪ ⎪ c cq−1 c q q−1 ⎨ jp (s D(sq (i)) = ∅, jp (sq−1 (i)) q−1 (i)), = cq (i)); ∈ D(s ⎪ q ⎪ ⎪ cq−1 ⎩ argmin cq ζ (jp (sq−1 (i)), j • ), otherwise j • ∈D(s (i))

.

q

(3.5)

72

3 AI/ML for Service Life Cycle at Edge c

τin (sqq (i)) ⎧ cq−1 c ⎪ 0, jp (sq−1 (i)) ∈ D(sqq (i)); ⎪ ⎪ ⎪ cq−1 ⎨ 0, jp (sq−1 (i)) = cloud; = cq−1 c jp (sq−1 (i)) = 0, D(sqq (i)) = ∅; τb , ⎪ ⎪ ⎪ cq−1 ⎪ ⎩ β · min • cq ζ (jp (sq−1 (i))), j • ), otherwise j ∈D(s (i))

.

q

(3.6)

cQ .τout (s Q (i))

=

c

τb + α · d(i, 0), jp (sQQ (i)) = cloud; cQ β · ζ (jp (sQ (i)), ji ) + α · d(i, ji ), otherwise (3.7)

To calculate the response time of .s1c1 (i), we use .d(i, j ) to denote the reciprocal of the bandwidth between i and j . The expenditure of time on wireless access is inversely proportional to the bandwidth of the link. Besides, the expenditure of time on routing is directly proportional to the number of hops between the source node and destination node. As a result, the data uplink transmission time .τin (s1c1 (i)) is summarized in (3.4), where .τb is the time on backbone transmission, .α is the size of input data stream from each mobile device to the initial candidate (measured in bits), and .β is a positive constant representing the rate of wired link between SBSs. We use .τexe (jp (s1c1 (i))) to denote the microservice execution time on the SBS .jp for the candidate .s1c1 (i). The data downlink transmission time is the same as the uplink time of the next micro-service. The intermediate services of a service chain: For the ith mobile device and its c selected services .sqq (i) of micro-service .tq , where .q ∈ Q\{1, Q}, .cq ∈ Cq , cq−1 the analysis of its data uplink transmission time is correlated with .jp (sq−1 (i)) c

c

q−1 (the SBS who processes .sq−1 (i)). .∀q ∈ Q\{1}, the calculation of .jp (sqq (i)) is summarized in (3.5). This formula is closely related to (3.3). c To calculate the response time of .sqq (i), the following cases may happen.

c

c

q−1 • If .jp (sq−1 (i)) ∈ D(sqq (i)), then the data downlink transmission time of



cq−1 (i), which is also the data uplink transmission time previous candidate .sq−1 cq−1 cq c (i)) = τin (sqq (i)) = 0. of current candidate .sq (i), is zero. That is, .τout (sq−1 cq−1 cq−1 c (i) and .sqq (i) are deployed on the SBS .jp (sq−1 (i)). It is because both .sq−1 cq−1 c If .jp (sq−1 (i)) ∈ / D(sqq (i)), the following cases must be considered. c

q−1 – If .jp (sq−1 (i)) = cloud (i.e., the request of .tq−1 from the ith mobile device is responded by cloud data center), then the invocation for .tq can be directly processed by cloud without backhaul. Here, the data uplink transmission time of .tq is zero.

3.2 Al/ML for Service Deployment c

73 c

c

q−1 – If .jp (sq−1 (i)) = 0 and .D(sqq (i)) = ∅ (i.e., .sqq (i) is not deployed on any SBSs in the HetNet), the invocation for .tq has to be responded by cloud data center through backbone transmission. cq−1 cq−1 c c – If .jp (sq−1 (i)) = 0, .D(sqq (i)) = ∅ but .jp (sq−1 (i)) ∈ / D(sqq (i)) (i.e., both

c

c

q−1 sq−1 (i) and .sqq (i) are processed by the SBSs in the HetNet but not the same one), we can calculate the data uplink transmission time by finding cq−1 c the shortest path from .jp (sq−1 (i)) to a SBS in .D(sqq (i)).

.

The above analysis is summarized in Eq. (3.6). The last service of a service chain: For the ith mobile device and its selected c candidate .sQQ (i) of the last micro-service .tQ , where .cQ ∈ CQ , the data uplink c transmission time .τin (sQQ (i)) can be calculated by (3.6), with every q replaced by c Q. However, for the data downlink transmission time .τout (sQQ (i)), the following cases may happen. c

• If .jp (sQQ (i)) = 0 (i.e., the chosen candidate of the last micro-service is processed by cloud data center), the processed result needs to be returned from cloud to the ith mobile device through backhaul transmission through the MBS. c • If .jp (sQQ (i)) = 0 (i.e., the chosen candidate of the last micro-service is processed by a SBS in the HetNet), the result should be delivered to the ith c c mobile device through .jp (sQQ (i)) and .ji . Note that .jp (sQQ (i)) and .ji could be the same. c

Equation (3.7) summarizes the calculation of .τout (sQQ (i)). Based on the above analysis, the response time of the ith mobile device is τ (E(s(i))) =

Q  

.

  c c c τin (sqq (i)) + τexe jp (sqq (i)) + τout (sQQ (i)).

(3.8)

q=1

3.2.3 Problem Formulation Our job is to find an optimal redundant placement policy to minimize the overall latency under the limited capability of SBSs. The heterogeneity of edge sites is directly embodied in the number of can-be-deployed services; .bj denotes this number for the j th SBS. In the heterogeneous edge, .bj could vary considerably, but the following constraint should be satisfied: .

    c ≤ bj , ∀j ∈ M, 1 j ∈ D sq q q∈Q c∈Cq

where .1{·} is the indicator function. Finally, the optimal placement problem can be formulated as

(3.9)

74

3 AI/ML for Service Life Cycle at Edge

P1 : min c

.

N 

q

τ (E(s(i)))

D(sq ) i=1

(3.9),

s.t. c

where the decision variables are .D(sqq (i)), ∀q ∈ Q, c ∈ Cq , and the optimization goal is the sum of response time of all mobile devices.

3.2.4 Algorithm Design In this section, we elaborate our algorithm for .P1 . We encode the decision variables as .x and propose the SAA-RP framework that includes an optimization algorithm: genetic algorithm-based server selection (GASS) algorithm.

3.2.4.1

Variables

Let us use .W(i)  (canIdx(t1 ), ..., canIdx(tQ )) to denote the random vector on the chosen service composition scheme of the ith mobile device; .canIdx(tq ) returns the index of the chosen candidate of the qth micro-service. .W(i) and .E(s(i)) describe the same thing from different perspectives. We use .W  (W(1), ..., W(N )) to denote the global random vector by taking all mobile devices into account. We use to encode the global deploy-or-not vector. .∀j ∈ M, x(b ) .x  [x(b1 ), ..., x(bM )] j is a deployment vector for SBS j , whose length is .bj . By doing this, the constraint (3.9) can be removed, because it is reflected in how  .x encodes a solution. .∀j ∈ M, each element of .x(bj ) is chosen from .{0, 1, ..., q=1 Cq }, that is, the global index of each candidate. Any service can appear in any number of SBSs in the HetNet. Thus, the redundancy mechanism is also reflected in how .x is encoded. .∀j ∈ M, .x(bj ) = 0 means that the j th SBS does not deploy any service. cq .x is a new encoding of .D(sq (i)). As a result, we can reconstitute .τ (E(s(i))) as .τ (x, W(i)). Therefore, the optimization goal can be written as g(x)  E[G(x, W)] = E

N 

.

 τ (x, W(i)) ,

(3.10)

i=1

and the optimal placement problem turns into P2 : min g(x).

.

x∈X

.P2 is a stochastic discrete optimization problem with independent variable .x. .X is the feasible region and, although finite, still very large. Therefore, enumeration

3.2 Al/ML for Service Deployment

75

approach is inadvisable. Besides, the problem has an uncertain random vector .W with probability distribution .P(E(s(i))).

3.2.4.2

The SAA-RP Framework

In .P2 , the random vector .W is exogenous because the decision on .x does not affect the distribution of .W. Also, for a given .W, .G(x, W) can be easily evaluated for any .x. Therefore, the observation of .G(x, W) is constructive. As a result, we can apply the sample average approximation (SAA) approach to .P1 [53] and handle the uncertainty. SAA is a classic Monte Carlo simulation-based method. In the following section, we elaborate on how we apply the SAA method to .P2 . Formally, we define the SAA problem .P3 . Here, .W1 , W2 , ..., WR are independently and identically distributed (i.i.d.) random samples of R realizations of the random vector .W. The SAA function is defined as gˆ R (x) 

.

R 1  G(x, Wr ), R

(3.11)

r=1

and the SAA problem .P3 is defined as P3 : min gˆ R (x).

.

x∈X

Using Monte Carlo sampling, with support from the Law of Large Numbers [54], when R is large enough, the optimal value of .gˆ R (x) can converge to the optimal value of .g(x) with probability one (w.p.1). As a result, we only need to solve .P3 as optimal as possible. The SAA-RP framework is presented in Algorithm 1. In this algorithm, we need to select the sample size R and .R appropriately. As the sample size R increases, the optimal solution of the SAA problem .P2 converges to its original/true problem .P1 . However, the computational complexity for solving .P2 increases at least linearly (maybe even exponentially) in the sample size R [53]. Therefore, when we choose R, the trade-off between the quality of the optimality and the computation effort should be taken into account. Here, .R is used to obtain an estimate of .P1 with the obtained solution of .P2 . In order to obtain an accurate estimate, we have chosen a relatively large sample size .R (.R R). Inspired by [53], we replicate generating and solving .P2 with L i.i.d. replications. From steps 4 to 8, we invoke GASS to obtain the asymptotically optimal solution of .P2 and record the best results up to that point. From steps 10 to 13, we estimate the true value of .P1 for each replication. After that, those estimates are compared with the average of those optimal solutions of .P2 . If the maximum gap is smaller than the tolerance, SAA-RP returns the best solution among L replications, and the algorithm terminates; otherwise, we increase R and .R and explore again.

76

3 AI/ML for Service Life Cycle at Edge

Algorithm 1 SAA-Based Redundant Placement (SAA-RP) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21:

Choose initial sample size R and R (R R) Choose the number of replications L (indexed by L) Set up a gap tolerance  for l = 1 to L in parallel do Generate R independent samples W1l , ..., WR l  r Invoke GASS to obtain the minimum value of gˆ R (xl ) with the form of R1 R r=1 G(xl , Wl ) ∗ ∗ Record the optimal values gˆ R (ˆxl ) and the corresponding variable xˆ l returned by GASS end for  x∗l ) v¯R∗ ← L1 L l=1 gˆ R (ˆ for l = 1 to L in parallel do Generate R independent samples W1l , ..., WR l  ∗ R 1 vRl ← R r =1 G(ˆxl , Wrl ) end for Get the worst replication vR• ← maxl∈L vRl if the gap vR• − v¯R∗ <  then Choose the best solution xˆ ∗l among all L replications else Increase R (for exploring) and R (for evaluation) goto Step. 4 end if return the best solution xˆ ∗l

3.2.4.3

The GASS Algorithm

GASS is a genetic algorithm (GA)-based algorithm. The detailed procedure is demonstrated in Algorithm 2. In GASS, we first initialize all necessary parameters, including the population size P , the number of iterations it, and the probability of crossover .Pc and mutation .Pm , respectively. After that, we randomly generate the initial population from the domain .X. From steps 6 to 10, GASS executes the crossover operation. At the beginning of this operation, GASS checks whether crossover needs to be executed. If yes, GASS randomly chooses two chromosomes according to their fitness values; the fitter, the more chance to be selected. With that, the latter parts of .xp1 and .xp2 are exchanged starting from the .x(bj ) position. From steps 11 to 13, GASS executes the mutation operation. At the beginning of this operation, it checks whether each chromosome can mutate according to the mutation probability .Pm . At the end, only the chromosome with the best fitness value can be returned.

3.3 AI/ML for Running Services

77

Algorithm 2 GA-Based Server Selection (GASS) 1: Initialise the population size P , number of iterations it, the probability of crossover Pc and mutation Pm 2: Randomly generate P chromosomes x1 , ..., xP ∈ X 3: for t = 1 to it do 4: ∀p ∈ {1, ..., P }, renew the optimisation goal of P2 ; that is, gˆ R (xp ), according to (3.11) 5: for p = 1 to P do 6: if rand() < Pc then 7: Choose two chromosomes p1 and p2 according to the probability distribution: P(p is chosen) = P

1/gˆ R (xp )

p =1

8: 9:

1/gˆ R (xp )

Randomly choose SBS j ∈ M Crossover the segment of xp1 and xp2 after the partitioning point x(bj −1 ): [xp1 (bj ), ..., xp1 (bM )] ↔ [xp2 (bj ), ..., xp2 (bM )]

10: end if 11: if rand() < Pm then 12: Randomly choose SBS j ∈ M and re-generate the segment xp (bj ) 13: end if 14: end for 15: end for ˆ p ) from P chromosomes 16: return argminp g(x

3.3 AI/ML for Running Services We are now embracing an era of IoT. The amount of IoT devices proliferated rapidly with the popularization of mobile phones, wearable devices, and various kinds of sensors. There was 8.6 billion IoT connections established in the end of 2018, and the number is predicted to increase to 22.3 billion according to Ericsson’s by 2022. To face the challenge of reliability connection and low latency in the future, researchers pay their attention to a novel computing paradigm called edge computing [55]. In contrast to cloud computing, edge computing refers to decentralized tasks at the edge of the network. In edge computing paradigm, plenty of edge servers are established close to IoT devices to deal with the requests from these devices before they are routed to the core network [56]. Therefore, the computation and transmission between devices and cloud server can be partly migrated to edge servers or cloud server. It enables IoT devices to fulfil complex analysis tasks with lower latency, higher performance, and less energy consumption [57] by taking advantage of the services deployed on edges. We can even establish a standalone cluster where the edge servers can work cooperatively to get full control and improve the offline scheduling capability at edge side. With the help of edge servers in proximity, applications are also enabled to learn from the mobile users’ real-time context information [58] and improve their quality of experience (QoE) [59].

78

3 AI/ML for Service Life Cycle at Edge

However, simply establishing such a service provisioning system is not enough [60]. Because there will be another significant problem: trust management that should be considered in this scenario. Except for the typical issues like key escrow and application distribution [61], we should also be careful about the business agreement complying. A service-level agreement is a commitment between a service provider and a client. It can contain numerous service-performance metrics with corresponding service-level objectives such as TAT (turnaround time) that describes the expected time to complete a certain task. Therefore, we can go a step further to find how to manage these services on edge servers so that the trustworthiness of them can be improved or at least guaranteed. An effective but challenging way to do it is to keep the services running effectively by allocating appropriate resources to them. Therefore, we need to explore the relationship between the service effectiveness and resource allocation scheme [62]. We also need to find a policy to determine how to allocate resources for the services on edge servers. Particularly, the policy should be a dynamical one which can tell how to allocate resources in different time periods. This is because the requests produced by IoT devices may vary over time, and thus we must care about the trustworthiness all the time. These will be difficult because there is no existing model for the edge service provisioning system in IoT environment, and most of the policies of existing works are static ones. Take the context of “smart city” as an example. In this field, plenty of IoT sensors are distributed around the city to collect data [63]. Traditionally, the collected data will be uploaded to the cloud for analyzing. When decisions are made, some control data will be sent back to corresponding controllers for further actions. For example, Alibaba researchers use their City Brain to help reduce traffic congestion: they collect traffic information from different roads with webcams and upload to AliCloud; then, they use the services on it to determine how to adjust the traffic lights. But things will be easier if we can take advantage of the edge computing paradigm. To be more specific, we can divide the road map into several independent regions and use the edge servers in every region to establish a service provisioning system. Doing this allows us to conduct the analyses locally, and the traffic lights of this region can be adjusted in time. Figure 3.5 shows an example platform with three edge servers making up a mini service provisioning system to provide four services with different IoT functions. For simplification, we assume these services have the same parameters except the functionality. The four related smart city services are presented with four colors: .s1 –.s4 are, respectively, TrafficLightService, NoiseService, AirQualityService, and CriminalDetectService, each with a predefined expected TAT. For edge server #1, the distribution of service requests over time is shown by a stacked area chart on the right. As services can process faster with more resources, we should carefully allocate limited resources of this server; for example, in time period t = 1, .s1 is used more and thus can receive more resources. However, if we do not change the resource allocation scheme, the trustworthiness of the system will decay in t = 4. Because in this time period .s2 is the most popular service. Therefore, some of the declared TAT may not be satisfied if we do not chance the resource allocation strategy. Similarly, it would be better to reallocate resources in t = 7.

3.3 AI/ML for Running Services

79

s2

s3

s4

Fig. 3.5 A motivation scenario with multiple edge servers in a smart city. ©2020 IEEE, reprinted, with permission, from S. Deng et al. (2020)

Using this simple example, we demonstrated the importance of dynamic resource allocation. In fact, because the relationship between the allocated resource and system performance may be more complex, we should also consider the cooperation of edge servers and cloud. To this end, we are going to elaborate one of the following topics in this chapter. 1. We will investigate the resource limitations of edge servers and resource consumption of services and consider the multi edge-cloud cooperation to model the reliability for edge service provisioning. 2. We quantify the reliability with the allocated resources for services so that we can explore how they can impact the provisioning. 3. We model the process of resource allocation with Markov decision process and train polices based on a reinforcement learning-based approach. 4. We conduct a series of experiments to evaluate the generated policies with the YouTube dataset and compare it with other similar approaches.

3.3.1 System Description and Model In a reliable edge service provisioning system, there are mainly three types of entities: (1) IoT devices that communicate with the edge servers to send requests and receive responses, (2) edge services that help to fulfil specific tasks, and (3) hosts

80

3 AI/ML for Service Life Cycle at Edge

in clouds and edge servers to deploy and execute services. As the requests of IoT devices may be imbalanced and dynamic, and requested services may have different functionalities and processing capacities, here the edge servers are integrated into this system to make a geographically local cluster. The cluster will help to make use of the edge resources, but some external complexity in reliability management is also introduced. Therefore, we need to quantify the reliability and explore the factors that may impact it, so that we can adjust the service resource allocation scheme in an appropriate way. In this section, we will give a brief description of the properties of these system entities and how they will interact with each other. The resource allocation scheme is then modelled based on them. Here, we mainly focus on the service average turnaround time of SLA in system management. It is formulated in this section to help understand the definition of reliability in this context. A. Average Turnaround Time Estimation: Suppose there is a cloud sever .h0 and n edge servers (.h1 , ..., .hn ) in this provisioning system for services .s0 , .s1 , ..., .sm (.s0 is a virtual service stands for “idle”). To explain the running of system clearly, we introduce “service request life cycle (SRLC).” A SRLC starts from its creation and ends with receiving the results by the invoker IoT device. Figure 3.6 shows three typical SRLCs. Shown in blue, the user with device #1 wants to request the content of Facebook. With the corresponding service on edge server #1, the request is easily handled by it. Shown in pink, the user with device #3 wants to access Instagram; in this case, because the edge server #2 does not allocate resources to Instagram, it will dispatch the request to the cloud server. Shown in orange, the user with device #2 wants to access Facebook; in this case, multiple edge servers need to work together to provide this service. Using these simple examples, an entire life cycle of a platform can be divided into four steps. (1) Access step A request for service .si is produced by the IoT device, and the request is sent to the nearest edge server .hj (called access server .Ha ) via a wireless link. The time cost is calculated as TA =

.

Diin j

(3.12)

,

vu j

where .Diin is the average input data size of .si and .vu is the average data transmission rate between .hj and the IoT devices in its serving area. Note that we used the principle of proximity to select an appropriate edge server; this is replaced by other methods. (2) Routing step The access server .hj will select the appropriate server .hk to handle this request and then send the request to it (called executor server .He ). The time cost is calculated as TR =

.

Diin , Bj,k

(3.13)

3.3 AI/ML for Running Services

81

Fig. 3.6 An illustration of a service request life cycle. ©2020 IEEE, reprinted, with permission, from S. Deng et al. (2020)

where .Bj,k describes the topology and bandwidth between .hj and .hk . The elements of matrix .B are non-negative values, while .Bj,k is set .+∞ if j = k and is set “0” if .hk is not accessible for .hj . Because an appropriate routing policy is needed to make decisions, we use .pi,j,k to denote the probability that the server .hj dispatches requests for service .si to server .hk . In reality, developers may like to use the round-robin principle to balance the workload. With this principle, requests will be routed to different hosts 1 t in order, namely, .p∗,∗,k = . n+1 . However, it implicitly assumes that the machines are homogeneous and does not consider the system context. In our model, because we assume a heterogeneous platform with different processing capacities, we will use the weighted round-robin approach to give the host .hj a larger probability if it has better processing capacity for servicing .si . To be μ more specific, we determine this probability as .pi,j,k = . n i,kμ . q=0

i,q

(3) Execution step When the executor server .hk receives the request, it will use the corresponding service to fulfil the task. Assume that the processing capacity (e.g., the ability to handle instructions measured by MIPS) for

82

3 AI/ML for Service Life Cycle at Edge

service .si on host .hk is .μi,k and the workload (e.g., the number of instruction to run the program) of .si is .wi . The time cost is calculated as TE =

.

wi , μi,k

(3.14)

where we assume .μi,k is proportional to .li,k , the resource (e.g., CPU or memory) allocated to .si on .hk [34]. When .μk is the maximum processing l capacity of .hk and the resource limitation of .hk is .Lk , we have .μi,k = . Li,kk μk according to Eq. (13) in [34]. (4) Backhaul step The results will go to the access server and finally go back to the IoT device to finish the whole life cycle. The time cost is calculated as TB =

.

Diout D out + ij , Bk,j vu

(3.15)

where .Diout is the average output data size of .si . The service request life cycle .φ is calculated as T φ = TA + TR + TE + TB

.

(3.16)

B. Dynamic Service Resource Allocation The resource allocation scheme .P can be represented with a matrix where the element .Pj,i means the resource quota (in %) for .si on .hj . Without loss of generality, here we mainly focus on the computation resource like CPU and memory; this is because these resources are more expensive than the storage and networking resources. Therefore, the allocated resource for service .si on .hj can be calculated by .li,j = Pj,i Lj . As mentioned before, a significant task in the reliable service provisioning system is to ensure the SLA of services. But as the requests are time-varying, a fixed service resource allocation scheme cannot work well all the time. Therefore, it will be effective for the system to reschedule the resource allocation scheme during runtime. In our work, the allocation replanning for the resource on hosts is denoted with the matrix .At .∈ .R(n+1)×(n+1) . This describes how the service allocation scheme will be replanned at time period t. The service resource allocation scheme of the next time period can be produced by P t+1 = At × P t

.

(3.17)

t+1 .∈ .R(n+1)×(m+1) . As .P where j,i is represented in percentage, we will have m .P . P = 1 and 0 . ≤ . P . ≤ 1. This is a constraint that can be used in training of j,i j,i i=0 the reinforcement learning-based approach. For a neural network that generates .P , a sof tmax layer can be added to ensure this constraint.

3.3 AI/ML for Running Services

83

3.3.2 Algorithm Design In this reliable IoT service provisioning system, we mainly care about how SLAs are fulfilled, especially the TAT of SLA. Thus, before developing our approach (DeraDE: dynamic service resource allocation for distributed edges algorithm), we should clarify the expected turnaround time of services and relate it to system reliability. By denoting the request number of service .si observed on .hj (or frequency) with t and the probability that .h dispatches .s to .h with .p t .ω j i k i,j i,j,k , then the probability to get request path .φ = (i, j, k) is P r{φ = (i, j, k)} = P r{s = si , Ha = hj , He = hk } =

P r{He = hk |s = si , Ha = hj }

× P r{s = si |Ha = hj }P r{Ha = hj } m t t ωi,j i=0 ωi,j t   · = pi,j,k · m m n t t i=0 ωi,j i=0 j =0 ωi,j

.

(3.18)

t t /n  t As introduced before, we can get .pi,j,k = .μk Pk,i q=0 μq Pq,i with the definition of .Pj,i . Having the total size of .si ’s input and output with .Di (.Di = .Diin + .Diout ) in time period t, the expected TAT .E[T ] of this time period can be calculated as

E[T ] =

n  n m  

P r{φ = (i, j, k)}Tφ

i=0 j =0 k=0

.

t Pt n  n m   μk ωi,j k,i n Pt μ q=0 q q,i

1 = m n

t j =0 ωi,j i=0 j =0 k=0

i=0

 ·

D in D out wi + t  + i + i j Bj,k Bk,j Pk,i μk vu

Di

(3.19)



When there are more than one time period in service provision, we should consider them synthetically. As mentioned before, the resource allocation scheme of the next time period is determined by that of the previous time period. We used the Markov decision process to describe this process. The Markov decision process is a stochastic control process described with a four-tuple (X, Y , P , R) where X is the state space, Y is the action space, t t+1 ) is the probability that action .y t in state .x t will lead to .x t+1 , and .R is .Py t (x , x t the immediate reward for applying action .y t when state is .x t . We can use the policy .π to describe the distribution for actions for different states.







π(y |x ) = P r[y = y |x = x ]

.

(3.20)

84

3 AI/ML for Service Life Cycle at Edge

In this way, the Markov decision process provides a mathematical framework for modelling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. In our model, we denote .x t = .P t as the state of the provisioning system at the beginning of time period t. With observation t t t .x , the provisioning system strategically decides an action .y = .A following a t resource replanning policy .π . Because .E [T ] can be obtained at the end of this time period, the reliability score for applying action .y t for state .x t can be calculated as Rt = T  − Et [T ],

(3.21)

.

where .T  is a threshold to make a smaller .E[T ] receive more reliability score than the larger ones. Therefore, given the time period t, if the current state .x t = x, we can generate a sequence like seq = [.x t , y t , x t+1 , y t+1 , ...] according to .π. The accumulative reliability score for seq can be calculated as Gt = Rt+1 + γ Rt+2 + · · · =

∞ 

.

γ k Rt+k+1 ,

(3.22)

k=0

where .0 ≤ γ ≤ 1 is the discount factor to show how the subsequent reliability score will contribute to .Gt . As these sequences may be diverse, we can use the expectation of .Gt to evaluate the value for state .x t = x under a policy .π . Based on that, we can evaluate the global reliability score for applying action .y t = y at state .x t = x with an action-value function:  .

Qπ (x, y) = Eπ

∞ 

 Gt |x = x, y = y t

t

(3.23)

k=0

Thus, given a behavior policy .β (where .β(y|x) describes the probability for applying action y in state x and .ρβ the probability density function of .β), the expected global reliability score for policy .π can be represented with 



Jβ (π ) =

ρβ (x)Qπ (x, y)dxdy x∈X

.

y∈Y

(3.24)

= Ex∼ρβ ,y∼β [Qπ (x, y)] The goal of our algorithm is to maximize the global reliability score (which is equivalent to minimizing the average turnaround time); it can be formulated as π ∗ = arg max Jβ (π)

.

π

(3.25)

3.3 AI/ML for Running Services

85

3.3.3 RL-Based Approach In our model, as the resource allocation scheme .P and replanning action .A can be 2 represented as a vector in .Rn +2n+1 and .Rnm+n+m+1 , respectively. By concatenating their rows, we can represent the policy .π with a determined function .f , that is, n2 +2n+1 → Rnm+n+m+1 to generate actions for any given state. To meet this .R demand, we use a neural network . with parameter .θ to approximate this function: y t = (x t ; θ )

.

(3.26)

There is another item (.Qπ (x, y)) in Eq. 3.24 as well. We will represent it with 2 the function .fQ : Rn +2n+1 × Rnm+n+m+1 → R. We use another neural network Q whose parameters are denoted with .θQ to approximate .fQ Qπ (x, y) = Q(x, y; θQ )

.

(3.27)

Now we just need to clarify structures of . and Q. To train them with adequate exploration, we use an actor-critic structure [64] and off-policy. The off-policy means that the behavior policy .β is different in actor module and critic module. In our work, we add a noise .N t from a random process in determining actions with . in actor module to construct behavior policy .β: yβt = (x t ; θ ) + N t

.

(3.28)

Thus, given state .x t at the beginning of time period t, we can take the action .yβt generated by .β for Eq. 3.17. After replanning, the reward can be represented by .Rt , and the current state of the provisioning system is changed to .x t+1 . Repeating this process, tuples like (.x t , yβt , Rt , x t+1 ) will be stored in a replay memory buffer M for future training. Taking advantage of the following Bellman equation for .Qπ (x t , y t ), derived from Eq. 3.23, we can use ˆ π (x t , (x t ; θ )) = E[Rt + γ Qπ (x t+1 , (x t+1 ; θ ))] Q

.

(3.29)

as the target of .Qπ (x t , (x t ; θ )). We can create a supervised learning task to train the network Q for data batches (batch size = N ) sampled from M: LQ =

.

N 1  ˆ (Qπ (x t , (x t ; θ )) − Qiπ (x t , (x t ; θ )))2 N

(3.30)

i=1

and the parameter for Q can be updated with θQ = θQ − ηQ ∇θQ LQ

.

(3.31)

86

3 AI/ML for Service Life Cycle at Edge

The DeraDE algorithm is equivalent to searching for the best policy .π that maximizes the accumulative reward. We can update .θ with .∇θ Jβ (θ ). This can be computed by θ = θ − η ∇θ Jβ (θ )

.

(3.32)

The gradient .∇θ Jβ (θ ) can be approximated with ∇θ Jβ (θ ) ≈ Ex∼ρβ [∇θ (x; θ ) · ∇y Q(x, y; θQ )|y= (x;θ ) ]

.

(3.33)

Because the parameters of Q will be updated frequently when used in the gradients of both Q and . , networks . and Q may lead to diverging training processes. To ensure a converging process, we introduce target networks . and .Q with the same structure and initial parameters as . and Q, but are updated more conservatively using the following formulas: θ = τ θ + (1 − τ )θ .

θQ = τ θQ + (1 − τ )θQ

(3.34)

The framework of this process is shown in Fig. 3.7, and the details of the approach are shown in Algorithm 3.

3.4 AI/ML for Service Operation and Management Edge computing has numerous extensions to today’s network architectures. Edge computing refers to a framework that facilitates low latency of edge servers deployed close to the sources of data or applications that deployed on users’ devices [65, 66]. The edge network can be any functional part from the users’ side to the center of the cloud server. These parts consist of different entities playing vital roles in integrating traditional networks. They provide computing, storage, cache, and transmission to support different services of applications in real-time, dynamic, and intelligent service for clients in the edge network [67]. Unlike traditional cloud computing (centralized servers), algorithms and strategy selection in edge computing push the computing and intelligence closer to the actual activity generated by users. It requires services computing of central servers moving to the edge. Thus, there are differences related to multi-heterogeneous processing, bandwidth capacity, resource utilization, and privacy protection [68–70].

(xti , yβti , Rti , xti +1 )

…… ……

……

……

yβti

……

Rti

……

……

……

……

*

……

+

Fig. 3.7 Framework of the reinforcement learning based approach. ©2020 IEEE, reprinted, with permission, from S. Deng et al. (2020)

ReplayBuffer

Service Provisioning System

……

……

Lossactor = −Q(xti , Π(xti ))

3.4 AI/ML for Service Operation and Management 87

88

3 AI/ML for Service Life Cycle at Edge

Algorithm 3 DeraDE Algorithm 0 : the initial parameters of Q;θ 0 : the initial parameters of 1: Initialise V : the batch size; θQ

2: Output Q: the action-value network; : the action network 3: for each episode do 4: initialise the service provisioning system 5: for each time period t do 6: select an action y t with Eq. 3.28; 7: get new state x t+1 with Eq. 3.17; 8: get reliability-score Rt with Eq. 3.21; 9: store tuple (x t , y t , Rt , x t+1 ) in buffer M; 10: sample V tuples {(x ni , y ni , Rni , x ni +1 ) | 1 ≤ i ≤ V } ⊂ M to compute ∇θQ LQ and update Q; 11: compute ∇θπ Jβ with Eq. 3.33 to update ; 12: Update Q and using Eq. 3.34; 13: end for 14: end for

Emerging service utilization patterns require increasingly more computing capabilities provided by end users. Offloading addresses problems of users’ devices regarding storage resources, computing performance, and energy efficiency. Recent studies have introduced this technique consisting of the offloading algorithm, strategy for offloading, and the offloading system [65]. Offloading faces a new problem in edge environments, with the number of end users continuously increasing. With the rapid growth in IoT and mobile devices, limited computation and network resources among end users in a specific edge network will become more and more common. Due to the heterogeneous nature of devices in edge networks, the traffic load in the network will endure non-uniform and dynamic burst loads [71]. Some recently published researches introduce that the bursty traffic in radio access network (C-RAN) architecture is a significant problem and difficult to solve [72]. However, edge network combining edge servers and remote cloud center is proposed as a popular technique nowadays in C-RAN for 5G [73]. It is challenging to tackle the problem when burst load occurs at the edge environment. To reduce the latency and maintain an acceptable QoS of edge networks, we focus on the problem of the burst load evacuation using offloading strategies edge environments. There are many occasions where load bursts may occur, including sudden shopping crowds in shopping malls, crowds on festival streets, traffic roads during rush hour, and crowds in tourist areas. Such scenes have prominent regional characteristics or unique temporal characteristics. For example, in the edge computing network for vehicle infrastructure, it is a fact that edge servers deployed with smallcell BS close to a vehicle may not be enough to satisfy the computation demand burst of each user (vehicle) [74]. Besides, many IoT/mobile devices that can connect to the Internet scatter across these areas. Applications such as intelligent sensors, healthcare systems, and smart home/city devices are widely running constantly. Numerous devices could generate bursty traffic, and thus edge servers must be able to support services from IoT/mobile devices [75]. For instance, many mobile users would cause a sudden influx of requests for their services in crowded sports events, concerts, or other scenes.

3.4 AI/ML for Service Operation and Management

89

However, current protocols commonly used in the Internet of Things only focus on the information transmission [76]. These protocols, such as HTTP, MQTT, and AMQP, cannot achieve an optimal solution to solve the bursty traffic problem caused by a considerable number of requests from clients. It is necessary to adopt a proper mechanism to tackle the evacuation of burst load. For the distributed mobile edge system, a key lesson from studies of solving load balance is that the scheduling of the tasks in the network should follow the proposed algorithm of optimization under constraints of the resources of network links and edge servers [65]. The different types of equipment (e.g., routers and switches) are fixed assets in an edge environment. In other words, the bandwidth of links in networks is typically assumed to be a fixed rate. Similarly, the computational resource of each edge server is fixed too. As more and more users join the edge-enabled networks, these limited resources can quickly become performance bottlenecks. Therefore, designing an efficient offloading strategy is crucial when burst load occurs at the edge server under limited resources. To cope with bursty data arrivals of multi-user application offloading in an edge environment, many researchers designed algorithms for the objective of minimizing overall users’ queue latency. Authors in [77] propose an orchestrator for mobile augmented reality for edge computing. The major challenge is the difficulty of what extent to reduce the latency. Data-intensive services in edge computing will face that load balancing conditions constantly change between edge servers [78]. Based on the limited resource of edge servers, the authors proposed a scheme using the genetic algorithm. Under resource-constrained distributed edge environment, authors in [68] focus on the edge provisioning of computational resources to ensure stable latency of complex tasks for each mobile. Network latency and computational latency are primarily considered in most edge network systems to get an optimal solution to the evacuation policy. In addition, edge networks do not exclude the cloud center networks; instead, recent studies have shown a promising trend that edge computing and traditional cloud computing come together to form a new architecture and this heterogeneous edge computing system will consist of multiple layers [79]. In these platforms, edge servers need to closely cooperate with the remote servers. This is to resolve the lack of computing resources on the users’ devices and, at the same time, preprocess computing tasks before relaying information to remote servers. Both will lead to the reduction of latency for services and save computing and networking costs. For example, in the field of security cameras for road traffic control or indoor monitoring, before recording and sending the video, edge servers need to compress the data before transmission. This ensures maximum efficiency before transferring the data to remote server using links with limited capacity. Particularly, in face recognition, edge servers may need to access the portrait data from the database of the remote server, or some intelligence prediction computing algorithms need to be updated in real time for the model, framework, and parameters [80]. Application of intelligent computing can make good use of the edge computing mode [81]. To mitigate the latency of mobile augmented reality (AR) applications, a hierarchical

90

3 AI/ML for Service Life Cycle at Edge

computation architecture consisting of several layers such as edge layer and cloud layer is widely adopted [77]. But the computation workload of each task is not easy to predict, because some heavy computation workloads are hard to know before their completions. For instance, the latency caused by the computation of an object recognizer and a renderer is difficult to evaluate in advance for mobile augmented reality (AR) [77]. Besides, tasks from the users in the edge environment can be an arbitrary sequence of tasks. From the perspective of the deployment of edge servers, idle edge servers can often be used as supplements, and the edge servers can join other edge servers when a burst load of tasks occurs; this is to help offloading and process these tasks. Computational resource provisioning problem in multi-layers, however, is a particular challenge for burst load. This is because different factors (workload of edge server, transmission latency between different layers, etc.) can affect the performance of the provisioning policy. Through analysis, three factors that affect the overall latency of the burst load while implementing an offloading strategy are the link capacity, the network transmission speed, and the computational latency. Load balancing in a multilayer edge network is paramount in edge environments. Offloading the burst load is a particular case of load balancing and usually includes the following main challenges: (1) limitation of the link bandwidth in the edge network, (2) limited edge network transmission speed, and (3) efficiency in an online policy-based distributed edge network. This section addresses the challenges mentioned above and proposes an efficient algorithm to minimize the offloading latency. To solve the burst load problem, we focus on the three-layer architecture (Fig. 3.8) of the edge environment, that is, the collaborated migration layer, the computing sharing layer, and the remote assisted layer. We denote the dispatching as task migration from an edge server to other edge and remote servers. We aim to minimize the delay of all tasks after the dispatching and scheduling stages.

3.4.1 System Model Our edge network system consists of multiple edge servers, as shown in Fig. 3.8. We present the index set of tasks as .N  {1, 2, ..., N } and the index set of edge servers by .M  {1, 2, ..., M}. Each edge server is able to receive tasks from other edge servers. This task migration starts from a heavy load edge server to others with less load. The workload of edge servers will drastically rise when a considerable amount of requests initiated by the users flood into them. Under this condition, a burst load occurs. Assume each edge server can offer computing resources for tasks during the burst load that happens in the computing sharing layer [82] and connects to the remote server directly. Each task will consume this resource at different levels. Here, in the migration collaboration layer, we assume there are N tasks at one specific edge network that need offloading as soon as possible. A total of M edge servers

3.4 AI/ML for Service Operation and Management

91

Fig. 3.8 The edge environment. ©2021 IEEE, reprinted, with permission, from S. Deng et al. (2021)

are deployed in this network that are able to execute the offloaded tasks [65]. After allocating all of the tasks in the burst load, we can decide whether to wait in the burst load or migrate to another edge server immediately. We aim to minimize the makespan of the task offloading procedure. The different stages of tasks in the edge network are shown in Fig. 3.12. For example, some tasks will become suspending under the limited capacity of links in the period [.t0 , .t1 ] due to the burst load occurring. Then the system will migrate these tasks from the original edge server to another in the period [.t1 , .t2 ]. At the last stage, the system schedules these tasks at the edge server by our proposed scheduling algorithm and finally completes at .t3 . The migration of tasks only occurs at the migration collaboration layer. The computing layer and remote assisted layer are responsible for the computation of tasks in burst load cooperatively. Task migration delay before offloading When a burst load occurs at an edge server, it will accumulate many tasks in a short period. These tasks from users’ requests are waiting at the edge server to dispatch to another edge server. According to the system manager, our proposed evacuation policy starts when tasks in the burst load accumulate to some extent. Thus, our designed mechanism only focuses on dealing with all tasks involved at once after the burst load has formed. The system will cache tasks in the task pool. The caching phase takes up a period until they can start the migration. We denote the indicator of a task

92

3 AI/ML for Service Life Cycle at Edge

latency in burst load as .Ii† (t), where .Ii† (t) ∈ {0, 1}. .I1† (t) = 1 means the ith task is waiting in the task pool of the edge server at the tth time slot; otherwise, † .I (t) = 0. Once a burst load occurs in the edge network, all of the tasks are i caching in the edge server until they can migrate. The latency time of the ith task is calculated as  † i .T1 = Ii (t), (3.35) t>0

where T represents the time when the offloading finishes. Task migration latency between edge servers As compared with traditional networks, several edge servers that are close to each other might be linked and constructed as a local area network. We assume that each edge server is linked to several servers to form a migration collaboration layer (Fig. 3.9). After the burst load occurs, tasks can be migrated through these links according to our optimal strategy. The system of an edge network can monitor the information of link status as well as available resources of each edge server. Under our offloading scheme, a task can migrate from one edge server to another using an optimal routing policy. We present the index set of links as .K = {1, ..., K} and the index set of available routing in the edge network as .R = {1, ..., R}. Before a task is migrated to another server, the migration cost is computed. An available routing in the edge environment consists of several links; each link latency is denoted as .lk , k ∈ K [83]. Each link allows to pass a limited number of tasks in a time slot t during

Fig. 3.9 The migration of the burst load. ©2021 IEEE, reprinted, with permission, from S. Deng et al. (2021)

3.4 AI/ML for Service Operation and Management

93

migration; this is presented as .ck (t) where the maximum number is presented as ckmax . The following condition must be satisfied:

.



ck (t) =

.

taskki (t) ≤ ckmax , t > 0, k ∈ K

(3.36)

i∈N

where .taskki (t) = 1 means the ith task is migrated through the kth link; otherwise, .taskki (t) = 0. The task migration should be working on the available links in our edge network between edge servers. In other words, each task after waiting in the burst load queue should start the migration. We denote the ith task’s migration state as ‡ .I i,k (t) = 1; it means the ith task is being migrated on the kth link at the tth time ‡ slot; otherwise, .Ii,k (t) = 0. We present this vector as

T  ‡ ‡ I‡i (t) = Ii,1 (t) · · · Ii,|K| (t)

(3.37)

‡ Ii,k (t) ∈ {0, 1}, i ∈ N, k ∈ K, t > 0

(3.38)

.

where .

Because a task is migrated as a whole and through one specific link, it cannot be duplicated or split. It should satisfy the following constraint:  .

‡ Ii,k (t) ∈ {0, 1}, k ∈ K

(3.39)

k∈K

Because temporary congestion may occur at some links, we can estimate the latency of migration for each task as T2i ≥ l1 + l2 + ... + lk , k ∈ K

.

(3.40)

Task processing latency at the edge and remote servers Once the task arrives at an edge server, after our dispatching strategy, the edge network system will arrange the task’s processing to minimize the makespan for all tasks. As shown in Fig. 3.10, the latency in this stage includes three parts: pre-process latency, remote-assistance latency, and final-execution latency. We present the indicator of the task execution by § Ii,j (t) ∈ {0, 1}, i ∈ N, j ∈ M

.

(3.41)

This means the ith task is being executed on the j edge server during the tth § § (t) = 1; otherwise, .Ii,j (t) = 0. The start time of the ith task time slot if .Ii,j s , and the final completion after arrived the j th edge server is represented as .t˜i,j

94

3 AI/ML for Service Life Cycle at Edge

Fig. 3.10 Task scheduling for executing and requesting data. ©2021 IEEE, reprinted, with permission, from S. Deng et al. (2021)

Fig. 3.11 Task processing sequence. ©2021 IEEE, reprinted, with permission, from S. Deng et al. (2021) c . The computation delays of the ith time of the ith task at the j th server as .t¯i,j task are denoted as .X˜ i,j , Xˆ i,j , and X¯ i,j during these steps, that is, at the preprocess, the remote-assistance, and the final-process, respectively. We represent the CPU cycles needed in pre-process and final-process part of the ith task as .e˜i and .e¯i , respectively [84]. As shown in Fig. 3.11, we have tasks that are processed in sequence for each task in our scheduling stage. We represent the frequency of CPU at each edge server as .fj where .j ∈ M. The latency of ith task generated by remote-assistance is denoted as .Xˆ i,j , where j is the edge server which processed

3.4 AI/ML for Service Operation and Management

95

the task i. The latency for buffering a task at servers is defined as .ξi . Using them, we can calculate the total execution time of a task through as follows: T3i =

.

e˜i e¯i + Xˆ i,j + + ξi , fj fj

(3.42)

where .i ∈ N, j ∈ M. The computing work of each phase of a task on a certain edge server should never stop until the work is finished. However, because a server can only process one job at a time slot, the task execution should satisfy the following constraints:  .

§ Ii,j (t) ≤ 1, i ∈ N, t > 0, .

(3.43)

j ∈M

  § § § I§i (t) = Ii,1 (t), Ii,2 (t), · · · , Ii,M (t) . ⎛ ⎞ t  rank ⎝ I§i (t)⎠ = 1, t > 0, .

(3.44) (3.45)

t=t

    s ¯s ˆs c ¯c ˆc t ∈ t˜i,j , ti,j , ti,j , t ∈ t˜i,j , ti,j , ti,j

(3.46)

When the rank of the matrix is “1,” it means that the ith task is executed only on one specific edge server. Because the system in both dispatching and scheduling stages are dynamic, once the task is migrated to another edge server, it will start its processing as soon as it has free computation resource. Therefore, the task processing time should satisfy the following constraints in the pre-process, remote-assistance, and final-process phases: s c ≥ t˜i,j tˆi,j .

s c ≥ tˆi,j , t¯i,j

(3.47)

where .i ∈ N (Fig. 3.12).

3.4.2 Problem Analysis Tasks will offload along the optimal route to get a minimal makespan. Our proposed strategy is based on the knowledge of connections about links in the edge environment. In the edge network, there may be a large number of tasks (such as the machine learning model of cloud-edge collaboration mentioned earlier) that require cloud-edge cooperation. In these platforms, the processing time of the model itself may be uncertain, and the data transmission latency between the edge network and the cloud center server is also uncertain. Thus, affected by the transmission latency of the edge-cloud, it is difficult to determine its transmission delay. In other

96

3 AI/ML for Service Life Cycle at Edge

Fig. 3.12 Task offloading stages. ©2021 IEEE, reprinted, with permission, from Deng et al. (2021)

words, due to the uncertainty in task processing, it is difficult to know how much time the task needs to be executed [85]. In this chapter, the uncertainty consists of transmission latency between the edge server and the remote server and the latency due to the processing at the edge server and the remote server. As shown in Fig. 3.11, the latency caused by the remote-assistance begins from the moment the task’s preprocess completes and finishes when the task returns to the original edge server. We define the latency of the ith task from the start of burst load to its completion as .ai , i ∈ M. There are three stages throughout a task’s lifetime, .T1i , T2i , andT3i , corresponding to the task latency in the burst load, the task migration latency among edge servers, and the total processing latency of the task, respectively. We need to arrange the strategy of each task in the burst caching phase, migration phase, and processing phase. Since a task might be migrated from the original edge server to another edge server, we should decide which link is scheduled for each task to migrate at each time slot and which edge server is scheduled to execute each task. The total latency of a task could be formulated as ai = T1i + T2i + T3i , i ∈ N.

.

(3.48)

3.4 AI/ML for Service Operation and Management

97

To minimize the makespan of all tasks in a burst load, we formulate the problem as P1 :

.

min

I†i (t),I‡i (t)I§i (t)

max(a1 , · · · , aN )

s.t. (3.36), (3.39), (3.40), (3.43), (3.45), (3.47)

(3.49)

s ≥ T1i (t) + T2i (t) t˜i,j

t > 0, i ∈ N, j ∈ M where .max(a1 , · · · , aN ) represents the most time-consuming task in the edge computing network. We can divide the task offloading problem .P1 into two independent stages: the dispatching stage and the scheduling stage. During the dispatching stage, the task is uninterruptedly migrated from the server where the burst load occurs to other servers through provided links. At the scheduling part, tasks are pre-processed, and the remote central server cooperates with the edge servers to complete the processing of the offloaded tasks. Therefore, we break down the problem into two sub-problems: .P2 and .P3 . As shown in Fig. 3.8, • The collaborated migration layer is composed of topological connection links of edge servers in the edge network. They form a network with a limited number of nodes and links. Each server shares these links as a possible task offloading path. Tasks from the burst load will accumulate at the specific edge server; we propose the OPTD 4 algorithm to get an optimal decision to offload tasks among servers considering the capacity and migration rate constraints. • The computing sharing layer consists of each node in the collaborated migration layer at the dispatching stage. In general, each node in the collaborated migration layer represents one edge server in the computing sharing layer. Edge server in the computing sharing layer receives tasks that are migrated through the collaborated migration layer. Edge server shares its computation resources to process offloaded tasks. • The remote assisted layer is used to assist edge servers to process tasks so that to cooperate with the edge servers. We propose Algorithm 6 to schedule the tasks between edge and remote servers to get an optimized latency at the scheduling stage.

3.4.3 Dispatching with Routing Search Assuming that connections/links in a specific edge environment are known, we denote .R = {1, 2, ..., R} as a set of available offloading routes. Each task of the burst load uses an offloading route to migrate between servers. In .x = {r1 , r2 , · · · , rN }, .ri represents the index of the route that the ith task chooses, where .i ∈ N, ri ∈ R.

98

3 AI/ML for Service Life Cycle at Edge

A feasible x represents the solution of our algorithm for optimal routing in the task dispatching stage as shown in Algorithm 4. Furthermore, based on the analysis of the original problem, we can formulate the optimal routing problem (.P2 ) during the task dispatching stage. P2 : min max(r1 , · · · , rN ) .

x∈X

(3.50)

s.t.(3.39), (3.40), (3.43), (3.45), (3.36) To find the optimal route, we investigate the total offloading time of all tasks with the makespan criterion. We present y as the makespan among all tasks during the dispatching stage. The optimal route for task dispatching is computed using Algorithm 4. We apply a method called derivative-free optimization [86, 87] based on computational learning theory [88] to determine the route for each task. An optimization method is proposed to search for a promising solution in our algorithm with a sampling model. Under a particular hypothesis, this algorithm is to discriminate good solutions from bad ones. In each iteration in .T, applying the updated hypothesis, a new optimal solution might be generated from this space. The hypothesis is a function mapping as a set .H = {h : X → Q}, where .Q is the optimal solutions. We aim to find a suitable hypothesis h, .h ∈ H that can construct an optimal solution space to sample a feasible solution for routing each task. We consider .y = c(x) as the makespan of the tasks after the dispatching stage. Assume an initial solution set as .S0 = {(x0 , y0 ), · · · , (xU , yU )} with a size of .U and .R as a set of the available routes that are sorted in ascending order of latency for migrating tasks. The value of .x i means the route that the ith task chooses, when .x i ∈ R. The sliding window moves to the right every W tasks in the solution space, where .W ≤ N. We randomly generate a number .λ for each iteration t. If .λ is less than .α (a predefined threshold), we first traverse the tasks in the window, and we proceed in a heuristic strategy to allocate a smaller latency routing for the task; otherwise, a random route is allocated to the task. Once the offloading route is selected for the selected task, we fix the offloading route for the task index in I , where I is denoted as the set of indexes of tasks in the burst load. Allocate the task index i with a feasible offloading route; then all allocated tasks are denoted as the set .If ; .Iv represents indexes of the remaining tasks. After all the tasks in the window are traversed, we compare the newly constructed solution .x˜ with .y˜ with the minimum value of the solution in the previous iteration. If .x˜ is better (.y˜ is smaller), then the accepted search space represented by the hypothesis .ht (the updated .If and .Iv ) is accepted. This space is used as the sampling space for the next round of solutions; otherwise, according to the 16th line in the algorithm, a global search is introduced to reconstruct .ht . Finally, we sample by hypothesis space to get .yh and then compare with .yt .y˜ to get .ymin . The next round of sampling is started afterward. Because the possible region constructed by the .ht is large, we define .α as a threshold to choose a low latency routing for a task i. The makespan for each routing

3.4 AI/ML for Service Operation and Management

99

vector is calculated heuristically, based on the modified .α and .T to get a better optimal solution. In the beginning of the algorithm, we initialize a feasible solution for each by randomly choosing another edge server to offload tasks. The fast routing is chosen with the following probability:  .P = 

lk ∈ri lk



r∈Rm

lk ∈r lk

(3.51)

where .ri ∈ R, .Rm is the set of routes for offloading to the mth edge server. Here, wsz is a parameter to control the window size in each iteration, and we set the window N N size .W =  wsz  in each iteration t. We set iteration number as .T = iter , where iter is a parameter to control the iterations. Assuming the solution of task dispatching with size of N, after a simple analysis of Algorithm 4, the overall complexity of the algorithm would be .O(NT). Algorithm 4 Optimal Routing for Task Dispatching (OPTD) 1: Generate the window size W , the initial solution size U,the parameter of routing selection γ 2: Initial S0 = {(x0 , y0 ), · · · , (xU , yU )}, Vt = ∅ 3: for t = 1 to T do 4: (xt , yt ) = (xmin , ymin ), Iv = I, If = ∅ 5: for i = (t − 1) ∗ W to t ∗ W do 6: if λ < α then 7: x˜ti = xti /γ , Iv = Iv \ {i}, If = If ∪ {i} 8: else 9: x˜ti = rand(1, |R|) 10: end if 11: end for 12: y˜t = c(x˜t ) 13: if y˜t < yt then 14: construct ht from the updated Iv and If 15: else 16: randomly choose i with the size of |If | to replace the elements in If , Iv = I \ If , construct ht 17: end if 18: for t = 1 to U do 19: Sample xh from ht , yh = c(xh ), Vt = Vt ∪ (xh , yh ) 20: end for 21: ymin = min{yh , yt , y˜t }, (xh , yh ) ∈ Vt 22: end for 23: return (xmin , ymin )

100

3 AI/ML for Service Life Cycle at Edge

Fig. 3.13 Task scheduling for executing and requesting data. ©2021 IEEE, reprinted, with permission, from S. Deng et al. (2021)

3.4.4 Scheduling with Online Policy The online scheduling algorithm is to arrange the order of distributed execution of tasks on edge servers. We adopt distributed edge servers in the edge network to realize a parallel strategy in scheduling. After pre-processing a task by an edge server, this task is migrated to the remote server for remote-assistance. Meanwhile, other edge servers may migrate their task to the remote server too. Our target is to optimize the strategy to minimize the makespan for all tasks during offloading. Thus, we adopt a method to make the scheduling of tasks as a re-entrant scheduling problem [89]. As shown in Fig. 3.13, edge servers may communicate with each other and the remote server to offload tasks. Under the re-entrant line scheduling problem, edge servers need to offload tasks to a set of shared remote servers. In general, the total consuming time of the remote server will be much greater than any edge server in the edge environments. Thus, we simply assume the remote server as the bottleneck of the system and denote it as btn [89]. As shown in Fig. 3.13, to construct a re-entrant line scheduling, we denote .step1 , step2 , · · · , step9 to correspond to each step in edge servers and remote server. For example, .step1 and step3 represent the pre-process and finalprocess in the f irst edge server; the steps of .step2 , step5 , and step8 represent .btn1 , btn2 , and btn3 of the remote server. We regard all the steps as a cycle that starts from .step1 and completes at .step9 . As shown in Fig. 3.13, dotted lines connect each edge server with the remote server, and the sequence constructs the

3.4 AI/ML for Service Operation and Management

101

virtual processing of tasks representing a cycle of re-entrant line scheduling. When a task completes at the final processing step of each edge server, one cycle in Algorithm 6 is concluded. For example, if there are only three edge servers with a btn server, the f ourth task in .step1 , the third task in .step2 , and the second task in .step3 are one cycle in our algorithm shown in Fig. 3.14. We denote the set of tasks’ index dispatched to the j th edge server as .Nj , where N = N1 ∪ N2 , · · · , ∪NM .

.

(3.52)

The pipeline structure has good parallel efficiency, and the distributed edge system can maximize the utilization of the limited computation resources. In a parallel structure, the start times of the task execution should satisfy the following constraints: .

s = Tini , . t˜1,j

(3.53)

s c ≥ t¯i−1,j . t˜i,j

(3.54)

s c ≥ t˜i,j ,. tˆi,j

(3.55)

s c ≥ tˆi,j , i ∈ Nj \ {1}, j ∈ M t¯i,j

(3.56)

The task execution in each edge server in our parallel strategy is followed by the btn server (remote server). Each cycle starts at the start time of the first task executed by the btn server; edge servers are not allowed to start ahead of the cycle. As shown in Fig. 3.14, .step2 and .step5 represent the steps in btn server, and the btn server always executes tasks without any delay. Suppose the .step3 and .step1 represent the ith server; if the third task at the .step3 completes, the ith server cannot immediately start the task 6; this is because the btn server has not started a new cycle, and the f ourth task has not started yet. Therefore, the btn server paces all edge servers in the computing sharing layer. Hereafter, i is denoted the unique sequence number of tasks in all servers; this means the ith task stands for the task arrived that server and is indexed by i by this server. The start time of the task in each edge server should satisfy the following constraint: s c s = max{t¯i−1,M , tˆi,j } t˜i,1

.

(3.57)

We must make sure to execute a task at the edge server at a certain step after its migration to this edge server has been done. In the same view of the task migration stage on any link, the task is only allowed to start its migration after it has migrated at the previous hop of routing it belonged. We call this special buffer safety stock (a term used by logistics [90]); it is used in order to mitigate the risk of stockouts of tasks. Safety stocks are used in our system to make each step of the edge server feasible and efficient at the scheduling stage. We present the index of tasks as .F˜j =

102

3 AI/ML for Service Life Cycle at Edge

Fig. 3.14 Task scheduling for in pipeline. ©2021 IEEE, reprinted, with permission, from S. Deng et al. (2021)

{1, 2, · · · , |Nj |} for pre-processing and .Fˆj and F¯j for remote-assistance and finalprocessing. The safety stock at step d of the j th edge server is on the left side of the red dotted line shown in Fig. 3.14. Each step in edge servers now has an offset before the edge server starts processing the task. To get a feasible schedule, those safety stocks in each step should be set up before the step starts work. It is similar to having some delay in each step when the pipeline structure begins. We denote the initialization delay for this kind of delay at the beginning of the pipeline organization as .Tini . .Tini is the cost time of processing in Algorithm 5. After initialization of the scheduling, the task start time at each step should be given by .Tini at the scheduling stage. As stated above, all edge servers start their work in parallel according to Algorithm 5. We present .b˜j and .b¯j as sets to cache the task from the previous step for pre-processing and final-processing; .j ∈ M, and .bˆj is denoted the task pool at a remote server that caches tasks from the j th edge server. .bˆj represents the task pool at the j th server in order to receive the arrived tasks from the collaborated migration layer. Processing each task in the edge server will follow the first-come-first-served (FCFS) rule in the task pool. Suppose task left in the step of final-process at the edge server after the initialization scheduling is .Nj . Thus, based on the analysis of the original problem .P1 and .P2 , after tasks arrive at the edge server by arranging each task to a routing, we can get the problem of optimal parallel scheduling for tasks as .P3 which is the same problem of .P1 as  c P3 : min max t¯|N j |,j .

j ∈M

s.t.(3.53), (3.54), (3.55), (3.56), (3.57)

(3.58)

3.4 AI/ML for Service Operation and Management

103

Algorithm 5 Initialization Scheduling 1: while the ith task arrive at j th edge server or |bˆj | = 0 or|b¯j | = 0 do 2: start the arrived task followed FCFS rule at the first step of each edge server 3: push the task to bˆj 4: if the (|Fˆj | − 1)th task has completed at step of pro-processing and the btn remote server is not busy then 5: start the arrived task followed FCFS rule 6: push the task to bˆj 7: end if 8: if the (|F¯j | − 1)th has task completed at step of final-process and the j th edge server is not busy then 9: break the while loop, calculate the stop time t¯c¯ |Fj |,j

10: end if 11: end while j 12: when all the edge servers stop the initialisation, Tini = max(t¯|cF¯ 13: Tini =

1 ,T2 ,··· max(Tini ini

|M| , Tini )

j |,j

, tˆcˆ

|Fj |,j

).

 c means the maximum value in the set of task completion time at where .max t¯|N j |,j each step of all edge servers. In general, after we solve the problem of .P2 , we can get the routing arrangement for each task. Then based on the solution of .P2 , we can solve the problem of .P3 to achieve the makespan of tasks’ offloading. To solve .P3 , we have designed a mechanism that event-driven scheduling is provided, as shown in the right part of the red dotted line in Fig. 3.14. In Algorithm 6, we design a heuristically online strategy that makes btn server as busy as possible. We do not use any processing value in advance to make any favorable permutation of tasks. Edge servers may process some tasks at time t when some previous task completes or the remote server starts a new cycle. We denote the indicator set .sj (t) = {0, 1}, t > 0, j ∈ M ∪ {btn} as these edge servers and the remote server. .sj (t) = 1 means the j th edge server is busy processing some task; otherwise, .sj (t) = 0. We denote the cycles started by the remote server as .dj and the edge server as .dj , j ∈ M. .startCycleEn is a set of the indicator where .startCycleEnj = 1; that is, the j edge server is allowed to process the task; otherwise, .startCycleEnj = 0, .j ∈ M. The scheduling algorithm will be executed in parallel by all edge and remote servers. The algorithm is an event-driven mechanism; thus, we can deduce that the overall complexity of the algorithm is .O(1). In this algorithm, we denote the index of the set of running edge servers in the time interval t as RS, where .t > 0. .p˜ j , .pˆ j , and .p˜ j represent the step at the edge server j for pre-processing, remote-assistance, and final-processing, respectively; where .j ∈ M, .j stands for the remote server.

104

3 AI/ML for Service Life Cycle at Edge

Algorithm 6 Optimal Parallel Scheduling for Task (OPST) 1: For every distributed edge server, the event driven algorithm is implemented parallel 2: Initial dj = 0, dj = 0, startCycleEn = 0 Monitor values of bˆj (t), b¯j (t), j ∈ M, t ≥ 0 s or t = t¯c or t = t˜c ) do 3: while (t = tˆi,1 i,j i,j s ˆ 4: if t = ti,1 then 5: dj = jj + 1 6: Start the task from b˜j (t) in FIFO order at the first step of the j server, j in M \ RS 7: if sj (t) = 1 then 8: startCycleEn[j ] = 1 9: end if 10: end if c then 11: if t = tˆi,j 12: Start the task at the pˆ j +1 step of the remote server in FIFO order from bˆj +1 (t) 13: end if c then 14: if t = t˜i,j 15: Start the task at step f i of the j th server in FIFO order from b¯j (t) 16: end if c and startCycleEn[j ] = 1 and d > d then 17: if t = t¯i,j j j 18: Start the task from b˜j (t) in FIFO order at the first step of the j th server, increase dj as dj + 1 19: if dj = dj then 20: startCycleEn[j ] = 0 21: end if 22: end if 23: end while

3.5 Summary This chapter discusses how AI and ML methods can be used to optimize the full life cycle of edge services. We divide typical problems for the life cycle of services in edge computing systems into topology, content, and service using a bottom-up approach. In the first section, we give an overall discussion of AI for edge. In the second section, we use several examples of micro-service function chain redundancy placement and deployment problems to show how intelligence algorithms can help in the decision-making for the deployment mode of services. We also studied how online learning algorithms can help in solving the job scheduling problems in the running mode of services.

References 1. Shuiguang Deng, Hailiang Zhao, Weijia Fang, Jianwei Yin, Schahram Dustdar, and Albert Y. Zomaya. Edge intelligence: The confluence of edge computing and artificial intelligence. IEEE Internet of Things Journal, 7(8):7457–7469, 2020. 2. J. Xu, Y. Zeng, and R. Zhang. Uav-enabled wireless power transfer: Trajectory design and energy optimization. IEEE Transactions on Wireless Communications, 17(8):5092–5106, Aug 2018.

References

105

3. B. Li, Z. Fei, and Y. Zhang. Uav communications for 5g and beyond: Recent advances and future trends. IEEE Internet of Things Journal, 6(2):2241–2263, April 2019. 4. M. Chen, M. Mozaffari, W. Saad, C. Yin, M. Debbah, and C. S. Hong. Caching in the sky: Proactive deployment of cache-enabled unmanned aerial vehicles for optimized quality-ofexperience. IEEE Journal on Selected Areas in Communications, 35(5):1046–1061, May 2017. 5. Guangxu Zhu, Dongzhu Liu, Yuqing Du, Changsheng You, Jun Zhang, and Kaibin Huang. Towards an intelligent edge: Wireless communication meets machine learning. CoRR, abs/1809.00343, 2018. 6. Y. Sun, M. Peng, and S. Mao. Deep reinforcement learning-based mode selection and resource management for green fog radio access networks. IEEE Internet of Things Journal, 6(2):1960– 1971, April 2019. 7. Hongyue Wu, Shuiguang Deng, Wei Li, Jianwei Yin, Qiang Yang, Zhaohui Wu, and Albert Y. Zomaya. Revenue-driven service provisioning for resource sharing in mobile cloud computing. In Michael Maximilien, Antonio Vallecillo, Jianmin Wang, and Marc Oriol, editors, ServiceOriented Computing, pages 625–640, Cham, 2017. Springer International Publishing. 8. S. Deng, Z. Xiang, J. Yin, J. Taheri, and A. Y. Zomaya. Composition-driven iot service provisioning in distributed edges. IEEE Access, 6:54258–54269, 2018. 9. S. Deng, Z. Xiang, J. Taheri, K. A. Mohammad, J. Yin, A. Zomaya, and S. Dustdar. Optimal application deployment in resource constrained distributed edges. IEEE Transactions on Mobile Computing, pages 1–1, 2020. 10. L. Chen, J. Xu, S. Ren, and P. Zhou. Spatio–temporal edge service placement: A bandit learning approach. IEEE Transactions on Wireless Communications, 17(12):8388–8401, Dec 2018. 11. ShuiGuang Deng, Hongyue Wu, Wei Tan, Zhengzhe Xiang, and Zhaohui Wu. Mobile service selection for composition: An energy consumption perspective. IEEE Trans. Automation Science and Engineering, 14(3):1478–1490, 2017. 12. S. Zhang, P. He, K. Suto, P. Yang, L. Zhao, and X. Shen. Cooperative edge caching in user-centric clustered mobile networks. IEEE Transactions on Mobile Computing, 17(8):1791– 1805, Aug 2018. 13. H. Zhao, W. Du, W. Liu, T. Lei, and Q. Lei. Qoe aware and cell capacity enhanced computation offloading for multi-server mobile edge computing systems with energy harvesting devices. In 2018 IEEE International Conference on Ubiquitous Intelligence Computing, pages 671–678, Oct 2018. 14. H. Zhao, S. Deng, C. Zhang, W. Du, Q. He, and J. Yin. A mobility-aware cross-edge computation offloading framework for partitionable applications. In 2019 IEEE International Conference on Web Services, pages 193–200, Jul 2019. 15. M. Min, L. Xiao, Y. Chen, P. Cheng, D. Wu, and W. Zhuang. Learning-based computation offloading for iot devices with energy harvesting. IEEE Transactions on Vehicular Technology, 68(2):1930–1941, Feb 2019. 16. X. Chen, H. Zhang, C. Wu, S. Mao, Y. Ji, and M. Bennis. Performance optimization in mobileedge computing via deep reinforcement learning. In 2018 IEEE 88th Vehicular Technology Conference (VTC-Fall), pages 1–6, Aug 2018. 17. S. Deng, Z. Xiang, P. Zhao, J. Taheri, H. Gao, A. Zomaya, and J. Yin. Dynamical resource allocation in edge for trustable iot systems: a reinforcement learning method. IEEE Transactions on Industrial Informatics, pages 1–1, 2020. 18. ShuiGuang Deng, Hongyue Wu, Daning Hu, and J. Leon Zhao. Service selection for composition with qos correlations. IEEE Trans. Services Computing, 9(2):291–303, 2016. 19. C. Zhang, H. Zhao, and S. Deng. A density-based offloading strategy for iot devices in edge computing systems. IEEE Access, 6:73520–73530, 2018. 20. Mingzhe Chen, Ursula Challita, Walid Saad, Changchuan Yin, and Mérouane Debbah. Machine learning for wireless networks with artificial intelligence: A tutorial on neural networks. CoRR, abs/1710.02913, 2017. 21. Omid Abari, Hariharan Rahul, and Dina Katabi. Over-the-air function computation in sensor networks. CoRR, abs/1612.02307, 2016.

106

3 AI/ML for Service Life Cycle at Edge

22. Guangxu Zhu, Li Chen, and Kaibin Huang. Over-the-air computation in MIMO multi-access channels: Beamforming and channel feedback. CoRR, abs/1803.11129, 2018. 23. Guangxu Zhu, Yong Wang, and Kaibin Huang. Low-latency broadband analog aggregation for federated edge learning. CoRR, abs/1812.11494, 2018. 24. Kai Yang, Tao Jiang, Yuanming Shi, and Zhi Ding. Federated learning via over-the-air computation. CoRR, abs/1812.11750, 2018. 25. Dongzhu Liu, Guangxu Zhu, Jun Zhang, and Kaibin Huang. Wireless data acquisition for edge learning: Importance aware retransmission. CoRR, abs/1812.02030, 2018. 26. Jin-Hyun Ahn, Osvaldo Simeone, and Joonhyuk Kang. Wireless federated distillation for distributed edge learning with heterogeneous data. ArXiv, abs/1907.02745, 2019. 27. X. Zhang, M. Peng, S. Yan, and Y. Sun. Deep reinforcement learning based mode selection and resource allocation for cellular v2x communications. IEEE Internet of Things Journal, pages 1–1, 2019. 28. X. Lu, X. Xiao, L. Xiao, C. Dai, M. Peng, and H. V. Poor. Reinforcement learning-based microgrid energy trading with a reduced power plant schedule. IEEE Internet of Things Journal, 6(6):10728–10737, Dec 2019. 29. Yifei Shen, Yuanming Shi, Jun Zhang, and Khaled B. Letaief. A graph neural network approach for scalable wireless power control. ArXiv, abs/1907.08487, 2019. 30. Dagnachew Azene Temesgene, Marco Miozzo, and Paolo Dini. Dynamic control of functional splits for energy harvesting virtual small cells: a distributed reinforcement learning approach. ArXiv, abs/1906.05735v1, 2019. 31. Y. Chen, S. Deng, H. Zhao, Q. He, and H. Gao Y. Li. Data-intensive application deployment at edge: A deep reinforcement learning approach. In 2019 IEEE International Conference on Web Services, pages 355–359, Jul 2019. 32. Z. Piao, M. Peng, Y. Liu, and M. Daneshmand. Recent advances of edge cache in radio access networks for internet of things: Techniques, performances, and challenges. IEEE Internet of Things Journal, 6(1):1010–1028, Feb 2019. 33. X. Qiu, L. Liu, W. Chen, Z. Hong, and Z. Zheng. Online deep reinforcement learning for computation offloading in blockchain-empowered mobile edge computing. IEEE Transactions on Vehicular Technology, pages 1–1, 2019. 34. Y. Chen, C. Lin, J. Huang, X. Xiang, and X. S. Shen. Energy efficient scheduling and management for large-scale services computing systems. IEEE Transactions on Services Computing, 10(2):217–230, 2015. 35. X. Chen, H. Zhang, C. Wu, S. Mao, Y. Ji, and M. Bennis. Optimized computation offloading performance in virtual edge computing systems via deep reinforcement learning. IEEE Internet of Things Journal, 6(3):4005–4018, June 2019. 36. Liang Huang, Suzhi Bi, and Ying-Jun Angela Zhang. Deep reinforcement learning for online offloading in wireless powered mobile-edge computing networks. CoRR, abs/1808.01977, 2018. 37. Lei Lei, Huijuan Xu andXiong Xiong, Kan Zheng, Wei Xiang, and Xianbin Wang. Multiuser resource control with deep reinforcement learning in iot edge computing. ArXiv, abs/1906.07860, 2019. 38. Qi Qi and Zhanyu Ma. Vehicular edge computing via deep reinforcement learning. CoRR, abs/1901.04290, 2019. 39. Zhong Yang, Yuanwei Liu, Yue Chen, and Naofal Al-Dhahir. Cache-aided noma mobile edge computing: A reinforcement learning approach. ArXiv, abs/1906.08812, 2019. 40. Docker: Modernize your applications, accelerate innovation, [n.d.]. 41. Kubernetes: Production-grade container orchestration. 42. T. Ouyang, Z. Zhou, and X. Chen. Follow me at the edge: Mobility-aware dynamic service placement for mobile edge computing. IEEE Journal on Selected Areas in Communications, 36(10):2333–2345, Oct 2018. 43. T. He, H. Khamfroush, S. Wang, T. La Porta, and S. Stein. It’s hard to share: Joint service placement and request scheduling in edge clouds with sharable and non-sharable resources. In 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), pages 365–375, July 2018.

References

107

44. B. Gao, Z. Zhou, F. Liu, and F. Xu. Winning at the starting line: Joint network selection and service placement for mobile edge computing. In IEEE INFOCOM 2019—IEEE Conference on Computer Communications, pages 1459–1467, April 2019. 45. L. Chen, J. Xu, S. Ren, and P. Zhou. Spatio–temporal edge service placement: A bandit learning approach. IEEE Transactions on Wireless Communications, 17(12):8388–8401, Dec 2018. 46. F. Ait Salaht, F. Desprez, A. Lebre, C. Prud’homme, and M. Abderrahim. Service placement in fog computing using constraint programming. In 2019 IEEE International Conference on Services Computing (SCC), pages 19–27, July 2019. 47. Y. Chen, S. Deng, H. Zhao, Q. He, Y. Li, and H. Gao. Data-intensive application deployment at edge: A deep reinforcement learning approach. In 2019 IEEE International Conference on Web Services (ICWS), pages 355–359, July 2019. 48. Liuyan Liu, Haisheng Tan, Shaofeng H.-C. Jiang, Zhenhua Han, Xiang-Yang Li, and Hong Huang. Dependent task placement and scheduling with function configuration in edge computing. In Proceedings of the International Symposium on Quality of Service, IWQoS ’19, New York, NY, USA, 2019. Association for Computing Machinery. 49. L. A. Vayghan, M. A. Saied, M. Toeroe, and F. Khendek. Deploying microservice based applications with kubernetes: Experiments and lessons learned. In 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), pages 970–973, July 2018. 50. Leila Abdollahi Vayghan, Mohamed Aymen Saied, Maria Toeroe, and Ferhat Khendek. Kubernetes as an availability manager for microservice applications. CoRR, abs/1901.04946, 2019. 51. Ravindra K Ahuja, Thomas L Magnanti, and James B Orlin. Network flows. 1988. 52. Developing software for multi-access edge computing, Feb 2019. 53. Anton J. Kleywegt, Alexander. Shapiro, and Tito. Homem-de Mello. The sample average approximation method for stochastic discrete optimization. SIAM Journal on Optimization, 12(2):479–502, 2002. 54. Christian Robert and George Casella. Monte Carlo statistical methods. Springer Science & Business Media, 2013. 55. Yun Chao Hu, Milan Patel, Dario Sabella, Nurit Sprecher, and Valerie Young. Mobile edge computing—a key technology towards 5g. ETSI white paper, 11(11):1–16, 2015. 56. Shuiguang Deng, Longtao Huang, Javid Taheri, and Albert Y Zomaya. Computation offloading for service workflow in mobile cloud computing. IEEE Transactions on Parallel and Distributed Systems, 26(12):3317–3329, 2015. 57. Michael Till Beck, Martin Werner, Sebastian Feld, and S Schimper. Mobile edge computing: A taxonomy. In Proc. of the Sixth International Conference on Advances in Future Internet, pages 48–55. Citeseer, 2014. 58. Shuiguang Deng, Longtao Huang, Javid Taheri, Jianwei Yin, MengChu Zhou, and Albert Y Zomaya. Mobility-aware service composition in mobile communities. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 47(3):555–568, 2017. 59. Wei-Jen Hsu, Thrasyvoulos Spyropoulos, Konstantinos Psounis, and Ahmed Helmy. Modeling spatial and temporal dependencies of user mobility in wireless mobile networks. IEEE/ACM Transactions on Networking (ToN), 17(5):1564–1577, 2009. 60. Jianping Wang. Exploiting mobility prediction for dependable service composition in wireless mobile ad hoc networks. IEEE Transactions on Services Computing, 4(1):44–55, 2011. 61. Shangguang Wang, Yali Zhao, Lin Huang, Jinliang Xu, and Ching-Hsien Hsu. Qos prediction for service recommendations in mobile edge computing. Journal of Parallel and Distributed Computing, 2017. 62. Yuyi Mao, Jun Zhang, and Khaled B Letaief. Dynamic computation offloading for mobile-edge computing with energy harvesting devices. IEEE Journal on Selected Areas in Communications, 34(12):3590–3605, 2016. 63. Xiang Sun and Nirwan Ansari. Edgeiot: Mobile edge computing for the internet of things. IEEE Communications Magazine, 54(12):22–29, 2016. 64. Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multiagent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pages 6379–6390, 2017.

108

3 AI/ML for Service Life Cycle at Edge

65. Yuyi Mao, Changsheng You, Jun Zhang, Kaibin Huang, and Khaled B. Letaief. A Survey on Mobile Edge Computing: The Communication Perspective. arXiv:1701.01090 [cs, math], January 2017. 66. Feng Lyu, Ju Ren, Nan Cheng, Peng Yang, Minglu Li, Yaoxue Zhang, and Xuemin Shen. Lead: Large-scale edge cache deployment based on spatio-temporal wifi traffic statistics. IEEE Transactions on Mobile Computing, 2020. 67. Shuo Wang, Xing Zhang, Yan Zhang, Lin Wang, Juwo Yang, and Wenbo Wang. A Survey on Mobile Edge Networks: Convergence of Computing, Caching and Communications. IEEE Access, 5:6757–6779, 2017. 68. Shuiguang Deng, Zhengzhe Xiang, Javid Taheri, Khoshkholghi Ali Mohammad, Jianwei Yin, Albert Zomaya, and Schahram Dustdar. Optimal application deployment in resource constrained distributed edges. IEEE Transactions on Mobile Computing, 2020. 69. Hailiang Zhao, Shuiguang Deng, Zijie Liu, Jianwei Yin, and Schahram Dustdar. Distributed redundancy scheduling for microservice-based applications at the edge. IEEE Transactions on Services Computing, 2020. 70. Shuiguang Deng, Zhengzhe Xiang, Peng Zhao, Javid Taheri, Honghao Gao, Jianwei Yin, and Albert Y Zomaya. Dynamical resource allocation in edge for trustable internet-of-things systems: A reinforcement learning method. IEEE Transactions on Industrial Informatics, 16(9):6103–6113, 2020. 71. Shuo Wang, Xing Zhang, Zhi Yan, and Wang Wenbo. Cooperative edge computing with sleep control under nonuniform traffic in mobile edge networks. IEEE Internet of Things Journal, 6(3):4295–4306, 2018. 72. Iskanter-Alexandros Chousainov, Ioannis Moscholios, Panagiotis Sarigiannidis, Alexandros Kaloxylos, and Michael Logothetis. An analytical framework of a c-ran supporting bursty traffic. In ICC 2020-2020 IEEE International Conference on Communications (ICC), pages 1–6. IEEE, 2020. 73. Wei-Che Chien, Chin-Feng Lai, and Han-Chieh Chao. Dynamic resource prediction and allocation in c-ran with edge artificial intelligence. IEEE Transactions on Industrial Informatics, 15(7):4306–4314, 2019. 74. Yi Liu, Huimin Yu, Shengli Xie, and Yan Zhang. Deep reinforcement learning for offloading and resource allocation in vehicle edge computing and networks. IEEE Transactions on Vehicular Technology, 68(11):11158–11168, 2019. 75. Seyed Hamed Rastegar, Aliazam Abbasfar, and Vahid Shah-Mansouri. Rule caching in sdnenabled base stations supporting massive iot devices with bursty traffic. IEEE Internet of Things Journal, 2020. 76. Cenk Gündo˘gran, Peter Kietzmann, Martine Lenders, Hauke Petersen, Thomas C Schmidt, and Matthias Wählisch. Ndn, coap, and mqtt: a comparative measurement study in the iot. In Proceedings of the 5th ACM Conference on Information-Centric Networking, pages 159–171, 2018. 77. Jinke Ren, Yinghui He, Guan Huang, Guanding Yu, Yunlong Cai, and Zhaoyang Zhang. An edge-computing based architecture for mobile augmented reality. IEEE Network, 33(4):162– 169, 2019. 78. Yishan Chen, Shuiguang Deng, Hongtao Ma, and Jianwei Yin. Deploying data-intensive applications with multiple services components on edge. Mobile Networks and Applications, pages 1–16, 2019. 79. Ju Ren, Deyu Zhang, Shiwen He, Yaoxue Zhang, and Tao Li. A survey on end-edge-cloud orchestrated network computing paradigms: Transparent computing, mobile edge computing, fog computing, and cloudlet. ACM Computing Surveys (CSUR), 52(6):1–36, 2019. 80. En Li, Zhi Zhou, and Xu Chen. Edge Intelligence: On-Demand Deep Learning Model CoInference with Device-Edge Synergy. arXiv:1806.07840 [cs], June 2018. 81. Shuiguang Deng, Longtao Huang, Hongyue Wu, Wei Tan, Javid Taheri, Albert Y Zomaya, and Zhaohui Wu. Toward mobile service computing: opportunities and challenges. IEEE Cloud Computing, 3(4):32–41, 2016.

References

109

82. ShuiGuang Deng, Longtao Huang, Javid Taheri, and Albert Y. Zomaya. Computation Offloading for Service Workflow in Mobile Cloud Computing. IEEE Trans. Parallel Distrib. Syst., 26(12):3317–3329, 2015. 83. Mohammed S Elbamby, Mehdi Bennis, and Walid Saad. Proactive edge computing in latencyconstrained fog networks. In 2017 European conference on networks and communications (EuCNC), pages 1–6. IEEE, 2017. 84. Hailiang Zhao, Shuiguang Deng, Cheng Zhang, Wei Du, Qiang He, and Jianwei Yin. A mobility-aware cross-edge computation offloading framework for partitionable applications. In 2019 IEEE International Conference on Web Services (ICWS), pages 193–200. IEEE, 2019. 85. Jiagang Liu, Ju Ren, Wei Dai, Deyu Zhang, Pude Zhou, Yaoxue Zhang, Geyong Min, and Noushin Najjari. Online multi-workflow scheduling under uncertain task execution time in iaas clouds. IEEE Transactions on Cloud Computing, 2019. 86. Yang Yu, Hong Qian, and Yi-Qi Hu. Derivative-free optimization via classification. In Thirtieth AAAI Conference on Artificial Intelligence, 2016. 87. Shangce Gao, MengChu Zhou, Yirui Wang, Jiujun Cheng, Hanaki Yachi, and Jiahai Wang. Dendritic neuron model with effective learning algorithms for classification, approximation, and prediction. IEEE transactions on neural networks and learning systems, 30(2):601–614, 2018. 88. Michael J Kearns, Umesh Virkumar Vazirani, and Umesh Vazirani. An introduction to computational learning theory. MIT press, 1994. 89. JG Dai and Gideon Weiss. A fluid heuristic for minimizing makespan in job shops. Operations Research, 50(4):692–707, 2002. 90. Elisa Gebennini, Rita Gamberini, and Riccardo Manzini. An integrated production–distribution model for the dynamic location and allocation problem with safety stock optimization. International Journal of Production Economics, 122(1):286–304, 2009.

Chapter 4

AI/ML for Computation Offloading

Abstract This chapter discussed how the growth of mobile Internet and the Internet of Things (IoT) motivated the creation of edge computing network paradigm with the aim to resolve latency and bandwidth bottlenecks. Edge computing brings computation closer to end users for improving network stability, as well as enabling task offloading for device terminals. Because designing efficient offloading mechanisms is complicated, due to their stringent real-time requirements, we elaborate on how artificial intelligence (AI) and machine learning (ML) are used in recent studies to optimize task offloading in edge computing platforms. Elaborating on how AI/ML technologies can deliver more accurate offloading strategies while lowering the computing decision-making costs, we will cover long-term optimization and Markov decision optimization for binary offloading, partial offloading, and complex jobs’ offloading problems.

4.1 Introduction We will start with a quick introduction to computation offloading and some of the key concerns to consider. Computation offloading is to migrate heavy computational tasks to other servers (neighboring edge servers or back-end cloud servers) aiming at lowering the computational strain on local devices (end users). As a result, both the local and edge server sides of such systems must collaborate to perform computation This chapter reuses literal text and materials from • H. Zhao et al., “A Mobility-Aware Cross-Edge Computation Offloading Framework for Partitionable Applications,” 2019 IEEE International Conference on Web Services (ICWS), 2019, https://doi.org/10.1109/ICWS.2019.00041. ©2019 IEEE, reprinted with permission. • H. Chen et al., “Mobility-Aware Offloading and Resource Allocation for Distributed Services Collaboration,” in IEEE Transactions on Parallel and Distributed Systems, 2022, https://doi. org/10.1109/TPDS.2022.3142314. ©2022 IEEE, reprinted with permission. • S. Deng et al., “Dependent Function Embedding for Distributed Serverless Edge Computing,” in IEEE Transactions on Parallel and Distributed Systems, 2022, https://doi.org/10.1109/TPDS. 2021.3137380. ©2022 IEEE, reprinted with permission. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Taheri et al., Edge Intelligence, https://doi.org/10.1007/978-3-031-22155-2_4

111

112

4 AI/ML for Computation Offloading

offloading. The critical challenges covered by the computation offloading procedure are: • • • •

Decide whether end users should offload their computation or not; Decide to perform binary or partial computation offloading; Decide about the time for offloading tasks; Decide where (cloud or edge servers) to offload the computation.

To tackle the above problems, we will introduce some studies for computation offloading, including traditional framework and AI approaches. Traditional approaches for offloading The authors of [1] introduced ThinkAir, a code offloading framework that uses parallel processing to run mobile cloud computing apps in the cloud. Cuckoo for Android is developed by authors in [2] as a platform for code offloading under the condition that intermittent network connection occurs. Servers deployed in clusters are widely used in the collaborative mode scenarios of edge computing and cloud computing. The authors of [3] implemented a cluster-based offloading algorithm to load balance and enable task offloading with dynamic resource changes without increasing the system’s complexity. The authors of [4] developed a framework called CloudAware to realize the function of service discovery (monitoring the available resources in the network, the task load intensity of the network, traffic, etc.) to realize the offloading of computing tasks. KubeEdge is a new infrastructure to facilitate edge and cloud collaboration for computation offloading. KubeEdge provides seamless communication between edge and cloud servers by the same runtime environment [5]. Motivation of AI/ML methods Traditional approaches for the computation offloading face, more or less, the same problem. Few of them can automatically react to changes in the network environment and end users. The existing computation offloading approach may no longer be relevant as the system evolves, resulting in a considerable reduction in computation offloading efficiency. Furthermore, the methodologies given by typical computation offloading frameworks are not universal, and traditional computation offloading frameworks are usually applicable to certain computing environments. To tackle the aforementioned challenges, AI/ML-based computation offloading approaches must be implemented, because they are inherently more adaptable in terms of scalability, flexibility, and portability. AI/ML approaches for computation offloading These approaches can be divided into various types according to their characteristics; yet, there are three main categories for computation offloading [6]. • Regression Algorithm is a statistical method used to model the mapping of outcomes to input data features. The most common algorithm in regression analysis is linear regression. It finds the curve or complex linear combination that best fits the input data according to a specific law. Regression analysis is widely used in forecasting and is an important method in ML. The regression itself reveals the relationship between the outcomes and the set of variables in

4.1 Introduction

113

a fixed dataset, inferring the causal relationship between the independent and dependent variables. • Deep Learning is an essential branch of ML. It is a method of learning data representations based on artificial neural networks. Deep learning uses information processing and communication modes to train neural networks. With the help of unsupervised or semi-supervised feature learning, deep learning can efficiently extract data features hierarchically. • Reinforcement Learning is a model that interacts with the environment. It iteratively learns and generates a decision-making action to maximize a reward. Because the model’s reward may not immediately match the action space, a reinforcement learning approach must understand the relationship between the state space and the action space through constant action feedback. This is to help the algorithm determine the optimum action for obtaining large rewards. The current iteration impacts the next repetition of the last reward; thus, the method allows reinforcement learning to learn from experience rather than directly from the dataset (e.g., through continuous trial and error). Regression algorithms for computation offloading Meurisch et al. [7] proposed an online method to implement a better offload strategy by investigating unknown accessible services (such as neighboring cloudlets or the remote cloud and networks) in an energy-efficient manner. They look at a probing technique for evaluating these unknown services by offloading micro-tasks and correctly forecasting the performance of tasks with heavy computation using regression models. Yang et al. [8] proposed an adaptive computation offloading framework to address some of the challenges of computationally and resource-intensive applications. In order to deal with the large dimension of the unloading problem, the authors combined the deep neural network and the Markov process with the regression module to design a computation offloading system. Lin et al. [9] presented FOTO to minimize energy usage and job reaction time. FOTO converts the minimization problem of the server’s energy usage and response time into a multi-objective optimization problem. FOTO improves resource usage by optimizing server resources, making it ideal for IoT and intelligent medical applications. Crutcher et al. [10] use KDN technology to anticipate computation offloading overhead for edge networks. Their proposed algorithm maps each server to a point in a multidimensional space, where each dimension corresponds to a predictor. They utilize the hyperprofile of each task to run KNN and choose the server corresponding to the computing task. Furthermore, their proposed regression model accurately assesses a variety of network prediction variables; it significantly improves the one-to-one connection between the expected offloading job and the server. Deep learning for computation offloading Yu et al. [11] devised a supervised learning approach in the edge network to model the user offloading model. It reduces the computational overhead of edge computing devices, based on a deep learning method for multi-label classifications. Kang et al. [12] designed a Neurosurgeon scheduler deployed between the terminal device and the data

114

4 AI/ML for Computation Offloading

center to divide the optimal combination of offloading and local execution automatically. This scheduler is highly flexible and can significantly reduce the latency and energy consumption of mobile devices. Alelaiwi et al. [13] created an architecture based on a deep learning algorithm. They used historical data to assess the user’s request to the edge server node and the response time of computing offloading. Their approach learns to forecast the response time of new task requests and accomplishes the goal of work offloading by selecting the edge server node with the shortest response time. This prediction model has a small average error, high accuracy, and a great performance assessment, as it was jointly trained by the cloud and edge server nodes. Another load balancing scheduling approach [14] is proposed based on a k-means clustering technique with optimized min-max. The approach employs an adaptive differential evolution algorithm to improve the local search ability in the task scheduling decision space, hence improving the algorithm’s efficiency and accuracy. Under specific conditions, the proposed solutions can successfully address the challenges of task scheduling and resource allocation in edge computing. Reinforcement learning for computation offloading Mao et al. [15] employed graph neural networks and reinforcement learning techniques to reduce the latency of DAG job scheduling in a cluster. The graph neural network is primarily used to solve DAG job feature extraction and assess each job. Furthermore, the task nodes in the jobs are configured to feed these metric values into the reinforcement learning model, and the best scheduling policy is formed through iterations. Tong et al. [16] designed QL-HEFT, a new scheduling method for complicated heterogeneous tasks based on the Q-learning algorithm, and the current HEFT algorithm for heterogeneous tasks. The algorithm measures timely reward using the ranking value in the Q-learning architecture. The agent effectively updates the Q-table through the adaptive learning process to acquire the strategy of shortest execution time for heterogeneous tasks. Wang et al. [17] designed an algorithm combining deep reinforcement learning and federated learning in the edge network environment. By designing appropriate caching and communication mechanisms, the edge AI framework cooperatively runs applications on edge devices and edge server nodes. At the same time, the edge AI framework provides a set of information sharing mechanisms, which significantly reduces the communication frequency between devices. Therefore, edge AI frameworks can provide more efficient training and inference for models while supporting the dynamic improvement of systems and applications. The edge AI framework achieves excellent performance in the edge systems’ caching and computing offloading evaluation scenarios. Tan et al. [18] used a deep reinforcement learning framework to build a strategy for fast communication, high caching efficiency, and optimal allocation of computing resources based on vehicle mobility over the Internet of Vehicles. This method has also been theoretically validated, and it has a significant operational cost-benefit on the Internet of Vehicles. Combining deep reinforcement learning with the genetic algorithm (DRGO), Qiu et al. [19] designed an online computing offloading method for model-free deep reinforcement learning. It is used in the combination

4.2 AI/ML Optimizes Task Offloading in the Binary Mode

115

of blockchain and mobile edge computing. The method uses a Markov decision process to model blockchain tasks, consequently evaluating the dynamic performance of the decision process: average cost, task loss rate, and average transfer time. In addition, reinforcement learning is used to maximize the performance of long-term computation offloading. Due to the algorithm’s high complexity, this method introduces the adaptive genetic algorithm to accelerate the convergence speed in a high-dimensional space. Compared to greedy, genetic, and deterministic policy gradient algorithms, the DRGO method has shown efficient performance.

4.2 AI/ML Optimizes Task Offloading in the Binary Mode Intelligent mobile devices (e.g., IoT sensors and wearable devices) have become indispensable tools in our daily routines. However, the limited computing and storage resources and battery capacities of mobile devices cannot meet the needs of various applications. In recent years, mobile edge computing (MEC) has emerged as a promising paradigm to overcome these limitations. MEC provides various resources and services for mobile devices by pushing computing capabilities away from the centralized cloud to the network edge, that is, in the vicinity of the widespread wireless access network [20]. In an MEC system, an edge site/server is a micro data center with applications deployed and attached to a small base station (SBS). User workloads can be offloaded from their mobile devices to a nearby edge site for processing to reduce service cost measured by quality of experience (QoE). For scenarios where mobile devices can arbitrarily move, the difficulty in designing the computation offloading strategy is high due to the frequent heterogeneous edge site selection and user profile handover. Because the wireless coverage of SBSs often overlaps in real-world scenarios, a collaboration network of edge sites can be constructed by the mobile network operator (MNO). In addition, application partitioning and partitioning have been extensively studied in MEC and distributed systems [21]. Based on these studies, computational tasks can be cooperatively and distributively processed [22] by partitioning tasks and offloading them to multiple edge sites; what cannot be ignored here is that the scheduling, communicating, and coordinating costs increase. Therefore, it is critical to trade off between the offloading and computing latency of mobile devices and the communication and coordination costs. In addition to communication and coordination costs, offloading latency is also coupled with energy consumption of mobile devices, mainly because performance may be compromised due to insufficient battery energy [23]. In many situations, it is not appropriate or sometimes even impossible (e.g., outdoor sensors) to recharge batteries frequently. Additionally, frequent connections and data transmissions to more than one edge site in a time slot may lead to excessive transient discharges and consequently greatly harm the battery life of mobile devices [24]. Due to

116

4 AI/ML for Computation Offloading

a strong need for green computing, several energy harvesting (EH) technologies [25] have been developed to collect recyclable and clean energy (e.g., wind, solar radiation, and human motion energy) [26]. Therefore, we need self-sustainability and perpetual operation of mobile devices.

4.2.1 System Model Formally, we refer to an MEC system consisting of N mobile devices equipped with EH components, indexed by .N, and M SBSs, indexed by .M. SBSs are interconnected via X2 Link for data transmission and coordination [27]. The time horizon is discretized into time slots with length .τ , indexed by .T  {1, 2, ...}. Figure 4.1 shows an example system that provides health monitoring and analytics to wearable devices, such as smart bracelets and intelligent glasses. Three wearable devices move following the Gauss-Markov mobility model across six time slots. At the beginning of each time interval, each wearable device generates a task offloading request with a certain probability. The request can be successfully responded to if and only if it is not timed out (it will be dropped otherwise). First, the pre-processing and packing of monitored data are carried out by wearable devices. Next, the pre-processed data are split into equal pieces and then offloaded to chosen edge sites for analytics. In each time interval, harvestable energy comes from light, kinetic, wind, etc. The EH components are implemented in the same way as described in [28]. Without aggregating geodistributed data to a centralized data center, health analytics are carried out by edge sites via cross-edge Map-Reduce queries [29]. Considering that we focus on the selection of the edge sites, we assume that edge sites can carry out the transitions of application phase instantaneously and without failure. Also, because the returned analytical results are usually much smaller than the offloaded data, the latency in the downlink transmission is ignored.

4.2.1.1

Local Execution Latency Evaluation

For the entire time horizon, the location of each SBS is fixed, whereas mobile devices can arbitrarily move in different time slots. We denote the set of SBSs covering the ith mobile device in time slot t as .Mi (t); correspondingly, .Nj (t) denotes the set of mobile devices covered by edge site j in the tth time slot. We model the task offloading demands of mobile devices as an i.i.d. Bernoulli distribution. In each time slot, the ith mobile device’s offloading demand is generated with probability .ρi . We set .A(t)  {×i∈N Ai (t)} ⊆ {0, 1}N as the demand vector, that is, .Pr{Ai (t) = 1} = ρi . The local execution includes data preprocessing and packing. We set the data size to be processed for local execution and offloading as .μli and .μri , respectively. Correspondingly, the two parts need .ηil and lc l r .η CPU cycles, respectively. For local execution, the execution latency .τ i is .ηi /fi . i

Fig. 4.1 A motivation scenario. ©2019 IEEE, reprinted, with permission, from H. Zhao et al. (2019)

4.2 AI/ML Optimizes Task Offloading in the Binary Mode 117

118

4 AI/ML for Computation Offloading

The energy consumption of local execution can be computed using the following formula. il = κi · ηil fi2 , i ∈ N, t ∈ T,

.

(4.1)

where .κi is the effective switched capacitance that depends on the architecture of the chip.

4.2.1.2

Task Offloading Latency

Let us denote .Ii (t)  [Ii,1 (t), ..., Ii,|Mi (t)| (t)], Ii,j (t) ∈ {0, 1} as the edge site selection indicator. We assume that the j th edge site can be assigned to at most max mobile devices due to its limited computational capability; this leads to the .N j following constraint:  .

Ii,j (t) ≤ Njmax , j ∈ Mi (t), t ∈ T.

(4.2)

i∈N

For each mobile device i, the latency in offloading comes from the wireless uplink transmission, as well as computation collaboration between edge sites. We denote the small-scale fading channel power gains from the ith mobile device to the j th SBS by .ζi,j (t). The channel power gain can be obtained by .hi,j (t)  ζi,j (t)g0 ( di,jd0(t) )λ , where .d0 and .di,j (t) denote the reference distance and real distance between i and j , respectively. .λ denotes the path loss exponent, and .g0 denotes the path loss constant. As a result, the achievable rate of from the ith mobile device to the j th edge site .Ri,j (t) can be computed using the following formula.   hi,j (t)pitx , j ∈ Mi (t), Ri,j (t)  ω log2 1 + I + 0

.

(4.3)

where .ω represents the bandwidth allocated, I is the maximum received average power of interference, and . 0 is the addictive background noise. .pitx represents the fixed transmit power. Having all latency aspects, the transmission latency can be computed as follows. tx τi,j (t) = 

.

μri j ∈Mi (t) Ii,j (t)

·

1 Ri,j (t)

, j ∈ Mi (t).

(4.4)

Based on (4.3) and (4.4), the energy consumption of transmission would be tx (t), which is equal to .p tx · τ tx (t). Similarly, the latency of computing .τ rc (t) i,j i i,j i,j would be as follows.

.

rc τi,j (t) =

.

fj ·



ηir j ∈Mi (t) Ii,j (t)

, j ∈ Mi (t),

(4.5)

4.2 AI/ML Optimizes Task Offloading in the Binary Mode

119

where .fj is the CPU cycle frequency of edge site j . Without loss of generality, ∀i ∈ N, j ∈ M, we set .fj  fi . To execute the computation task successfully, we need to satisfy the following constraints also.

.

τd ≥ max

.



j ∈Mi (t)

tx rc τi,j (t) + τi,j (t)

+τilc + ϕ ·





Ii,j (t),

(4.6)

j ∈Mi (t)

where .τd is the execution deadline, which, without loss of generality, we set .τd ≤ τ . Although the transmission latency can be reduced if multiple edge sites are chosen, the coordination cost could be unjustifiable. Here, we simply assume that the coordination cost isproportional to the number of chosen edge sites. Thus, it can be described as .ϕ · j ∈Mi (t) Ii,j (t), where .ϕ is the unit latency cost. As can be inferred, it is critical to trade off between the cross-edge collaboration cost and the offloading latency.

4.2.1.3

Battery Energy Consumption

Denote the battery energy level of the ith mobile device at the beginning of the tth time slot as .ψi (t). We set .ψi (T ) < +∞ as .T → +∞, i ∈ N. The energy consumption has the following constraint: il +



.

tx i,j (t)Ii,j (t) ≤ ψi (t), i ∈ N, t ∈ T.

(4.7)

j ∈Mi (t)

Successive energy packets with size .Eih (t) arrive at the beginning of each time max in different time slot. We assume .Eih (t) is i.i.d. with the maximum value of .Ei,h slots. We denote .αi (t) as the amount of energy unit to be stored in mobile device. Thus, the energy level for the ith mobile device can be computed using the following formula.  tx .ψi (t + 1) = ψi (t) − i,j (t) · Ii,j (t) − il + αi (t), (4.8) j ∈Mi (t)

where 0 ≤ αi (t) ≤ Eih (t), i ∈ N, t ∈ T.

.

(4.9)

120

4.2.1.4

4 AI/ML for Computation Offloading

Problem Formulation

 tx (t) · I (t). The total energy consumption of the ith mobile device is .il + j ∈M i,j i,j Because rapid discharge of battery is unusual and to some extent even unsafe, saf e we introduce the “safe discharging threshold (.ψi )” as the maximum transient discharge w.r.t. the ith mobile device; it can be computed as 

il +

.

saf e

tx i,j (t)Ii,j (t) ≤ ψi

, i ∈ N, t ∈ T.

(4.10)

j ∈Mi (t)

Furthermore, besides it is also possible that no feasible solution with sufficient battery energy level under (4.6) is found. Therefore, as we have described before, those timeout requests have to be dropped. We set .Di (t) ∈ {0, 1} as the indicator of dropping the task, that is, .Di (t) = 1{Ii (t) = 0}. When ith mobile device’s request in the tth time slot is dropped, a penalty .i is generated. We set .i ≥ τd , to motivate that a task is preferred to be executed rather than dropped. We denote . i (t) as the offloading energy allocation vector with size .|{j ∈ M|1{Ii,j = 1}}|. The overall cost of the ith mobile device in the tth time slot can be computed as C(Ii (t)) 

.

max



j ∈Mi (t):Ii,j (t)=1

+τilc + ϕ ·

tx rc τi,j (t) + τi,j (t)





Ii,j (t) + i · Di (t).

(4.11)

j ∈Mi (t)

Notice that this is a general model without specific structural assumptions. In practice, a proper value of .ϕ can be determined based on the types of coordination through a multiple criteria decision-making process. Consequently, the edge site selection problem can be formulated as follows: P1 :

.

∀i,Ii (t),αi (t)

T −1  1  lim E C(Ii (t)) T →∞ T

s.t.

(4.2), (4.6), (4.7), (4.9), (4.10).

min

t=0

i∈N

In the problem considered, the state of the system is composed of the task request, the harvestable energy, as well as the battery energy level. The action lies in the selection of the edge site and the harvesting of energy. It can be checked that the allowable action set depends only on the current system state and is irrelevant to the state and the action history. Therefore, .P1 is an MDP. In principle, it can be solved optimally with standard MDP algorithms. To that end, the system first needs to be discretized, where it usually leads to very large solution space that would be impossible to traverse in a rational time. Furthermore, the large memory requirement for storing the optimal policy with the discretized system cannot be ignored also. To

4.2 AI/ML Optimizes Task Offloading in the Binary Mode

121

overcome this, deep Q-network (a model-free learning algorithm) can be utilized to solve the problem directly, although we should also consider the fact that fine-tuned quantization of “states” and “actions” may lead to severe performance degradation.

4.2.2 Cross-Edge Computation Offloading Framework Figure 4.2 demonstrates the architecture of the cross-edge computation offloading (CCO) framework. The centralized framework consists of mobile devices, edge sites (edge server + SBS), and the mobile network operator (MNO). The resource manager is the core component of the CCO algorithm that is supposed to be executed across the time horizon. In each time slot, each mobile device generates computation task requests with a certain probability distribution. Those requests along with the basic information of senders (e.g., app type, local CPU cycle frequency, battery energy level, etc.) will be sent to the MNO. In turn, the MNO monitors the channel state information of edge sites for dynamic scaling. By executing the CCO algorithm, the MNO chooses edge sites for each mobile

Fig. 4.2 The cross-edge computation offloading (CCO) framework. ©2019 IEEE, reprinted, with permission, from H. Zhao et al. (2019)

122

4 AI/ML for Computation Offloading

device for offloading. Finally, edge sites send the analytical results to the MNO to determine the final downlink transmission. In this subsection, we demonstrate the details on how to obtain the asymptotic optimal solution of .P1 by our CCO algorithm. We use a vector .(t)  [ψ1 (t), ..., ψN (t)] to represent the system energy queues in the tth time slot. For a given set of non-negative parameters .θ  [θ1 , ..., θN ], the non-negative Lyapunov function .L((t)) is defined as follows: L((t)) 

.

 1 (ψi (t) − θi )2 = ψi (t)2 , 2 N

N

i=1

i=1

(4.12)

where θi ≥

V · ω log2 (1 +

.

tx hmax  i,j pi

0 )

pitx μri ηr +Mi − i fj





μri ω log2 (1 +

  saf e max + min Ei,all , ψi

tx hmax i,j pi

0 )

max   l + M  max ,  max  p tx (τ − τ lc ), hmax  max and .Ei,all d t∈T:j ∈M hi,j (t). j =1 i,j i i,j i i i,j Equation (4.12) tends to keep battery energy backlog near a non-zero value .θi for the ith mobile device. We can define the conditional Lyapunov drift .((t)) as

((t))  E[L((t + 1)) − L((t))|(t)].

(4.13)

.

Notice that the lower bound of .θi is not tightened; that is, the larger battery capacity of mobile devices will lead to more optimized solution. According to Eq. (4.8), we can obtain ⎡ ψi (t + 1)2 ≤ ψi (t)2 + 2ψi (t) ⎣αi (t) − il −



.

⎤ tx i,j (t) · Ii,j (t)⎦

j ∈M max 2 max 2 +(Ei,h ) + (Ei,all ) .

Therefore, ((t)) ≤

N 

.

i=1

⎡ ψi (t) ⎣αi (t) − il −



⎤ tx i,j (t) · Ii,j (t)⎦ + C,

j ∈M

   max 2 max 2 where .C  12 N i=1 (Ei,h ) +(Ei,all ) . Using Lyapunov optimization principles, we can obtain the Nnear-optimal solution to .P1 by minimizing the upper bound of .((t)) + V · i=1 C(Ii (t)), irrespective of Eq. (4.7).

4.2 AI/ML Optimizes Task Offloading in the Binary Mode

123

up

With .V ((t)) defined by up

V ((t)) 

.

N 

M    tx ψi (t) αi (t) − il − i,j (t)Ii,j (t) j =1

i=1

+V

N 

C(Ii (t)) + C,

(4.14)

i=1

we introduce the deterministic problem .P2 in every time slot t as P2 :

.

up

min

V ((t))

s.t.

(4.2), (4.6), (4.9), (4.10),

∀i,Ii (t),αi (t)

where the constraint (4.7) is ignored because it violates the conditions of the vanilla version of Lyapunov optimization for i.i.d. random events [30]. To simplify the saf e max . This means Eq. (4.10) can be reasonably ignored. problem, we set .ψi ≥ Ei,all Algorithm 1 Cross-Edge Computation Offloading (CCO) 1: At the beginning of the tth time slot, obtain i.i.d. random events .A(t), Eh (t) h (t)] and channel state information. [E1h (t), ..., EN 2: .∀i ∈ N, decide .Ii (t), αi (t) by solving the deterministic problem .P2 . 3: .∀i ∈ N, update the battery energy level .ψi (t) by (4.8). 4: .t ← t + 1.



Algorithm 1 summarizes our proposed CCO algorithm. To asymptotic optimally solve .P2 , it can be divided into two sub-problems: optimal energy harvesting and optimal edge site selection. • Optimal energy harvesting. The optimal amount of harvested energy .αi (t) can be obtained by solving the following sub-problem: PEH : min 2

.

∀i,αi (t)

N 

ψi (t)αi (t)

i=1

s.t. (4.9). It is easy to obtain that  αi (t) = EiH (t) · 1 ψi (t) ≤ 0 , i ∈ N.

.

(4.15)

124

4 AI/ML for Computation Offloading

• Optimal edge site selection. The optimal decision vector .Ii (t) can be obtained by solving the following sub-problem: Pes 2

.

⎧ ⎡ ⎤ N ⎨   : min ψi (t) ⎣il + i,j (t)Ii,j (t)⎦ − ∀i,Ii (t) ⎩ i=1

+V ·

N 



j ∈Mi (t)

C(Ii (t))

i=1

s.t.(4.2), (4.6), (4.10). As a highly non-convex combinatorial optimization problem, it is difficult to obtain the optimal solution of .Pes 2 (t) using branch-and-bound methods. Therefore, inspired by the sampling-and-classification (SAC) algorithm [31], we propose the following SAC-based edge site selection (SES) algorithm; it is shown in Algorithm 2.

Algorithm 2 SAL-Based Edge Site Selection (SES) 1: for s = 1 to S0 do 2: Sample the sth solution Is (t) from the feasible solution space X (denoted as UX ): ∀j ∈ M, assign min{Njmax , |Nj (t)|} connections to elements in set Nj (t) randomly.  tx  rc (t) + τ lc + ϕ ·  (t) + τi,j 3: ∀i ∈ N, update Di (t) as 1{ maxj ∈Mi (t) τi,j j ∈Mi (t) Ii,j (t) > i    τd ∨  i (t) < ψi (t) }. 4: For those mobile devices who satisfy Di (t) = 1, update Ii (t) as 0. 5: end for (Is (t)), 6: I (t) ← argminIs (t)∈S0 (t) GPes 2 7: Initialize the hypothesis h0 . 8: Q0 (t) ← ∅. 9: for k = 1 to K do 10: Construct the binary-labeled dataset: For all Is (t) ∈ Sk−1 (t), y s (t) = sign{γk − (Is (t))}, Qk (t)  {(I1 (t), y 1 (t)), ..., (ISk−1 (t), y Sk−1 (t))} where γk is the threshold for GPes 2 labelling. 11: Obtain the hypothesis with a binary classification algorithm L(·): hk ← L(Qk (t)). 12: Initialise Sk (t) as ∅. 13: for s = 1 to Sk do 14: Sample with ε-greedy policy:  Get Is (t) from

Hhk , with probability ε UX , with probability 1 − ε,

where Hhk is the distribution transformation of hypothesis hk . 15: Sk (t) ← Sk (t) ∪ {Is (t)}. 16: end for (Is (t)). 17: I (t) ← argminIs (t)∈Sk (t)∪{I (t)} GPes 2 18: end for 19: return I (t).

4.3 AI/ML Optimizes Task Offloading the Partial Mode

125

Under the error-target independence proposed in Definition 2 in [31], which assumes that the error of the learned classifier/hypothesis h in each iteration is independent of the target approximation area, the SES algorithm can obtain a polynomial improvement over the uniform search on the positive solution set .Dh  {I(t)|h(I(t)) = +1}. A tightened query complexity of the SAC algorithms in discrete domains is presented in Theorem 1 in [31]. The SES algorithm first initializes the solution set .S0 (t) = {I1 (t), ..., IS0 (t)} by i.i.d. sampling from the uniform distribution over solution space .{0, 1}×i∈N Mi (t) with constraint (4.2) embedded (lines 1–5), where .Is (t)  {×i∈N Ii (t)} is the sth generated solution. The way of sampling from the feasible solution set follows the idea that we maximize the utilization of computation and communication resources of each edge site. Tasks of mobile devices that miss the computation task’s deadline or have insufficient battery energy will be dropped (.Di (t) ← 1). In each iteration, SES algorithm queries the objective function .GPes to evaluate the 2 generated solutions and then forms a binary classification dataset .Qk (t); a threshold .γk is used to label the solution as .+1 and .−1. In the classification phase (line 11), a binary classifier is trained on .Qk (t), to approximate the region .Dγk  {I(t) ∈ Sk (t)|GPes ≤ γk }. During the sampling phase (lines 14–15), solutions are sampled 2 from distribution .Hhk , and the universal set of feasible solutions are tuned by .ε. As it has been pointed out in [32], uniform sampling with the positive region .Dhk is straightforward and efficient, that is, .Hhk ← Uhk . Throughout the procedure, the best-so-far solutions are recorded (lines 6 and 17), and the best solution will be returned as the final answer (line 19).

4.3 AI/ML Optimizes Task Offloading the Partial Mode 4.3.1 System Model and Overheads As illustrated in Figs. 4.3 and 4.4, we consider an application/project composed of multiple mobile users (MU), denoted by .N = {1, 2, ..., N }. Each MU is required to finish a composition task chain. MU-1, as the master, is responsible for sending commands (generated from one task) and aggregating results produced by other MUs to complete the project shown in Fig. 4.4. MUs .{i |i ∈ N, i = 1} execute tasks individually and return demanded results back to MU-1. It is exactly the same with federated learning that the central server orchestrates algorithms for training nodes (slaves) and aggregates newly updated parameters. The task model shown in Fig. 4.4 is a specific round of collaboration, like one training round in federated learning. Each task can be offloaded to edge servers or processed locally. The optimization objective is to provide valuable schemes by drawing optimal offloading and allocation policies to promote/realize green computing in mobile environments. We assume the master (MU-1) needs to interact with others twice, once to publish

126

4 AI/ML for Computation Offloading

1

2

3

1 1

2

2

3

1

2

3

1

2

3

3 1

2

3

Fig. 4.3 MEC network with mobile users, APs, and servers. ©2022 IEEE, reprinted, with permission, from H. Chen et al. (2022)

MU1 task0

m1,1

m1,2

···

m1,k1

··· m1,M 1+1

m1,M 1

m1,M 1–1

···

m1,r 1

··· m2,k2

···

···

m2,r 2

m3,k3

···

···

···

···

m2,1

m2,M2

m 3,1

MU2 task0

m2,M2+1

MU3 task0

mn,kn

m3,r 3

···

mn,r n

···

···

m 3,M 3

mn,1

mn,Mn

m3,M3+1

MU n task0

mn,Mn+1

···

Fig. 4.4 Illustration of service dependency in task chains. ©2022 IEEE, reprinted, with permission, from H. Chen et al. (2022)

4.3 AI/ML Optimizes Task Offloading the Partial Mode

127

contents and once to collect results, that is, broadcasting averaged gradients and aggregating updated gradients in federated learning.

4.3.1.1

System Model

To optimize task offloading in the partial mode, we need a concrete model for the following aspect in a platform. Task model For clarity, computation task chains of MUs are detailed in Fig. 4.4. For example, MU .i, ∀i ∈ N has .Mi computation tasks to be executed in a sequence, defined as .Mi = {1, 2, ..., Mi }. We take .mi,j to indicate the j th task of .Mi , ∀i ∈ N, j  Mi . Each task .mi,j gets ready for execution only after outputs of its all predecessors have been received. For instance, .m2,k2 could be started after results of .m1,k1 and .m2,k2 −1 are returned. We use .Ii,j and .Oi,j to indicate the input and output data of task .mi,j , respectively. The workload of task .mi,j is represented by .Di,j . For convenience, we introduce two empty tasks for each user .i, ∀i ∈ N, referred as to .mi,0 and .mi,Mi +1 , in a way that .Ii,0 = Oi,Mi +1 = 0 and .Di,0 = Di,Mi +1 = 0. We take DAG form .G = (V , E) to represent the whole project, where V indicates the set of tasks and E describes the dependency between two tasks. In this example, MU-1 needs to distribute tasks to other MUs .{i |i ∈ N, i = 1} and aggregate their results. Network model Edge servers are deployed by mobile network operator in strategic points to finish computation tasks on behalf of MUs, denoted by .S = {1, 2, ..., S}. Each server is equipped with an access point (AP) to communicate with MUs via wireless LAN (WLAN). APs have different antenna allocation schemes, and edge servers are heterogeneous in computational resources. MUs have the option of offloading tasks to edge servers or running them locally. It’s assumed that MUs can be only allowed to communicate with at most one edge server concurrently; it means offloading decision follows a binary policy. We use s .a i,j ∈ {0, 1} to represent the binary decision of .mi,j as described below.  s . ai,j

=

1, offloaded to edge server#s 0, otherwise

and  .

s ai,j ≤ 1.

(4.16)

s∈S

Notice that pseudo-tasks are naturally processed at local MUs, that is,   s s a = a s∈S i,0 s∈S i,Mi +1 = 0. Based on orthogonal frequency-division multiplexing (OFDM) technique, we consider an orthogonal partition on spectrum to enable multi-user offloading/downloading service. Therefore, we

.

128

4 AI/ML for Computation Offloading

can safely assume that MUs communicate with edge servers over orthogonal channels and would not interfere with each other. Mobility model We use a random way-point model to characterize the mobility paths of MUs. Each user .i ∈ N stays at the initial point .pi,0 for a random time .ti,0 ∈ (0, tmax ]. Then, the next point .pi,1 is randomly selected. The MU moves to point .pi,1 at .vi,0 m/s (meters per second); .vi,0 is generated randomly in .[vmin , vmax ]. The user i stays in .pi,1 for a random time .ti,1 ∈ [tmin , tmax ]. This process is repeated until the user leaves the platform. In this case, given an arbitrary time point .t ∈ (t0 , tK ), we can obtain its location .li . Because there is s and channel gain .hs , there would a one-to-one mapping between distance .di,j i,j also be a bijection between time point t and channel gain .hsi,j . 4.3.1.2

Overheads

Smart devices are capable of processing many applications locally, some even computationally heavy such as image pre-processing and animation rendering. Yet, because of the performance, local computing is only considered as an alternative strategy when channel state is poor. Dynamic voltage and frequency scaling (DVFS) technology also provides theoretical and technical support to increase the energy efficiency of devices when performing tasks. This subsection aims to analyze the energy consumption and latency for performing tasks locally and producing results. Task .mi,j can be executed locally only after the output data of .mi,j −1 has reached MU i. We define .T Ri,j to represent the transmission latency between tasks .mi,j −1 and .mi,j as follows: l .T Ri,j

=



 s ai,j −1

1−



s d = where .τi,j

Oi,j −1 c Ri,j

 s ai,j

d τi,j

(4.17)

s

c is the downlink is the download delay from edge server and .Ri,j

l bitrate (explained in the next section). .T Ri,j only exists when task .mi,j −1 is .mi,j is executed on the local device, indicated computed in the edge server and task  s  s by . ai,j ai,j ). −1 (1 − s

s

To calculate the completion latency, we let .fi,j denote the CPU clock frequency for finishing task .mi,j . It can be expressed as CLli,j =

.

Di,j fi,j

(4.18)

4.3 AI/ML Optimizes Task Offloading the Partial Mode

129

and the corresponding energy consumption of local execution is 2   αDi,j l 2 Ei,j = αfi,j + β CLli,j = + βCLli,j CLli,j

.

(4.19)

where .α and .β are parameters decided according to the CPU model. Notice that the DVFS technology divides .fi,j into discrete values ranging from 0 to .fpeak . For simplicity, we regard .fi,j as a continuous value to attain its optimal solution. Devices just need to choose the closest element from the set in practice or use other approaches to select the best set of frequencies. Overheads of edge computing To compute this overhead, we will elaborate on the concrete process of edge computing. This process starts when the task .mi,j −1 is completed and the result must be sent to .mi,j . It could be divided into two stages. • Offloading or migrating: Before task .mi,j can be executed, the edge server should get the output of task happens when .mi,j −1 is i,j −1 . Offloading  .m  s s completed locally, that is, . 1 − s ai,j −1 s ai,j = 1. MU i is required to send output data from task .j − 1 to the target edge server with adjustable power .pi,j . It comes with transmission delay as formulated below. u τi,j =

.

Oi,j −1 s Ri,j

(4.20)

s denotes the bit rate of uplink to edge server s; it can be computed where .Ri,j as   s h p i,j i,j s .Ri,j = B log 1 + (4.21) σ2

where B is the communication bandwidth, .σ 2 denotes the variance of additive white Gaussian noise (AWGN) originating from the receiver (e.g., receiver θ  8 denotes the channel gain caused by thermal noise), and .hsi,j = G 4π3·10 s Fc d i,j

path loss and shadowing attenuation. Here, G is the antenna gain, .Fc indicates s denotes the the carrier frequency, .θ represents the path loss exponent and .di,j distance between the target server s and the MU i, and .θ represents the path c is the downlink bitrate based on fixed power .p c . loss exponent. .Ri,j 

By introducing .f (x) = σ 2 (2x /B − 1), we will have pi,j

.

1 = s f hi,j



Oi,j −1 u τi,j

 (4.22)

130

4 AI/ML for Computation Offloading

Accordingly, the energy overhead can be calculated by u .Ei,j

=

u pi,j τi,j

=

u τi,j

hsi,j

 f



Oi,j −1 u τi,j

(4.23)

We take a triad .{CP Us , STs , BDs } to express the current status of each edge server .∀s ∈ S where .CP Us , STs , BDs indicates the percentage of the remaining CPU, storage, and bandwidth resources; .0 ≤ CP Us , STs , BDs ≤ 1. Task .mi,j can be offloaded to edge server s only if its demand for these three resources could be satisfied; that is, .cpui,j ≤ CP Us , sti,j ≤ STs , bdi,j ≤ BDs . The channel state is location-based, since the channel gain .hsi,j is related s from the edge server. Therefore, it is essential for MUs to to the distance .di,j be aware of channel state information (CSI) before making any offloading strategy. Suppose APs are channel-aware and there exist feedback channels to transfer CSI from APs to MUs. Before uploading, MUs send offloading decisions to edge servers deployed with controllers. They determine how many resources (.cpui,j , sti,j , bdi,j ) the task needs based on historical records, check whether the offloading is feasible (sufficient computational capability, bandwidth, and storage resources), and send back control signal to MUs. Task migration occurs when two consecutive tasks .mi,j −1 and .mi,j are executed on two different edge servers. We use .τ0 to indicate the migration delay. Therefore, the latency for transmission can be calculated as  c . T Ri,j

= 1−

 s

 s ai,j −1



s u ai,j τi,j

+

s

 s

 s ai,j −1

1−



 s s ai,j −1 ai,j

τ0

s

(4.24) c The energy consumption of edge computing can be given by .Ei,j =  s  s u ai,j (1 − ai,j −1 )Ei,j . s

s

• Edge execution: Similar to local computing, the completion latency of edge computing can be calculated as CLci,j =

.

Di,j fsc

(4.25)

where .fsc is the CPU frequency of the sth edge server. Without loss of generality, we set .min∀s∈S fsc > fpeak . Overheads of user-user offloading In this section, we will elaborate on the impact of other types of overhead caused by the energy consumption and time spent retrieving results from other MUs. For example, task .m2,k2 needs to retrieve the output of .m1,k1 . Let .Ai,j denote the execution platform of j th task in ith MU. .Ai,j = 0 indicates task .mi,j chooses local computing; .Ai,j = s means it is about to be offloaded to

4.3 AI/ML Optimizes Task Offloading the Partial Mode

131

edge server s. Assume that .mi,j is a predecessor of .mi ,j where .i = i . According to different .Ai,j and .Ai ,j , we divide these overheads into five categories. (1) .0 < Ai,j = Ai ,j : In this case, there is no need for extra communication since tasks .mi,j and .mi ,j can be executed consecutively at the same edge server. (2) .0 = Ai,j < Ai ,j : It means the result of computation task .mi,j is required to be sent from MU i to .Ai ,j . The additional overheads in terms of delay and energy consumption could be, respectively, calculated as ex

τi,j (i , j ) =

.

Oi,j A

Ri,ji ,j

(4.26)



and ex .ei,j

  i ,j =



Oi,j A



A

Ri,ji ,j hi,ji ,j



f

Oi,j ex τi,j (i , j )

 (4.27)

A



In this case, MU i dispatches result .Oi,j to .Ai ,j at bitrate .Ri,ji ,j with a constant power .p0 . (3) .0 = Ai ,j < Ai,j : This is similar to the previous case with an additional download from .Ai,j . The extra cost can be computed as   Oi,j ex

i ,j = c τi,j Ri ,j

.

(4.28)

(4) .0 < Ai,j = Ai ,j : In this case, .mi,j and .mi ,j are executed on different edge ex = τ . servers. As aforementioned before, the migration delay is .τi,j 0 (5) .0 = Ai,j = Ai ,j : This happens when the two tasks are executed in two local devices i and .i . It needs to exchange result of .mi,j from MU i to MU

.i where APs act as relay nodes. The extra latency could be calculated as   Oi,j Oi,j ex

i , j = max + c τi,j Ri,j Ri ,j

.

(4.29)

max is the maximum uploading rate MU i could achieve with edge where .Ri,j servers .S. It can be regarded as one uploading and one downloading process.   ex i , j = p0 Oi,j . Accordingly, the extra energy consumption would be .ei,j R max i,j

132

4 AI/ML for Computation Offloading

4.3.2 Problem Formulation To formulate the expression of delay .Ti of each MU .i, ∀i ∈ N, we use the finish time .F Ti,Mi +1 of task .mi,Mi +1 to equivalently represent the total latency of MU i. We could express the finish time of the j th task of MU i (.mi,j ), as a sum of two parts F Ti,j = RTi,j + CLi,j

.

(4.30)

 s where .RTi,j is the ready time for .mi,j to be executed and .CLi,j = ai,j CLci,j + s  s (1 − ai,j )CLli,j denotes the completion latency of task .mi,j . A task is ready for s

execution only after the outputs of all its predecessors have reached its execution platform. Let .pred(mi,j ) indicate the predecessor task set of task .mi,j . .RTi,j could be computed as .

  l c RTi,j = max F Ti,j −1 + T Ri,j + T Ri,j , Tmax i,j

(4.31)

In the .max{·} function, the former indicates the arrival time of the output of predecessor task from MU i (.mi,j −1 ), and the latter .Tmax the lati,j denotes  est arrival time of predecessors from other MUs . i |i ∈ N, i = i , .Tmax  i,j   maxi =i,mi ,j ∈pred (mi,j ) F Ti ,j + τiex

,j (i, j ) . When .pred(mi,j ) has only one elel c ment .mi,j −1 , we set .Tmax i,j = 0, that is, .RTi,j = F Ti,j −1 + T Ri,j + T Ri,j . As a result, we can obtain the expressions of the overall latency .Ti of MU i as Ti = F Ti,Mi +1

(4.32)

.

For task .mi,j , we can express its energy consumption as  Ei,j = 1 −



.

 s ai,j

Eijl +

s



s c ex ai,j Ei,j + ei,j

(4.33)

s

The total energy consumption of MU i can be calculated as the sum of energy consumption of tasks from .mi,1 to .mi,j +1 as expressed below. Ei =

M i +1 

.

Ei,j

(4.34)

i=1

In this work, our objective is to jointly optimize the energy consumption and response delay which are proved to be two competitive objectives. In real platforms, each participant in the collaboration is supposed to be self-governed. They have

4.3 AI/ML Optimizes Task Offloading the Partial Mode

133

their own concerns about the energy consumption and response delay to complete the entire project. For example, there could exist someone who is urgently needed to finish work (large .ωiT ), even at the cost of high energy consumption (small .ωiE ). Parameters .ωiT and .ωiE will influence other workers’ decisions to reduce response time. In other words, what we care about is not only the performance of the whole project but also the concerns of each participant. We will minimize the weighted sum of response delay and energy consumption of each participant as our objective; it can be defined as   ωiT Ti +ωiE Ei P 1 : min (a,p,f )

.

i∈N

s.t. 0 ≤ pi,j ≤ ppeak ,

(4.35)

0 ≤ fi,j ≤ fpeak , s ai,j ∈ {0, 1}, ∀i, j.

where .0 < ωiT < 1 is the normalized weight parameter that indicates how much MU i cares about the entire latency and normalized .0 < ωiE < 1 denotes the importance of energy consumption, which is decided by the preference of MU i and its specific application. It is interesting to mention that parameters .ωiT and .ωiE are allowed to take any value before normalization. When set to specific values, P 1 turn into a special problem (e.g., max span minimization problem). Notice that solutions s , f , and .p ) of tasks .m , ∀i ∈ N, j ∈ M are spatio-temporally coupled (.ai,j i,j i,j i,j i with each other. This kind of relations is reflected in the changes of network status brought by different locations of MU i, which is determined by the ready time .RTi,j and mobility pattern. For inner-user dependency, the location where the task .mi,j is ready for execution depends on the amount of time previous tasks .mi,j , 1 ≤ j ≤ s , f , p . For inter-user dependency, any j −1 take, which in turn depends on .ai,j

i,j i,j task that needs results from other MUs cannot start execution unless results have s been received. Once the location to execute the task .mi,j changes, the distance .di,j from MU i to the edge server .s, ∀s ∈ S will change accordingly as well; this will further affect the decisions and performance of later tasks. In other words, solutions

s .a , fi ,j , pi ,j , ∀j ∈ Mi of other MUs can influence the performance of MU i. To i ,j avoid such extra complexity, we assume that the resources of each edge server are enough for MUs.

4.3.3 Solution This section outlines our solution for resource allocation and offloading, respectively.

134

4 AI/ML for Computation Offloading

4.3.3.1

Allocation of CPU Frequency and Power

We denote the “receive-process” as the one that is receiving results from other MUs and the “send-process” as the one that is sending result/output to other MUs. As shown in Fig. 4.4, MUs can be both receive-process and send-process. It can be inferred that only receiving output from other MUs will influence subsequent tasks. To decouple this inter-user dependency, we introduce N auxiliary variables .Qi , 1 ≤ i ≤ N to indicate the end time of receive-process for MU i. For example, MU-1 finishes receiving after all results from other MUs have reached its execution platform. .Q1 can be computed as   l c Q1  max F T1,r1 −1 + T R1,r + T R1,r , Tmax 1,r1 , i = 1 1 1

.

(4.36)

As for MU .i, 1 < i ≤ N , it only receives the result of .m1,k1 from MU-1. The .Qi can be computed as   l c max ,1 < i ≤ N Qi  max F Ti,ki −1 + T Ri,k + T R , T i,ki i,ki i

.

(4.37)

We can rewrite .RTi,j as

RTi,j =

.

⎧ ⎪ ⎪ ⎨Q1

Qi ⎪ ⎪ ⎩F T

i = 1, j = r1 1 < i ≤ N, j = ki l i,j −1 + T Ri,j

c + T Ri,j

(4.38)

otherwise

where the coupling term .max{·} is eliminated. Note that there exists a bijection u ; this means that we can obtain .f between .fi,j and .CLli,j and .pi,j and .τi,j i,j and l u .pi,j by determining .CL i,j and .τi,j , respectively. P 1 could be further approximately reformulated as   ωiT Ti +ωiE Ei .P 1 − AP : min (a,τ u ,CLl )

s.t.

s ai,j

i∈N

∈ {0, 1},

0 ≤ pi,j ≤ ppeak , 0 ≤ fi,j ≤ fpeak , ∀i, j, N  

 l c sum F Ti,ki −1 + T Ri,k ≤ 0, . + T R + T − 2Q i i,k i,k i i i

(4.39) (C1)

i=2 l c F T1,r1 −1 + T R1,r + T R1,r + Tsum 1,r1 − NQ1 ≤ 0, . 1 1

(C2)

4.3 AI/ML Optimizes Task Offloading the Partial Mode

135

where C1 and C2 approximate to Eqs. (4.36) and (4.37),  respectively. This indicates the inter-user constraints. Specifically, .Tsum  i =i,(i ,j )∈pred(i,j ) F Ti ,j + i,j τiex

,j (i, j ). In this way, we decouple the solution of MUs. .P 1 − AP is a non-convex MINLP problem. A closer observation of .P 1 − AP shows that the feasible set of a is not related to .τ u , CLl . In other words, we can solve a and .τ u , CLl separately. Furthermore, .P 1 − AP would become a convex problem if .a is given. By assuming .a is given, the Lagrangian function of .P 1 − AP can be expressed as   l u L(τi,j , τi,j , λ, μ) = ωiT Ti + ωiE Ei i∈N



.

N    l c sum + T R + T − 2Q F Ti,ki −1 + T Ri,k i i,k i,k i i i

(4.40)

i=2

  l c sum + μ F T1,r1 −1 + T R1,r + T R + T − N Q 1 1,r1 1,r1 1 where .λ and .μ are Lagrangian multipliers. Based on the KKT condition, .λ ≥ 0 and μ ≥ 0.

.

Proposition 4.1 (Adapted from Theorem 4.1 in [34] and Proposition in [33])  3.1 s = 0 is ∗ of task j for MU i with . The optimal CPU frequency allocation .fi,j ai,j s

given by  ∗ fi,j = min

.

G + ωiE β ωiE α

 , fpeak

(4.41)

For MU-1, when .0 < j ≤ k1 , G = μ+(N −1)λ, and when .k1 < j ≤ r1 −1, G = μ; otherwise, .G = ω1T . For MU .i, 0 < i ≤ N, when .0 < j ≤ ki − 1, G = λ, when T T .ki − 1 < j ≤ ri , G = ω + μ, otherwise .G = ω . i 1 Proof For MU 1, when .0 < j ≤ k1 , the derivative of the Lagrangian function L with respect to .CLli,j is ∂L ∂CLl1,j .

=

  ∂ ω1E E1 + μF T1,r1 −1 + λ (N − 1) F T1,k1 ∂CLl1,j ⎞

⎛ = ω1E ⎝−

αD1,j 2 2 CLl1,j

+ β ⎠ + μ + (N − 1) λ

(4.42)

136

4 AI/ML for Computation Offloading

$  D1,j where .CLl1,j ranges within . fpeak , +∞ and . function with .CLl1,j . If ∗ fpeak ; otherwise, .f1,j

.

∂L | D ∂CLl1,j CLl1,j = f 1,j

%

=

∂L ∂CLl1,j

is a monotonously increasing

∗ = > 0, the optimal solution is .f1,j

peak

μ+(N −1)λ+ω1E β . ω1E α

When .k1 < j ≤ r1 − 1, we have ∂L ∂CLl1,j

=

  ∂ ω1E E1 + μF T1,r1 −1 ∂CLl1,j



.

= ω1E ⎝−

Similarly, we

∗ obtain .fi,j

% = min

αD1,j 2 CLl1,j



(4.43)

+ β⎠ + μ

2

μ+ω1E β , fpeak ω1E α

& ; otherwise, the derivative of L

is ∂L ∂CLl1,j

=

  ∂ ω1T T1 + ω1E E1 ⎛

.

=

∗ = min Thus, we have .fi,j

ω1E

%

∂CLl1,j

⎝−

αD1,j CLl1,j

⎞ 2

(4.44)

+ β ⎠ + ω1T 2

ω1T +ω1E β , fpeak ω1E α

& . As for other MUs, the mathe-

matical derivation of .fi,j , 1 < i ≤ N is similar to the proof for .f1,j .

 

With the mathematical analysis of Proposition 4.1, there are some interesting numerical properties. For the master, MU-1, the optimal CPU frequency of .m1,j , 0 < j ≤ k1 is proportional to .λ and .μ; in fact, the larger the .λ and .μ, the tighter the constraints of inter-user dependency. When .λ and .μ exceed a certain threshold, MU-1 has to compute tasks with the maximum frequency .fpeak at the ∗ only price of the maximum energy consumption. Similarly, .k1 < j ≤ r1 − 1, .f1,j depends on .μ, because task j will not be affected by'the first dependency. Otherwise, ∗ is decided by .ωT ωE , that is, the comparison of when .r1 − 1 < j ≤ M1 , .f1,j 1 1 preference on delay and energy consumption instead of .λ and .μ. This is because the rest of .M1 would not be affected by dependency. ∗ is proportional The execution of task .j, 0 < j ≤ ki − 1 of MU .i, 1 < i ≤ N, .fi,j to the value of .λ. The larger the .λ, the more urgent the ith MU is to complete task .ki . As a result, MU i will run tasks with a higher CPU frequency regardless of the ∗ for task .j, 0 < j ≤ k − 1 is not related to .λ, additional overhead. Note that .fi,j i because its performance has no direct impact on the second interdependency. On

4.3 AI/ML Optimizes Task Offloading the Partial Mode

137

∗ for task .j, k − 1 < j ≤ M is decided by .λ. The rest of the the other hand, .fi,j i i tasks .j, ri < j ≤ Mi can be regarded as an independent task sequence; that is, their ∗ only depends on .ωT and .ωE . solution .fi,j i i u )∗ of task .j, 1 ≤ j ≤ M + 1 for Proposition 4.2 The optimal transmit delay .(τi,j i    s  s MU .i ∈ N with . 1 − s ai,j −1 s ai,j = 1 is given by

⎧ Oi,j −1 ⎨  ∗ ⎪ ppeak hsi,j u . τi,j = B log2 (1+ σ 2 ) ⎪ ⎩ ln 2·Oi,j −1 B(W (ge−1 )+1)

hsi,j ≤

σ2 z ppeak [− W (−ze−z )

λ(N−1)+μ ω1E ppeak

μ and .g = E 2 − ω1E ppeak ω1 σ ωiT h1,j ω1T 1+ E and .g = E 2 −1. For MU .{i |i ω1 ppeak ω1 σ hsi,j λ λ

when .0 < j ≤ ki − 1, .z = 1 + ri − 1, .z = 1 + z = 1+

.

ωiT ωiE ppeak

h1,j (λ(N−1)+μ) ω1E σ 2 hsi,j μ

and .g =

when .k1 + 1 < j ≤ r1 − 1, .z = 1 + r1 −1 < j ≤ M1 +1, .z =

μ ωiE ppeak

ωiE ppeak hsi,j μ

and .g = hsi,j ωiT

and .g =

ωiE σ 2

of .J (z) = zexp(z).

(4.45)

otherwise.

For MU-1, when .0 < j ≤ k1 , .z = 1 +

.

− 1]

ωiE σ 2

and .g =

ωiE σ 2

− 1;

1; and when ∈ N, i = 1} ,

− 1; when .ki − 1 < j ≤

− 1; and when .ri − 1 < j ≤ Mi + 1,

− 1. .W (x) is Lambert function, the inverse function

Proof The proof is similar to the proof for Proposition 1; it can be completed by deriving the first-order and second-order derivatives of the Lagrangian function L y with respect to .τi,j , where .i ∈ N, j ∈ Mi .   There are interesting properties that if channel gain .hsi,j is weaker than a certain threshold (e.g., caused by deep shading), MUs have to offload tasks with the ∗ = p maximum power as .pi,j peak . The threshold is inversely proportional to .λ and ∗ .μ. This is because tighter dependency means greater demand for .p i,j ; otherwise, s u )∗ . This is when .hi,j is higher than the threshold, the MUs will obtain a lower .(τi,j because .W (x) is an increasing function when .x > −1/e. Furthermore, similar to Proposition 1, .λ, .μ, and .ωT have different effects according to the index of task. Theorem 4.1 We can quickly obtain the optimal .λ∗ and .μ∗ , with iteration as .

 + λ(t + 1) = λ(t) + (t) · Y

(4.46)

μ(t + 1) = [μ(t) + (t) · Z]+

(4.47)

and .

 sum l c where .Y = N i=2 (2Qi − F Ti,ki −1 − T Ri,ki −T Ri,ki − Ti,ki ) and .Z =  l c sum where .(t) is diminishing step size. N Q1 − F T1,r1 −1 − T R1,r − T R1,r − τ1,r 1 1 1

138

4 AI/ML for Computation Offloading

The number of iterations for the normal Lagrangian method is non-deterministic; this is due to the  random initial values of .λ and .μ. However, if the diminishing step size satisfies . ∞ n=0 (n) = 0 and .limt→∞ (t) = 0, the Lagrangian method will quickly converge, regardless of values for the initial points.

4.3.3.2

Solution of Offloading Policy

∗ and .f ∗ assuming .a We obtain .pi,j i,j is given. This MINLP problem is accordingly i,j s . It is a difficult problem since it is transformed into ILP program with variables .ai,j ( non-convex with exponential solution space as . i∈N (Mi )S+1 . In addition, due to the mobility of users and task dependencies, there exists a spatio-temporal causality among offloading strategies. We define .γi,j ∈ {0, 1, . . . , S} to indicate the offloading policy of task .mi,j , where .γi,j = 0 represents local execution and .γi,j = s, 1 ≤ s ≤ S indicates offloading to server s. We denote . as the Markov chain state set and .φ ∈  as each state composed of the set of offloading policies .{γ1 , γ2 , · · · , γN }. .γi is a .Mi dimensional vector indicating the offloading decision of MU i, and each dimension is .γi,j for task .mi,j . For brevity, P 1 as a common optimization  we re-express problem .minφ∈ xφ , where .xφ = i∈N ωiT Ti + ωiE Ei . To construct our problem-specific Markov chain, we associate each state with a percentage of time .πφ when .φ is in use. Thus, it can be transferred to an equivalent minimum-weight independent set (MWIS) problem, which holds the same optimal value, as

P 1 − MW I S : min



πφ xφ

φ∈

.

s.t.



πφ = 1

(4.48)

φ∈

We consider .xφ to be the weight of .πφ . By introducing log-sum-exp function, .P 1 − MW I S can be approximated to a convex problem P 1 − ρ : min



πφ xφ +

φ∈

.

s.t.



1  πφ log πφ ρ φ∈

(4.49)

πφ = 1

φ∈

' where the approximation gap is upper bounded by .log || ρ; .ρ is a positive constant.

4.3 AI/ML Optimizes Task Offloading the Partial Mode

$Based on the) KKT condition, φ∈ exp(−ρxφ ) would be

.

the

139

optimal

value

for

 '  1 ρ log

.

exp(−ρxφ ) πφ∗ =  exp(−ρxφ )

(4.50)

φ ∈

We consider .πφ∗ as the stationary distribution to construct our time-sharing Markov chain. Once it converges, the time allocation .πφ∗ for .φ, ∀φ ∈  can be obtained, and .P 1 − MW I S will be solved based on the most time assigned to .φ ∗ . Due to the product form of .πφ∗ , we can design at least one Markov chain while satisfying the following conditions. • any two states are reachable from each other, • the following equation is satisfied .

πφ∗ qφ,φ = πφ∗ qφ ,φ

(4.51)

where .qφ,φ denotes the transition rate from .φ to .φ . There exist many forms of .qφ,φ . In this work, we design it as follows: .

qφ,φ =

exp (−ϕ)    1 + exp −ρ xφ − xφ

(4.52)

where .ϕ is a constant, while the above two conditions are satisfied. To this end, we proposed the CSRAO (Collaborative Service Resource Allocation and Offloading) and show it in Algorithm 3. It is a distributed algorithm and only requires that MUs broadcast their new states with each other. To start, all MUs initialize offloading policies and generate their own timers randomly (lines 1–4). Consequently, each MU i determines .fi and .pi (lines 5–6). In each iteration, the first MU whose timer expires randomly updates its offloading decision with only one element changed; it informs other MUs to obtain new .f and .p. The system has to decide whether to stay in the new state or not (lines 8–13) until P 1 converges. It is worth mentioning that not only could CSRAO converge fast but also it possesses a mechanism (adjustable parameter .ρ) to prevent falling into local optimum.

4.3.3.3

Algorithm Analysis

In this subsection, we will analyze the complexity, the feasibility of convergence, the approximation gap, and the convergence time of our proposed CSRAO algorithm. Complexity analysis As explained, we applied a Lagrangian multiplier method ∗ and .p ∗ , by assuming a is given. In to derive the closed-form expression of .fi,j i,j

140

4 AI/ML for Computation Offloading

Algorithm 3 Collaborative Service Resource Allocation and Offloading (CSRAO) Algorithm In: Mobility information of each MUs, network state information and DAG information s Out: Resource allocation pi,j and fi,j and offloading policy ai,j for each MU i ∈ N do Initialise the offloading policy γi randomly Generate an exponential random timer with ' mean as .exp(ϕ) Vi Calculate the objective P 1 as λ and μ converge Denote the current system state as φ and let all MUs begin counting down end for if P 1 converges then if there exists one timer of MU expires then Denote the MU as i and update its offloading policy randomly as new state φ with only one element changed MU i broadcasts the new state to 'other MUs to obtain the value  of f ∗ and p ∗ φ ← φ with probability .1 − exp(−ρxφ ) exp(−ρxφ ) + exp(−ρxφ ) All MUs refresh timers and begin counting down end if end if return γ , f ∗ and p ∗

∗ , and .p ∗ under diminishing step size, CSRAO Algorithm 3, by updating .ai,j , .fi,j i,j could quickly converge with complexity as .O(T Mi ), where T represents the number of iteration. Due to .(t), the value of T is small, and the searching space ( of a is .O( i∈N (Mi )S+1 ). By introducing Markov approximation algorithm, we could obtain the sub-optimal solution within polynomial time when .ρ is a properly set according to Theorem 4.1, which will significantly reduce the exponential complexity of the exhaustive search. The feasibility of convergence To determine the feasibility of convergence, we propose the following theorem.

Theorem 4.2 P 1 can'be solved globally by Algorithm 4.16, with transition prob ability as .exp(−ρxφ ) exp(−ρxφ ) + exp(−ρxφ ) . When .ρ → 0, Algorithm 4.16 converges with probability “1.” Proof Let .φ denote the current state with .a, f, p given. The objective value .xφ could be derived as (4.39) . In each iteration, the first user i whose countdown times expire first randomly changes its offloading policy .ai,j , ∀j ∈ Mi to update a. In this case, .f, p could converge fast according to Eqs. (4.46) and (4.47), so as to obtain

.x . According to Algorithm 4.16, under state .φ, user i counts down with rate as φ' .V exp(ϕ), and there is qφ,φ = .

=

exp(−ρxφ ) 1 Vi · · Vi exp(−ρxφ ) + exp(−ρxφ ) exp(ϕ) exp(−ϕ)    1 + exp −ρ xφ − xφ

(4.53)

4.3 AI/ML Optimizes Task Offloading the Partial Mode

141

which is equal to Eq. (4.52) designed before. It means that our design is satisfied to realize time-sharing among different states in the Markov chain. Thus, the optimal solution will be gained with the stationary distribution convergent to Eq. (4.51). Let .φ ∗ be the global optimum, that is, .xφ ∗ ≤ xφ , ∀φ ∈ . According to Eq. (4.52) , the system is prone to stay in .xφ with larger probability .qφ,φ whose objective value is lower. It indicates the system will converge to .φ ∗ and be allocated with time percentage as .πφ∗ . We re-express Eq. (4.51) as

.

πφ∗∗ =  φ ∈

1 exp(−ρ(xφ ∗ − xφ ))

(4.54)

It can be concluded that .πφ∗∗ (x) increases as .ρ decreases. When .ρ → 0, the time percentage allocated to .φ ∗ is .πφ∗∗ → 1. However, there exists a trade-off that large .ρ may lead to system getting trapped in local optimum.   The approximation gap following theorem.

To determine the approximation gap, we propose the

' Theorem 4.3 The approximation gap is upper bounded by .0 ≤  − ε ≤ log || ρ, where . denotes the approximation solution of CSRAO and .ε indicates the optimum [35, 36]. Proof Following [36], we introduced a Dirac delta function to represent the distribution .π¯ φ of the theoretical optimality with definition .φmin  arg minφ∈ xφ .  .

π¯ φ =

1, if φ = φmin 0, otherwise

(4.55)

According to the optimal stationary distribution of .πφ∗ , we derive  .

πφ∗ xφ +

φ∈

 1  ∗ 1  πφ log πφ∗ ≤ π¯ φ xφ + π¯ φ log π¯ φ = ε ρ ρ φ∈

φ∈

(4.56)

φ∈

Based on the Jensen inequality, we obtain  .

⎛ πφ∗ log πφ∗ ≥ − log ⎝

φ∈



φ∈

⎞ πφ∗ ∗ ·

1 ⎠ = − log || πφ∗

(4.57)

By substituting Eq. (4.57) with Eq. (4.56) , we have

.

=

 φ∈

πφ∗ xφ ≤ ε +

log || ρ

(4.58)

142

4 AI/ML for Computation Offloading

 

With . > ε, this theorem is proved. The convergence time lowing theorem.

To determine the convergence time, we propose the fol-

Theorem 4.4 (Adapted from Theorem 2.5 in [37]) . The mixing time (convergence time) of our designed continuous-time Markov chain is upper bounded by .O(log(N )) and lower bounded if   1 Mi S+1 − 1 , 0 < ρ < ρth , /2 exp(−ϕ − ρxmin ) 2ν

ln

.

i∈N

where 



1

ρth = ln 1 + 

.

i∈N

(Mi S+1 − 1)

/2(xmax − xmin ).

Proof Let .Pt (φ) represent the probability distribution of all states in . at time t, with the initial state as .φ. According to [37], the mixing time can be given by  & * * ∗* * . tmix (ν) = inf t > 0 : max Pt (φ) − π ≤ν TV 

φ∈



(4.59)



 We denote .xmax = maxφ∈ xφ and .xmin = minφ∈ xφ , since .|| exp(−βxmax ) ≤ φ ∈ exp(−βxφ ) ≤ || exp(−βxmin ). According to Eq. (4.50), we could derive the minimum probability of stationary 

distribution .πmin = min πφ∗ ≥

exp(−ρ(xmax −xmin )) . ||

 Based on the uniformization technique [38], let .Q = qφ,φ denote the transition rate matrix of our Markov chain and develop a discrete-time φ∈



Markov chain .ς with transition rate matrix .P = 1 + Q θ , where .θ =  S+1   M − 1 exp(−ϕ − ρx ) is the uniformization constant and I is the i min i∈N identity matrix. 1 1 Applying spectral gap inequality [39], we have .tmix (ν) ≥ θ(1− ln 2ν where 2) .2 indicates the second largest eigenvalue of P . With Cheeger’s inequality [39], we have .1 − 2χ ≤ 2 ≤ 1 − 12 χ 2 where .χ is bounded by . πmin θ · exp(−ϕ − ρxmax ) ≤ χ ≤ 1. .tmix (ν) satisfies t (ν) ≥ . mix

1 ln 2ν    S+1 Mi 2 exp(−ϕ − ρxmin ) −1

(4.60)

i∈N

Next, we prove the upper bound of .tmix (ν). Similarly, we construct' a uniformized Markov chain .ς whose transition matrix is .P = I + Q θ

4.4 AI/ML Optimizes Complex Jobs

143



by uniformization technique  S+1  on our Markov chain. .θ is given by .θ  exp(−ϕ) i∈N Mi −1 φ∈ exp(−ρxφ ). Using the path coupling method [40], there is

dT V (Pt (φ), p∗ ) ≤ .

N · exp(− exp(−ϕ)Kt exp(−ρ(2xmax − xmin )))



=

(4.61)

  (Mi S+1 − 1)(exp(2ρ(xmax − xmin )) , 0 < ρ < ρth = where .K = i∈N   1  ln 1 + /2(xmax − xmin ). S+1 i∈N

(Mi

−1)

Thus, according to [37], we have exp(ρ(2xmax − xmin ) + ϕ) · ln Nν   t (ν) ≤  . mix (Mi S+1 − 1)(exp(2ρ(xmax − xmin ))) i∈N

(4.62)  

4.4 AI/ML Optimizes Complex Jobs In recent years, the micro-services application architecture has achieved rapid advances. By changing applications from monolithic to small pieces of code as functions, Function-as-a-Service (FaaS) is leading its way to the future service pattern of cloud computing [41, 42]. Combining FaaS with lightweight containerization and service orchestration tools, such as the Docker runtime [43] and the Kubernetes engine [44], the concept of serverless computing becomes increasingly popular [45]. Serverless computing offers a platform that allows the execution of software without providing any notion of the underlying computing clusters, operating systems, VMs, or containers [46]. It is one step ahead in the abstraction staircase from Infrastructure-as-a-Service (IaaS) to Platform-as-a-Service (PaaS) [42]. Meanwhile, with more and more applications offloaded to remote cloud data centers, it is hard to meet the QoS requirements of latency-sensitive applications [47]. To mitigate latency, near-data processing within the network edge is a more applicable way to gain insights, which leads to the birth of edge computing. Generally, edge computing refers to leveraging the computation- and communicationenabled servers, located at the network edge, to make quick response to mobile and IoT applications [48, 49]. Edge servers can be co-located with telecommunication equipment in multiple places across radio access networks (RAN) to the core network (5GC), in the way of small-scale data centers or machine rooms [50]. Demonstrating common features with the requirements of Internet of Things (IoT) applications, the adaptation of serverless in edge computing has attracted special attention from both industrial community and academia, leading to the birth of serverless edge computing [51]. The paradigm of serverless edge computing

144

4 AI/ML for Computation Offloading

allows users to execute their differentiated applications without managing the underlying servers and clusters. Serverless has proven to be more cost-efficient and user-friendly compared to traditional IaaS architectures in many pilot projects [42]. Nevertheless, serverless edge computing faces a series of problems that need to be solved urgently. One of the problems restricting its development concerns the scheduling of functions with complex inter-task dependency to resource-constrained edges [51–53]. To address this issue, many works have studied the placement of optimal functions in heterogeneous edge servers [54–56]. In these works, applications are structured as a service function chain (SFC) or a directed acyclic graph (DAG) composed of dependent functions, and the placement of each function is obtained through minimizing the makespan of the application, under the trade-off between function processing time and cross-server data transferring overhead. When minimizing the makespan of the application, state-of-the-art approaches only optimize the placement of functions, that is, how the request to call the function’s successors being routed and how the corresponding data stream being mapped onto the virtual links between edge servers are typically ignored [54, 55]. This is despite the fact that routing and management of flow traffic are of great importance for cloud-native applications. In Kubernetes-native systems, Istio is a popular tool for traffic management; it relies on the Envoy proxies co-configured along with the functions [57]. Istio enables the edge-cloud cluster manager to configure how each function’s request invokes its successors, along with its internal output data and routes within an Istio service mesh. Based on flexible and smart configurations and with the help of Istio, we can find better routing and stream mapping and consequently obtain a lower makespan. This phenomenon is captured in Fig. 4.5. The upper half of this figure is an undirected connected graph of six edge servers, abstracted from the physical infrastructure of the heterogeneous edge. The numbers tagged in each vertex and beside each edge of the graph are the processing power (measured in gflop/s) and guaranteed bandwidth (measured in GB/s), respectively. The bottom half is an SFC with three functions. The number tagged inside each function is the required processing power (measured in gflops). The number tagged beside each data stream is the size of it (measured in GB). Figure 4.5 shows two solutions of the function placement. The numbers tagged beside the vertices and edges of each solution are the time consumed. In terms of function placement, solution 1 leads to lower function processing time (2.5 vs. 4), whereas the makespan of solution 2 is much better (7.5 vs. 9) simply because it has a better traffic routing policy. The above example implies that different traffic routing policies could affect the makespan of an application significantly. This motivates us to take traffic routing into account proactively. In this chapter, we name the combination of function placement and stream mapping as “function embedding.” Moreover, if stream splitting is allowed (i.e., the internal output of a function can be split and routed on multiple paths), the makespan decreases further. This phenomenon is captured in Fig. 4.6. The structure of this figure is the same as Fig. 4.5, except that it demonstrates two function embedding solutions where stream splitting is allowed.

4.4 AI/ML Optimizes Complex Jobs

145

Fig. 4.5 Two function placement solutions for an SFC with different traffic routing policies. ©2022 IEEE, reprinted, with permission, from S. Deng et al. (2022)

Fig. 4.6 Two function placement solutions with stream splitting allowed or not, respectively. ©2022 IEEE, reprinted, with permission, from S. Deng et al. (2022)

In solution 2, the output stream of the first function is divided into two parts, each with two or three units. Correspondingly, the times consumed on routing are 3 and .2.5, respectively. Although the two solutions have the same function placement, the makespan of solution 2 is calculated as .1 + max{3, 2.5} + 1 + 1.5 + 1 = 7.5. This is much less than solution 1 with the makespan of 12.

146

4 AI/ML for Computation Offloading

4.4.1 System Model and Problem Formulation 4.4.1.1

A Working Example

The edge network is organized as a weighted directed graph [58, 59], where the vertices are edge servers with heterogeneous processing power and the edges are virtual links with certain propagation speed. The edge network is managed with a lightweight Kubernetes platform (e.g., the KubeEdge [60]). Let us deploy an application of surveillance video processing. The procedure is captured in Fig. 4.7. An edge device (a surveillance camera) uploads the raw video and pre-prepared configurations to a nearby edge server periodically. With raw video input, the functions are pulled from remote Docker registries and triggered immediately. For video processing, the ffmpeg library (https://www. ffmpeg.org/) could be used to produce the corresponding Docker images. When done, the processed results are saved into a PersistentVolume (PV). In the above procedure, the camera only needs to upload the raw video data. The functions will

Fig. 4.7 The architecture of surveillance video processing by leveraging the elastic edge. ©2022 IEEE, reprinted, with permission, from S. Deng et al. (2022)

4.4 AI/ML Optimizes Complex Jobs

147

be triggered automatically. The intermediate data produced by each function is saved into a local volume located at some host path. To make the most of the quick response of the serverless edge, we need to study where each function is processed and how the flow traffic is mapped, to minimize the completion time as much as possible.

4.4.1.2

Problem Formulation

Let us formulate the heterogeneous edge network as an undirected connected graph G  (N, L), where .N  {n1 , ..., nN } is the set of edge servers and .L  {lij }ni ,nj ∈N is the set of virtual links. A virtual link .lij = (ni , nj ) is an augmented link between edge servers .ni and .nj . In .G, each edge server .n ∈ N has a processing power .ψn , max , measured in tflop/s, while each virtual link .lij ∈ L has a maximum bandwidth .bij max max measured in GB/s. We assume .bij = bj i . When .ni = nj , we simply set the data transferring time as 0, since the intra-server processing is usually negligible. The computation of functions, which is non-preemptive, can be overlapped with communication. Let us use .P(ni , nj ) to denote the set of simple paths (i.e., paths with no loops) from source .ni to target .nj . For an arbitrary given edge network, all simple paths .{P(ni , nj )}∀ni ,nj ∈N can be obtained through depth-first search (DFS). The algorithm designed in this chapter needs to know the simple paths between any two edge servers as a priori. In the next section, we give a simple algorithm to calculate them based on DFS. Each IoT application with dependent functions is modelled as a DAG. Let us use .(F, E) to represent the DAG, where .F  {f1 , ..., fF } is the set of F -dependent functions listed in a topological order. A topological order of a DAG is a linear ordering of all the vertices such that for every directed edge .(fi , fj ) in this DAG, .fi comes before .fj in this ordering. .∀fi , fj ∈ F, i = j , if the output stream of .fi is the input of its downstream function .fj , a directed edge .eij exists. .E  {eij |∀fi , fj ∈ F} is the set of all directed edges. For each function .fi ∈ F, we write .ci for the required number of floating point operations of it. For each directed link .eij ∈ E, the data stream size is denoted as .sij (measured in GB). To model the entry function that has no predecessors and the exit function that has no successors, we write .Fentry for the set of entry functions and .Fexit ⊂ F for the set of exit functions of the DAG. We make no restrictions on the shape of DAGs, that is, they could be single-entry-single-exit or multi-entry-multi-exit. The dependent function embedding problem is decomposed into two subproblems, where each function to be dispatched to and how each data stream is mapped onto virtual links. We write .p(f ) ∈ N for the chosen edge server which .f ∈ F to be dispatched to. For any function pair .(fi , fj ) and the associated edge .eij ∈ E, the data stream of size .sij can be split and routed through different paths     in .P p(fi ), p(fj ) . .∀ ∈ P p(fi ), p(fj ) ; we use .z to represent the size of the non-negative data stream allocated for the path .. .

148

4 AI/ML for Computation Offloading

Fig. 4.8 An example of data stream splitting. ©2022 IEEE, reprinted, with permission, from S. Deng et al. (2022)

Then, .∀eij ∈ E, we have the following constraint:  .



∈P p(fi ),p(fj )



z = sij .

Note that if .p(fi ) = p(fj ) (i.e., .fi and .fj are dispatched to the same edge server), then .P(eij ) = ∅ and the data transferring time is zero. Figure 4.8 gives an example for data splitting. The connected graph has four edge servers and five virtual links. There are four simple paths between .n1 and .n4 . The two squares represent the source function .fi and the destination function .fj . From the edge server .p(fi ) to the edge server .p(fj ), .eij routes through three (out of four) simple paths with data size of 3 GB, 2 GB, and 1 GB, respectively. In this example, .sij = 6. On a closer observation, we can find that two data streams route through .l1 . Each of them is from path .1 and .2 with 3 GB and 2 GB, respectively. Theoretically, overhead exists in the splitting and merging operations of the data stream .eij at the source .p(fi ) and the target .p(fj ), respectively. However, in our video transcoding scenarios, this overhead is negligible compared with extensive computation of functions  or transferring of data streams in gigabytes. Let us use .T p(fi ) to denote the finish time of .fi if it is scheduled to the edge server .p(fi ). Considering that the functions   of the DAG have dependent relations, for each function .fj ∈ F\Fentry , .T p(fj ) should involve according to    &    T p(fi ) + t (eij ) .T p(fj ) = max ιp(fj ) , max   +t p(fj ) .

∀i:eij ∈E

(4.63)

In (4.63), .ιp(fj ) is the earliest idle time of the edge server .p(fj ). In our model, the processing of functions is non-preemptive; thus, an edge server is idle if and only if the functions assigned to it are complete. .t (eij ) is the transfer time of the

4.4 AI/ML Optimizes Complex Jobs

149

  data stream .eij , and .t p(fj ) is the processing time of .fj on the edge server .p(fj ). According to (4.63), for each entry function .fi ∈ Fentry ,     T p(fi ) = ιp(fj ) + t p(fi )

.

(4.64)

  In the following, we demonstrate the calculation of .t (eij ) and .t p(fj ) . .t (eij ) is constitutive of two parts. The first part is the communication startup cost between the two functions .fi and .fj . This cost is mainly decided by the configurations in kube-proxy in this edge-cloud Kubernetes cluster [61]. For example, if we use Envoy to implement the network proxy and communication bus for the cluster, data transfer is mainly handled by Envoy route filters. Before data transfer, the Envoy proxy needs to do some preparatory works (e.g., looking for the route tables to get the actual routing paths). We simply use .σ (fi , fj ) to represent the communication startup cost. The second part is the actual communication cost between .p(fi ) and .p(fj ); this is  determined by the slowest data transfer time of .z among all . ∈ P p(fi ), p(fj ) . Considering that the edge network serves thousands of services and applications, we assume that for each data stream .z , the bandwidth allocated to it on each virtual   max holds. link .lmn ∈  is fixed as .bmn and .bmn  bmn To sum up, .t (eij ) is defined as t (eij )  σ (fi , fj ) +

.

 max

∈P p(fi ),p(fj )



 z  , b mn l ∈

(4.65)

mn

Note that the above formulation ignores the data splitting and merging costs, as we have mentioned earlier. In real-world scenarios, the real-time bandwidth for transferring data stream at some point is unknowable because the network status is always in dynamic changes. Even so, our assumption on the fixed bandwidth allocation is reasonable since it can be guaranteed by the QoS level defined in the generic network slice template (GST) [62]. Besides, our model does not require that each function pair  of the  DAG receives the same bandwidth on each virtual link. .∀fj ∈ F, .t p(fj ) is decided by the processing power of the chosen edge server .p(fj ). Since the processing of functions   is non-preemptive, each function is executed with full power. Therefore, .t p(fj ) is defined as   t p(fj ) 

.

cj ψp(fj )

.

(4.66)

After all functions are scheduled, the makespan will be the finish time of the slowest exit function (i.e., the function without successors). Our target is to minimize the makespan of the DAG by finding the optimal .p  {p(f )}∀f ∈F and

150

4 AI/ML for Computation Offloading

  the optimal .z  {z |∀ ∈ P p(fi ), p(fj ) }eij ∈E . Thus, the dependent function embedding problem is formulated as follows. P:

.

s.t.



  minp,z maxf ∈Fexit T p(f )   z = sij , ∀eij ∈ E, .

∈P p(fi ),p(fj )

z ≥ 0.

(4.67) (4.68)

4.4.2 Algorithm Design In this section, we firstly give the optimal substructure hidden in .P. Then, we propose the DPE algorithm and provide theoretical analysis on its optimality and complexity. In the end, we give a method to obtain simple paths for the given edge  network based  on DFS. To simplify  the notations, in the following, we replace .P p(fi ), p(fj ) and .σ p(fi ), p(fj ) by .Pij and .σij , respectively.

4.4.2.1

Finding Optimal Substructure

In consideration of the dependency relationship between the before and the after functions, the optimal placement of functions and optimal mapping of data streams cannot be obtained at the same time. Nevertheless, we can solve it optimally stepby-step based on its optimal substructure.  Let us use .T  p(f ) to denote the earliest finish time of function f when it is placed on edge server .p(f ). Based on (4.63), .∀fj ∈ F\Fentry , we have   p(fj ) = max .T

+





max

∀i:eij ∈E

min

p(fi ),{z }∀∈Pij

&   ,   T  p(fi ) + t (eij ) , ιp(fj ) + t p(fj )

(4.69)

  Besides, for all the entry functions .fi ∈ Fentry , .T  p(fi ) is calculated by (4.64) immediately. With (4.69), for each function pair .(fi , fj ) where .eij exists, we can define the sub-problem .Psub : Psub :

.

min

p(fi ),{z }∀∈Pij

s.t.



    ij  T  p(fi ) + t (eij )

z = sij ,

.

(4.70)

∈Pij

z ≥ 0, ∀ ∈ Pij .

(4.71)

4.4 AI/ML Optimizes Complex Jobs

151

Note that (4.70) and (4.71) are from a subset of constraints (4.67) and (4.68), respectively. In .Psub , we need to decide where .fi is placed and how .eij is mapped.  Through solving .Psub , we can obtain the earliest completion time .T  p(fj ) by (4.69). In this way, .P is solved optimally by calculating the earliest finish time of each function in topological order.

4.4.2.2

Optimal Data Splitting

To solve .Psub optimally, we first fix the position of .fi (i.e., .p(fi )) and then concentrate on the optimal mapping of .eij . To minimize .t (eij ), we define a diagonal matrix .A as follows. A  diag

 

.

1 

lmn ∈1

1 bmn



,

1 

lmn ∈2

2 bmn



1

lmn ∈|Pij |

bmnij

, ...,

|P

 |

.

As can be seen, all the diagonal elements of .A are positive real numbers. The variables that need to be determined can be written as .zij  [z1 , z2 , ..., z|Pij | ] ∈

R|Pij | . Thus, .Psub can be converted to

Pnorm : min Azij ∞

.

zij

 s.t.

1 zij = sij , zij ≥ 0.

(4.72)

  We drop .T  p(fi ) and .σij , as the constant does not change the optimal solution  .z . .Pnorm is an infinity norm minimization problem. By introducing slack variables ij |P | .τ ∈ R and .y ∈ R ij , .Pnorm can be transformed into the following form: P slack :

.

s.t.

min

  z ij [z ij ,y ]

τ

⎧ ⎪ ⎨ ∈Pij z = sij , Azij + y = τ · 1, ⎪ ⎩ z ij ≥ 0.

P slack is feasible and its optimal objective value is finite. As a result, simplex method and interior point method can be applied to obtain the optimal solution efficiently. However, these standard methods might not be scalable, especially for large graphs (.G); this is because the simplex method has exponential complexity and the interior point method is at least .O(|z ij |3.5 ) in the worst case [63]. Fortunately, .

152

4 AI/ML for Computation Offloading

we can directly obtain the analytical expression of the optimal .zij . The result is introduced in the following theorem. Theorem 4.5 The optimal objective value of .Pnorm is sij min Azij ∞ = |P | , ij zij k=1 1/Ak,k

(4.73)

(v) Au,u z(u) ij = Av,v zij , 1 ≤ u = v ≤ |Pij |,

(4.74)

.

if and only if .

(u)

where .zij is the uth component of vector .zij and .Au,u is the uth diagonal element of .A. Based on (4.74), we can infer that the optimal variable .zij > 0 holds; this means that .∀ ∈ Pij , .z = 0. Algorithm 4 summarizes the way to obtain the optimal data splitting and mapping. Algorithm 4 Optimal Stream Mapping (OSM) 1: for each .m ∈ N in parallel do 2: .p(fi ) ← m   sij (m)  p(f ) 3: .ij ←  i (m) + T k

1/Ak,k

4: end for (m) 5: .p  (fi ) ← argminm∈N ij 6: Calculate .zij by (4.72) and (4.74) with .A = A(p

 (f )) i

In this algorithm, lines 1–5 solve .Pnorm through solving .|N| times of .Psub in parallel, each with a different .p(fi ). The procedure can be executed in parallel because there is no intercoupling. In line 4, the objective of .Psub is obtained with the analytical solution (4.73) directly. The most time-consuming operation lies in lines 4 and 6, as they have at least one traversal over edge servers and simple paths,  respectively. The OSM algorithm has the complexity of .O max{|N|, |Pij |} .

4.4.2.3

Dynamic Programming-Based Embedding

Combining OSM with dynamic programming, we designed the DPE (dynamic programming-based embedding) algorithm and presented it in Algorithm 5. In DPE, the loop starts from non-entry functions with a topological order. For each non-entry function .fj ∈ N\Nentry , DPE fixes its placement .p(fj ) at an edge server n in (i) line 3. Then, from lines 4 to 12, DPE solves the sub-problem .Psub by calling the  OSM algorithm for the function pairs .(fi , fj ), eij ∈ E. If .p (fi ) has been decided

4.5 Summary

153

beforehand (in the case where a function is a predecessor of multiple functions), DPE will skip .fi and go to process the next predecessor .fi (lines 5–7). At the end, (i) DPE updates the finish time of .fj based on the solution of .{Psub }∀eij ∈E in line 13. Note that the finish time of each entry function .fi should be calculated with (i) (4.64) before solving .Psub . When all finish times of functions have been calculated, the global minimal makespan of the DAG can be obtained by .

  max argmin T  p (f ) .

f ∈Fexit

p

The optimal embedding of each function can be calculated using .z and .p . Algorithm 5 DP-Based Embedding (DPE) 1: for .j = |Fentry | + 1 to .|F| do 2: for each .n ∈ N do 3: .p(fj ) ← n 4: for each .fi ∈ {fi |eij exists} do 5: if .p  (fi ) has been decided then 6: continue 7: end if 8: if .fi ∈ Fentry then    9: .∀p(fi ) ∈ N, update .T p(fi ) by (4.64) 10: end if 11: Obtain the optimal .ij , .p  (fi ), and .zij by calling OSM 12: end for   13: Update .T  p(fj ) by (4.69) 14: end for 15: end for

4.5 Summary In this chapter, we introduce AI/ML approaches used for computation offloading in edge computing systems, especially long-term optimization and Markov decision optimization, solving binary offloading, partial offloading, and complex jobs’ offloading problems. We systematically review the division of computation offloading problems and discuss the AI methods that can be used to optimize end user QoE. We also discussed the binary and partial computation offloading problems, as well as how the computation offloading of complex jobs can be helped with ML approaches.

154

4 AI/ML for Computation Offloading

References 1. S. Kosta, A. Aucinas, P. Hui, R. Mortier, and X. Thinkair Zhang. Dynamic resource allocation and parallel execution in the cloud for mobile code offloading. 2012 Proceedings IEEE Infocom, pages 945–953, 2012. 2. R. Kemp, N. Palmer, T. Kielmann, and H. Cuckoo Bal. a computation offloading framework for smartphones. International Conference On Mobile Computing, Applications, And Services, pages 59–79, 2010. 3. J. Oueis, E. Strinati, and S. The fog balancing Barbarossa. Load distribution for small cell cloud computing. 2015 IEEE 81st Vehicular Technology Conference (VTC Spring), pages 1–6, 2015. 4. G. Orsini, D. Bade, and W. Lamersdorf. Computing at the mobile edge: Designing elastic android applications for computation offloading. 2015 8th IFIP Wireless And Mobile Networking Conference (WMNC), pages 112–119, 2015. 5. Y. Xiong, Y. Sun, L. Xing, and Y. Huang. Extend cloud to edge with kubeedge. 2018 IEEE/ACM Symposium On Edge Computing (SEC), pages 373–377, 2018. 6. G. Carvalho, B. Cabral, V. Pereira, and J. Bernardino. Computation offloading in edge computing environments using artificial intelligence techniques. Engineering Applications Of Artificial Intelligence, 95, 2020. 7. C. Meurisch, J. Gedeon, T. Nguyen, F. Kaup, and M. Muhlhauser. Decision support for computational offloading by probing unknown services. 2017 26th International Conference On Computer Communication And Networks (ICCCN), pages 1–9, 2017. 8. B. Yang, X. Cao, J. Bassey, X. Li, and L. Qian. Computation offloading in multi-access edge computing: A multi-task learning approach. IEEE Transactions On Mobile Computing, 20:2745–2762, 2020. 9. K. Lin, S. Pankaj, and D. Wang. Task offloading and resource allocation for edge-of-things computing on smart healthcare systems. Computers & Electrical Engineering, 72:348–360, 2018. 10. A. Crutcher, C. Koch, K. Coleman, J. Patman, F. Esposito, and P. Calyam. Hyperprofile-based computation offloading for mobile edge networks. 2017 IEEE 14th International Conference On Mobile Ad Hoc And Sensor Systems (MASS), pages 525–529, 2017. 11. S. Yu, X. Wang, and R. Langar. Computation offloading for mobile edge computing: A deep learning approach. 2017 IEEE 28th Annual International Symposium On Personal, Indoor, And Mobile Radio Communications (PIMRC), pages 1–6, 2017. 12. Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, and L. Neurosurgeon Tang. Collaborative intelligence between the cloud and mobile edge. ACM SIGARCH Computer Architecture News, 45:615–629, 2017. 13. A. Alelaiwi. An efficient method of computation offloading in an edge cloud platform. Journal Of Parallel And Distributed Computing, 127:58–64, 2019. 14. X. Sui, D. Liu, L. Li, H. Wang, and H. Yang. Virtual machine scheduling strategy based on machine learning algorithms for load balancing. EURASIP Journal On Wireless Communications And Networking, 2019:1–16, 2019. 15. H. Mao, M. Schwarzkopf, S. Venkatakrishnan, Z. Meng, and M. Alizadeh. Learning scheduling algorithms for data processing clusters. Proceedings Of The ACM Special Interest Group On Data Communication, pages 270–288, 2019. 16. Z. Tong, X. Deng, H. Chen, J. Mei, and H. Ql-heft Liu. a novel machine learning scheduling scheme base on cloud computing environment. Neural Computing And Applications, 32:5553– 5570, 2020. 17. X. Wang, Y. Han, C. Wang, Q. Zhao, X. Chen, and M. In-edge ai Chen. Intelligentizing mobile edge computing, caching and communication by federated learning. IEEE Network, 33:156– 165, 2019. 18. R. Hu and Others Mobility-aware. edge caching and computing in vehicle networks: A deep reinforcement learning. IEEE Transactions On Vehicular Technology, 67:10190–10203, 2018.

References

155

19. X. Qiu, L. Liu, W. Chen, Z. Hong, and Z. Zheng. Online deep reinforcement learning for computation offloading in blockchain-empowered mobile edge computing. IEEE Transactions On Vehicular Technology, 68:8050–8062, 2019. 20. Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief. A survey on mobile edge computing: The communication perspective. IEEE Communications Surveys Tutorials, 19(4):2322–2358, Fourthquarter 2017. 21. L. Yang, J. Cao, H. Cheng, and Y. Ji. Multi-user computation partitioning for latency sensitive mobile cloud applications. IEEE Transactions on Computers, 64(8):2253–2266, Aug 2015. 22. S. Deng, Z. Xiang, J. Yin, J. Taheri, and A. Y. Zomaya. Composition-driven iot service provisioning in distributed edges. IEEE Access, 6:54258–54269, 2018. 23. S. Deng, H. Wu, W. Tan, Z. Xiang, and Z. Wu. Mobile service selection for composition: An energy consumption perspective. IEEE Transactions on Automation Science and Engineering, 14(3):1478–1490, July 2017. 24. T. Kim, W. Qiao, and L. Qu. A series-connected self-reconfigurable multicell battery capable of safe and effective charging/discharging and balancing operations. In 2012 27th Annual IEEE Applied Power Electronics Conference and Exposition (APEC), pages 2259–2264, Feb 2012. 25. S. Ulukus, A. Yener, E. Erkip, O. Simeone, M. Zorzi, P. Grover, and K. Huang. Energy harvesting wireless communications: A review of recent advances. IEEE Journal on Selected Areas in Communications, 33(3):360–381, March 2015. 26. S. Sudevalayam and P. Kulkarni. Energy harvesting sensor nodes: Survey and implications. IEEE Communications Surveys Tutorials, 13(3):443–461, Third 2011. 27. Cambridge Broadband Networks. Backhauling x2. http://cbnl.com/resources/backhauling-x2/. Accessed Feb 2, 2018. 28. M. Magno and D. Boyle. Wearable energy harvesting: From body to battery. In 2017 12th International Conference on Design Technology of Integrated Systems In Nanoscale Era (DTIS), pages 1–6, April 2017. 29. Lin Jia, Zhi Zhou, and Hai Jin. Optimizing the performance-cost tradeoff in cross-edge analytics. In 2018 IEEE International Conference on Ubiquitous Intelligence Computing, pages 564–571, October 2018. 30. Michael J. Neely. Stochastic network optimization with application to communication and queueing systems. Synthesis Lectures on Communication Networks, 3(1):211, 2010. 31. Yang Yu, Hong Qian, and Yi-Qi Hu. Derivative-free optimization via classification. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, AAAI’16, pages 2286– 2292. AAAI Press, 2016. 32. H. Qian and Y. Yu. On sampling-and-classification optimization in discrete domains. In 2016 IEEE Congress on Evolutionary Computation (CEC), pages 4374–4381, July 2016. 33. J. Yan, S. Bi, Y. J. Zhang, and M. Tao, Optimal task offloading and resource allocation in mobile-edge computing with inter-user task dependency. IEEE Transactions on Wireless Communications, 19(1):235–250, 2020. 34. W. Zhang, Y. Wen, K. Guan, D. Kilper, H. Luo, and D. O. Wu. Energy-optimal mobile cloud computing under stochastic wireless channel. IEEE Transactions on Wireless Communications, 12(9):4569–4581, 2013. 35. C. Lemarechal, S. Boyd, and L. Vandenberghe. Convex optimization. European Journal of Operational Research, 170(1):326–327, 2006 (Cambridge Uni. Press). 36. H. Zhang, An optimized video-on-demand system: Theory, design and implementation, Ph.D. dissertation, Electr. Eng. Comput. Sci., Univ. California, Berkeley, Berkeley, CA, 2012. [Online]. Available: http://www.escholarship.org/uc/item/74k0723z 37. M. Chen, S. C. Liew, Z. Shao, and C. Kai. Markov approximation for combinatorial network optimization. IEEE Transactions on Information Theory, 59(10):6301–6327, 2013. 38. R. B. Lund, Markov processes for stochastic modelling. Journal of the American Statistical Association, 93(442):842–843, 1998. 39. P. Diaconis and D. Stroock. Geometric bounds for eigenvalues of Markov chains. Annals of Applied Probability, 1:36–61, 1991.

156

4 AI/ML for Computation Offloading

40. R. Bubley and M. Dyer. Path coupling: A technique for proving rapid mixing in Markov chains. In Proceedings, 38th Annual Symposium on Foundations of Computer Science, pages 223– 231, 1997. 41. Paul Castro, Vatche Ishakian, Vinod Muthusamy, and Aleksander Slominski. The rise of serverless computing. Commun. ACM, 62(12):44–54, November 2019. 42. Mohammad S. Aslanpour, Adel N. Toosi, Claudio Cicconetti, Bahman Javadi, Peter Sbarski, Davide Taibi, Marcos Assuncao, Sukhpal Singh Gill, Raj Gaire, and Schahram Dustdar. Serverless edge computing: Vision and challenges. In 2021 Australasian Computer Science Week Multiconference, ACSW ’21, New York, NY, USA, 2021. Association for Computing Machinery. 43. Docker: Accelerate how you build, share and run modern applications. https://www.docker. com/. 44. Kubernetes: Production-grade container orchestration. https://kubernetes.io/. 45. Eric Jonas, Johann Schleier-Smith, Vikram Sreekanti, Chia-che Tsai, Anurag Khandelwal, Qifan Pu, Vaishaal Shankar, Joao Carreira, Karl Krauth, Neeraja Jayant Yadwadkar, Joseph E. Gonzalez, Raluca Ada Popa, Ion Stoica, and David A. Patterson. Cloud programming simplified: A berkeley view on serverless computing. CoRR, abs/1902.03383, 2019. 46. E. van Eyk, L. Toader, S. Talluri, L. Versluis, A. U¸ta˘ , and A. Iosup. Serverless is more: From paas to present cloud computing. IEEE Internet Computing, 22(5):8–17, 2018. 47. Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief. A survey on mobile edge computing: The communication perspective. IEEE Communications Surveys Tutorials, 19(4):2322–2358, 2017. 48. P. Porambage, J. Okwuibe, M. Liyanage, M. Ylianttila, and T. Taleb. Survey on multi-access edge computing for internet of things realization. IEEE Communications Surveys Tutorials, 20(4):2961–2991, 2018. 49. S. Deng, H. Zhao, W. Fang, J. Yin, S. Dustdar, and A. Y. Zomaya. Edge intelligence: The confluence of edge computing and artificial intelligence. IEEE Internet of Things Journal, 7(8):7457–7469, 2020. 50. 5G PPP Architecture Working Group. View on 5g architecture: Version 3.0. https://doi.org/10. 5281/zenodo.3265031, Feb 2020. 51. Bahman Javadi, Jingtao Sun, and Rajiv Ranjan. Serverless architecture for edge computing, 2020. 52. P. Aditya, I. E. Akkus, A. Beck, R. Chen, V. Hilt, I. Rimac, K. Satzke, and M. Stein. Will serverless computing revolutionize nfv? Proceedings of the IEEE, 107(4):667–678, 2019. 53. Luciano Baresi, Danilo Filgueira Mendonça, and Martin Garriga. Empowering low-latency applications through a serverless edge computing architecture. In Flavio De Paoli, Stefan Schulte, and Einar Broch Johnsen, editors, Service-Oriented and Cloud Computing, pages 196– 210, Cham, 2017. Springer International Publishing. 54. Liuyan Liu, Haisheng Tan, Shaofeng H.-C. Jiang, Zhenhua Han, Xiang-Yang Li, and Hong Huang. Dependent task placement and scheduling with function configuration in edge computing. In Proceedings of the International Symposium on Quality of Service, IWQoS ’19, New York, NY, USA, 2019. 55. Shweta Khare, Hongyang Sun, Julien Gascon-Samson, Kaiwen Zhang, Aniruddha Gokhale, Yogesh Barve, Anirban Bhattacharjee, and Xenofon Koutsoukos. Linearize, predict and place: Minimizing the makespan for edge-based stream processing of directed acyclic graphs. In Proceedings of the 4th ACM/IEEE Symposium on Edge Computing, SEC ’19, page 1–14, New York, NY, USA, 2019. 56. Z. Zhou, Q. Wu, and X. Chen. Online orchestration of cross-edge service function chaining for cost-efficient edge computing. IEEE Journal on Selected Areas in Communications, 37(8):1866–1880, 2019. 57. Istio: Connect, secure, control, and observe services. https://istio.io/latest/. 58. X. Foukas, G. Patounas, A. Elmokashfi, and M. K. Marina. Network slicing in 5g: Survey and challenges. IEEE Communications Magazine, 55(5):94–100, 2017.

References

157

59. S. Vassilaras, L. Gkatzikis, N. Liakopoulos, I. N. Stiakogiannakis, M. Qi, L. Shi, L. Liu, M. Debbah, and G. S. Paschos. The algorithmic aspects of network slicing. IEEE Communications Magazine, 55(8):112–119, 2017. 60. Kubeedge: An open platform to enable edge computing. https://kubeedge.io/en/. 61. Envoy: An open source edge and service proxy, designed for cloud-native applications. https:// www.envoyproxy.io/. 62. GSM Association. Official document ng.116 - generic network slice template v4.0. https:// www.gsma.com/newsroom/wp-content/uploads//NG.116-v4.0-2.pdf, Nov 2020. 63. John Fearnley and Rahul Savani. The complexity of the simplex method. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 201–208, 2015.

Chapter 5

AI/ML Data Pipelines for Edge-Cloud Architectures

Abstract This chapter elaborated on hybrid cloud-/edge-tier computing solutions that enable unprecedented development in industrial practice to achieve more clear added values into business scenarios. We will motivate how high-speed interregional networks and Internet of Things (IoT) devices enabled data processing in the edge-tier network as an effective solution for real-time processing of raw data produced by IoT devices. We will also elaborate on how companies can effectively derive significant insights into their business by analyzing big data in a nearreal-time manner and have a combined view of both structured and unstructured customer data. We discuss how modern hardware architectures provide foundations to seamlessly perform real-time computation in the edge devices. Finally, we will present frameworks to achieve these goals and focus on the availability and consistency of services and resources in both edge and cloud tiers, as they are essential to the success of such paradigm in distributed IoT-edge-cloud platforms.

5.1 Introduction After a great technological evolutionary step on high-speed Internet, mobiles, and the Internet of Things (IoT) technological innovations, now we are witnessing that big data era and its associated technologies are stepping up to be the greatest ever evolutionary step. Such results in fully new scaling and multiplication effects for companies and economies to the design and optimization of the digital and analytical value-added data pipelines toward achieving sustainable competitive enterprise-level advantages in the digital ecosystems. Such evolution toward the data-driven and analytics-driven design of business processes and models, in the This chapter reuses literal text and materials from • M. Reza HoseinyFarahabady et al., “Q-Flink: A QoS-Aware Controller for Apache Flink,” 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), 2020, https://doi.org/10.1109/CCGrid49817.2020.00-30. ©2020 IEEE, reprinted with permission.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Taheri et al., Edge Intelligence, https://doi.org/10.1007/978-3-031-22155-2_5

159

160

5 AI/ML Data Pipelines for Edge-Cloud Architectures

enterprise levels, directly correlates with the development of the edge-tier and cloud-tier computational capacity. Thanks to the development of the IoT devices, almost every equipment is able to progressively and proactively communicate with other devices. As computers become more complex, the collection, preparation, and analysis of large amounts of raw data demand for more effective computing resources and automated processes, so that experts can spend more time on the interpretation and implementation of the analytical results (such as the process of customer acquisition and the observation of competition). As cloud platforms are becoming more complex, many experts suggested to push most of such complexities to the edge devices. Such complications pose complex challenges for developing an effective resource management of computing devices in the edge-cloud platform. Many past research studies conducted across different cloud platforms have clearly demonstrated the difficulty of developing a high available solution under heavy traffics. One can immediately imagine that such difficulty can be extended to the more complex extended versions across edge-cloud platforms in a similar manner. Perhaps such difficulty can be considered as the main reason why service disruptions of major cloud service providers in recent years occurred without warning, causing massive damages to core services and applications that are highly dependent on such core services. However, understanding of different kinds of failures and their impacts on other services in the extended edge-cloud platform is considered as one of the main areas to be investigated by researchers in the coming years. At present, there are a great many informatics-related explanations by experts in the cloud-edge computations. In equal measure, there are a wide number of popular publications and discussions by scholars in the academic and industrial research centers. To our understanding, more exploration and investigation on performing highly complex data pipeline within the edge-cloud architecture design space are needed in order to detect failures and improve the efficiency of existing solutions. Literature on the subject of edge-tier data processing and industry-oriented IoT is frequently very technical and informatics-focused. In this chapter, we highlight such areas in more details and illustrate potentials by numerous best practices in this domain.

5.2 State-of-the-Art Stream Processing Solutions for Edge-Cloud Architectures Authors of [1] proposed a stream processing framework that improves the overall throughput and minimizes the end-to-end latency in a hybrid edge-cloud environment. The authors showed that there exist three possible deficiencies in a stream processing network that may result in high overall latency: (1) lack of data locality awareness when the bandwidth is the bottleneck, (2) lack of load awareness when computational resources are the bottleneck, and (3) not considering dependencies when synchronization between the data flows in the network is required. Based

5.2 State-of-the-Art Stream Processing Solutions for Edge-Cloud Architectures

161

on such constraints, the authors proposed a new design that consists of three main phases. First, the authors proposed locating the source operator on the node where the input stream data of the network is generated. Similarly, the sink operator is located on the node where the sink output service is running. In this setting, when there are multiple stream sources to be fetched, the data locality optimization technique splits a single logic source operator into multiple physical source operators, each to be mapped to a distinct node so that it can be co-located with its corresponding stream source. Furthermore, to decrease the amount of data transferred among nodes, they suggested two approaches: (1) push selective operators whose output rate is less than their input rate (e.g., filter operators), toward their upstream operators, and (2) push operators whose output outpaces their input (e.g., join operators) to be co-located with their downstream operators. Second, authors proposed solutions to employ a load optimization technique for avoiding back pressures resulting from insufficient computing power. Using such a technique, the computational load placed on each node is determined based on the resource capacity of that node. If there exist overloaded nodes in a network, the proposed solution modifies the operator placement plan by offloading the operators whose displacement has the lowest effect on the network utilization to their neighboring nodes with lower resource utilization. Third, the authors proposed a rate-controlling plan to avoid back pressures caused by dependencies between the data flows. Such a technique can result in smaller co-flows to gain priority over larger co-flows to be completed first. This guarantees the bandwidth requirements of the network are satisfied, thereby enhancing the network usage efficiency. Authors of [2] proposed a software engineering approach, and its implementation as a C++ library, to build a new stream processing system (SPS), called WindFlow, for multicore systems. Their design unifies two primary stream processing approaches to scale up systems: (1) continuous streaming model to process inputs immediately after they are received and (2) discretized streaming model to buffer inputs and process them in batches. They show that their design has lower latency as compared with existing SPSs designed for continuous streaming, but produces at least the same throughout as existing SPSs designed for discretized streaming. Authors also introduce a set of customized building blocks and a set of formal transition rules to show how various streaming applications can be modelled through connecting these building blocks according to the transition rules. WindFlow defines sequential building blocks (SBBs) that are basic operators in an SPS and execute a combination of some sequential codes. Subsequently, parallel building blocks (PBBs) are presented, where each PBB is a pipeline of SBBs running in parallel. Authors then presented Matryoshka to model a set of pipelines with direct or shuffle connections. Matryoshkas are linked together to form a MultiPipe, whose instances can be related to one another by split or merge operators. Finally, PipeGraph is presented as an application environment to maintain a list of MultiPipes and their relationship. The implementation of WindFlow is based on thread-based parallelism, which avoids scheduling overhead by dedicating a thread to each node. However, since this approach may lead to over-subscription or reduced performance, WindFlow also employs the chaining technique (i.e., placing replicas

162

5 AI/ML Data Pipelines for Edge-Cloud Architectures

of various operators in a single thread). In contrast to traditional SPSs where the OS schedules the threads, threads in WindFlow are mapped to cores based on their topology. In particular, threads related to communicating operators are mapped to sibling cores that share some levels of cache; this is to ensure that the data forwarding latency and data access overhead are both reduced. Authors of [3] proposed a fog computing supported spatial big data clustering (FCS-SBC) solution for fog computing environments in disaster situations. The design is a grid-based spatial big data processing scheme to reduce the overall latency of making decisions in a disaster situation in comparison with when all data is collected and analyzed in a centralized cloud. Their proposed design maintains the data resolution, the amount of information, and the detail that the data keeps, at an acceptable level, thereby improving the precision and quality of the final spatial clustering. Authors propose implementing two layers of data processing on each fog node: (1) content analysis on the data processing layer I and (2) spatial analysis on the data processing layer II. In this setting, there are three types of data on each fog node: (1) raw data, which is unprocessed data; (2) point data, which is the data only processed in data processing layer I; and (3) page data, which is the data processed in both data processing layers I and II. Following the content analysis, several point data are processed to find local clusters through spatial analysis in the data processing layer II. Such operations are then abstracted as a page data. Each fog node first divides its covered sub-region, as well as its own and its children’s covered regions, into .m × n cells in a grid structure. It then aggregates spatial data collected from its own and its children’s covered regions. After that, it implements spatial clustering to discover hot-spot areas through locating the mesh of cells that contains the highest density of data. Finally, after finding local clusters, all target point data in each cell of a local cluster are replaced with a representative point, which usually represents the weighted average of all covered data points. The authors also examined how to optimize data processing ratios in fog nodes to effectively utilize all computational resources available in the fog layer; their aim was to maintain the precision and quality of the final spatial clustering to a high degree. For such a purpose, they initially formalized the problem of maximizing the data resolution efficiency by taking into account both the data resolution and overall delay and then provided a solution to this problem by employing a genetic algorithm that adjusts the data processing ratios for all data processing layers in all movable base stations. The aim was to find the distribution patterns of hot-spot areas (e.g., disaster areas) and analyze the migration patterns of such areas to establish a new communications network when necessary.

5.3 Data Pipeline in Existing Platforms More recently, we have seen an astonishing progression in the amount of data produced from different data sources. Today, proliferation, processing, and analyzing of data using computer systems become an integral part of every enterprise. The advent

5.3 Data Pipeline in Existing Platforms

163

of not-structured interactive data has been recently adding variety and velocity characteristics to the ever-growing data reservoir and, subsequently, poses significant challenges to enterprises. It has been shown that effective managing data can deliver immense business values within an enterprise. A decade ago, the acceptance of concepts like enterprise data warehouse, business intelligence, and analytics helped enterprises to transform raw data collections into informative actions, which have been fully reflected in a variety of applications such as customer analytics, financial analytics, risk analytics, product analytics, and healthcare business applications. In the early 2010s, the ubiquity of the computing devices has dramatically changed the functionality of enterprises by bringing the concept of digital business. New technological paradigms such as Web 2.0, social medias, cloud computing, and Software-as-a-Service applications further contributed to the explosion of data. While data become a core business asset for almost all kinds of enterprise, such a shift added several new dimensions to the complexity of data, including the explosion of data volumes that are generated at high velocity and in more variety format and types. Today, enterprises need to utilize effective tools to lower costs for managing the ever-increasing volumes of data while building systems that are scalable, highly performant (to meet business requirements), and secure and can provide privacy- and data quality-related concerns. The National Research Council categorizes a set of computational tasks that can be conducted for massive data analysis into the following seven groups. 1. 2. 3. 4. 5. 6. 7.

Basis statistics Generalized N-body problems Linear algebraic computations Graph-theoretic computation Optimization Integration Alignment problems

Such characterization aims to provide a taxonomy of analysis that have proved to be useful in data analysis, mostly based on the intrinsic mathematical structure and computational strategy of the problem. Such a mapping can be further grouped into four different analytic types. Descriptive analytics comprises analyzing historical data for describing patterns in the data in a summarized format for easier interpretation through massive usage of statistics functions such as counts, maximum, minimum, mean, top-N, and percentage. Diagnostic analytics includes data analytical activities over past data to diagnose the reasons of occurrence of a certain set of events, for example, by applying linear algebraic and general N-body algorithms to provide more insights into why a certain event occurred in the past based on the patterns in the ingested data.

164

5 AI/ML Data Pipelines for Edge-Cloud Architectures

Data IngesƟon Tier collect data from devices

Message queuing Ɵer Temporarily collect data from different loca ons

Analysis Ɵer Actual processing over data

Data access User interface to query the data data-pipeline service

longlong-term storage Ɵer Store valuable data in persistent data store

In-memory data store Ɵer Fast processing and delta computa on

Fig. 5.1 An architectural view of data pipeline solutions in a hybrid cloud-edge platform

Predictive analytics includes data analytical activities for predicting the occurrence of future events using prediction models that are usually trained over existing data. Prescriptive analytics uses multiple prediction models for different inputs to predict various outcomes and suggest the best course of action for each outcome. Data pipeline, in the most general abstract terminology, refers to an architectural blueprint that defines a streaming data processing system that delivers the information a client requests at the moment it is needed. Such a design often consists of the following tiers (Fig. 5.1). • An ingestion tier to collect from distributed sources across different physical locations and to ingest data into the rest of the system. • A message queuing tier to store and collect data from different locations across a platform. • An analysis tier where the actual processing are taken place and meaningful insights are extracted from the data. • A long-term storage tier to store most valuable data in a persistent data store. • An in-memory data store tier to store recent data for fast processing and delta computations in the near future. • A data access tier/layer to allow end users to query the data pipeline service. A complex data processing task in the analysis tier can be further split into a series of distinct tasks to improve the system performance, scalability, and reliability. Such a schema is often referred to as the “Pipes and Filters” pattern where pipes are the connections between the processing components. One of the biggest advantages of this pattern is that a complex data processing job can be divided into simpler tasks so that they can be performed/scaled independently. By running multiple workers for each task, processing can be done in parallel. In fact, running multiple copies of each task makes the system more reliable, as the entire pipeline does not fail if some workers in the pipeline encounter runtime error. The Pipes and Filters pattern is particularly beneficial for real-time analytics/applications that involve processing of streams of data in a near-real-time fashion. From the abovementioned tiers/layers, the ingestion layer is probably the most critical one. It is an essential data layer in the enterprise systems that is responsible to handle the high volumes, high velocity, or a variety of the input data. It is

5.4 Critical Challenges for Data Pipeline Solutions

165

also supposed to separate noise from the relevant information while having the capability to validate, cleanse, transform, reduce, and integrate the input data into other components of a big data analytics framework. A multi-source extractor can be seen in scenarios where enterprises that have large collections of unstructured data need to investigate disparate datasets or non-relational databases. Typical industry examples are financial trading, telecommunications, e-commerce, fraud detection, social media, and gaming. The building blocks of the ingestion layer often include components for performing the following tasks. • The identification layer to collect data from various data formats that might include unstructured data. • The filtration layer to filter the inbound information relevant to the enterprise, based on the enterprise data repository. • The validation layer to analyze the data continuously against the new enterprise metadata. • The noise reduction layer to cleanse data by removing the noise and minimizing disturbances in the input data. • The transformation layer to split, converge, de-normalize, and/or summarize data. • The compression layer to reduce the size of the data in the process. • The integration layer to integrate the final sent (transmitted) dataset into storage systems such as file systems or SQL/NoSQL databases.

5.4 Critical Challenges for Data Pipeline Solutions Several different problems need to be tackled when building a shared platform in a hybrid cloud-edge solution. The following list highlights the most important and challenging ones. Scalability Scalability is an essential concern to avoid rewriting the entire process whenever existing demands can no longer be satisfied with the current capacity. Any effective data pipeline solution needs to address the data volume, velocity, variety, and complexity of raw inputs. In this context, velocity reflects the speed with which different types of data enter the enterprise for storage or further analyses. Variety reflects the unstructured nature of the data in contrast to structured data in other streams. While traditional cloud-based business intelligence solutions work on the principle of assembling all the enterprise data in a central server or a data warehouse, a hybrid cloud-edge data pipeline solution is retained in multiple file systems across distributed devices. In fact, sometimes, it is highly desirable to move the processing functions toward the location of the data rather than transfer the data to the functions (data locality principle). The desire to share physical resources in either cloud or edge tier brings up other scalability-related issues such as multi-tenancy, isolation, massively parallel processing, and security as

166

5 AI/ML Data Pipelines for Edge-Cloud Architectures

well. Users interacting with a hybrid cloud-edge platform that is designed for serving as a long-running service also need to be able to depend on its reliable and highly available operation. Scalability has its own requirements. The nextgeneration data pipeline computing platform in a hybrid cloud-edge solution needs to scale horizontally to service tens of thousands of concurrent applications in a distributed manner. To achieve such requirement, the framework is required to migrate significant parts of the data pipeline into new nodes in both cloud and edge tiers without a need for major refactoring. Serviceability This challenge in hybrid cloud-edge solutions states the ability of the data pipeline computing platform to be completely decoupled from user’s applications or upgrading dependencies. Such a feature allows administrative teams, developers, testers, and end users to quickly set up and utilize a hybrid cloudedge solution for achieving a high-performance computing environment. Such a requirement also states that the computing capacity of devices and nodes in both cloud and edge tiers is available to be simultaneously allocated to multiple applications, as long as the organizational policies are not violated. Such feature allows users and developers to list all possible operations and queries on the available capacity of resources for managing multiple concurrent edge tiers. This feature also requires the underlying cloud-edge platform to host multiple tenants to coexist on the same computing devices while enabling fine-grained sharing of individual nodes among different tenants. Locality awareness This challenge states that an effective cloud-edge solution needs to support locality awareness by moving computation to the data and placing tasks close to the associated input data. Most of the existing solutions, such as traditional Spark engine running on a distributed HDFS file system, allocate computational nodes without accounting for locality. Such an ineffective allocation can yield unbalanced distribution of jobs; for example, a computational cluster with a very small number of nodes process many large jobs, while a large fraction of computing nodes process a few number of small jobs. In addition, an effective cloud-edge solution platform should enable high utilization of the underlying physical resources, for example, by packing data processing tasks into shared computing devices as much as possible. While the move to shared clusters improved utilization and locality compared to a fully isolated strategy, it can bring concerns for scalability, serviceability, and availability bottlenecks. In particular, it is important that a decision to increase throughput should not yield scalability bottlenecks (e.g., caused by memory management layer or coarse-grained locking mechanisms). Part of this problem arose from the fact that the resource management layer needs to keep the state of computations from all running jobs in the limited memory of devices (which could potentially be unbounded). In addition, any failure in computational nodes can potentially cause a loss of all running jobs within the environment; this requires the administrative team to manually recover the entire workflow from the last checkpoint.

5.5 MapReduce

167

5.5 MapReduce MapReduce is among the earliest programming model that works across distributed clusters with commodity hardware to process big data inputs [4]. MapReduce was adopted by Google to efficiently executing a set of functions against a large amount of data in a batch mode [5]. The MapReduce system takes care of receiving the large-size input data and distributes it across the computer network. It then processes it in a parallel fashion across a large number of systems. It handles the placement of the tasks in a way that distributes the load and manages recovery from failures, as well as combines the output for further aggregation [6]. The processing model introduced by MapReduce consists of two separate steps: map and reduce. The map phase is an embarrassingly parallel model where the input data is split into discrete chunks for being processed independently. In the reduce phase, the output of each map process is aggregated to produce the final result. The programmers have to design Map functions for processing a key-value pair to generate a set of intermediate key-value pairs and then a set of Reduce functions that can merge all the intermediate keys [7]. A classic MapReduce system contains different components for job submission and initialization, assignment and execution of submitted tasks across available resources, progress updating, and job completion-related tasks. Such activities are mainly managed by a master node in a distributed setting. In particular, the master daemon is responsible for the execution of jobs; scheduling of mappers, reducers, combiners, and partition functions; as well as monitoring successes/failures of individual job tasks while executing a batch job. To optimize the operation of a MapReduce program, it is important to partition, shuffle, and sort a set of emitted intermediate (key, value) pairs after the completion of the map phase and before sending them to the reducer nodes. When the MapReduce paradigm is paired with a distributed file system (such as HDFS), it can provide very high aggregate I/O bandwidth across a large cluster of commodity servers. One of the key features in the MapReduce paradigm is that it tries to move the compute tasks to the data servers hosting the data; this minimizes or eliminates the need to move large data files to the compute servers [8].

5.5.1 Limitations of MapReduce The intrinsic limitation of the MapReduce paradigm is its limited scalability for real-time streaming processing, especially when data need to be shuffled over the network. The original design of MapReduce only allows a program to scale up to process very large amounts of data in the batch mode; in fact, it constrains a program’s ability to process streaming data. The main limitation of the MapReduce paradigm is listed below [9, 10].

168

5 AI/ML Data Pipelines for Edge-Cloud Architectures

• It is hard (if not impossible) to leverage the MapReduce paradigm for complex computational logic such as real-time streaming, iterative algorithms, graph processing, and message passing. • It is difficult and inefficient to query data over distributed, non-indexed data in the MapReduce paradigm where data is not indexed using the default options. In this case, the responsibility to maintain the indexed data is left to programmers when the data items are added/removed to/from the system. • The reduce tasks cannot be run in parallel with the map tasks to decrease the overall processing time, because the input to the reducers’ task is entirely dependent on the output of the mappers. • It has been shown that MapReduce has a poor performance when some of map jobs have long completion times.

5.5.2 Beyond MapReduce The industry learned from the limitation and failures of MapReduce paradigm and continued to innovate and improve specialized techniques. Employing hardware optimization techniques to process queries and designing tailor-made analytical platforms for specific use cases are the most common techniques to efficiently store and analyze large quantities of data. Over time, a large ecosystem of tools was also developed to create and manage data and to keep track of data models and metadata. Some of the more prominent technologies include, but are not limited to, the following list. • • • • • • • • •

Extract-transform-load (ETL) tools Extract-load-transform (ELT) tools Data quality (DQ) and profiling tools Master data management (MDM) systems Enterprise information integration (EII) Business intelligence (BI) Online analytical processing (OLAP) Data federation tools Data virtualization tools

5.6 NoSQL Data Storage Systems Relational databases have historically been a great fit for storing structured data within organizations, because they lead to high reliability and consistent results for most business transactions. Relational databases have strict schema that are designed for the purposes of data integrity [11]. Because traditional relational databases model data as tables with rows (records) and columns (fields), several

5.6 NoSQL Data Storage Systems

169

business cases reported that, sometimes, it becomes too expensive to scale them for large databases at the web scale. NoSQL databases have been introduced to address specifically such exponential data growth challenges [12]. NoSQL databases are design to provide a more efficient way to store certain data volumes and types. In NoSQL databases, overcoming the hurdle of scaling data is traded with relaxing the consistency requirement that slows down relational databases. The CAP theorem states one can only have two (out of three) desirable characteristics of a storage system, that is, consistency, availability, or partition tolerance [13]. The CAP theorem can help designers to choose the right database based on each enterprise need. In most general form, the relational databases do not offer partition tolerance and high availability as they keep records in a format which is most optimized for insert, join, and update operations. However, the relational databases can offer highly reliable and consistent data views [14]. On the contrary, the NoSQL databases offer partition tolerance, along with properties related to availability and consistency in a more general form; this highly depends on each specific type of the NoSQL database [15]. In a NoSQL system that focuses on partition tolerance and availability, all users can write to a particular machine, but the users can read from other machines as well. In this type, the most accurate data exist on the first machine, and it is eventually replicated to the other machines in the system. Therefore, the imperfect consistency in data becomes less severe. Examples of such systems include Cassandra, CouchDB, and DynamoDB [13]. In this section, we will introduce and elaborate on some of the key features for most currently used NoSQL database engines.

5.6.1 Apache Cassandra Cassandra is a free and open-source, distributed, wide-column store, NoSQL database management system. It can store large amounts of data across many commodity servers. Cassandra stores data in tables very similar to the relational databases. Cassandra tables have primary keys to access data. The primary keys uniquely identify rows in a table. Unlike the relational databases, primary keys are not enough to find data rows in Cassandra, as a highly available Cassandra system is designed to run on a cluster of servers. Cassandra organizes its data quite differently from a relational database; that is, queries (not data) drive the data modelling methodology in Cassandra. In this way, the data duplication is considered a quite normal action as a side effect of nesting data. Cassandra pre-computes queries during the write time to get optimized reads as a free by-product. This is different than relational databases that use expensive operations (such as JOIN and ORDER BY) to compute queries when reading data [16]. Referential integrity in a relational database plays an important rule to combine data from multiple relations to answer a query. In Cassandra, the data are nested to answer a query within the same table, and the referential integrity is omitted [17]. Cassandra tables use a partition key to determine which node in the cluster

170

5 AI/ML Data Pipelines for Edge-Cloud Architectures

to store a data row. Cassandra offers support for clusters spanning multiple data centers with asynchronous master-less replication yielding low-latency operations for all clients [18]. Cassandra has a special query language, called CQL, which in many ways is similar to standard SQL statements such as commands to define data structures (tables and indexes) and commands to modify data (updates). Contrary to conventional relational databases, a Cassandra table does not have a fixed schema; that is, some rows may have different columns than other rows. Furthermore, Cassandra only provides an eventually consistent database where the replicas of a row in different machines might have different versions of the data. In practice, it can take some period of time before all copies of data on different nodes are updated. The main design objective of Cassandra is to provide global availability at low latency while scaling out on commodity hardware. It aims to meet emerging largescale data footprints and query volumes. Cassandra implements “no single points of failure” that is achieved through having redundant nodes and data. Cassandra implements a master-less “ring” architecture; this is different from legacy relational database systems that are mostly based on master-slave architectures. All nodes in a Cassandra cluster communicate with each other using a scalable and distributed protocol called gossip. Cassandra data model is also based primarily on managing columns to store and retrieve the columns (which are structurally very different from relational database data model). Another important feature of Cassandra is its ability to provide linear throughput increase when adding processing device; it uses inline load balancing as the cluster size grows. The Cassandra Query Language (CQL) is a built-in tool to create flexible schema, update database schema, and access data. By using concepts such as Keyspace, Table, Partition, Row, and Column, the end users can organize data within a cluster of Cassandra nodes [19, 20]. In contrast to traditional relational databases that require setting up complex replication architectures to propagate data to multiple nodes, Cassandra automatically replicates data (in either synchronous or asynchronous replication mode) to multiple nodes for enhancing the level of fault tolerance. Several benchmarks have shown that the Cassandra performance outstrips that of other NoSQL databases (in terms of achieved throughput during writes for the maximum number of nodes in a distributed setting). Such a predictable high performance, despite an increase in the workload, allows developers to design applications that can meet strict business SLA levels. Table 5.1 provides a comparison between the data models that are mostly used in Cassandra and similar concepts in conventional relational databases [17]. Table 5.1 Cassandra data model and RDBMS equivalences Cassandra data model Schema/Keyspace Table/Column-Family Row Column

Definition Set of column families Set of rows Ordered set of columns Key/value pair

RDBMS equivalent Schema/database Table Row (name, value) pair

5.6 NoSQL Data Storage Systems

171

The main performance drawbacks of Cassandra are the extra overhead for performing mutations and delete over data. Cassandra stores its data in immutable data structures on disk (namely, SSTables) in which the data can be spread across several SSTables; hence, an non-negligible overhead is imposed to ensure that it updates and deletes the data correctly across the cluster. In addition to such write limitation, there are not any embedded rollback mechanisms in Cassandra similar to relational database that uses traditional locking mechanisms to ensure transactions [17]. As we mentioned before, any NoSQL engine makes trade-offs among data consistency, availability, and partition tolerance. Eventual consistency requires that all updates to a data item must be presented on all the replicas in a distributed database (although with some delay). Cassandra allows developer to implement tune-able consistency, where the administrator team can balance the consistency level against the replication factor. In fact, the level of consistency of the data in Cassandra’s data model is often determined by the client applications. Cassandra’s engine automatically replicates data across the entire nodes in the cluster; however, the inherent latency in replicating the data might cause limitations to return the last updated value to a data item sometimes [17, 21].

5.6.2 Apache Flink Apache Flink is a free and open-source, distributed stream processing and analytical framework; it is built to process continuous streams of data. The Flink framework that provides job scheduling, resource assignments, horizontal scaling, parallel processing, and reliability for fast and scalable data processing pipelines that can scale up to terabytes of data. The Flink engine offers high throughput and low latency, which is ideal for stream processing with millisecond latency. Flink’s relational APIs provide a simple to use, yet powerful and scalable, interface for exploratory data analytics such as projections, filtering, grouping, aggregations, and joins. Flink offers two standard relational APIs. 1. Flink offers tables that are tightly integrated with the semantics of native programming languages such as Scala or Java. 2. Flink offers the SQL-API to allow users to perform analytics using standard SQL queries. Both APIs are tightly integrated with Flink’s processing APIs, namely, Data-Sets and Data-Streams. Both Flink APIs produce the same results in the form of a table object. Such a design allows the incoming streaming data to be smoothly transformed from/to these data structures. The relational APIs also support unified batch and stream processing.

172

5.6.2.1

5 AI/ML Data Pipelines for Edge-Cloud Architectures

Flink Connectors

Flink tables and SQL APIs are tightly integrated with other capabilities (such as the Data-Sets APIs and the Data-Streams APIs) to help build end-to-end solutions. Data-Sets are used for bounded data where the data is known beforehand; they support all standard SQL analytics functions. The Data-Stream APIs are designed to handle the unbounded data. As an example, Flink can support windowing capabilities on streams and thus are able to be used in real-time applications. Flink tables can be created from Data-Sets and Data-Streams to build powerful pipelines. In addition, the Flink engine supports direct table integration with external data sources such as Linux file systems, Apache Kafka, Elasticsearch, HBase, and JDBC. Flink provides a unified analytical platform for both batch and streaming data. However, there are some unique characteristics and limitations when analyzing streaming data. Because streaming data is unbounded, it is not possible to know the total number of records entering the system within the next intervals. Constant new data also means that the results are not repeatable if a same query runs over the input data in different time intervals. This also means that the results derived from stream queries expire in each interval and the query must be executed again to update the latest status. Flink SQL supports “Dynamic Tables” and “Continuous Queries” concepts to support stream processing. A continuous table is a logical table that represents a streaming data source. It is created in a stream table environment. The contents of the table are constantly changing as more data arrives. There is no guarantee about the number of rows in the table or the readability query results at any point in time. Similar to batch tables, Flink dynamic tables can also support a schema for batch table API and SQL operations (such as projections, filtering, and grouping). A query executed on a dynamic table is called a continuous query. It is called continuous because Flink executes a query continuously again and again against the dynamic table and publishes continuous results, which form another dynamic table. Each execution of the continuous query changes the content of the resulting dynamic table. Dynamic tables and continuous queries form the basis for stream analytics in Flink. The continuous queries can be further executed on the result of previous dynamic table for extending the data pipeline operations. The final results are often pushed out to a streaming string storage such as Redis or Apache Kafka. The query latency can be defined as the data movement between the origination nodes and the analytics nodes. It is also important to control the resource requirements for streaming analytics in terms of hardware, storage, and network bandwidth. Latency and scalability requirements may impact the resource allocation and the total cost of operation.

5.6 NoSQL Data Storage Systems

5.6.2.2

173

Flink Architecture

Figure 5.2 depicts the high-level architectural overview and Flink’s main components for distributed execution of data processing programs across a cluster of nodes. Flink applications are structured as directed graphs, also known as “JobGraph.” Each vertex of such graph represents a computational operation that can exchange data with other operations along the edges of the JobGraph. Particularly, the two building blocks of an application are: 1. Intermediate result to represent a data stream. 2. Transformations, as stateful operators, take one or more streams as the input and generate one or more output streams. Each operator can contain further iterative operations as well. Each computation is attached to one or more data stream sources (e.g., IoT sensors, a Kafka topic, or database transactions) and ends in one or more data stream sinks (e.g., files or database). The Flink runtime core layer builds the JobGraph by receiving the application code. It represents an abstract parallel data flow with arbitrary computational tasks to consume and produce data streams. DataStreams APIs and Data-Sets APIs can be used to compile the application code for building JobGraph. A query optimizer can also be used to determine the optimal deployment plan for the program across the runtime environment (e.g., local or YARN). Application developers can exploit different tools (from the libraries and/or APIs that are bundled with Flink) to generate data stream or batch processing programs.

Flink Libraries for Stream-and Batch-Processing Table/GellyAPI

Java/Scala API

Complex Event Processing

Machine Learning

Flink Run me Core, Distributed Streaming Dataflow Execu on Data-Set API

Data-Stream API

Query Op mizer

Stream Builder

Flink Deployment and Cluster Management Local, Single JVM

Standalone Cluster

YARN Cluster

Cloud (GCE, EC2, ...)

Storage Management Local File System

JDBC Store

HDFS

Cloud (S3, ...)

Fig. 5.2 Flink software stack and its core components ©2019 IEEE, Reprinted, with permission, from M. Reza HoseinyFarahabady et al. (2020)

174

5.6.2.3

5 AI/ML Data Pipelines for Edge-Cloud Architectures

Flink Deployment Plan

Flink offers powerful paradigm for developing streaming analytical applications that require event-time semantics and stateful exactly-once processing. In the runtime, the logical computations, as defined by distinct operators, can be executed in a concurrent fashion across a multi-node Flink cluster. Such an execution model is different from traditional programming model (e.g., MapReduce) where an application is firstly divided into individual phases and then each stage must be executed sequentially [22]. In Flink, operator “subtasks” are executed in the context of a distributed execution environment and, hence, must be designed with the capability to run independently from each other (e.g., in different threads/containers across distributed nodes). The number of operator subtasks, referred to as the parallelism degree of an operator, is a configurable parameter that can be overwritten by the controller to improve the performance of application programs. Different operators in a Flink application can have dissimilar degrees of parallelism when exchanging data among each other or following predefined forwarding patterns. The data exchange model involves several objects, including the following list. JobManager nodes are responsible for scheduling, recovery, and coordination among tasks by maintaining the DataFlow data structure. TaskManager nodes execute the set of submitted tasks in JVM using concurrently running threads. Each TaskManager node contains its local communication manager and memory manager. ExecutionGraph (the DataFlow graph) data structure lives in the JobManager and represents computation tasks, intermediate results, and ExecutionEdges. The distributed engine in Flink adapts the execution plan to the cluster environment; it accordingly runs distinctive configurable execution plans and/or data distribution deployments. Flink is compatible with a number of cluster management and storage solutions, such as Apache HDFS [23] and Apache Hadoop YARN [24]. StreamBuilder and Common API components, as highlighted in Figs. 5.2 and 5.3, are responsible to translate the represented schematic directed graphs of logical operations into generic data stream programs to be deployed on the Flink’s runtime environment. The query optimizer component finds an automatic optimization of data flow process, for example, by finding the most efficient concrete filtering execution plan a user who only requests an abstract filter operation using the provided Scala or Java APIs.

5.6.3 Apache Storm Apache Storm has emerged as an important open-source technology for performing stream processing with very tight latency constraints over a cluster of computing nodes. Storm engine allows machines to be easily added in order to increase the

5.6 NoSQL Data Storage Systems

175

Fig. 5.3 Abstracted view of the control flow for data exchange between tasks in Flink ©2019 IEEE, Reprinted, with permission, from M. Reza HoseinyFarahabady et al. (2020)

processing capacity in real time. Storm also allows developers to run incremental functions over data with a guaranteed performance for message processing applications; it includes providing a well-defined framework for performing certain actions when failures occur.

5.6.3.1

Storm Concepts

A Storm topology is a graph of computation where the nodes represent some individual computations and the edges represent the result of computations being passed among nodes. It includes the following components. Topology, streams, spouts, and bolt are among the most basic components in a Storm ecosystem. Storm topology A Storm topology is an abstraction that defines the entire stream processing pipeline as a computation graph running over data. A topology is a DAG (direct acyclic graph) where each node represents processing operations and forwards it to the next node(s) in the flow across the edges. Data tuples and fields The basic unit of data that can be processed by any Storm application is called a data tuple. Each tuple often consists of a list of predefined attributes, known as fields. A node can create and then send tuples to any number of nodes in the

176

5 AI/ML Data Pipelines for Edge-Cloud Architectures

graph. The process of sending a tuple to be handled by any number of nodes is called emitting a tuple. Stream A stream in Storm is an unbounded sequence of tuples between two nodes in the topology. A topology can contain any number of streams. Nodes in a Storm topology accept one or more streams as input, perform some computation or transformation, and emit (create) new tuples as a new output stream. Such output streams then act as input streams for other nodes in the topology. Spouts Spouts are building blocks in a Storm ecosystem. Each spout reads data from an external data source and emits tuples into the topology. Spouts can listen to message queues for incoming messages, examine log files for perceiving new messages, or query database for real-time changes. Bolts Bolts are also building blocks in a Storm ecosystem. Each bolt accepts a tuple from its input stream, performs some computation or transformation (filtering, aggregation, join, etc.) on that tuple, and then optionally emits a new tuple to its output streams [25]. Storm task Each running instance of a spout or bolt is called a task. The key to the Storm model is that tasks are inherently parallel, however not necessarily on the same machine. Apache Storm also provides developers with mechanisms to support serialization, message passing, task discovery, and fault-tolerant low-latency computations. The Storm model requires stream groupings to specify how tuples should be partitioned among consuming tasks. The simplest kind of stream grouping is a shuffle grouping where tuples are distributed on nodes using a random roundrobin algorithm. This grouping evenly splits the processing load for all consumers. Another common grouping is the field grouping tuples are distributed by hashing a subset of the tuple fields and modding the result by the number of consuming tasks. To summarize, a topology in a Storm ecosystem consists of nodes (representing either spouts or bolts) and edges (representing streams of tuples between spouts and bolts). It is worth to mention that each spout and bolt might have one or many individual instances to perform processing in parallel.

5.6.3.2

Storm Deployment Architecture

During the runtime, the logical computations, as defined by Storm topology, can be executed concurrently across a multi-node Storm cluster. The storm deployment architecture includes the following objects. Nimbus A Storm cluster consists of a master node that is called Nimbus. It manages and distributes application codes and runs them among multiple working nodes. The

5.6 NoSQL Data Storage Systems

177

Nimbus node accepts requests from user and deploys a topology across a Storm cluster by assigning workers around the cluster to execute the topology. The Nimbus node is also responsible for detecting failures occurring in any working node and reassigning failed tasks to other nodes if necessary. By default, the Nimbus node is stateless and saves all of data related to state of the computation in the ZooKeeper node in a Storm cluster. ZooKeeper ZooKeeper is an application that is run on a single machine or a cluster of nodes to provide coordination among different components in a distributed application. It is mainly used to share some configuration information among various processes. In Storm, Nimbus and worker nodes do not communicate directly with each other but through ZooKeeper. Supervisor Each worker node in Storm cluster runs a daemon called the supervisor that is responsible for creating, starting, and stopping worker processes to execute the tasks assigned to that node. A single supervisor daemon, running at any working node, communicates with Nimbus master node to determine the set of tasks for running. It then handles multiple worker processes on that machine to accomplish the task. The Supervisor daemon can start or stop the worker processes in the machine as dictated by Nimbus. Once running, worker nodes can discover the location of other workers through Zookeeper so that they can directly transfer data to each other.

5.6.4 Apache Spark Apache Spark is a fast and general engine for large-scale data processing and realtime stream event processing [26]. Spark has an advanced engine that can achieve up to a hundred times faster processing than traditional MapReduce jobs on Hadoop, because Spark distributes the execution across its cluster and performs many of its operations in memory. Spark supports Java, Scala, Python, and R, as well as more than hundred high-level operators that can help developers to build data processing streaming applications. Spark implements SQL, streaming, machine learning, and graphing using open-source libraries built into the platform. Spark platform can consume data from many sources, including the Hadoop distributed file system (HTFS), Cassandra, HBase, Hive, as well as any other Hadoop data source. Spark 2.0 can also connect directly to traditional relational databases using data frames. Spark can perform data integration or ETL (extract, transform, and load) process for moving data from one location to another. Spark is heavily used by data analysts, business intelligence units, and data scientists.

178

5.6.4.1

5 AI/ML Data Pipelines for Edge-Cloud Architectures

Spark Architecture

A Spark cluster can consist of multiple executor nodes capable of executing the program in parallel. The level of parallelism and performance achieved depends on how the pipeline is designed. A typical pipeline starts by reading from an external source database and turning it into the structure data that can be later converted to a data frame. The “Resilient Distributed Dataset” is the lowest-level API for working with data in Spark. RDD can be seen as a container that allows Spark to work with data objects. RDD objects can be of varying types and spread across many machines in a cluster. When a parallel transforming operation (map or filter) is executed, the result of the operations is pushed to the executor nodes to be executed locally on the assigned partitions. The executor nodes may create new partitions. Apache Spark is built around the batch-first design philosophy where the input data (which may be scattered across a distributed storage system such as HDFS or Amazon S3) remains unchanged during the course of execution. The Spark engine can execute computations over a set of JVMs (Java virtual machines) that last for the duration of a data processing application. The Spark cluster manager component is responsible for orchestrating the dispensing of application tasks across several worker nodes. A Spark application needs to be built around a specific abstraction model, namely, “Resilient Distributed Dataset (RDDs),” toward representing a statically typed and distributed data collection (https://spark.apache.org/docs/latest/rddprogramming-guide.html). Developers can use a variety of programming languages (Scala, Java, Python, etc.) either to use predefined “coarse-grained” actions (e.g., map, join, or reduce) or to define their own transformation functions for manipulating RDDs. Figure 5.4 lists Scala script for creating RDDs in two ways: (1) referencing a dataset from a data file in the external storage system and (2) parallelizing processing it over an existing collection. The Spark ecosystem also includes other components (such as Spark SQL, Spark Streaming, and GraphX) to provide more specific interfaces for performing specialized functionality [27]. In particular, Spark Streaming employs the core scheduling of Spark for performing analytical operations on semi-structured minibatches data by defining a window size for batches.

import org.apache.spark.SparkContext import org.apache.spark.sql.SparkSession object SparkTestInScala { def main(args: Array[String]) { val sc = SparkSession.builder.appName("SparkTest").getOrCreate() val rdd_from_textFile = sc.textFile("hdfs://input_file.txt",3).rdd val rdd_from_list = sc.parallelize(List("This example","is created", "for test", "purpose!")).rdd sc.stop() } }

Fig. 5.4 Scala script for creating two RDD data types of referencing and parallelizing

5.6 NoSQL Data Storage Systems

5.6.4.2

179

Spark Execution Engine

Once a user submits a processing job over RDDs, the Spark engine translates it to a DAG of tasks for execution on distributed environment using parallel worker nodes across a cluster. Each DAG vertex represents the corresponding RDD, and each DAG edge reflects the operation to be applied on the RDD once the data is ready. The Spark “DAGScheduler” is responsible to locate the right partitions for executing each task in a way that its relevant RDDs are close to it. The Spark engine currently supports three cluster manager models: standalone cluster manager, Apache Mesos, and Hadoop YARN. Spark standalone cluster manager schedules the tasks in different stages following the FIFO (first-in-first-out) policy. Using FIFO, each job is divided into several stages (e.g., map, filter, exchange, join, and reduce). Then, the first submitted job gets priority over all available resources as long as its running phases have some pending operation tasks to launch. After that, the second priority is assigned to the subsequent submitted job; the other jobs are treated similarly. When a job is submitted to the Apache Spark engine, it first analyzes the given code to build an execution plan. Spark uses an optimizing engine to analyze the steps needed for processing the data; it aims to optimize for performance and resource utilization. Spark only executes a code when an action such as reduce or collect is performed. The optimizer also can figure out opportunities to optimize parallel IO, shuffling, and memory usage of operations to further improve the performance and/or reduce memory requirements on the driver. The strategy that is employed by Spark engine for handling data streams has some constraints. Most importantly, a series of batches is constructed in accordance with the processing time, instead of the event time. This causes each RDD to be a read-only object; hence, algorithms that required iterative computation have to create new set of RDDs based on the application’s DAG structure. This could seriously affect Spark as the size of iterative computations increases. Although RDD could be replaced to speed up the running time, it is required to perform a series of pre-computation process for each data update. Thus, for most machine learning and deep learning algorithm, it demands developers to manually deal with all issue that could be caused by iterative computations. Spark uses check-pointing mechanism to maintain the current state for computing aggregations and managing transitions in stream processing. Check-pointing is the ability to save the state of the pipeline to a persistent data store such as HDFS. When a job fails and needs to be restarted, the information saved during check-pointing will be used to resume processing from where it was left off. Checkpointing will store a number of metadata elements, as well as some RDDs, at periodic intervals to the checkpoint store. Apache Spark watermarks are another important feature for event-time operations. Event time is the time a specific event is created at a source. If windowing needs to be done, based on the event time, the processing of any window needs to wait until all events for such a window arrive at the Spark executors. Watermarks

180

5 AI/ML Data Pipelines for Edge-Cloud Architectures

are used to set the delay required before processing can happen. Using watermarks, Spark keeps track of the events and waits for this delay until all events for that window arrive.

5.7 Conclusion The emergence of Internet of Things (IoT) applications with potentially billions of machine inputs oppresses the relational database usage and the ACID transaction model. To support many IoT applications, programmers must develop applications that are capable of handling unstructured data and run them on edge devices with limited memory and computing power. Hadoop as one of the earliest transformations in database architecture provides storage and processing power for unstructured and semi-structured data, but cannot handle transactional and online operations. To close this gap, several classes of new data storage/processing engines have been introduced to address consistency and availability requirements of the massive real-time processing in both batch and streaming modes. This chapter listed some of commonly used ones in the industry and briefly explained their characteristics.

References 1. Jinlai Xu, Balaji Palanisamy, Qingyang Wang, Heiko Ludwig, and Sandeep Gopisetty. Amnis: Optimized stream processing for edge computing. Journal of Parallel and Distributed Computing, 160:49–64, 2022. 2. Gabriele Mencagli, Massimo Torquati, Andrea Cardaci, Alessandra Fais, Luca Rinaldi, and Marco Danelutto. Windflow: high-speed continuous stream processing with parallel building blocks. IEEE Transactions on Parallel and Distributed Systems, 32(11):2748–2763, 2021. 3. Junbo Wang, Michael Conrad Meyer, Yilang Wu, and YU Wang. Maximum data-resolution efficiency for fog-computing supported spatial big data processing in disaster scenarios. IEEE Transactions on Parallel and Distributed Systems, 30(8):1826–1842, 2019. 4. Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of ACM 51 (1), pages 107–113, 2008. 5. Tyson Condie, Neil Conway, Peter Alvaro, Joseph M Hellerstein, Khaled Elmeleegy, and Russell Sears. Mapreduce online. In Nsdi, volume 10, page 20, 2010. 6. Howard Karloff, Siddharth Suri, and Sergei Vassilvitskii. A model of computation for mapreduce. In Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms, pages 938–948. SIAM, 2010. 7. Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. 2004. 8. Feng Li, Beng Chin Ooi, M Tamer Özsu, and Sai Wu. Distributed data management using mapreduce. ACM Computing Surveys (CSUR), 46(3):1–42, 2014. 9. Ibrahim Abaker Targio Hashem, Nor Badrul Anuar, Abdullah Gani, Ibrar Yaqoob, Feng Xia, and Samee Ullah Khan. Mapreduce: Review and open challenges. Scientometrics, 109(1):389– 422, 2016.

References

181

10. Dawei Jiang, Beng Chin Ooi, Lei Shi, and Sai Wu. The performance of mapreduce: An in-depth study. Proceedings of the VLDB Endowment, 3(1-2):472–483, 2010. 11. Christof Strauch, Ultra-Large Scale Sites, and Walter Kriha. Nosql databases. Lecture Notes, Stuttgart Media University, 20:24, 2011. 12. Yishan Li and Sathiamoorthy Manoharan. A performance comparison of sql and nosql databases. In 2013 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM), pages 15–19. IEEE, 2013. 13. Jaroslav Pokorny. Nosql databases: a step to database scalability in web environment. International Journal of Web Information Systems, 2013. 14. Michael Stonebraker. Sql databases v. nosql databases. Communications of the ACM, 53(4):10– 11, 2010. 15. Ameya Nayak, Anil Poriya, and Dikshay Poojary. Type of nosql databases and its comparison with relational databases. International Journal of Applied Information Systems, 5(4):16–19, 2013. 16. Sam R Alapati. Apache cassandra: An introduction. In Expert Apache Cassandra Administration, pages 3–34. Springer, 2018. 17. Sam R Alapati. Expert apache cassandra administration. Springer, 2018. 18. Joaquin Casares. Multi-datacenter replication in cassandra. DataStax Blog, 2012. 19. Reuven M Lerner. At the forge: Cassandra. Linux Journal, 2010(198):7, 2010. 20. Apache Cassandra. Apache cassandra documentation v3. 2, 2017. 21. Raul Estrada and Isaac Ruiz. Big data smack. Apress, Berkeley, CA, 2016. 22. Flink Docs Stable: Introduction to Apache Flink. https://flink.apache.org/. Accessed: Nov. 2019. 23. Dhruba Borthakur et al. Hdfs architecture guide. Hadoop Apache Project, 53(1-13):2, 2008. 24. VinodKumar Vavilapalli et al. Hadoop yarn: Yet another resource negotiator. In Symp. on Cloud Computing, page 5. ACM, 2013. 25. Matthew Jankowski, Peter Pathirana, and Sean Allen. Storm Applied: Strategies for real-time event processing. Simon and Schuster, 2015. 26. Matei Zaharia et al. Spark: Cluster computing with working sets. HotCloud, 10:95, 2010. 27. Holden Karau and Rachel Warren. High Performance Spark. O’Reilly Media, 2017.

Chapter 6

AI/ML on Edge

Abstract This chapter covers the details of implementing AI/ML approaches on edge computing environments. Before introducing the four key technical components (caching, training, inference, and offloading) of edge intelligence, we first give a fundamental introduction to core concepts and analyze their current inevitable development processes. We will then focus on the overall workflow and architecture of the intelligent edge system and present general disciplines and challenges of the four components separately. We provide in-depth discussion on two strongly related components of any intelligent algorithm (i.e., training and inferencing) for edge platforms. For training on the edge, we discuss the basic architecture and model optimization techniques, followed by demonstrating a typical case example on federated learning. For model inference on the edge, we will provide various lightweight/simplified networks suitable for resource constraints and introduce inference learning models suitable for edge devices.

6.1 Introduction Artificial intelligence (AI) refers to the exploitation of a set of codes, technologies, algorithms, and/or data that enables computer systems to develop and simulate human-like behavior in making decisions similar to (or in some cases better than) people [1]. The first official appearance of the AI can be dated back to the 1950s, and as far as we know, AI has three key components: data, models/algorithms, and computing. Initially, most AI techniques were model-driven, that is, they were designed to study characteristics of given applications and mathematically form a model to describe them. In recent years, however, the majority of algorithms currently used are based on machine learning (ML), which is driven by data. Modern ML methods were inspired from the early computational model of neurons proposed by Warren McCulloch (neuroscientist) and Walter Pitts (logician) in 1943 [2]. In their model, artificial neurons receive one or more inputs, each independently weighted. Neurons sum these weighted inputs, and the result is passed through a non-linear function (called an activation function) to represent the neuron’s action potential; they were then passed along its axon to other neurons. These algorithms converge to an optimal solution and generally improve their performance as the © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Taheri et al., Edge Intelligence, https://doi.org/10.1007/978-3-031-22155-2_6

183

184

6 AI/ML on Edge

number of samples available for learning increases. Several types of learning algorithms exist: supervised learning, unsupervised learning, and reinforcement learning. Thanks to the advantages of recent technological advancements, AI technology is now thriving with the continuous development of algorithms, big data, and computing power. AI technology is developing rapidly as a key strategic element in many commercial, medical, and government sectors around the world, such as self-driving cars [3–5], chatbots [6, 7], autonomous planning and scheduling [8–11], gaming [12], translation [13, 14], medical diagnosis [15–18], and even spam countermeasures [19, 20]. Considering the ever-growing requirements for computing and storage, centralized cloud computing resources [21] are undoubtedly excellent hosts to foster ML, especially for computationally intensive tasks like deep learning (DL). The direction of Internet of “Everything” has made the proliferation of Internet of Things (IoT) in our lives [22, 23], leading a large amount of data to be generated by things/devices immersed in the surrounding. According to Cisco, they predict that there will be 30.9 billion IoT-connected devices by 2023, generating an even ultra amount of data [24].Typically, these data should be stored in mega-scale data centers and processed by powerful servers, but come with high latency induced by network overload and high security risks which occurred during transmission. Therefore, cloud computing is not efficient enough to support these applications in certain scenarios. This naturally passed us into a post-cloud era that data needs to be processed at the edge of the network to satisfy the requirements of IoT; this is known as edge computing [25]. Formally, edge computing is a highly virtualized computing platform that migrates computing tasks from the remote cloud to the near-end network edge for execution [26]. The edge can be any entity from the data source to the data center that provides services such as computing, storage, and/or data analytics. Unlike cloud computing, edge computing is not embedded with powerful servers, but clustering weaker and more dispersed computing nodes, which can significantly reduce latency, by offloading computing power and data analysis to the edge of the network, while keeping the data safe. The main advantages of edge computing can be listed as follows. Transmission shortening Computing tasks are usually performed near the data sources, which greatly alleviates the uncertainty on transmission time and path, to ensure optimal latency and enhance security [27, 28]. Energy saving Since end devices can offload some computing tasks to the edge server, the energy consumption of end devices will be greatly reduced to effectively prolong its battery life [29]. Scalability Cloud computing is still available if there are not enough resources on edge devices or edge servers. Moreover, idle devices can communicate with each other to complete tasks cooperatively. This means that the computing power of the edge computing paradigm can be flexibly adapted to different application scenarios [22, 30].

6.2 System Overflow

185

Referring to the advantages mentioned above, edge has the great potential to provide some or even extend analytics capabilities that were formerly confined to clouds. Furthermore, considering that AI is functionally necessary for the quick analysis of large volumes of data and extracting insights, there exists a strong demand to host AI at the edge. Pushing the AI frontier to the edge ecosystem is a demand-side trend that introduces two different research areas: AI for the edge and AI on the edge. To complete previous chapters that have elaborated various aspects for “AI for the edge,” this chapter is dedicated to “AI on the edge.” It means how to run AI models on edge, aiming to be satisfied with algorithm performance, cost, privacy, reliability, efficiency, etc. In this context, edge intelligence refers to a set of connected devices and systems for data collection, caching, processing, and analysis proximity to where data is collected, with the purpose of enhancing the quality and speed of data processing and to protect the privacy and security of data [31]. Compared to the traditional cloud intelligence shown in Fig. 6.1a, edge intelligence has significant implications on the Internet of Things (IoT) and distributed networks (shown in Fig. 6.1b). It is closer to the end user or data and thus is able to maximize business efficiency. Performing intelligent tasks closer to the data, rather than sending out to a remote server or other places, increases tasks efficiently, as well as reduces the likelihood of data being intercepted or leaked. Several works have demonstrated the feasibility of edge intelligence by applying it on real-world applications. For example, edge intelligence is starting to emerge in areas such as the smart home to provide new experiences and opportunities in safety and autonomous driving. To this end, in this chapter, we elaborate on how to implement edge intelligence in a systematic way, including (1) how to cache data at the edge, (2) how to train at the edge, (3) how to infer at the edge, and (4) how to provide enough computing resources at the edge.

6.2 System Overflow The research on edge intelligence has gradually emerged since 2011 and has been flourishing ever since. Benefiting from both edge computing and AI technologies, it enables people’s daily usage of mobile applications in an easy way; it allowed them to become less dependent less on centralized clouds. The main supply-side external driving force of promoting the sustainable development of edge intelligence is because AI technologies have achieved outperformed results in majority areas of our lives. With the continuous penetration of AI in our lives, people desire this excellent technology to be applied to support mobile devices anytime and anywhere to meet the users’ demands as well. However, currently popular AI methods are usually deployed at clouds, which is a centralized architecture not suitable for providing continuous services for mobile applications. Another internal driving factor on the supply-side is that with the rapid development of IoT technology, more and more big data is generated and distributed in the edge surroundings. In fact,

186

6 AI/ML on Edge

FORXG

(GJH

7KLQJV KLQJV

(a)

FORXG

(GJH

7KLQJV

(b) Fig. 6.1 Different intelligence architecture. (a) Cloud Intelligence. (b) Edge Intelligence

sending all these data to the remote cloud server is avoided, as much as possible, because it can heavily consume/waste network bandwidth or heavily jeopardize data security during long-distance transmission. Pushing the frontiers of intelligence is certainly a promising solution to address these issues and unleash the potential of big data at the edge. In addition to the above two supply-side drivings, there

6.2 System Overflow

187

&ORXG ,QWHOOLJHQFH

'& 2IIORDGLQJ

(GJH ,QWHOOLJHQFH

0DFUR%6

+\SHU 2IIORDGLQJ

0LFUR%6

0LFUR%6 0RGHO&RPSUHVVLRQ

6ROR 7UDLQLQJ

&ROODERUDWLYH 7UDLQLQJ

0RGHO&RPSUHVVLRQ

'( 2IIORDGLQJ

'' 2IIORDGLQJ

(GJHGHYLFH

(GJH6HUYHU 6HUYLFVH

(QG8VHUV

&DFKLQJ

7UDLQLQJ

,QIHUHQFLQJ

2IIORDGLQJ

Fig. 6.2 The complete prototype of edge intelligence

is also an indicator on the demand-side, that is, the continuous improvement of edge computing systems along with people’s increasing demand for smart living has facilitated the implementation of edge intelligence. Naturally, large efforts from both academia and industry have been enacted to realize these demands. Referring to a complete edge AI execution process, we identify four key components in the process (Fig. 6.2): data/model caching, model training, model inference, and offloading.

188

6 AI/ML on Edge

6.2.1 Caching on the Edge Caching was initially proposed to fill the throughput gap between the main memory and registers by exploring correlations of memory access patterns [32]. But for edge caching, it refers to collecting and storing the data generated by edge devices and surrounding environments within a distributed system close to end users (presented as the yellow lines in Fig. 6.2). Edge caching will not only accelerate the processing and analysis of intelligent algorithm but also effectively save limited resources on transmission and computation. It ultimately provides end users with services that are more tally with the mobile scenarios and demands. To address caching successfully, we need to clarify the following three questions. (1) What to cache? As we mentioned above, the process of intelligence edge involves feeding the intelligence application by collected data and then sending results back to where the data is stored. Therefore, there are two kinds of redundancy at the edge: data redundancy and computational redundancy [33]. Data redundancy, also known as communication redundancy, means that the inputs to an intelligent application may be identical or partially identical. For example, in continuous motion visual analysis, there are a large number of similar pixels between consecutive frames. Some resource-constrained edge devices need to upload the collected videos to the edge server or cloud for further processing. With caching, edge devices only need to upload different pixels or frames. For parts with the same content, edge devices can reuse the content to avoid unnecessary communication. Computational redundancy means that the computing tasks requested by smart applications may be the same. For example, different users in the same zoo are likely to request the same task of identifying species through the image recognition; in this case, the edge server can directly send the previously obtained identification results back to multiple users. Such caching can significantly reduce computation and execution time. (2) Where to cache? Existing work mainly suggests to deploy caches in three possible locations: on macro base stations, on micro base stations, and on edge devices. In edge intelligence, macro base stations usually act as edge servers to provide intelligent services by caching data. Compared with macro base stations, micro base stations have smaller coverage but higher quality of experience. Compared with macro base stations and micro base stations, edge devices usually have limited resources and high mobility, so very little attention has been paid to caching issues on individual edge devices. The differences among these three different cache locations are given in Table 6.1. (3) How to cache? Due to the limited storage capacity of macro base stations, micro base stations, and edge devices, content replacement must be considered for better caching decision. In order to address this problem, designing efficient replacement policies to maximize the service quality is necessary. For example, authors

6.2 System Overflow

189

Table 6.1 Comparison among different cache places Cache places Devices Micro base station Macro base station

Coverage radius 10 m 20–200 m 500 m

User range Small Medium Large

Structure Unstable Slightly unstable Stable

Computational capability Low Medium High

of [34] discussed content replacement by designing decentralized algorithms; or authors of [35] designed a greedy algorithm to find the optimal content replacement decisions.

6.2.2 Training on the Edge Edge training refers to a distributed learning procedure that learns the optimal values for all the weights or the hidden patterns based on the training set cached at the edge. The process for edge training is shown in solid lines (both in green and blue) in Fig. 6.2. The recapitulation of edge training is to perform learning tasks upon the data generated or collected through resources at edge, instead of sending out the data to the central server, which brings more secure and robust services. Unlike the traditional centralized training process on powerful servers or computing clusters, edge training usually occurs on edge servers or edge devices, which are usually not as powerful as centralized servers or computing clusters. Therefore, the method of training in the edge also changes accordingly. As the prominent change refers to the distributed architecture, data distribution, computing capability, and network should be considered comprehensively. It brings new challenges and problems, such as training efficiency, communication efficiency, privacy and security issues, and uncertainty estimation [36]. Therefore, to facilitate the model training at edge, we should focus on the following three key issues. (1) How to train? There are two training architectures classified by the different processing capabilities of the devices: solo training and collaborative training. Solo training means that training tasks are performed on a single device without the help of other devices; collaborative training means that multiple devices cooperate to train a model/algorithm. Solo training is a closed system that requires only iterative computations on a single device to obtain optimal parameters or modes. Due to the high hardware requirements of solo training, it is not widely implemented at present, but with the continuous upgrading of hardware, this method still has strong prospect on application. For collaborative training, there are two main subcategories: master-slave [37] and peer-to-peer [38].

190

6 AI/ML on Edge

(2) How to optimize? Different from centralized intelligence with powerful processing ability, edge training is usually based on the collaboration of multiple resource-limited edge devices; and it usually requires periodic communication for updating. Therefore, time efficiency, energy consumption, and security guarantee are partial optimization objections to be considered. To the best of our knowledge, there are six key performance indicators that widely used in edge training optimization research; they are: • • • • • •

Training loss Convergence Privacy Communication cost Latency Energy efficiency

(3) How to estimate the uncertainty of the model output? Uncertainty estimation can be easily implemented on centralized intelligence and thus consumes relatively negligible edge resources during the training.

6.2.3 Inference on the Edge Edge inference is the stage of using a trained model/algorithm to infer test-set instances and obtain learning results through edge devices and servers. The process for edge inference is shown in dash lines (in both green and blue) in Fig. 6.2. Most existing AI inference models are designed to be implemented on powerful central servers with strong CPUs and GPUs, which are not applicable for edge environments. But the efficient implementation of model inference at the edge will be critical for enabling high-quality edge intelligence services. Therefore, the key points for adopting inference at the edge are as follows. (1) How to adapt models for edge devices or servers? Since edge devices are always resource-constrained, methods of adapting models to those device are mainly divided into two categories: (a) designing new models/algorithms that requires less resource, and thus are applicable to the edge, and (b) compressing existing models to reduce unnecessary operation during inference. For the first direction, there are two paths to design new models: (1) to let machines themselves design optimal models (this is referred to as architecture search) and (2) to implement human-invented architectures with the application of depth-wise separable convolution and group convolution. Model compression means compressing existing mature models to obtain simpler and smaller models that are more computationally acceptable and energy efficient with negligible loss in accuracy. There are many strategies for model compression [39], including

6.3 Edge Training

191

low-rank approximation, knowledge distillation, compact layer design, network pruning, and parameter quantization. (2) How to accelerate inference at the edge to provide realistic time response? Similar to edge training, edge devices and servers are not as powerful as centralized servers or computing clusters; hence, edge inference is much slower and needs to be accelerated at both hardware and software sides.

6.2.4 Offloading on the Edge As a necessary component of edge intelligence, edge offloading refers to a distributed computing paradigm (depicted in the big blue-filled arrow lines in Fig. 6.2) that provides computing service for edge caching, edge training, and edge inference. If a single edge device does not have enough resource for a specific intelligent task, it could offload whole/partial of its tasks to edge servers, other edge devices or even cloud servers. The edge offloading layer transparently provides computing services for the other three components of edge intelligence. In edge offloading, Offloading strategy [40] is of utmost importance as it must make full use of the available resources in the edge. Available computing resources are distributed in cloud servers, edge servers, and edge devices. Correspondingly, four popular strategies can be described. Device-to-Cloud (D2C) offloading strategy prefers to leave pre-processing tasks on edge devices and offload the rest of the tasks to a cloud server. This could significantly reduce the amount of uploaded data and latency. Device-to-Edge (D2E) offloading strategy is similar to D2C, but could further reduce latency and the dependency on cellular network. Device-to-Device (D2D) offloading strategy focuses on smart home scenarios where IoT devices, smartwatches, and smartphones collaboratively perform training/inference tasks. Hybrid offloading strategies have the strongest ability of adaptiveness to make the most of all the available resources.

6.3 Edge Training The role of model training is to set parameters of the applied ML framework (e.g., neural network) based on the data. Due to the limitation of computational capability, this process is always assumed to occur off-device (like sending to the central server to perform), to offload the computation resources to the model inference process. Therefore, it is still less common to train ML models at edge nodes or servers. However, the rationale of edge intelligence is to utilize the data generated or collected by the edge devices and train the model locally rather than sending

192

6 AI/ML on Edge

them to a central server. This is because the edge training prototype can effectively address the privacy and network issues. It can offer a more secure and robust model training step to further construct the entire practical learning service. In this section, we mainly focus on one of the most important procedures of AI on the edge: how to train models at the network edge. To that end, we introduce the basic architecture of edge training and corresponding optimization techniques, followed by the concepts of federated learning, as the most common approach for this purpose.

6.3.1 Architecture Because edge servers are typically not as powerful as central servers, we need to design appropriate training architectures and methods suitable to apply on edges. The deployment of model training architecture is mainly up to the computation capability of the edge servers or devices. If the computation capability become as powerful as central servers (e.g., by continuous upgrading of hardware), it can adopt the architecture similar to a traditional centralized deployment as well and perform the training task on a single device independently (called the solo training). For example, they are reports about constrained training and inference processes on edge devices [41, 42]. However, along with the development of deep learning, the parameters of learning networks become more and more complex, as this may consume much more resources on computation, networking, and storage. In this case, because the power of the edge node may no longer be sufficient to handle most of training procedures, the model needs to be co-trained among devices (called collaborative training). In other words, few devices and servers work together to perform training tasks at the edge. The collaborative training methods are classified into two strategies: masterslave and peer-to-peer [43]. Master-Slave The master-slave type can be recognized as the centralized-distributed architecture under the hierarchical relationship, where the master node owns abundant resources for computation and storage and is usually responsible for storage management, task scheduling, and parts of data-relevant maintenance within the architecture. Meanwhile, the slave nodes usually obtain data processing function only. The master-slave architecture is appropriate to be applied in the application scenarios that slaves are willing to share the required information to allow centralized decision-making. However, sending the collected information to the master node and distributing the adaptation plans may impose a significant communication overhead. Moreover, the solution may be problematic in the case of large-scale distributed systems where the master may become a bottleneck. Federated learning [44] is a typical example for master-slave architecture, where a server uses multiple devices and assigns training tasks to them.

6.3 Edge Training

193

Fig. 6.3 Peer-to-peer architecture

Peer-to-Peer The peer-to-peer type (Fig. 6.3) can be identified as the decentralized-distributed flat architecture, where every node can be treated equally within the architecture, which can be established in a self-organized manner among multiple edge devices. We can also apply knowledge transfer, even cross-model transfer learning, between edge devices to speed up the training and satisfy the situation of label deficiency. The peer-to-peer prototype has the ability to fully utilize neighbors for outsourcing computations so that every available resource at the edge of the network can be exploited. It is a well-established communication model that can aid in realizing computing at the edge, because systems allow endpoints to cooperate with each other to achieve common goals [45]. Peerto-peer shows potential for handling distributed infrastructures in a scalable manner as well. Because of the scalability, it is suitable for scenarios with a large number of devices, such as the IoT. However, the IoT also includes resourceconstrained nodes that cannot implement complex communication mechanisms. Nevertheless, such nodes can still join in peer-to-peer applications by connecting to a compute node that acts as a gateway to/from the peer-to-peer network. The peer-to-peer model can also benefit from edge computing since stable resources at the edge can aid in resolving fault tolerance and transient availability issues of such systems.

194

6 AI/ML on Edge

6.3.2 Training Optimization After adopting the model training at the edge, we need to further consider the optimization issues of the training process. The principle of optimization is to take data distribution, computer power, and network into full consideration, as well as to make distributed edge deployment feasible. Also, since the solo training is similar to the centralized architecture, our attention in this chapter is mainly on the collaborative training ones. In this context, training optimization refers to optimizing the training process to meet some requirements, such as time efficiency, energy consumption, accuracy, privacy protection, etc. Here, the primary consideration is how to accelerate the training process on resource-constrained edge devices; they can be categorized into the following three main directions. Processing Efficiency As discovered, the complexity of the training model is one of the most important factors that affect the time efficiency when the computing resources are insufficient on the device. In order to speed up the process, we need to shorten the training time by applying transfer learning or transferring the learned feature and caching them locally for secondary training to accelerate the process. Meanwhile, edge devices are able to learn from each other to improve training efficiency. Communication Efficiency In order to achieve communication efficiency, we focus on the reduction of communication frequency and cost. In other words, reducing the communication frequency and the size of each communication is a key method to reduce communication costs. For example, authors of [46] introduced FedAvg to reduce the number of communication round of the updates. In general, the distributed training looks at how to make synchronous stochastic gradient descent (SGD) faster or how to make asynchronous SGD converge to a better solution. For example, elastic averaging is able to reduce the communication cost of synchronous and asynchronous SGD training methods by allowing each device to perform more local training computations and further deviate/explore a global shared solution before synchronising its updates [47]. In addition to the frequency of training updates, the size of training updates also has an impact on bandwidth usage. We can utilise gradient compression methods (such as gradient quantisation [48] and gradient sparsification [49]) to effectively reduce the size of updates, thereby improving communication efficiency. Privacy and Security Although local training on edge devices improves data privacy and training security (e.g., by stopping sharing data among the edge devices), the gradient information communicated between edge devices can still be indirectly disclosure of information about some private data. Therefore, further privacyenhancing techniques are needed. In this section, we will consider two main classes of privacy-enhancing techniques: adding noise to gradients or data transfer as part of training and secure computation for training DNNs.

6.3 Edge Training

195

6.3.3 Federated Learning In this subsection, we will elaborate on federated learning including its model training and corresponding optimization operations. Federated learning was first proposed by Google to allow mobile phone to collaboratively learn the shared model by training their respective local data [46], rather than uploading all data to a central cloud server for global training. It is a common distributed learning framework, which allows datasets and training models to be located differently at non-centralized locations, and the training processes can be performed regardless of time and location. The learning process of federated learning is shown in Fig. 6.4. As we can see, the training task assigns untrained shared model from the federated node to edge nodes individually, and then the training server on edge node takes their local data as the model input for training. After the model is trained by each edge node, the updated parameters of each edge node are separately sent to the central server through encrypted communications. The central server averages the updated parameters it has received from edge nodes and updates the original shared model with the averaged results (called FedAve). Edge nodes download the updated model to their devices and repeat the process to continuously improve the generalization and accuracy of the shared model. During this learning process, only encrypted parameters are uploaded to the cloud, and the locally training data never leave the mobile device.

Fig. 6.4 Federated Learning Architecture

196

6 AI/ML on Edge

We formulated this method as the following. Assuming there are L distributed edge nodes .E0 , ..., EL holds sample sets .X0 , ..., XL , where each ⎡ ⎤ xl0 .Xl = ⎣ ... ⎦ xlm involves m samples in lth edge node. All entities accompanied with the label involved in lth edge node can be shown as d d 0 0 .[(x , ..., x , yl0 ), ..., (x , ..., x , ylm )]. lm lm l0 l0 To implement the shared model under the federated learning framework, the key idea is to calculate the parameters .gi and .hi at each local parties and then pass them to the central aggregator to determine an optimal split through iterative model (e.g., by averaging) to further update the model. In short, XGBoost under the FL framework is summarized below. • Each edge node downloads the latest XGBoost model from the central aggregate server. • Each edge node uses local data to train the downloaded XGBoost model and uploads the gradient to the central aggregate server. • The server aggregates the gradient of each user to update the model parameters. • The federated node aggregate server distributes the updated model to each party. • Each edge node updates the local model accordingly. In this architecture, we need to consider the optimization concerns for the federated learning. Edge devices in federated learning are generally assumed to be smartphones with unreliable or slow network connections, and they might not be able to work continuously due to the unknown nature of movement; therefore, how to ensure the communication efficiency between the smartphone and the central server is crucial to the entire training process. Specifically, the factors that affect the communication efficiency are listed below. Communication Efficiency In federated learning, communication between edge nodes and the aggregate server is the most important operation; it uploads the updates from edge devices to the cloud server and downloads the aggregated update from the shared model to local models. Due to the possible unreliable network condition of edge devices, minimizing the number of update rounds (i.e., communication frequency) between edge devices and the cloud server is necessary. One of the most practical methods is to replace the commonly used synchronous updating methods with asynchronous ones. The synchronous update first uploads updates from edge devices to a central server and then aggregates them to update the shared model. The central server then distributes the aggregated updates to each edge device (in each update round). However, this assumption of synchronous updates is generally unachievable for several reasons. Firstly, edge devices have significantly heterogeneous computing resources, and local models are typically long trained on each edge device; secondly, the connection between edge devices and the central server is unstable; and thirdly, edge devices may be intermittently

6.4 Edge Inference

197

available or respond with longer delays due to poor connectivity. Therefore, the step-by-step update method may effectively reduce the communication frequency and improve the operation efficiency. Communication cost is another factor that affects the communication efficiency between edge devices and the central server. Reducing the communication cost could significantly save bandwidth and improve communication efficiency. Privacy and Security In federated learning, the data does not leave the local, and thus it is assumed to be relatively safe. However, the entire training process may be attacked, and data or models could be destroyed. In the ideal assumption of the federated learning, we usually assume that the learning updates provided by each edge node are harmless and they all contribute positively to the central shared model. However, in practical scenario, there might exist malicious edge nodes that upload erroneous updates to the central aggregation server, ultimately causing the model training process to fail. According to this problem, attacks can be divided into data-poison and model-targeted attacks. Data-poison means compromising the behavior and performance of a model by altering the training set (e.g., accuracy), while model-targeted only changes the model’s behavior for a particular input without affecting the performance of other inputs. In fact, the data-poison attacks are usually less harmful to the federated learning, as the number of harmful nodes is usually small, and the arbitrary updates will be offset by average aggregation. Conversely, model-targeted attacks have more impact where attackers can directly destroy the global shared model, thereby affecting thousands of participants simultaneously.

6.4 Edge Inference Edge inference is another major component of the edge intelligence. Modern neural networks are getting larger, deeper, and more complex; and they also require more computing resources, which makes it quite difficult to run high-performance models directly on edge devices, such as mobile devices, IoT terminals, and embedded devices, with limited computing resources. However, edge inference, as an important component of edge intelligence, has to be performed at the edge, where its overall performance (e.g., execution time, accuracy, and energy efficiency) could be severely limited by devices’ capabilities. In this section, we discuss various frameworks and approaches to bridge the gap between the task demand and device supply.

198

6 AI/ML on Edge

6.4.1 Model Design Many recent works have focused on designing lightweight neural network models that could be performed on edge devices with less requirements on the hardware. According to the approaches of model design, existing literature could be divided into two categories: architecture search and human-invented architecture. The former is to let the machine automatically design the optimal architecture; and the latter is to design architectures by human. Architecture Search Human-invented architecture is quite time-consuming and usually requires substantial efforts from experts. Therefore, using artificial intelligence to search existing architectures and pick the optimal one for the edge environment is a more efficient method. Currently, some automatic search architectures, such as NASNet [50] and AmoebaNet [51], can achieve competitive or even better performance in classification and recognition. However, although architecture search shows good capabilities in model design, the high hardware requirements still prevent it from popularity. Human-Invented Architecture Although architecture searching has shown great promises for model design, it still cannot overcome the hardware requirements. Thus, researchers pay more attention to human-invented strategies, for example, building a lightweight deep neural networks, called MobileNets [52], for mobile and embedded devices using depth-wise separable convolutions or using group convolution, another way to reduce the computational cost of model design, to design some basic architectures like Xception [53].

6.4.2 Model Compression Although DNN has been widely discussed and applied due to its high performances and other strengths, its shortcomings in computational complexity cannot be ignored, specially those that prevent DNN methods from being employed on resource-constrained edge devices. Incapable power consumption and latency can affect the performance of the entire system or even cause a system crash, as most edge devices are not designed for compute-intensive tasks. There are several methods applied to overcome this issue. Firstly, the development on specific chips for deep learning has gain attention in recent years. This approach is to accelerate a given task using a dedicated hardware. Another solution is deploying softwarebased approaches by considering whether all computations are necessities in the model or not. If they are not, then it is possible to simplify the models and reduce the load of computation and storage footprint. Similar to many other machine learning methods, deep neural networks can be divided into two phases: training and inference. The training phase is to learn

6.4 Edge Inference

199

the parameters in the model based on the training dataset; in the inference phase, testing data is fed into the model to derive the final result. In this context, overparameterization refers to situation where many parameters are needed during the training phase (e.g., to capture fluctuation of the model), but once the training is finished, we do not need as many parameters in the inference phase. This allows simplifying the trained model before deploying at the edge for subsequent inference procedures. There are many benefits for model simplification, including but not limited to the following. 1. The amount of computation is reduced, thereby reducing computation time and power consumption. 2. The memory footprint becomes smaller and can be deployed on low-end devices. 3. Smaller packages are used for application releases and updates. This software-based method is referred to as model compression; it has a low application cost and does not contradict the hardware acceleration method; in fact, they can even benefit from each other. Model compression can be divided into the following four major methods: network pruning, quantization, knowledge distillation, and low-rank factorization.

6.4.2.1

Network Pruning

DNN models usually have a large number of redundant parameters from convolutional layers to fully connected layers; this is known as model overparameterization. Specifically, each layer passes information to the next layer according to the respective weights of its neurons. After the training, the weight of some neurons approximates to zero, conveying the fact that they have little effect on the information transmission of the model. If neurons with negligible weights can be deleted, the complexity of the model is effectively simplified without affecting its accuracy. This method is called network pruning, and it is the most common method for model compression. Our aim is to develop smaller, more efficient neural networks without losing accuracy through network pruning. This is especially critical to deploying models to mobile devices or other edge devices. Due to its cascading effects, network pruning is divided into the following steps (Fig. 6.5). 1. Training the network. 2. Evaluate the importance of weights and neurons. Thresholds, for example, can be used to evaluate the importance of weights; and the number of times that they are not “0” can be used to measure the importance of the connected neurons. 3. Sort the importance of weights or neurons and remove unimportant weights or neurons. 4. Because after removing some weights or neurons, the accuracy of the network could be reduced, the new network is fine-/re-tuned using the original training data.

200

6 AI/ML on Edge

Fig. 6.5 Network pruning

5. To avoid excessive damage to the model caused by pruning, we will not prune too many weights or neurons at the same time. In other words, this process requires iteration, and after each pruning and fine-tuning, the model needs to be reevaluated to ensure its quality. If the pruning proved to be satisfactory, the model is reverted to its previous state, and the iteration continues by pruning other weights or neurons. The basic idea of network pruning is to prune the unimportant parts of a network as mathematically described below. minω L(D; ω) + λ||ω||0 ,

.

(6.1)

We can re-write this formula to the following constrained optimization form. minω L(D; ω)

.

s.t.||ω||0 ≤ .

(6.2)

Because the L0 norm exists, network pruning becomes a combinatorial optimization problem.

6.4 Edge Inference

201

Network pruning can be divided into two categories: structured pruning and unstructured pruning. Some early methods were mainly based on unstructured pruning to prune the granularity of individual neurons. Unstructured pruning is more fine-grained and can remove “redundant” parameters in a desired proportion of the network indefinitely. There are many advantages to pruning parameters directly during unstructured pruning. First, the approach is simple, since replacing their weight values with zeros in the parameter tensors is enough to prune the connections. Such fine granularity allows pruning of very subtle patterns, such as parameters within convolution kernels. Since pruning weights is completely free from any constraints and is the best way to prune a network, this paradigm is called unstructured pruning. However, this approach has a major fatal disadvantage. Because most frameworks and hardware cannot accelerate sparse matrix computations, no matter how many zeros you fill the parameter tensor with, it doesn’t affect the actual cost of the network. Moreover, the network structure after pruning is usually irregular and thus difficult to effectively accelerate. Structure tuning is a way to solve this issue as it aims to directly alter the network architecture so that any framework can handle it. Structured pruning can be further subdivided into channelwise, filter-wise, or shape-wise approaches. The granularity of structured pruning is relatively coarse, and the smallest unit of pruning is the combination of parameters in the filter. By setting evaluation factors for filters or feature maps, it is even possible to remove entire filters or parts of channels to narrow the network, resulting in effective speedup directly over existing software/hardware. This is why many works have focused on pruning larger structures, such as entire neural networks, or the direct equivalent convolutional filters in more modern deep convolutional networks. Because large networks tend to contain many convolutional layers, each with hundreds or thousands of filters, filter pruning allows the use of exploitable but sufficiently fine granularity. Removing such structures not only results in sparse layers that can be directly instantiated into thinner layers but also eliminates feature maps that are the output of such filters. Therefore, networks become easier to store due to their fewer parameters and also require less computation/memory due to their lighter intermediate representations. In fact, sometimes reducing the bandwidth is more beneficial than reducing the number of parameters. Once decided which structures to prune, the next question is about ways to decide which structures to keep and which ones to prune. The following is a list of most commonly used strategies. Pruning based on weight and size A very intuitive and very effective way is to prune “redundant weights,” which are weights with the smallest absolute value (or magnitude). Despite its simplicity, the magnitude criterion is still widely used in state-of-the-art methods, making it a common strategy in the field. Although this strategy can be easily implemented for unstructured pruning, it needs some efforts to adapt it to structured pruning. A straightforward approach is to sort the filters according to their norm (e.g., L1 or L2). Another approach is to encapsulate sets of parameters

202

6 AI/ML on Edge

in a single metric, for example, a convolutional filter, its bias and its batch normalization parameters, or even the corresponding filters in parallel layers. A more advanced approach is to insert a learnable multiplicative parameter for each feature map after each set of layers so that they can be pruned without needing to compute the combined norm of these parameters. When this parameter is reduced to zero, the entire set of parameters responsible for this channel is effectively pruned, and the magnitude of this parameter accounts for the importance of all parameters. Pruning based on gradient magnitude Another major strategy is to use the magnitude of the gradient. The popular implementations of this method actually accumulate gradients on mini-batches of training data and prunes based on the product of this gradient and the corresponding weight of each parameter. Besides pruning the structure, we should also adjust the training procedure; the following are the most common approaches for this purpose. Training, pruning, and fine-tuning In this method, we first train a network and then prune it by setting all parameters under the targeted pruning structure and criteria. Finally, we train the network with the lowest learning rate for a few additional epochs to give it a chance to recover from the performance loss caused by pruning. The last two steps (pruning and re-training) are usually iterated, each time increasing the pruning rate. Sparse training This method is associated with an underlying assumption that is related to training under sparsity constraints. This principle enforces a constant sparsity rate during training; the distribution of the rate can be changed and/or gradually adjusted during the process. It consists of the following steps. 1. Initialize a network with a random mask to prune a certain proportion of the network. 2. Train the pruned network in one epoch. 3. Prune a certain number of weights (e.g., those with lowest magnitude). 4. Re-generate the same amount of the pruned weights randomly. In this method, although the pruning mask is random at first, it gradually adjusts to target the smallest imported weights.The sparsity level can be adjusted for each layer or set globally. Other methods extend sparse training by using some criterion to re-increase weights instead of randomly selecting them. Sparse training periodically prunes and grows different weights during training; this results in adjusted masks that should only target relevant parameters. Penalty-based methods Rather than manually prune connections or penalize auxiliary parameters, many methods impose various penalties on the weights themselves. This causes them to gradually shrink to “0.”

6.4 Edge Inference

6.4.2.2

203

Quantization

Same as other compression methods, model quantization is also based on consensus. That is, models with complex, high-precision representations are required at training time, but not at the inference stage. Furthermore, experiments show that neural networks are robust to noise, and quantization can be seen as noise. Therefore, we can simplify the model before deploying it, and the reduction in representation accuracy is one of the important means of simplification. Most deep learning training frameworks are 32-bit floating-point parameters by default, and calculations are also 32-bit floating-point. The basic idea of model quantization is to replace the original floating-point precision with a lower precision (such as an 8-bit integer). The core challenge of quantization is how to reduce the accuracy of the representation without degrading the accuracy of the model, that is, to make a trade-off between compression ratio and loss of accuracy. There are three mainly quantization objects. Weight The quantization of weight is the most conventional and the most common. Quantizing weight can achieve the purpose of reducing model size and memory footprint. Activation In practice, activation often accounts for the bulk of memory usage, so quantifying activation can not only greatly reduce the memory footprint. More importantly, quantization combined with weight can make full use of integer computation to gain performance improvement. Gradient Its usage is slightly less common than the above two, because it is mainly used for training. Its main function is to reduce communication overhead in distributed computing. It can also reduce backward overhead during singlemachine training. Here is a very simple list of the basic principles. For the float-to-integer quantization, it maps a segment from the real domain to an integer. If linear quantization is used, its general form can be expressed as q = round(s ∗ x + z),

.

(6.3)

where x and q represent the numbers before and after quantization, respectively. s is the scaling factor, and z is the zero point. This zero point is the quantized value of 0 in the original range. There will be many 0 in weight or activation (such as padding or through ReLU), so we need to ensure that the real 0 is accurately represented after and during the quantization procedures. In order to make the final range, after quantization, within a specified range, the scaling factor can be computes as .

2n − 1 , maxx − minx

(6.4)

204

6 AI/ML on Edge

where .maxx and .minx are the lower and upper bounds of the dynamic range of the quantized object (such as weight or activation), respectively. Then, if the dynamic range is very wide, or there are some very bizarre outliers, then the bits will be wasted in those areas that are not dense, resulting in preserving loss information. Therefore, another method is to clip the dynamic range first and cut off those areas with less information. This can be described as q = round(s × clip(x, α, β) + z),

.

(6.5)

where .α and .β are the upper and lower bounds of clip, respectively. In the above method, since neither the original range nor the quantized range is required to be symmetrical around 0, it is called asymmetric quantization. If the zero point above is set to 0, it is called symmetric quantization. It can be regarded that symmetric quantization is a special case of asymmetric quantization. The advantage of asymmetric quantization is that it can make a better use (i.e., full use) of bits as compared with symmetric quantization. For example, if the dynamic range is severely asymmetrical on both sides of 0 (such as activation after ReLU), symmetric quantization yields significant loss of useful information. Therefore, asymmetric quantization is more efficient but more computationally expensive. To demonstrate the difference, take multiplication as an example; for symmetric quantization, the multiplication after quantization is simply multiplication of the original value followed by multiplying the result by the scaling factor (.sx x × sy y = sx sy xy); for asymmetric quantization however, it is calculated using the following formula: .(sx x + zx ) × (sy y + zy ) = sx sy xy + sx xzy + sy yzx + zx zy . There are pros and cons for using either symmetric or asymmetric quantization, depending on where they are used. The quantization methods can also be classified into uniform quantization or non-uniform as well. In uniform quantization, which is the simpler case, the distance between the quantized levels is equal. In some cases, some information will be lost in this type, because some levels are much denser than the others. Non-uniform quantization is to address these issues by allocating unequal lengths between quantization levels; log quantization, for example, is a practical use case that can benefit from this type. The number of bits, for quantization, can also vary. It can be roughly divided into several categories: Float16, Float8, etc. Float16 quantization is a safer approach, and in most cases, there is a significant performance improvement without losing too much precision. Float8 uses 8 bits and is more common. There are many related studies, and various mainstream frameworks are supporting it. Below 8 bits are 4, 2, and 1 bits (because the power of 2 will perform better and be easier to implement). If the precision is as low as 1 bit, it is called binarization; and it can be calculated using bitwise operation that is the most processor-friendly type. According to whether training is required in the quantization process or not, it can be divided into two categories.

6.4 Edge Inference

205

Post-training quantization (PTQ) As the name suggests, it is quantization after a model is fully trained. It can be subdivided into two types. • Calibration data is required. In this type, data is used to obtain quantitative parameters (e.g., through statistical analysis). Here, marking of data is not required, and it generally includes a significant amount of data. • No dataset is required. This type is suitable for scenarios where the training environment and data are completely unavailable. Quantization-aware training (QAT) It can be divided into re-training or fine-tuning. This is basically due to the fact that quantization to 4 bits (or below) requires training intervention in many methods to compensate for the loss of information.

6.4.2.3

Knowledge Distillation

Knowledge distillation is another model compression method; it is based on the idea of “teacher-student network.” To be more specific, it “distills” the “knowledge” obtained from the trained model to another model. In this method, the teacher is the exporter of knowledge, and the student is the receiver of it. The process of knowledge distillation is divided into two stages (Fig. 6.6). Teacher model: original model training Train the teacher model, referred to as Net-T for short, is characterized by a relatively complex model and can also be integrated by multiple separately trained models. There is practically no restriction on the model architecture, parameter quantity, and integration of the teacher model. The only requirement is that for the input X, it should output Y, where Y is mapped by softmax, and the output value corresponds to the probability of the corresponding category value.

Fig. 6.6 Steps of knowledge distillation

206

6 AI/ML on Edge

Student model: simplified model training Train the student model, referred to as Net-S for short, is a single model with a small number of parameters and a relatively simple model structure. Similarly, for input X, it should output Y, where Y can also output the probability value corresponding to the corresponding category after softmax mapping. These stages are shown in Fig. 6.6. Neural networks typically produce class probabilities by using a “softmax” output layer that converts the logit, .zi , computed for each class into a probability, .qi , by comparing .zi with the other logits. exp zTi qi =  zj j exp T

.

(6.6)

T is a temperature that is normally set to 1. Using a higher value for T produces a softer probability distribution over classes. In the simplest form of distillation, knowledge is transferred to the distilled model by training it on a transfer set and using a soft target distribution for each case in the transfer set, which is produced by using the cumbersome model with a high temperature in its softmax. The same high temperature is used when training the distilled model, but after it has been trained. When the correct labels are known for all (or some) of the transfer set, this method can be significantly improved by also training the distilled model to produce the correct labels. One way to do this is to use the correct labels to modify the soft targets. The other way, which is proved to be better, is to simply use a weighted average of two different objective functions. The first objective function is the cross-entropy with the soft targets; here, the crossentropy is computed using the same high temperature in the softmax of the distilled model as was used for generating the soft targets from the cumbersome model. The second objective function is the cross-entropy with the correct labels; here, the cross-entropy is computed using exactly the same logits in softmax of the distilled model but at a .T = 1. We found that the best results were generally obtained by using a considerably lower weight on the second objective function. Since the magnitudes of the gradients produced by the soft targets scale as . T12 . it is important to multiply them by .T 2 when using both hard and soft targets. This ensures that the relative contributions of the hard and soft targets remain roughly unchanged if the temperature used for distillation is changed while experimenting with metaparameters. dC Each case in the transfer set contributes a cross-entropy gradient, . dz , with i respect to each logit, .zi , of the distilled model. If the cumbersome model has logits .vi that produce soft target probabilities .pi and the transfer training is done at a temperature of T, the gradient can be calculates as ⎞ ⎛ vi zi ∂C eT ⎠ 1 1 ⎝ eT . = (qi − pi ) = vj z −  ∂zi T T  e Tj T je j

(6.7)

6.5 Summary

207

If the temperature is high compared with the magnitude of the logits, we can approximate ∂C 1 . ≈ ∂zi T



1 + zTi  N+ j

zj T

1 + vTi −  N+ j

vj T

(6.8)

If we assume  logits have been zero-meaned separately for each transfer  that the case so that . j zj = j vj = 0, the above equation can be further simplified to .

∂Lloss 1 ≈ (zi − vi ) ∂zi NT 2

(6.9)

So in the high temperature limit, distillation is equivalent to minimizing .1/2(zi − vi )2 , provided the logits are zero-meaned separately for each transfer case. At lower temperatures, distillation pays much less attention to matching logits that are much more negative than the average. This is potentially advantageous because these logits are almost completely unconstrained by the cost function used for training the cumbersome model; that is, they could be very noisy. On the other hand, the very negative logits may convey useful information about the knowledge acquired by the cumbersome model. Which of these effects dominates is an empirical question. We show that when the distilled model is much too small to capture all of the knowledge in the cumbersome model, intermediate temperatures work best; this strongly suggests that ignoring the large negative logits can be helpful.

6.5 Summary This chapter introduced the basic concepts and related technologies of edge intelligence, including identifying the key components of edge intelligence and discussing them separately. In addition, we focus on the domains of edge computing infrastructure and platforms of edge intelligence and the real-time and distributed implementation of ML/AI algorithms in resource-constrained edge environments and consider the challenges like communication efficiency, security, privacy, and cost efficiency they faced. With the explosive development of the Internet of Things and artificial intelligence, we believe that the deployment of edge intelligence systems, edge intelligence algorithms, and edge intelligence service will flourish in the next decade.

208

6 AI/ML on Edge

References 1. Flavio Bonomi, Rodolfo Milito, Jiang Zhu, and Sateesh Addepalli. Fog computing and its role in the internet of things. In Proceedings of the first edition of the MCC workshop on Mobile cloud computing, pages 13–16, 2012. 2. Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943. 3. Claudine Badue, Rânik Guidolini, Raphael Vivacqua Carneiro, Pedro Azevedo, Vinicius B Cardoso, Avelino Forechi, Luan Jesus, Rodrigo Berriel, Thiago M Paixao, Filipe Mutz, et al. Self-driving cars: A survey. Expert Systems with Applications, 165:113816, 2021. 4. Abhishek Gupta, Alagan Anpalagan, Ling Guan, and Ahmed Shaharyar Khwaja. Deep learning for object detection and scene perception in self-driving cars: Survey, challenges, and open issues. Array, 10:100057, 2021. 5. Alexander Cui, Sergio Casas, Abbas Sadat, Renjie Liao, and Raquel Urtasun. Lookout: Diverse multi-future prediction and planning for self-driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16107–16116, 2021. 6. Basit Ali and Vadlamani Ravi. Developing dialog manager in chatbots via hybrid deep learning architectures. In Intelligent data engineering and analytics, pages 301–310. Springer, 2021. 7. Weijiao Huang, Khe Foon Hew, and Luke K Fryer. Chatbots for language learning—are they really useful? a systematic review of chatbot-supported language learning. Journal of Computer Assisted Learning, 38(1):237–257, 2022. 8. Chuang Gan, Siyuan Zhou, Jeremy Schwartz, Seth Alter, Abhishek Bhandwaldar, Dan Gutfreund, Daniel LK Yamins, James J DiCarlo, Josh McDermott, Antonio Torralba, et al. The threedworld transport challenge: A visually guided task-and-motion planning benchmark for physically realistic embodied ai. arXiv preprint arXiv:2103.14025, 2021. 9. Fengli Zhang, Qianzhe Qiao, Jinjiang Wang, and Pinpin Liu. Data-driven ai emergency planning in process industry. Journal of Loss Prevention in the Process Industries, page 104740, 2022. 10. Diaz Jorge-Martinez, Shariq Aziz Butt, Edeh Michael Onyema, Chinmay Chakraborty, Qaisar Shaheen, Emiro De-La-Hoz-Franco, and Paola Ariza-Colpas. Artificial intelligence-based kubernetes container for scheduling nodes of energy composition. International Journal of System Assurance Engineering and Management, pages 1–9, 2021. 11. Fateh Boutekkouk. Ai-based methods to resolve real-time scheduling for embedded systems: A review. International Journal of Cognitive Informatics and Natural Intelligence (IJCINI), 15(4):1–44, 2021. 12. Toka Haroun, Vikas Rao Naidu, and Aparna Agarwal. Artificial intelligence as futuristic approach for narrative gaming. In Deep Learning in Gaming and Animations, pages 37–64. CRC Press, 2021. 13. Wei-Jen Ko, Ahmed El-Kishky, Adithya Renduchintala, Vishrav Chaudhary, Naman Goyal, Francisco Guzmán, Pascale Fung, Philipp Koehn, and Mona Diab. Adapting high-resource nmt models to translate low-resource related languages without parallel data. arXiv preprint arXiv:2105.15071, 2021. 14. Qianqian Dong, Rong Ye, Mingxuan Wang, Hao Zhou, Shuang Xu, Bo Xu, and Lei Li. Listen, understand and translate: Triple supervision decouples end-to-end speech-to-text translation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 12749– 12759, 2021. 15. Yucen Nan, Wei Li, Feng Lu, Chengwen Luo, Jianqiang Li, and Albert Zomaya. Developing practical multi-view learning for clinical analytics in p4 medicine. IEEE Transactions on Emerging Topics in Computing, 2021. 16. Norah Alballa and Isra Al-Turaiki. Machine learning approaches in covid-19 diagnosis, mortality, and severity risk prediction: A review. Informatics in Medicine Unlocked, 24:100564, 2021.

References

209

17. Vivek Lahoura, Harpreet Singh, Ashutosh Aggarwal, Bhisham Sharma, Mazin Abed Mohammed, Robertas Damaševiˇcius, Seifedine Kadry, and Korhan Cengiz. Cloud computingbased framework for breast cancer diagnosis using extreme learning machine. Diagnostics, 11(2):241, 2021. 18. Zhijian Wang, Wenlei Zhao, Wenhua Du, Naipeng Li, and Junyuan Wang. Data-driven fault diagnosis method based on the conversion of erosion operation signals into images and convolutional neural network. Process Safety and Environmental Protection, 149:591–601, 2021. 19. Andronicus A Akinyelu. Advances in spam detection for email spam, web spam, social network spam, and review spam: Ml-based and nature-inspired-based techniques. Journal of Computer Security, 29(5):473–529, 2021. 20. Qussai Yaseen et al. Spam email detection using deep learning techniques. Procedia Computer Science, 184:853–858, 2021. 21. Brian Hayes. Cloud computing, 2008. 22. Karrar Hameed Abdulkareem, Mazin Abed Mohammed, Ahmad Salim, Muhammad Arif, Oana Geman, Deepak Gupta, and Ashish Khanna. Realizing an effective covid-19 diagnosis system based on machine learning and iot in smart hospital environment. IEEE Internet of Things Journal, 8(21):15919–15928, 2021. 23. Seyoung Huh, Sangrae Cho, and Soohyung Kim. Managing iot devices using blockchain platform. In 2017 19th international conference on advanced communication technology (ICACT), pages 464–467. IEEE, 2017. 24. Cisco. Cisco annual internet report (2018–2023) white paper, March 2020. 25. Weisong Shi and Schahram Dustdar. The promise of edge computing. Computer, 49(5):78–81, 2016. 26. Mahadev Satyanarayanan. The emergence of edge computing. Computer, 50(1):30–39, 2017. 27. Lejun Zhang, Yanfei Zou, Weizheng Wang, Zilong Jin, Yansen Su, and Huiling Chen. Resource allocation and trust computing for blockchain-enabled edge computing system. Computers & Security, 105:102249, 2021. 28. ASM Sanwar Hosen, Pradip Kumar Sharma, In-Ho Ra, and Gi Hwan Cho. Sptm-ec: A security and privacy-preserving task management in edge computing for iiot. IEEE Transactions on Industrial Informatics, 2021. 29. Jiawei Zhang, Xiaochen Zhou, Tianyi Ge, Xudong Wang, and Taewon Hwang. Joint task scheduling and containerizing for efficient edge computing. IEEE Transactions on Parallel and Distributed Systems, 32(8):2086–2100, 2021. 30. Lu Zhao, Wenan Tan, Bo Li, Qiang He, Li Huang, Yong Sun, Lida Xu, and Yun Yang. Joint shareability and interference for multiple edge application deployment in mobile edge computing environment. IEEE Internet of Things Journal, 2021. 31. Shuiguang Deng, Hailiang Zhao, Weijia Fang, Jianwei Yin, Schahram Dustdar, and Albert Y Zomaya. Edge intelligence: The confluence of edge computing and artificial intelligence. IEEE Internet of Things Journal, 7(8):7457–7469, 2020. 32. Fernando G. Tinetti. Computer architecture: A quantitative approach. Journal of Computer Science & Technology, 8(3): 168–170, 2008. 33. Jingjing Yao, Tao Han, and Nirwan Ansari. On mobile edge caching. IEEE Communications Surveys Tutorials, 21(3):2525–2553, 2019. 34. Francesco Pantisano, Mehdi Bennis, Walid Saad, and Mérouane Debbah. In-network caching and content placement in cooperative small cell networks. In 1st International Conference on 5G for Ubiquitous Connectivity, pages 128–133, 2014. 35. Xiuhua Li, Xiaofei Wang, and Victor C. M. Leung. Weighted network traffic offloading in cache-enabled heterogeneous networks. In 2016 IEEE International Conference on Communications (ICC), pages 1–6, 2016. 36. Irina Valeryevna Pustokhina, Denis Alexandrovich Pustokhin, Deepak Gupta, Ashish Khanna, K. Shankar, and Gia Nhu Nguyen. An effective training scheme for deep neural network in edge computing enabled internet of medical things (iomt) systems. IEEE Access, 8:107112– 107123, 2020.

210

6 AI/ML on Edge

37. Wei Yang Bryan Lim, Nguyen Cong Luong, Dinh Thai Hoang, Yutao Jiao, Ying-Chang Liang, Qiang Yang, Dusit Niyato, and Chunyan Miao. Federated learning in mobile edge networks: A comprehensive survey. IEEE Communications Surveys Tutorials, 22(3):2031–2063, 2020. 38. Emmanouil Krasanakis, Symeon Papadopoulos, and Ioannis Kompatsiaris. p2pgnn: A decentralized graph neural network for node classification in peer-to-peer networks. IEEE Access, 10:34755–34765, 2022. 39. Tejalal Choudhary, Vipul Mishra, Anurag Goswami, and Jagannathan Sarangapani. A comprehensive survey on model compression and acceleration. The Artificial intelligence review, 53(7):5113–5155, 2020. 40. Dianlei Xu, Tong Li, Yong Li, Xiang Su, Sasu Tarkoma, Tao Jiang, Jon Crowcroft, Pan Hui. Edge Intelligence: Architectures, Challenges, and Applications, 2020. https://arxiv.org/abs/ 2003.12172. https://doi.org/10.48550/ARXIV.2003.12172 41. Nicholas D. Lane, Sourav Bhattacharya, Akhil Mathur, Petko Georgiev, Claudio Forlivesi, and Fahim Kawsar. Squeezing deep learning into mobile and embedded devices. IEEE Pervasive Computing, 16(3):82–88, 2017. 42. Nicholas D. Lane, Sourav Bhattacharya, Petko Georgiev, Claudio Forlivesi, and Fahim Kawsar. An early resource characterization of deep learning on wearables, smartphones and internetof-things devices. In Proceedings of the 2015 International Workshop on Internet of Things towards Applications, IoT-App ’15, page 7–12, New York, NY, USA, 2015. Association for Computing Machinery. 43. Danny Weyns, Bradley Schmerl, Vincenzo Grassi, Sam Malek, Raffaela Mirandola, Christian Prehofer, Jochen Wuttke, Jesper Andersson, Holger Giese, and Karl M. Göschka. On Patterns for Decentralized Control in Self-Adaptive Systems, pages 76–107. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013. 44. Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50–60, 2020. 45. Vasileios Karagiannis, Alexandre Venito, Rodrigo Coelho, Michael Borkowski, and Gerhard Fohler. Edge computing with peer to peer interactions: Use cases and impact. In Proceedings of the Workshop on Fog Computing and the IoT, IoT-Fog ’19, page 46–50, New York, NY, USA, 2019. Association for Computing Machinery. 46. D. Ramage B. McMahan. Federated learning: Collaborative machine learning without centralized training data. Google Research Blog, 2017. 47. Sixin Zhang, Anna E Choromanska, and Yann LeCun. Deep learning with Elastic Averaging SGD. In Advances in Neural Information Processing Systems, vol. 28, ed. C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett. Curran Associates, Inc. 2015. https:// proceedings.neurips.cc/paper/2015/file/d18f655c3fce66ca401d5f38b48c89af-Paper.pdf 48. Yujun Lin, Song Han, Huizi Mao, Yu Wang, William J. Dally. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training, 2017. https://arxiv.org/abs/ 1712.01887. https://doi.org/10.48550/ARXIV.1712.01887 49. Yi Cai, Yujun Lin, Lixue Xia, Xiaoming Chen, Song Han, Yu Wang, Huazhong Yang. Long Live TIME: Improving Lifetime for Training-in-Memory Engines by Structured Gradient Sparsification. In Proceedings of the 55th Annual Design Automation Conference, DAC ’18, 2018. New York, NY: Association for Computing Machinery. https://doi.org/10.1145/3195970. 3196071 50. Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for scalable image recognition. In The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Conference Proceedings, pages 8697–, Piscataway, 2018. The Institute of Electrical and Electronics Engineers, Inc. (IEEE). 51. Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. Regularized evolution for image classifier architecture search. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01): 4780–4789, 2019.

References

211

52. Whui Kim, Woo-Sung Jung, and Hyun Kyun Choi. Lightweight driver monitoring system based on multi-task mobilenets. Sensors (Basel, Switzerland), 19(14):3200–, 2019. 53. Francois Chollet. Xception: Deep learning with depthwise separable convolutions. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1800–1807. IEEE, 2017.

Chapter 7

AI/ML for Service-Level Objectives

Abstract This chapter presents “SLO Script,” a language and accompanying framework, which is highly motivated by real-world industrial needs, to allow service providers to define complex, high-level SLOs in an orchestrator-independent manner. SLO Scripts are created and introduced because most approaches focus on low-level SLOs that are closely related to resources (e.g., average CPU or memory usage) and thus are usually bound to specific elasticity controllers. The main features of SLO Script include (1) novel abstractions with type safety features to ensure compatibility between SLOs and elasticity strategies, (2) different abstractions that allow the decoupling of SLOs from elasticity strategies, (3) a strongly typed metrics API, and (4) an orchestrator-independent object model that enables language extensibility. We also present a middleware for SLO Script to provide an orchestrator-independent SLO controller for periodically evaluating SLOs and triggering elasticity strategies. We evaluate SLO Script and our middleware by implementing a motivating use case, featuring a cost efficiency SLO for an application deployed on Kubernetes.

This chapter reuses literal text and materials from • T. Pusztai et al., “A Novel Middleware for Efficiently Implementing Complex Cloud-Native SLOs,” 2021 IEEE 14th International Conference on Cloud Computing (CLOUD), 2021, https://doi.org/10.1109/CLOUD53861.2021.00055. ©2021 IEEE, reprinted with permission. • T. Pusztai et al., “SLO Script: A Novel Language for Implementing Complex Cloud-Native Elasticity-Driven SLOs,” 2021 IEEE International Conference on Web Services (ICWS), 2021, https://doi.org/10.1109/ICWS53863.2021.00017. ©2021 IEEE, reprinted with permission.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Taheri et al., Edge Intelligence, https://doi.org/10.1007/978-3-031-22155-2_7

213

214

7 AI/ML for Service-Level Objectives

7.1 SLO Script: A Language to Implement Complex Elasticity-Driven SLOs 7.1.1 SLOs and Elasticity In cloud computing, it is common for providers and consumers to agree on Service Level Agreements (SLAs) to define bounds for certain cloud services [1]. An SLA consists of one or more Service Level Objectives (SLOs), where an SLO is defined as a “commitment to maintain a particular state of the service in a given period” [2]. SLOs provide usually directly measurable capacity guarantees, such as available memory, despite the fact that service consumers usually prefer to get performance guarantees that can be related to business-relevant Key Performance Indicatorss (KPIss). Because the vast majority of today’s cloud providers offer only rudimentary support for SLOs, customers who want to have a high-level SLO need to manually map it to directly measurable low-level metrics (e.g., CPU or memory) [3]. The same concepts that govern SLOs in the cloud also apply to the edge. Elasticity is a flagship property of cloud and edge computing. Herbst et al. [4] define it as “the degree to which a system is able to adapt to workload changes by provisioning and deprovisioning resources in an autonomic manner, such that at each point in time the available resources match the current demand as closely as possible.” This definition already suggests that today’s cloud offerings usually deal with resource elasticity, that is, adding resources (CPU, memory, etc.) or additional service instances when the demand is high and removing resources when the demand is low. Elasticity should not be limited to the resource dimension; it must include, at least, the following dimensions as well: (1) cost elasticity as the amount of money a consumer is willing to pay for a service and (2) quality elasticity as the desired precision of the output data, for example, related to a machine learning system [5]. In line with these requirements, we define an elasticity strategy as a sequence of altering actions that adjust the amount of resources provisioned for a workload, their type, or both. Additionally, it can change the workload configuration, that is, alter quality parameters, as well. Thus, an elasticity strategy is capable of affecting all three elasticity dimensions. From a business perspective, it is important to be able to map business goals to measurable KPIss that can be translated to SLOs. Because this would not be possible with low-level metrics (e.g., average CPU usage), a high-level SLO that combines multiple elasticity dimensions must be defined, for example, one that combines resource usage with the total cost of the system. In this chapter, we continue our work envisioned in [3], which we refer to as the Polaris SLO Cloud (Polaris) project, and present SLO Script,1 a language and accompanying framework, which permits service providers to define complex

1 SLO

Script is referred to as “SLO Elasticity Policy Language” in [3].

7.1 SLO Script: A Language to Implement Complex Elasticity-Driven SLOs

215

SLOs on their services and service consumers to configure and apply them to their workloads. SLO Scripts embody the following attributes. 1. Novel abstractions (StronglyTypedSLO) with type safety features to ensure compatibility between workloads, SLOs, and elasticity strategies. 2. Language constructs: ServiceLevelObjectives, ElasticityStrategies, and SloMappings enable decoupling of SLOs from elasticity strategies. This is to promote reuse and increase the number of possible SLO/elasticity strategy combinations. Details are provided in Sects. 7.1.5.1 and 7.1.5.2. 3. Strongly typed metrics API that boosts productivity when writing queries. It will be presented in Sect. 7.1.5.3. 4. Orchestrator-independent object model to promote extensibility, as detailed in Sect. 7.1.5.4.

7.1.2 Motivation In the open-source2 Polaris project [3], we aim to establish SLOs as first-class entities and bring multidimensional elasticity capabilities to the cloud and edge computing environments. Polaris is part of Linux Foundation’s Centaurus project,3 a novel open-source platform targeted toward building unified and highly scalable public or private distributed cloud infrastructure and edge systems. To motivate the need for such language, we present a use case, featuring a cloud service provider that wants to offer an e-commerce platform in the form of Software-as-a-Service to its customers. The service provider offers customers the E-Commerce-as-a-Service for deployment on the cloud infrastructure. Service consumers are customers who integrate the E-Commerce-as-a-Service into their applications. Figure 7.1 shows an overview of the use case. The deployment consists of two major components: an online store and an Elasticsearch4 database. Both need to be managed transparently for the service consumers. Each service exposes one or more metrics, for example, CPU usage, response time, or even complex metrics such as cost efficiency. The service provider defines a set of SLOs that are supported by the service. The more requests per second a service must handle, the more resources it needs, and, thus, the more expensive it becomes. Different service consumers have different needs with regard to requests per second; some are willing to pay different prices for these guarantees. However, for most of them, it is difficult, if not impossible, to specify a low-level, resource-bound SLO that delivers the best performance within their budget. This is mainly due to a lack of detailed technical understanding of the

2 https://polaris-slo-cloud.github.io. 3 https://www.centauruscloud.io. 4 https://www.elastic.co/elasticsearch/.

216

7 AI/ML for Service-Level Objectives

Creates

Service Consumer

Deploys Worklo ad

Cost Efficiency SLO Mapping Applies & Configures

Provides

Service Provider

SLO

Online Store

Services

Metrics

Elas c Search

Cloud Infrastructure

Fig. 7.1 E-Commerce-as-a-Service scenario overview. © 2021 IEEE, reprinted, with permission, from Pusztai et al. (2021a)[8]

services and also because a resource-bound SLO only captures a single elasticity dimension. As it can be inferred for this use case, the service consumers would prefer simply specifying a high-level cost efficiency of the micro-services. The cost efficiency could be defined, for example, as the number of requests per second served faster than N milliseconds divided by the total cost of the micro-service [6]. To achieve this with our approach, the service consumer only needs to perform a set of simple tasks, that is, deploys the E-Commerce-as-a-Service platform; we refer to this deployment as a workload. To apply the cost efficiency SLO to the workload, the service consumer creates an SLO mapping to associate an SLO offered by the service provider with a workload of the service consumer. After creating the SLO mapping, the service consumer is finished. Now the cloud will be responsible for automatically performing elasticity actions to ensure that the SLO is fulfilled. Therefore, by allowing service consumers to specify a high-level SLO such as cost efficiency [7], our approach enables service consumers to specify a value that can be easily communicated to the non-technical, management layers of their companies; this is important for approving the budget and checking conformance with the business goals. The complex task of mapping this cost efficiency to lowlevel resources and performing complicated elasticity actions to achieve the SLO is left to the service provider who knows the infrastructure and the requirements of the offered services. Using SLO Script, the service provider is able to efficiently use the know-how about the services to implement complex SLOs.

7.1.3 Research Challenges SLO Script addresses the following research challenges. Enable complex elasticity strategies: The majority of systems provide only simple elasticity strategies, with horizontal scaling being the most common [9].

7.1 SLO Script: A Language to Implement Complex Elasticity-Driven SLOs

217

For example, Kubernetes,5 which has the most production-level services among commonly used container orchestration systems [10], usually ships with the Horizontal Pod Autoscaler (HPA) [11]. Due to the fact that some cloud providers have shown little or no further increase in application performance beyond certain instance counts [12], a more complex elasticity strategy (e.g., combining horizontal and vertical scaling) may achieve better results. Enable high-level SLOs, based on complex metrics: The majority of metrics nowadays used are directly measurable at the system or at the application level; notable examples are CPU and memory utilization or response time [9, 13, 14]. For example, HPA uses the average CPU utilization of all pods of a workload. We define a composed metric as a metric that can be obtained by aggregating and composing other metrics. In HPA, they can be supplied through a custom metrics API or an external metrics API.6 Both entail the registration of a custom API server, called an adapter API server, to which the Kubernetes API can proxy requests. This leads to additional development and maintenance effort. The custom metrics API [15] and the external metrics API [16] allow exposing arbitrary metrics (e.g., from the monitoring solution Prometheus7 ) as Kubernetes resources. However, apart from summing all values if an external metric matches multiple time series [17], the computation or aggregation of these metrics must be implemented by the adapter API server. HPA allows specifying multiple metrics for scaling, but it calculates a desired replica count for each of them separately and then scales to the highest value [18]. This, although useful, cannot properly address the need for a high-level SLO as shown in our motivating use case. Decouple SLOs from elasticity strategies: If a system provides only a single elasticity strategy (e.g., horizontal scaling in HPA), it means that the SLO is tied to that strategy; this makes the system rigid and inflexible. A tight coupling between SLO evaluation and elasticity strategy in the same controller would require re-implementing every needed SLO in every elasticity controller; and this leads to duplicate code and difficult maintenance. Furthermore, a specific SLO may not yet have been implemented on a certain elasticity controller, albeit being needed by a consumer. Unify APIs for multiple metrics sources: Each major time series database has its own query language, for example, Prometheus has PromQL, InfluxDB8 has Flux, and Google Cloud has MQL.9 Thus, an implementation in a particular language ties the SLO to a certain DB, as there is no common query language for time series databases, like SQL is for relational databases.

5 https://kubernetes.io. 6 https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#support-for-metrics-

apis. 7 https://prometheus.io. 8 https://www.influxdata.com. 9 https://cloud.google.com/monitoring/mql/reference.

218

7 AI/ML for Service-Level Objectives

Prevent cloud vendor lock-in: Common autoscaling solutions are tied to a specific orchestrator or a cloud provider. In fact, all major cloud vendors (i.e., AWS [19], Azure [20], and Google Cloud [21]) have their own non-portable way of configuring their dedicated autoscaler, presumably to foster vendor lock-in. HPA, although not being tied to a particular cloud provider, is still specific to Kubernetes.

7.1.4 Language Requirements Overview SLO Script is at the heart of the Polaris project and supports the definition and implementation of metrics, SLOs, and elasticity strategies, where each metric could be generic or specifically tailored to a particular service. An SLO evaluates metrics to determine whether a system conforms to the expectations defined by the service consumer. When the SLO is violated (reactive triggering) or when it is likely to be violated in the near future (proactive triggering), it may trigger an elasticity strategy. These can range from a simple horizontal scaling strategy to more complex strategies that combine horizontal and vertical scaling or application-specific elasticity strategies that combine scaling with adaptations of the service’s configuration. The goal in this language is to present a significant usability improvement over raw configurations that rely on YAML or JSON. To this end, the language must support higher-level abstractions than raw configuration files; it must also provide type safety to reduce errors and boost productivity. The requirements derived from our motivating use case and the core objectives of SLO Script are as follows. 1. Allow service consumers to configure and map an SLO to a workload. 2. Allow service consumers to choose any compatible elasticity strategy when configuring an SLO (loose coupling). 3. Allow SLOs to instantiate, configure, and trigger the elasticity strategy chosen by the service consumer. 4. Support the definition of composed metrics. 5. Support the definition of elasticity strategies. 6. Ensure compatibility between SLOs and elasticity strategies during programming (i.e., type safety). 7. Make SLO Scripts orchestrator-independent. 8. Allow specific orchestrators to plug them into their infrastructure. 9. Allow service providers to focus only on the business logic of their metrics, SLOs, and elasticity strategies. 10. Present a DB-independent API for querying metrics. 11. Support packaging metrics, SLOs, and elasticity strategies into plugins. SLO Script supports the use of any metrics source using adapters and elasticity strategies developed in any language, as long as their input data types match the

7.1 SLO Script: A Language to Implement Complex Elasticity-Driven SLOs

219

output data types of the SLOs. This allows reusing an elasticity strategy written in a different language. This is to cover cases where an orchestrator-specific API client is written in a different language. The next subsection will explain the design of the SLO Script language and how it achieves orchestrator independence.

7.1.5 SLO Script Language Design and Main Abstractions In this subsection, we describe how SLO Script provides the main contributions announced in the introduction section, that is, (1) high-level StronglyTypedSLO abstractions with type safety features, (2) constructs to enable decoupling of SLOs from elasticity strategies, (3) a strongly typed metrics API, and (4) an orchestratorindependent object model that promotes extensibility.

7.1.5.1

SLO Script Overview and Language Meta-Model

SLO Script consists of high-level, domain-specific abstractions and restrictions that constitute a language abstraction. It does not provide its own textual syntax, but uses TypeScript as its base. Using a publicly available and well-supported language increases the chances for SLO Script to be accepted by developers and reduces maintenance effort (because language and compiler maintenance is handled by the TypeScript authors). The previously presented requirements result in the metamodel for SLO Script, depicted as a UML class diagram, in Fig. 7.2. ServiceLevelObjective is

one of the central constructs of the SLO Script language. An example instance is the CostEfficiencySlo, which implements the cost efficiency scenario described in Sect. 7.1.2. An instance of the ServiceLevelObjective construct defines and implements the business logic of an SLO; it is configured by the service consumer using an SloConfiguration. The ServiceLevelObjective uses instances of SloMetric to determine the current state of the system and compares it to the parameters specified by the service consumer in the SloConfiguration. The metrics are obtained using our strongly typed metrics API that abstracts a monitoring system, such as Prometheus. The metrics might be low-level (directly observable on the system) or higher-level (instances of ComposedSloMetric) or a combination of both. Every evaluation of the ServiceLevelObjective produces an SloOutput to describe how much the SLO is currently fulfilled and used as a part of the input to an ElasticityStrategy. ServiceLevelObjective and ElasticityStrategy define the type of SloOutput they, respectively, produce or require. ElasticityStrategy construct represents the implementation of an elasticity strategy. It executes a sequence of elasticity actions to ensure that a workload

220

7 AI/ML for Service-Level Objectives

SloConfiguration config

1

ElasticityStrategyConfiguration 1

1 input

output

SloMapping

ServiceLevelObjective 1 0..*

1 input

SloOutput

0..*

0..* 1

0..*

0..* 1 ElasticityStrategy

target

0..*

SloTarget 1..*

1 1

involvedMetrics

SloMetric

PolarisPlugin

0..*

1

1..* lowerLevelMetrics

1

MetricCompositionOperator

operator 0..* ComposedSloMetric

UnaryOperator

NAryOperator

0..*

Fig. 7.2 SLO Script meta-model (partial view). © 2021 IEEE, reprinted, with permission, from Pusztai et al. (2021a) [8]

fulfils an SLO. Elasticity actions may include provisioning or deprovisioning of resources, changing the types of resources used, or adapting the configuration of a service. The input to an ElasticityStrategy is a corresponding ElasticityStrategyConfiguration, consisting of the SloOutput produced by the ServiceLevelObjective, as well the static configuration provided by the consumer. There is no direct connection between a ServiceLevelObjective and an ElasticityStrategy; this clearly shows that these two constructs are decoupled from each other. In fact, a connection between them can only be established through additional constructs, such as SloOutput or SloMapping. SloMapping construct is used by the service consumer to establish relationships among a ServiceLevelObjective, an ElasticityStrategy, and an SloTarget that is the workload to which the SLO applies. The SloMapping contains the SloConfiguration (SLO-specific bounds that the consumer can define), the SloTarget (the workload to which the SLO is applied), and any static configuration for the chosen ElasticityStrategy.

7.1.5.2

StronglyTypedSLO

When defining a ServiceLevelObjective using SLO Script’s StronglyTypedSLO mechanism, the service provider must first create an SloConfiguration data type

7.1 SLO Script: A Language to Implement Complex Elasticity-Driven SLOs

221

to be used by the service consumer for configuring the ServiceLevelObjective and an SloOutput data type to describe its output. While each ServiceLevelObjective will likely have its own SloConfiguration type, it is recommended to reuse an SloOutput data type for multiple ServiceLevelObjectives to allow loose coupling between ServiceLevelObjectives and ElasticityStrategies. To create the actual SLO, a service provider must instantiate the ServiceLevelObjective meta-model construct that can be represented using the ServiceLevelObjective TypeScript interface. It takes three generic parameters to enable type safety. C denotes the type of SloConfiguration object that will carry the parameters from an SloMapping. O is the type of SloOutput that will be fed to the elasticity strategy. T is used to define the type of target workload the SLO supports.

An ElasticityStrategy uses the same mechanism to define the type of SloOutput that it expects as input. Figure 7.3 illustrates how the type safety feature of SLO Script works. There are two sets of types: those determined by the ServiceLevelObjective and those determined by the ElasticityStrategy. The ServiceLevelObjective defines that it needs a certain type of SloConfiguration (indicated by the yellow color) as configuration input. The SloConfiguration defines the type of SloTarget (orange), which may be used to scope the SLO to specific types of workloads. The ElasticityStrategy defines its type of ElasticityStrategyConfiguration (purple), which in turn specifies the type of SloOutput (blue) that is required by the ElasticityStrategy. Thus, the bridge between these two sets is the SloOutput type. Once the service consumer has chosen a particular ServiceLevelObjective type, the possible SloTarget types are fixed (because of the SloConfiguration). Since the ServiceLevelObjective defines an SloOutput type, the set of compatible elasticity strategies will be composed of exactly those ElasticityStrategies that have defined an ElasticityStrategyConfiguration (with the same SloOutput type as input).

Types determined by ServiceLevelObjec ve

SloTarget

target

config

ServiceLevel Objec ve

Types determined by Elas cityStrategy

sloOutput

output

SloOutput

Elas cityStrategy Configura on

input

Elas city Strategy

Slo Configura on

Fig. 7.3 Type safety provided by StronglyTypedSLO. © 2021 IEEE, reprinted, with permission, from Pusztai et al. (2021a) [8]

222

7 AI/ML for Service-Level Objectives

Type checking is especially useful in enterprise scenarios, where hundreds of SLOs need to be managed. Using YAML or JSON files for this purpose provides no way of verifying that the used SLOs, workloads, and elasticity strategies are compatible. The SLO Script provides this feature. In fact, by using a type safe language, significant time will be saved when a set of SLOs and their mappings need to be refactored. The Polaris runtime for SLO Script invokes the SLO instance at configurable intervals to check whether the SLO is currently fulfilled or the elasticity strategy needs to take corrective actions. It may simply check if the metrics currently match the requirements of the SLO or it can use predictions and machine learning to determine if the SLO is likely to be violated in the near future. The result of this operation is an instance of the defined SloOutput type that will be returned asynchronously. The Polaris runtime will be described in detail in the following sections.

7.1.5.3

Strongly Typed Metrics API

The strongly typed metrics API provides two types of abstractions: (1) raw metrics queries for querying time series databases independent of the query language they use and (2) composed metrics for creating higher-level metrics from aggregated and composed lower-level metrics obtained through raw metrics queries. Since our API is based on objects, rather than on a textual language, it also comes with type safety features. When using PromQL or Flux directly, developers often need to write queries as plain strings in their application code, thereby breaking the type safety of that code. Figure 7.4 shows a class diagram with a simplified view of our strongly typed metrics API. The raw metrics query abstractions are mainly inspired by PromQL with some influences from Flux and MQL. For raw metrics queries, the central model type is TimeSeries that describes a sequence of sampled values for a metric. In addition to the metricName, a TimeSeries has a map of labels that can be used to further describe its samples; for example, a metric named http_requests_per_sec could have a label service to identify the particular service from which this metric was observed. The base interface for querying time series is TimeSeriesQuery. In contrast to a relational DB query that results in a set of one or more rows, a time series DB query results in a set of one or more time series, each with a distinct metric name and label combination. For example, a query for http_requests_per_sec could result in two distinct TimeSeries: one with the label service = ’online_store’ and another with the label service = ’elasticsearch’. This is why the execution of a TimeSeriesQuery could result in a QueryResult with multiple TimeSeries instances. A time series DB allows retrieving a time series with particular properties, as well as applying functions to the data (e.g., various types of aggregations or sorting). Certain functions (e.g., aggregations) require time series with multiple samples as input, while others (e.g., sorting) only work on time series with a single sample. For

7.1 SLO Script: A Language to Implement Complex Elasticity-Driven SLOs

223

V : ValueType

TimeSeriesInstant

V : ValueType

TimeSeries

+samples : Sample[1]

+metricName : string +labels : Map +samples : Sample[]

T : TimeSeriesType

TimeSeriesQuery +filterOnValue(predicate : ValueFilter) : TimeSeriesQuery +execute() : Promise +toObservable() : Observable

V : ValueType T = TimeSeries

TimeRangeQuery +avg() : TimeInstantQuery +sum() : TimeInstantQuery

V : ValueType T = TimeSeriesInstant

TimeInstantQuery +abs() : TimeInstantQuery +sort() : TimeInstantQuery

V : ValueType T = TimeSeries

LabelFilterableTimeRangeQuery

V : ValueType T = TimeSeriesInstant

LabelFilterableTimeInstantQuery

T : TimeSeriesType

LabelFilterableQuery +filterOnLabel(predicate : LabelFilter) : LabelFilterableQuery V : ValueType T : ValueType P : ComposedMetricParams ComposedMetricType +metricTypeName : string

ComposedMetricSource +getCurrentValue() : Observable +getValueStream() : Observable

Fig. 7.4 Strongly typed metrics API (simplified view). © 2021 IEEE, reprinted, with permission, from Pusztai et al. (2021a) [8]

example, one may first query all time series for http_requests_per_sec, then compute the sum for each single time series, and finally sort the results to see which service receives the most requests. This feature is not implemented in most current platforms, for example, Prometheus will return an error when trying to sort time series with multiple samples. The TimeSeriesInstant model type represents time series that are limited to a single sample. To support both time series types, the TimeSeriesQuery interface is extended by multiple sub-interfaces: TimeRangeQuery for queries that result in a set of TimeSeries and TimeInstantQuery for queries that result in a set

224

7 AI/ML for Service-Level Objectives

of TimeSeriesInstants. Each of these interfaces exposes only methods for DB functions that are applicable to the respective time series type. A function may also change the time series type, for example, sum() is applied to a TimeSeries, but returns a TimeSeriesInstant. LabelFilterableQuery is another sub-interface of TimeSeriesQuery to allow applying filters on labels. Since our metrics query API needs to produce valid DBspecific queries, label filtering is a capability of a query that is lost after applying the first DB function (e.g., sum(), due to the structure of PromQL queries). A composed metric is designated by a composed metric type. It defines the name of the composed metric, the data type used for its values, and parameters needed to obtain it (e.g., the name of the target workload). The metric values are supplied by a composed metric source that may use raw metrics queries to obtain and aggregate multiple lower-level metrics. For each composed metric type, there may be multiple composed metric sources. This allows decoupling the type of a composed metric from the implementation that computes it. It enables multiple implementations, each tailored to a specific type of workload (REST APIs, databases, etc.), for delivering the same type of composed metric.

7.1.5.4

Polaris Object Model

The Polaris object model, a subset of which is shown in Fig. 7.5, is an instantiation of SLO Script’s meta-model in the Polaris framework. This abstract object model yields orchestrator independence and promotes extensibility.

T : SpecType ApiObject

objectKind 1

+spec : T +metadata : ApiObjectMetadata O : SloOutputType T : SloTargetType ElasticityStrategyKind SloMappingBase 1

1

spec

elasticityStrategy C : SloConfigType O : OutputType T : SloTargetType

SloMappingSpec

+sloConfig : C +staticElasticityStrategyConfig : Map

ObjectKind +group : string +version : string +kind : string

O : SloOutputType T : SloTargetType ElasticityStrategySpec +sloOutputParams : O +staticConfig : any ObjectReference +name : string targetRef 1 targetRef 1

SloTarget

Fig. 7.5 Core Polaris object model types (partial view). © 2021 IEEE, reprinted, with permission, from Pusztai et al. (2021a) [8]

7.1 SLO Script: A Language to Implement Complex Elasticity-Driven SLOs

225

Every object that is submitted to the orchestrator must be of type ApiObject or a subclass of it. It contains an objectKind attribute that describes its type. The ObjectKind.group attribute denotes the API group of the type (similar to a package in UML). The version attribute identifies the version of the API group and kind conveys the name of the type. ApiObject also has a metadata attribute that contains additional information about the object, including the name of the instance. The spec attribute contains the actual “payload” content of the object. ObjectReference extends ObjectKind with a name attribute to reference existing object instances in the orchestrator. This is needed to refer to the target workload of an SLO in an SloTarget that is derived from ObjectReference. ApiObject is the root extension point for objects that need to be stored in the orchestrator. For example, to instantiate the SloMapping construct from the metamodel, a TypeScript class needs to be inherited from SloMappingBase class. It contains the type information for the spec and sets up the correct ObjectKind for this SloMapping. The SloConfiguration construct can be represented by an arbitrary TypeScript interface or class; it needs to be wrapped in a class implementing SloMappingSpec to store the configuration. The SloMapping represents a custom resource type that needs to be registered with the orchestrator. A concrete example will be shown in the “Evaluation” section. To identify an ElasticityStrategy when configuring an SloMapping, the ObjectKind subclass ElasticityStrategyKind is used. For each ElasticityStrategy, an ElasticityStrategyKind subclass has to be created and parameterized with the SloOutput and SloTarget types expected by the ElasticityStrategy. This, in conjunction with the SloOutput type configured on a ServiceLevelObjective and its corresponding SloMapping, enables the type checking discussed in the previous sections. The SloOutput meta-model construct is instantiated by creating an arbitrary TypeScript class. To allow compatibility between as many SLOs and elasticity strategies as possible, generic SloOutput data types that are supported by multiple ServiceLevelObjectives and ElasticityStrategies are recommended. The SloCompliance class provided by the core object model conforms to this requirement. It expresses the current state of the SLO as a percentage of conformance. A value of 100% indicates that the SLO is precisely met. A value greater than 100% indicates that the SLO is violated and therefore additional resources (e.g., by scaling out) are needed. A value below 100% indicates that the SLO is being outperformed, and therefore a reduction of resources (e.g., by scaling in) should be considered. Because every orchestrator has its own set of abstractions, an independent framework must provide mechanisms to transform objects into native structure of each supported orchestrator. To this end, SLO Script provides a transformation service that allows each orchestrator-specific connector library to register transformers for those object types. It converts objects that require transforming and directly copies the ones that do not require any transformation. The transformation is not limited to the type of the root object; instead, the appropriate transformer is recursively applied to all nested objects as well.

226

7 AI/ML for Service-Level Objectives

The Polaris object model is heavily influenced by that of Kubernetes, but the two are not equal. For example, in Kubernetes, there is no objectKind property on an object returned from the API, but an apiVersion and a kind property: the former being a combination of the Polaris group and version attributes of ObjectKind and the latter being an equivalent to ObjectKind.kind. The transformation service is responsible for transforming instances of the orchestrator-independent Polaris classes into plain JavaScript objects (e.g., JSON or YAML) so that they can be serialized by the orchestrator-specific connector library. It also transforms plain JavaScript objects supplied by the orchestrator connector into instances of Polaris classes.

7.2 A Middleware for SLO Script In this section, we describe the Polaris Middleware that implements a runtime for SLO Script. To efficiently adjust the elasticity of a deployed application, which we refer to as a workload, a Monitor Analyze Plan Execute (MAPE) loop [22] must implement the following routines. 1. 2. 3. 4.

monitoring of systems and workloads metrics using tools such as Prometheus, analysis metrics (by SLO) to evaluate whether defined goals are met, planning actions (e.g., elasticity strategy) to correct a violated SLO, execution of the planned actions by the cloud orchestrator such as Kubernetes.

In this section, we focus on the realization of SLOs, that is, the analysis step of the control loop. In the analysis step, an SLO must obtain one or more metrics from the monitoring step and pass its evaluation result on to the planning step. A common approach is to implement the SLO as a control loop itself. The variety of monitoring solutions and databases (DBs) makes obtaining metrics difficult without tying the implementation to a particular vendor. Once the metrics have been obtained, they may need to be aggregated to gain deeper insights. When the current status of the SLO has been determined, the outcome needs to be conveyed to one or several elasticity strategies. To facilitate the implementation of complex SLOs, we present the Polaris Middleware. Its implementation is published as open source, as part of the Polaris project,10 with the following features. 1. orchestrator-independent SLO controller periodically evaluates SLOs and triggers elasticity strategies while ensuring that SLOs and elasticity strategies remain decoupled to increase the number of possible SLO/elasticity strategy combinations.

10 https://polaris-slo-cloud.github.io.

7.2 A Middleware for SLO Script

227

2. provider-independent SLO metrics collection and processing mechanism allows querying raw time series metrics, as well as composing multiple metrics into reusable higher-level metrics. 3. Command-Line Interface (CLI) Tool creates and manages projects that rely on the Polaris Middleware. Additionally, we provide platform connectors for Kubernetes (known to have the most capabilities for production-level services among container orchestrators [10]) and Prometheus (known to be a popular choice for a time series DB).

7.2.1 Research Challenges The following research challenges highlight future directions to improve such systems. Decoupling of SLOs from elasticity strategies Many SLOs are tightly coupled with the elasticity strategy they trigger; for example, HPA in Kubernetes provides the average CPU usage SLO to trigger horizontal scaling. This rigid coupling reduces the flexibility of a system. Because re-implementing every useful elasticity strategy for every SLO controller is infeasible, decoupling of SLOs from elasticity strategies is needed. Realization of high-level SLOs using complex metrics Most metrics that guide cloud or edge elasticity today are directly measurable at the system or application level [9, 13, 14], while HPA supports custom metrics using the custom and external metrics APIs,11 realizing such approaches require writing custom API servers so that Kubernetes API can proxy requests. This will significantly increase development and maintenance efforts. The external metrics API supports the specification of custom queries, but this feature must also be implemented by the custom API server. Therefore. an SLO middleware must provide easy-to-deploy mechanisms for combining multiple low-level metrics into high-level metrics so that they can be reused by multiple SLOs. Making cloud platforms and data stores independent The configuration of autoscaling solutions is commonly specific to a cloud vendor or orchestrator. Likewise, there is a distinct query language for each major time series DB. Therefore, portable SLOs require mechanisms to make them independent of particular vendors.

11 https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#support-for-

metrics-apis.

228

7 AI/ML for Service-Level Objectives

7.2.2 Framework Overview In this section, we provide a high-level overview of the Polaris Middleware’s architecture and the Polaris CLI.

7.2.2.1

Architecture

The architecture of the Polaris Middleware is divided into two major layers (Fig. 7.6): (1) the Core Runtime layer that contains orchestrator-independent abstractions and algorithms and (2) the Connectors layer that contains orchestratorand DB-specific implementations of interfaces to connect to specific orchestrators or time series DB. SLO Controllers are built on top of the core runtime, shielding them from orchestrator- and DB-specific APIs. Core model contains abstractions for defining and implementing SLOs. The most important ones are ServiceLevelObjective, SloTarget, SloMapping, and ElasticityStrategy.

Fig. 7.6 Polaris Middleware architecture (the colors indicate which connector realizes interfaces from a particular component). © 2021 IEEE, reprinted, with permission, from Pusztai et al. (2021b) [23]

7.2 A Middleware for SLO Script

229

ServiceLevelObjective defines the interface that the SLO implementation needs

to realize to plug into the control loop provided by the runtime. SloTarget is an abstraction used to identify the target workload that the SLO should be applied to. An SLO is configured through an SloMapping to associate

with a particular SLO type with a target workload and an elasticity strategy. It establishes loose coupling between them. SLO mappings are deployed to the orchestrator as custom resources. Each SLO mapping type entails the definition of a custom resource type in the orchestrator. The addition of a new SLO mapping resource instance activates the respective SLO controller to enforce the SLO. ElasticityStrategy is specified as part of an SLO mapping; it is used to identify strategies to return violated SLO back to their acceptable ranges. Similar to an SLO mapping type, each ElasticityStrategy type is represented by a custom resource type in the orchestrator. SLO control loop is used in an SLO controller to watch the orchestrator for new or changed SLO mappings, as well as to periodically evaluate the SLO. It relies solely on Polaris Middleware abstractions and does not need to be customized by an orchestrator connector or an SLO controller. SLO evaluation facilities are used by the SLO control loop to evaluate SLOs and trigger elasticity strategies on the orchestrator, if necessary. The evaluation of the SLO is handled by the core runtime. Identified mechanisms for each elasticity strategy are specific to its orchestrator and thus must be provided by the respective orchestrator connector. Transformation Service allows transforming orchestrator-independent Polaris Middleware objects into orchestrator-specific objects for a particular target platform and vice versa. The runtime provides a transformation mechanism that allows orchestrator connectors to register a type transformer for every object type that needs customization. Object watch facilities allow observing a set of resource instances of a specific type in the orchestrator for additions, changes, and removals of instances. For example, they are used by the SLO control loop to monitor modifications (additions or changes) to SLO mappings. Orchestrator connectors must implement these facilities for their respective platforms. Raw Metrics Service enables DB-independent access to time series data to obtain metrics. A DB connector must transform the generic queries produced by this service into queries for its particular DB. Composed Metrics Service provides access to higher-level metrics, called composed metrics, to combine multiple lower-level metrics into a reusable high-level metric. To make it accessible through the Composed Metrics Service, a composed metric may be packaged into a library that can be included in an SLO controller, be exposed as a service, or be stored in a shared DB. Doing this will promote loose coupling between the metrics providers and the SLO controllers. The implementation of the sharing mechanism can be provided by either the orchestrator or the DB connector. Kubernetes connector library provides Kubernetes-specific realizations of the three runtime facilities that are highlighted in green in Fig. 7.6. Kubernetes-

230

7 AI/ML for Service-Level Objectives

specific transformers are plugged into the Transformation Service to enable the transformation of objects from the core model to Kubernetes-specific objects. The library also implements the object watch facilities for the Kubernetes orchestrator. This allows the SLO control loop in an SLO controller to watch a particular SLO mapping Custom Resource Definition (CRD) for additions of new resource instances or changes to the existing ones. The SLO evaluation realization for Kubernetes augments the generic evaluation facility from the core runtime to allow triggering elasticity strategies using Kubernetes CRD instances. Prometheus connector implements the generic Raw Metrics Service using queries specific to a Prometheus time series database. It also supplies a mechanism for reading composed metrics from Prometheus.

7.2.2.2

Polaris CLI

The Polaris Command-Line Interface (CLI) provides mechanisms for project creation, building, and deployment. Its aim is to provide a convenient user interface, as well as a starting point for integrating Polaris Middleware projects in Continuous Integration (CI) pipelines. Developers can use the Polaris Middleware to create custom SLOs and controllers. It has the following components. polaris-cli generate adds a component of the

specified type to the project. The CLI currently supports ten different componentTypes, including the following list. For the full list of commands,

please refer to the CLI’s documentation:12 • slo-mapping-type creates a new SLO mapping type that can be used to apply and configure an SLO. • slo-controller creates an SLO controller for an SLO mapping type and deployment configuration files. • slo-mapping creates a new mapping instance for an existing SLO mapping type. This is intended to be used by consumers who want to configure and apply a particular SLO to their workload. • composed-metric-type creates a new composed metric type for defining a reusable complex metric. • composed-metric-controller creates a controller for computing a composed metric. polaris-cli (docker-)build executes the build process for the speci-

fied component to produce deployable artifacts. For controllers, this is a container image with the executable controller on the orchestrator. polaris-cli serialize serializes the specified SLO mapping instance for submission to the orchestrator.

12 https://polaris-slo-cloud.github.io/polaris/features/cli.html.

7.2 A Middleware for SLO Script

231

polaris-cli gen-crds generates Kubernetes CRDs for the SLO map-

pings, composed metric types, and elasticity strategies for/in each specified component. polaris-cli deploy deploys the build artifact of the specified component to the currently selected default orchestrator. Polaris CLI provides a default implementation for all commands. It also allows developers to override these defaults in the project file and enable the use of a different tool for deployment of the artifacts.

7.2.3 Mechanisms In this section, we describe the two main mechanisms provided by the Polaris Middleware: orchestrator-independent SLO controller and provider-independent SLO metrics collection and processing mechanism.

7.2.3.1

Orchestrator-Independent SLO Controller

The central mechanism in an SLO controller is the SLO control loop to monitor, react, and enforce an SLO configured by a user. The SLO control loop itself is orchestrator-independent and merely requires a few supporting services to be implemented by the orchestrator connector. The SLO controller can use the control loop without further adaptation. The SLO control loop consists of two sub-loops (Fig. 7.7): watch loop and evaluation loop. The watch loop (on the left side) is concerned with observing additions, changes, or removals of SLO mappings in the orchestrator using the object watch facilities. It also maintains the list of SLOs managed by the control loop. The evaluation loop (on the right side) periodically evaluates each SLO and triggers the configured elasticity strategies using the SLO evaluation facilities. Watch Loop The watch loop begins by observing the SLO mapping custom resource types that the SLO controller supports. To this end, it uses the object watch facilities to produce an event whenever an object of the watched types (i.e., the supported SLO mappings) is added, changed, or removed by the orchestrator. This functionality must be implemented by the orchestrator connector. Each watch event entails receiving the raw SLO mapping object that has been added, changed, or removed. Since this object is specific to the underlying orchestrator, it is transformed using the Transformation Service into an orchestrator-independent object. The watch loop then acts according to the type of watch event. If a new SLO mapping has been added or changed, the appropriate SLO object that is capable of evaluating the SLO is instantiated, configured, added, or replaced in the list of SLOs for the evaluation loop. If an existing SLO mapping has been removed,

232

7 AI/ML for Service-Level Objectives

SLO Control Loop

Watch supported SloMappings Evaluation Loop Interval

for each SLO

Receive raw SloMapping

SLO instance Orchestrator SloMapping

Evaluate SLO Transform to SLOC object

SLO Output SLOC SloMapping Instance

[SloMapping added]

[SloMapping removed]

Wrap in ElasticityStrategy

Transform to orchestrator object Instantiate and configure SLO

Remove SLO from evaluation loop Orchestrator-specific ElasticityStrategy

Add to evaluation loop or replace Submit to orchestrator

Fig. 7.7 SLO control loop. © 2021 IEEE, reprinted, with permission, from Pusztai et al. (2021b)

the corresponding SLO object is removed from the evaluation loop as well. Subsequently, the watch loop returns to waiting for the next event. Evaluation Loop The evaluation loop is executed on predefined intervals that is configurable by the SLO controller. Whenever it is triggered, the evaluation loop iterates through the list of all its SLOs. For each SLO, the current status is evaluated using the SLO evaluation facilities. The exact evaluation process depends on the implementation of the particular SLO that is built on top of the Polaris Middleware. SLO evaluation generally entails the retrieval of all relevant input metrics using the Raw Metrics Service and the Composed Metrics Service. These

7.2 A Middleware for SLO Script

233

metrics may be further processed and combined; they might subsequently be compared to the ideal values configured by the user in the SLO mapping. This results in an SLO output that indicates if the SLO is currently fulfilled, violated, or outperformed. This includes whether it is fulfilled by a large margin (e.g., a resource reduction is possible), as well as any additional information necessary to return it to a fulfilled state, if necessary. This output is wrapped in an elasticity strategy object of the type specified by the SLO configuration. The elasticity strategy object is subsequently transformed to an orchestrator-specific object (using the Transformation Service) and submitted to the orchestrator, where it will trigger the respective controller for the elasticity strategy. This submission to the orchestrator is the part of the SLO evaluation facilities that must be implemented by the orchestrator connector. The SLO control loop is designed to handle errors during the evaluation of an SLO gracefully so that a buggy/problematic SLO would not cause the entire controller to fail. The SLO control loop relies on the Transformation Service to convert between orchestrator-independent and orchestrator-specific objects. All objects that are received/submitted from/to the orchestrator pass through this service. Orchestrator connector libraries can register transformers for object types, whose orchestrator-specific data structure does not match that of the corresponding type in the Polaris Middleware. The Transformation Service is responsible for transforming the structure of objects. Serialization and deserialization (e.g., to/from JSON) are handled by the object watch and SLO evaluation facilities. To transform an object’s structure, the Transformation Service recursively iterates through all attributes of an input object. If a transformer has been registered for an attribute’s type, it is executed on the attribute’s value according to the direction of the current transformation operation, that is, from orchestrator-independent to orchestrator-specific or from orchestrator-specific to orchestrator-independent. If no transformer is registered for a particular type, the value is copied, and the recursive iteration continues on the value’s attributes. Figure 7.8 exemplifies how an SLO mapping object for a cost efficiency SLO is transformed to a Kubernetes resource object. Note that the objectKind attribute of the Polaris object is transformed into two attributes (apiVersion and kind), on the Kubernetes object.

:CostEfficiencySloMapping objectKind: { group: ’slo.polaris.github.io’, version: ’v1’, kind: ’CostEfficiencySloMapping’, }, metadata: { name: ’my-slo’ }, spec: { ... }

:KubernetesObject apiVersion: ‘slo.polaris.github.io/v1’, kind: ‘CostEfficiencySloMapping’, metadata: { name: ’my-slo’ }, spec: { ... }

Fig. 7.8 Cost efficiency SLO mapping before and after transformation. © 2021 IEEE, reprinted, with permission, from Pusztai et al. (2021b)

234

7 AI/ML for Service-Level Objectives

Another essential mechanism in an SLO controller is the decoupling of SLOs and elasticity strategies. The goal of this mechanism is twofold. Firstly, it allows an SLO to trigger a user-configurable elasticity strategy that is unknown at the time the SLO controller is built; note that the SLO controller cannot have a hardcoded set of elasticity strategy options. Secondly, it allows an elasticity strategy to be used by multiple SLOs to avoid re-implementing the same set of elasticity strategies for multiple SLOs. To achieve both goals, we have defined a common structure for elasticity strategy resources; it consists of three parts: (1) a reference to the target workload, (2) the output data from the SLO evaluation, and (3) static configuration parameters supplied by the user. The target workload reference and the static configuration parameters are copied from the SLO mapping by the Polaris Middleware. The configuration parameters are specific to the elasticity strategy that the user has chosen. It does not limit the generality of the mechanism, because they are statically specified together with the identifier of the elasticity strategy that the user has chosen and are not modified by the SLO. Conversely, the SLO evaluation output data are entirely produced by the SLO controller. The structure of the SLO output determines which elasticity strategies can be combined with that SLO. If an elasticity strategy supports the SLO’s output data type as input, the two are compatible. The Polaris Middleware only needs to copy the SLO output data to the elasticity strategy resource. Using a generic data structure that is supported by multiple SLOs and elasticity strategies as an SLO’s output data type increases the number of possible SLO/elasticity strategy combinations. Any suitable data structure can be used for this purpose. The Polaris Middleware includes the generic SloCompliance data type; it captures the compliance to an SLO as a percentage: a compliance value of .100% indicates that the SLO is exactly met, and a higher value means that the SLO is violated and an increase in resources is needed. A value lower than .100% indicates that the SLO is being outperformed and that resources can be reduced to save costs. To avoid too frequent scaling, SloCompliance includes the possibility for specifying a tolerance band, within which no elasticity action should be performed.

7.2.3.2

Provider-Independent SLO Metrics Collection and Processing Mechanism

The metrics required for evaluating an SLO can be obtained through two mechanisms: Raw Metrics Service and Composed Metrics Service. The former is intended for low-level metrics that are directly measurable on a workload (CPU usage, network throughput, etc.); the latter allows obtaining higher-level metrics that are aggregations of several lower-level metrics or predictions of metrics.

7.2 A Middleware for SLO Script

235

Raw Metrics Service The Raw Metrics Service enables the DB-independent construction of queries for time series data. To this end, it allows specifying the metric name and the target workload, as well as the time range and filter criteria. Furthermore, it provides arithmetic and logical operators and aggregation functions to operate on the metrics. Upon execution, a query is transformed into the native query language of the used time series DB. The result of a query is an ordered sequence or a set of ordered sequences of primarily simple (numeric or Boolean) raw or low-level metric values. The Raw Metrics Service is designed as a fluent API [24, 25]; it means that the code resulting from its use should be natural and easy to read. Specifically, this entails chaining of method calls, supporting nested function calls, and relying on object scoping. Listing 7.1 shows a query for the sum of the durations of all HTTP requests that were directed to a specific workload in the last minute, grouped by their request paths. Listing 7.1 Raw Metrics Service query for total duration of all HTTP requests in the last minute, grouped by paths rawMetricsService.getTimeSeriesSource() .select(’my_workload’, ’request_duration_seconds_count’, TimeRange.fromDuration( Duration.fromMinutes(1))) .filterOnLabel(LabelFilters.regex( ’http_controller’, ’my_workload.*’)) .sumByGroup(LabelGrouping.by(’path’)) .execute();

Composed Metrics Service The Composed Metrics Service is aimed at high-level metrics. These may be simple values (e.g., numbers or Booleans) or complex data structures. Unlike a raw (low-level) metric, a composed metric is not directly observable on a workload, but needs to be calculated (e.g., by aggregating several lower-level metrics). A composed metric may also represent predictions of future values of a metric. Every composed metric has a composed metric type definition that specifies the data structure of its values and a unique name for identification. The calculation of a composed metric requires an additional entity, termed a composed metric source, to perform this calculation. Each composed metric source supplies a metric of a specific composed metric type. A composed metric type is similar to an interface in object-oriented programming; it specifies the type of composed metric that is delivered and may be supplied by multiple composed metric sources. Apart from its composed metric type, a composed metric source is also identified by the type of target workload it supports. This enables high-level metrics (e.g., cost efficiency) that need to be computed differently for various workload types. For example, for a REST service, cost efficiency relies on the response time of the incoming HTTP requests, a metric that is not available on a SQL database. For this case, the execution time of the queries could be used instead. This

236

7 AI/ML for Service-Level Objectives

entails different composed metric sources that can be registered to the respective workload types. The Composed Metrics Service supports both composed metric sources integrated into the SLO controller through libraries and out-of-process composed metric sources that execute within their own metric controller. The former option computes the composed metric within the SLO controller; it is easy to realize for developers, because it only requires the creation of a custom library that needs to be imported in the SLO controller and registered with the Composed Metrics Service. The latter option is more flexible and allows for decoupling of the implementation and maintenance of the SLO controller from that of the composed metric source. Out-of-process composed metric sources may be implemented using REST services or through the use of a shared DB. The latter allows the composed metric to be calculated once and reused by multiple SLO controllers. An out-of-process composed metric source can be leveraged to flexibly update or change the way a certain composed metric type is computed. For example, a TotalCost composed metric type is of interest to multiple SLOs. It may be supplied by a metric controller with a refresh rate of 5 min; in other words, the total cost of a workload is updated every 5 min. This metric controller can be replaced by a newer version, with a refresh rate of 1 min, without having to recompile and redeploy the SLO controllers that depend on this composed metric.

7.2.4 Implementation In this section, we briefly describe the implementation of the mechanisms from Sect. 7.2.3 in our core runtime and the connectors for Kubernetes and Prometheus. The Polaris Middleware and its CLI are realized in TypeScript and published as a set of npm library packages. An SLO controller is a Node.js application that uses these packages (as dependencies) to implement SLO checking and enforcement mechanisms. All middleware and CLI code, as well as example controllers, are available as open source.13

7.2.4.1

Orchestrator-Independent SLO Controller

The orchestrator-independent SLO controller relies on the abstractions provided by the core model, as well as the object watch and SLO evaluation facilities. Figure 7.9 shows the main components involved in the SLO control loop. In case the default control loop implementation does not suffice for a particular scenario, the

13 https://polaris-slo-cloud.github.io.

7.2 A Middleware for SLO Script

237

SloEvaluatorBase

ServiceLevelObjective

+evaluateSlo() +onBeforeEvaluateSlo() +onAfterEvaluateSlo()

+configure() +evaluate() +onDestroy() registeredSlos

SloEvaluator

sloMapping

*

SloControlLoop

WatchManager

DefaultSloControlLoop

+evaluateSlo()

SloMappingBase

+startWatchers() +stopWatchers()

evaluator

SloControlLoopConfig +interval$ +sloTimeoutMs

watchHandler

loopConfig

WatchEventsHandler handler

activeWatchers

*

ObjectKindWatcher

Fig. 7.9 SLO control loop components (simplified). © 2021 IEEE, reprinted, with permission, from Pusztai et al. (2021b)

runtime may be configured to use a custom implementation of the SloControlLoop interface. The SLO control loop manages ServiceLevelObjective objects that are implemented by the SLO controller. The ObjectKindWatcher is provided by the orchestrator connector library to enable observation of the supported SLO mapping types. The evaluation loop evaluates registered SLOs using the SloEvaluator provided by the orchestrator connector. The default implementation handles the evaluation of the SLO and the wrapping of its output in an elasticity strategy object; note that the connector library must only implement the submission to the orchestrator. The Kubernetes connector for the Polaris Middleware relies on kubernetes-client,14 the officially supported JavaScript client library for Kubernetes. It is important to note that the decisions need to be made inside the SLO- and elasticity strategy-specific code in the respective controllers. The purpose of the Polaris Middleware is to connect an SLO to any compatible elasticity strategy and to provide reusable facilities to reduce the effort of developing these types of controllers. The Transformation Service relies on the open-source library class-transformer15 for executing the transformation process, but provides its own, more flexible, transformer registration mechanism. We assume that all raw orchestrator resources

14 https://github.com/kubernetes-client/javascript. 15 https://github.com/typestack/class-transformer.

238

7 AI/ML for Service-Level Objectives

contain a metadata property that uniquely identifies their type. An orchestrator connector library is required to register a transformer that converts these metadata into an ObjectKind object, which is the Polaris abstraction used for identifying orchestrator resource types. The Transformation Service supports associating object kinds with Polaris classes to enable the transformation into the correct runtime objects. The decoupling of SLOs and elasticity strategies relies on a common layout of elasticity strategy resources and the use of the same data type for the output of an SLO and the input of an elasticity strategy. The user selects an elasticity strategy for an SLO by specifying its object kind in the SLO mapping that configures the SLO. After evaluating an SLO, the Polaris Middleware instantiates the elasticity strategy class associated with this object kind and copies the SLO output data to it. An SLO mapping requires the user to choose exactly one elasticity strategy. An elasticity strategy is responsible for ensuring that its sub-actions do not conflict with each other (e.g., when it combines horizontal and vertical scaling). Unlike metrics, SLOs and elasticity strategies cannot be composed. However, it is possible to configure multiple SLOs for a single workload.Since such combinations are highly use case specific, there is no generic conflict resolution mechanism. Instead, the user needs to ensure that there are no conflicts; note that this does not limit the expressiveness of the solution.

7.2.4.2

Provider-Independent SLO Metrics Collection and Processing Mechanism

Raw Metrics Service To create a raw metrics query, the Raw Metrics Service is used to obtain a TimeSeriesSource; it realizes a DB-independent interface for assembling time series queries for a particular target DB. The supported sources are registered with the Polaris Middleware when the SLO controller starts. The select() method of TimeSeriesSource creates a new query by specifying the name of the metric and the target workload. Each method call on a query (see Listing 7.1) returns an immutable object that models the query up to this point. The query may be executed, using the execute() method, or extended by adding another query clause with an additional method call, which yields a new, immutable query object. This approach allows reusing a base query object for multiple queries without side effects, for example, using a time series of all HTTP request durations to sum up all request durations or calculate the average duration for a request. When execute() is called on a query object q, the segments of the query chain, starting from the select() query object up to query object q, are passed to a NativeQueryBuilder. This builder needs to be provided by a DB connector library (e.g., the Prometheus connector). Composed Metrics Service To get a composed metric, a ComposedMetricSource is obtained from the Composed Metrics Service using a composed metric type and the target work-

7.3 Evaluation

239

load. Upon startup, the SLO controller registers all ComposedMetricSource realizations that are provided through libraries, together with their corresponding composed metric types and supported target workload types, in the Polaris Middleware. These composed metric sources execute their metric computation logic inside the SLO controller, for example, by using the Raw Metrics Service internally for retrieving and aggregating multiple raw metrics. If no ComposedMetricSource has been registered for a particular composed metric type, the Composed Metrics Service assumes that this is an out-ofprocess composed metric source that is realized by a connector library. The Prometheus connector provides a ComposedMetricSource realization that relies on Prometheus as a shared DB, where standalone composed metric controllers store their computed metrics.

7.3 Evaluation We implement the motivating cost efficiency SLO use case from Sect. 7.1.2 to show the productivity benefits of using SLO Script and the Polaris Middleware. We ran experiments in our cluster testbed to evaluate its performance.

7.3.1 Demo Application Setup Figure 7.10 provides an overview of the components of the cost efficiency SLO implementation and their relationships. The blue components are implemented for the use case; the white components are orchestrator- and DB-independent parts of the Polaris Middleware; and the green and orange components are part of the Kubernetes and Prometheus connectors, respectively. All code is available in the Polaris project’s repository. Our test cluster provides an elasticity strategy for horizontal scaling; it is part of the Polaris project and accepts generic SloCompliance data as input. Contrary to a simple CPU utilization SLO, it is not possible to derive whether an increase or a decrease in resources is needed by examining only the current and target values of the cost efficiency. In fact, a low cost efficiency could be ambiguous as it may indicate any of the following situations. • The system cannot handle the current high demand in time, and thus an increase in resources is needed. • All requests are handled in time, but too many resources are provisioned for the few incoming requests, and thus a decrease in resources is needed. We must consider similar issues when creating our SLO to ensure including additional information to distinguish between similar cases. To this end, we set up an SLO mapping type in an npm library to allow users to configure the cost efficiency

240

7 AI/ML for Service-Level Objectives

KubernetesSlo Evaluator

HorizontalElasticity StrategyResource

RestApiCostEfficiency MetricSource

SloCompliance CostEfficiencyMetric

SloEvaluator

ServiceLevelObjective

CostEfficiencySlo



SloControlLoop

Transformation Service

Transformer

KubeCost MetricSource

ComposedMetrics Service

RawMetrics Service

ObjectKindWatcher

Kubernetes Watcher

TotalCostMetric

NativeQueryBuilder

Kubernetes Transformers

Prometheus Connector

Fig. 7.10 Cost efficiency SLO implementation (blue), Kubernetes connector (green), and Prometheus connector (orange). © 2021 IEEE, reprinted, with permission, from Pusztai et al. (2021b)

SLO. The Polaris CLI can create a TypeScript class CostEfficiencySloMapping that can be extended with the configuration parameters. In SLO Script, we define the type CostEfficiencySloConfig as shown in Listing 7.2. To handle the ambiguity problem we just described, we add an additional parameter to this configuration type: the minimum percentile of requests that should be handled within the time threshold. If the number of requests per second is greater than the threshold, and it is below that percentile, the service does not have enough resources to handle the load; otherwise, if it is above that percentile, the service has too many resources. Listing 7.2 Cost efficiency SLO configuration export interface CostEfficiencySloConfig { responseTimeThresholdMs: number; targetCostEfficiency: number; minRequestsPercentile?: number; }

Listing 7.3 shows the SloMappingSpec and SloMapping classes. The spec class defines the generic type parameters for its superclass: the configuration type for this SLO will be CostEfficiencySloConfig, the output type will be SloCompliance, and the target workload must be of type RestServiceTarget. This short definition ensures that (1) the SLO can only be applied to workloads of the correct type (i.e., workloads that expose the required metrics) and (2) only an elasticity strategy that supports the SLO’s output data can be used, because each ElasticityStrategyKind needs to specify the compatible input types in an analogous way. This greatly reduces the possibility for deploy-time or runtime

7.3 Evaluation

241

errors, because SLO Script enforces that only compatible workloads and elasticity strategies be used. The constructor of the CostEfficiencySloMapping class initializes the objectKind property to ensure that the correct API group and kind are configured and are using the @PolarisType decorator to set the appropriate class for the transformation of the spec property. After configuring all properties of the SLO mapping, we use the Polaris CLI to generate a Kubernetes CRD, to allow its registration in the orchestrator. Listing 7.3 Cost efficiency SLO mapping export class CostEfficiencySloMappingSpec extends SloMappingSpecBase { } export class CostEfficiencySloMapping extends SloMappingBase { constructor(initData?: SloMappingInitData){ super(initData); this.objectKind = new ObjectKind({ group: ’slo.polaris-slo-cloud.github.io’, version: ’v1’, kind: ’CostEfficiencySloMapping’ }); initSelf(this, initData); } @PolarisType(() => CostEfficiencySloMappingSpec) spec: CostEfficiencySloMappingSpec; }

To enable reusing the cost efficiency metric, we implemented it as a composed metric in a library. In our use case, cost efficiency is defined as the number of REST requests handled faster than N milliseconds, divided by the total cost of the workload. However, cost efficiency is not only useful for REST services, but can be applied to other types of services as well, for example, for a weather prediction service, albeit with a different raw metric as the numerator in the equation. To allow this, we define a generic cost efficiency composed metric type (composed metric types are shown as interfaces in Fig. 7.10) that can be implemented by multiple composed metric sources. Thus, we enable a cost efficiency SLO controller to support multiple workload types (REST services, prediction services, etc.) either by registering multiple cost efficiency composed metric sources from libraries or by relying on out-of-process composed metric services to provide the cost efficiency metric for the various workload types. In our use case, we supply a cost efficiency implementation for REST services. Because the Composed Metrics Service differentiates between workload types when obtaining a composed metric source, this can be increased to an arbitrary number of implementations for various workload types. Since total cost is an important metric in cloud and edge computing, this part of the cost efficiency composed metric could be reused by other composed metrics or SLOs as well. To this end, we create a total cost composed metric type that may be supplied by multiple composed metric sources. We provide an implementation

242

7 AI/ML for Service-Level Objectives

that relies on KubeCost,16 which we use to export the hourly resource costs to Prometheus. In the implementation of the KubeCostMetricSource, we use the Raw Metrics Service to obtain these costs and the recent CPU and memory usage of the involved workload components to calculate the total cost. The RestApiCostEfficiencyMetricSource also relies on the Raw Metrics Service to read the HTTP request metrics from the time series DB. It uses the Composed Metrics Service to obtain the cost efficiency composed metric source for the current workload to calculate the cost efficiency. The modular approach of the composed metrics allows changing parts of the implementation (e.g., use a different cost provider) without affecting the rest of the composed metrics. Note that even though we use Prometheus in our use case, the implementation of both composed metrics is completely DB-independent; in fact, a DB connector must be initialized by the SLO controller (Prometheus connector in Fig. 7.10) to provide a NativeQueryBuilder for generating queries for a specific DB. Next, the SLO controller needs to be created. Its bootstrapping code generated by the Polaris CLI initializes the Polaris Middleware and the Kubernetes and Prometheus connectors, registers the cost efficiency SLO and its SLO mapping type with the SLO control loop, and starts the control loop. For the CostEfficiencySlo class, a skeleton is generated to realize the ServiceLevelObjective interface. Since the cost efficiency composed metric has been developed as a library, we need to call its initialization function during controller startup to register the cost efficiency metric with the Composed Metrics Service. The SLO control loop monitors CostEfficiencySloMapping resources in the orchestrator through the object watch facilities. To this end, the Kubernetes connector provides an implementation of the ObjectKindWatcher interface that relies on the Transformation Service to transform Kubernetes resources using the transformers supplied by the Kubernetes connector as well. When a CostEfficiencySloMapping resource is received by the SLO control loop, the CostEfficiencySlo class is instantiated to handle its evaluation, for example, when periodically triggered by the control loop through the SLO evaluation facilities. We use the Composed Metrics Service in the CostEfficiencySlo class to obtain the composed metric source for the cost efficiency metric. The current value of the metric is compared to the target value configured by the user; then, an SLO compliance value is calculated and returned to the SLO evaluation facilities, whose orchestrator-specific parts are realized by the Kubernetes connector. They use the elasticity strategy object kind, which is configured in the SLO mapping instance, to create a HorizontalElasticityStrategy resource to wrap the SloCompliance output and submit it to the orchestrator for triggering the elasticity strategy controller.

16 https://www.kubecost.com.

7.3 Evaluation

243

Table 7.1 Lines of code (excl. comments and blanks) Component Composed metrics SLO controller Polaris Middleware Total

Lines of code 209 119 2594 2922

% of total 7% 4% 89% 100%

7.3.2 Qualitative Evaluation Due to the use of the generic SloCompliance (depicted as an interface in Fig. 7.10) and the dynamic instantiating of the elasticity strategy resource, the cost efficiency SLO does not need to know about the specific elasticity strategy that will be used. Similarly, the horizontal elasticity strategy controller does not require any information on the SLO that has created the elasticity strategy resource. The type of SLO output data is the only link that connects an SLO to an elasticity strategy. In fact, apart from having to share the same output/input data type, they are completely decoupled. For example, changing to a vertical elasticity strategy only entails the user altering the SLO mapping instance to reference a vertical elasticity strategy object kind instead of a horizontal elasticity strategy object kind. All orchestrator-specific actions used in the SLO control loop are encapsulated in the object watch and SLO evaluation facilities, as well as the transformers used by the Transformation Service, which in this use case are implemented by the Kubernetes connector library. Switching to a different orchestrator, e.g., OpenStack,17 only entails exchanging the Kubernetes connector library for an OpenStack connector library (i.e., importing a different library and changing one initialization function call); after that, the rest of the cost efficiency SLO controller’s implementation would remain unchanged. The same applies to changing the type of time series DB used as the source for the raw metrics needed to compute the cost efficiency composed metric; that is, the Prometheus connector library could be exchanged by an InfluxDB connector library, for example, without altering the implementation of the cost efficiency composed metric source. Table 7.1 summarizes the line counts of the involved components. The Polaris Middleware has the largest part, with 89% of the total code. The reusable total cost and the cost efficiency metrics together add up to 209 lines or 7% of the code. The cost efficiency SLO controller is the smallest part with only 119 lines (4% of the total code), about half of which can be generated by the Polaris CLI. This shows that the usage of the Polaris Middleware greatly increases productivity when developing complex SLOs while keeping them portable to multiple orchestrators and DBs. To better illustrate the usage of the Polaris CLI, we have published a demo video online.18

17 https://www.openstack.org. 18 https://www.youtube.com/watch?v=qScTsLGyOi8.

244

7 AI/ML for Service-Level Objectives

The orchestrator-independent object model of SLO Script eases the porting of SLOs and their mappings to other orchestration platforms; this is to promote flexibility while limiting the possibility for vendor lock-in for consumers and fostering open-source collaboration on SLOs for multiple platforms. Many SLOs may be implemented in a completely orchestrator-independent manner as well, allowing the creation of “standard SLO libraries” for instant reuse on other platforms.

7.3.3 Performance Evaluation Our testbed consists of a three-node Kubernetes cluster, with one control plane node and two worker nodes, all running MicroK8s19 v1.20 (which is based on Kubernetes v1.20). The underlying Virtual Machines (VMs) are running Debian Linux 10 and have the following configurations: • Control plane and Worker1: 4 vCPUs and 16 GB of RAM • Worker2: 8 vCPUs and 32 GB of RAM We use a synthetic workload for the performance tests. Because there is no other middleware (to the best of our knowledge) that offers the same features as Polaris, production-ready solutions that HPA offers are used to compare our results with. However, the realization of composed metrics would require the addition of a custom Kubernetes API server to provide these metrics, which means that it could not compete with Polaris with respect to the lines of code. We conduct two experiments, where we created 100 cost efficiency SLO mappings and let the SLO controller evaluate them at an interval of 20 s. SLO Controller Resource Usage First, we show that an SLO controller built with the Polaris Middleware does not consume excessive resources, even when handling numerous SLOs. For this experiment, we deploy the cost efficiency SLO controller to our cluster in a pod with resource limits of 1 vCPU and 512 MiB RAM. We observe the resource usage over a period of 20 min using Grafana20 that fetches/visualizes metrics from Prometheus. While evaluating 100 SLOs every 20 s, the CPU usage stays between 0.2 and 0.25 vCPUs, and the memory usage between 102 and 140 MiB. Thus, both CPU and memory usage stay far below the pod’s limits and constitute reasonable values for execution in the cloud. Execution Performance of the Polaris Middleware Next, we demonstrate that the Polaris Middleware does not add significant overhead to an SLO controller. To this end, we execute the cost efficiency SLO controller on a development machine (Intel Core i7 Whiskey Lake-U with 4

19 https://microk8s.io. 20 https://grafana.com.

7.3 Evaluation

245

Fig. 7.11 Average total execution times of executeControlLoopIteration() and its children across all 300 s profiling sessions. © 2021 IEEE, reprinted, with permission, from Pusztai et al. (2021b)

CPU cores, clocked at 1.8 GHz and 16 GiB RAM) under the Visual Studio Code JavaScript debugger and profiler while being connected to our cluster’s control plane node through SSH. As for the previous experiment, we use 100 cost efficiency SLO mappings to generate load. We execute three profiling sessions, each with the length of 300 s (i.e., 5 min). Figure 7.11 shows a flame chart with the total execution times of all SLO control loop iterations and the major methods invoked by it. The numbers are the mean average values across all profiling sessions. The sum of the execution times of all SLO control loop iterations in a 300 s profiling session is on average 12,480 milliseconds (ms). The SLO control loop itself and the triggering of elasticity strategies using the results from the SLO evaluations only take about 9% of that time; the remaining 91% are consumed by the evaluation of the cost efficiency SLO. The SLO relies on the cost efficiency composed metric, which takes up most of the SLO’s execution time. The composed metric sets up one raw metrics query itself for the HTTP request metrics and delegates the creation of the query for the costs to the total cost composed metric. The execution of both raw metrics queries took about 58% of the total SLO control loop execution time. More than half of this (35% of the total) time was related to the query execution in the third-party Prometheus client library. This analysis demonstrates that the evaluation of SLOs using the Polaris Middleware is efficient and does not show any evidence of bottlenecks.

246

7 AI/ML for Service-Level Objectives

7.4 Summary This chapter has presented SLO Script and the Polaris Middleware, a language and accompanying framework for defining and implementing service-level objectives. They are based on TypeScript and are parts of the open-source Polaris project. We have motivated (using a real-world use case) why SLO Script and the middleware are needed. We showed SLO Script’s meta-model and then described SLO Script’s design, as well as how it fulfils its main contributions: (1) being a high-level StronglyTypedSLO abstraction with type safety features, (2) decoupling of SLOs from elasticity strategies, (3) having a strongly typed metrics API, and (4) being an orchestrator-independent object model that promotes extensibility. We presented our design and implementation of the Polaris Middleware and its mechanisms that enable the middleware’s core contributions of (1) the orchestratorindependent SLO controller for periodically evaluating SLOs and triggering elasticity strategies; (2) the provider-independent SLO metrics collection and processing mechanism for obtaining raw, low-level metrics from a time series DB and composing them into reusable, higher-level composed metrics; and (3) a CLI Tool for creating and managing projects that rely on the Polaris Middleware. Finally, we used the realization of the motivating use case using SLO Script and the Polaris Middleware to evaluate the efficiency of SLO Script and its performance on our middleware engine. We showed that they provide substantial benefits and flexibility when implementing SLOs.

References 1. Vincent C. Emeakaroha, Ivona Brandic, Michael Maurer, and Schahram Dustdar. Low level metrics to high level slas—lom2his framework: Bridging the gap between monitored metrics and sla parameters in cloud environments. In 2010 International Conference on High Performance Computing & Simulation, pages 48–54. IEEE, 28.06.2010–02.07.2010. 2. Alexander Keller and Heiko Ludwig. The wsla framework: Specifying and monitoring service level agreements for web. Journal of Network and Systems Management, 11(1):57–81, 2003. 3. Stefan Nastic, Andrea Morichetta, Thomas Pusztai, Schahram Dustdar, Xiaoning Ding, Deepak Vij, and Ying Xiong. Sloc: Service level objectives for next generation cloud computing. IEEE Internet Computing, 24(3):39–50, 2020. 4. Nikolas Roman Herbst, Samuel Kounev, and Ralf Reussner. Elasticity in cloud computing: What it is, and what it is not. In 10th International Conference on Autonomic Computing (ICAC 13), pages 23–27, San Jose, CA, 2013. USENIX Association. 5. Schahram Dustdar, Yike Guo, Benjamin Satzger, and Hong-Linh Truong. Principles of elastic processes. Internet Computing, IEEE, 15(5):66–71, 2011. 6. Zheng Li, Liam O’Brien, He Zhang, and Rainbow Cai. On a catalogue of metrics for evaluating commercial cloud services. In 2012 ACM/IEEE 13th International Conference on Grid Computing, pages 164–173. IEEE, 20.09.2012–23.09.2012. 7. Tor Atle Hjeltnes and Borje Hansson. Cost effectiveness and cost efficiency in e-learning. QUIS-Quality, Interoperability and Standards in e-learning, Norway, 2005. 8. T Pusztai, S Nastic, A Morichetta, V Casamayor Pujol, S Dustdar, X Ding, D Vij, and Y Xiong. SLO Script: A Novel Language for Implementing Complex Cloud-Native Elasticity-Driven SLOs. In 2021 IEEE International Conference on Web Services (ICWS), 2021.

References

247

9. Chenhao Qu, Rodrigo N. Calheiros, and Rajkumar Buyya. Auto-scaling web applications in clouds. ACM Comput. Surv., 51(4):1–33, 2018. 10. Isam Mashhour Al Jawarneh, Paolo Bellavista, Filippo Bosi, Luca Foschini, Giuseppe Martuscelli, Rebecca Montanari, and Amedeo Palopoli. Container orchestration engines: A thorough functional and performance comparison. In ICC 2019—2019 IEEE International Conference on Communications (ICC), pages 1–6. IEEE, 52019. 11. Thanh-Tung Nguyen, Yu-Jin Yeom, Taehong Kim, Dae-Heon Park, and Sehan Kim. Horizontal pod autoscaling in kubernetes for elastic container orchestration. Sensors (Basel, Switzerland), 20(16), 2020. 12. Qingye Jiang, Young Choon Lee, and Albert Y. Zomaya. The limit of horizontal scaling in public clouds. ACM Trans. Model. Perform. Eval. Comput. Syst., 5(1), 2020. 13. Emanuel Ferreira Coutinho, Flávio Rubens de Carvalho Sousa, Paulo Antonio Leal Rego, Danielo Gonçalves Gomes, and José Neuman de Souza. Elasticity in cloud computing: a survey. annals of telecommunications—annales des télécommunications, 70(7-8):289–309, 2015. 14. Amjad Ullah, Jingpeng Li, Yindong Shen, and Amir Hussain. A control theoretical view of cloud elasticity: taxonomy, survey and challenges. Cluster Computing, 21(4):1735–1764, 2018. 15. The Kubernetes Authors. Custom metrics api—design proposal, 2018-01-22. 16. The Kubernetes Authors. External metrics api—design proposal, 2018-12-14. 17. The Kubernetes Authors. Hpa v2 api extension proposal, 2018-02-14. 18. The Kubernetes Authors. Horizontal pod autoscaler with arbitrary metrics—design proposal, 2018-11-19. 19. Amazon Web Services, Inc. AWS auto scaling features, 2020. 20. Microsoft. Autoscaling, 2017. 21. Google, LLC. Autoscaling groups of instances, 2020. 22. Edson Manoel, Morten Jul Nielsen, Abdi Salahshour, Sai Sampath K.V.L., and Sanjeev Sudarshanan. Problem determination using self-managing autonomic technology. IBM redbooks. IBM International Technical Support Organization, Austin Tex., 1st ed. edition, 2005. 23. T Pusztai, S Nastic, A Morichetta, V Casamayor Pujol, S Dustdar, X Ding, D Vij, and Y Xiong. A Novel Middleware for Efficiently Implementing Complex Cloud-Native SLOs. In 2021 IEEE 14th International Conference on Cloud Computing (CLOUD), 2021. 24. Martin Fowler. Fluentinterface, 2005. 25. Martin Fowler. Domain-specific languages. Addison-Wesley, Upper Saddle River, NJ, 2010.