Communication Efficient Federated Learning for Wireless Networks [1st ed. 2024] 3031512650, 9783031512650

This book provides a comprehensive study of Federated Learning (FL) over wireless networks. It consists of three main pa

114 29 7MB

English Pages 190 [189] Year 2024

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Communication Efficient Federated Learning for Wireless Networks [1st ed. 2024] 3031512650, 9783031512650

This book provides a comprehensive study of Federated Learning (FL) over wireless networks. It consists of three main pa

99 11 21MB Read more

Link-Layer Cooperative Communication in Vehicular Networks (Wireless Networks) [1st ed. 2018] 9783319587219, 9783319587202, 3319587218

184 71 2MB Read more

Cognitive Wireless Communication Networks 0387688307, 9780387688305

This book provides a unified view on the state-of-the-art of cognitive radio technology. It includes a set of research a

106 19 7MB Read more

Deep Reinforcement Learning for Wireless Networks [1st ed. 2019] 9783030105464, 9783030105457, 3030105466

159 60 8MB Read more

Wireless Sensor Networks [1st ed.] 9789811557569, 9789811557576

This book presents state-of-the-art research advances in the field of wireless sensor networks systems and approaches. I

1,709 235 10MB Read more

Federated Learning: Privacy and Incentive [1st ed.] 9783030630751, 9783030630768

This book provides a comprehensive and self-contained introduction to federated learning, ranging from the basic knowled

451 52 27MB Read more

Green Communications for Energy-Efficient Wireless Systems and Networks (Telecommunications) 1839530677, 9781839530678

The ICT industry is a major consumer of global energy. The energy crisis, global warming problems, dramatic growth in da

1,381 119 13MB Read more

Wireless Communication for Cybersecurity 9781119910435

WIRELESS COMMUNICATION in CYBERSECURITY Presenting the concepts and advances of wireless communication in cybersecurity,

831 73 16MB Read more

Federated Learning 9783030706036, 9783030706043

395 89 4MB Read more

Federated Learning 9783030706043

764 75 20MB Read more

Communication Efficient Federated Learning for Wireless Networks [1st ed. 2024]
3031512650, 9783031512650

Author / Uploaded
Mingzhe Chen
Shuguang Cui

Table of contents :
Preface
Acknowledgements
Contents
1 Introduction
1.1 Motivation of Distributed Learning
1.2 Challenges of Deploying Distributed Learning
1.3 Potential Techniques for Deploying Federated Learning
1.4 Summary and Book Overview
2 Fundamentals and Preliminaries of Federated Learning
2.1 Preliminaries of FL
2.1.1 Common Federated Learning
2.1.2 Federated Multi-Task Learning
2.1.3 Model Agnostic Meta Learning Based FL
2.2 Performance Metrics of FL Over Wireless Networks
2.2.1 Training Loss
2.2.2 Convergence Time
2.2.3 Energy Consumption
2.2.4 Reliability
2.3 Effects of Wireless Factors on FL Metrics
2.4 Research Directions of Deploying FL Over Wireless Networks
2.4.1 Wireless Resource Management
2.4.2 Compression and Sparsification
2.4.3 FL with Over-the-Air Computation
2.4.4 FL Training Method Design
2.4.5 Industry Interest
3 Resource Management for Federated Learning
3.1 Resource Management for FL Training Loss Minimization
3.1.1 Wireless FL Model
3.1.1.1 FL Parameter Transmission Model
3.1.1.2 FL Parameter Error Rates
3.1.1.3 Energy Consumption Model
3.1.2 Problem Formulation
3.1.3 Analysis of the FL Convergence Rate
3.1.4 Optimization of FL Training Loss
3.1.4.1 Optimal Transmit Power
3.1.4.2 Optimal Uplink Resource Block Allocation
3.1.5 Simulation Results and Analysis
3.2 Resource Management for FL Convergence Time Minimization
3.2.1 Wireless FL Model
3.2.2 Problem Formulation
3.2.3 Minimization of FL Convergence Time
3.2.3.1 Gradient Based User Association Scheme
3.2.3.2 Optimal RB Allocation Scheme
3.2.3.3 Prediction of the Local FL Models
3.2.4 Simulation Results and Analysis
3.3 Resource Management for Energy Efficiency Optimization
3.3.1 Wireless FL Model
3.3.1.1 Local Computation
3.3.1.2 Wireless Transmission
3.3.1.3 GLobal FL Model Generation and Broadcast
3.3.2 Problem Formulation
3.3.3 Iterative Algorithm
3.4 Conclusions
4 Quantization for Federated Learning
4.1 Univrersal Vector Quantization for Federated Learning
4.1.1 Wireless FL Model
4.1.2 Problem Formulation
4.1.3 Universal Vector Quantization based FL
4.1.4 Performance Analysis
4.1.4.1 Local SGD
4.1.4.2 Quantization Error Bound
4.1.4.3 FL Convergence Analysis
4.1.5 Numerical Evaluations
4.1.5.1 Quantization Error
4.1.5.2 FL Convergence
4.2 Variable Bitwidth Federated Learning
4.2.1 Variable Bitwidth FL Model
4.2.1.1 Training Process of Bitwidth Federated Learning
4.2.1.2 Training Delay of Bitwidth Federated Learning
4.2.2 Problem Formulation
4.2.3 Optimization Methodology
4.2.3.1 Components of Model Based RL Method
4.2.3.2 Calculation of State Transition Probability
4.2.3.3 Optimization of Device Selection and Quantization Scheme
4.2.4 Numerical Evaluation
4.2.4.1 Datasets and ML Models
4.2.4.2 Convergence Performance Analysis
4.3 Conclusions
5 Federated Learning with Over the Air Computation
5.1 AirComp Principle and Techniques
5.1.1 AirComp Principle
5.1.2 Broadband AirComp
5.1.3 MIMO AirComp
5.1.4 Design of AirComp Federated Learning
5.1.4.1 Model Update Distortion
5.1.4.2 Device Scheduling
5.1.4.3 Coding Against Interference
5.1.4.4 Power Control
5.2 Power Control Optimization for AirComp FL
5.2.1 AirComp FL Model
5.2.2 AirComp FL Convergence Analysis
5.2.2.1 Basic Assumptions on Learning Model
5.2.2.2 Optimality Gap Versus Aggregation Errors
5.2.2.3 Optimality Gap Versus Transmission Power Control
5.2.2.4 Convergence Analysis for AirComp-FL in Case I
5.2.2.5 Convergence Analysis for AirComp-FL in Case II
5.2.3 Power Control Optimization
5.2.3.1 Power Control Optimization for Case I
5.2.3.2 Power Control Optimization for Case II
5.2.3.3 Feasibility of Problem (P2.1)
5.2.3.4 Optimal Solution to Problem (P2.1)
5.2.4 Simulation Results
5.2.4.1 Simulation Setup and Benchmark Schemes
5.3 Beamforming Design for MIMO AirComp FL
5.3.1 Digital AirComp FL Model
5.3.1.1 Digital Pre-processing at the Devices
5.3.1.2 Post-processing at the PS
5.3.2 Problem Formulation
5.3.3 Optimization of Beamforming for FL Training Loss Minimization
5.3.3.1 Analysis of the Convergence of the Designed FL
5.3.3.2 Prediction of the Local FL Models
5.3.3.3 Optimization of the Beamforming Matrices
5.3.4 Simulation Results and Analysis
5.4 Conclusions
6 Federated Learning for Autonomous Vehicles Control
6.1 Autonomous Vehicle System Model
6.1.1 Adaptive Longitudinal Controller Model
6.1.2 FL Model
6.1.3 Communication Model
6.2 Dynamic Federated Proximal Algorithm for CAV Controller Design
6.2.1 Dynamic Federated Proximal Algorithm
6.2.2 Convergence of the DFP Algorithm
6.3 Contract-Theory Based Incentive Mechanism Design
6.3.1 Utility Function of the Parameter Server
6.3.2 Utility Function of the CAVs
6.3.3 Contract Design
6.4 Simulation Results
6.5 Conclusions
7 Federated Learning for Mobile Edge Computing
7.1 MEC Network Model
7.1.1 Transmission Model
7.1.2 Computing Model
7.1.2.1 Edge Computing Model
7.1.2.2 Local Computing Model
7.1.3 Time Consumption Model
7.1.4 Energy Consumption Model
7.2 Problem Formulation
7.3 Federated Learning for Proactive User Association
7.3.1 Components of the SVM-Based FL
7.3.2 Training of SVM-Based FL
7.4 Optimization of Service Sequence and Task Allocation
7.5 Simulation Results and Analysis
7.6 Conclusion
References

Citation preview

Wireless Networks

Mingzhe Chen Shuguang Cui

Communication Efficient Federated Learning for Wireless Networks

Wireless Networks Series Editor Xuemin Sherman Shen, University of Waterloo, Waterloo, ON, Canada

The purpose of Springer’s Wireless Networks book series is to establish the state of the art and set the course for future research and development in wireless communication networks. The scope of this series includes not only all aspects of wireless networks (including cellular networks, WiFi, sensor networks, and vehicular networks), but related areas such as cloud computing and big data. The series serves as a central source of references for wireless networks research and development. It aims to publish thorough and cohesive overviews on specific topics in wireless networks, as well as works that are larger in scope than survey articles and that contain more detailed background information. The series also provides coverage of advanced and timely topics worthy of monographs, contributed volumes, textbooks and handbooks.

Mingzhe Chen • Shuguang Cui

Communication Efficient Federated Learning for Wireless Networks

Mingzhe Chen Department of Electrical and Computer Engineering University of Miami Coral Gables, FL, USA

Shuguang Cui School of Science and Engineering and Future Network of Intelligence Institute (FNii) The Chinese University of Hong Kong at Shenzhen Shenzhen, China

ISSN 2366-1186 ISSN 2366-1445 (electronic) Wireless Networks ISBN 978-3-031-51265-0 ISBN 978-3-031-51266-7 (eBook) https://doi.org/10.1007/978-3-031-51266-7 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.

Preface

Machine learning and data-driven approaches have recently received considerable attention as key enablers for next-generation intelligent networks. Currently, most existing learning solutions for wireless networks rely on centralizing the training and inference processes by uploading data generated at edge devices to data centers. However, such a centralized paradigm may lead to privacy leakage, violate the latency constraints of mobile applications, or may be infeasible due to limited bandwidth or power constraints of edge devices. To address these issues, distributing machine learning at the network edge provides a promising solution, where edge devices collaboratively train a shared model using real-time generated mobile data. The avoidance of data uploading to a central server not only helps preserve privacy but also reduces network traffic congestion as well as communication cost. Federated learning (FL) is one of most important distributed learning algorithms. In particular, FL enables devices to train a shared machine learning model while keeping data locally. However, in FL, training machine learning models requires communication between wireless devices and edge servers over wireless links. Therefore, wireless impairments such as noise, interference, and uncertainties among wireless channel states will significantly affect the training process and performance of FL. For example, transmission delay can significantly impact the convergence time of FL algorithms. In consequence, it is necessary to optimize wireless network performance for the implementation of FL algorithms. On the other hand, FL can also be used for solving wireless communication problems and optimizing network performance. For example, FL endows on edge devices the capabilities of user behavior prediction, user identification, and wireless environment analysis. Moreover, federated reinforcement learning leverages distributed computation power and data to solve complex convex and nonconvex optimization problems that arise in various use cases, such as network control, user clustering, resource management, and interference alignment. Besides, traditionally, FL makes a desirable assumption that edge devices will unconditionally participate in the tasks when invited, which is not practical in reality due to resources cost and wiliness incurred by model training. Therefore, building incentive mechanisms is indispensable for FL network. v

vi

Preface

The goal of this book is to provide a comprehensive study of FL over wireless networks. The book consists of three main parts: (a) Fundamentals and preliminaries of FL, (b) analysis and optimization of FL over wireless networks, and (c) applications of wireless FL for Internet-of-Things systems. In particular, in the first part, we provide a detailed overview on widely-studied FL framework. In the second part of this book, we comprehensively discuss three key wireless techniques including wireless resource management, quantization, and over-theair computation to support the deployment of FL over realistic wireless networks. It also presents several solutions based on optimization theory, graph theory and machine learning to optimize the performance of FL over wireless networks. In the third part of this book, we introduce the use of wireless FL algorithms for autonomous vehicle control and mobile edge computing optimization. Coral Gables, FL, USA Shenzhen, China July, 2023

Mingzhe Chen Shuguang Cui

Acknowledgements

The work was supported in part by NSFC with Grant No. 62293482, the Basic Research Project No. HZQB-KCZYZ-2021067 of Hetao Shenzhen-HK S&T Cooperation Zone, the National Key R&D Program of China with grant No. 2018YFB1800800, the Shenzhen Outstanding Talents Training Fund 202002, the Guangdong Research Projects No. 2017ZT07X152 and No. 2019CX01X104, the Guangdong Provincial Key Laboratory of Future Networks of Intelligence (Grant No. 2022B1212010001), the Shenzhen Key Laboratory of Big Data and Artificial Intelligence (Grant No. ZDSYS201707251409055), and the Key Area R&D Program of Guangdong Province with grant No. 2018B030338001. This work was also supported by the U.S. National Science Foundation under Grants CNS-2312139.

vii

Contents

1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Motivation of Distributed Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Challenges of Deploying Distributed Learning . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Potential Techniques for Deploying Federated Learning . . . . . . . . . . . . . 1.4 Summary and Book Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 2 2 3

2

Fundamentals and Preliminaries of Federated Learning . . . . . . . . . . . . . . . 2.1 Preliminaries of FL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Common Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Federated Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Model Agnostic Meta Learning Based FL . . . . . . . . . . . . . . . . . . . . 2.2 Performance Metrics of FL Over Wireless Networks . . . . . . . . . . . . . . . . . 2.2.1 Training Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Convergence Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Energy Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Effects of Wireless Factors on FL Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Research Directions of Deploying FL Over Wireless Networks. . . . . . 2.4.1 Wireless Resource Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Compression and Sparsification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 FL with Over-the-Air Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.4 FL Training Method Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.5 Industry Interest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5 5 5 6 7 8 8 8 8 9 9 10 11 12 14 16 17

3

Resource Management for Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Resource Management for FL Training Loss Minimization . . . . . . . . . . 3.1.1 Wireless FL Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 Analysis of the FL Convergence Rate . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.4 Optimization of FL Training Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.5 Simulation Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Resource Management for FL Convergence Time Minimization. . . . .

19 19 19 24 25 30 33 37 ix

x

Contents

3.2.1 Wireless FL Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Minimization of FL Convergence Time . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Simulation Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Resource Management for Energy Efficiency Optimization . . . . . . . . . . 3.3.1 Wireless FL Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Iterative Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37 38 39 43 48 48 51 52 56

4

Quantization for Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Univrersal Vector Quantization for Federated Learning . . . . . . . . . . . . . . 4.1.1 Wireless FL Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Universal Vector Quantization based FL . . . . . . . . . . . . . . . . . . . . . . 4.1.4 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.5 Numerical Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Variable Bitwidth Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Variable Bitwidth FL Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Optimization Methodology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Numerical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57 57 57 59 60 64 70 76 76 81 82 86 91

5

Federated Learning with Over the Air Computation . . . . . . . . . . . . . . . . . . . . 5.1 AirComp Principle and Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 AirComp Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Broadband AirComp. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.3 MIMO AirComp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.4 Design of AirComp Federated Learning. . . . . . . . . . . . . . . . . . . . . . . 5.2 Power Control Optimization for AirComp FL . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 AirComp FL Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 AirComp FL Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Power Control Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Beamforming Design for MIMO AirComp FL . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Digital AirComp FL Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Optimization of Beamforming for FL Training Loss Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 Simulation Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

93 93 93 95 95 96 99 99 102 108 113 115 116 118

6

119 125 128

Federated Learning for Autonomous Vehicles Control . . . . . . . . . . . . . . . . . . 129 6.1 Autonomous Vehicle System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.1.1 Adaptive Longitudinal Controller Model . . . . . . . . . . . . . . . . . . . . . . 130

Contents

6.1.2 FL Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.3 Communication Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dynamic Federated Proximal Algorithm for CAV Controller Design 6.2.1 Dynamic Federated Proximal Algorithm . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Convergence of the DFP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . Contract-Theory Based Incentive Mechanism Design . . . . . . . . . . . . . . . . 6.3.1 Utility Function of the Parameter Server . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Utility Function of the CAVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 Contract Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

131 133 134 134 135 138 139 140 140 143 149

Federated Learning for Mobile Edge Computing . . . . . . . . . . . . . . . . . . . . . . . . 7.1 MEC Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Transmission Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Computing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.3 Time Consumption Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.4 Energy Consumption Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Federated Learning for Proactive User Association . . . . . . . . . . . . . . . . . . . 7.3.1 Components of the SVM-Based FL . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Training of SVM-Based FL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Optimization of Service Sequence and Task Allocation . . . . . . . . . . . . . . 7.5 Simulation Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

151 152 152 154 154 155 156 157 158 158 161 165 169

6.2

6.3

6.4 6.5 7

xi

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

Chapter 1

Introduction

1.1 Motivation of Distributed Learning Over the past five years, the field of machine learning (ML) has witnessed a major shift from the so-called “big data” paradigm, in which large volumes of data are collected and processed at a central cloud, towards a “small data” paradigm [1–3], in which a set of agents or devices distributively process their data at the edge of a mobile network. The main motivation of this paradigm shift is to allow edge devices to rapidly access real-time data for fast ML model training. This in turn endows on the devices human-like intelligence to respond to real-time events [4–9]. This paradigm shift is driven by two trends in the evolution of computing. First, as computer chips become cheaper, computers are built into tens of billions of devices. They are connected to form Internet-of-Things (IoT) networks, which provide platforms for executing large-scale tasks but also generate very large amounts of useful data. Second, the spread of computing from the cloud towards the network edge enables the deployment of ML algorithms in the proximity of edge devices to distill their collected data into intelligence. This paradigm shift means that the classical centralized ML approach requiring large training datasets is no longer dominant. There is a growing need for novel distributed learning solutions that can leverage rich distributed data and computation resources at the edge without the need for transporting data across the network. The new framework of distributed learning finds a wide range of applications especially those related to IoT such as connected autonomy (e.g., connected vehicles or drones). In such systems, under the constraint of data privacy, devices have to find an intelligent way to cooperate in training an ML model by overcoming their local-data scarcity. In such scenarios, the direct exchange of raw data is undesirable due to privacy concerns, or, in some cases, even infeasible due to communication and computing constraints.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 M. Chen, S. Cui, Communication Efficient Federated Learning for Wireless Networks, Wireless Networks, https://doi.org/10.1007/978-3-031-51266-7_1

1

2

1 Introduction

1.2 Challenges of Deploying Distributed Learning Realizing a successful evolution towards distributed learning requires overcoming several key challenges. The first is to find methods for distributed learning without raw-data sharing. This gives rise to an interesting trade-off between privacy and learning accuracy as regulated by the level of information exchange among distributed agents. The second challenge stems from the fact that one common denominator of all such distributed systems is the need to perform training and inference of an ML model over wireless links as devices are usually connected using a cellular network or a wireless local area network (WLAN). As such, the characteristics of wireless propagation—interference, noise, and fading—will now introduce new impediments to the learning process [10]. For example, in [11], it has been shown that bit errors and communication delay can significantly affect the convergence and accuracy of distributed learning. Furthermore, it is shown in [12] that the wireless network architecture can also have a significant effect on the convergence speed of a learning algorithm. Hence, the deployment of distributed learning in wireless networks entails a need for accounting for wireless factors in the design. A third, related challenge, is the fact that distributed learning requires many rounds of exchanging high-dimensional ML model updates or model parameters between parameter servers and devices. However, the radio spectrum is a scarce resource. To resolve the conflict calls for the design of communication-efficient techniques for distributed learning. The fourth challenge pertains to computing. Distributed learning requires efficient ways to perform distributed computation, both over-the-air [13–15] and off-the-air. The delay and efficiency associated with distributed ML and distributed computing over large scale wireless networks will directly impact the learning performance. The final challenge is that efficient distributed learning over wireless systems requires new distributed optimization frameworks that enable multiple agents to collaboratively solve complex optimization problems in a distributed way. Research efforts aiming at addressing these challenges have led to the emergence of many important distributed learning frameworks in the past few years. Perhaps the most widely-studied one is federated learning (FL) [16, 17] which enables a group of agents to collaboratively execute a common learning task by exchanging only their local model parameters instead of raw data. Thereby, FL helps preserve data privacy while achieving high learning accuracy. Following the seminal work in [16], a broad range of FL techniques have been developed to tackle individual challenges among those mentioned earlier.

1.3 Potential Techniques for Deploying Federated Learning To achieve very high performance in different dimensions, 6G will feature the integrated design of sensing, communication, computing, and control. In the context

1.4 Summary and Book Overview

3

of FL, the objective of 6G design is no longer rate maximization but to accelerate the training of ML models using distributed data [18–23]. This requires new algorithms and techniques for integrated communication and learning. A first approach is radio resource management [24, 25] which enables wireless networks to efficiently use the limited resources such as spectrum, transmit power, and computational capabilities to complete the FL training process. A second approach is compression. Compression techniques aim at using fewer bits to quantize each ML model parameter [26–28] to decrease the size of the ML model parameters or updates exchanged among devices thus reducing communication overhead. Overthe-air computation (OAC or AirComp) [29] is a third approach that provides the needed scalability for multiple-access in FL to the participation of many edge devices which is crucial for satisfactory learning performance. A fourth approach is to develop novel training methods that jointly consider FL training parameters, wireless network dynamics (e.g., wireless channel conditions), and wireless network topologies [30] (e.g., locations and mobility patterns of wireless devices).

1.4 Summary and Book Overview Clearly, deploying FL over wireless networks faces a plethora of challenges and opportunities. In the rest of the book, we will explore these challenges and associated problems, while focusing on the following themes: • Chapter 2. Fundamentals and Preliminaries of Federated Learning: In this chapter, we first introduce the preliminaries of FL. Then, we introduce several important performance metrics to quantify the FL performance over wireless networks and analyze how wireless factors affect these metrics. Finally, we present the research directions and industry interest of designing communication efficient FL over wireless networks. • Chapter 3. Resource Optimization for Federated Learning: In Chap. 3, we introduce the joint optimization of FL training parameters (i.e., number of local and global updates) and wireless resources, particularly spectrum, transmit power, device selection and scheduling, to optimize the FL performance metrics introduced in Chap. 1. • Chapter 4. Quantization for Federated Learning: In Chap. 4, we introduce the use of quantization theory for wireless FL deployment. In particular, we introduce two quantization schemes to reduce the size of FL parameters that must be exchanged among edge devices while considering unique FL training settings and wireless network dynamics. • Chapter 5. Federated Learning with Over the Air Computation: In this chapter, we discuss the use of over the air computation technique for FL performance optimization. Here, we first introduce the basic principle and techniques of over the air computation. Then, we analyze how over the air techniques affect FL convergence. Finally, we introduce the joint optimization

4

1 Introduction

of quantization parameters (i.e., bitwidth) and FL parameters to improve FL performance. • Chapter 6. Federated Learning for Mobile Edge Computing: In this chapter, we conclude the book by introducing the use of designed FL for the optimization of wireless network performance, particularly focusing on the optimization of task offloading in mobile edge computing based networks.

Chapter 2

Fundamentals and Preliminaries of Federated Learning

In this chapter, we first introduce the preliminaries of FL. Inparticular, we introduce the federated averaging and two personalized FL algorithms. Then, we introduce four important performance metrics to quantify the FL performance over wireless networks and analyze how wireless factors affect these metrics. Finally, we present the research directions and industry interest of designing communication efficient FL over wireless networks.

2.1 Preliminaries of FL Consider a set .U of U devices orchestrated by a parameter server (PS) to jointly train a common ML model. We assume that each participating device i owns a dataset .Ki of .Ki training samples, where each training sample .k ∈ Ki consists of an input vector .xi,k and a corresponding output vector .yi,k . Next, we introduce different FL problems.

2.1.1 Common Federated Learning The training objective of common FL is given as follows: . min

m

U ∑ ) pi ∑ ( f m, xi,k , yi,k , Ki i=1 k∈Ki

(2.1)

where .m ∈ Rd is the ML model that the devices aim to find collaboratively, .f (·) is a loss function that captures the accuracy of the considered FL algorithm by building © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 M. Chen, S. Cui, Communication Efficient Federated Learning for Wireless Networks, Wireless Networks, https://doi.org/10.1007/978-3-031-51266-7_2

5

6

2 Fundamentals and Preliminaries of Federated Learning

a relationship between an input vector .xi,k and the corresponding output vector yi,k ; .pi is a scaling parameter that scales the weight of device i’s average loss, U ) ( ∑ 1 ∑ f m, x , y , on the total training loss with . pi = 1. Problem (2.1) is . i,k i,k Ki i=1 k∈Ki commonly solved by using iterative distributed optimization techniques orchestrated by the PS. Federated Averaging (FedAvg) [31] is the first FL algorithm proposed by Google to solve problem (2.1). The training process of FedAvg at iteration t proceeds as follows: .

(a) The PS broadcasts the current global model .b (t) to all (or a subset of) the devices. (b) Each device participating in this iteration uses some local learning method, such as the stochastic gradient descent (SGD), to train its ML model using locally available data (called local ML model). (c) Each device sends its updated ML model parameters, .mi (t + 1), to the PS. U ∑ (d) The PS updates global model as follows: .b (t+1) = pi (mi (t + 1) −b (t)) + i=1

b (t). (e) Steps from a. to d. are repeated for a certain number of iterations, or until some convergence criteria is met. From the training procedure, we observe that, in FedAvg, each device transmits its model update .mi (t + 1) − b (t) to the PS instead of sending its private dataset, thus promoting data privacy for devices. Hereinafter, we define the implementation of steps from b. to d. as one learning step. Meanwhile, at step b., each device can update its ML model multiple times. Hereinafter, a device using the SGD method to update its ML model once is called one local update. Since FedAvg finds a common ML model for all the devices, the training loss of each device will be significantly increased when the data distribution of each device is non independent and identically distributed (Non-IID). To deal with Non-IID data, next, we introduce personalized FL. In particular, we introduce two classical personalized FL algorithms: federated multi-task learning [32] and model agnostic meta learning (MAML) based FL [33].

2.1.2 Federated Multi-Task Learning In federated multi-task learning (FMTL), devices are considered to implement correlated but different learning tasks. In other words, Non-IID data distributions of devices can be considered as different tasks. The training purpose of FMTL is given as follows:

.

min

M,Ω

U ∑ ∑ i=1 k∈Ki

) ( f mi , xi,k , yi,k + R (M, Ω) ,

(2.2)

2.1 Preliminaries of FL

7

where .M = [m1 , . . . , mU ], .Ω models the relationship among different learning tasks of devices, and function .R (·) is a regularizer. To solve problem (2.2), one can use separate problem (2.2) into several subproblems so as to enable devices to solve problem (2.2) in a distributed manner. For example, the authors in [32] used a dual method and quadratic approximation to divide problem (2.2). Then, each device i can individually optimize its ML model .mi under given .Ω while the PS updates .Ω using the updated .M. After the devices and the PS iteratively optimize .M and .Ω, problem (2.2) can be solved. From (2.1) and (2.2), we can see that, in FedAvg, devices will have the same ML model at convergence. In contrast, in FMTL, devices may have different ML models at convergence. This is due to the fact that for Non-IID data or different learning tasks, devices with different ML models can achieve less sum training loss than devices with a common ML model.

2.1.3 Model Agnostic Meta Learning Based FL MAML based FL aims to find an ML model, using which each device can find a personalized ML model via one or a few steps of gradient descent iterations. The training purpose of MAML based FL is given as follows: . min

m

U ∑ ) pi ∑ ( f m − λ∇fi , xi,k , yi,k , Ki i=1 k∈Ki

(2.3)

where .λ is the learning rate and .∇fi is the gradient descent of local ML model of device i. From (2.3), we can see that MAML based FL aims to find a common ML model for all devices. Then, the devices can use their own data to update their common ML models via a few steps of gradient descent so as to find their personalized ML models. Given the overview of FedAvg, FMTL, and MAML based FL, we remark the following: • FMTL directly optimizes the personalized ML model of each device while MAML based FL optimizes the initialization of ML model of each device. • FedAvg is recommended for processing IID data while FMTL and MAML based FL are recommended for processing Non-IID data. • Choosing between FMTL or MAML based FL depends on whether the PS knows the relationship among the data distributions of the devices. • All FL algorithms must be trained by a distributed iterative process.

8

2 Fundamentals and Preliminaries of Federated Learning

2.2 Performance Metrics of FL Over Wireless Networks Next, we introduce four key metrics that evaluate the performance of FL implemented over wireless networks: (a) training loss, (b) convergence time, (c) energy consumption, and (d) reliability.

2.2.1 Training Loss Training loss is the value of the loss functions .f (·) defined in (2.1), (2.2), and (2.3). From the FL training procedure, we can see that the FL training loss depends on the ML models of all devices. In wireless networks, devices’ ML models are transmitted over imperfect wireless links. Therefore, they may experience transmission errors thus negatively impacting the training loss. Meanwhile, due to limited energy and computational capacity, only a subset of devices can participate in FL. Therefore, only a subset of devices’ ML models can be used to generate the global ML model thus negatively impacting the training loss.

2.2.2 Convergence Time For FL implemented over wireless networks, its convergence time T is expressed as T = (TC + TT ) × NT ,

.

(2.4)

where .TC is the time that each device used to update its local ML model at each learning step, .TT is the maximum ML model transmission time per learning step, .NT is the number of learning steps that FL needs to converge. From (2.4), we can see that FL convergence time depends on three components: (a) ML parameter transmission delay .TT , (b) the time .TT needed by each device to train its local ML model, and (c) number of learning steps .NT . Here, we need to note that .TC and .NT are dependent. In particular, increasing the number of SGD steps to update a local ML model at each learning step (e.g., increasing .TC ) can decrease the number of learning steps .NT that FL needs to converge.

2.2.3 Energy Consumption The energy consumption E of each device participating the entire FL training is expressed as E = (EC + ET ) × NT ,

.

(2.5)

2.3 Effects of Wireless Factors on FL Metrics

9

where .EC is energy consumption of each device training its ML model at each learning step and .ET is the energy consumption of transmitting ML parameters to the PS at each learning step. From (2.5), we can see that energy consumption of each device depends on three components: (a) energy consumption for ML parameter transmission, (b) energy consumption for training local ML model, and (c) number of learning steps that FL needs to converge. Here, since increasing the number of SGD steps to update a local ML model at each learning step can decrease the number of learning steps .NT that FL needs to converge, a trade-off exists between .EC and .NT .

2.2.4 Reliability FL reliability is defined as the probability of FL achieving a target training loss. For wireless FL, due to limited wireless resources, only a subset of devices can participate in the FL training at each learning step. Hence, the devices that transmit FL parameters to the PS at different learning steps may be different, which will affect the FL convergence time and training loss. Meanwhile, imperfect wireless links will cause errors on the FL parameters used to generate the global ML model, hence decreasing training loss.

2.3 Effects of Wireless Factors on FL Metrics Given the metrics defined in the previous subsection, we first explain how wireless network factors such as spectrum, transmit power, and computational capacity affect these FL metrics. Table 2.1 summarizes the relationship between various wireless factors and FL performance metrics. In Table 2.1, a tick implies that the communication factor will affect the FL performance metrics. For example, the spectrum resource allocated to each device for FL parameter transmission will affect the training loss, FL parameter transmission time per learning step .TC , Energy consumption of FL parameter transmission .EC , and reliability of FL. Next, we explain how these wireless factors affect the FL performance metrics as follows: • Spectrum resource allocated to each device determines the signal-to-interferenceplus-noise ratio (SINR), data rate, and the probability that the transmitted FL parameters include errors. Hence, spectrum resource affects the training loss, .TT , .ET , and reliability. • Computational capacity determines the number of SGD updates that each device can perform at each learning step. Hence, computational capacity affects the time and energy used for local training. Meanwhile, as the number of SGD updates decreases, the training loss increases and the number of learning steps that FL needs to converge increases.

10

2 Fundamentals and Preliminaries of Federated Learning

Table 2.1 Summary of effects of communication factors on FL metrics

Communication factors Spectrum resource Computational capacity Transmit power Wireless channel Set of devices that participate in FL Size of the FL parameters trained by each device Size of the FL parameters transmitted by each device

FL Local parameter Training training transmission loss time .TC time .TT √ √

.

√

.

.

√ .

√

√

.

.

√

√

.

√

.

.

√

√

.

√

√

.

√

.

√

√

.

√

√

.

.

√

√

.

√

.

√

.

.

.

Energy used for FL transmission .ET Reliability √ √

.

.

√

.

Energy used for local training .EC

.

√

.

.

Total number of learning steps .NT √

.

√

√

.

.

√

.

• Transmit power and wireless channel determine the SINR, data rate, and the probability that the transmitted FL parameters include errors. Therefore, as the transmit power of each device increases, the training loss, .TT , .NT , and reliability decrease but .ET increases. • In FL, as the number of devices that participate in FL increases, the training loss and .NT decrease while .TT and reliability increase. • As the size of the FL parameters trained by each device increases, the FL training loss, reliability, and the total number of learning steps may decrease. However, the energy and time used for training FL model increases.

2.4 Research Directions of Deploying FL Over Wireless Networks Next, we present a comprehensive overview on the key research directions that must be pursued for practically deploying FL over wireless networks. For each research direction, we first outline the key challenges, and then we discuss the state of the art.

2.4 Research Directions of Deploying FL Over Wireless Networks

11

2.4.1 Wireless Resource Management As shown in Table 2.1, wireless resources such as spectrum, transmit power, and computational capabilities jointly determine the FL training loss, convergence time, energy consumption, and reliability. Due to limited resources in wireless networks, it is necessary to optimize resource allocation so as to enable wireless networks to efficiently complete the FL training process. However, analyzing the effects of resource allocation on the FL performance faces several challenges. First, FL training process is distributed and iterative, but it is challenging to quantify how each single model update affects the entire training process. Also, since each device only exchanges its gradient vector with the PS, the PS does not have any information about devices’ local datasets, and cannot use sample distribution or the values of the data samples to decide how resource allocation will affect the FL convergence. State of the Art Now, we discuss a number of recent works on the optimization of spectrum resources for deploying FL over wireless networks. In [12], the authors considered the implementation of FL over a hierarchical network architecture and they showed that the global convergence can be accelerated if local training is enabled with the help of small base stations (BSs), which only occasionally communicate with the macro BS for global consensus. Meanwhile, local learning not only speeds up the learning process, but also reduces the energy consumption of communication due to short distance transmissions, and increases communication efficiency by frequency reuse across multiple small cells, enabling parallel local learning processes. The authors in [34, 35] study the trade-off between the local ML model updates and global ML model aggregation so as to minimize the total energy consumption for local ML model training and transmission or the FL training loss. The authors in [36] study the use of gradient statistics to optimize the set of devices that participate in FL at each training round. The authors in [37] assume that the local FL model transmitted by the device can be decoded by the PS only when the SINR is under the target threshold, and analyzed how user scheduling affects the FL convergence. The work in [38] designed a FL algorithm which can handle heterogeneous user data without further assumptions except strongly convex and smooth loss functions and then optimized the resource allocation to improve FL convergence and training loss. The authors in [39] jointly optimized device scheduling and resource allocation policies to maximize the model accuracy within a given total training time budget for latency constrained wireless FL. In [40], the authors designed a multi-armed bandit based algorithm to select the devices that must participate in FL without knowing wireless channel state information and statistical characteristics of devices. In [41], data-driven experiments are designed to show that different temporal device selection patterns lead to considerably different learning performance. With the obtained insights, device selection and bandwidth allocation are jointly optimized utilizing only currently available wireless channel information.

12

2 Fundamentals and Preliminaries of Federated Learning

2.4.2 Compression and Sparsification A major challenge in FL, particularly over wireless channels, is the communication bottleneck due to the large size of the trained models. For emerging DNNs with hundreds of millions of training parameters, transmitting so many locally trained parameter values from each participating device to the PS at every iteration of the learning algorithm over a shared wireless channel is a significant challenge. We would like to note here that the transmission of locally trained model parameters to the PS over a noisy wireless channel is a joint source-channel coding problem. Indeed, considering the fact that the PS is interested in the average of the models, rather than the individual model updates from different devices, this can be classified as a joint source-channel function computation problem [42, 43]. In general, we do not have an optimal solution to such a problem, particularly in the practical finite blocklength regime. The conventional approach to this problem is to separate the compression of DNN parameters from the transmission over the channel. This so-called ‘digital’ approach converts all the local updates into bits, which are then transmitted over the channel as reliably as possible, and all the decoded ‘lossy’ reconstructions are averaged by the PS. A more efficient method would be to directly map each locally trained model parameter to a channel input in an ‘analog’ fashion [13]. While we will explore this approach in Chap. 5 in detail, here we focus on digital schemes, and assume that each device individually compresses its own parameters. Numerous communication efficient learning strategies have been proposed in the ML literature to reduce the amount of information; that is, the number of bits exchanged between the devices and the PS per global iteration. We classify these approaches into two main groups; namely sparsification and quantization. We would like to highlight that, thanks to the separation between compression and transmission of compressed bits to the PS, these strategies are independent of the communication medium and the communication protocol employed to exchange model updates between the devices and the PS, as they mainly focus on reducing the size of the messages exchanged. Therefore, these techniques can be incorporated into the resource allocation and device selection policies that will be presented below. The objective of sparsification is to transform the d-dimensional model update ˜ by setting some of its elements to zero. .m at a device to its sparse representation .m Sparsification can also be considered as applying a d-dimensional mask vector .M ∈ {0, 1}d on .m, such that .m ˜ = M ⊗ m, where .⊗ denotes element-wise multiplication. We can define the sparsification level of this mask by .φ ||M||1 /d, i.e., the ratio of its non-zero elements to its dimension. Note that, when conveying a sparse model update to the PS, rather than conveying the values of all d values of the model update, each device needs to convey only the values of .φd non-zero values and their locations. Therefore, the lower the sparsification level, the higher the compression ratio, and the lower the communication load. It is known that when training a complex DNN model using stochastic gradient descent methods, model updates can

2.4 Research Directions of Deploying FL Over Wireless Networks

13

be highly sparse. Indeed, it has been shown that when training some of the popular large-scale architectures, such as ResNet [44] or VGG [45], sparsification levels of .φ ∈ [0.01, 0.001] provides significant reduction in the communication load with almost no loss in their generalization performance [46, 47]. Top-K sparsification is probably the most common strategy used in distributed learning. In top-K sparsification, each device constructs its own sparsification mask .Mi,t at each iteration by identifying the K values in its local update with the largest absolute values [48–50]. A simpler alternative to top-K is rand-K sparsification [50], which selects the sparsification mask .Mi,t randomly from the set of masks with sparsification level K. Both rand-K and top-K are biased compression strategies. In the case of rand-K, unbiased model updates can be obtained by scaling .Mi,t with .d/K, albeit at the expense of increasing the variance, which is not desirable in practice [50]. Top-K sparsification has been shown to outperform rand-K in practical applications in terms of both the test accuracy and the convergence speed; however, top-K sparsification requires sorting the elements of the model update vector at each iteration, which can significantly slow down the learning process. Moreover, as mentioned above, top-K sparsification requires transmitting the location of the non-zero values within the model update vector, which increases its communication load, whereas this is not needed for rand-K if a pseudo-random generator with a common seed is used across all the devices to generate the same mask. A time-correlated sparsification strategy is introduced in [51], where a common mask is sent from the PS at each iteration to be employed by all the devices to remove the additional communication load due to sending locations of the non-zero values, and instead, each device sends only a limited number of significant values that are not present in this common mask, enabling exploration of more efficient masks. This approach exploits the time correlations between model updates across different iterations, and can provide up to 2000 times reduction in the communication load with minimal loss in model accuracy. We also note that, when employed for distributed training of DNN architectures, these sparse communication strategies can be applied to each layer of the network separately, since it is observed that different layers have different tolerance to sparsification of their weights [46, 51]. As mentioned above, the weights of a DNN take values from real numbers, and hence, even after sparsification they cannot be transmitted to the PS as they are, and must be quantized. In practice, since even the computing of local iterations are carried out using 32bit floating point representations, we can assume that each weight can be conveyed to the PS perfectly using 32 bits. Quantization techniques aim at identifying more efficient representations of the network weights that use less than 32 bits per weight [26–28]. At the extreme, only a single bit can be used to represent only the sign of each element, which would result in a 32 times reduction in the communication load. Sign based compression techniques for distributed optimization have been studied for a long time mainly to improve the robustness and convergence of learning algorithms [52]. It has been recently shown that simple sign-based quantization together with majority voting converges to the optimal solution (under certain assumptions), and provides an extremely communication-

14

2 Fundamentals and Preliminaries of Federated Learning

efficient viable alternative in practice as well [53–55]. A more advanced vector quantization scheme is considered in [56]. State of the Art Next, we discuss a number of recent works on the use of compression and sparsification techniques for deploying FL over wireless networks. The authors in [57] studied the use of compression and sparsification techniques for local ML model transmission and analyzed their convergence properties in both homogeneous and heterogeneous local data distributions settings. In [58], the authors studied the use of lossy compression for global ML model parameter transmission. The work in [59] introduced a ternary quantization approach for the training and inference stages of devices. The authors in [60] investigated the fundamental trade-off between the number of bits needed to encode compressed vectors and the compression error and performed both worst-case and averagecase analysis on the FL convergence. In [61], the authors designed a hyper-sphere quantization based FL algorithm so as to achieve a continuum of trade-offs between communication efficiency and gradient accuracy. The authors in [62] focused on the design and analysis of physical layer quantization and transmission methods for wireless FL and evaluated the impact of various quantization and transmission options of the ML models on the learning performance. The work in [63] designed a novel FL algorithm based on random linear coding and developed efficient power management and channel usage techniques to manage the trade-offs between power consumption, communication bit-rate and convergence rate. In [64], the count sketch is used to compress the local ML parameters thus overcoming the challenges of sparse device participation while still achieving high compression rates and convergence speed. Additional forms of probabilistic scalar quantization for FL were considered in [28, 65–67]. We would like to highlight that, most of the literature on distributed learning, and particularly its implementation over a wireless network, focus on the limitation of the uplink resources, and study quantization and sparsification of model updates from the devices while assuming that the global model from the PS is conveyed perfectly to all the participating devices. However, in the case of bandwidth-limited wireless networks, broadcasting the global model to all the wireless devices can be a challenge as well. The convergence of FL with noisy downlink transmission of the global model is studied in [68], and both digital and analog transmission of global model updates is considered.

2.4.3 FL with Over-the-Air Computation Researchers have attempted to reduce the resultant communication latency using different approaches such as excluding slow devices (“stragglers") [69, 70], selecting only those devices whose updates can significantly accelerate learning [71, 72], or compressing updates by exploiting their sparsity using the techniques outlined in Sect. 2.4.2. An alternative approach of our interest in this section is to design new

2.4 Research Directions of Deploying FL Over Wireless Networks

15

multiple access schemes targeting FL. The main drawback of the classic orthogonalaccess schemes (e.g., OFDMA or TDMA) is that they do not scale well with the number of devices. Specifically, the required radio resources increase linearly with the number of transmitters, or else the latency will grow linearly. A recently emerged approach, called over-the-air computation, which is also known as AirComp, can provide the needed scalability for multi-access in FL. Specifically, the deployment of AirComp to support FL exploits the wave-form superposition property of a multi-access channel together with simultaneous transmission to realize over-theair model/gradient aggregation [13–15]. Given simultaneous access, the latency becomes independent of the number of devices. This overcomes the communication bottleneck to facilitate the implementation of FL over many devices. State of the Art One challenge in AirComp is to enable learning in a broadband communication scenario. The authors in [73] first analyzed how user selction and transmit power affect the convergence of AirComp based FL and then optimized these wireless factors to improve the performance of AirComp based FL. The work in [74] studied the use of 1-bit compressive sensing (CS) for analog ML model aggregation thus reduce the size of FL parameters transmitted over wireless links. The work in [75] used a Markovian probability model to characterize the temporal structure of the local ML parameter aggregation over a series of learning steps. Based on the Markovian model, the authors developed a turbo message passing algorithm to efficiently recover the desired global ML model from all the historical noisy observations at the PS. Researchers have also designed AirComp FL systems over multiple-antenna channels [43, 76, 77]. While the beamforming vectors are optimized in [43] to exploit the available multiple antennas for FL, it is shown in [78] that if there are sufficiently many receive antennas at the PS, this can compensate for the lack of channel state information at the transmitter. It is further shown in [78] that, since only the summation of the transmitted symbols needs to be decoded at the receiver, this also reduces the channel state estimation requirements at the receivers, which only needs an estimate of the sum channel gain from the devices to each antenna. Another important potential benefit of AirComp in the FL setting is regarding privacy. Even though FL has been proposed as a privacy-sensitive learning paradigm as the devices only transmit their model updates to the PS and the datasets remain localized, it has been shown that the gradient information can reveal significant information about the datasets, called gradient leakage [79, 80]. Several works have proposed privacy mechanisms to prevent gradient leakage. In particular, differential privacy (DP) is used as a rigorous privacy measure in this context [81]. A common method to provide DP guarantees is to add noise to data before sharing it with third parties. In the digital implementation of FL, each device can add noise to its local gradient estimate before sharing it with the PS [82], which results in a trade-off between privacy and the accuracy of learning. However, note that, the gradients (or, model updates) in the case of AirComp are received at the PS with additional channel noise. Several recent works have developed privacy-aware AirComp schemes based on this observation. In [83], if the channel noise is not

16

2 Fundamentals and Preliminaries of Federated Learning

sufficient to satisfy the DP target, some of the devices transmit additional noise, benefiting all the devices. Instead, in [84] and [85], transmit power is adjusted for the same privacy guarantee. The authors in [86] showed that jointly optimizing both wireless aggregation and user sampling can further improve differential privacy. Hence, the authors designed a private wireless gradient aggregation scheme that relies on the device selection scheme to improve differential privacy. While these works benefit mainly from the presence of channel noise, and depend critically on the perfect channel knowledge at the transmitters, in [87], the authors exploit the anonymity provided by AirComp for privacy, which prevents the PS to detect which devices are participating in each round.

2.4.4 FL Training Method Design Beyond the use of wireless techniques, one can design novel FL training methods and adjust the learning parameters (e.g., step size) to enable FL to be efficiently implemented over wireless networks. Naturally, wireless devices have a limited amount of energy and computational resources for ML model training and transmission. In consequence, the size of ML model parameters that can be trained and transmitted by a wireless device is typically small and the time duration that the wireless devices can be used for training FL is typically short. Hence, while designing FL training methods, the energy, computation, and training time constraints need to be explicitly taken into account. Meanwhile, FL training methods determine the network topologies formed by the devices thus significantly affecting the FL training complexity and the FL convergence time. In consequence, designing FL training methods also needs to jointly consider the locations and mobility patterns of wireless devices as well as wireless channel conditions. State of the Art Designing communication efficient FL training methods has been studied from various perspectives. In particular, an error feedback based SignSGD update method is proposed in [55] to improve both convergence and generalization. In [12], hierarchical FL is proposed, where devices are grouped into clusters, and devices within each cluster carry out local learning with the help of a small BS or a cluster head, while a global model is trained at the macro BS. This framework is extended in [88], which designed a training method for a multi-layer FL network. The authors in [72] and [89] proposed a gradient aggregation method so as to decrease the number of devices that must transmit the local ML parameters to the PS thus reducing the FL communication overhead. In [90], the authors introduced a non-parametric generalized Bayesian inference framework for FL so as to reduce the number of learning steps that FL needs to converge. The authors in [91] proposed a post-local SGD update method that enables each device to update its ML parameters once in the initial multiple learning steps while update its ML parameters several times in the following learning steps. This post-local SGD method can significantly improve the generalization performance and communication efficiency. The work

2.4 Research Directions of Deploying FL Over Wireless Networks

17

in [92] designed a parallel restarted SGD method using which each device will average its ML model every certain learning steps and perform local SGDs to update its ML model in other learning steps. In [93], the authors designed a personalized FL algorithm using Moreau envelopes as each device’s regularized loss function which can decouple personalized ML model optimization from the global model learning in a bi-level problem stylized for personalized FL. The work in [94] addressed the FL problem, in which the users are distributed and partitioned into clusters. In particular, the authors proposed a new framework dubbed the iterative federated clustering algorithm, which alternately estimates the cluster identities of the users and optimizes model parameters for the user clusters via gradient descent. In [95], the authors studied FL over wireless device-to-device networks by providing theoretical insights into the performance of digital and analog implementations of decentralized stochastic gradient descent. The authors in [96] designed a novel FL optimization objective inspired by fair resource allocation in wireless networks that encourages more uniform accuracy distributions across devices. The work in [97] developed a one-shot unsupervised federated clustering scheme based on the Lloyd’s method used for k-means clustering.

2.4.5 Industry Interest As we have already mentioned, centralized based algorithms cannot fulfil the low latency demands of near real time applications of 5G and beyond cellular networks, while at the same time satisfying security and privacy requirements. Therefore, approaches that keep local data on resource-constrained edge nodes (such as mobile phones, IoT devices or radio sites) and employ edge computation to learn a shared model for prediction have become increasingly attractive for the networking and IoT industry, and in recent years it have appeared several implementations of distributed ML. In April 2017, Google published a blog post [98] describing they had successfully tested an FL method with many Android mobile devices. Using a federated averaging algorithm, a global model had been trained and deployed on Android mobiles to suggest search queries based on typing context from Android Gboard. The mobile used the model stored on the device to predict search queries (such as suggesting next words and expressions) but training and model update would only take place once the mobile was connected to WiFi and charging. As such it was ensured that only the user has a copy of their data. Besides Google, many other industrial researchers have also recently started exploring FL. Intel [99] used FL to do medical imagining where personal data used for training a global model is kept local. During MWC 2019 ByteLake and Lenovo [100] have demonstrated FL IoT industry application that enables IoT devices in 5G networks to learn from each other as well as makes it possible to leverage local ML models on IoT devices.

18

2 Fundamentals and Preliminaries of Federated Learning

As we discuss in this paper, despite the apparent opportunities FL offers in wireless networks it is still in its early stages, as there exist several critical challenges that need to be researched, especially for large scale telecom application, such as computational resource allocation for training FL models at edge devices, selection of users for FL, energy efficiency of FL implementation, spectrum resource allocation for FL parameter transmission, and design of communication-efficient FL. Nevertheless, the telecom industry has recently started industrially applying distributed ML to improve privacy when using ML for network optimization, time-series forecasting [101], predictive maintenance and quality of experience (QoE) modeling [102, 103]. To better understand the potential of FL in a telecom environment, the Ericsson authors in [103] have tested it on a number of use cases, migrating the models from conventional, centralized ML to FL, using the accuracy of the original model as a baseline. Their research has indicated that the usage of a simple neural network results in a significant reduction in network utilization, due to the sharp drop in the amount of data that needs to be shared. Besides being improved by 5G techniques, FL has also been integrated in the 5G Network Data Analytics (NWDA) architecture where it has been used to deal with 5G problems such as Network Data Analytics Function (NWDAF) [104], in order to improve privacy. As we discuss in this paper, despite the apparent opportunities FL offers in wireless networks it is still in its early stages, as there exist several critical challenges that need to be researched, especially for large scale telecom application, such as computational resource allocation for training FL models at edge devices, selection of users for FL, energy efficiency of FL implementation, spectrum resource allocation for FL parameter transmission, and design of communication-efficient FL. Nevertheless, the telecom industry has recently started industrially applying distributed ML to improve privacy when using ML for network optimization, time-series forecasting [101], predictive maintenance and quality of experience (QoE) modeling [102, 103]. To better understand the potential of FL in a telecom environment, the Ericsson authors in [103] have tested it on a number of use cases, migrating the models from conventional, centralized ML to FL, using the accuracy of the original model as a baseline. Their research has indicated that the usage of a simple neural network results in a significant reduction in network utilization, due to the sharp drop in the amount of data that needs to be shared. Besides being improved by 5G techniques, FL has also been integrated in the 5G Network Data Analytics (NWDA) architecture where it has been used to deal with 5G problems such as Network Data Analytics Function (NWDAF) [104], in order to improve privacy.

Chapter 3

Resource Management for Federated Learning

In this chapter, we introduce the joint optimization of FL parameters and wireless resource (i.e., computational resources, spectrum, transmit power) to optimize wireless FL performance metrics (i.e., FL training loss, convergence time, energy efficiency). In particular, we first analyze how wireless factors (i.e., transmit power, spectrum interference, packet error rates) affect the FL performance. Then, using the FL convergence analytical results, we introduce several optimization theory based methods for resource management aiming to optimize the wireless FL performance metrics. Finally, several simulations are implemented to demonstrate the performance of the designed FL.

3.1 Resource Management for FL Training Loss Minimization 3.1.1 Wireless FL Model First, we introduce the FL training process over a wireless network. In particular, we consider a cellular network in which one BS and a set .U of U users cooperatively perform an FL algorithm for data analysis and inference. Each user will use its collected training data to train an FL model. Hereinafter, the FL model that is trained at the device of each user (using the data collected by the user itself) is called the local FL model. The BS is used to integrate the local FL models and generate a shared FL model. This shared FL model is used to improve the local FL model of each user so as to enable the users to collaboratively perform a learning task without training data transfer. Hereinafter, the FL model that is generated by the BS using the local FL models of its associated users is called the global FL model. As shown in Fig. 3.1, the uplink from the users to the BS is used to transmit the local FL model parameters while the downlink is used to transmit the global FL model parameters. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 M. Chen, S. Cui, Communication Efficient Federated Learning for Wireless Networks, Wireless Networks, https://doi.org/10.1007/978-3-031-51266-7_3

19

20

3 Resource Management for Federated Learning

Fig. 3.1 The architecture of an FL algorithm that is being executed over a wireless network with multiple devices and a single base station

┌ ┐ In our model, each user i collects a marix .Xi = x i1 , . . . , x iKi of input data, where .Ki is the number of the samples collected by each user i and each element .x ik is an input vector of the FL algorithm. The size of .x ik depends on the specific FL task. Our approach, however, is applicable to any generic FL algorithm and task. Let .yik be the output of .x ik . For simplicity, we consider an FL algorithm with a single output, however, the introduced approach can be readily generalized to a case with multiple outputs [105]. ┐The output data vector for training the FL algorithm of user ┌ i is .y i = yi1 , . . . , yiKi . We define a vector .wi to capture the parameters related to the local FL model that is trained by .X i and .y i . In particular, .wi determines the local FL model of each user i. For example, in a linear regression learning algorithm, T .x w i represents the predicted output and .w i is a weight vector that determines ik the performance of the linear regression learning algorithm. For each user i, the local training problem seeks to find the optimal learning model parameters .w∗i that minimize its training loss. The training process of an FL algorithm is done in a way to solve the following optimization problem:

.

min

w 1 ,...,w U

U Ki 1 Σ Σ f (wi , x ik , yik ), K

(3.1)

i=1 k=1

s. t.. w 1 = w2 = . . . = w U = g,

(3.1a)

3.1 Resource Management for FL Training Loss Minimization

where .K =

U Σ

21

Ki is total size of training data of all users and .g is the global

i=1

FL model that is generated by the BS and .f (wi , x ik , yik ) is a loss function. The loss function captures the performance of the FL algorithm. For different learning tasks, the FL performance captured by the loss function is different. For example, for a prediction learning task, the loss function captures the prediction accuracy of FL. In contrast, for a classification learning task, the loss function captures the classification accuracy. Constraint (3.1a) is used to ensure that, once the FL algorithm converges, all of the users and the BS will share the same FL model for their learning task. To solve (3.1), the BS will transmit the parameters .g of the global FL model to its users so that they train their local FL models. Then, the users will transmit their local FL models to the BS to update the global FL model. The detailed procedure of training an FL algorithm [106] to minimize the loss function in (3.1). In FL, the update of each user i’s local FL model .wi depends on the global model .g while the update of the global model .g depends on all of the users’ local FL models. The update of the local FL model .w i depends on the learning algorithm. For example, one can use gradient descent, stochastic gradient descent, or randomized coordinate descent [105] to update the local FL model. The update of the global model .g is given by Koneˇcn`y et al. [105] U Σ

gt =

.

Ki wi,t

i=1

K

.

(3.2)

During the training process, each user will first use its training data .Xi and .y i to train the local FL model .wi and then, it will transmit .wi to the BS via wireless cellular links. Once the BS receives the local FL models from all participating users, it will update the global FL model based on (3.2) and transmit the global FL model .g to all users to optimize the local FL models. As time elapses, the BS and users can find their optimal FL models and use them to minimize the loss function in (3.1). Since all of the local FL models are transmitted over wireless cellular links, once they are received by the BS, they may contain erroneous symbols due to the unreliable nature of the wireless channel, which, in turn, will have a significant impact on the performance of FL. Meanwhile, the BS must update the global FL model once it receives all of the local FL models from its users and, hence, the wireless transmission delay will significantly affect the convergence of the FL algorithm.

3.1.1.1

FL Parameter Transmission Model

For uplink, an orthogonal frequency division multiple access (OFDMA) technique in which each user occupies one RB is used. The uplink rate of user i transmitting its local FL model parameters to the BS is

22

3 Resource Management for Federated Learning

=

U .ci (r i , Pi )

R Σ

⎛ ⎛ ri,n B Ehi log2 1+ U

n=1

Pi hi In + B U N0

⎞⎞ ,

(3.3)

┐ ┌ where .r i = ri,1 , . . . , ri,R is an RB allocation vector with R being the total number R Σ of RBs, .ri,n ∈ {0, 1} and . ri,n = 1; .ri,n = 1 indicates that RB n is allocated to n=1

user i, and .ri,n = 0, otherwise; .B U is the bandwidth of each RB and .Pi is the transmit power of user i; .hi = oi di−2 is the channel gain between user i and the BS with .di being the distance between user i and the BS and .oi being the Rayleigh fading parameter; .Ehi (·) is the expectation with respect to .hi ; .N0 is the noise power spectral density; .In is the interference caused by the users that are located in other service areas (e.g., other BSs not participating in the FL algorithm) and use RB n. Note that, although we ignore the optimization of resource allocation for the users located at the other service areas, we must consider the interference caused by the users in other service areas (if they are sharing RBs with the considered FL users), since this interference may significantly affect the packet error rates and the performance of FL. Similarly, the downlink data rate achieved by the BS when transmitting the parameters of global FL model to each user i is given by ⎛ ⎛ ciD = B D Ehi log2 1+

.

PB hi D I + B D N0

⎞⎞ ,

(3.4)

where .B D is the bandwidth that the BS used to broadcast the global FL model of each user i; .PB is the transmit power of the BS; .I D is the interference caused by other BSs not participating in the FL algorithm. Given the uplink data rate .ciU in (3.3) and the downlink data rate .ciD in (3.4), the transmission delays between user i and the BS over uplink and downlink are respectively specified as liU (r i , Pi ) =

.

liD =

.

Z (wi ) , ciU (r i , Pi )

Z (g) , ciD

(3.5)

(3.6)

where function .Z (x) is the data size of .x which is defined as the number of bits that the users or the BS require to transmit vector .x over wireless links. In particular, .Z (w i ) represents the number of bits that each user i requires to transmit local FL model .wi to the BS while .Z (g) is the number of bits that the BS requires to transmit the global FL model .g to each user. Here, .Z (wi ) and .Z (g) are determined by the type of implemented FL algorithm. From (3.2), we see that the number of elements in the global FL model .g is similar to that of each user i’s local FL model .wi . Hence, we assume .Z (w i ) = Z (g).

3.1 Resource Management for FL Training Loss Minimization

3.1.1.2

23

FL Parameter Error Rates

For simplicity, we assume that each local FL model .wi will be transmitted as a single packet in the uplink. A cyclic redundancy check (CRC) mechanism is used to check the data errors in the received local FL models at the BS. In particular, .C (w i ) = 0 indicates that the local FL model received by the BS contains data errors; otherwise, we have .C (wi ) = 1. The packet error rate experienced by the transmission of each local FL model .wi to the BS is given by Xi et al. [107] qi (r i , Pi ) =

R Σ

.

(3.7)

ri,n qi,n ,

n=1

where .qi,n

⎛ ⎛ ( ) ⎞⎞ m In +B U N0 is the packet error rate over RB n = Ehi 1 − exp − Pi h i

with m being a waterfall threshold [107]. In the considered system, whenever the received local FL model contains errors, the BS will not use it for the update of the global FL model. We also assume that the BS will not ask the corresponding users to resend their local FL models when the received local FL models contain data errors. Instead, the BS will directly use the remaining correct local FL models to update the global FL model. As a result, the global FL model in (3.2) can be written as U Σ

g (a, P , R) =

.

Ki ai wi C (w i )

i=1 U Σ

,

(3.8)

Ki ai C (w i )

i=1

where ⎧ C (w i ) =

.

1, with probability 1 − qi (r i , Pi ) , 0, with probability qi (r i , Pi ) ,

(3.9)

a = [a1 , . . . , aU ] is the vector of the user selection index with .ai = 1 indicating that user i performs the FL algorithm and .ai = 0, otherwise, .R = [r 1 , · · · , r U ], U Σ .P = [P1 , · · · , PU ], . Ki ai C (wi ) is the total number of training data samples, .

i=1

which depends on the user selection vector .a and packet transmission .C (wi ), .Ki ai w i C (w i ) = 0 indicates that the local FL model of user i contains data errors and, hence, the BS will not use it to generate the global FL model, and .g (a, P , R) is the global FL model that explicitly incorporates the effect of wireless transmission. From (3.8), we see that the global FL model also depends on the resource allocation matrix .R, user selection vector .a, and transmit power vector .P .

24

3.1.1.3

3 Resource Management for Federated Learning

Energy Consumption Model

In our network, the energy consumption of each user consists of the energy needed for two purposes: (a) Transmission of the local FL model and (b) Training of the local FL model. The energy consumption of each user i is given by Pan et al. [108] .

ei (r i , Pi ) = ς ωi ϑ 2 Z (w i ) + Pi liU (r i , Pi ) ,

(3.10)

where .ϑ is the frequency of the central processing unit (CPU) clock of each user i, ωi is the number of CPU cycles required for computing per bit data of user i, which is assumed to be equal for all users, and .ς is the energy consumption coefficient depending on the chip of each user i’s device [108]. In (3.10), .ς ωi ϑ 2 Z (wi ) is the energy consumption of user i training the local FL model at its own device and U .Pi l (r i , Pi ) represents the energy consumption of local FL model transmission i from user i to the BS. Note that, since the BS can have continuous power supply, we do not consider the energy consumption of the BS in our optimization problem. .

3.1.2 Problem Formulation To jointly design the wireless network and the FL algorithm, we now formulate an optimization problem whose goal is to minimize the training loss, while factoring in the wireless network parameters. This minimization problem includes optimizing transmit power allocation as well as resource allocation for each user. The minimization problem is given by

.

min

a,P ,R

U Ki 1 Σ Σ f (g (a, P , R) , x ik , yik ), K

s. t.. ai , ri,n ∈ {0, 1} , R Σ

(3.11)

i=1 k=1

∀i ∈ U, n = 1, . . . , R, .

ri,n = ai , ∀i ∈ U, .

(3.11a) (3.11b)

n=1

liU (r i , Pi ) + liD ≤ γT , ∀i ∈ U, .

(3.11c)

ei (r i , Pi ) ≤ γE , ∀i ∈ U, . Σ ri,n ≤ 1, ∀n = 1, . . . , R, . i∈U

(3.11d)

0 ≤ Pi ≤ Pmax ,

(3.11f)

∀i ∈ U,

(3.11e)

3.1 Resource Management for FL Training Loss Minimization

25

where .γT is the delay requirement for implementing the FL algorithm, .γE is the energy consumption of the FL algorithm, and B is the total downlink bandwidth. Equations (3.11a) and (3.11b) indicates that each user can occupy only one RB for uplink data transmission. Equation (3.11c) is the delay needed to execute the FL algorithm at each learning step. (3.11d) is the energy consumption requirement of performing an FL algorithm at each learning step. Equation (3.11e) indicates that each uplink RB can be allocated to at most one user. Equation (3.11f) is a maximum transmit power constraint. From (3.11), we can see that the user selection vector .a, the RB allocation matrix .R, and the transmit power vector .P will not change during the FL training process and the optimized .a, .R, and .P must meet the delay and energy consumption requirements at each learning step in (3.11c) and (3.11d). From (3.7) and (3.8), we see that the transmit power and resource allocation determine the packet error rate, thus affecting the update of the global FL model. In consequence, the loss function of the FL algorithm in (3.11) depends on the resource allocation and transmit power. Moreover, (3.11c) shows that, in order to perform an FL algorithm, the users must satisfy a specific delay requirement. In particular, in an FL algorithm, the BS must wait to receive the local model of each user before updating its global FL model. Hence, transmission delay plays a key role in the FL performance. In a practical FL algorithm, it is desirable that all users transmit their local FL models to the BS simultaneously. From (3.11d), we see that to perform the FL algorithm, a given user must have enough energy to transmit and update the local FL model throughout the FL iterative process. If this given user does not have enough energy, the BS should choose this user to participate in the FL process. In consequence, in order to implement an FL algorithm in a real-world network, the wireless network must provide low energy consumption and latency, and highly reliable data transmission.

3.1.3 Analysis of the FL Convergence Rate To solve (3.11), we first need to analyze how the packet error rate affects the performance of the FL. To find the relationship between the packet error rates and the FL performance, we must first analyze the convergence rate of FL. However, since the update of the global FL model depends on the instantaneous signal-tointerference-plus-noise ratio (SINR), we can analyze only the expected convergence rate of FL. Here, we first analyze the expected convergence rate of FL. Then, we show how the packet error rate affects the performance of the FL in (3.11). In the studied network, the users adopt a standard gradient descent method to update their local FL models as done in [105]. Therefore, during the training process, the local FL model .w i of each selected user i (.ai = 1) at step t is wi,t+1 = g t (a, P , R) −

.

Ki ) ( λ Σ ∇f g t (a, P , R) , x ik , yik , Ki k=1

(3.12)

26

3 Resource Management for Federated Learning

) ( where .λ is the learning rate and .∇f g t (a, P , R) , x ik , yik is the gradient of ) ( .f g t (a, P , R) , x ik , yik with respect to .g t (a, P , R). Ki Ki U Σ Σ Σ f (g, x ik , yik ) and .Fi (g) = f (g, x ik , yik ) We assume that .F (g) = K1 i=1 k=1

k=1

where .g is short for .g (a, P , R). Based on (3.12), the update of global FL model .g at step t is given by ) ( ( ) g t+1 = g t − λ ∇F g t − o ,

.

( ) where .o = ∇F g t −

U Σ i=1

ai

Ki Σ

(3.13)

∇f (g,x ik ,yik )C(w i )

k=1 U Σ

. We also assume that the FL Ki ai C(w i )

i=1

algorithm converges to an optimal global FL model .g ∗ after the learning steps. To derive the expected convergence rate of FL, we first make the following assumptions, as done in [37, 109]. • First, we assume that the gradient .∇F (g) of .F (g) is uniformly Lipschitz continuous with respect to .g [110]. Hence, we have ) ( ) ( ||∇F g t+1 − ∇F g t || ≤ L||g t+1 − g t ||,

.

(3.14)

where L is a positive constant and .||g t+1 − g t || is the norm of .g t+1 − g t . • Second, we assume that .F (g) is strongly convex with positive parameter .μ, such that ) ( ) ( )T ( ( ) μ F g t+1 ≥ F g t + g t+1 − g t ∇F g t + ||g t+1 − g t ||2 . 2

.

(3.15)

• We also assumed that .F (g) is twice-continuously differentiable. Based on (3.14) and (3.15), we have μI ≾ ∇ 2 F (g) ≾ LI .

.

(3.16)

( ( ) ) • We also assume that .||∇f g t , x ik , yik ||2 ≤ ζ1 +ζ2 ||∇F g t ||2 with .ζ1 , ζ2 ≥ 0. These assumptions can be satisfied by several widely used loss functions such as the mean squared error, logistic regression, and cross entropy [110]. These popular loss functions can be used to capture the performance of implementing practical FL algorithms for identification, prediction, and classification. For future work, we can investigate how to extend our work for other non-convex loss functions. The expected convergence rate of the FL algorithms can now be obtained by the following theorem.

3.1 Resource Management for FL Training Loss Minimization

27

Theorem 3.1 Given the transmit power vector .P , RB allocation matrix .R, user selection vector .a, optimal global FL model .g ∗ , and the learning rate .λ = L1 , the ) ( ( ) upper bound of .E F g t+1 − F (g ∗ ) can be given by ) ( )) ( )) ( ( ( ( ) E F g t+1 − F g ∗ ≤ At E F g 0 − F g ∗ .

U 2ζ1 Σ 1 − At + Ki (1 − ai + ai qi (r i , Pi )) , LK 1−A i=1

(3.17)

Impact of wireless factors on FL convergence

μ 2 + 4μζ where .A = 1 − L LK

U Σ

Ki (1 − ai + ai qi (r i , Pi )) and .E (·) is the expectation

i=1

with respect to packet error rate.

( ) Proof To prove Theorem 3.1, we first rewrite .F g t+1 using the second-order Taylor expansion, which can be expressed as ) ( ) ( )T ( ( ) F g t+1 = F g t + g t+1 − g t ∇F g t .

( )T ) 1( g t+1 − g t ∇ 2 F (g) g t+1 − g t , 2 ( ) L )T ( ) ( ≤ F g t + g t+1 − g t ∇F g t + ||g t+1 − g t ||2 , 2 +

(3.18)

where the inequality stems from the assumption in (3.16). Given rate ( the ( learning )) λ = L1 , based on (3.13), the expected optimization function .E F g t+1 can be expressed as

.

⎞ ⎛ ( ) )) ( ) ( ) ( ( ( ) Lλ2 ||∇F g t − o||2 , E F g t+1 ≤E F g t − λ(∇F g t − o)T ∇F g t + 2 . ⎛ ⎞ ( ) 1 1 (a) ( ( )) = E F gt − E ||o||2 , ||∇F g t ||2 + 2L 2L (3.19) 2

1 ||∇F (g t )||2 − where (a) stems from the fact that . Lλ2 ||∇F (g t ) − o||2 = 2L ) ( 1 T 1 2 2 L o ∇F (g t ) + 2L ||o|| . Next, we derive .E ||o|| , which can be given as follows

28

3 Resource Management for Federated Learning

⎛|| ||2 ⎞ Ki U Σ || || Σ || ai ∇f (g, x ik , yik )C (wi ) || ⎜|| || ⎟ ⎞ ⎛ ⎜ || ⎟ || i=1 k=1 E ||o||2 = E ⎜ ) − ∇F (g , || ⎟ || t ⎜|| U || ⎟ Σ ⎝|| ⎠ || K a C ) (w i i i || || i=1

.

⎛|| ⎛ ||2 ⎞ ⎞ Ki Ki U || || Σ Σ Σ Σ Σ || Ki ai C (wi ) ∇f (g, x ik , yik ) ∇f (g, x ik , yik ) || K− ⎜|| || ⎟ ⎜|| || ⎟ i=1 i∈N2 k=1 i∈N1 k=1 ⎜ || ⎟ = E ⎜|| , + − || ⎟ || U ⎜|| K Σ || ⎟ ⎠ ⎝|| K Ki ai C (wi ) || || || i=1 ⎞2 ⎛⎛ ⎞ Ki Ki U Σ Σ Σ Σ Σ ||∇f ||∇f K − K a C , y , y x x )|| ) )|| (w (g, (g, i i i ik ik ik ik ⎟ ⎜ ⎟ ⎜ i=1 i∈N1 k=1 i∈N2 k=1 ⎟ , ≤ E⎜ + ⎟ ⎜ U K Σ ⎠ ⎝ K Ki ai C (wi ) i=1

(3.20) where .N1 = {ai = 1, C (wi ) = 1|i ∈ U} is the set of users that correctly transmit their local FL models to the BS and .N2 = {i ∈ U|i ∈ / N1 }. The inequality equation ) ( in (3.20) is achieved by the triangle-inequality. Since .||∇f g t , x ik , yik || ≤ / / Ki ( ) ( ) Σ Σ ||∇f (g, x ik , yik )|| ≤ ζ1 +ζ2 ||∇F g t ||2 ζ1 +ζ2 ||∇F g t ||2 , we have . i∈N1 k=1 / K U i ( ) Σ Σ Σ ||∇f (g, x ik , yik )|| ≤ ζ1 + ζ2 ||∇F g t ||2 × Ki ai C (wi ) and . i=1 i∈N k=1 ⎞ 2 ⎛ U ) ( Σ K− Ki ai C (w i ) . Hence, .E ||o||2 can be expressed by i=1

⎛

.

E ||o||

2

⎞

⎛ ⎞2 U ⎛ Σ ( ) ⎞ 4 ≤ 2E K − ζ1 + ζ2 ||∇F g t ||2 . Ki ai C (wi ) K

(3.21)

i=1

Since .K ≥ K −

U Σ

Ki ai C (wi ) ≥ 0, we have

i=1

⎛

.

E ||o||

2

⎞

⎛ ⎞ U ⎛ Σ ( ) ⎞ 4 ≤ E K− Ki ai C (wi ) ζ1 + ζ2 ||∇F g t ||2 . K

(3.22)

i=1

Since .K =

U Σ i=1

follows

Ki and .E (C (wi )) = 1 − qi (r i , Pi ), (3.22) can be simplified as

3.1 Resource Management for FL Training Loss Minimization

⎛

E ||o||

2

⎞

29

⎛U ⎞ ⎛ Σ ( ) ⎞ 4 = E Ki (1 − ai C (wi )) ζ1 + ζ2 ||∇F g t ||2 , K i=1

.

4 = K

⎛

( ) ⎞ Ki (1 − ai + ai qi (r i , Pi )) ζ1 + ζ2 ||∇F g t ||2 .

U Σ

(3.23)

i=1

Substituting (3.23) into (3.19), we have U )) ( ( )) 2ζ1 Σ ( ( E F g t+1 ≤E F g t + Ki (1 − ai + ai qi (r i , Pi )) LK i=1

⎞ U ( ) 1 4ζ2 Σ − Ki (1 − ai + ai qi (r i , Pi )) ||∇F g t ||2 . 1− K 2L i=1 (3.24) ⎛

.

Subtract .E (F (g ∗ )) in both sides of (3.24), we have ) ( )) ( ( ) ( )) ( ( E F g t+1 − F g ∗ ≤ E F g t − F g ∗ + .

U 2ζ1 Σ Ki (1 − ai + ai qi (r i , Pi )) LK i=1

−

(3.25) ⎞

⎛

U ( ) 1 4ζ2 Σ Ki (1 − ai + ai qi (r i , Pi )) ||∇F g t ||2 . 1− 2L K i=1

Given (3.15) and (3.16), we have [111] ( )) ( ) ( ( ) ||∇F g t ||2 ≥ 2μ F g t − F g ∗ .

.

(3.26)

Substituting (3.26) into (3.25), we have U ) ( )) ( ( 2ζ1 Σ Ki (1 − ai + ai qi (r i , Pi )) E F g t+1 − F g ∗ ≤ LK . i=1 ( )) ( ( ) + AE F g t − F g ∗ , μ 4μζ2 where .A = 1− L + LK

we have

U Σ i=1

(3.27)

Ki (1 − ai + ai qi (r i , Pi )). Applying (3.27) recursively,

30

3 Resource Management for Federated Learning U t−1 Σ ) ( )) ( ( 2ζ1 Σ Ki (1 − ai + ai qi (r i , Pi )) Ak E F g t+1 − F g ∗ ≤ LK i=1 k=0 ( ∗ )) ( ( ) t + A E F g0 − F g ,

.

U 2ζ1 Σ 1 − At = Ki (1 − ai + ai qi (r i , Pi )) LK 1−A i=1 ( )) ( ( ) + At E F g 0 − F g ∗ .

(3.28)

⨆ ⨅

This completes the proof.

3.1.4 Optimization of FL Training Loss In this section, our goal is to minimize the FL loss function when considering the underlying wireless network constraints. To solve the problem in (3.11), we must first simplify it. From Theorem 3.1, we can see that, to minimize the training loss in (3.11), we need to only minimize the gap, U t 2ζ1 Σ . Ki (1 − ai + ai qi (r i , Pi )) 1−A 1−A . When .A ≥ 1, the FL algorithm will LK i=1

not converge. In consequence, here, we only consider the minimization of the FL training loss when .A < 1. Hence, as t is large enough, which captures the asymptotic convergence behavior of FL, we have .At = 0. The gap can be rewritten as U 2ζ1 Σ 1 − At Ki (1 − ai + ai qi (r i , Pi )) 1−A LK i=1

2ζ1 LK

.

= μ L

−

U Σ i=1

4μζ2 LK

U Σ

. Ki (1 − ai + ai qi (r i , Pi ))

i=1

2ζ1 From (3.29), we can observe that minimizing . LK U Σ

only requires minimizing

.

R Σ

i=1 R Σ

n=1

ri,n and .qi (r i , Pi ) =

n=1

(3.29)

Ki (1 − ai + ai qi (r i , Pi ))

U Σ i=1

t

Ki (1 − ai + ai qi (r i , Pi )) 1−A 1−A

Ki (1 − ai + ai qi (r i , Pi )). Meanwhile, since .ai = ri,n qi,n , we have .qi (r i , Pi ) ≤ 1, when .ai = 1, and

3.1 Resource Management for FL Training Loss Minimization

31

qi (r i , Pi ) = 0, if .ai = 0. In consequence, we have .ai qi (r i , Pi ) = qi (r i , Pi ). The problem in (3.11) can be simplified as

.

.

min P ,R

U Σ

⎛ Ki 1 −

i=1

R Σ

⎞ ri,n + qi (r i , Pi ) ,

(3.30)

n=1

s. t. (3.11c) − (3.11f), .

ri,n ∈ {0, 1} , R Σ

(3.30a)

.

∀i ∈ U, n = 1, . . . , R,

ri,n ≤ 1, ∀i ∈ U.

(3.30b)

n=1

Next, we first find the optimal transmit power for each user given the uplink RB allocation matrix .R. Then, we find the uplink RB allocation to minimize the FL loss function.

3.1.4.1

Optimal Transmit Power

The optimal transmit power of each user i can be determined by the following proposition. Proposition 3.1 Given the uplink RB allocation vector .r i of each user i, the optimal transmit power of each user i, .Pi∗ is given by } { Pi∗ (r i ) = min Pmax , Pi,γE ,

(3.31)

.

where .Pi,γE satisfies the equality .ς ωi ϑ 2 Z (wi ) +

Pi,γE Z(w i ) ( ) ciU r i ,Pi,γE

= γE .

Proof To prove Proposition 3.1, we first prove that .ei (r i , Pi ) is an increasing function of .Pi . Based on (3.3) and (3.10), we have ei (r i , Pi ) = ς ωi ϑ 2 Z (Xi ) +

.

Pi R Σ n=1

where .κi,n =

Σ i' ∈

is given by

' n

U

hi . Pi ' hi ' +B U N0

ri,n

B U log

( ) 2 1 + κi,n Pi

,

(3.32)

The first derivative of .ei (r i , Pi ) with respect to .Pi

32

3 Resource Management for Federated Learning

.

∂ei (r i , Pi ) = ∂Pi

(ln 2)

R Σ n=1

ri,n 1+κi,n Pi

⎛

(( ) ( ) ) 1 + κi,n Pi ln 1 + κi,n Pi − κi,n Pi

R Σ

ri,n

B U ln

⎞ ) 2 ( 1 + κi,n Pi

.

n=1

(3.33) i ,Pi ) is always positive when .Pi > 0, .ei (r i , Pi ) is a monotonically Since . ∂ei (r ∂Pi increasing function when .Pi > 0. Contradiction is used to prove Proposition 3.1. We( assume )that .Pi' (.Pi' /= Pi∗ ) is the optimal transmit power of user i. In (3.11d), ∗ ' ∗ .ei r , Pi,γE is a monotonically increasing function of .Pi . Hence, as .P i > Pi , ( i∗ ' ) > γE , which does not meet the constraint (3.11f). From (3.7), we .ei r , P i i see that, the packer error Thus, as ( rates∗ )decrease ( as the ) transmit power increases. ' ∗ ' ' ≤ qi r i , Pi . In consequence, as .Pi < Pi∗ , .Pi' .P i < Pi , we have .qi r i , Pi cannot minimize the function in (3.30). Hence, we have .Pi' = Pi∗ . This completes ⨆ ⨅ the proof.

From Proposition 3.1, we see that the optimal transmit power depends on the size of the local FL model .Z (wi ) and the interference in each RB. In particular, as the size of the local FL model increases, each user must spend more energy for training FL model and, hence, the energy that can be used for data transmission decreases. In consequence, the training loss increases. Hereinafter, for simplicity, .Pi∗ is short for .Pi∗ (r i ). 3.1.4.2

Optimal Uplink Resource Block Allocation

Based on (3.7), the optimization problem in (3.30) can be simplified as follows

.

min R

U Σ i=1

⎛ Ki 1 −

R Σ n=1

ri,n +

R Σ

⎞ ri,n qi,n ,

(3.34)

n=1

s. t.. (3.30a), (3.30b), and (3.11e), ( ) liU r i , Pi∗ + liD ≤ γT , ∀i ∈ U, . ( ) ei r i , Pi∗ ≤ γE , ∀i ∈ U.

(3.34a) (3.34b)

Obviously, the objective function (3.34) is linear, the constraints are non-linear, and the optimization variables are integers. Hence, problem (3.34) can be solved by using bipartite matching algorithm [112]. Compared to traditional convex optimization algorithms, using bipartite matching to solve problem (3.34) does not require computing the gradients of each variable nor dynamically adjusting the step size for convergence.

3.1 Resource Management for FL Training Loss Minimization

33

To use a bipartite matching algorithm for solving problem (3.34), we first transform the optimization problem into a bipartite matching problem. We construct a bipartite graph .A = (U × R, E) where .R is the set of RBs that can be allocated to each user, each vertex in .U represents a user and each vertex in .R represents an RB, and .E is the set of edges that connect to the vertices from each set .U and .R. Let .χin ∈ E be the edge connecting vertex i in .U and vertex n in .R with .χin ∈ {0, 1}, where .χin = 1 indicates that RB n is allocated to user i, otherwise, we have .χin = 0. Let matching .T be a subset of edges in .E, in which no two edges share a common vertex in .R, such that each RB n can only be allocated to one user (constraint (3.11e) is satisfied). Nevertheless, in .T, all of the edges associated with a vertex .i ∈ U will not share a common vertex .n ∈ R, such that each user i can occupy only one RB (constraint (3.11b) is satisfied). The weight of edge .χin is given by ⎧ ψin =

.

( ) ( ) ) ( Ki qi,n − 1 , liU ri,n , Pi∗ +liD ≤ γT andei ri,n , Pi∗ ≤γE , 0, otherwise.

(3.35)

From (3.35), we can see that when RB n is allocated to user i, if the delay and energy requirements cannot be satisfied, we will have .ψin = 0, which indicates that RB n will not be allocated to user i. The goal of this formulated bipartite matching problem is to find an optimal matching set .T∗ that can minimize the weights of the edges in .T∗ . A standard Hungarian algorithm [113] can be used to find the optimal matching set .T∗ . When the optimal matching set is found, the optimal RB allocation is determined. When the optimal RB allocation vector .r ∗i is determined, the optimal transmit power of each device can be determined by (3.31) and the optimal user R Σ ∗ . Algorithm 1 summarizes the entire ri,n selection can be determined by .ai∗ = n=1

process of optimizing the user selection vector .a, RB allocation matrix .R, and the transmit power vector .P for training the FL algorithm.

3.1.5 Simulation Results and Analysis For simulations, the authors consider a circular network area having a radius r = 500 m with one BS at its center servicing .U = 15 uniformly distributed users. The other parameters used in simulations are listed in Table 3.1. The FL algorithm is simulated by using the Matlab Machine Learning Toolbox for handwritten digit identification. Each user trains an FNN that consists of 50 neurons using the MNIST dataset [114]. The loss function is cross entropy loss. For comparison purposes, we use three baselines: (a) an FL algorithm that optimizes user selection with random resource allocation, (b) an FL algorithm that randomly determines user selection and resource allocation, which can be seen as a standard FL algorithm (e.g., similar to the one in [105]) that is not wireless-aware, and (c) a wireless optimization algorithm that minimizes the sum packet error rates of all users via optimizing user

.

34

3 Resource Management for Federated Learning

Table 3.1 System parameters

Parameter .α .PB m .σi .ϑ .ς .ωi

Value 2 1W 0.023 dB 1 9 .10 −27 .10 40

Parameter .N0 D .B U .B .Pmax .Ki .γT .γE

Value .−174 dBm/Hz 20 MHz 1 MHz 0.01 W [12,10,8,4,2] 500 ms 0.003 J

Identification accuracy

0.9 0.8 0.7 0.6

Proposed FL Baseline a) Baseline b) Baseline c)

0.5 0.4 0.3 0.2 20

40

60

80

100

120

Number of iterations Fig. 3.2 Identification accuracy as the number of iterations varies

selection, transmit power while ignoring FL parameters. The code is available at: https://github.com/mzchen0/Wireless-FL. In Fig. 3.2, we show how the identification accuracy changes as the number of iterations varies. From Fig. 3.2, we see that, as the number of iterations increases, the identification accuracy of all considered learning algorithms decreases first and, then remains unchanged. The fact that the identification accuracy remains unchanged demonstrates that the FL algorithm converges. From Fig. 3.2, we can also see that the increase speed in the value of identification accuracy is different during each iteration. This is due to the fact that the local FL models that are received by the BS may contain data errors and the BS may not be able to use them for the update of the global FL model. In consequence, at each iteration, the number of local FL models that can be used for the update of the global FL model will be different. Figure 3.2 also shows that a gap exists between the designed algorithm and baselines (a), (b), and (c). This gap is caused by the packet errors.

3.1 Resource Management for FL Training Loss Minimization

35

Identification accuracy

0.9 0.88 0.86 0.84 Proposed FL Baseline a) Baseline b) Baseline c)

0.82 0.8 0.78 3

6

9

12

15

18

Total number of users Fig. 3.3 Identification accuracy as the total number of users varies (.R = 12)

Figure 3.3 shows how the identification accuracy changes as the total number of users varies. In this figure, an appropriate subset of users is selected to perform the FL algorithm. From Fig. 3.3, we can observe that, as the number of users increases, the identification accuracy increases. This is due to the fact that an increase in the number of users leads to more data available for the FL algorithm training and, hence, improving the accuracy of approximation of the gradient of the loss function. Figure 3.3 also shows that the designed algorithm improves the identification accuracy by, respectively, up to 1.2, 1.7, and 2.3% compared to baselines (a), (b) and (c) as the network consists of 18 users. The 1.2% improvement stems from the fact that the designed algorithm optimizes the resource allocation. The 1.7% improvement stems from the fact that the designed algorithm joint considers learning and wireless effects and, hence, it can optimize the user selection and resource allocation to reduce the FL loss function. The 2.3% improvement stems from the fact that the designed algorithm optimizes wireless factors while considering FL parameters such as the number of training data samples. Figure 3.3 also shows that when the number of users is less than 12, the value of the identification accuracy increases quickly. In contrast, as the number of users continues to increase, the identification accuracy increases slowly. This is because, for a higher number of users, the BS will have enough data samples to accurately approximate the gradient of the loss function. Figure 3.4 shows how the identification accuracy changes as the number of RBs changes. From Fig. 3.4 of the response, we can see that, as the number of RBs increases, the identification accuracy resulting from all of the considered FL algorithms increases. This is due to the fact that, as the number of RBs increases, the number of users that can perform the FL algorithm increases. From this figure,

36

3 Resource Management for Federated Learning

Identification accuracy

0.9 0.88 0.86 0.84 Proposed FL Baseline a) Baseline b) Baseline c)

0.82 0.8 0.78 3

6

9

12

Number of RBs Fig. 3.4 Identification accuracy changes as the number of RBs varies (.U = 15) Proposed FL: 6

3

5

5

2

0

Baseline b): 6 4

3 1

5 9

5 5

2 7

0 8

4 9

1 2

9 2

6 4

7 4

8 4

9 3

2 3

2 7

8 0

4 2

4 8

3 1

3 7

7 3

0 2

0 9

8 7

1 9

7 6

3 2

2 7

8 8

7 4

9

6

2

7

8

4

Fig. 3.5 An example of implementing FL for handwritten digit identification

we can also see that, the designed FL algorithm can achieve up to 1.4, 3.5, and 4.1% gains in terms of the identification accuracy compared to baselines (a), (b), and (c) for a network with 9 RBs. This is because the wireless-aware FL algorithm can optimize the RB allocation, transmit power, and user selection and thus minimizing the loss function values.

3.2 Resource Management for FL Convergence Time Minimization

37

Figure 3.5 shows one example of implementing the wireless-aware FL algorithm for handwritten digit identification. In particular, each user trains a convolutional neural network (CNN) using the MNIST dataset. In this simulation, CNNs are generated by the Matlab Machine Learning Toolbox and each user has 2000 training data samples to train the CNN. From this figure, we can see that, for 36 handwritten digit identification, the designed algorithm correctly identify 30 handwritten digits while baseline (b) correctly identify 27 handwritten digits. Hence, the designed FL algorithm can more accurately identify the handwritten digits than baseline (b). This is because the designed FL algorithm can minimize the packet error rate of the users and hence improving the FL performance. From Fig. 3.5, we can see that, even though CNNs are not convex, the designed FL algorithm can also improve the FL performance. This is another evidence to show that our assumptions can still apply to practical FL solutions.

3.2 Resource Management for FL Convergence Time Minimization In this section, we introduce the optimization of resource management for FL convergence time minimization.

3.2.1 Wireless FL Model ( ) its local FL model Let .ciU r i,μ be the uplink rate of user i that is transmitting ┐ ┌ parameters to the BS at iteration .μ with .r i,μ = ri1,μ , . . . , riR,μ being an RB allocation vector. .ciD is the downlink data rate of the BS when transmitting the global FL model parameters to each user i. Let Z be the data size of a global FL model or local FL model. The transmission delay between user i and the BS over both uplink and downlink at iteration .μ will be ( ) liU r i,μ =

.

liD =

.

ciU

Z ( ), r i,μ

Z . ciD

(3.36)

(3.37)

The time that the users and the BS require to jointly complete an update of their respective local and global FL models at iteration .μ is given by ⎛ ( ⎞ ( ) ) tμ aμ , R μ = max ai,μ liU r i,μ + liD , i∈U

.

(3.38)

38

3 Resource Management for Federated Learning

┐ ( ( ) ) ┌ where .R μ = r 1,μ , . . . , r U,μ . When .ai,μ = 0, .ai,μ liU r i,μ + liD = 0. Here, .ai,μ = 0 implies that user i will not send its local FL model to the BS at iteration ( U( ) ) D .μ, and hence, user i will not cause any delay at iteration .μ, (.ai,μ l i r i,μ + li = 0). When .ai,μ =1, then user i( will) transmit its local FL model to the BS and the transmission delay will be .liU r i,μ + liD . Hence, (3.38) is essentially the worst-case transmission delay among all selected users.

3.2.2 Problem Formulation Having defined the system model, the next step is to introduce a joint RB allocation and user selection scheme to minimize the time that the users and the BS need in order to complete the FL training process. This optimization problem is formulated as follows:

.

min A,R

T Σ

( ) tμ aμ , R μ Ωμ

s. t.. ai,μ , rin,μ , Ωμ ∈ {0, 1} , ∀i ∈ U, n = 1, . . . , R, . Σ ri,n ≤ 1, ∀n = 1, . . . , R, . i∈U R Σ

(3.39)

μ=1

rin,μ = ai,μ , ∀i ∈ U,

(3.39a) (3.39b)

(3.39c)

n=1

┐ ┌ where .A = a 1 , . . . , a μ , . . . , a T is a user selection matrix for all iterations, .R = ┐ ┌ R 1 , . . . , R μ , . . . , R T is an RB allocation matrix for all users at all iterations, and T is a constant, which is large enough to guarantee the convergence of FL. In other words, the number of iterations that the FL algorithm requires to converge will not be larger than T . In (3.39), .Ωμ = 1 implies that the FL algorithm does not converge, otherwise, we have .Ωμ = 0, (3.39a) and (3.39b) imply that each user can only occupy one RB, and (3.39c) implies that all RBs must be allocated to the users that are associated with the BS. From (3.39), we see that the time used for the update of the local and global FL models, .tμ , depends on the user selection vector .a μ and RB allocation matrix .R μ . Meanwhile, the total number of iterations that the FL algorithm needs in order to converge depends on the user selection vector .a μ . In consequence, the time duration of each FL training iteration and the number of iterations needed for the ) algorithm to converge are dependent. Moreover, ( FL given the global model .g a μ at iteration .μ, the BS cannot calculate the number of iterations that the FL algorithm needs to converge since all of the training data samples are located at the users’ devices. Hence, problem (3.39) is challenging to solve.

3.2 Resource Management for FL Convergence Time Minimization

39

Fig. 3.6 The training procedure of the proposed FL

3.2.3 Minimization of FL Convergence Time To solve problem (3.39), we first determine the user association at each iteration. Given the user selection vector, the optimal RB allocation scheme can be derived. To further improve the convergence speed of FL, artificial neural networks (ANNs) are introduced to estimate the local FL models of the users that are not allocated any RBs for transmitting their local model parameters at each given learning step. Figure 3.6 summarizes the training procedure of the introduced FL.

3.2.3.1

Gradient Based User Association Scheme

To predict the local FL model of each user, the BS needs to use the local FL model of a given user as an ANN input, as will be explained in the following subsections. Hence, in the proposed user association scheme, one user must be selected to connect with the BS during all training iterations. To determine the user that connects to the BS during the entire training process, we first assume that the distance .di between user i and the BS satisfies .d1 ≤ d2 ≤ . . . ≤ dU . Hence, user .i ∗ that always connects to the BS can be found from Ki Σ

) ) ( ( i = arg max ∇f g a μ−1 , x ik , y ik , i∈U k=1

.

∗

s. t.. d1 ≤ di ≤ dγR , ∀i ∈ U,

(3.40) (3.40a)

40

3 Resource Management for Federated Learning

where .γR is the number of users considered in (3.40) with .1 ≤ γR ≤ U . As .γR increases, the number of users considered in (3.40) increases. Hence, the transmission delay of user .i ∗ may increase thus increasing the time used to complete one FL Ki ) ) ( ( Σ iteration. However, as .γR increases, the value of .max ∇f g a μ−1 , x ik , y ik i∈U k=1 may increase, and, thus, the number of iterations required for FL to converge decreases. Here, user .i ∗ is determined at the first iteration. ( ) At each iteration .μ, the global FL model .g a μ−1 will change Ki ) ) ( ( Σ .λ ∇f g a μ−1 , x ik , y ik due to the local FL model of a given user i. We k=1

define a vector .ei,μ = λ

Ki Σ

) ) ( ( ∇f g a μ−1 , x ik , y ik as the change in the global FL

k=1

model due to user i’s local FL model. To enable the BS to predict each user’s local FL model at each learning step, each user must have a chance to connect to the BS so as to provide the training data samples (local FL model parameters) to the BS for training ANNs. Therefore, a probabilistic user association scheme is developed, which is given by

pi,μ =

.

⎧ ⎪ ⎨ ⎪ ⎩

||ei,μ ||

U Σ

i=1,i/=i ∗

||ei,μ ||

1,

, if i /= i ∗ , (3.41) if i =

i∗,

where .p|| the probability that user i connects to the BS at iteration i,μ represents || μ, and .||e i,μ || is the norm of vector .e i,μ . From (3.41), we can see that, as || || .||e i,μ || increases, the probability of associating user i with the BS increases. In consequence, the probability that the BS uses user i’s local FL model to generate the global FL model increases. Hence, using the proposed user association scheme in (3.41), the BS has a high probability to connect to the user whose local FL model significantly affects the global FL model, thus improving the FL convergence speed. From (3.41), we also see that user .i ∗ will always connect to the BS so as to provide information for the prediction of other users’ local FL models. Based on (3.41), the user association scheme at each iteration || || can be determined. To calculate .pi,μ ei,μ ||||of each user i without requiring the in (3.41), the BS only needs to know .|||| exact training data information. In fact, .||ei,μ ||||can be || directly calculated by user i and each user i needs to transmit only a scalar .||ei,μ || to the BS. .

3.2.3.2

Optimal RB Allocation Scheme

Given the user association scheme at each iteration .μ, problem (3.39) at iteration .μ can be simplified as follows:

3.2 Resource Management for FL Convergence Time Minimization

.

⎛ ( ⎞ ( ) ) min tμ R μ = min max ai,μ liU r i,μ + liD Rμ R μ i∈U

s. t.. rin,μ ∈ {0, 1} , ∀i ∈ U, n = 1, . . . , R, . Σ rin,μ ≤ 1, ∀n = 1, . . . , R, . i∈U R Σ

rin,μ = ai,μ , ∀i ∈ U.

41

(3.42) (3.42a) (3.42b)

(3.42c)

n=1

( ( ) ) We assume that there exists a variable m that satisfies .ai,μ liU r i,μ + liD ≤ m. Problem (3.42) can be simplified as follows: .

min m Rμ

s. t.. (3.42a), (3.42b), and (3.42c), . ⎛ ( ⎞ ) m ≥ ai,μ liU r i,μ + liD , ∀i ∈ U.

(3.43) (3.43a) (3.43b)

Since (3.43b) is nonlinear, we must transform it into a linear constraint. We U ⎛Z ⎞ , which represents the delay of user i = first assume that .lin,μ P hi Blog2 1+In +BN 0 ( ) transmitting the local FL model over RB n at iteration .μ. Then, we have .liU r i,μ = R Σ U . Hence, problem (3.43) can be rewritten as follows: rin,μ lin,μ n=1

.

min m Rμ

s. t.. (3.42a), (3.42b), and (3.42c), . ⎛ R ⎞ Σ U D m ≥ ai,μ rin,μ lin,μ + li , ∀i ∈ U.

(3.44) (3.44a) (3.44b)

n=1

Problem (3.44) is equivalent to (3.43) and is an integer linear programming problem, which can be solved by known optimization algorithms such as interior-point methods [115].

42

3.2.3.3

3 Resource Management for Federated Learning

Prediction of the Local FL Models

The previous subsections determine the users that are associated with the BS at each iteration and minimize their transmission delay by optimizing the RB allocation. Next, we introduce an ANN-based algorithm to predict the local FL model parameters of the users that are not allocated any RBs for local FL model transmission at each given learning step. In particular, ANNs are used to build a relationship between the local FL models of different users. Since fully-connected multilayer perceptron (MLP) in ANNs are good at function fitting tasks and finding a relationship among different users’ local FL models is a regression task, we prefer to use MLP instead of other neural networks such as recurrent neural networks. Next, we first introduce the architecture of our FNN-based algorithm. Then, we explain how to implement this algorithm to predict the local FL model parameters of the users at each given learning step. Our FNN-based prediction algorithm consists of three components: (a) input, (b) a single hidden layer, and (c) output, which will be defined as follows: • Input: The input of the FNN that is used for the prediction of user j ’s local FL model is a vector .wi ∗ ,μ , which represents the local FL model of user .i ∗ . As we mentioned in Sect. 3.2.3.1, user .i ∗ will always connect with the BS so as to provide the input information for the MLP to predict the local FL models of other users. • Output: The output of the FNN for the prediction of user j ’s local FL model is a vector .o = w i ∗ ,μ − wj,μ , which represents the difference between user .i ∗ ’s local FL model and user j ’s local FL model. Based on the prediction output .o and user .i ∗ ’s local FL model, we can obtain the local FL model of user j , i.e., ˆ j,μ = wi ∗ ,μ − o with .wˆ j,μ being the predicted user j ’s local FL model. .w • A single hidden layer: The hidden layer of an FNN allows it to learn nonlinear relationships between input vector .wi ∗ ,μ and output vector .o. Mathematically, a single hidden layer consists of N neurons. The weight matrix that represents the connection strength between the input vector and the neurons in the hidden layer is .v in ∈ RN ×W . Meanwhile, the weight matrix that captures the strengths of the connections between the neurons in the hidden layer and the output vector is .v out ∈ NW ×N . Given the components of the MLP, next, we introduce the use of MLP to predict each user’s local FL model. The states of the neurons in the hidden layer are ⎞ ⎛ ϑ = σ v in wi ∗ ,μ + bϑ ,

.

(3.45)

2 where .σ (x) = 1+exp(−2x) − 1 and .bϑ ∈ RN ×1 is the bias. Given the neuron states, we can obtain the output of the FNN, as follows:

o = v out ϑ + bo ,

.

(3.46)

3.2 Resource Management for FL Convergence Time Minimization

43

where .bo ∈ RW ×1 is a vector of bias. Based on (3.46), we can calculate the predicted local FL model of each user j at each iteration .μ, i.e., .wˆ j,μ = wi ∗ ,μ − o. To enable the FNN to predict each user’s local FL model, the FNN must be trained by the online gradient descent method [116]. Given the prediction of the users’ FL models, the update of the global FL model can be rewritten by U Σ

( ) g aμ =

.

Ki ai,μ wi,μ +

i=1 U Σ

Ki ai,μ +

i=1

where .

U Σ

U Σ i=1 U Σ i=1

( ) Ki 1 − ai,μ wˆ i,μ 1{Ei,μ ≤γ } ( ) Ki 1 − ai,μ 1{Ei,μ ≤γ }

,

(3.47)

Ki ai,μ wi,μ is the sum of the local FL models of the users that connect to

i=1

the BS at iteration .μ and .

U Σ

i=1

( ) Ki 1 − ai,μ wˆ i,μ 1{Ei,μ ≤γ } is the sum of the predicted

local FL models of the users that are not associated with the BS at iteration .μ, || ||2 1 || ˆ i,μ − wi,μ || is the prediction error at iteration .μ, and .γ is the .Ei,μ = 2W w prediction requirement. In (3.47), when the prediction accuracy of the FNN cannot meet the prediction requirement (i.e., .Ei,μ > γ ), the BS will not use the prediction result for updating its global FL model. From (3.47), we can also observe that, using MLP, the BS can include additional local FL models to generate the global FL model so as to decrease the FL training loss and improve FL convergence speed. Equation (3.47) is used to generate the global FL model in Step c. of the FL training procedure specified in Sect. 3.2. The proposed FL algorithm that jointly minimizes the FL convergence time FL training loss is shown in Algorithm 1. From Algorithm 1, we can see that the user selection and RB allocation are optimized at each FL iteration and, hence problem (3.39) is solved at each FL iteration.

3.2.4 Simulation Results and Analysis For the simulations, we consider a circular network area having a radius .r = 500 m with one BS at its center servicing .U = 15 uniformly distributed users. ┐ ┌ The value of the inter-cell interference at each RB is randomly selected from . 10−4 , 0.01 . The FL algorithm is used for the classification task where an FL algorithm is used to identify the handwritten digits from 0 to 9. Here each user trains a MLP using the MNIST dataset [117]. The size of neuron weight matrices are .784 × 50 and .50 × 10. The BS also implements a MLP for each user to predict its local FL model parameters. The MLP is generated based on the MATLAB machine learning toolbox [118]. 1000 handwritten digits are used to test the trained FL algorithms. The other

44

3 Resource Management for Federated Learning

Algorithm 1 FL convergence time minimization Ensure: Local FL model of each user i, wi , FNN model for the prediction of local FL model, v in , v out , bϑ , bo , user i ∗ that always connects to the BS. 1: for iteration μ do 2: Each user i trains its local FL model to obtain w i,μ . | | 3: Each user i calculates the change of gradient of the local FL model ei,μ and sends |ei,μ | to the BS. 4: The BS calculates pi,μ using (3.41) and determines a μ . 5: The BS determines R μ in (3.42). 6: The selected users (ai,μ = 1) transmit their local FL models to the BS based on R μ and aμ . 7: The BS uses MLP to estimate the local FL models of the users who are not associated with the BS (ai,μ = 0). 8: The BS calculates the global FL model g using (3.47). 9: The BS uses collected local FL models to train the MLP. 10: end for Table 3.2 System parameters

Parameter .α

P R .N .Nout .Nin .W

Value 2 1W 5 5 10 .784 .5000

Parameter

Value

.N0

.−174 dBm/Hz

B

1 MHz 20 MHz 1W 500 0.01 5

.B

D

.PB .Ki .γ .γR

parameters used in simulations are listed in Table 3.2. For comparison purposes, we use two baselines: (a) an FL algorithm that uses the proposed user association policy without the prediction of users’ local FL models at each given learning step and (b) a standard FL algorithm in [105] that randomly determines user selection and resource allocation without using MLP to estimate the local FL model parameters of each user at each given learning step. The simulation results are averaged over a large number of independent runs. In Fig. 3.7, we show how the FL identification accuracy changes as time elapses. From this figure, we can see that, as time elapses, the FL identification accuracy of all considered algorithms increases. This is because the local FL models and the global FL model are trained by the users and the BS as time elapses. From Fig. 3.7a, we can see that the proposed FL algorithm can reduce the number of iterations needed for convergence, by, respectively, up to 9 and 14% compared to baselines (a) and (b). The 9% gain stems from the fact that the proposed FL algorithm uses MLP to estimate the local FL model parameters of the users that are not allocated any RBs for local FL model transmission at each given learning step. The 14% gain stems from the fact that the proposed FL algorithm uses the proposed probabilistic user selection scheme to select the users for local FL model transmission and uses

3.2 Resource Management for FL Convergence Time Minimization

45

Identification accuracy

0.9 0.85 Proposed algorithm Baseline a) Baseline b) Convergence point

0.8 0.75 0.7 0.65 0.6 50

100

150

Number of iterations (a)

Identification accuracy

0.9 0.85 0.8

Proposed algorithm Proposed algorithm with full gradient descent

0.75 0.7 0.65 0.6 50

100

150

200

Number of iterations (b) Fig. 3.7 Identification accuracy as the number of iteration varies

250

300

46

3 Resource Management for Federated Learning

0.92

Identification accuracy

0.91 0.9 0.89 0.88

Proposed algorithm Baseline a) Baseline b)

0.87 0.86 0.85 3

6

9

12

15

20

Number of users Fig. 3.8 Training loss changes as the number of users varies

the ANNs to estimate the local FL model parameters of the users that do not RBs for local FL model transmission at each given learning step. From Fig. 3.7b, we can see that the proposed algorithm converges faster than the proposed algorithm with full gradient descent. However, the proposed algorithm with full gradient descent can achieve better classification accuracy compared to the proposed algorithm. This is due to the fact that a full gradient descent method uses all training data samples to train the local FL models at each iteration while the SGD method uses a subset of training data samples to train the local FL models. Figure 3.8 shows how the identification accuracy changes as the number of users varies. In this figure, we can see that, as the number of users increases, the FL identification accuracy of all considered algorithms increases. This is because as the number of users increases, the number of data samples used for training FL increases. From Fig. 3.8, we can also see that, for a network with 20 users, the proposed FL algorithm can improve the identification accuracy by up to 1 and 3%, respectively, compared to baselines (a) and (b). These gains stem from the fact that, in the proposed FL algorithm, a probabilistic user selection scheme is developed for user selection and local FL model transmission. Meanwhile, to include additional local FL models to generate the global FL model, at each given learning step, the proposed FL algorithm uses ANNs to estimate the local FL model parameters of the users that are not allocated any RBs for local FL model transmission hence improving the classification accuracy. Figure 3.8 also shows that, as the number of users increases, the gap between the identification accuracy resulting from the proposed FL algorithm and the baselines increases. This is because, for the

3.2 Resource Management for FL Convergence Time Minimization

Identification accuracy

0.93 0.92

47

Proposed algorithm Baseline a) Baseline b)

0.91 0.9 0.89 0.88 0.87 200

400

600

800

1000

Number of training data samples per user Fig. 3.9 Training loss changes as the number of training data samples varies

considered baselines, as the number of users increases, the number of users that can transmit their local FL models to the BS remains the same due to the limited number of RBs. In contrast, the proposed FL algorithm can use MLP to estimate the local FL models of the users that are not allocated any RBs for transmitting their local model parameters at each given learning step and, hence, include more local FL models to generate the global FL model. Figure 3.9 shows how the handwritten digit identification accuracy changes as the number of training data samples varies. From Fig. 3.9, we can see that, as the number of training data samples increases from 200 to 800, the identification accuracy of all considered algorithms increase. This is because, as the number of training data samples increases, each user can use more data samples to train the local FL model thus improving the identification accuracy of the FL algorithms. As the number of training data samples continues to increase, the identification accuracy of all considered algorithms improve slowly. This is because 800 training data samples may include all features of the MNIST dataset. Figure 3.9 also shows that, as the number of training data samples increases, the gap between the identification accuracy resulting from the proposed FL algorithm and baseline (a) decreases. This is due to the fact that, as the number of training data samples per user increases, each local FL model is trained by a dataset that contains all features of MNIST dataset, and, hence, the BS can use fewer local FL models to generate the global FL model.

48

3 Resource Management for Federated Learning

3.3 Resource Management for Energy Efficiency Optimization Next, we introduce the optimization of (1) the bandwidth allocated to each device for FL parameter transmissions, (2) the computational capacity of each device, (3) the transmit power of each device, (4) the number of local and global updates per FL iteration, (5) the time of each device transmitting its FL parameters to the BS to minimize the energy consumption of FL training and FL parameter transmission.

3.3.1 Wireless FL Model Each FL iteration consists of three steps: (1) local computation in which each user calculates its local FL model by using its local dataset and the received global FL model, (2) local FL parameter transmission for each user, and (3) FL parameter aggregation and broadcast at the BS.

3.3.1.1

Local Computation

Let .fk be the computation capacity of user k, which is measured by the number of CPU cycles per second. The time of user k training local FL model is τk =

.

Ik Ck Dk , fk

∀k ∈ K,

(3.48)

where .Ck (cycles/sample) is the number of CPU cycles required for computing one sample data at user k and .Ik is the number of local iterations at user k. According to Lemma 1 in [119], the energy consumption for computing a total number of .Ck Dk CPU cycles at user k is C Ek1 = κCk Dk fk2 ,

.

(3.49)

where .κ is the effective switched capacitance that depends on the chip architecture. To compute the local FL model, user k needs to compute .Ck Dk CPU cycles with .Ik local iterations, which means that the total computation energy at user k is C EkC = Ik Ek1 = κIk Ck Dk fk2 .

.

(3.50)

3.3 Resource Management for Energy Efficiency Optimization

3.3.1.2

49

Wireless Transmission

Given the local computation model, we introduce the FL parameter transmission model. We assume that all users upload their local FL models to the BS via frequency domain multiple access (FDMA). The achievable rate of user k is ⎛ ⎞ g k pk rk = bk log2 1 + , N0 bk

.

∀k ∈ K,

(3.51)

where .bk is the bandwidth allocated to user k, .pk is the average transmit power of user k, .gk is the channel gain between user k and the BS, and .N0 is the power spectral Σ density of the Gaussian noise. Due to limited bandwidth of the system, we have . K k=1 bk ≤ B, where B is the total bandwidth. In this step, user k needs to upload the local FL model to the BS. Since the dimensions of the local FL model are fixed for all users, the data size that each user needs to upload is constant, and can be denoted by s. To upload data of size s within transmission time .tk , we must have: .tk rk ≥ s. To transmit data of size s within a time duration .tk , the wireless transmit energy of user k will be: .EkT = tk pk . 3.3.1.3

GLobal FL Model Generation and Broadcast

In this step, the BS aggregates the received local FL models to generate a global FL model. Then, the BS broadcasts the global FL model to all users. The energy consumption of each user includes both local computation energy .EkC and wireless transmission energy .EkT . Denote the number of global iterations by .I0 , and the total energy consumption of all users that participate in FL is E = I0

K Σ

.

(EkC + EkT ).

(3.52)

k=1

Hereinafter, the total time needed for completing the execution of the FL algorithm is called completion time. The completion time of each user includes the local computation time and transmission time, as shown in Fig. 3.10. Based on (3.48), the completion time .Tk of user k will be: ⎛ Tk = I0 (τk + tk ) = I0

.

⎞ Ik Ck Dk + tk . fk

(3.53)

Let T be the maximum completion time for training the entire FL algorithm and we have: Tk ≤ T ,

.

∀k ∈ K.

(3.54)

50

3 Resource Management for Federated Learning

Fig. 3.10 An implementation for the FL algorithm via FDMA

Each user is assumed to minimize the local FL training loss with a target accuracy η. Next, in Lemma 3.1, we derive a lower bound on the number of local iterations needed to achieve a local accuracy .η.

.

Lemma 3.1 Let .v =

2 (2−Lδ)δγ

. If we set step .δ
0, we always have .0 < t min ≤ t max < 1. .limη→0+ βk (η) = −∞ and .t k k k With the help of (3.66), problem (3.64) can be simplified as (3.62). Theorem 3.3 shows that it is optimal to transmit with the minimum time for each user. Based on this finding, problem (3.59) is equivalent to the problem (3.62) with only one variable. Obviously, the objective function (3.62) has a fractional form, which is generally hard to solve. By using the parametric approach in [120], we consider the following problem,

54

3 Resource Management for Federated Learning

H (ζ ) =

.

min

ηmin ≤η≤ηmax

α1 log2 (1/η) + α2 − ζ (1 − η).

(3.67)

It has been proved [120] that solving (3.62) is equivalent to finding the root of the nonlinear function .H (ζ ). Since (3.67) with fixed .ζ is convex, the optimal solution ∗ .η can be obtained by setting the first-order derivative to zero, yielding the optimal 1 solution: .η∗ = (lnα2)ζ . Thus, problem (3.62) can be solved by using the Dinkelbach method in [120]. In the second step, given .(t, η) calculated in the first step, problem (3.58) can be simplified as:

.

K ⎞ a Σ ⎛ κAk log2 (1/η)fk2 + tk pk , . b,f ,p 1 − η k=1 ⎛ ⎞ a Ak log2 (1/η) s.t. + tk ≤ T , ∀k ∈ K, . 1−η fk ⎞ ⎛ gk pk tk bk log2 1 + ≥ s, ∀k ∈ K, . N0 bk

min

K Σ

bk ≤ B, .

(3.68)

(3.68a) (3.68b)

(3.68c)

k=1

0 ≤ pk ≤ pkmax ,

∀k ∈ K, .

(3.68d)

0 ≤ fk ≤ fkmax ,

∀k ∈ K.

(3.68e)

Since both objective function and constraints can be decoupled, problem (3.68) can be decoupled into two subproblems: aκ log2 (1/η) Σ Ak fk2 , . 1−η k=1 ⎛ ⎞ a Ak log2 (1/η) + tk ≤ T , fk 1−η K

.

min f

s.t.

0 ≤ fk ≤ fkmax ,

(3.69) ∀k ∈ K, .

∀k ∈ K,

(3.69a) (3.69b)

and

.

K a Σ t k pk , . b,p 1−η k=1 ⎞ ⎛ gk pk s.t. tk bk log2 1 + ≥ s, N0 bk

(3.70)

min

∀k ∈ K, .

(3.70a)

3.3 Resource Management for Energy Efficiency Optimization K Σ

55

bk ≤ B, .

(3.70b)

k=1

0 ≤ pk ≤ pkmax ,

∀k ∈ K.

(3.70c)

According to (3.69), it is always efficient to utilize the minimum computation capacity .fk . To minimize (3.69), the optimal .fk∗ can be obtained from (3.69), which gives: fk∗ =

.

aAk log2 (1/η) , T (1 − η) − atk

∀k ∈ K.

(3.71)

We solve problem (3.70) using the following theorem. Theorem 3.4 The optimal solution .(b∗ , p ∗ ) of problem (3.70) satisfies: bk∗ = max{bk (μ), bkmin },

(3.72)

.

and ∗ .pk

N0 bk∗ = gk

⎛ 2

s tk bk∗

⎞ −1 ,

(3.73)

where (ln 2)s

⎛

bkmin = −

.

(ln 2)N0 s

2)N0 s − gk pkmax tk − (ln gk pkmax tk e

tk W

⎞

, +

(3.74)

(ln 2)N0 s gk pkmax

bk (μ) is the solution to

.

.

N0 g k tk

⎞ ⎛ (ln 2)s (ln 2)s t (lnb 2)s e tk bk (μ) − 1 − e k k (μ) + μ = 0, tk bk (μ)

(3.75)

and .μ satisfies K Σ .

max{bk (μ), bkmin } = B.

(3.76)

k=1

Proof See Appendix D of [109].

⨆ ⨅

By iteratively solving problem (3.59) and problem (3.68), the algorithm that solves problem (3.58) is given in Algorithm 3. Since the optimal solution of problem (3.59) or (3.68) is obtained in each step, the objective value of problem (3.58) is

56

3 Resource Management for Federated Learning

Algorithm 3 Iterative algorithm 1: Initialize a feasible solution (t (0) , b(0) , f (0) , p(0) , η(0) ) of problem (3.58) and set l = 0. 2: repeat 3: With given (b(l) , f (l) , p(l) ), obtain the optimal (t (l+1) , η(l+1) ) of problem (3.59). 4: With given (t (l+1) , η(l+1) ), obtain the optimal ( b(l) , f (l) , p (l) ) of problem (3.68). 5: Set l = l + 1. 6: until objective value (3.58a) converges

nonincreasing in each step. Moreover, the objective value of problem (3.58) is lower bounded by zero. Thus, Algorithm 3 always converges to a local optimal solution.

3.4 Conclusions In this chapter, we have introduced the joint optimization of wireless resource (i.e., transmit power of each device, spectrum, computational capacity) allocations and FL parameters (i.e., number of local and global updates) to minimize wireless FL performance metrics including training loss, energy consumption, and convergence time. In particular, we have analyzed how wireless factors such as packet error rates, computational capacity affect the FL convergence. Then, we have jointly optimized wireless factors and FL parameters to improve FL performance. The FL convergence analytical results and methods can be easily extended to analyze other wireless factors (i.e., quantization, channel fading) and FL parameters (i.e., data importance).

Chapter 4

Quantization for Federated Learning

In this chapter, we introduce the use of quantization theory to reduce the size of FL parameters transmitted over wireless networks thus reducing FL parameter transmission time.

4.1 Univrersal Vector Quantization for Federated Learning In this section, we introduce a universal vector quantization scheme for FL based on the unique settings of FL so as to reduce the size of FL parameters transmitted over wireless links while guaranteeing FL performance.

4.1.1 Wireless FL Model We consider a common FL framework introduced in Chap. 2 where a centralized server is training a model consisting of m parameters based on labeled samples available at a set of K remote users, in order to minimize some loss function .l(·; ·). (k) (k) k Letting .{x i , y i }ni=1 be the set of .nk labeled training samples available at the kth user, .k ∈ {1, . . . , K} Δ K, FL aims at recovering the .m × 1 weights vector .w o satisfying ⎧ wo = arg min F (w) Δ

K Σ

.

w

⎫ αk Fk (w) .

(4.1)

k=1

Σ Here, the weighting average coefficients .{αk } are non-negative satisfying . αk = 1, and the local objective functions are defined as the empirical average over the corresponding training set, i.e., © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 M. Chen, S. Cui, Communication Efficient Federated Learning for Wireless Networks, Wireless Networks, https://doi.org/10.1007/978-3-031-51266-7_4

57

58

4 Quantization for Federated Learning

Fig. 4.1 Federated learning with bit rate constraints

nk ( ( 1 Σ (k) nk ) (k) ) Δ Fk (w) ≡ Fk w; {x (k) , y } l w; (x (k) i i i=1 i , yi ) . nk

.

i=1

FedAvg [31] aims at recovering .wo using iterative subsequent updates. In each update of time instance t, the server shares its current model, represented by the vector .wt ∈ Rm , with the users. The kth user, .k ∈ K, uses its set of .nk labeled training samples to retrain the model .wt over .τ time instances into an updated model m ˜ (k) .w t+τ ∈ R . Having updated the model weights, the kth user should convey its model update, (k) ˜ (k) denoted as .ht+τ Δ w t+τ − w t , to the server. Since uploading throughput is typically more limited compared to its downloading counterpart [121], the kth user needs to communicate a finite-bit quantized representation of its model update. Quantization consists of encoding the model update into a set of bits, and decoding each bit (k) combination into a recovered model update [122]. The kth model update .ht+τ is (k) therefore encoded into a digital codeword of .Rk bits denoted as .ut ∈ {0, . . . , 2Rk − (k) 1} Δ Uk , using an encoding function whose input is .ht+τ , i.e., (k) et+τ : Rm |→ Uk .

.

(4.2)

The uplink channel is modeled as a bit-constrained link. In such channels, each Rk bit codeword is recovered by the server without errors, representing, e.g., coded communications at rates below the channel capacity, where arbitrarily small error rates can be guaranteed by proper channel coding. The server uses the received (k) m ˆ codewords .{ut+τ }K k=1 to reconstruct .ht+τ ∈ R , obtained via a joint decoding function

.

dt+τ : U1 × . . . × UK |→ Rm .

.

(4.3)

4.1 Univrersal Vector Quantization for Federated Learning

59

Σ (k) The recovered .hˆ t+τ is an estimate of the weighted average . K k=1 αk ht+τ . Finally, the global model .wt+τ is updated via wt+τ = wt + hˆ t+τ .

(4.4)

.

An illustration of this FL procedure is depicted in Fig. 4.1. Clearly, if the number Σ (k) 2 of allowed bits is sufficiently large, the distance .||hˆ t+τ − K k=1 αk ht+τ || can be made arbitrarily small, allowing the server to update the global model as the desired weighted average, denoted .wdes t+τ , via: w des t+τ =

K Σ

.

(k)

˜ t+τ . αk w

(4.5)

k=1

In the presence of a limited bit budget, i.e., small values of .{Rk }, distortion is induced which can severely degrade the ability of the server to update its model. To tackle this issue, various methods have been proposed for quantizing the model updates, commonly based on sparsification or probabilistic scalar quantization. These approaches are suboptimal from a quantization theory perspective, namely, the gap between the distortion they achieve in quantizing a signal using a given number of bit and the most accurate achievable finite-bit representation, dictated by rate-distortion theory, can be further reduced by, e.g., using vector quantization [123, Ch. 23]. This motivates the study of efficient and practical quantization methods for FL.

4.1.2 Problem Formulation Our goal is to propose an encoding-decoding system which mitigates the effect of quantization errors on the ability of the server to accurately recover the updated model (4.5). To faithfully represent the FL setup, we design our quantization strategy in light of the following requirements and assumptions: (k)

A1 All users share the same encoding function, denoted as .et (·) = et (·) for each .k ∈ K. A2 A-priori knowledge or distribution of .h(k) t+τ is assumed. A3 The users and the server share a source of common randomness. This is achieved by, e.g., letting the server share with each user a random seed along with the weights. Once a different seed is conveyed to each user, it can be used to obtain a dedicated source of common randomness shared by server and each of the users for the entire FL procedure. Requirement A2 gives rise to the need for a universal quantization approach, namely, a scheme which operates reliably regardless of the distribution of the model

60

4 Quantization for Federated Learning

updates and without its prior knowledge. In light of the above requirements, we introduce a universal vector quantization based FL in the following section.

4.1.3 Universal Vector Quantization based FL (k)

We now introduce an FL that conveys the model updates .{ht+τ } from the users to the server over the rate-constrained channel using a universal quantization method. Specifically, the scheme encodes each model update using subtractive dithered lattice quantization [124], which operates in the same manner for each user, satisfying Requirement A1. The designed FL allows the server to recover the (k) updates with small average error regardless of the distribution of .{ht+τ }, as required in Requirement A2, by exploiting the source of common randomness assumed in Requirement A3. In addition to its compliance with the model requirements stated in Sect. 4.1.2, the designed approach is particularly suitable for FL, as the distortion is mitigated by federated averaging. This significantly improves the overall FL capabilities, as numerically demonstrated in Sect. 4.1.5. First, we present the encoding and decoding functions, .et+τ (·) and .dt+τ (·). Following requirement A1, we utilize universal vector quantization, i.e., a quantization scheme which maps each set of continuous-amplitude values into a discrete representation in a manner which is ignorant of the underlying distribution. Common universal quantization methods are based on selection from an ensemble of source codes [125], or alternatively, on subtractive dithering [126], where the latter is simpler to implement being based on adding dither, i.e., noise, to the discretized quantity, but requires knowledge of the dither as it is subtracted from the discrete quantity when parsing the quantized value. The source of common randomness assumed in A3 implies that the server and the users can generate the same realizations of a dither signal. We thus design an FL based on dithered vector quantization, and particularly, on lattice quantization, detailed in the following. Let L be a fixed positive integer, referred to henceforth as the lattice dimension, and a let .G be a non-singular .L × L matrix, which denotes the lattice generator matrix. For simplicity, we assume that .M Δ m L is an integer, where m is the number of model parameters, although the scheme can also be applied when this does not hold by replacing M with .┌ M ┐. Next, we use .L to denote the lattice, which is the set of points in .RL that can be written as an integer linear combination of the columns of .G, i.e., the set of all points .x ∈ RL which can be written as .Gl with .l having integer entries: L Δ {x = Gl : l ∈ ZL }.

.

(4.6)

A lattice quantizer .QL (·) maps each .x ∈ RL to its nearest lattice point, i.e., QL (x) = l x where .l x ∈ L if .||x − l x || ≤ ||x − l|| for every .l ∈ L. Finally, let L .P0 be the basic lattice cell [127], i.e., the set of points .x ∈ R which are closer to .0 .

4.1 Univrersal Vector Quantization for Federated Learning

61

than to any other lattice point: P0 Δ {x ∈ RL : ||x|| < ||x − p||, ∀p ∈ L/{0}}.

(4.7)

.

As .P0 represents the set of points which are closer to the origin than to any other lattice point, its shape depends on the lattice .L, and in particular on the generator matrix .G. For instance, when .G = Δ · I L for some .Δ > 0, then .L is the square lattice, for which .P0 is the set of vectors .x ∈ RL whose .l∞ norm is not larger than . Δ2 . In the two-dimensional case, such a generator matrix results in .P0 being a square centered at the origin. For this setting, .QL (·) implements entry-wise scalar uniform quantization with spacing .Δ [123, Ch. 23]. In general, the basic cell can take different shapes, such as hexagons for two-dimensional hexagonal lattices. Using the above definitions in lattice quantization, we now present the encoding and decoding procedures of the designed FL, which are based on subtractive dithered lattice quantization: Encoder The designed encoding function .et+τ (·) implements dithered lattice quantization in four stages. It first normalizes the model updates and partitions it into sub-vectors of the lattice dimension, where the normalization is used to prevent overloading the finite lattice. Then, each vector is dithered before it is quantized to result in a distortion term which is not deterministically determined by the model updates, and is thus reduced by averaging. The quantized representation is compressed in lossless manner using entropy coding to further reduce its volume without inducing additional distortion. These steps are detailed in the following: (k) E1 Normalize and partition: The kth user scales .h(k) t+τ by .ζ ||ht+τ || for some .ζ > (k) 0, and divides the result into M distinct .L × 1 vectors, denoted .{h¯ }M . The i

i=1

¯ (k) M scalar quantity .ζ ||h(k) t+τ || is quantized separately from .{hi }i=1 using some fineresolution quantizer. E2 Dithering: The encoder utilizes the source of common randomness, e.g., a (k) shared seed, to generate the set of .L × 1 dither vectors .{zi }M i=1 , which (k) are randomized in an i.i.d. fashion, independently of .ht+τ , from a uniform distribution over .P0 . (k) E3 Quantization: The vectors .{h¯ i }M i=1 are discretized by adding the dither vectors (k) and applying lattice quantization, i.e., by computing .{QL (h¯ i + z(k) i )}. (k) (k) E4 Entropy coding: The discrete values .{QL (h¯ i + zi )} are encoded into a (k)

digital codeword .ut+τ in a lossless manner. (k) (k) In order to utilize entropy coding in Step E4, the discretized .{QL (h¯ i + zi )} must take values on a finite set. This is achieved by the normalization in Step E1, (k) which guarantees that .{h¯ i }M i=1 all reside inside the L-dimensional ball with radius L/2

π ζ −1 , in which the number of lattice points is not larger than . ζ L ┌ (1+L/2) [128, det(G) Ch. 2], where .┌ (·) is the Gamma function. The overhead in accurately quantizing

.

62

4 Quantization for Federated Learning

Fig. 4.2 Subtractive dithered lattice quantization illustration

the single scalar quantity .ζ ||h(k) || is typically negligible compared to the number (k) of bits required to convey the set of vectors .{h¯ i }M i=1 , hardly affecting the overall quantization rate. Decoder The decoding mapping .dt+τ (·) is comprised of four stages. The purpose of the first three steps is to invert the encoding procedure by decoding the lossless entropy code used in E4, subtracting the dither added in E2, and reforming the full model update vector from the partitioned sub-vectors generated in E1. The final stage uses the recovered model update to compute the aggregated global model. These stages are detailed in the following: (k)

D1 Entropy decoding: The server first decodes each digital codeword .ut+τ into (k) the discrete value .{QL (h¯ i + z(k) i )}. Since the encoding is carried out using a lossless source code, the discrete values are recovered without any errors. D2 Dither subtraction: Using the source of common randomness, the server (k) generates the dither vectors .{zi }, which can be carried out rapidly and at low complexity using random number generators as the dither vectors obey a uniform distribution. The server then subtracts the corresponding vector from (k) (k) (k) each lattice point, i.e., compute .{QL (h¯ i + zi ) − zi }. An illustration of the subtractive dithered lattice quantization procedure is illustrated in Fig. 4.2. (k) (k) (k) D3 Collecting and scaling: The values .{QL (hi + zi ) − zi } are collected (k) into an .m × 1 vector .hˆ t+τ using the inverse operation of the partitioning and normalization in Step E1. D4 Model recovery: The recovered matrices are combined into an updated model based on (4.4). Namely,

4.1 Univrersal Vector Quantization for Federated Learning

63

Fig. 4.3 Encoding-decoding block diagram of the designed FL

wt+τ = wt +

K Σ

.

(k)

αk hˆ t+τ .

(4.8)

k=1

A block diagram of the proposed scheme is depicted in Fig. 4.3. The usage of subtractive dithered lattice quantization in Steps E2–E3 and D2 allow obtaining a digital representation which is relatively close to the true quantity, as illustrated in Fig. 4.2, without relying on prior knowledge of its distribution. The joint decoding aspect of the proposed scheme is introduced in the final model recovery Step D4. The remaining encoding-decoding procedure, i.e., Steps E1–D3 is carried out independently for each user. Discussion The designed FL has several clear advantages. While it is based on information theoretic arguments, the resulting architecture is rather simple to implement.Both subtractive dithered quantization as well as entropy coding are concrete and established methods which can be realized with relatively low complexity and feasible hardware requirements. In particular, the main novel aspect of the designed FL, i.e., the usage of subtractive dithered lattice quantization, first requires generating the dither signal via, e.g., the methods discussed in [129] for randomizing uniformly distributed random vectors. Then, the encoder carries out lattice projection of each sub-vector which, for finite and small L, involves a complexity term which only grows linearly with the number of parameters m. This resulting additional complexity is of the same order as previous quantized FL strategies, e.g., QSGD [65] which also uses entropy coding, and is typically dominated by the computational burden involved in training a deep model with m parameters. The source of common randomness needed for generating the dither vectors can be obtained by sharing a common seed between the server and users, as also assumed in [105]. The statistical characterization of the quantization error of such quantizers does not depend on the distribution of the model updates. This analytical tractability allows us to rigorously show that its combination with federated averaging mitigates the quantization error in Sect. 4.1.4. A similar approach was also used in the analysis of probabilistic quantization schemes for average consensus problems [130].

64

4 Quantization for Federated Learning

The encoding Steps E1–E3 can be viewed as a generalization of probabilistic scalar quantizers, used in, e.g., QSGD [65]. When the lattice dimension is .L = 1 and .ζ = 1, Steps E1–E3 implement the same encoding as QSGD. However, the decoder is not the same as in QSGD due to the dither subtraction in Step D2, which is known to reduce the distortion and yield an error term that does not depend on the model updates [131]. Furthermore, the designed FL allows using vector quantizers, i.e., setting .L > 1, which is known to further improve the quantization accuracy [127]. Specifically, the usage of vector quantizers allows the designed FL to combine dimensionality reduction methods with quantization schemes by jointly mapping sets of samples into discrete representations. The gains of subtracting the dither at the decoder and of using vector quantizers over scalar ones are numerically demonstrated in our experimental study in Sect. 4.1.5. The usage of lossless source coding in Steps E4 and D1 allows exploiting the typically non-uniform distribution of the quantizer outputs. A similar approach was also used in QSGD [65], where Elias codes were utilized. Since Steps E4 and D1 involve multiple encoders and a single decoder, improved compression can be achieved by utilizing distributed source coding methods, e.g., Slepian-Wolf coding (k) [132, Ch. 15.4]. In such cases, the server decodes the received codewords .{ut+τ } (k) (k) (k) ¯ into .{QL (h¯ i + z(k) i )} in a joint manner, instead of decoding each .QL (hi + zi ) (k) from its corresponding .ut+τ separately. Similarly, the distributed nature of FL can be exploited to optimize the reconstruction fidelity for a given bit budget using Wyner-Ziv coding [133]. However, such distributed coding schemes typically (k) require a-priori knowledge of the joint distribution of .{h¯ i }, and utilize different encoding mappings for each user, thus not meeting requirements A1–A2. Finally, we note that the FL performance is affected by the selection of the lattice .L and the coefficient .ζ . In general, lattices of higher dimensions typically result in more accurate representations, at the cost of increased complexity. Methods for designing the lattice generator matrix .G can be found in [134]. The coefficient .ζ should be set to allow the usage of a limited number of lattice points, which is translated into less bits, without concentrating the resulting vectors such that they become indistinguishable after quantization. For example, using .ζ = 1 results in most quantized values mapped to zero, as also observed in [65]. A reasonable setting is .ζ = 3 √1 , resulting in .ζ ||h(k) || approaching 3 times the standard deviation of the M quantized vectors when they are zero-mean and i.i.d., and thus assuring that they reside inside the unit L-ball with probability of over .88% by Chebyshev’s inequality [135].

4.1.4 Performance Analysis Next, we analyze the performance of the designed FL, characterizing its distortion and studying its convergence properties. We consider the conventional local SGD training method, detailed in Sect. 4.1.4.1, and characterize the resulting distortion of

4.1 Univrersal Vector Quantization for Federated Learning

65

the designed FL and the convergence of the global model in Sects. 4.1.4.2–4.1.4.3, respectively.

4.1.4.1

Local SGD

Local SGD is arguably the most common training method used for FedAvg. Here, each user updates the weights using .τ SGD iterations before sending the updated (k) model to the server for aggregation. Let . ɛ t denote the error induced in quantizing (k) (k) the model update .ht , and let .it be the sample index chosen uniformly from the local data of the kth user at time t. By defining the gradient computed at a single ( (k) ) ˜ (x (k) ˜ Δ ∇Fk w; sample of index i as .∇Fki (w) i , y i ) , the local weights at the kth ˜ (k) user, denoted .w t , are updated via: ⎧ (k) it ( (k) ) ⎪ ⎨w ˜t , ˜ (k) w t +1 ∈ / Tτ , t − ηt ∇Fk ⎛ ' ⎞ K (k ' ( ) ˜ (k) .w ) ' ' i t+1 = ⎪ Σ ) ) (k ) t ˜ (k ˜ (k αk ' w + ɛ t+1 , t +1 ∈ Tτ , ⎩ t −ηt ∇Fk ' w t

(4.9)

k ' =1

where .ηt is the learning rate at time instance t, and .Tτ is the set of positive integer multiples of .τ . We focus on the case in which the users compute a single stochastic gradient in each time instance. Hence, the performance in terms of convergence rate can be (k) further improved by using mini-batches, i.e., replacing the random index .it with a set of random indices. The fact that the model updates are quantized when conveyed to the server is encapsulated in the per-user model update quantization error . ɛ (k) t .

4.1.4.2

Quantization Error Bound (k)

The need to represent the model updates .ht+τ using a finite number of bits (k) (k) (k) inherently induces some distortion, i.e., the recovered vector is .hˆ t+τ = ht+τ + ɛ t+τ . (k) The error in representing .ζ ||ht+τ || is assumed to be negligible. For example, the normalized quantization error is of the order of .10−7 for 12 bit quantization of a scalar value, and decreases exponentially with each additional bit [123, Ch. 23]. Letting .σ¯ 2 be the normalized second order lattice moment, defined as L .σ ¯ 2 Δ P0 ||x||2 dx/ P0 dx [136], the moments of the quantization error satisfy the L following: (k)

Theorem 4.1 The quantization error vector . ɛ t+τ has zero-mean entries and satisfies ||2 | (k) } {|| (k) 2 2 2 || |h E || ɛ (k) ¯L . t+τ t+τ = ζ ||ht+τ || M σ

.

(4.10)

66

4 Quantization for Federated Learning

Proof To prove the theorem, we note that by decoding Step D3, the error vector (k) (k) ɛ (k) ɛ (k) is an .L × 1 t+τ scaled by .ζ ||ht+τ ||, consists of M vectors .{¯ i }. Each . ɛ ¯ i vector representing the ith subtractive dithered quantization error, defined as (k) (k) ¯ (k) ¯ (k) . ɛ Δ QL (h¯ i + z(k) i i ) − zi − hi . The fact that we have used subtractive dithered quantization via encoding Steps E2–E3 and decoding Step D2, implies that, (k) (k) regardless of the statistical model of .{h¯ i }, the quantization error vectors .{¯ ɛ i } are zero-mean, i.i.d (over both i and k), and uniformly distributed over .P0 [127]. Consequently, .

⎧|| ⎧|| || ⎫ ⎫ M || Σ || (k) ||2 || (k) || (k) ||2 (k) 2 2 ht+τ = ζ ||ht+τ || .E || ɛ t+τ || e || ɛ ¯ i || i=1

=ζ

2

(k) 2 ||ht+τ ||2 M σ¯ L ,

⨆ ⨅

thus proving the theorem.

Theorem 4.1 characterizes the distortion in quantizing the model updates using UVeQFed. Unlike the corresponding characterization of previous quantizers used in FL which obtained an upper bound on the quantization error, e.g., [65, Lem. 1], the dependence of the expected error norm on the number of bits is not explicit in (4.10), but rather encapsulated in the lattice moment .σ¯ 2 . To observe that (4.10) L indeed represents lower distortion compared to previous FL quantization schemes, we note that even when scalar quantizers are used, i.e., .L = 1 for which . L1 σ¯ 2 is L known to be largest [136], the resulting quantization is reduced by a factor of 2 compared to conventional probabilistic scalar quantizers, such as QSGD, due to the subtraction of the dither upon decoding in Step D2 [131, Thms. 1–2]. The model updates are recovered in order to update the global model via Σ ˜ (k) .w t+τ = αk w t+τ at the server. We next show that the statistical characterization of the distortion in Theorem 4.1 contributes to the accuracy in recovering the desired des .w t+τ (4.5) via .w t+τ . To that aim, we introduce the following assumption on the stochastic gradients: ( ) AS1 The expected squared .l2 norm of the random vector .∇Fki w , representing the stochastic gradient evaluated at .w, is bounded by some .ξk2 > 0 for all .w ∈ Rm . We can now bound the distance between the desired model .w des t+τ and the recovered one .wt+τ , as stated in the following theorem: Theorem 4.2 When AS1 holds, the mean-squared distance between .wt+τ and w des t+τ satisfies

.

⎛t+τ−1 ⎞ K ⎧|| ||2 ⎫ Σ Σ || des || 2 2 .E ||w t+τ −w t+τ || ηt2' αk2 ξk2 . ≤ Mζ σ¯ L τ t ' =t

k=1

(4.11)

4.1 Univrersal Vector Quantization for Federated Learning

67

Proof To prove the theorem, we first express the updated global model using as a sum of the desired global model and the quantization noise. Then, we show that the distance between .w t+τ and .wdes t+τ can be bounded via (4.11) due to the statistical properties of subtractive dithered quantization error [127]. To formulate ¯ t,i }M ¯ t+τ,i }M this distance between .wt+τ and the desired .wdes t+τ , we use .{w i=1 , .{w i=1 , des des M ¯ t+τ,i }i=1 to denote the partitions of .wt , .wt+τ , and .wt+τ into M distinct and .{w ¯ t,i }M .L × 1 vectors, as done in Step E1. To formulate this distance, we use .{w i=1 to denote the partition of .wt into M distinct .L × 1 vectors via Step E1, similarly to ¯ t+τ,i } and .{w ¯ des the definitions of .{w t+τ,i }. From the decoding and model recovery Steps D3–D4, it follows that

.

¯ t,i + ¯ t+τ,i = w w

K Σ

⎛ ⎞ (k) (k) (k) ¯ αk ζ ||h(k) || Q ( h + z ) − z t+τ L i i i

k=1

¯ t,i + =w

K Σ

(k) (k) αk ζ ||ht+τ ||h¯ i +

K Σ

k=1

(k)

(k)

αk ζ ||ht+τ ||¯ ɛ i ,

(4.12)

k=1

(k) (k) where . ɛ ¯ i is the subtractive dithered quantization error. Now, since .ht+τ = Σ K (k) ˜ t+τ − wt combined with (4.5) and the fact that . k=1 αk = 1, it holds that w Σ (k) ¯ (k) ¯ des ¯ t,i + K .w k=1 αk ζ ||ht+τ ||hi = w t+τ,i . Substituting this into (4.12) yields

¯ des ¯ t+τ,i − w .w t+τ,i

=

K Σ

(k)

(k)

αk ζ ||ht+τ ||¯ ɛ i .

(4.13)

k=1 (k)

(k)

Since .{¯ ɛ i } are zero-mean, i.i.d (over both i and k), and independent of .ht+τ . Consequently, by the law of total expectation ⎧|| ||2 ⎫ ⎧|| M Σ K ||2 ⎫ || ⎬ ⎨||Σ || || || (k) (k) || des .E ||w t+τ − w t+τ || αk ζ ||ht+τ ||¯ ɛ i || = E || || ⎭ ⎩|| i=1 k=1

⎧ ⎧|| ⎫⎫ ||2 K M Σ ⎨ ⎨||Σ || | ⎬⎬ (a) || (k) (k) || | (k) = E E || αk ζ ||ht+τ ||¯ ɛ i || |ht+τ ⎩ ⎩|| ⎭⎭ || i=1 k=1

⎧ (b)

=E M

K Σ k=1

⎫ 2 2 αk2 ζ 2 σ¯ L ||h(k) t+τ ||

.

(4.14)

where .(a) follows from the law of total expectation, and .(b) holds by (4.10). (k) ˜ (k) ˜ (k) Next, we note that by (4.9), the model update .ht+τ = w t can be written t+τ − w (k) Σ t+τ −1 it ' ( (k) ) (k) ˜ t ' . Since the as the sum of the stochastic gradients .ht+τ = t ' =t ηt ' ∇Fk w

68

4 Quantization for Federated Learning (k)

indices .{it } are i.i.d. over t and k, applying the law of total expectation to (4.14) yields ⎧|| ||2 ⎫ || || E ||wt+τ − wdes t+τ ||

.

⎧ = E Mζ σ¯ L 2 2

K Σ

αk2 e

{

|

⎫ }

2 | (k) ˜ t' } ||h(k) t+τ || |{w

k=1

⎧ ⎧|| ⎫⎫ ||2 K −1 ⎨ || | ⎨||t+τ ⎬⎬ (k) Σ Σ ( ) i ' || | (k) || 2 ˜ (k) ˜ = E Mζ 2 σ¯ L αk2 e || ηt ' ∇Fkt w { w } || | t ' || t ' ⎭⎭ ⎩ ⎩|| ' t =t

k=1

⎧

K t+τ −1 Σ Σ ≤ E Mζ σ¯ L αk2 τ ηt2' e

(a)

2 2

k=1

(b)

≤ Mζ σ¯ L τ 2 2

⎛t+τ −1 Σ t ' =t

t ' =t

⎞

ηt2'

K Σ

⎧|| ⎫⎫ ||2 | (k) || it ' ( (k) )|| | (k) || ˜ ' } ||∇F w || k ˜ t ' || |{w t

αk2 ξk2 ,

(4.15)

k=1

Σ Σ t+1 2 ≤ τ 2 where in .(a) we used the inequality .|| t+1 t ' =t+1−τ r t || t ' =t+1−τ ||r t || , which holds for any multivariate sequence .{r t }; and .(b) holds since the uniform distribution of the random index .ik implies that the expected value of the stochastic (k) { i ( )} = ∇Fk (w), and consequently, gradient is the full gradient, i.e., .E ∇Fkt w (k) ( (k) ( {|| ) ( (k) )||2 } {|| )||2 } i i (k) t t || ≤ ξ 2 by AS1. Equa˜ t − ∇Fk w ˜ t || ≤ E ||∇Fk w ˜ (k) .E ||∇F w t k k tion (4.15) proves the theorem. ⨆ ⨅ Theorem 4.2 implies that the recovered model can be made arbitrarily close to the desired one by increasing K, namely, the number of users. For example, when .αk = 1/K, i.e., conventional averaging, it follows from Theorem 4.2 that the mean-squared error in the weights decreases as .1/K. In particular, if .maxk αk decreases with K, which essentially means that the updated model is not based only on a small part of the participating users, then the distortion vanishes in the aggregation process. Furthermore, when the step size .ηt gradually decreases, which is known to contribute to the convergence of FL [137], it follows from Theorem 4.2 that the distortion decreases accordingly, further mitigating its effect as the FL iterations progress. Since the bound in (4.11) is obtained by exploiting the mutual independence of the subtractive dithered quantization error and the quantized value, our ability to rigorously upper bound the distance in Theorem 4.2 is a direct consequence of this universal method.

4.1 Univrersal Vector Quantization for Federated Learning

4.1.4.3

69

FL Convergence Analysis

We next study the convergence of the designed FL. Our analysis is carried out under the following assumptions, commonly used in FL convergence studies [137]: AS2 The local objective functions .{Fk (·)} are all .ρs -smooth, namely, for all m .v 1 , v 2 ∈ R it holds that 1 Fk (v 1 ) − Fk (v 2 ) ≤ (v 1 − v 2 )T ∇Fk (v 2 ) + ρs ||v 1 − v 2 ||2 . 2

.

AS3 The local objective functions .{Fk (·)} are all .ρc -strongly convex, namely, for all .v 1 , v 2 ∈ Rm it holds that 1 Fk (v 1 ) − Fk (v 2 ) ≥ (v 1 − v 2 )T ∇Fk (v 2 ) + ρc ||v 1 − v 2 ||2 . 2

.

We do not restrict the labeled data of each of the users to be generated from an identical distribution, i.e., we consider a statistically heterogeneous scenario, thus faithfully representing FL setups [138]. Such heterogeneity is in line with Assumption A2, which does not impose any specific distribution structure on the underlying statistics of the training data. Following [137], we define the heterogeneity gap, ψ Δ F (w o ) −

K Σ

.

k=1

αk min Fk (w).

(4.16)

w

The value of .ψ quantifies the degree of heterogeneity. If the training data originates from the same distribution, then .ψ tends to zero as the training size grows. However, for heterogeneous data, its value is positive. The convergence of the designed FL is characterized in the following theorem: Theorem 4.3 Set .γ = τ max(1, 4ρs /ρc ) and consider the designed FL satisfyτ ing AS1–AS3. Under this setting, local SGD with step size .ηt = ρc (t+γ ) for each .t ∈ N satisfies E{F (wt )} − F (wo )

.

≤

⎛ 2 ⎞ ρc + τ 2 b ρs , γ ||w 0 − wo ||2 , max 2(t + γ ) τρc

(4.17)

where K K ⎛ ⎞ Σ Σ 2 2 b Δ 1 + 4Mζ 2 σ¯ L αk2 ξk2 + 6ρs ψ + 8(τ − 1)2 αk ξk2 . τ

.

k=1

k=1

70

4 Quantization for Federated Learning

Proof See Appendix C of [139].

⨆ ⨅

Theorem 4.3 implies that the designed FL with local SGD converges at a rate of .O(1/t). The physical meaning of this asymptotic convergence rate is that as the number of iterations t progresses, the learned model converges to the optimal one with a difference decaying as .1/t. Specifically, the difference between the objective of the model learned in a federated manner and the optimal objective decays to zero at least as quickly as .1/t (up to some constant). Nonetheless, it is noted that the need to quantize the model updates yield an additive term in the coefficient b which grows with the number of parameter via M. This term adds to the linear dependence of b on the bound on the gradients norm .ξ 2 , which is expected to grow with the number of parameters, and also appears in the corresponding bounds for local SGD without quantization constraints. This implies that FL typically converges slower for larger models, i.e., the larger the dimensionality of the model updates which have to be quantized.

4.1.5 Numerical Evaluations Next, we numerically evaluate the designed FL. We first compare the quantization errors induced by the proposed quantization method. Then, we numerically demonstrate how the reduced distortion is translated in FL performance gains using both MNIST and CIFAR-10 datasets.1

4.1.5.1

Quantization Error

We begin by focusing only on the compression method, studying its accuracy using synthetic data. We evaluate the distortion induced by the designed quantization scheme operating √ with a two-dimensional hexagonial lattice, i.e., .L = 2 and .G = [2, 0; 1, 1/ 3] [140], as well as with scalar quantizers, namely, .L = 1 and 2+R/5 √ .G = 1. The normalization coefficient is set to .ζ = . The distortion of the M designed quantization scheme is compared to QSGD [65], as well as to uniform quantizers with random unitary rotation [105], and to subsampling by random masks followed by uniform three-bit quantizers [105], all operating with the same quantization rate, i.e., the same overall number of bits. Let .H be a .128×128 matrix with Gaussian i.i.d. entries, and let .Σ be a .128×128 matrix whose entries are given by .(Σ)i,j = e−0.2|i−j | , representing an exponentially decaying correlation.

1 The source code used in the numerical evaluations detailed in this section is available online at https://github.com/mzchen0/UVeQFed.

4.1 Univrersal Vector Quantization for Federated Learning Fig. 4.4 Quantization distortion, i.i.d. data Average squared error

10 1

71

UVeQFed, L=2 UVeQFed, L=1 QSGD Random rotation + uniform quantizers Subsampling with 3 bits quantizers

10 0

10 -1

10 -2

10 -3

1

1.5

2

2.5

3

3.5

4

4.5

5

4

4.5

5

Quantization rate

Fig. 4.5 Quantization distortion, correlated data Average squared error

10 1

10 0

10 -1

UVeQFed, L=2 UVeQFed, L=1 QSGD Random rotation + uniform quantizers Subsampling with 3 bits quantizers

10 -2

10 -3

1

1.5

2

2.5

3

3.5

Quantization rate

In Figs. 4.4 and 4.5 we depict the per-entry squared-error in quantizing .H and .ΣH Σ T , representing independent and correlated data, respectively, versus the quantization rate R, defined as the ratio of the number of bits to the number of entries of .H . The distortion is averaged over 100 independent realizations of .H . To meet the bit rate constraint when using lattice quantizers we scaled .G such that the resulting codewords use less than .1282 R bits. For the scalar quantizers and subsampling-based scheme, the rate determines the quantization resolution and the subsampling ratio, respectively. We observe in Figs. 4.4 and 4.5 that UVeQFed achieves a more accurate digital representation compared to previously proposed methods. It is also observed that UVeQFed with vector quantization, outperforms its scalar counterpart, and that the gain is more notable when the quantized entries are correlated. This demonstrates the improved accuracy of jointly encoding multiple samples via vector quantization as well as its ability to exploit statistical correlation in a universal manner by using

72

4 Quantization for Federated Learning

fixed lattice-based quantization regions which do not depend on the underlying distribution.

4.1.5.2

FL Convergence

Next, we demonstrate that the reduced distortion of the designed method also translates into FL performance gains. To that aim, we evaluate its application for training neural networks using the MNIST and CIFAR-10 data sets, and compare its performance to that achievable using previous quantization methods for FL. We first compare the accuracy of models trained using UVeQFed to those obtained using federated averaging combined with the quantization methods considered in Sect. 4.1.5.1, i.e., QSGD [65] and the schemes proposed in [105] of uniform quantizers with random rotation as well as random subsampling followed by three-bit uniform quantizers. To that aim, we train a fully-connected network with a single hidden layer of 50 neurons and an intermediate sigmoid activation for detecting handwritten digits based on the MNIST data set. Training is carried out using .K = 100 users, each has access to 500 training samples distributed in an i.i.d. fashion, such that each user has an identical number of images from each label. The users update their weights using gradient descent, where federated averaging is carried out on each iteration. The resulting accuracy versus the number of iterations of these quantized FL schemes compared to federated averaging without quantization is depicted in Figs. 4.6 and 4.7 for quantization rates .R = 2 and .R = 4, respectively. Observing Figs. 4.6 and 4.7, we note that UVeQFed with vector quantization, i.e., .L = 2, achieves the most rapid and accurate convergence among all considered schemes. In particular, for .R = 4, UVeQFed with .L = 2 achieves a convergence profile within a minor gap from federated averaging without quantization constraints. Among the previous schemes, QSGD demonstrates steady accuracy improvements, though it is still outperformed by UVeQFed with .L = 1, indicating that the reduced distortion achieved by using subtractive dithering is translated into improved trained models. The quantization methods proposed in [105] result in notable variations in the trained model accuracy and in slower convergence due to their increased error induced in quantization, as noted in Sect. 4.1.5.1. We next evaluate UVeQFed for both heterogeneous as well as i.i.d. distributions of the training data. Based on the results observed in Figs. 4.6 and 4.7 and to avoid cluttering, we compare UVeQFed only to QSGD and to the accuracy achieved using federated averaging without quantization. Here, we train neural classifiers for both the MNIST and the CIFAR-10 data sets, where for each data set we use both heterogeneous and i.i.d. division of the data. For MNIST, we again use a fully-connected network with a single hidden layer of 50 neurons and an intermediate sigmoid activation with gradient descent optimization. Each of the .K = 15 users has 1000 training samples. We consider the case where the samples are distributed sequentially among the users, i.e., the first user has the first 1000 samples in the data set, and so on, resulting in an

4.1 Univrersal Vector Quantization for Federated Learning Fig. 4.6 Convergence profile, MNIST, .R = 2, .K = 100

73

0.9

Identification accuracy

0.8 0.7 0.6 0.5 0.4 UVeQFed, L=2 UVeQFed, L=1 QSGD Random rotation + uniform quantizers Subsampling with 3 bits quantizers Fed Avg

0.3 0.2 0.1 0

50

100

150

200

Number of iterations

Fig. 4.7 Convergence profile, MNIST, .R = 4, .K = 100

0.9

Identification accuracy

0.8

0.7

0.6 UVeQFed, L=2 UVeQFed, L=1 QSGD Random rotation + uniform quantizers Subsampling with 3 bits quantizers Fed Avg

0.5

0.4 0

50

100

150

200

Number of iterations

uneven heterogeneous division of the labels of the users. We also train using an i.i.d. data division, where the labels are uniformly distributed among the users. The resulting accuracy versus the number of iterations is depicted in Figs. 4.8 and 4.9 for quantization rates .R = 2 and .R = 4, respectively. For CIFAR-10, we train the deep convolutional neural network that consists of three convolution layers and two fully-connected layers. Here, we consider two methods for distributing the 50,000 training images of CIFAR-10 among the .K = 10 users: An i.i.d. division, where each user has the same number samples from each of the 10 labels, and a heterogeneous division, in which at least .25% of the samples of each user correspond to a single distinct label. Each user completes a single epoch of SGD with mini-batch size 60 before the models are aggregated. The

74

4 Quantization for Federated Learning

Fig. 4.8 Convergence profile, MNIST, .R = 2, .K = 15 Identification inaccuracy

0.9

0.8

0.7

UVeQFed, L=2 (het.) UVeQFed, L=1 (het.) QSGD (het.) Fed Avg (het.) UVeQFed, L=2 (iid) UVeQFed, L=1 (iid) QSGD (iid) Fed Avg (iid)

0.6

0.5

0.4 0

500

1000

1500

Number of iterations

Fig. 4.9 Convergence profile, MNIST, .R = 4, .K = 15 Identification inaccuracy

0.9

0.85

0.8 UVeQFed, L=2 (het.) UVeQFed, L=1 (het.) QSGD (het.) Fed Avg (het.) UVeQFed, L=2 (iid) UVeQFed, L=1 (iid) QSGD (iid) Fed Avg (iid)

0.75

0.7 0

500

1000

1500

Number of iterations

resulting accuracy versus the number of epochs is depicted in Figs. 4.10 and 4.11 for quantization rates .R = 2 and .R = 4, respectively. We observe in Figs. 4.8, 4.9, 4.10, and 4.11 that UVeQFed with vector quantizer, i.e., .L = 2, results in convergence to the most accurate model for all the considered scenarios. In fact, when training a deep convolutional network, for which the loss surface is extremely complex and non-convex, we observe in Figs. 4.10 and 4.11 that UVeQFed with .L = 2 trained using i.i.d. data achieves improved accuracy over federated averaging without quantization. This follows from the fact that the stochastic nature of the quantization error in UVeQFed results in its implementing a noisy variant of local SGD, which is known to be capable of boosting convergence and avoid local minimas when training FL with non-convex loss surfaces. The observed gains are more dominant for .R = 2, implying that the usage of UVeQFed with multi-dimensional lattices can notably improve the performance over low rate

4.1 Univrersal Vector Quantization for Federated Learning Fig. 4.10 Convergence profile, CIFAR-10, .R = 2

75

0.7

Identification accuracy

0.6

0.5

0.4 UVeQFed, L=2 (iid) UVeQFed, L=1 (iid) QSGD (iid) Fed Avg (iid) UVeQFed, L=2 (het.) UVeQFed, L=1 (het.) QSGD (het.) Fed Avg (het.)

0.3

0.2

0.1 5

10

15

20

25

30

35

Number of epochs

Fig. 4.11 Convergence profile, CIFAR-10, .R = 4

0.7

Identification accuracy

0.6

0.5

0.4 UVeQFed, L=2 (iid) UVeQFed, L=1 (iid) QSGD (iid) Fed Avg (iid) UVeQFed, L=2 (het.) UVeQFed, L=1 (het.) QSGD (het.) Fed Avg (het.)

0.3

0.2

0.1 5

10

15

20

25

30

35

Number of epochs

channels. Particularly, we observe in Figs. 4.8, 4.9, 4.10, and 4.11 that similar gains of UVeQFed are noted for both i.i.d. as well as heterogeneous setups, while the heterogeneous division of the data degrades the accuracy of all considered schemes compared to the i.i.d division. It is also observed that UVeQFed with scalar quantizers, i.e., .L = 1, achieves improved convergence compared to QSGD for most considered setups, which stems from its reduced distortion. The results presented in this section demonstrate that the theoretical benefits of UVeQFed, which rigorously hold under AS1–AS3, translate into improved convergence when operating under rate constraints with non-synthetic data.

76

4 Quantization for Federated Learning

4.2 Variable Bitwidth Federated Learning In this section, we introduce a variable bitwidth FL framework in which edge devices train and transmit quantized versions of their local FL model parameters to a coordinating server, which, in turn, aggregates them into a quantized global model and synchronizes the devices. To optimize the qunatized FL algorithm, we introduce a model based RL method that can estimate the FL training parameters and mathematically model the FL training process without continual interacting with the devices.

4.2.1 Variable Bitwidth FL Model We consider a wireless network that consists of a set .M of M devices connected to a parameter server. These devices are aiming to execute an FL algorithm for training a machine learning model. Each device m has .Nm training data samples, and each training data sample n consists of an input feature vector .x m,n ∈ RNI ×1 and (in the case of supervised learning) a corresponding label vector .y m,n ∈ RNO ×1 . The objective of the server and the devices is to minimize the global loss function over all data samples, which is given by M Nm ) ( 1 Σ Σ . F (g) = min f g, x m,n , y m,n , g N

(4.18)

m=1 n=1

where .g ∈ RY ×1 is a vector that captures the global FL model of dimension Y M Σ Nm being the total number of training data trained across the devices, with .N = m=1) ( samples of all devices. .f g, x m,n , y m,n is a loss function (e.g., squared error) that measures the accuracy of the generated global FL model .g in building a relationship between the input vector .x m,n and the output vector .y m,n .

4.2.1.1

Training Process of Bitwidth Federated Learning

In FL, devices and the server iteratively exchange their model parameters to find the optimal global model .g that minimizes the global loss function in (4.18). However, due to limited computational and wireless resources, devices may not be able to train and transmit such large sized model parameters (e.g., as in the case of deep learning). To reduce the computation and transmission delays, bitwidth federated learning was proposed. Compared to the widely studied case of FedAvg [31], the FL model parameters in bitwidth FL are quantized. The overall training process of bitwidth FL is given as follows:

4.2 Variable Bitwidth Federated Learning

77

1. The server quantizes the initialized global learning model and broadcasts it to each device. 2. Each device calculates the training loss using the quantized global learning model and its collected data samples. 3. Based on the calculated training loss, the quantized learning model in each device is updated. 4. Each device quantizes its updated learning model. 5. The server selects a subset of devices for local FL model transmission. 6. The server aggregates the collected local FL models into a global FL model that will be transmit to devices. Steps 2–6 are repeated until the optimal vector .g is found. From the training process, we see that, in bitwidth FL, each device uses a quantized FL model to calculate the training loss and gradient vectors during the training process. Therefore, the quantization scheme in bitwidth FL will affect the resource requirements of FL model training and transmission. This is significantly different from current quantization-based FL algorithms that must recover the quantized FL model during the training process, thus introducing additional computational complexity and reducing training efficiency. Next, we will introduce the training process mathematically. Calculation of Training Loss of Each Device We first introduce the calculation of each device’s training loss for step 2. Without loss of generality, we will assume that a neural network is being trained; the quantization method can be used in other machine learning approaches (such as support vector machines (SVM) [141]) as well. The weights of each device’s local FL model are quantized into .αt bits. Through this, the full-precision neural network is transformed into a quantized neural network (QNN). When .αt = 1, each QNN weight has two possible values, namely 1/0 or +1. Therefore, a neural network that consists of the weights with two possible values is called a binary neural network (BNN) [142]. Given the input vector .hkm,t and the weight vector .gˆ kt of the neurons in layer k that is represented by .αt bits, the output of each layer k at iteration t is given by Young et al. [143]

.

hk+1 m,t

⎧ ⎛ ⎞ k k ⎪ ˆ σ h ⊙ g if αt = 1, ⎪ t , ⎨ ⎛ m,t ⎞ αΣ = t −1 αΣ t −1 k,j k,i i+j ⎪ 2 (hm,t ⊙ gˆ t ) , if αt > 1, ⎪ ⎩σ

(4.19)

i=0 j =0

where .σ (·) is the activation function and .⊙ represents the inner product for vectors with bitwise operations. Given the outputs of all neuron layers .hm,t = [h1m,t , . . . , hK m,t ], the cross-entropy loss function can be expressed based on the neurons in an output layer .hK m,t as .

) ( K f gˆ t , x m,n , y m,n = −y m,n log(hK m,t ) + (1 − y m,n )log(1 − hm,t ),

(4.20)

78

4 Quantization for Federated Learning

where .gˆ t = [gˆ 1t , . . . , gˆ kt , . . . , gˆ K t ] is the quantized global FL model. FL Model Update A backward propagation (BP) algorithm based on stochastic gradient descent is used to update the parameters in QNN. The update function is expressed as

.

w m,t+1

) ( Σ ∂f g, x m,n , y m,n = gˆ t − λ , ∂g n∈Nm,t

(4.21)

where .λ is the learning rate, .Nm,t is the subset of training data samples (i.e., minibatch) selected from device m’s training dataset .Nm at iteration t, .wm,t+1 is the updated local FL model of device m at iteration .t +1, and .

∂f ∂fm,t ∂ gˆ t ∂fm,t × = × Htanh(g t ), = ∂g ∂ gˆ t ∂g t ∂ gˆ t

(4.22)

where .g t represents the full-precision weights. .Htanh(x) = max(−1, min(1, x)) is used to approximate the derivative of the quantization function that is not differentiable. From (4.21) and (4.22), we can see that the weights are updated with full-precision values since the changes of the learning model update at each step are small. FL Model Quantization at Device As each local FL model is updated, these fullprecision weights must be completely quantized into .αt bits, which is given by

k,j .w ˆ m,t (αt )

=

k,j Q(wm,t , αt )

=

⎧ k,j ⎪ sign(wm,t ), ⎪ ⎞ ⎨⎛ α k,j R (2 t −1)wm,t

⎪ ⎪ ⎩

2αt −1 k,j wm,t ,

if αt = 1, , if 1 < αt < V , if αt = V ,

(4.23)

where V is the bitwidth of the full-precision and .sign(x) = 1 if .x ≥ 0 and .sign(x) = −1, otherwise. .R(·) is a rounding function with .R(x) = Lx⎦ if .x ⦤ Lx⎦+┌ x ┐ , and 2 .R(x) = ┌ x ┐, otherwise. From (4.23), we see that when .αt = 1 (i.e., the binary case), k,j k,j k,j k,j k,j if .wm,t > 0, we have .wˆ m,t = 1 with .wm,t and .wˆ m,t being j -th element in .wm,t and k,j k,j .w ˆ m,t otherwise .wˆ m,t = −1. For .1 < αt < V , .w km,t is quantized with increasing precision between .−1 and 1. Finally, when .αt = V , there is no quantization. FL Model Transmission and Aggregation Due to limited wireless bandwidth, the server may need to select a subset of devices to upload their local FL models ˆ m,t of for aggregation into the global model. Given the quantized local FL model .w each device m at each iteration t, the update of the global FL model at iteration t is

4.2 Variable Bitwidth Federated Learning

.

g t (ut , αt ) =

M Σ m=1

79

um,t Nm,t M Σ

ˆ m,t (αt ), w

um,t Nm,t

(4.24)

m=1

where .

um,t Nm,t M Σ um,t Nm,t

ˆ m,t , with .Nm,t being the number is a scaling update weight of .w

m=1

ˆ m,t at device m. .g t (ut ) is the global FL model at of data samples used to train .w iteration t, and .ut = [u1,t , . . . , uM,t ] is the device selection vector, with .um,t = 1 ˆ m,t to the server indicating that device m will upload its quantized local FL model .w at iteration t, and .um,t = 0 otherwise. FL Model Quantization at the Server As the global FL model is aggregated based on the collected local FL models, the server must quantize it in low bitwidth that can be directly used to calculate the training loss at each device. This is given by Ji and Chen [144] ⎧ k ), if αt = 1, ⎪ t ) ⎨ ( sign(g α t k k R (2 −1)g kt ˆ t = Q(g t , αt ) = .g , if 1 < αt < V , αt ⎪ ⎩ 2 −1 k gt , if αt = V . 4.2.1.2

(4.25)

Training Delay of Bitwidth Federated Learning

We next study the training delay of bitwidth FL. From the training steps, we can see that the delay consists of four components: (a) time used to calculate the training loss, (b) FL model update delay, (c) FL model quantization delay, and (d) FL model transmission delay. However, the FL model update delay is unrelated to the number of quantization bits .αt , since the models are updated with full-precision values. Thus, component (b) is constant with respect to our methodology and can be ignored. Then, the training delay is specified as follows: Time Used to Calculate the Training Loss The time used to calculate the training loss depends on the number of multiplication operations in (4.19) and (4.20). From (4.19), we can see that the computational complexity of each multiplication operation is related to the number of bits .αt used to represent each element in FL model vector. Specifically, given .αt , the time used to calculate the training loss is given by .

C lm,t (αt ) = ρ

αt2 N C , ϑf

(4.26)

where .ρ is the time consumption coefficient depending on the chip of each device and .N C is the number of multiplication operations in the neural network. f and .ϑ

80

4 Quantization for Federated Learning

represent the frequency of the central processing unit (CPU) and the number of bits that can be processed by the CPU in one clock cycle, respectively. FL Model Quantization Delay Since the updated local FL model is in fullprecision, each device must quantize its updated local FL model using (4.23) to reduce transmission delay. Given .αt , the quantization delay can be represented as ⎧ Q . lm,t (αt )

=

0, D ϑf

,

if αt = 1 or αt = V , if 1 < αt < V ,

(4.27)

where D is the number of neurons in the neural network. In (4.27), when .αt = 1 or αt = V , the quantization delay will be 0. When .αt = 1, the value of quantized weight ˆ m,t can be directly decided by the sign bit. When .αt = V , no quantization takes .w ˆ m,t = w m,t . When place since we are dealing with full precision weights, i.e., .w .1 < αt < V , the quantization delay incurred will increase based on the number of neurons in the neural network. For each neuron, the server will arithmetically perform the rounding, multiplication, and division operations according to (4.25). .

FL Model Transmission Delay To generate the global FL model that is aggregated ˆ m,t to the server. by each quantized local FL model, each device must transmit .w To this end, we adopt an OFDMA transmission scheme for quantized local FL model transmission. In particular, the server can allocate a set .U of U uplink orthogonal RBs to the devices for quantized weight transmission. Let W be the bandwidth of each RB and P be the transmit power of each device. The ( uplink ) channel capacity between device m and the server over each RB i is .cm,t um,t = ⎛ ⎞ um,t W log2 1+

P hm,t σN2

where .um,t ∈ {0, 1} is the user association index, .hm,t is the

channel gain between device m and the server, and .σN2 represents the variance of additive white Gaussian noise. Then, ( ) the uplink transmission delay between device T m and the server is .lm,t um,t , αt = c Dαut where .Dαt is the data size of the m,t ( m,t ) ˆ m,t . quantized FL parameters .w Since the server has enough computational resources and sufficient transmit power, we do not consider the delay used for global FL model quantization and transmission. Thus, the time that the devices and the server require to jointly complete the update of their respective local and global FL models at iteration t is ⎛ ( )⎞ Q C T . . lt (ut ,αt ) = max um,t lm,t (αt )+ lm,t (αt )+ lm,t um,t ,αt (4.28) m∈M Here, .um,t = 0 implies that device m will not send its quantized local FL model to the server, and thus not cause any delay.

4.2 Variable Bitwidth Federated Learning

81

4.2.2 Problem Formulation The goal is to minimize the FL training loss while meeting a delay requirement on FL completion per iteration. This minimization problem involves jointly optimizing the device selection scheme and the quantization scheme, which is formulated as follows: min F (g(uT , α)), U ,α

.

.

s.t. um,t ∈ {0, 1} , α ∈ [0, V ] and α ∈ N+ , ∀m ∈ M, ∀t ∈ T, . M Σ

um,t ⦤ U, ∀m ∈ M, ∀t ∈ T, .

(4.29)

(4.29a) (4.29b)

m=1

lt (ut , α) ⦤ ┌ , ∀m ∈ M, ∀t ∈ T,

(4.29c)

where .U = [u1 , . . . , ut , . . . , uT ] is a device selection matrix over all iterations with .ut = [u1,t , . . . , uM,t ] being a user association vector at iteration t, .α = [α1 , . . . , αt , . . . , αT ] is a quantization precision vector of all devices for all iterations, and .T = {1, . . . , T } is the training period. .┌ is the delay constraint for completing FL training per iteration, and T is a large constant to ensure the convergence of FL. In other words, the number of iterations that FL needs to converge will be less than T . Equation (4.29a) indicates that each device can quantize its local FL model and can only occupy at most one RB for FL model transmission. Equation (4.29b) ensures that the server can only select at most U devices for FL model transmission per iteration. Equation (4.29c) is a constraint on the FL training delay per iteration. The problem in (4.29) is challenging to solve by conventional optimization algorithms due to the following reasons. First, as the central controller, the server must select a subset of devices to collect their quantized local FL models for aggregating the global FL model. However, each local FL model that is generated by each device depends on the characteristics of the local dataset. Without such information related to the datasets, the server cannot determine the optimal device selection and quantization scheme for minimizing the FL training loss. Second, as the stochastic gradient decent method is adopted to generate each local FL model, the relationship between the training loss and device selection as well as quantization scheme cannot captured by the server via conventional optimization algorithms. This is because the stochastic gradient decent method enables each device to randomly select a subset of data samples in its local dataset for local FL model training, and hence, the server cannot directly optimize the training loss of each device. To tackle these challenges, we introduce a model based RL algorithm that enables the server to capture the relationship between the FL training loss and

82

4 Quantization for Federated Learning

the chosen device selection and quantization scheme. Based on this relationship, the server can proactively determine .ut and .αt so as to minimize the FL training loss.

4.2.3 Optimization Methodology In this section, a model based RL approach for optimizing the device selection scheme .U and the quantization scheme .α in (4.29) is proposed. Compared to traditional model free RL approaches that continuously interacts with edge devices to learn the device selection and quantization schemes, model based RL approaches enable the server to mathematically model the FL training process thus finding the optimal device selection and quantization scheme based on the learned state transition probability matrix. Next, we first introduce the components of the proposed model based RL method. Here, a linear regression method is used to learn the dynamic environment model in RL approach. Then, we explain the process of using the proposed model based RL method to find the global optimal .U and .α. Finally, the convergence and complexity of the proposed RL method is analyzed.

4.2.3.1

Components of Model Based RL Method

The proposed model based RL method consists of six components: a) agent, b) action, c) states, d) state transition probability, e) reward, and f) policy, which are specified as follows: • Agent: The agent that performs the proposed model based RL algorithm is the server. In particular, at each iteration, the server must select a suitable subset of devices to transmit their local FL models and determine the number of bits used to represent each element in FL model matrix. • Action: An action of the server is .a t = [ut , αt ] ∈ A that consists of the device selection scheme .ut and the quantization scheme .αt of all device at iteration t with .A being the discrete sets of available actions. • States: The state is .st = F (g t ) ∈ S that measures the performance of global FL model at iteration t with .F (g t ) being the FL training loss and .S being the sets of available states. • State Transition Probability: The state transition probability .P (st+1 |st , a t ) denotes the probability of transiting from state .st to state .st' when action .a t is taken, which is given by P (st+1 |st , a t ) = Pr{st+1 = st' |st , a t }.

.

(4.30)

Here, we need to note that in model free RL algorithms, the server does not know the values of a state transition probability matrix. However, in our work, we analyze the convergence of FL and estimate the FL training parameters in the FL

4.2 Variable Bitwidth Federated Learning

83

convergence analytical results so as to calculate the state transition probabilities. Using the state transition probability matrix can reduce the interactions between the server and edge devices thus improving the convergence speed of RL. • Reward: Based on the current state .st and the selected action .a t , the reward function of the server is given by r (s t , a t ) = −F (g(ut , αt )),

.

(4.31)

where .F (g(ut , αt )) is the training loss at iteration t. Note that, .r (s t , a t ) increases as .F (g(ut , αt )) decreases, which implies that maximizing the reward of the server can minimize the FL training loss. • Policy: The policy is the probability of the agent choosing each action at a given state. The model based RL algorithm uses a deep neural network parameterized by .θ to map the input state to the output action. Then, the policy can be expressed as .π θ (st , a t ) = P (a t |st ).

4.2.3.2

Calculation of State Transition Probability

In this section, we introduce the process of calculating the state transition probability that is used to reduce the interactions between the server and edge devices thus improving the convergence speed of RL. To this end, we must analyze the relationship between .st+1 and .(st , a t ). First, we make the same assumptions in Sect. 3.1.3. Based on these assumptions, next, we first derive the upper bound of the improvement of the FL training loss at one FL training step under the non-i.i.d. setting. Then, we further analyze the relationship between the FL training loss improvement and the selected action (i.e., the relationship between .st+1 and .st when .a t is given). Based on the analytical result, we can calculate the state transition probability .P (st+1 |st , a t ). To obtain the upper bound of the FL training loss improvement at one FL training step under the non-i.i.d. setting, we first define the degree of the non-i.i.d. data distribution. Definition 4.1 The degree of non-i.i.d. in the global data distribution can be characterized by

.

ɛ =

Nm M Σ Σ um,t ɛ m Nm,t m=1 n=1

M Σ

(4.32) Nm,t

m=1

~m (g t ) is the difference between the data where . ɛ m = ∇F (g t ) − ∇ F distribution of device m and the global data distribution. We also assume that

84

4 Quantization for Federated Learning

|| || || || ||∇f (g t , x mn , y mn ) + ɛ m ||2 ≤ ζ1 +ζ2 ||∇F (g t )||2 + B ɛ 2 for some positive B with M Nm ( ) 1 Σ Σ .F (g t ) = f g t , x m,n , y m,n . N .

m=1n=1

Using Definition 4.1, we derive the upper bound of the FL training loss improvement at one FL training step under the non-i.i.d. setting. Lemma 4.1( The FL) training( loss improvement over one iteration (i.e., the gap ) between .E F (g t+1 ) and .E F (g t ) ) with a non-i.i.d. data distribution can be upper bounded as ( ) ( ) E F (g t+1 ) − E F (g t ) )( )) (( ≤ E gˆ t+1 − g t ∇F (g t ) − ɛ . ⎞ ⎞ L ⎛ L ⎛ + E ||gˆ t+1 − g t ||2 + E ||gˆ t+1 − g t+1 ||2 , 2 2

(4.33)

where .g t and .gˆ t are short for .g t (ut , αt ) and .gˆ t (ut , αt ), respectively. .E(·) is the expectation with respect to the Rayleigh fading channel gain .hm,t and quantization error. Proof This lemma is proved using the same method that is used to prove Theorem 3.1. ⨆ ⨅ From Lemma 4.1, we can see that, the upper bound of the FL training loss improvement at one iteration depends on .gˆ t+1 (ut+1 , αt+1 ) − g t (ut , αt ) that is determined by the device selection vector .ut and quantization scheme .αt . To investigate how an action .a t = [ut , αt ] affects the state transition in the considered bitwidth FL algorithm with non-i.i.d. data distribution, we derive the following theorem: Theorem 4.4 Given ( the user ) selection ( )vector .ut and quantization scheme .αt , the upper bound of .E F (g t+1 ) − E F (g t ) in non-i.i.d. data distribution can be given by ( ) ( ( )) E F (g t+1 ) − E F g t ⎛ ⎞ 1 4 (N − A)2 (E ||Δ (αt )|| + 1) ζ2 ≤ −1 + ||∇F (g t )||2 2L N2 ⎛ ⎞ ( ) . E ||Δ (αt )|| + 1 4 (N − A)2 ζ1 + B ɛ 2 + L2 E ||Δ (αt )|| + 2L N2 ⎛ ⎞ + E Δ (αt )2 , (4.34)

4.2 Variable Bitwidth Federated Learning

where .A =

M Σ

85

um,t Nm,t represents the sum of all selected devices’ data samples that

m=1

are used to train their local models, .Δ(αt ) = gˆ t (αt )− g t is the quantization error of the global FL model that depends on the quantization scheme .α, .E ||Δ(αt )|| = M2−αt is the unbiased quantization function defined in (4.23). ⨆ ⨅ ) From 4.4, we can see that, the relationship between .E F (g t+1 ) and ( Theorem ) .E F (g t ) (i.e., .st+1 and .st ) depends on the selected action .a t as well as the constants 2 2 .1/L, .ζ1 , .ζ2 , and .B ɛ . However, we do not know the values of .1/L, .ζ1 , .ζ2 , and .B ɛ since they are predefined in assumptions (15)–(19). To find the tightest bound in (21), we must find the values of .1/L, .ζ1 , .ζ2 , and .B ɛ 2 so as to build the relationship between .st+1 and .st and calculate the state transition probability .P (st+1 |st , a t ). To of L, this end, a linear regression method [146] is used ( to determine ) ( the values ) 2 .ζ1 , .ζ2 , and .B ɛ since the relationship between .E F (g t+1 ) − E F (g t ) and these constants are linear. The regression loss function defined as Proof See Appendix B of [145].

(

J(L, ζ1 , ζ2 , B ɛ 2 )

.

=

I ⎞ ⎛ ⎞⎞ 1 Σ ⎛⎛ ⎛ E F (g t+1 )(i) − E F (g t )(i) I

(4.35)

i=1

⎞⎞2 ⎛ (i) , −K L, ζ1 , ζ2 , B ɛ 2 |F (g t )(i) , a t , F (g t+1 )(i) where I is the number of real interactions ⎛ between the server and edge devices ⎞ (i) used to estimate .1/L, .ζ1 , .ζ2 , and .B ɛ 2 . .K L, ζ1 , ζ2 , B ɛ 2 |F (g t )(i) ,a t ,F (g t+1 )(i) is the upper bound of the FL training loss at one FL training step. .b(i) = (i) (F (g t )(i),a (i) t ,F (g t+1 ) ) is the set of recorded pairs consisted of FL training loss and the selected action observed by the server and devices. .b(i) will be used to estimate the} values of .1/L, .ζ1 , .ζ2 , and .B ɛ 2 . Specifically, given .B = { b(0),. . .,b(i),. . .,b(I ) , L, .ζ1 , .ζ2 , and .B ɛ 2 are updated using a standard gradient descent method L = L − ιL .

∂J(L, ζ1 , ζ2 , B ' ) , ∂L

∂J(L, ζ1 , ζ2 , B ' ) , ζ2 = ζ2 −ιζ2 ∂ζ2

ζ1 = ζ1 − ιζ1

∂J(L, ζ1 , ζ2 , B ' ) , ∂ζ1

∂J(L, ζ1 , ζ2 , B ' ) , B = B − ιB ' ∂B ' '

(4.36)

'

where .B ' = B ɛ 2 . .ιL , .ιζ1 , .ιζ2 , and .ιB ' are learning rates for parameters L, .ζ1 , .ζ2 , and B '. ( ) 2 ( Given) the values of L, .ζ1 , .ζ2 , and .B ɛ , the gap between .E F (g t+1 ) and .E F (g t ) can be estimated according to our upper bound. Based on the definition .

86

4 Quantization for Federated Learning

of the state, the state transition probability .P (st + 1|st , a t ) is given by ⎧ P (st+1 |st , at ) =

.

⎛ ⎞ (i) 1, if st+1 = st + K L, ζ1 , ζ2 , B ɛ 2 |F (g t )(i) , a t , F (g t+1 )(i) , 0, otherwise. (4.37)

4.2.3.3

Optimization of Device Selection and Quantization Scheme

Having the state transition probability .P (st+1 |st , a t ), next, we introduce the optimization of .π θ so as to find the optimal device selection scheme .ut and quantization scheme .αt . Optimizing .π θ for minimizing the FL training loss corresponds to minimizing Σ

L(θ) =

.

P (s0 )

(st ,a t )∈τ

T ⨅

π θ (st−1 , a t )P (st |st−1 , a t )

t=1

T Σ

r (s t , a t ) ,

(4.38)

t=1

where .τ = {s0 , a 0 , . . . , sT , a T } is the trajectory replay buffer. Given (4.38), the optimization of policy network .θ is .

max L(θ).

(4.39)

θ

We update .π θ using a standard gradient descent method .

θ = θ + ι∇θ L(θ),

(4.40)

where .α is the learning rate and the policy gradient is ∇θ L(θ) =

Σ (st ,a t )∈τ

P (s0 )

T ⨅

π θ (st−1 , a t )P (st |st−1 , a t )

t=1

.

T 1 Σ r (s t , a t ) ∇logπ θ (st , a t ). = T

T Σ t=1

r (s t , a t ) (4.41)

t=1

4.2.4 Numerical Evaluation For our simulations, we consider a circular network area having a radius .r = 1500 m with one server at its center serving .M = 15 uniformly distributed devices. The

4.2 Variable Bitwidth Federated Learning

87

Table 4.1 Simulation parameters Parameters M W K T .D .ιL

Values 15 15 kHz 8 1000 217,728 0.02

Parameters U P I f .ρ .ιζ1

Values 6 0.5 W 20 3.3 GHz 6 .2.8 × 10 0.02

Parameters .Nm

2

.σN .┌ .B .ι .ιζ2

Values 200 .−174 dBm 1s 64 0.02 0.02

other parameters used in simulations are listed in Table 4.1, unless otherwise stated. For comparison purposes, we use three baselines: • (a) The binary FL scheme from [147] that enables the server to randomly select a subset of devices to cooperatively train the FL model at each iteration. Each parameter in the trained FL model is quantized into one bit. • (b) An FL algorithm that enables the server to randomly select a subset of devices to cooperatively train the FL model in full-precision (i.e., without quantization), which can be seen as a standard FL. • (c) An FL algorithm that optimizes the device selection and quantization schemes using a model free RL method. For c), a policy gradient-based RL update is employed to learn the state transition probabilities.

4.2.4.1

Datasets and ML Models

We consider two popular ML tasks: handwritten digit identification on the MNIST dataset [114], and image classification on the CIFAR-10 dataset [148]. The quantized FL algorithm that is used for handwritten digit identification consists of three full-connection layers. The total number of model parameters in the used fullyconnected neural network (FNN) is 217,728 (.= 28×28×256+256×64+64×10). To verify the feasibility of the proposed calculation time model in (9), we first simulate the actual calculation time using the clock module in GEneral Matrix Multiply (GEMM) [149], as shown in Fig. 4.12. Figure 4.12 shows that the actual calculation time is almost the same as the theoretical calculation time in (9). The quantized FL algorithm that is used for image classification consists of three convolutional layers and two full-connection layers. In the used convolutional neural network (CNN), the size of the convolutional kernel is 5 × 5 and the total number of model parameters in CNN is 116,704 (= 5 × 5 × (3 × 32 + 32 × 32 + 32 × 64) + 576 × 64 × 64 + 10). For both datasets, we will consider two cases of data distributions across clients: (i) non-i.i.d., where each client is allocated samples from only 3 of 10 labels; and (ii) i.i.d., where each client is allocated samples from all labels. All FL algorithms

88

4 Quantization for Federated Learning

Time used to calculate training loss (s)

Fig. 4.12 Calculation time of low bitwidth federated learning vs. the quantization precision

10-3

Actual calculation time Theoretical calculation time based on (9)

5

4

3

2

1

0 1 2

4

8

16

32

Value of t 80

Identification accuracy

70

60

50

40

30

Standard FL Model free RL Proposed FL Binary FL

20

10 0

50

100

150

200

250

300

Iterations Fig. 4.13 Identification accuracy vs. number of iterations

are considered to be converged when the value of the FL loss variance calculated over 20 consecutive iterations is less than 0.001.

4.2.4.2

Convergence Performance Analysis

Figure 4.13 shows how the FL handwritten digit identification accuracy changes as the number of iterations varies. From this figure, we can see that, the proposed algorithm obtains a noticeable improvement in convergence speed compared with model free RL in the non-i.i.d. case. This implies that our proposed algorithm

4.2 Variable Bitwidth Federated Learning Fig. 4.14 Identification accuracy vs. number of iterations

89

Identification accuracy

60

50

Standard FL Model free RL Proposed RL Binary FL

40

30

20

10

0 100

200

300

400

500

Iterations

models the FL training process effectively in the non-i.i.d. case via estimating the key meta-parameters that lead to speeding up the convergence. Moreover, our algorithm comes within 2% of the accuracy obtained by standard FL at convergence. It achieves this while reducing the interactions between the server and edge devices so as to saving the communication cost. Figure 4.14 shows how the identification accuracy of all considered algorithms changes as the number of iterations varies on the CIFAR-10 dataset in the non-i.i.d. case. We see that our proposed methodology can achieve up to 22% improvement in terms of the number of iterations required to converge compared to the model free RL method. Similar to the previous figures, this demonstrates the advantage of the server estimating the associated FL model parameters based on information captured during the training process, thus optimizing the FL training process through minimal interaction with each device. Figure 4.14 also shows that the binary FL algorithm (i.e., when the weights of CNN are binary) can only achieve 21% identification accuracy. This is due to the fact that the binary FL neither has a pretraining process nor uses full-precision scale factors to recover the full-precision model, again emphasizing the benefit of optimizing the quantization precision. Figure 4.15 shows one example of implementing the proposed FL algorithm for 40 handwritten digit identification. From this figure, we see that, as the delay requirement ┌ for completing each FL training iteration increases, the average quantization bits α and the identification accuracy increase. This is because as ┌ increases, the time that can be used for training and transmitting local FL parameters in the selected devices increases thus resulting in an improvement of α and identification accuracy. From Fig. 4.15, we can see that, for 40 handwritten digit identification, the proposed algorithm correctly identifies 35 handwritten digits. In contrast, the model free RL identifies 34 handwritten digits and the binary FL correctly identifies 33 handwritten digits. This is because the proposed FL algorithm can mathematically model the FL training process by obtaining the transition

90

4 Quantization for Federated Learning

Fig. 4.15 An example of implementing quantized FL for handwritten digit identification

Fig. 4.16 An example of implementing quantized FL for image identification (CIFAR-10)

probability so as to find out the optimal device selection and quantization scheme for achieving a higher identification accuracy. Figure 4.16 shows how the identification accuracy of the proposed FL framework changes as the delay requirement varies. This figure is simulated using CIFAR10 dataset. From Fig. 4.16, we see that, as the delay requirement increases, the identification accuracy of all considered learning algorithms increases. This is because that as the delay requirement increases, all considered learning algorithms enables the selected devices to fully utilize the training and transmitting time to perform FL framework, which results in an increase of average quantization bits and achievable accuracy. Figure 4.16 also shows that as the average quantization bits α decreases, the number of iterations required to reach a fixed achievable accuracy increases slightly. This is due to the fact that as α decreases, the quantization error increases, which decreases the accuracy for modeling FL training process. However, with a decrease of α, the time used to perform FL training at each iteration decreases and thus, the total time used to reach a fixed achievable accuracy decreases rapidly, which implies that the time used for training the proposed quantized FL framework decreases.

4.3 Conclusions

91

4.3 Conclusions In this chapter, we have introduced the use of quantization methods to reduce size of FL parameters thus improving FL convergence time. In particular, we have introduced the use of universal vector quantization for FL model compression, which can minimize the size of FL parameters that each user needs to transmit while reducing the quantization errors based on the unique FL settings (i.e., FL parameter aggregation). Then, we have introduced the use of model free RL to determine the number of bits used to represent each FL parameter according to dynamic wireless environments and device conditions. Simulation results have demonstrated the effectiveness of the introduced FL algorithms.

Chapter 5

Federated Learning with Over the Air Computation

In this chapter, we first discuss the basic principle and techniques of over the air computation (AirComp), and then explore its deployment in a communicationefficient FL system.

5.1 AirComp Principle and Techniques 5.1.1 AirComp Principle The idea of AirComp is elaborated as follows. Given simultaneous timesynchronized transmission by devices, their signals are superimposed over-the-air and their weighted sum, called the aggregated signal, is received by the PS, where the weights correspond to the channel coefficients. For AirComp based FL (AirComp-FL), it is desirable to have uniform weights so that the aggregated signal is not biased towards any device, and can be easily converted to the desired average of the transmitted signals (i.e., model updates). To make this possible requires each device to modulate its signal using linear analog modulation and to invert its fading channel by transmission-power control. The former operation is necessary to exploit the channel’s analog-waveform superposition property and the latter aligns the received magnitudes of individual signal components, called magnitude alignment. One may question the optimality of the use of seemingly primitive analog modulation compared with sophisticated digital modulation and coding. Interestingly, from the information-theoretic perspective, it was shown in [42] that AirComp can be optimal in terms of minimizing the mean squared error (MSE) distortion if all the multi-access channels and sources are Gaussian and independent.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 M. Chen, S. Cui, Communication Efficient Federated Learning for Wireless Networks, Wireless Networks, https://doi.org/10.1007/978-3-031-51266-7_5

93

94

5 Federated Learning with Over the Air Computation

Though FL requires only over-the-air averaging, AirComp is capable of computing a broad class of so called nomographic functions [43]. They are characterized by a post-processing function of a summation form with each term being a pre-processing function of an individual data sample. Besides averaging, other examples include arithmetic mean, weighted sum, geometric mean, polynomial, and Euclidean norm. Consequently, except for averaging, the implementation of AirComp of a nomographic function usually requires pre-processing of data before transmission and post-processing at the receiver. For a general function, it can be decomposed as a summation form of nomographic functions [150]. This suggests the possibility of approximately computing a general function with AirComp. A key requirement for implementing AirComp is time synchronization of devices’ transmissions. Such requirements also exist for uplink transmission (e.g., TDMA and SC-FDMA) in practical systems (e.g., LTE and 5G). In such systems, a key synchronization mechanism is called “timing advance”, which can be also adopted for AirComp synchronization. The technique of timing advance involves each device estimating the corresponding propagation delay and then transmitting in advance to “cancel” the delay. Thereby, different signals can arrive at the BS in their assigned slots (in the case of TDMA) or overlap with sufficiently small misalignment (in the cases of SC-FDMA and AirComp). Considering a synchronization channel for the purpose of propagation-delay estimation, its accuracy is proportional to the channel bandwidth [151]. For instance, the estimation error is no larger than 0.1 microsecond for a bandwidth of 1 MHz. If AirComp is deployed in a broadband OFDM system (see the next sub-section), the error gives rise to only a phase shift to a symbol received over a sub-channel so long as the error is shorter than the cyclic prefix (CP). Then the phase shift can be compensated by subchannel equalization. In an LTE system, the CP length is several microseconds, and hence more than sufficient for coping with synchronization errors. This suggests the feasibility of AirComp deployment in practical systems. The impact of potential remaining synchronization errors on the performance of AirComp and techniques to tackle them have been recently studied in [152]. The distortion of digital modulation originates from quantization and decoding errors. In contrast, for AirComp, the main source of signal distortion is channel noise and interference that directly perturb analog modulated signals. Hence, a commonly used performance metric for AirComp is the MSE distortion of received functional values with respect to the ground-truth. In the context of AirCompFL, channel noise and interference perturb the model updates and their effects can be evaluated using the relevant metric of learning performance. Finally, it is worth mentioning that AirComp is similar to non-orthogonal multiple access (NOMA) in both being simultaneous-access schemes. However, the distinction of AirComp is the harnessing of “inference” for functional computation via devices’ cooperation. On the other hand, NOMA attempts to suppress inference as the devices (subscribers) transmitting independent data compete for the use of radio resources.

5.1 AirComp Principle and Techniques

95

5.1.2 Broadband AirComp In a practical broadband system, the spectrum is divided into sub-carriers using the OFDM modulation. The deployment of AirComp-FL in such a system involves simultaneous over-the-air aggregation of model-update coefficients transmitted over sub-carriers subject to power constraints of individual devices. The channel inversion need be generalized to the case of multiple sub-carriers as follows. Consider a specific uploading device. For each OFDM symbol, ideally each subcarrier is linearly analog modulated with a single model/gradient element, whose power is determined by channel inversion. However, due to the power constraint, it is impractical to invert those sub-carriers in deep fade; hence they are excluded from transmission, called channel truncation. AirComp requires all devices to have fixed and identical mappings between update coefficients to sub-carriers. As a result, channel truncation results in the erasure of coefficients mapped to sub-carriers in deep fade as they cannot be remapped to other sub-carriers. Channel truncation can potentially have a near-far problem where the fraction of erased coefficients, called truncation ratio, is much larger for a nearer device from the PS (hence with larger severe path loss) than a faraway device (will smaller loss). The problem introduces bias and degrades the learning performance. One solution is to apply channel truncation based only on small-scale fading with twofold advantages: (1) approximately equalizing truncation ratios among devices, and (2) allowing the PS to exploit data even at faraway devices. The resultant scheme of truncated channel inversion scales the symbol transmitted over the m-th sub-carrier by a coefficient (m) .p k given as:

(m)

pk

.

=

−α

⎧ ⎪ ⎨ ⎪ ⎩

η − α (m) rk 2 hk

0,

,

2 |h(m) k | ≥ gth

(5.1) otherwise

(m)

where .rk 2 is the path loss and .hk the fading gain. The parameter .η represents the aligned received magnitude of different signal components and is chosen by observing individual power constraints of all devices. Next, given truncated channel inversion, the PS demodulates a certain number of OFDM symbols and thereby receives from the sub-carriers an over-the-air aggregated model update. This is then used to update the global model.

5.1.3 MIMO AirComp MIMO (or multi-antenna) communication is widely adopted in practical systems (e.g., LTE and 5G) to support high-rate access by spatial multiplexing of data streams. The deployment of AirComp-FL in a MIMO system can leverage spa-

96

5 Federated Learning with Over the Air Computation

tial multiplexing to reduce the communication latency by a factor equal to the multiplexing gain. Realizing the benefit requires the design MIMO AirComp, a technique multiplexing parallel over-the-air aggregation or equivalently AirComp of vector symbols, each comprising multiple update coefficients. The main distinction of MIMO AirComp is the use of receive beamforming, called aggregation beamforming, to enhance the received signal-to-noise ratios (SNRs) of aggregated observations from the PS array. The intuition behind the design of aggregation beamforming is that in terms of subspace distance, the beamformer should be steered aways from the relatively strong MIMO link and closer to those relatively weaker links. The purpose is to enhance the received SNRs of the latter at the cost of those of the former, thereby equalizating their channel gains. This facilitates the subsequent spatial magnitude alignment to enhance post-aggregation SNRs. Given the aggregation beamformer, the effective MIMO channel can be inverted at each device to implement spatial magnitude alignment after aggregation beamforming. Finding the optimal aggregation beamformer is a non-convex program and intractable. An approximate solution can be found in closed form though to mathematically express the above design intuition. Specifically, the received SNRs of spatial data streams from an individual MIMO link as observed after aggregation beamforming can be approximated using the smallest SNR, corresponding to the weakest eigenmode of the effective channel. Using the approximation, an approximate of the optimal aggregation beamformer can be obtained as the first L left eigenvectors of the 2 following matrix: .G = λ2min,k Uk UH k where .λmin,k is the smallest singular value of the k-th link and .Uk its left .L × 1 eigen subspace [43]. The matrix suggests that the aggregation beamformer is a weighted centroid of the eigen subspaces of individual MIMO links, where the weights are their smallest eigenvalues. This is aligned with the design intuition mentioned above.

5.1.4 Design of AirComp Federated Learning Consider the AirComp-FL system in Fig. 5.1. In this section, we discuss several issues concerning the design of such a system.

5.1.4.1

Model Update Distortion

Given the deployment of broadband AirComp, the received aggregated model update at the PS is distorted in two ways. First, the local-model update transmitted by each device may lose some coefficients due to truncated channel inversion in (5.1). Second, the uncoded aggregated update is directly perturbed by channel noise. There exist a trade-off between these two factors. The sub-carrier/coefficient truncation ratio can be reduced by lowering the truncation threshold in (5.1). As a result, sub-carriers with small gains are used for transmission, and thus involved in channel inversion, consuming more transmission power. Due to individual devices’

5.1 AirComp Principle and Techniques

97

Fig. 5.1 The AirComp-FL system

fixed power budgets, the magnitude alignment factor, .η in (5.1), has to be reduced. This leads to reduction on the received SNR and more noisy aggregated update received by the PS. For this reason, there exists a trade-off between the truncation ratio of each local-model update and the received SNR. In system design, such a trade-off should be balanced so as to regulate the overall distortion to prevent it from significantly degrading the learning performance. The operating point on this trade-off may also be adjusted along the iterations of the learning process as more accurate estimates of the model updates are needed as the learning process gradually converges to its optimal value.

5.1.4.2

Device Scheduling

In a conventional radio-access system, its throughput or link reliability can be enhanced by scheduling cell-interior users at the cost of quality-of-service of cell-edge users. In the context of AirComp FL, the penalty of doing so is some loss of data diversity since the data at cell-edge devices cannot be exploited for model training, which can significantly reduce the generalization power of the learned model. To elaborate, due to the required signal-magnitude alignment in AirComp, the received SNR of aggregated model update is dominated by the weakest link among the participating devices. Consequently, including faraway devices with severe path loss can expose model updates to strong noise, and hence potentially slow down convergence and reduce model accuracy. On the other hand, including more devices, which are data sources, means more training data; from this perspective, they may have the opposite effects from the above. Therefore, designing a scheduling scheme for AirComp FL needs to balance this trade-off between update quality and data quantity. For example, when the device density is high, the path-

98

5 Federated Learning with Over the Air Computation

loss threshold for selecting contributing devices can be raised and vice versa. On the other hand, mobility can alleviate this issue even when only cell-interior devices are employed. They are mobile and hence change over rounds, which benefits model training by providing data diversity. In the scenario with low mobility, one can also alleviate the issue by alternating cell-edge and cell-interior devices over different rounds [14].

5.1.4.3

Coding Against Interference

Existing AirComp with uncoded linear analog modulation exposes model training to interference and potential attacks. Most existing works target single-cell systems and overcomes the noise effect by increasing the transmission power. However, in the scenarios of multi-cell networks or multiple coexisting services, the signal-to-interference ratios are independent of power. Besides coping with interference, making FL secure is equally important. This motivates the need of coding in AirComp. Possible methods include scrambling signals using pseudorandom spreading codes from spread spectrum or encoding the signals using Shannon-Kotelnikov mappings from joint source-channel coding [153], prior to their transmission. Both coding schemes have the potential of providing the desired property that AirComp remains feasible after coding so long as participating devices apply an identical code (spreading code or Shannon-Kotelnikov mapping) while interference is suppressed by despreading/decoding at the BS.

5.1.4.4

Power Control

Channel inversion is adopted in typical AirComp to realize magnitude alignment [154]. Its drawbacks are to either exclude devices with weak links from FL at the cost of data diversity or consume too much power by inverting such links. In other words, channel-inversion transmission is sub-optimal in terms of minimizing the errors in aggregated gradients/models. Targeting a sensor system with i.i.d. data sources, it was shown in [155] that the optimal power-control policy for error minimization exhibits a threshold based structure: when its channel gain is below a fixed threshold, a device should transmit with full power; otherwise it should adopt channel inversion. Nevertheless, the assumption of i.i.d. data sources does not hold for AirComp FL since stochastic gradients or local models of different devices are highly correlated. It is proposed in [156] that information on gradient distribution can be exploited in power control for AirComp FL. While this provides significant gains in learning accuracy, the optimal power-control strategy in general remains an open problem.

5.2 Power Control Optimization for AirComp FL

99

5.2 Power Control Optimization for AirComp FL In this section, we investigate the transmit power control to combat against aggregation errors in AirComp-FL. Different from conventional power control designs (e.g., to minimize the individual mean squared error (MSE) of the overthe-air aggregation at each round), we introduce a new power control design aiming at directly maximizing the convergence speed. Towards this end, we first analyze the convergence behavior of AirComp-FL (in terms of the optimality gap) subject to aggregation errors at different communication rounds. It is revealed that if the aggregation estimations are unbiased, then AirComp-FL would converge exactly to the optimal point with mild conditions; while if the aggregation estimations are biased, then the AirComp-FL would converge with an error floor determined by the accumulated estimation bias over communication rounds. Next, building upon the convergence results, we optimize the power control to directly minimize the derived optimality gaps under the cases without and with unbiased aggregation constraints, subject to a set of average and maximum power constraints at individual edge devices. Finally, numerical results show that the proposed power control policies achieve significantly faster convergence for AirComp-FL, as compared with benchmark policies with fixed power transmission or conventional MSE minimization.

5.2.1 AirComp FL Model We consider an AirComp FL system consisting of an edge server and K edge devices, as shown in Fig. 5.1. We assume that the learning model is represented by the parameter vector .w ∈ Rq with .w = [w1 , · · · , wq ]T and q denoting the learning model size. Let .Dk denote the local dataset at edge device k, in which the i-th sample and its ground-true label are denoted by .xi and .τi , respectively. Define .f (w, xi , τi ) as the sample-wise loss function quantifying the prediction error of the learning model .w on sample .xi with respect to (w.r.t.) its ground-true label .τi . Then the local loss function of the learning model vector .w on .Dk is Fk (w) =

.

1 |Dk |

Σ

f (w, xi , τi ).

(5.2)

(xi ,τi )∈Dk

For notational convenience, we denote .f (w, xi , τi ) as .fi (w) and assume that the sizes of local datasets at different edge devices are uniform, i.e., .D Dk = |Dk |, ∀k ∈ K. Then, the global loss function on all the distributed datasets .Dtot = ∪k∈K Dk evaluated on parameter vector .w is given by

100

5 Federated Learning with Over the Air Computation

1 Σ 1 Σ Fk (w), Dk Fk (w) = K Dtot k∈K k∈K

F (w) =

.

(5.3)

where .Dtot = |Dtot | = KD. The objective of the training process is to find a desired parameter vector .w for minimizing the global loss function .F (w) in (5.3), i.e., w* = arg min F (w).

.

w

(5.4)

Consider a particular iteration or communication round n, with the learning model before updating being denoted by .w(n) . In this round, each edge device (n) .k ∈ K computes the local gradient estimate of the loss function as .g k , based on a randomly sampled mini-batch from the local dataset. We denote the set ˜ (n) of mini-batch data used by the edge device k at round n as .D k and its size (n) ˜ .mb = |Dk |, ∀k ∈ K. Then we have 1 mb

(n)

gk =

.

Σ

⎛ ⎞ ∇fi w(n) .

(5.5)

˜ (xi ,τi )∈D k

Next, the edge devices upload their local gradients to the edge server for aggregation. If the aggregation is error-free, then the global gradient estimate can be obtained as an average of local gradient estimates from all different edge devices, i.e.,1 .

g¯ (n) =

1 Σ (n) gk . K k∈K

(5.6)

Then, the edge server broadcasts the obtained global gradient estimate .g¯ (n) to the edge devices, based on which different edge devices can synchronously update their own learning model via w(n+1) = w(n) − η(n) · g¯ (n) ,

.

1

(5.7)

Although we consider the same data size D at different edge devices, our proposed AirComp-FL can be easily extended to the case when they have different data sizes, i.e., .Dk ’s are different. In this case, we only need to revise the global gradient estimate in (5.6) as a weighted-average of Σ Dk (n) the local ones, i.e., .g¯ (n) = Dtot gk . Via AirComp, the desired weighted aggregation of the k∈K local gradient estimate can be easilyΣ attained by adding an additional pre-processing .ψ(·) on the Dk transmitted signal .sk with .ψ(sk ) = Dtot sk . k∈K

5.2 Power Control Optimization for AirComp FL

101

where .η(n) is the learning rate at communication round n. The above procedure continues until the convergence criteria is met or the maximum number of communication rounds is achieved. Notice that we consider the over-the-air aggregation approach to achieve fast gradient aggregation, based on which the received aggregated gradient at the edge server in (5.6) may be erroneous due to perturbation caused by the channel fading and noise. To implement AirComp, during the gradient-uploading phase, all devices transmit simultaneously over the same time-frequency block with proper phase compensation. For ease of exposition, it is assumed that the channel coefficients remain unchanged within each communication round, but may change over different rounds. It is also assumed that the edge devices can perfectly know their own channel state information (CSI), so they can compensate for the channel phase differences. Let .hˆ (n) k denote the complex channel coefficient from edge device k to the edge (n) server at communication round n, and .hk denote its post-compensated real-valued (n) (n) channel coefficient, i.e., .hk = |hˆ k |. Then, the received aggregated signal via AirComp (after phase compensation) is given by y(n) =

Σ

.

k∈K

(n)

hk

/ (n) (n) pk gk + z(n) ,

(5.8)

in which .pk(n) denotes the power scaling factor at edge device k, and .z(n) ∈ Rq denotes the additive white Gaussian noise (AWGN) with .z(n) ∼ CN(0, σz2 I) and 2 .σz being the noise power. Based on (5.8), the global gradient estimate at the edge server is given by2 .

gˆ (n) =

y(n) . K

(5.9)

It thus follows from (5.8) and (5.9) that the aggregation error caused by the overthe-air aggregation in global gradient estimation is given by ε (n) = gˆ (n) − g¯ (n) ⎛ ⎞ / 1 Σ z(n) (n) h(n) = p − 1 g(n) , k k + k K K

k∈K

Noise

.

(5.10)

Signal misalignment error

2 Unlike

the conventional AirComp, using an additional scaling factor at the receiver, in (5.9) we directly use .y(n) /K as the estimated value of global gradient for AirComp-FL. This is due to the fact that the learning rate of local FL model update can play the equivalent role of scaling factor, and thus dedicated scaling factors are not needed as in conventional AirComp FL.

102

5 Federated Learning with Over the Air Computation

which consists of two components representing the signal misalignment error and noise-induced error, respectively. (n) The devices can adaptively adjust their transmit powers by controlling .{pk } to reduce the aggregation errors for enhancing the learning performance. We consider that each edge device .k ∈ K is subject to a maximum power budget .Pˆkmax for each communication round, and an average power budget denoted by .Pˆkave over the whole training period. Therefore, we have ⎛||/ || ⎞ || (n) (n) ||2 1 || || ˆ max . E || pk gk || ≤ Pk , ∀k ∈ K, ∀n ∈ N, q

(5.11)

(n)

where q is the size of the gradient vector .gk , as well as ⎛||/ || ⎞ || (n) (n) ||2 1 Σ || ˆ ave E || pk gk || . || ≤ Pk , ∀k ∈ K, Nq n∈N

(5.12)

where .N {1, · · · , N} with N denoting the total number of communication rounds for model training. Next, we will establish a direct learning performance metric, namely the optimality gap, linking with the aggregation errors over communication rounds. Based on the analysis, in Sect. 5.2.3 we will propose to minimize the optimality gap via optimizing the power control subject to a set of individual maximum and average power constraints.

5.2.2 AirComp FL Convergence Analysis In this section, we present a convergence analysis framework for the FL in the presence of aggregation errors by using the optimality gap as the performance metric, which sheds light on how the imperfect gradient updates affect the convergence of FL. As will be shown shortly, depending on whether the aggregated gradient estimate is unbiased or not, the FL will have different convergence behaviors.

5.2.2.1

Basic Assumptions on Learning Model

To facilitate the convergence analysis, we make several assumptions on the loss functions and gradient estimates. Assumption 5.1 (Smoothness) Let .∇F (w) denote the gradient of the loss function evaluated at point .w ∈ Rq . Then there exists a non-negative constant vector .L ∈ Rq with .L = [L1 , · · · , Lq ]T , such that

5.2 Power Control Optimization for AirComp FL

103

⎡ ⎤ F (w)− F (w' )+∇F (w)T (w−w' )

.

1 Σ ≤ Li (wi − wi' )2 , ∀w, w' ∈ Rq . 2 q

i=1

Assumption 5.1 guarantees that the gradient of the loss function would not change arbitrarily quickly w.r.t. the parameter vector. Note that such an assumption is essential for convergence analysis of gradient decent methods to provide a good indicator for how far to decrease to the minimum loss. Assumption 5.2 (Polyak-Łojasiewicz Inequality) Let .F * denote the optimal loss function value to problem (5.4). There exists a constant .δ ≥ 0 such that the global loss function .F (w) satisfies the following Polyak-Łojasiewicz condition: ||∇F (w)||2 ≥ 2δ(F (w) − F * ).

.

(5.13)

Notice that Assumption 5.2 is more general than the standard assumption of strong convexity. The inequality in (5.13) simply requires that the gradient grows faster than a quadratic function when away from the optimal function value and implies that every stationary point is a global minimum. Assumption 5.3 (Variance Bound) The local gradient estimates .{gk }, defined in (5.5), where the index n is omitted for simplicity, are assumed to be independent and unbiased estimates of the batch gradient .∇F (w) with coordinate bounded variance, i.e., E[gk ] = ∇F (w), ∀k ∈ K, .

.

E[(gk,i − ∇F (wi ))2 ] ≤

σi2 , ∀k ∈ K, ∀i, mb

(5.14) (5.15)

where .gk,i and .∇F (wi ) are defined as the i-th element of .{gk } and .∇F (w), respectively, .σ = [σ1 , · · · , σq ] is a vector of non-negative constants, and the denominator .mb accounts for the fact that the local gradient estimate is computed over a mini-batch of data with size .mb .

5.2.2.2

Optimality Gap Versus Aggregation Errors

) ( Suppose that at each communication round n, .F w(n) is the value of loss function w.r.t. the parameter vector .w(n) . Thus, with the lossy gradient aggregation in (5.10), the update of learning model at communication round n in (5.7) is represented as ⎛ ⎞ w(n+1) = w(n) − η(n) · g¯ (n) + ε(n) ,

.

(5.16)

104

5 Federated Learning with Over the Air Computation

where .ε (n) represents the induced random aggregation error (including the signal misalignment error and noice-induced error) at each communication round n. Let (n) ] and .E[||ε (n) ||2 ] denote the bias and MSE of the global gradient estimate at .E[ε each communication round n, respectively, where the expectation operation is taken over the stochastic sample selection on the local gradient estimation over a minibatch dataset, as well as the receiver noise due to AirComp. Depending on the value of .E[ε(n) ], we define two cases for the gradient aggregation. • Case I without unbiased aggregation constraints: The aggregation can either be biased (i.e., .E[ε (n) ] /= 0) or unbiased (.E[ε(n) ] = 0). In this case, no additional constraints on the aggregation biasness are introduced during the power control designs. • Case II with unbiased aggregation constraints: The aggregation is unbiased, i.e., the constraints .E[ε (n) ] = 0, ∀n ∈ N, are introduced in the aggregation designs (e.g., transmission power control). ( ) Define the optimality gap after N communication rounds as .F w(N +1) − F * and .L ||L||∞ . Then, by considering a properly chosen fixed learning rate, we establish the following theorem. Theorem 5.1 Under Assumption 5.1, suppose that the AirComp FL algorithm is 2 implemented with a fixed learning rate .η η(n) , ∀n ∈ N, with .0 ≤ η ≤ 2+L ≤ 1δ and fixed mini-batch size .mb = N [53]. Then, the expected optimality gap satisfies the inequality (5.17), where .C = 1 − δη with .0 < C < 1. .E

⎤ ⎞⎤ ⎡ ⎛ ⎡ Σ C N −n F w(N +1) − F * ≤ || E ε (n) ||2 2

n∈N Bias

Error floor

⎛ ⎛ ⎡ ⎛ ⎞⎤ ⎞ Σ C N −n + C N E F w(1) − F * +

n∈N 2

Initial optimality gap

⎞ ⎡|| ⎜ η2 L ||σ ||2 ||2 ⎤⎟ ⎤ ⎡ ⎜ || || ⎟ 2 +η2 L2 ≤ || E ε (n) ||2 +η2 L E ||ε(n) || ⎟ . ⎜ ⎝ 2δNK 2 ⎠

Gradient variance

Bias

MSE

The gap to the error floor Δ(N)

(5.17) Proof See Appendix A of [154].

⨆ ⨅

From Theorem 5.1, we have the following observations. • The FL algorithm converges eventually as .N → ∞, with the optimality gap possibly landed on an error floor instead of diminishing to zero. It is observed from (5.17) that the upper bound of the optimality gap can be ⎡ ⎤|| Σ C N−n || ||E ε (n) ||2 that decomposed into two components, i.e., the error floor . 2 n∈N cannot vanish as N grows, and the gap to the error floor, denoted by .Δ(N ),

5.2 Power Control Optimization for AirComp FL

105

which can approach zero as N increases. To see this, .⎡Δ(N ( ) is)⎤observed to contain four terms related to the initial optimality gap (.E F w(1) − F * ), the ⎡|| ||2 ⎤ ⎡ ⎤ η2 L||σ ||2 gradient variance . 2δN K 22 , as well as the bias .E ε (n) and MSE .E ||ε (n) || of the aggregation errors, respectively. All the four terms diminish as N goes to infinity, or become negligible under a sufficiently small learning rate. On the || ⎡ ⎤||2 other hand, the error floor is determined by the accumulated bias .||E ε (n) || over rounds. Hence, as N increases, the error floor would approach a constant while the gap to it .Δ(N ) would vanish. • The FL algorithm shows different convergence behaviors depending on whether the gradient aggregation ⎡ is ⎤ biased or not. For the case with unbiased aggregation constraints, i.e., .E ε (n) = 0, ∀n ∈ N, as N becomes sufficiently large, the model under training can converge exactly to the optimal point with minimum training loss with zero error floor in the training process. By contrast, for the case without unbiased aggregation constraints, the model under training may only converge to a neighborhood of the optimal point (if the aggregation is biased). However, the case with unbiased aggregation constraints may converge slower compared with its counterpart without unbiased aggregation constraints, as the enforcement of the unbiasness generally comes at a cost of elevated MSE that translates to a larger gap to the error floor .Δ(N ). The observation is also validated via experiments shown in Sect. 5.2.4. ⎤ ⎡ • Latter rounds are more sensitive to aggregation error. The bias .E ε (n) and ⎤ ⎡|| ||2 MSE .E ||ε (n) || at the later communication rounds (with large n) contribute more on the optimality gap than that of the initial rounds (with small n), as the effect of the aggregation error introduced at early stages (small n) is discounted by .C N −n on the right hand side of (5.17).

5.2.2.3

Optimality Gap Versus Transmission Power Control

In this subsection, we obtain the optimality gap w.r.t. the transmission power control variables based on the results in Sect. 5.2.2.2, in order to facilitate the power control design in the sequel. In particular, we consider ⎡ the⎤AirComp-FL in the cases without and with unbiased aggregation constraints .E ε (n) = 0, ∀n ∈ N. Before proceeding, we introduce the following assumption on the sample-wise gradient bound. Assumption 5.4 (Bounded Sample-Wise Gradient) At any communication ( ) round n, the sample-wise gradient .∇f w(n) , x, y for any training sample .(x, y) is upper bounded by a given constant .G(n) , i.e., .

|| ⎛ ⎞|| || || ||∇f w(n) , x, y || ≤ G(n) , ∀n ∈ N.

(5.18)

106

5 Federated Learning with Over the Air Computation

|| || || ( )|| Based on Assumption 5.4, we have .||∇F (w(n) )|| ≤ max(x,y)∈D ||∇f w(n) , x, y || ≤ G(n) . Together with Assumption 5.3, it thus holds that ⎡|| || ⎤ || ||2 ||σ ||2 || || (n) ||2 || 2 ≤ ||∇F (w(n) )|| + .E ||g k || mb ⎞2 ||σ ||2 ⎛ 2 ˆ (n) G(n) + ≤G . mb 5.2.2.4

(5.19)

Convergence Analysis for AirComp-FL in Case I

In this part, we formally characterize the convergence behavior of AirComp-FL w.r.t. the transmission power in the case without unbiased aggregation constraints. According to the definition of .ε (n) in formula (5.10), at each communication round n, the bias and MSE of gradient estimates through the over-the-air gradient aggregation for AirComp-FL are derived as ⎞ || || ⎛ / || ⎡ ⎤|| ||∇F (w(n) )|| Σ || || (n) (n) ⎝ ε (n) || = hk pk − K ⎠ . ||E K k∈K ⎞ ⎛ / G(n) ⎝ Σ (n) (n) ≤ hk pk − K ⎠ , K k∈K 2 || || ⎡|| ⎞2 || ⎤ ||∇F (w(n) )||2+||σ ||2 Σ ⎛ / σ 2q || (n) ||2 mb (n) (n) hk pk −1 + z 2 E. ||ε || ≤ K K k∈K ⎞2 ˆ (n) Σ ⎛ (n) / (n) σ 2q G hk pk − 1 + z 2 , ≤ K K k∈K

(5.20)

(5.21)

where both inequalities follow from Assumptions 5.1 and 5.4. By substituting (5.20) and (5.21) into (5.17), we have the following proposition. Proposition 5.1 The expected optimality gap for AirComp-FL in the case without unbiased aggregation constraints is upper bounded by ⎞⎤ ⎛⎡ ⎛ ⎞⎤ ⎞ ⎡ ⎛ ⊓ C (n) F w(1) − F * E F w(N +1) −F * ≤ n∈N

.

5.2 Power Control Optimization for AirComp FL

⎛ N Σ

⎛

107

⎞2

/

⎞

Σ (n) ⎜ (n) hk pk − K ⎠ J (n) ⎝A(n) ⎝ n=1 k∈K ⎛ ⎞ N Σ Σ ⎛ (n) / (n) ⎞2 (η(n) )2 σz2 Lq ⎠, + J (n) ⎝B (n) hk pk −1 + K2 n=1 k∈K

+

( ) 1+(η(n) )2 L2 (G(n) )2

(η(n) )2 L ||σ ||22⎟ + ⎠ 2mb K 2

ˆ (n)

(n) 2

(5.22) ⊓N

C (i)

i=n where .A(n) = for , .B (n) = (η )KLG , and .J (n) = 2C (n) K2 u (n) the diminishing learning rates .η = n+v , ∀n ∈ N, with .v > 0, u > 1/δ, and

( ) 1+η2 L2 (G(n) )2

2

ˆ (n)

2 ; while .A(n) = η(1) ≤ 2+L , .B (n) = η LKG , and .J (n) = K2 2 ≤ 1δ . fixed learning rate with .η = η(n) , ∀n ∈ N, with .0 ≤ η ≤ 2+L

.

5.2.2.5

C N−n 2

for the

Convergence Analysis for AirComp-FL in Case II

Next, ⎤ consider the case with unbiased aggregation constraints, where we have ⎡ we E ε(n) = 0, ∀n ∈ N. Similar to Proposition 5.1, we have the following proposition.

.

Proposition 5.2 The expected optimality gap for AirComp-FL in the case with unbiased aggregation constraints is upper bounded by ⎡ ⎛ ⎞⎤ ⎛⎡ ⎛ ⎞⎤ ⎞ ⊓ C (n) F w(1) − F * E F w(N+1) − F * ≤ n∈N ⎛ ⎞ ⎞2 N (n) )2 σ 2 Lq Σ Σ ⎛ (n) / (n) (η z ⎠ hk pk − 1 + J (n) ⎝B (n) + K2 n=1 k∈K

.

+

N Σ

J (n)

n=1

where .B (n) = η(n) =

ˆ (n) (η(n) )2 LG K2

u n+v , ∀n ∈ N, N −n and .J (n) = C 2 for 1 2 2+L ≤ δ .

.

(η(n) )2 L ||σ ||22 , 2mb K 2 and .J (n) =

(5.23) ⊓N

i=n C 2C (n)

(i)

for the diminishing learning rates ˆ (n) 2 (n) = η2 LG 2+L ; while .B K2 η(n) , ∀n ∈ N, with .0 ≤ η ≤

with .v > 0, u > 1/δ, and .η(1) ≤ the fixed learning rate with .η =

Since the derived convergence results for both the cases of diminishing and fixed learning rates share similar form, the subsequent power control optimization will be presented targeting the case with diminishing learning rates only for brevity, while the yielded insights hold for both cases.

108

5 Federated Learning with Over the Air Computation

5.2.3 Power Control Optimization Given the convergence results of AirComp-FL, we next present the power control optimization methods for speeding up the convergence rate. We first reformulate the power constraints in (5.11) and (5.12) by leveraging Assumption 5.4 and inequality (5.19) to avoid the requirement of non-causal (n) gradient information .gk . Hence, the individual power constraints at each communication round and the entire training process are respectively reformulated as (n) ˆ (n) pk G ≤ Pkmax , ∀k ∈ K, n ∈ N, . 1 Σ (n) ˆ (n) pk G ≤ Pkave , ∀k ∈ K, N n∈N

(5.24)

.

(5.25)

where .Pkmax q Pˆkmax and .Pkave q Pˆkave , ∀k ∈ K, are defined for notational convenience.

5.2.3.1

Power Control Optimization for Case I

We start with the case I without unbiased aggregation constraints. Discarding the irrelevant terms in (5.22) in Proposition 5.1 (i.e., the terms related to the initial )⎤ ⎡ ( (η(n) )2 L||σ ||2 optimality gap .E F w(1) −F * , the gradient variance bound . 2m K 2 2 , and the b

(η(n) )2 σ 2 Lq

z ˜ noise power . ) in Proposition 5.1, we denote .⏀({p k }) in the following as K2 the effective optimality gap to be optimized.

(n) ˜ .⏀({p k })

(n)

N Σ

⎛ J

(n)

(n) ⎝

A

k∈K

n=1 N Σ

+

Σ

J

(n)

B

(n)

n=1

(n) hk

/

⎞2 (n) pk

− K⎠

⎞2 Σ ⎛ (n) / (n) pk − 1 . hk k∈K

(5.26)

The optimization problem is thus formulated as P1 : min

.

(n)

{pk ≥0}

(n) ˜ ⏀({p k })

s.t. (5.24) and (5.25). By introducing a set of auxiliary variables, .pˆ k(n) = objective is re-expressed as

/ pk(n) , ∀k ∈ K, n ∈ N, the

5.2 Power Control Optimization for AirComp FL

(n) .⏀({p ˆ k })

N Σ

⎛ J (n) A(n) ⎝

k∈K

n=1

+

Σ

N Σ

J (n) B (n)

⎞2 (n) (n) hk pˆ k

Σ ⎛ k∈K

n=1

109

− K⎠

⎞2 (n) (n) hk pˆ k − 1 ,

(5.27)

and problem (P1) is re-expressed as P1.1 . : min

(n) {pˆ k ≥0}

(n)

⏀({pˆ k }) (n)

max ≤ Pk,n , ∀k ∈ K, n ∈ N. ⎛ 1 Σ (n) ⎞2 ˆ (n) G ≤ Pkave , ∀k ∈ K, qˆk N n∈N

s.t. qˆk

(5.28) (5.29)

where constraints / (5.28) and (5.29) follow from (5.24) and (5.25), respectively, max and .Pk,n

Pkmax , ∀k ˆ (n) G

∈ K, n ∈ N. Problem (P1.1) is convex and can thus

be optimally solved by the standard convex optimization techniques such as the interior point method [157]. Alternatively, to gain engineering insights, we resort to the Lagrange duality method to derive the structured optimal solution for problem (n)opt opt (P1.1). Let .{pˆ k } denote the optimal solution to problem (P1.1), and .ϕk , ∀k ∈ K the optimal dual variable associated with the k-th constraint in (5.29). Then we have the following proposition. (n)opt

Proposition 5.3 The optimal solution .pˆ k

, ∀k ∈ K, n ∈ N to problem (P1.1) is ⎤

⎡ (n)opt

pˆ k

.

(n)

where .Mk

⎢ ⎢ = min ⎢ ⎣

(n)

B (n) hk +

⎥ B (n) + A(n) K max ⎥ , Pk,n ⎥, (n) ⎦ (n) (n) Σ hi (n) Mk + A Mk (n) Mi i∈K opt ˆ (n) ϕk G (n)

N J (n) hk

(5.30)

, ∀k ∈ K, n ∈ N. ⨆ ⨅

Proof See Appendix B of [154]. (n)opt

According to Proposition 5.3, the optimal power scaling factors .pk K, n ∈ N to problem (P1) is

, ∀k ∈

110

5 Federated Learning with Over the Air Computation

⎡⎛ (n)opt

p. k

⎢⎜ ⎢⎜ = min⎢⎜ ⎣⎝

⎞2

⎤

⎟( ⎥ B (n) + A(n) K ⎟ max )2 ⎥ , P ⎟ ⎥. Σ h(n) ⎠ k,n ⎦ i Mk(n) +A(n) Mk(n) (n) M i∈K i

(5.31)

(n)opt

According to Proposition 5.3, the optimal .{pˆ k } (equivalently the optimal (n)opt (n)opt 2 = (pˆ k ) , ∀k ∈ K, n ∈ N) exhibits a regularized power scaling factor .pk (n) Σ A(n) h(n) i Mk channel inversion structure with the regularized term . related to all (n) Mi i∈K opt dual variables .ϕk associated with the average power budgets at all edge devices in (5.29). Considering the special case when the average power budgets .{Pkave } at all devices are sufficiently large, such that the dual variables become zero at opt the same time (i.e., .ϕk = 0, ∀k ∈ K, n ∈ N). In this case, the optimal (n)opt = power ⎡ scaling strategy⎤ reduces to the channel inversion policy, i.e., .pk ⎞2 ⎛ 1 max , .∀k ∈ K, n ∈ N. Interestingly, this result is equivalent min ⎛ (n) ⎞2 , Pk,n hk

to minimizing the MSE in isolation at each communication round. In other words, in the special case when all devices have a sufficiently large average power budgets, the conventional MSE minimization can be sufficient to minimize the optimality gap.

5.2.3.2

Power Control Optimization for Case II

Next, we consider the power control optimization for the case with unbiased aggregation constraints,⎡ where ⎤ the power control policy needs to enforce the additional constraint .E ε(n) = 0, ∀n ∈ N. According to (5.20), it follows Σ (n) / (n) that . pk = K, ∀n ∈ N. In this case, the effective optimality gap in hk k∈K Proposition 5.2 is given by ⎞2 N ⎞ Σ ⎛ Σ ⎛ (n) / (n) (n) (n) (n) ˜ hk .Θ {p J B pk − 1 . k } n=1 k∈K

(5.32)

Accordingly, we formulate the power control optimization problem as P2 : min

.

(n)

{pk ≥0}

s.t.

⎞ ⎛ ˜ {p(n) } Θ k Σ k∈K

(n)

hk

/ (n) pk = K, ∀n ∈ N

(5.24) and (5.25).

(5.33)

5.2 Power Control Optimization for AirComp FL

111 (n)

Note that problem (P2) is non-convex. However, via a change of variables .qk / pk(n) , ∀k ∈ K, n ∈ N, the objective can be re-expressed as N ⎛ ⎞ Σ ⎞2 Σ ⎛ (n) (n) hk qk − 1 , Θ {qk(n) } J (n) B (n) n=1 k∈K

.

(5.34)

and problem (P2) can be transformed into the following equivalent convex form: P2.1 . : min (n)

{qk ≥0}

s.t.

⎛ ⎞ (n) Θ {qk } Σ k∈K

(n) (n)

hk qk

= K, ∀n ∈ N.

(n)

max ≤ Pk,n , ∀k ∈ K, ∀n ∈ N. ⎛ ⎞ Σ 1 (n) 2 ˆ (n) qk G ≤ Pkave , ∀k ∈ K, N n∈N

qk

(5.35) (5.36) (5.37)

where constraints (5.36) and (5.37) follow from (5.24) and (5.25), respectively.

5.2.3.3

Feasibility of Problem (P2.1)

Before solving problem (P2.1), we first check its feasibility, i.e., whether the power budget can support the required unbiased estimation level denoted by .l or not. Let * .l denote the maximum unbiased estimation level, which can be expressed as l* = max

.

(n)

{qk ≥0}

s.t.

(5.38)

l Σ k∈K

(n) (n)

hk qk

≥ l, ∀n ∈ N

(5.36) and (5.37). If .l* ≥ K, then problem (P2.1) is feasible; otherwise, problem (P2.1) is not feasible. Hence, the feasibility checking procedure corresponds to finding .l* by solving problem (5.38). Notice that problem (5.38) is convex, which can thus be efficiently solved via standard convex optimization techniques, such as the interior point method [157]. By comparing .l* versus K, the feasibility of problem (P2.1) is checked. In the following, we solve problem (P2.1) when it is feasible.

112

5 Federated Learning with Over the Air Computation

5.2.3.4

Optimal Solution to Problem (P2.1)

(n)opt

Let .{qk } denote the optimal solution to problem (P2.1). We have the following opt opt proposition by leveraging the Lagrange duality method, where .μn and .λk are the optimal dual variables associated with constraints (5.35) and (5.37), respectively. (n)opt

Proposition 5.4 The optimal solution .qk given as

, ∀k ∈ K, n ∈ N to problem (P2.1) is

⎡

(n)opt

qk

.

⎛ (n) where .αk 1 −

opt

=

μn 2J (n) B (n)

(n) (n) hk αk min ⎣ opt ˆ (n) 2λ G (n) (hk )2 + N Jk(n) B (n)

⎞+

⎤ max ⎦ , Pk,n ,

(5.39)

, ∀k ∈ K, n ∈ N. ⨆ ⨅

Proof See Appendix C of [154].

From (5.39) in Proposition 5.4, we can accordingly obtain the optimal power (n)opt scaling factors .pk , ∀k ∈ K, n ∈ N to problem (P2) as (n)opt

pk

.

⎞ ⎛ (n)opt 2 = qk ⎡⎛

(n) (n) hk αk ⎢ = min⎣⎝ opt ˆ (n) 2λ G (n) (hk )2 + N Jk(n) B (n)

⎞2

⎤

( max )2⎥ ⎠ , Pk,n ⎦.

(5.40)

(n)opt

According to (5.39), the optimal solution of .{qk } to problem (P2.1) (equivalently (n)opt (n)opt 2 = (qk ) , ∀k ∈ K, n ∈ N) has a similar the optimal power scaling factor .pk regularized channel inversion structure as that in (5.30), but the regularized term opt

2λ

ˆ (n) G

therein (i.e.,. N Jk(n) B (n) ) is only related to its own device k’s average power budget opt

in (5.37) through the dual variable .λk , as opposed to all devices’ budgets in (5.29) for the case without unbiased aggregation constraints. Furthermore, it is observed opt that for any edge device .k ∈ K, if .λk > 0 holds, then the average power constraint Σ ⎛ (n)opt ⎞2 (n) ˆ −P ave = of edge device k must be tight at the optimality (i.e., . N1 qk G k n∈N 0) due to the complementary slackness condition, and thus this edge device should use up its average power budget based on the regularized channel inversion power opt control over communication rounds; otherwise, if .λk = 0, then edge device k should transmit with channel-inversion power control without using up its average power budget.

5.2 Power Control Optimization for AirComp FL

113

5.2.4 Simulation Results In this section, we provide simulation results to validate the performance of the proposed power control policies for AirComp-FL. The proposed algorithms are implemented using the Matlab and Pytorch for handwritten digit recognition.

5.2.4.1

Simulation Setup and Benchmark Schemes

In the simulation, the wireless channels from the edge devices to the edge server over different communication rounds follow i.i.d. Rayleigh fading, i.e., .h(n) k ’s are modeled as i.i.d. circularly symmetric complex Gaussian (CSCG) random variables with zero mean and unit variance. We set the number of devices as .K = 10, the noise variance .σz2 = 0.1, and the average power budgets at different devices .Pˆkave to be heterogeneous3 . We set the maximum power budget to be .5Pˆ ave . We consider both u the fixed and diminishing learning rates with .η = 0.05 and .η(n) = n+v under .u = 2 and .v = 8, respectively. As for the performance metrics, the loss function value and test (recognition) accuracy are considered for handwritten digit recognition on MNIST dataset. We implement a 6-layer CNN as the classifier model, which consists of two .5 × 5 convolution layers with ReLU activation (the first with 32 channels, the second with 64), each followed by a .2 × 2 max pooling; a fully connected layer with 512 units and ReLU activation; and a final softmax output layer (.582,026 parameter in total). The local batch size at each edge device is set to be .mb = 512. Notice that Assumptions 5.1 and 5.2 may not hold in this case, but our proposed power control policies still work well as will be shown shortly. For performance comparison, we consider the following two benchmark schemes. • Fixed power transmission: The edge devices transmit with fixed power over (n) different communication rounds by setting .pk = Pkave , ∀k ∈ K. • Conventional MSE minimization: The edge devices optimize their power control to minimize the aggregation MSE in isolation at each communication round. For each round, the MSE minimization problem has been solved in [158]4 . Figure 5.2 shows the learning performance versus the varying number of communication rounds N, where the learning rates are set to be diminishing in

ave = 15W, .i = average power budgets at different devices are set as, .Pîave = 5W and .Pî+1 {1, · · · , K/2}. 4 Although the conventional channel inversion power control can achieve the unbiased aggregation, it is not the only way to achieve the unbiased aggregation and just a sufficient condition leading to unbiased aggregation. Moreover, as validated in [158], the conventional MSE minimization scheme can achieve the minimum communication distortion in AirComp. Therefore, we only consider the conventional MSE minimization scheme as one benchmark, which always outperforms the generally sub-optimal channel inversion scheme. 3 The

114 2.5

Proposed power control (Case I) Proposed power control (Case II) Fixed power transmission Conventional MSE minimization

2

Loss value

Fig. 5.2 Learning performance of AirComp-FL on MNIST dataset over number of communication rounds. (a) Loss value versus N under diminishing learning rate. (b) Test accuracy versus N under diminishing learning rate.

5 Federated Learning with Over the Air Computation

1.5

1

0.5

0 10 20

50

80

100

150

200

250

Number of communication rounds, N

(a)

(b)

Fig. 5.2a and b and those are set to be fixed with .η = 0.01 in Fig. 5.2c and d. First, it is observed that the proposed power control policies achieve lower loss function values and higher test accuracy than both the fixed-power-transmission and conventional-MSE-minimization schemes. Furthermore, the power control policy under Case II is observed to outperform that under Case I when .N > 200 with the fixed learning rate and .N > 150 with the diminishing learning rates.

5.3 Beamforming Design for MIMO AirComp FL Fig. 5.2 (continued) (c) Loss value versus N under fixed learning rate.. (d) Test accuracy versus N under fixed learning rate

115

2.5

Proposed power control (Case I) Proposed power control (Case II) Fixed power transmission Conventional MSE minimization

Loss value

2

1.5

1

0.5

0 10 20

50

80

100

150

200

250

Number of communication rounds, N

(c)

(d)

5.3 Beamforming Design for MIMO AirComp FL In this section, we introduce the performance optimization of digital AirComp FL when deployed over a MIMO communication system. We first introduce the considered digital AirComp FL model and then introduce the FL metrics and problem that we aim to optimize. We then analyze the convergence of the designed FL framework and derives a closed-form optimal design of the transmit and receive beamforming matrices based on the analysis. Finally, numerical evaluation is presented and discussed.

116

5 Federated Learning with Over the Air Computation

Fig. 5.3 The structure of a digital AirComp FL algorithm deployed over in a MIMO communication system

5.3.1 Digital AirComp FL Model We consider an FL system implemented over a cellular network, where K wireless edge devices train their individual machine learning models and send the machine learning parameters to a central PS through a noisy wireless MAC as shown in Fig. 5.3. In the considered model, the PS is equipped with .Nr antennas while each device k is equipped with .Nt antennas. Each device has .Nk training data samples and each training data sample n in device k consists of an input feature vector .x k,n ∈ RNI ×1 and a corresponding label vector .y k,n ∈ RNO ×1 where .NI and .NO are the dimension of input and output vectors, respectively. Let .g ∈ RV ×1 be a vector that represents the global FL model K Σ of dimension V trained across the devices with .N = Nk being the total number k=1 ) ( of training data samples of all devices. .f g, x k,n(, y k,n is the ) local loss function of each device k with FL model .g and data sample . x k,n , y k,n . To minimize the global loss function .F (g) in a distributed manner, each device .w k,t can update its FL model .w k using its local dataset with a backward propagation (BP) algorithm based on SGD Given .w k,t , distributed devices must simultaneously exchange their model parameters with the PS via bandwidth-limited wireless fading channels for model aggregation. To ensure all devices can participate in FL model exchanging via wireless fading channels, each device adopts digital modulation to mitigate wireless fading and the PS adopts beamforming to maximize the number of devices scheduled for FL parameter transmission. Next, we will mathematically introduce the FL training and transmission process integrated with digital modulation in the considered MIMO communication system. In particular, we first introduce our designed digital modulation process that consists of two steps: (i) digital pre-processing at the devices and (ii) digital post-processing at the PS.

5.3 Beamforming Design for MIMO AirComp FL

5.3.1.1

117

Digital Pre-processing at the Devices

To transmit .wk,t over wireless fading channels, each device k leverages digital preprocessing to represent each numerical FL parameter in .w k,t using a symbol vector, which is given by .

( ) ˆ k,t = l w k,t , w

(5.41)

ˆ k,t ∈ RW is a modulated symbol vector with W being the number where .w of symbols, and .l (·) denotes the digital pre-processing function that combines decimal-to-binary conversion and digital modulation where the decimal-to-binary conversion is used to represent each numerical FL parameter with a binary coded bit-interleaved vector and the digital modulation is used to modulate several binary ˆ k,t is normalized bits as a symbol [159]. For convenience, the modulated signal .w ˆ k,t | = 1). We use rectangular M-quadrature-amplitude modulation (QAM) (i.e., .|w for digital modulation and it can be extended to other types of digital modulation schemes. ˆ k,t to the PS at each iteration t using fullyIn our model, each device sends .w digital beamforming with low RF complexity. Given the transmit beamforming matrix .Ak,t ∈ CNt ×W and the maximal transmit power .P0 at device k, the power constraint can be expressed as .

⎛| |2 ⎞ | |2 ˆ k,t | = |Ak,t | ≤ P0 . E |Ak,t w

(5.42)

where .E (x) represents the expectation of x.

5.3.1.2

Post-processing at the PS

Considering the multiple access channel property of wireless communication, the received signal at the PS is given by

.

s t (At ) =

K Σ

ˆ k,t + nt H k Ak,t w

(5.43)

k=1

where .At = [A1,t , · · · , AK,t ] denotes the transmit beamforming matrices of all devices, .H k ∈ CNr ×Nt denotes the MIMO channel vector for the link from device k to the PS, and .nt ∈ CNr denotes additive white Gaussian noise. The entries of .H k and .nt are assumed to be i.i.d. complex Gaussian variables with zero mean. Since .s t (At ) is the weighted sum of all users’ local FL models, we consider directly generating the global FL model .g t+1 from .s t (At ). This is a major difference

118

5 Federated Learning with Over the Air Computation

between the existing works and this work. The digital beamformer output signal can be expressed as .

sˆ t (B t , At ) = B t H s t (At ),

(5.44)

where .B t ∈ CNr ×W is the digital receive beamforming matrix. Given the received symbol vector .sˆ t (B t , At ), the PS can reconstruct the numerical parameters in global FL model .g t+1 , which can be expressed as .

( ) g t+1 (B t , At ) = l −1 sˆ t (B t , At ) ,

(5.45)

where .l −1 (·) is the inverse function with respect to .l (·) that combines the binaryto-decimal function and the digital demodulation function.

5.3.2 Problem Formulation Next, we introduce our optimization problem. Our goal is to minimize the FL training loss by designing the transmit and receive beamforming matrices under the total transmit power constraint of each device, which is formulated as follows: min F (g (B T , AT )) ,

(5.46)

| |2 s.t. |Ak,t | ≤ P0 , ∀k ∈ K, ∀t ∈ T.

(5.46a)

.

.

B,A

where .A = [A1 , . . . , AT ] and .B = [B 1 , . . . , B T ] are the transmit and receive beamforming matrices for all iterations, respectively. T is a constant which is large enough to guarantee the convergence of FL. From (5.46), we can see that the FL training loss .F (g (B T , AT )) depends on the global FL model .g (B T , AT ) that is trained iteratively. Meanwhile, as shown in (5.43) and (5.44), edge devices and the PS must dynamically adjust .At and .B t based on current FL model parameters to minimize the gradient deviation caused by AirComp in the considered MIMO system with digital modulation. However, the PS does not know the gradient vector of each edge device and hence the PS cannot proactively adjust the receive beamforming matrix using traditional optimization algorithms. To tackle this challenge, we propose an ANN-based algorithm that enables the PS to predict the local FL gradient parameters of each device. Based on the predicted local FL model parameters, the PS and edge devices can cooperatively optimize the beamforming matrices to improve the performance of FL. Next, we first mathematically analyze the FL update process in the considered AirCompbased system to capture the relationship between the beamforming matrix design

5.3 Beamforming Design for MIMO AirComp FL

119

and the FL training loss per iteration. Based on this relationship, we then derive the closed-form solution of optimal .At and .B t that depends on the predicted FL models achieved by an ANN-based algorithm.

5.3.3 Optimization of Beamforming for FL Training Loss Minimization To solve (5.46), we first analyze the convergence of the considered FL so as to find the relationship between digital beamforming matrices .At , .B t , and FL training loss in (5.46). The analytical result shows that the optimization of beamforming matrices .At and .B t depends on the FL parameters transmitted by each device. However, the PS does not know these FL parameters since it must determine the beamforming matrices .At and .B t before the FL parameter transmission. Therefore, we propose to use neural networks to predict the local FL models of each device and proactively determine the beamforming matrices using these predicted FL parameters.

5.3.3.1

Analysis of the Convergence of the Designed FL

We first analyze the convergence of the considered FL system. Since the update of the global FL model depends on the instantaneous signal-to-interference-plusnoise ratio (SINR) affected by the digital beamforming matrices .At and .B t , we can analyze only the expected convergence rate of FL. To analyze the expected convergence rate of FL, we first assume that a) the loss function .F (g) is .L−smooth with the Lipschitz constant .L > 0, b) .F (g) is strongly convex with positive param||2 || eter .μ, c) .F (g) is twice-continuously differentiable, and d) .||∇f (g t , x kn , y kn )|| ≤ ||2 || ζ1 +ζ2 ||∇F (g t )|| . These assumptions can be satisfied by several widely used loss functions such as mean squared error, logistic regression, and cross entropy. Based on these assumptions, next, we first derive the upper bound of the FL training loss at one FL training step. The expected convergence rate of the designed FL algorithm can now be obtained by the following theorem. Theorem 5.2 Given the optimal global FL model .g ∗ , the current global FL model .g t , the transmit beamforming matrix .At , and the receive beamforming matrix .B t , ( ( ) ) ∗ .E F g t+1 (At , B t ) − F (g ) can be upper bounded as ( ( ) ( )) ( ( ) ( )) E F g t+1 (At , B t ) − F g ∗ ⦤ E F g t − F g ∗ .

( )|| 1 || ||∇F g t ||2 2L || ||)2 1 ( + E ||et || + ||eˆt (At , B t )|| , 2L −

(5.47)

120

5 Federated Learning with Over the Air Computation

where ⎞⎞ ⎛ ⎛ K ) ( Σ Σ ) ( Σ ⎜ l⎝ ∇f g t , x n,k , y n,k ⎠ ⎟ ∇f g t , x n,k , y n,k ⎟ ⎜ k=1 k=1 n∈Nk,t ⎟ ⎜ n∈Nk,t −1 ⎟ ⎜ . et = −l ⎜ ⎟ K K Σ Σ ⎟ ⎜ |Nk,t | |Nk,t | ⎠ ⎝ K Σ

k=1

k=1

(5.48) with the first term being the gradient trained by SGD and the second term being from a sum of all selected devices’ symbols (i.e., ⎞ ⎛ the gradient demodulated K Σ ⎜ Σ l⎝

k=1 .

n∈

Nk,t

⎟ ∇f (g t ,x n,k ,y n,k )⎠

K Σ k=1

), and

|Nk,t |

⎞⎞ ) ( ⎜ l⎝ ∇f g t , x n,k , y n,k ⎠ ⎟ ⎟ ⎜ k=1 ⎟ ⎜ n∈Nk,t ⎟ eˆt (At , B t ) =l −1 ⎜ ⎟ ⎜ K Σ ⎟ ⎜ |Nk,t | ⎠ ⎝ ⎛

K Σ

⎛

Σ

k=1

.

⎞ ⎞⎞ ) ( ⎜ Bt ⎝ H k Ak,t l ⎝ ∇f g t , x n,k , y n,k ⎠ + nt ⎠ ⎟ ⎜ ⎟ k=1 ⎜ ⎟ n∈Nk,t −1 ⎜ ⎟. −l ⎜ ⎟ K Σ ⎜ ⎟ |N | ⎝ ⎠ k,t ⎛

⎛

⎛

K Σ

Σ

k=1

(5.49) ⨆ ⨅

Proof See Appendix A of [160].

From Theorem 5.2, we see that, since .et does not depend on .At or .B t , the optimization of the digital beamforming matrices cannot minimize .et . In consequence, || || we can only minimize .||eˆ t || to decrease the gap between ( ( the )FL training) loss at iteration .t + 1 and the optimal FL training loss (i.e., .E F g t+1 − F (g ∗ ) ). Thus, problem (5.46) can be rewritten as || ||2 min ||eˆt ||

(5.50)

| |2 s.t. |Ak,t | ≤ P0 , ∀k ∈ K, ∀t ∈ T.

(5.50a)

.

.

B t ,At

5.3 Beamforming Design for MIMO AirComp FL

121

|| || To minimize .||eˆ t || in (5.50), the PS and edge devices must obtain the information of MIMO channel ⎞ vector .H k as well as the trained gradients ⎛ l⎝

.

Σ

) ( ∇f g t , x n,k , y n,k ⎠ so as to adjust .At and .B t . However, the trained

n∈Nk,t FL gradients .Δwk,t =

Σ

) ( ∇f g t , x n,k , y n,k cannot be obtained by the PS

n∈Nk,t before edge devices sending FL model parameters. || || Hence, the PS must predict ˆ t ||. .Δw k,t for optimizing .At and .B t and minimizing .||e

5.3.3.2

Prediction of the Local FL Models

Next, we explain the use of neural networks to predict the local FL model updates of all devices. Since finding a relationship among each device’s local FL model updates at different iterations is a regression task and the fully-connected multilayer perceptrons (MLPs) in ANNs are good at such tasks, we propose to use MLPs. Next, we first explain the components of the proposed ANN-based algorithm. Then, the details to implement this algorithm for predicting each local FL model update are presented. The proposed MLP-based prediction algorithm consists of three components: (a) input, (b) a single hidden layer, and (c) output, which are defined as follows: • Input: The input of the MLP that is implemented by the PS for predicting device k’s local FL model is a vector .g t−1 . As we mentioned in (5.43), all devices are able to connect with the PS so as to provide the input information for the MLP to predict the local FL models for next iteration. • Output: The output of the MLP is a vector .Δ~ w k,t that represents device k’s local FL model update at current iteration t. Based on the predicted .Δ~ wk,t , the PS can adjust the transmit and receive beamforming matrices proactively to minimize (5.50). • A Single Hidden Layer: The hidden layer of a MLP is used to learn the nonlinear w k,t . The weight relationships between input vector .g t−1 and the output vector .Δ~ matrix that represents the connection strength between the input vector and the neurons in the hidden layer is .v in ∈ CD×V where D is the number of neurons in the single hidden layer. Meanwhile, the weight matrix that captures the strengths of the connections between the neurons in the hidden layer and the output vector is .v out ∈ CV ×D . Having the components of the MLP, next, we introduce the use of the MLP to predict each device’s local FL model update. The states of the neurons in the hidden layer are given by .

⎞ ⎛ v = σ v in g k,t−1 + bv ,

(5.51)

122

5 Federated Learning with Over the Air Computation

2 where .σ (x) = 1+exp(−2x) − 1 and .bv ∈ CD×1 is the bias. Then, the output of the MLP can be given by .

Δ~ wk,t = v out v + bo ,

(5.52)

where .bo ∈ CV ×1 is a vector of bias. To predict each device’s local FL model update, the MLP must be trained by an online gradient descent method. However, in the considered model, the PS can only obtain .g t that is directly demodulated from the received signal from all devices. Hence, the PS and the devices must exchange information to train the MLP cooperatively. In particular, at each iteration, device k first generates .wk,t using its local dataset and .g t−1 received from the PS. Then, device k calculates the training loss of MLP and transmits it to the PS. Based on the value of the training loss, the PS and device k can update its MLP synchronously. Since each device only needs to transmit its training loss, the cost for information exchange can be ignored compared with the communicated DNN model weights.

5.3.3.3

Optimization of the Beamforming Matrices

Having the predicted local FL model updates .Δ~ wk,t , the PS can optimize the beamforming matrices .At and .B t to solve Problem (5.50). Substituting .Δ~ wk,t , (5.43), and (5.44) into (5.50), we have || ⎛ K ⎛ ⎛K ⎞ ⎞ ⎞||2 || || ) ) ( Σ Σ ( || H k Ak,t l Δ~ wk,t + nt ⎟|| l Δ~ wk,t ⎟ || || ⎜ Bt ⎜ || −1 ⎜ k=1 ⎜ ⎟|| ⎟ k=1 . min ||l ⎜ K || ⎟ ⎟ − l −1 ⎜ K ⎝ Σ ⎝ ⎠|| ⎠ B t ,At || Σ || || |N | |N | k,t k,t || || k=1 k=1 (5.53)

.

⎛

K Σ

| |2 s.t. |Ak,t | ≤ P0 , ∀k ∈ K, ∀t ∈ T.

(5.53a)

⎞ l (Δ~ w k,t )

⎜ In (5.53), .l −1 ⎝ k=1Σ K

k=1

|Nk,t |

⎟ ⎠ is independent of .At and .B t and can be regarded as

a constant. However, the existence of the inverse function .l −1 (·) defined in (5.45) significantly increases the complexity for solving (5.53). Considering .l −1 (·) that is used to demodulate the symbols into numerical parameters, the minimization of ⎛ FL ⎛ ⎞⎞ ⎞ ⎛K K Σ Σ w k,t )+nt l (Δ~ wk,t ) ⎜ B t k=1 H k Ak,t l (Δ~ ⎟ ⎟ ⎜ −1 ⎜ ⎟ is equivthe gap between .l −1 ⎝ k=1Σ and .l ⎠ K K ⎝ ⎠ Σ |Nk,t | |Nk,t | k=1

k=1

5.3 Beamforming Design for MIMO AirComp FL

123

Fig. 5.4 An example of 16-QAM constellation at the PS with 4 devices

K Σ

⎛ Bt

l (Δ~ w k,t )

alent to minimize the distance between . k=1Σ K k=1

|Nk,t |

and .

K Σ

⎞ H k Ak,t l (Δ~ w k,t )+nt

k=1 K Σ k=1

|Nk,t |

in

the decision region of digital demodulation, as shown in Fig. 5.4. To this end, in this K Σ

l (Δ~ w k,t )

section, we first derive the position of . k=1Σ K k=1

|Nk,t |

in the decision region and remove

l −1 (·) from (5.53) for simplification. Then, we present a closed-form optimal design of the transmit and receive beamforming matrices. Given .Δ~ w k,t and the digital pre-processing function .l(·) defined in (5.41), the ) ( Q I I ˆ k,t = l Δ~ modulated symbol vector .Δw wk,t = [Δwˆ k,t,1 Δwˆ k,t,1 , . . . , Δwˆ k,t,L

.

Q

Q

I Δwˆ k,t,L ] can be obtained where .Δwˆ k,t,i and .Δwˆ k,t,i are the i-th in-phase wk,t , respectively. Since in-phase and and quadrature symbols modulated by .Δ~ quadrature-phase symbols that have vertical and⎞horizontal decision regions are ⎛ K Σ

⎜ mutually independent, the value of .l −1 ⎝ k=1 K Σ k=1

ˆ k,t Δw |Nk,t |

⎟ ⎠ can be obtained via individually

analyzing the decision region of each in-phase and quadrature-phase symbols which are

124

5 Federated Learning with Over the Air Computation

| | | | | | K | | ξ Σ 1 | I I| Δwˆ k,t,i − ai | ⦤ , .| K | 2 | Σ | | |Nk,t | k=1 | | k=1

| | | | | | K | | ξ Σ 1 | Q Q| Δwˆ k,t,i − ai | ⦤ , | K | Σ | 2 | | |Nk,t | k=1 | | k=1 (5.54)

{ √ } √ √ √ Q M−1 where .aiI , ai ∈ M = 1−2 M ξ, 3−2 M ξ, . . . , M−3 2 ξ, 2 ξ are the constellation points | in the decision region with .M being the set of all constellation points. ξ =

⎛√

.

4P0

M−1

⎞2

is the minimum Euclidean distance between two constellation Q

points. Using (5.54), .aiI and .ai are given by

.

I at,i =

⎧ ⎪ ⎪ ⎪ ⎨

K Σ

K Σ

I Δwˆ k,t,i

⎫ ⎪ ⎪ ⎪ ⎬

I Δwˆ k,t,i

ξ ξ k=1 k=1 x∈M:− + ⦤ x ⦤ + ∩M K K ⎪ ⎪ 2 2 Σ Σ ⎪ ⎪ ⎪ ⎪ |Nk,t | |Nk,t | ⎭ ⎩ k=1

(5.55)

k=1

and

.

Q

at,i =

⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩

⎫ ⎪ ⎪ ⎪ ⎬ ξ ξ k=1 k=1 ∈M:− + ⦤ x ⦤ + ∩M . K K ⎪ 2 2 Σ Σ ⎪ ⎪ |Nk,t | |Nk,t | ⎭ K Σ

x

Q

Δwˆ k,t,i

k=1

K Σ

Q

Δwˆ k,t,i

k=1

(5.56) I a , . . . , aI a Given .a ∗t = [at,1 t,W t,W ], problem (5.53) can be rewritten as t,1 Q

Q

|| ||2 K || || Σ || || . min ˆ k,t − B t H nt || H k Ak,t Δw ||a ∗t − B t H || B t ,At ||

(5.57)

k=1

.

|| ||2 s.t. ||Ak,t || ≤ P0 , ∀k ∈ K, ∀t ∈ T.

(5.57a)

) ( ˆ k,t = l Δ~ where .Δw wk,t is a modulated symbol vector of .Δ~ wk,t . Problem (5.57) can be solved by an iterative optimization algorithm. In particular, to solve problem (5.57), we first fix .B t , then the objective functions and constraints with respect to .At are convex and can be optimally solved by using a dual method [111].

5.3 Beamforming Design for MIMO AirComp FL

125

Algorithm 4 Digital AirComp FL over MIMO Based System 1: Init: Global FL model g 0 , beamforming metrics A0 and B 0 , MIMO channel matrix H . 2: for iterations t = 0, 1, · · · , T do 3: for k ∈ {1, 2, · · · , K} in parallel over K devices do 4: Each device calculates and returns wk,t based on local dataset and g t . 5: Each device leverages digital pre-processing to modulate each model parameter into a symbol. 6: Each device sends the symbol vector wˆ n,k to the PS using the optimized transmit beamforming matrix Ak,t . 7: end for 8: The PS directly demodulates the global FL model g t+1 from the received superpositioned signal using (5.45). 9: The PS predicts the local FL model wˆ k,t+1 of each device based on demodulated g t+1 using trained ANNs. 10: The PS proactively adjusts the transmit and receive beamforming matrices using the augmented Lagrangian method and broadcast the transmit beamforming matrix Ak,t+1 to each device k. 11: end for

⎛

⎞H

⎜ Similarly, given .At , problem (5.57) is minimized as .B ∗t = ⎝ Σ K k=1

a ∗t ˆ k,t H k A∗k,t Δw

⎟ ⎠ . The

entire algorithm for solving problem (5.46) is summarized in Algorithm 4.

5.3.4 Simulation Results and Analysis We consider a circular network area having a radius .r = 1500 m with one PS at its center serving .K = 20 uniformly distributed devices. In particular, the PS allocates 56 subcarriers to all devices and the bandwidth of each subcarrier is 15 kHz. The channels between the PS and devices are modeled as the independent and identically distributed Rayleigh fading channels. All statistical results are averaged over 5000 independent runs. For comparison purposes, we consider three baselines: (a) the proposed FL algorithm implemented over noiseless wireless channels, (b) an FL algorithm that uses digital beamforming and analog modulation for FL parameter transmission [161], and (c) an FL algorithm that uses digital beamforming and BPSK for FL parameter transmission [162]. To evaluate the performance of the proposed FL, MNIST dataset [114] is used. In particular, we adopt a fully-connected neural network (FNN) that consists of two full-connection layers with 7840 (=28.×28.×10) model parameters. Each device collects 2000 data samples for training the adopted FNNs and the PS uses one MLP that consists of three layers to predict the FL gradient vector of each device. We assume that all local datasets are independent and identically distributed across the devices. All FL algorithms are considered to have converged when the value of the FL loss variance calculated over 20 consecutive iterations is less than 0.001.

126

5 Federated Learning with Over the Air Computation

Fig. 5.5 Identification accuracy vs. number of iterations on MNIST dataset

Figure 5.5 shows how the identification accuracy of all considered algorithms changes as the number of iterations varies. From Fig. 5.5, we can see that the proposed AirComp method converges much faster compared to analog FL and BPSK FL. In particular, the proposed method can improve FL convergence speed by up to 75 and 85% compared to analog FL and BPSK FL. The 75% gain stems from the fact that, the proposed FL uses digital modulation (i.e., 64 QAM) which can combat channel impairments and misalignments thus reducing the errors incurred by model transmission. The 85% gain stems from the fact that the proposed algorithm uses high-order quantization scheme instead of using one bit to represent each FL parameter so as to reduce quantization errors. Figure 5.5 also shows that the convergence speed and the identification accuracy of the proposed AirComp method are very close to the proposed FL over noiseless channels, which illustrates that our proposed method can use digital modulation to significantly reduce transmission errors caused by channel fading and noise. From Fig. 5.5, we can also see that without using MLP for FL gradient predictions, the proposed method cannot converge. This is because the PS cannot adjust beamforming matrices without knowing the gradient vectors of each device thus introducing demodulation errors. In Fig. 5.6, we show how the identification accuracy of the proposed AirComp methods changes as the modulation order M varies. In this figure, we can see that as M increases, the identification accuracy of the considered algorithms increases. This is due to the fact that, as M increases, each device can use more bits to represent one FL parameter thus reducing quantization errors. However, as M continues to increase, the identification accuracy of the proposed algorithm remains a constant. This is because as M is larger than 64, quantization errors are minimized.

5.3 Beamforming Design for MIMO AirComp FL

127

Fig. 5.6 Identification accuracy vs. number of iterations on MNIST dataset 0.95

Identification accuracy

0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 45

Proposed FL over noiseless channels Analog FL BPSK FL Proposed method with 64 QAM 35

25

15

5

SNR Fig. 5.7 Identification accuracy vs. SNR on MNIST dataset

In Fig. 5.7, we show how the identification accuracy changes as SNR decreases. From this figure, we can see that, the identification accuracy decreases as SNR decreases (equivalently noise power increases). This is because as SNR decreases, the probability of incurring transmission error increases, which results in additional errors in FL model training and decreases the FL identification accuracy. Figure 5.7

128

5 Federated Learning with Over the Air Computation

also shows that the proposed AirComp method improves the identification accuracy by up to .15% compared to analog FL when SNR is 15 dB. This is due to the fact that the proposed method can reduce transmission errors introduced by wireless channel noise via digital demodulation. Meanwhile, compared to BPSK FL, the proposed method can achieve up to .10% gain in terms of identification accuracy when SNR is 25 dB. This is because BPSK FL uses one bit to represent each FL parameter thus introducing quantization errors and decreasing FL performance in terms of identification accuracy. From Fig. 5.7, we can also see that the identification accuracy of BPSK FL remains a constant as SNR decreases to 15 dB. This is because the decision threshold of BPSK in BPSK FL is larger than that of 64 QAM in the proposed FL. However, as SNR continues to decrease, the identification accuracy of BPSK FL decreases and achieves a 15% accuracy gap compared to the proposed method.

5.4 Conclusions In this chapter, we have introduced the use of AirComp to improve the FL performance. In particular, we have first introduced the basic principle and techniques of AirComp. Then, we introduced two recent works that optimized power control and beamforming matrix design for edge devices and the PS that jointly implement FL. For each recent work, we have introduced the considered FL model, problem formulation, and the corresponding solution. We have also analyzed the simulation results for each recent work to demonstrate the effectiveness of the designed FL algorithms.

Chapter 6

Federated Learning for Autonomous Vehicles Control

The deployment of future intelligent transportation systems is contingent upon seamless and reliable operation of connected and autonomous vehicles (CAVs). One key challenge in developing CAVs is the design of an autonomous controller that can accurately execute near real-time control decisions, such as a quick acceleration when merging to a highway and frequent speed changes in a stop-and-go traffic. However, the use of conventional feedback controllers or traditional learning-based controllers, solely trained by each CAV’s local data, cannot guarantee a robust controller performance over a wide range of road conditions and traffic dynamics. In this chapter, a new federated learning (FL) framework enabled by large-scale wireless connectivity is introduced for designing the autonomous controller of CAVs. In this framework, the learning models used by the controllers are collaboratively trained among a group of CAVs. To capture the varying CAV participation in the FL training process and the diverse local data quality among CAVs, a novel dynamic federated proximal (DFP) algorithm is explained that accounts for the mobility of CAVs, the wireless fading channels, as well as the unbalanced and non-independent and identically distributed data across CAVs. A rigorous convergence analysis is performed for the introduced algorithm to identify how fast the CAVs converge to using the optimal autonomous controller. Next, we first present the control, learning, and communication models. Then, the control algorithm and its convergence proof are explained. Given the introduced control algorithm and convergence analysis, the contract-theory based incentive mechanism is introduced. Finally, simulation results are presented and discussed.

6.1 Autonomous Vehicle System Model Consider a cellular BS serving a set .N of N CAVs with the same type of vehicle dynamics that move along a road system, as shown in Fig. 6.1a. Each CAV will © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 M. Chen, S. Cui, Communication Efficient Federated Learning for Wireless Networks, Wireless Networks, https://doi.org/10.1007/978-3-031-51266-7_6

129

130

6 Federated Learning for Autonomous Vehicles Control

Traffic jam

Crash

Base station

Road work

(a)

(b)

Fig. 6.1 Illustration of our system model. The traffic model is presented in (a) where green triangles and red squares, respectively, represent CAVs that do and do not participate in the FL process. The adaptive controller and learning models are shown in (b)

perceive its surrounding environment and accordingly adjust the controller decisions in order to achieve the target movement. FL is used to learn the controller so that CAVs can automatically change their control parameters, execute control decisions, and adapt to their local traffic. Note that, to adapt the road traffic at different times, the frequency that FL is used to update the controller design will be time-varying. We will next introduce the controller, communication, and learning models used for our FL-based autonomous controller design framework.

6.1.1 Adaptive Longitudinal Controller Model To perceive their surrounding environment, CAVs will use sensors and communicate with nearby CAVs and BS. This environmental perception enables the longitudinal controller of each CAV to automatically adjust its acceleration or deceleration and maintain a safe spacing and target speed. Due to the simplicity and ease of implementation of a PID controller, we assume that it is used by CAVs to control their longitudinal movement. Then, the acceleration .un (t) of vehicle .n ∈ N at time sample t is [163] ⎛ ⎞ Kn,d un (t) =un (t − 1) + Kn,p + Kn,i Δt + en (t) + . Δt ⎞ ⎛ Kn,d 2Kn,d en (t − 1) + en (t − 2), −Kn,p − Δt Δt

.

(6.1) (6.2)

6.1 Autonomous Vehicle System Model

131

where non-negative coefficients .Kn,p , Kn,i , and .Kn,d are, respectively, the proportional gain, integral time constant, and derivative time constant used by the PID controller at CAV .n ∈ N. .Δt is the sampling period and .en (t) = vn,r (t) − vn,a (t) captures the difference between the target reference speed .vn,r (t) and the actual speed .vn,a (t) at sample t. Note that the target reference speed is decided by the motion planner [164] in the CAV based on the environmental perception. According to (6.1), we can calculate the actual speed at sample .t + 1 as .vn,a (t + 1) = vn,a (t) + un (t)Δt and the distance traversed between samples t v (t+1)+vn,a (t) and .t + 1 as .dn,p = n,a Δt. Clearly, achieving the target speed and safe 2 spacing will depend on the control parameter setting of the PID controller. Hence, it is imperative to adjust these control parameters adaptively to deal with varying traffic dynamics and road conditions. To this end, as shown in Fig. 6.1b, we assume that the CAV will use an adaptive PID controller enabled by an artificial neural network (ANN) based auto-tuning unit. Here, we use an ANN because it is capable of capturing the nonlinear relationship between the PID control parameter setting and the longitudinal controller performance (i.e., velocity errors) [165]. Hence, the ANN-based auto-tuning unit can dynamically adjust the control parameters, and the CAVs can adapt to varying traffic scenarios. Meanwhile, to guarantee that the PID control parameters will be always positive, we use the sigmoid function as an activation function in the ANN. In this case, to adapt to various traffic conditions, the CAV will train the auto-tuning unit by using the back-propagation algorithm over its own local data and adjust the control parameters accordingly. This is an emerging approach for adaptive controller design, as discussed in [165].

6.1.2 FL Model The ANN based auto-tuning unit in Fig. 6.1b can adaptively tune the PID control parameters to achieve the target speed. However, the CAV’s local training data (e.g., camera data containing the longitudinal movement) is constrained by the onboard memory of the CAV, and, thus, the information that can be stored will be limited to a few traffic scenarios. For example, for CAVs driving on the highway, the longitudinal movement data captured by the camera will be mostly high speed data. As a result, the trained controller can only operate in the highway scenario and cannot adapt to stop-and-go traffic with frequent stops and accelerations when CAVs exit the highway and drive in urban settings. In other words, by solely training the local data for the auto-tuning unit, the controller can only work in limited traffic scenarios but not in presence of a more general traffic pattern which could jeopardize the safe operation of CAVs. To address this challenge, we can use the wireless connectivity of CAVs to build a cooperative, learning-based training framework, i.e., FL, among multiple CAVs for the controller design. Here, we consider that CAVs will engage in an FL process to collaboratively learn the ANN auto-tuning units for their adaptive controller design. In particular,

132

6 Federated Learning for Autonomous Vehicles Control

a wireless BS, operating as a parameter server, will first generate an initial global ANN model parameter .w 0 for the auto-tuning unit and send it to all CAVs over a downlink broadcast channel. Then, in the first communication round, CAVs will use the received model parameters .w0 to independently train their own model based on their local data for I iterations. Note that, due to the temporal correlation within the road traffic, the CAV will train the ANN with initial training parameters (i.e., weights and bias) that are close to the target values, guaranteeing stability [166]. Meanwhile, the motion planner can take into account the stability of the controller when designing target velocity traces [167], thereby further enhancing the controller’s stability. In the uplink, the CAVs transmit their trained model parameters to the BS. Next, the BS will aggregate all the received local model parameters to update the global model parameters which are then sent back to all CAVs over the downlink broadcast channel. This FL process is repeated over uplinkdownlink channels and the local and global ANN models are sequentially updated in the following communication rounds. Ultimately, the ANN model parameters used by the CAVs will converge to the optimal model after solving the following optimization problem that captures the FL process [11]:

.

sn N Σ Σ sn fn (w (n) , ξi ), . s (1) (N) d N w ,...,w ∈R

arg min

(6.3)

n=1 i=1

s.t. w(1) = w (2) = ... = w (N ) = w, Σ

(6.4)

where .sN = n∈N sn is the size of all training data available at the local memory of CAVs with .sn being the size of the local data at CAV n. .fn (w(n) , ξi ) is the loss function of CAV n when using the ANN model parameters .w(n) in the auto-tuning unit for the selected data .ξi . Note that, the loss function plays a pivotal role in determining the performance of the trained auto-tuning unit. The loss function used for the controller design can be either convex [168] or non-convex [169]. We assume (n) = w, n ∈ N. .f (w) to be the value of the objective function in (6.3) when .w When training the local ANN models at CAVs, we can calculate the energy consumption for CAV .n ∈ N in each communication round as .En,comp = κcφ 2 s¯ I , where .κ is the energy consumption coefficient that depends on the computing system and .s¯ is the size of training data at the local iteration. c is the number of computing cycles needed per bit, and .φ is the frequency of the central processing unit (CPU) clock of CAV. Accordingly, we can obtain the computing delay as s¯ c .tn,comp = I φ . Due to the mobility of CAVs and the wireless fading channels, some CAVs cannot finish their local training and uplink transmission within the duration .t¯ of the communication round. With this in mind, next, we present the communication model used to determine whether the locally trained model at a particular CAV can be used in the model aggregation or not.

6.1 Autonomous Vehicle System Model

133

6.1.3 Communication Model For the uplink transmissions, we consider an orthogonal frequency-division multiple access (OFDMA) scheme where each CAV in set .N will use a unique orthogonal resource block to transmit the trained ANN model parameters to the BS. In particular, the BS will allocate orthogonal subcarriers to CAVs so as to avoid interference between concurrent uplink transmissions. This is a practical assumption given that a single BS will only service a handful of CAVs that are of the same vehicular type. The data rate for the link between a CAV .n ∈ N and the BS will be ⎞ ⎛ Pn hn dn−α , .rn = B log2 1 + δn + BN0

(6.5)

where B is the bandwidth of each resource block, .Pn is the transmit power of CAV n, and .hn is the wireless fading channel gain. In particular, since line-of-sight links between CAVs and the BS do not always exist, we model these channels as independent Rayleigh fading channels [170]. Moreover, .dn is the distance between CAV n and the BS, .α is the path-loss exponent, and .N0 is the noise power spectral Σ −α P density. In addition, .δn = is the received interference power j /∈N j hj dj generated by CAVs in other cells that share the same resource block with CAV n. From (6.5), the uplink transmission delay for CAV .n ∈ N can be calculated (n) as .tn,comm = s(wrn ) , where .s(w(n) ) is the size of the data packet that depends on the trained model parameters, .w(n) , transmitted by CAV n. The uplink energy consumption is .En,comm = Pn tˆ where .tˆ is calculated by the product of the total number of data symbols and the symbol duration. In the downlink, since the BS can have a higher transmit power and a larger bandwidth, the downlink transmission delay is considered to be negligible compared to the uplink transmission delay, as assumed in [34]. In addition, given the higher computing power of BSs, the computing delay at the BS can be ignored. Hence, to identify whether the local learning model update from CAV .n ∈ N can be used for the model aggregation in the BS, we can compare the time for uplink transmission and local computing at the CAV with the duration .t¯ of the communication round. In this case, the probability that CAV .n ∈ N participates at communication round t of FL (i.e., the locally trained model at CAV n is used in the model aggregation) will be given by .pn,t = P(tn,comp + tn,comm ≤ t¯). When developing the FL framework for the CAV’s controller design, we need to address a number of challenges. The first challenge is that the BS can only aggregate a varying subset of CAVs to update the global model at each communication round as a result of the mobility of the CAVs and the uncertainty of wireless channels. A fast convergence for the controller design will be challenging to achieve when the participation of the CAVs in the FL process varies over time [171]. Meanwhile, as the local data is generated under various traffic scenarios and road incidents, its distribution and size will be different across CAVs. Hence, the second challenge will be mitigating the impact of the non-IID and unbalanced local data on the

134

6 Federated Learning for Autonomous Vehicles Control

convergence of the controller design. In the following section, we introduce a novel FL algorithm to tackle these two challenges. Moreover, due to the energy cost in the model training and uplink transmission, another challenge will be designing an incentive mechanism that encourages CAVs to participate in the introduced FL algorithm. However, to improve the convergence performance, the incentive mechanism should only motivate a subset of CAVs which can improve the convergence process of controller design, while preventing other CAVs that impede the convergence from engaging in the FL process. Such an incentive mechanism is of great importance for enabling CAVs to quickly adapt to the local traffic dynamics when exploiting the introduced FL algorithm. Next, to address this challenge, we will use the insights obtained from the convergence study of the introduced FL algorithm and design a contract-theoretic incentive mechanism.

6.2 Dynamic Federated Proximal Algorithm for CAV Controller Design To address the challenges imposed by the varying CAVs’ participation in the learning process and the non-IID and unbalanced data, we introduce a new DFP algorithm. In particular, we study how the mobility of the CAVs, wireless fading channels, and the diverse local data affect the convergence of the learning model. Here, we will first introduce the DFP algorithm and then study its convergence.

6.2.1 Dynamic Federated Proximal Algorithm The DFP algorithm is summarized in Algorithm 5. In particular, we assume that the CAVs will run I iterations of stochastic gradient descent (SGD) at each round. In each iteration of SGD, CAV .n ∈ N will solve the following optimization problem that minimizes the sum of the loss of a randomly selected local training sample γt 2 .ξ ∈ Sn and an .L2 regularizer: .arg minw∈Rn fn (w, ξ ) + 2 ||w − w t || , ξ ∈ Sn , where .γt is the coefficient of the regularizer and .w t captures the received learning model parameters from the BS at communication round t. Different from FedAvg algorithm [105], we introduce the .L2 regularizer to guarantee that the trained model parameters .w of CAV .n ∈ N will be close to .w t during the local training, reducing the variance introduced by the non-IID and unbalanced data. Meanwhile, in contrast to popular FL algorithms, such as FedProx [172], we explicitly consider the impact of CAVs’ mobility and uncertainty of wireless channels and model the participation probability as a dynamic variable for each CAV at each communication round. After I iterations of SGD at communication round t, we obtain the trained model parameters of CAV n as follows:

6.2 Dynamic Federated Proximal Algorithm for CAV Controller Design

135

Algorithm 5 Dynamic Federated Proximal (DFP) Algorithm Iutput: N, Nt , Sn , ηt , w0 , I , ut , γt , sn , n = 1, . . . , N Output: ANN-based auto-tuning unit w for the CAV’s controller t = 0, . . . , T − 1 1. The BS sends w t to all CAVs over broadcast downlink channels. (n) 2. CAV n ∈ N updates w t for I iterations of SGD with a step size as ηt in (6.6) and obtain w t+1,I which will be sent to the BS. 3. Due to the mobility and wireless fading channels, the BS can only aggregate the trained model parameters from a subset Nt of Nt CAVs and update the global model parameters as w t+1 = Σ Σ (n) sn n∈Nt sN w t+1,I with sNt = n∈Nt sn . t

I−1 ⎛ ⎞ ⎛ ⎞ Σ (n) (n) (n) fn wt+1,I = w t +ηt ∇fn(wt,i , ξi )+γt (w t,i −w t ) ,

.

(6.6)

i=0

where .w nt,0 = wt , n ∈ N.

6.2.2 Convergence of the DFP Algorithm Next, we perform a convergence study to determine how fast CAVs converge to using the optimal model in (6.3) when exploiting the DFP algorithm. Unlike the convergence study done by existing works such as [105], we need to consider how both the dynamic participation probability of CAVs and the .L2 regularizer in the local training affect the convergence. To this end, we make the following standard assumptions: • The gradient .∇fn (w), n ∈ N, is uniformly Lipschitz continuous in terms of .w with positive parameter L. • The upper bound of the variance of SGD with respect to the full gradient descent of each CAV .n ∈ N is .Eξ ∈Sn ||∇fn (w, ξ )−∇fn (w)||22 ≤ σ 2 , ∀n ∈ N, ∀w ∈ Rd , where .σ 2 is the upper bound. Both assumptions are commonly used in the convergence study of machine learning algorithms (e.g., see [173]). The first constraint can be satisfied by some popular loss functions used in control theory, such as the squared error loss function. The second constraint is often adopted in stochastic optimization where the gradient estimator is always assumed to have a bounded variance. In the autonomous controller design problem, the second constraint can be justified by the fact that CAVs have limited acceleration and deceleration capabilities. Using these two assumptions, we can bound the expected loss function at communication round .t + 1 as shown by the following theorem. Theorem 6.1 Given that the BS sends the global learning model parameters .w t to all CAVs at communication round t, an upper bound for the expected loss function at communication round .t + 1 can be written as

136

6 Federated Learning for Autonomous Vehicles Control

Eξ,n (f (wt+1 )) ≤f (w t )−

.

(ηt+γt ηt )

ΣN

2 2 n=1 pn,t sn I ||∇fn (w t )||2 ΣN 2sN j =1 pj,t sj

⎞ ΣN ⎛ pn,t sn2 2 ηt Lηt2 I 2 ηt γt 2 2 2 + (I +I (1+ηt ) )+Lηt I Σn=1 σ , + N 2sN 2sN j =1 pj,t sj

(6.7)

if the following two conditions are satisfied: .

L2 ηt2 I 2 + γt I 2 (1 + ηt )2 + 2sN Lηt I ≤ 1, .

(6.8)

L2 ηt2 γt I 2 + γt2 ηt2 I 2 + 2sN ηt γt LI ≤ 1,

(6.9)

⎛

⎛

⎜ 0 ⎜ where .pn,t = exp ⎝− δPn +BN ⎝2 d −α

⎛ ⎞ (n) s wt ⎞ ⎛ B t¯−I s¯φc

n n

⎞⎞ ⎟⎟ − 1⎠⎠.

Proof The proof is provided in Appendix A of [174].

⨆ ⨅

Using Theorem 6.1, we can calculate how much the total loss decreases between two consecutive communication rounds and determine the speed with which the model converges to the optimal auto-tuning model in (6.3). In particular, as observed from Theorem 6.1, the convergence speed hinges on the value of participation probability .pn,t , .n ∈ N. This participation probability depends on the quality of the wireless channel and the distance between the CAVs and the server, as determined by the mobility of the CAVs. In addition, to identify how the participation of a particular CAV in the FL affects the convergence, we also need to consider the size and distribution of the local data at CAVs. To do so, in the following corollary, we will first mathematically define the local data quality of CAVs and then study the impact of local data quality on the convergence of learning models. Corollary 6.1 Given the conditions in (6.8) and in (6.9), the local data quality of CAV .n ∈ N can be defined as ┌ ⎛ βn =sn2

.

ηt γt η t + 2sN 2sN

⎞ I ||∇fn (wt )||22

┐ ⎞ ηt γt ηt 2 2 2 2 2 2 2 − Lη I +Lηt I σ + (I +I (1+ηt ) )σ . 2sN t 2sN ⎛

The set .N can be divided into two subsets .N(1) and .N(2) with the negative and positive data quality, respectively. In this case, the results in (6.7) can be simplified as Σ Σ pn,t βn n∈N(1) pn,t βn . (6.10) + .f (w t )−Eξ,n (f (w t+1 )) ≥ ΣN sN j =1 pj,t sj n∈N(2)

6.2 Dynamic Federated Proximal Algorithm for CAV Controller Design

137

⨆ ⨅

Proof The proof is provided in Appendix B of [174].

According to Corollary 6.1, the local data quality for a CAV .n ∈ N can be calculated based on the size .sn of its local data samples and the loss function .fn (w t ). Also, from Corollary 6.1, we observe that the participation of CAVs within the subset .N(1) in the FL will impede the convergence whereas the participation of CAVs from subset .N(2) will improve the FL convergence. In other words, depending on the value of the data quality .βn , n ∈ N, the convergence gain contributed by different CAVs can be negative or positive. Different from the previous works in [105] and [172], we mathematically capture the local data quality and analyze the impact of diverse data quality on convergence. In the following corollary, we also extend Theorem 6.1 to the case in which the vanilla FedAvg is used for the autonomous controller design. Corollary 6.2 When using FedAvg algorithm, i.e., no .L2 regularizer in each SGD, we can obtain the following upper bound for the expected loss: ηt .Eξ,n (f (w t+1 )) ≤f (w t )− 2sN ⎛

ΣN

2 2 n=1 pn,t sn I ||∇fn (w t )||2 ΣN j =1 pj,t sj

ηt + Lη2 I 2 +Lηt2 I 2sN t

⎞ ΣN

2 n=1 pn,t sn

ΣN

j =1 pj,t sj

σ 2,

if .L2 ηt2 I 2 + 2sN Lηt I ≤ 1. Proof We can replace .γt = 0 in Theorem 6.1 to obtain the bound.

⨆ ⨅

By comparing Theorem 6.1 and Corollary 6.2, we can prove that, when the constraint (6.8) is satisfied, the DFP algorithm can achieve a smaller upper bound for the expected loss than FedAvg. In other words, the DFP method can achieve a faster convergence for the controller design in comparison to the FedAvg algorithm, leading to a fast adaptation to the traffic dynamics for CAVs. To minimize the energy spent on model training, CAVs can also dynamically adjust the number of iteration .In , n ∈ N, of the local SGD performed at each communication round. In this case, we can simplify the results in Theorem 6.1 and obtain f (wt ) − Eξ,n (f (wt+1 )) ≥ ┌ ⎛ ⎞ ┐ ΣN 2 − ηt γt (1 + η )2 σ 2 + ηt Lη2 σ 2 I 2 p s t t n n=1 n,t n 2sN 2sN + ΣN j =1 pj,t sj ⎞ ⎞ ┐ ┌ ⎛⎛ 2 ΣN γt ηt ηt 2 ηt γt σ 2 2 2 n=1 pn,t sn 2sN + 2sN ||∇fn (w t )||2− 2sN −Lηt σ In . ΣN j =1 pj,t sj

.

(6.11)

138

6 Federated Learning for Autonomous Vehicles Control

The result in (6.11) is useful for applications with stringent energy constraints, such as electric CAVs. Also, (6.11) can provide guidelines on how to choose the number of local SGD iterations at each CAV so as to facilitate the convergence to the optimal controller model. In summary, in this section, we introduced the DFP algorithm to tackle the challenges of non-IID and unbalanced data and varying participation of CAVs in the learning process when using FL for the autonomous controller. We further proved the convergence and theoretically studied how the data quality, mobility, wireless fading channels, and number of local training iterations affect the overall convergence. Based on these insights, next, we will design a contract-theory based incentive mechanism to further improve the convergence performance of the DFP algorithm.

6.3 Contract-Theory Based Incentive Mechanism Design To improve the controller convergence performance, one can design the incentive mechanism which motivates the CAVs with positive .β to participate in FL and prevents CAVs with negative .β from engaging in the FL process. However, due to the information asymmetry between the server and the CAVs, the server cannot obtain the needed information on the distribution of the local data at each CAV, let alone the data quality. To address such information asymmetry, a framework of contract theory [175] is introduced to design an efficient incentive mechanism for the FL-based autonomous controller design where the parameter server and CAVs are modeled as, respectively, employer and employees in a labor market. Contract theory is apropos here because the parameter server can avoid iterative communications with CAVs and increase its utility by allowing the CAVs to instantly choose from a limited number of designed contracts. There are many conventional approaches to design the incentive mechanism, but unlike the introduced contract-theoretic approach, they are not suitable for the CAV controller design. For example, when using the deep reinforcement learning approach [176], it will take a long time to converge to an effective incentive mechanism, inevitably delaying the controller training process and jeopardizing the CAVs’ operation. Moreover, another alternative approach is to use a Stackelberg game [177]. However, in a game setting, each CAV will seek to maximize its own individual utility and, thus, such a strategy may not maximize the parameter server’s utility as done in the contract-based approach. As will be evident from the discussion below, the utility at the parameter server is modeled as the convergence of the learning process, and maximizing the utility at the server is the key goal of our problem. Hence, to avoid a long delay and improve the FL convergence, we prefer to use contract theory over other alternatives. In the designed contract, the parameter server groups CAVs into different types according to the data quality .βn , n ∈ N, and then designs a unique contract for each type of CAVs. In this case, when faced with a list of contracts offered by the parameter server, each CAV will

6.3 Contract-Theory Based Incentive Mechanism Design

139

self-reveal the type of its local data quality by choosing the contract designed for its type. Since the data quality is contingent on how CAVs impact the FL convergence, the designed contract can improve the convergence of the FL-based controller to the optimal CAV controller. Next, we will define the utility functions for the parameter server and CAVs and design the contract for the FL-based autonomous controller design.

6.3.1 Utility Function of the Parameter Server From Corollary 6.1, we can obtain a modified data quality as .θn = sβNn , n ∈ N. Based on the modified data quality, we assume that all CAVs in set .N(2) can be categorized into M types sorted in an ascending order: .0 < θ1 ≤ . . . ≤ θM . For CAVs in the set .N(1) , their corresponding type is denoted as type 0 with .θ0 = 0. Clearly, for CAVs belonging to a higher type, their data quality is better and their participation in the FL can expedite the convergence to the optimal autonomous controller model used by CAVs. While the parameter server cannot identify the type of a CAV .n ∈ N, we assume that the parameter server has the knowledge of the probability .p¯ m that a CAV belongs to type .m ∈ {1, . . . , M} based on the historical data and previous observations, as considered in [178]. To achieve the self-revealing property, the parameter server will design the contract, i.e., the resource-reward bundle, for each type of CAVs. In particular, to compensate the energy consumption spent on the uplink transmission and local training, the resource-reward bundle for CAVs of type .m ∈ {1, . . . , M} can be written as .(Pm , Rm ), where .Rm is the reward to the CAVs with an uplink transmit power .Pm . Since CAVs belonging to subset .N(1) actually impede the FL convergence, the parameter server will not give them any compensation, i.e., .R0 = 0. This zero compensation can result in the unwillingness of those CAVs to participate in the FL process, leading to .P0 = 0. However, when incentivizing CAV .n ∈ N(2) of type .m ∈ {1, . . . , M} into FL aggregation, the utility function of the parameter server at communication round t will be ⎛

⎛

⎜ δn+BN0 ⎜ Ups (m)=u1 exp ⎝− ⎝2 Pm dn−α

.

⎛ ⎞ (n) s wt ⎛ ⎞ B t¯−I s¯φc

⎞⎞ ⎟⎟ −1⎠⎠ θm − u2 Rm ,

where .u1 captures the valuation factor for the convergence gain brought by the participation of CAVs and .u2 is the unit cost of providing a reward to the CAVs. As the CAVs in .N(1) are sorted into type 0 with the reward as .R0 = 0 and the transmit power as .P0 = 0, the average utility for the parameter server at communication round t can be written as

140

6 Federated Learning for Autonomous Vehicles Control ⎛

(n)

⎞

s w ⎞ ⎛ ⎛ δ +BN ( ⎛ t s¯c ⎞ )⎞ n 0 B t¯−I φ −1 θ −u R .Ups= p¯ m u1 exp − 2 m 2 m . Pm dn−α n=1m=1

M N Σ Σ

(6.12)

6.3.2 Utility Function of the CAVs For the CAVs, the reward received from the parameter server will be used to compensate the energy consumption spent on local model training and uplink transmission. The utility of CAVs of type .m ∈ {1, . . . , M} is thereby obtained as UCAV (m) = θm Rm − u3 (κcφ 2 s¯ I + Pm tˆ),

.

(6.13)

where .u3 is the unit cost of the energy consumption.

6.3.3 Contract Design With the utility functions obtained in (6.12) and (6.13), respectively, for the parameter server and CAVs, we can design the optimal contract which can maximize the utility, i.e., the convergence gain between two consecutive communication rounds, at the parameter server. In particular, to design a feasible contract for the autonomous controllers, two constraints must be satisfied. First, the designed contract must meet the individual rationality (IR) constraint where every CAV is rational and will not accept the contract with a negative utility [175]. That is, UCAV (m) = θm Rm −u3 (κcφ 2 s¯ I +Pm tˆ) ≥ 0, m ∈ {1, . . . , M}.

.

(6.14)

For CAVs in type 0, since .R0 = 0, the CAVs will not train their local controller model and will not participate in the uplink transmission, justifying .P0 = 0. Moreover, for a feasible contract, we must impose an incentive compatibility (IC) constraint ensuring that each type of CAVs must always prefer to choose the contract designed for their type over contracts for other types [175]. In particular, the IC constraints of contract types m and .m, ˆ ∀m, m ˆ ∈ {1, . . . , M}, will be θm Rm − u3 (κcφ 2 s¯ I + Pm tˆ) ≥ θm Rmˆ − u3 (κcφ 2 s¯ I + Pmˆ tˆ),

.

∀m, m ˆ ∈ {1, . . . , M}.

(6.15)

According to (6.14) and (6.15), we can further simplify the IR and IC constraints and obtain the list of five following conditions for a feasible contract.

6.3 Contract-Theory Based Incentive Mechanism Design

141

Lemma 6.1 The designed contract .(Pm , Rm ), m ∈ {1, . . . , M}, will be feasible if and only if the following five conditions are satisfied: •

M Σ

p¯ m RM ≤ Rtotal ,

(6.16)

m=1

• 0 ≤ R1 ≤ . . . ≤ Rm ≤ . . . ≤ RM , 0 ≤ P1 ≤ . . . ≤ Pm ≤ . . . ≤ PM ≤ Pmax • θ1 R1 − u3 (κcφ 2 s¯ I + P1 tˆ) ≥ 0,

(6.17) (6.18)

• θm−1 (Rm −Rm−1 ) ≤ u3 tˆ(Pm −Pm−1 ) ≤ θm (Rm −Rm−1 ), m ∈ {1, . . . , M},

(6.19)

where .Rtotal is total reward at the parameter server and .Pmax denotes the maximum transmit power of CAVs. The condition in (6.16) stems from the fact that the parameter server has a limited reward to offer in a contract. The proofs for conditions in (6.17)–(6.19) are similar to [175]. Based on the utility function defined in (6.12) and the conditions presented in Lemma 6.1, we can formulate the contract design into an optimization problem whose goal is to maximize the average utility at the parameter server, as follows:

.

⎛ ⎞ ⎞ ⎛ −(δn+BN0)An θ −u R p¯ m u1 exp m 2 m (Pm ,Rm )1≤m≤M Pm dn−α n=1m=1 max

M N Σ Σ

(6.20)

s.t. (6.16), (6.17), (6.18), (6.19), (n)

⎞ ⎛ ⎛s(wt s¯c) ⎞ B t¯−I φ where .An = 2 − 1 . Due to the non-concave objective function and the complex constraints, directly solving the optimization problem in (6.16)–(6.20) will be challenging. Alternatively, we will use a sequential method where the optimal power allocation is first determined in terms of the reward assignment and the optimal reward assignment for each data quality type is then derived. In the following theorem, we will study the optimal power allocation when the reward assignment is given. Theorem 6.2 Given a reward assignment .R = (R1 , . . . , RM ) that satisfies condi∗ ) that maximizes tions (6.16) and (6.17), the power allocation .P ∗ = (P1∗ , . . . , RM the average utility at the parameter server will be Pm∗ =

.

θ1 R1 − u3 κcφ 2 s¯ I Σ ρk , m ∈ {1, . . . , M}, + u3 tˆ m

k=1

(6.21)

142

6 Federated Learning for Autonomous Vehicles Control

where .ρk = 0, if .k = 1; otherwise, .ρk =

θk (Rk −Rk−1 ) . u3 tˆ

Proof To prove the optimality of the solutions in (6.21), we will proceed by contradiction. In particular, we assume there exists another feasible contract .(P ' , R) which achieves a higher average utility for the parameter server than the contract ∗ .(P , R). Since the utility function at the parameter server is an increasing function of the transmit power, there will be at least one type, e.g., type .m ˆ ∈ {1, . . . , M}, of ' CAVs with .Pmˆ > Pm∗ˆ . Here, we consider two cases with .m ˆ = 1 and .m ˆ /= 1. When .m ˆ = 1, .P1' > P1∗ . As defined in (6.21), .θ1 R1 − u3 (κcφ 2 s¯ I + P1∗ tˆ) = 0. When the CAVs belonging to type 1 are assigned to power .P1' > P1∗ , .θ1 R1 − u3 (κcφ 2 s¯ I + P1' tˆ) < 0, violating the contract feasibility condition (6.18). When .m ˆ /= 1, we have .Pm'ˆ > Pm∗ˆ . From condition (6.19), the feasible contract .(P ' , R) will satisfy the following condition: ' u3 tˆ(Pm'ˆ − Pm−1 ) ≤ θmˆ (Rmˆ − Rm−1 ). ˆ ˆ

(6.22)

.

Using the definition of .Pm∗ , m ∈ {1, . . . , M}, in (6.21), the values of .Rmˆ and .Rm−1 ˆ will meet Rmˆ − Rm−1 = ˆ

.

∗ ) u3 tˆ(Pm∗ˆ − Pm−1 ˆ

θmˆ

(6.23)

.

Based on the result in (6.23), we can simplify the results in (6.22) and obtain ' ∗ . As .P ' ≥ P ∗ , .P ' ∗ . Iteratively, the transmit Pm'ˆ − Pm∗ˆ ≤ Pm−1 − Pm−1 ≥ Pm−1 ˆ ˆ m ˆ m ˆ m−1 ˆ ˆ ' power allocated to the type 1 CAVs in .(P , R) will be less than the one in .(P ∗ , R), i.e., .P1' ≤ P1∗ , which is proved to violate the basic feasible contract constraint. Hence, there will not exist a feasible contract that achieves a better average utility at the parameter server than the contract .(P ∗ , R). In other words, for a given reward assignment .R, the power allocation in the optimal contract is calculated in (6.21). ⨆ ⨅

.

With the optimal power allocation in Theorem 6.2, we can verify that the feasible ∗ ≤ conditions in (6.17)–(6.19) will be automatically satisfied when .P1∗ ≥ 0 and .PM ∗ Pmax . Next, we can replace .Pm with .Pm , .m ∈ {1, . . . , M}, in (6.16)–(6.20) and reformulate the optimization problem as follows:

.

max R

N Σ M Σ

⎛ p¯ m u1 e

−(u3 t¯(δn +BN0 )dnα )An Σ ¯ θ1 R1 −u3 κcφ 2 s¯ I + m k=1 u3 t ρk

⎞ θm −u2 Rm

.

(6.24)

n=1m=1

s.t. R1 ≥ M Σ m=1

Pmax u3 tˆ + u3 κcφ 2 s¯ I u3 κcφ 2 s¯ I , RM ≤ ,. θ1 θM

p¯ m RM ≤ Rtotal , .

(6.25)

(6.26)

6.4 Simulation Results

143

Rm ≤ Rm+1 , m ∈ {1, . . . , M − 1},

(6.27)

∗ ≤ P where the constraints in (6.25) result from .P1∗ ≥ 0 and .PM max , and the constraint in (6.27) is derived from the feasibility constraint in (6.17). Define .R as a set of all possible non-negative reward assignments where the constraints in (6.25) are met. The Lagrangian dual function will be M N Σ Σ

⎛ p¯ m u1 ×

L(R, λ, μ) = max R∈R n=1 m=1 ⎞ ⎞ ⎛ u3 t¯(δn + BN0 )dnα Σ An θm − u2 Rm exp − ¯ θ1 R1 − u3 κcφ 2 s¯ I + m k=1 u3 t ρk

.

+ λ(Rtotal −

M Σ

p¯ m RM ) +

m=1

M−1 Σ

μm (Rm+1 − Rm ),

(6.28)

m=1

where .λ and .μ = {μ1 , . . . , μM−1 } are the Lagrangian multipliers associated to the inequality constraints (6.26) and (6.27). Hence, the dual optimization problem will be .

min L(R, λ, μ) s.t. λ ≥ 0, μ ≥ 01×(M−1) . λ,μ

(6.29)

As the dual optimization problem is always convex, it can be solved by updating Lagrangian multipliers using basic gradient based algorithms. Note that, since the objective function in (6.24) is not concave, the solution obtained in the dual optimization problem will be suboptimal. However, instead of tackling the original problem in (6.24)–(6.27) with a high complexity, the parameter server can spend less computation cost and delay when solving the low-complexity dual optimization problem. For example, when choosing the ellipsoid method to solve the dual optimization problem, the complexity will be .O((M)2 ln(1/ε)) where .ε is the accuracy requirement [111]. Once the reward assignment is determined, the transmit power allocation in the contract design can be derived using Theorem 6.2.

6.4 Simulation Results To evaluate the performance of the DFP algorithm, we use two real datasets: The BDD data [180] and the DACT data [181]. The BDD data is a large-scale driving video dataset with extensive annotations for heterogeneous tasks, and such dataset is collected under diverse geographic, environmental, and weather conditions across the United States. The DACT data is a collection of trajectories collected in the city of Columbus, Ohio, where each trajectory records more than 10 minutes of

144

6 Federated Learning for Autonomous Vehicles Control

Table 6.1 Simulation parameters Parameter .η .γ

I .Pmax .Δt .t¯ .κ c .φ .N0 B .s¯ M .α .Rtotal

Description Learning rate Coefficient for the .L2 regularizer Iteration number of local SGD Maximum transmit power Sampling period Duration of each communication round Energy consumption efficiency Number of computing cycles per bit Frequency of the CPU Noise power spectral density Bandwidth Size of randomly selected data Total number of CAV types Path-loss exponent Total reward at the parameter server

Value .0.01 .0.1

20 1W 1s .0.02 s −28 [179] .10 3 .10 [179] 9 .10 cycles/s [179] .−174 dBm/Hz 1 MHz 1000 bits 7 2.5 5.0

driving data and can be divided into multiple segments annotated by the operating pattern, like speed-up and slow-down. In terms of the traffic model, we consider a 2 .×2 km square area with 20 lanes randomly located around the center of the square area. When using BDD data and the DACT data, we assume that CAVs are randomly assigned to these 20 lanes and all the training data is randomly split among CAVs to capture the unbalanced distribution of local data. The CAVs’ velocity is determined by the headway distance to the preceding CAVs. For the auto-tuning unit used by CAVs, we consider an ANN model with two hidden layers. In particular, each hidden layer has eight fully connected neurons where the initial weights are chosen randomly from .[0, 1] and the mean squared error is used as the loss function. The values of the parameters used for simulations are summarized in Table 6.1. Figure 6.2 shows the velocity tracking performance comparison between the autonomous controllers solely trained by the local data (i.e., smooth slow-down) and trained by the DFP algorithm under different traffic scenarios. In this simulation, we consider three traffic scenarios from the DACT dataset. In particular, we choose a use case with a dramatic speed decline to represent a harsh brake in a traffic accident, the speed variations around zero as the stop-and-go traffic in a congestion, and the change of the average speed as the speed limit changes in a roadwork zone. As shown in Fig. 6.2, the controller trained by the DFP algorithm can accurately execute the control decisions and track the target speed under all three traffic scenarios. However, when using the controller trained with the local data, we can face large speed variations around the target values. For example, as shown in Fig. 6.2a, to achieve a harsh brake, the controller trained by the local data will generate sequential deceleration and acceleration instead of a constant deceleration as done by the controller trained by the DFP algorithm. In the traffic congestion and roadwork zone of Fig. 6.2b and c, the controller trained by the local data will have a more

6.4 Simulation Results

145

Fig. 6.2 Velocity variations over different traffic scenarios. (a) Harsh brake in a traffic accident. (b) Stop-and-go traffic in a congestion. (c) Speed limit changes in a work zone

(a)

(b)

(c)

146

6 Federated Learning for Autonomous Vehicles Control

Fig. 6.3 Velocity variations over time

frequent switch between acceleration and deceleration than the target speed traces, adversely impacting the driving experience of the passengers. Also, in Fig. 6.2b and c, the controller trained by the local data can make aggressive deceleration and acceleration and such behaviors will not only increase the CAVs’ maintenance costs, but it will also endanger following and preceding CAVs especially when the spacing is small. In Fig. 6.2, we also compare the controller trained by the DFP algorithm with the popular MPC with loss function as the objective function and maximum acceleration constraint as .2.5 m/s.2 , and maximum deceleration as 2 .2.5 m/s. . Note that, for MPC, the sampling rate is chosen as 2 s, due to the fact the MPC needs to solve a quadratic program with a computation complexity higher than the counterparts in the FL based controller design. In particular, we can observe that, when using the controller trained by the DFP algorithm, the longitudinal velocity trace better aligns with the target reference speed compared to the counterpart of MPC, especially in the stop-and-go traffic. Meanwhile, we calculate the mean squared errors for the controller trained by our DFP algorithm which are .0.0993, .0.0114, and .0.0032 for harsh brake, stop-and-go traffic, and speed limit changes. For the MPC, the corresponding mean squared errors will be .0.4231, .0.2751, and .0.0561, verifying the effectiveness of the DFP algorithm for the longitudinal controller design. Figure 6.3 shows the velocity tracking performance comparison between the autonomous controllers solely trained by the local data (i.e., smooth speed-up) and trained by our proposed DFP algorithm over time. In this simulation, the trajectory data in the DACT dataset is randomly assigned to the CAVs. Figure 6.3 shows that the DFP-based controller design can accurately track the target velocity over time. However, the actual velocity generated by the controller trained with local data can deviate from the target value. In particular, at time .t = 311 s, the error between the actual and target velocities can be as large as .3.17 miles/hour (.1.42 meters/second),

6.4 Simulation Results

147

1 0.9 0.8 0.7 0.6

CDF

0.5 0.4 Only trained by the local data

0.3

DFP, B = 10 MHz

0.2

DFP, B = 5 MHz

0.1 0

DFP, B = 1 MHz

0

20

40

60

80

100

Absolute distance error (m)

120

140

160

Fig. 6.4 The CDF of absolute distance errors

violating the two commonly used design criteria for a vehicle’s controller, i.e., 0.5 meters/second error upper bound [182] and .5% maximum allowable error [183]. Figure 6.4 shows the cumulative distribution function (CDF) when the controllers tracks the DACT dataset. In particular, the autonomous controllers are trained, respectively, by local data and by our proposed DFP algorithm with different bandwidth. Also, the absolute distance error is calculated by the absolute difference between the target distance in the DACT dataset and the actual distance traversed by the CAV with the designed controller at the end of each trajectory. As observed from Fig. 6.4, the controller trained by the proposed DFP algorithm yields a much smaller distance error compared with the case in which the CAVs only use their local data to train the controller model. In particular, with a .0.90 probability, the controller solely trained with local data will generate an absolute distance error of less than 80 m, two times larger than the error resulting from the DFP-based autonomous controller. Moreover, as shown in Fig. 6.4, for a larger bandwidth, the proposed DFP-based controller design will more likely yield a smaller distance error. For example, when the bandwidth .B = 10 MHz, the probability that the distance error generated by DFP-based controller remains below 20 m is around .0.80, while the counterpart for the case with a bandwidth .B = 1 MHz is around .0.68. That is because with a larger bandwidth, more CAVs can meet the time constraint .t¯ and participate in the FL, leading to a better training performance. As shown in Figs. 6.2, 6.3, and 6.4, it is clear that the autonomous controller based on the proposed DFP algorithm outperforms the baseline scheme that solely relies on the local data for training. Figure 6.5 compares the proposed DFP with FedAvg and FedProx. To test the ability of dealing with unbalanced and non-IID data for these three algorithms, we choose a larger BDD dataset. In particular, the BDD data collected under different traffic scenarios will be assigned to different vehicles unevenly to capture the unbalanced and non-IID distribution of local data. As observed from Fig. 6.5, when faced with unbalanced and non-IID training data, FedAvg and FedProx fail to converge near zero loss over 100 communication rounds. In particular, after

.

148

6 Federated Learning for Autonomous Vehicles Control 0.7 0.6

Loss

0.5 0.4 0.3

FedAvg

0.2

FedProx

0.1

DFP

0

0

20

40

60

Communication round

80

100

Fig. 6.5 Convergence performance of the DFP, FedAvg, and FedProx algorithms

100 communication rounds, the loss values for FedAvg and FedProx are near .0.62 and .0.38, respectively. The slow convergence of FedAvg stems from the fact that the training performance of FedAvg is negatively impacted by the unbalanced and non-IID data. The poor performance of FedProx can be explained by the fact that, in FedProx, the CAVs that are randomly selected for the training process might not finish the uplink transmission in time due to the path loss and fading. However, as shown in Fig. 6.5, our proposed DFP algorithm needs only around 20 communication rounds (i.e., .0.2 s) to achieve convergence, much faster than its counterparts FedAvg and FedProx. In other words, when dealing with the diverse local data and varying participation of CAVs, our proposed DFP algorithm exhibits a fast convergence to the optimal autonomous controller for CAVs. Such a fast convergence can enable the CAV to quickly adapt to the traffic dynamics and correctly track the speed determined by the motion planner. Figure 6.6 shows the training performance difference when the DFP algorithm uses the contract-theory based incentive mechanism and two baseline schemes for the power allocation among CAVs. The two baseline schemes include the maximum power allocation where all CAVs use the highest transmit power for their uplink transmission and the random power allocation where all CAVs use a randomly selected transmit power in the range from zero to the maximum power. In addition, we show the convergence for the optimal contract design where an exhaustive search algorithm is used to determine the optimal reward assignment in (6.24) and then the power allocation in the optimal contract is derived using Theorem 6.2. As shown in Fig. 6.6, we can observe that when using these four assignment strategies, the training loss will decrease as the communication round increases. However, the FL process using our proposed contract-theoretic incentive mechanism for the power allocation can achieve a faster convergence compared with random and maximum power allocation schemes. In particular, to achieve a .0.05 loss for the training

6.5 Conclusions

149

0.4

Maximum power allocation

0.35

Random power allocation

0.3

Proposed sub-optimal contract design Optimal contract design

Loss

0.25 0.2 0.15 0.1 0.05 0

0

10

20

30

40

50

60

Communication round

70

80

90

100

Fig. 6.6 Training performance of the proposed contract-based approach and two baselines

process of the controller design, the FL process with the designed scheme will only need around 30 communication rounds; whereas the corresponding communication rounds for both baseline schemes will be around 50. In other words, the introduced strategy can achieve 40% faster FL convergence speed compared with both baseline schemes. The reason is that our proposed incentive mechanism will only allocate the transmit power to the CAVs in .N(2) which bring positive convergence gain to the FL process. However, in the maximum power allocation and the random power allocation, CAVs in .N(1) will also be allowed to participate in the FL and their negative convergence gain will offset the positive gain brought by CAVs in .N(2) . Moreover, we can observe that the convergence of the suboptimal contract design is closely aligned to the optimal contract solution. In other words, the suboptimal solution is effective to design a contract which can improve the convergence of the DFP algorithm.

6.5 Conclusions In this chapter, we have introduced an FL framework to enable collaborative learning of the autonomous controller model across a group of CAVs. In particular, we have introduced a new DFP algorithm that accounts for the varying participation of CAVs in the FL process as well as diverse data quality across CAVs. We have performed a rigorous theoretical convergence analysis for the introduced algorithm, and we have explicitly studied the impact of CAVs’ mobility, uncertainty of wireless channels, as well as unbalanced and non-IID local data on the overall convergence performance. To improve the convergence of the introduced algorithm, we have designed a contract-theoretic incentive mechanism. Simulation results from using

150

6 Federated Learning for Autonomous Vehicles Control

real traces have shown that the autonomous controller designed by the introduced algorithm can track the target speed over time and under different traffic scenarios and the DFP algorithm can lead to a better controller design in comparison to the FedAvg and FedProx algorithms. Also, the simulation results have validated the feasibility of our introduced contract-based incentive mechanism and shown that the incentive mechanism can accelerate the convergence of controller models in CAVs.

Chapter 7

Federated Learning for Mobile Edge Computing

In this chapter, we introduce the use of designed FL to minimize energy and time consumption for task computation and transmission in a mobile edge computing (MEC)-enabled balloon network. In the considered network, each user needs to process a computational task in each time instant, where high-altitude balloons (HABs), acting as flying wireless base stations, can use their powerful computational abilities to process the tasks offloaded from their associated users. Since the data size of each user’s computational task varies over time, the HABs must dynamically adjust the user association, service sequence, and task partition scheme to meet the users’ needs. This problem is posed as an optimization problem whose goal is to minimize the energy and time consumption for task computing and transmission by adjusting the user association, service sequence, and task allocation scheme. To solve this problem, a support vector machine (SVM)-based FL algorithm is used to determine the user association proactively. The introduced SVM-based FL method enables each HAB to cooperatively build an SVM model that can determine all user associations without any transmissions of either user historical associations or computational tasks to other HABs. Given the prediction of the optimal user association, the service sequence and task allocation of each user can be optimized so as to minimize the weighted sum of the energy and time consumption. Simulations with real data of city cellular traffic show that the introduced solution can reduce the weighted sum of the energy and time consumption of all users by up to 16.1% compared to a conventional centralized method. Next, we first introduce the system model of the considered MEC network and then explain the problem formulation. Then, we discuss the use of the FL framework to predict user association. Given the prediction results, the optimization of service sequence and task allocation are introduced. Finally, numerical results are presented and discussed.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 M. Chen, S. Cui, Communication Efficient Federated Learning for Wireless Networks, Wireless Networks, https://doi.org/10.1007/978-3-031-51266-7_7

151

152

7 Federated Learning for Mobile Edge Computing

7.1 MEC Network Model Consider an MEC-enabled HAB network that consists of a set .N of N HABs serving a set .M of M users over both uplink and downlink in a given geographical area, as shown in Fig. 7.1. In this model, the users are associated with the HABs via wireless links and each HAB is equipped with computational resources to provide communication and computing services to the users. For example, HABs can be equipped with computational resources for analyzing the optimal route from the current location to the destination of each ground vehicle so as to provide navigation service to ground vehicles [184]. In this network, the uplink is used to transmit the computational task that each user offloads to the HAB while the downlink is used to transmit the computing result of the offloading task. We assume that the size of each task that user m needs to process in each time instant t is .zm,t , which will be changed as time elapses.

7.1.1 Transmission Model In the considered scenario, all the communication links will use the millimeter wave (mmWave) frequency bands to provide high data rate services for ground users so as to satisfy the delay requirement of computational tasks. A time division multiple access (TDMA) scheme is adopted to support directional transmissions over the mmWave band. Note that, the channel gains of the mmWave links depend on the

Fig. 7.1 An illustration of the considered MEC-enabled HAB network model

7.1 MEC Network Model

153

instantaneous large scale and small scale fading. For HAB-ground user transmission links (air-to-ground transmission links), the large scale fading is the free space path loss and attenuation due to rain and clouds [185]. Small scale fading is modeled as Ricean fading due to the presence of line-of-sight rays from the HAB to most of the locations in the HAB service area [186]. The channel gains .gmn,t and .hmn,t between HAB n and user m over uplink and downlink during each time instant t are given by: ⎛ gmn,t =

.

⎛ hmn,t =

.

C 4π rmn fc C 4π rmn fc

⎞ · GH (𝜓 mn ) · Gm · A(dmn ) · ϕn,t ,

(7.1)

· GH (𝜓 mn ) · Gm · A(dmn ) · ϕm,t ,

(7.2)

⎞

respectively, where C is the speed of light, .fc is the carrier frequency, and .rmn is 32log2 √ the distance between HAB n and user m; .GH (𝜓 mn ) = cos(𝜓 mn )ρ 2(2 arccos( ρ 0.5))2 is the gain seen at an angle .𝜓 mn between user m and HAB n’s boresight axis with .ρ being ⎛ the⎞ roll-off factor of the antenna. .Gm is the antenna gain of user m. .A(rmn ) = 3χ rmn

10 10H is the attenuation due to clouds and rain with H being the HAB height and .χ being the attenuation through the cloud and rain in dB/km. .ϕn,t and .ϕm,t represent the small scale Ricean gain during time instant t for HAB n and user m, respectively. Since a directional antenna is adopted at each HAB, the connectivity between HAB and user can be available for data transmission only if the directional antenna is directed towards each user and hence, interference is negligible. Given a bandwidth B for each HAB, the rates of data transmission for uplink and downlink between user m and HAB n during time instant t will be: ⎛ ⎞ ( ) PU gmn,t , umn,t amn,t = amn,t Blog2 1 + σ2

(7.3)

⎛ ⎞ ( ) PB hmn,t , .dmn,t amn,t = amn,t Blog2 1 + σ2

(7.4)

.

respectively, where .amn,t is the index of the user association with .amn,t = 1 indicating that user m connects to HAB n at time instant t, otherwise, we have .amn,t = 0. .PB and .PU are the transmit power of each HAB and user, which are assumed to be equal for all HABs and users, respectively. .σ 2 represents the variance of the additive white Gaussian noise. The uplink and downlink transmission delay between user m and HAB n at time instant t can be given by: ( ) U βmn,t , amn,t = lmn,t

.

βmn,t zm,t ( ), umn,t amn,t

( ) D βmn,t , amn,t = lmn,t

βmn,t zm,t ( ), dmn,t amn,t (7.5)

154

7 Federated Learning for Mobile Edge Computing

respectively, where .βmn,t zm,t is the fraction of the task that user m transmits to HAB n for processing in each time instant t with .βmn,t ∈ [0, 1] being the task division parameter.

7.1.2 Computing Model In the considered model, each user’s computational task can be cooperatively processed on the HAB, a process that we call edge computing, or it can use local computing on the user itself. Next, we introduce the models of edge computing and local computing in detail. 7.1.2.1

Edge Computing Model

Given the data size .βmn,t zm,t of the task that is offloaded from user m, the time used for HAB n to process the task can be given by: ( ) ωB βmn,t zm,t B βmn,t = , lmn,t fB

.

(7.6)

where .f B is the frequency of the central processing unit (CPU) clock of each HAB n assumed to be equal for all HABs. .ωB is the number of CPU cycles required for computing data (per bit). 7.1.2.2

Local Computing Model

Given the data size .(1 − βmn,t )zm,t of the task that is computed locally, the time used for user m to process the task can be given by: L .lmn,t

( ) U 1−β ( ) ωm mn,t zm,t βmn,t = , fmU

(7.7)

U is the number of where .fmU is the frequency of the CPU clock of user m and .ωm CPU cycles required for computing the data (per bit) of user m.

7.1.3 Time Consumption Model In the proposed model, since users and HABs can process their computational task simultaneously, the total time used for the task computation is determined by the maximum time between the local computing time and edge computing time. Thus,

7.1 MEC Network Model

155

based on (7.5)–(7.7), the time needed by user m and HAB n to cooperatively process the computational task of user m can be given by: ⎧ ( ) ( ) ( ) U B βmn,t , amn,t + lmn,t βmn,t lmn,t βmn,t , amn,t = max lmn,t

.

D + lmn,t

⎫ ( ) L ( ) βmn,t , amn,t , lmn,t βmn,t ,

(7.8)

( ) ( ) ( ) U B D βmn,t , amn,t + lmn,t βmn,t + lmn,t βmn,t , amn,t represents the edge where .lmn,t ( ) L βmn,t represents the local computing time. computing time and .lmn,t Moveover, since TDMA is used in the considered model, each user must wait for service, thus incurring a wireless access delay. For a given user m that is associated with HAB n, this access delay can be given by: S lmn,t (qmn,t ) =

Σ

.

lm' n,t (am' n,t , βm' n,t ),

(7.9)

Qm

m' ∈

| | | | where .qmn,t is a service sequence variable that satisfies .1 ≤ qmn,t ≤ |a n,t |. .|a n,t | is the module of |.a n,t and represents the number of users that are associated with HAB n. .Qm = {m' |qm' n,t < qmn,t } is the set of users that are served by HAB n before user m. Given the access delay and processing delay of each user, the total delay for user m to process a computational task can be given by: ( ) S tm,t (βmn,t , amn,t , qmn,t ) = lmn,t (qmn,t ) + lmn,t βmn,t , amn,t .

.

(7.10)

7.1.4 Energy Consumption Model In our model, the energy consumption of each user consists of three components: (a) Device operation energy consumption, (b) Data transmission energy consumption, and (c) Data computing energy consumption. Here, the device operation energy consumption relates to the energy consumption caused by the users using their devices for any applications. The energy consumption of user m can be given by [187]: ⎛ ⎞2 ( ( ) ) ( ) U 1 − βmn,t zm,t + PU lmn,t βmn,t , amn,t , em,t βmn,t , amn,t = Om + ςm fmU

.

(7.11) where .Om is the energy consumption of device operation and .ςm is the energy consumption coefficient depending on the chip of user m’s device. In (7.11), ( U )2 ( ) 1 − βmn,t zm,t is the energy consumption of user m computing the size .ςm fm

156

7 Federated Learning for Mobile Edge Computing

( ) ( ) U βmn,t , amn,t represents the of task . 1 − βmn,t zm,t at its own device and .PU lmn,t energy consumption of task transmission from user m to HAB n. Similarly, the energy consumption of each HAB can be given by: ⎛ ⎞2 ( ) ( ) D en,t βmn,t , amn,t = On + ς f B βmn,t zm,t + PB lmn,t βmn,t , amn,t ,

.

(7.12)

where .On is the energy consumption of hover for the HAB and .ς is the energy consumption coefficient depending on the chip of HAB’s device. In (7.12), ( B )2 .ς f βmn,t zm,t is the energy consumption of HAB n computing the data size ( ) D .βmn,t zm,t of task that is offloaded from user m and .PB lmn,t βmn,t , amn,t represents the energy consumption of task transmission from HAB n to user m.

7.2 Problem Formulation We now formally pose our optimization problem whose goal is to minimize weighted sum of the energy and time consumption of each user. The minimization problem of the energy and time consumption for all users involves determining user association, service sequence, and the size of the data that must be transmitted to the HAB, as per the below formulation:

.

min

At ,Qt ,β t

.

M T Σ Σ ( ( ) ( )) γE em,t βmn,t , amn,t + γT tm,t βmn,t , amn,t , qmn,t

(7.13)

t=1 m=1

s. t. amn,t ∈ {0, 1} , ∀n ∈ N, ∀m ∈ M,. Σ amn,t ≤ 1, ∀m ∈ M, . n∈N | | 1 ≤ qmn,t ≤ |a n,t | , qmn,t ∈ Z+ , ∀m ∈ M, ∀n ∈ N, . '

'

(7.13a) (7.13b) (7.13c)

qmn,t /= qm' n,t , ∀m /= m , m, m ∈ M, ∀n ∈ N, .

(7.13d)

0 ≤ βmn,t ≤ 1, ∀m ∈ M, ∀n ∈ N, .

(7.13e)

M Σ

( ) en,t βmn,t , amn,t ≤ Et , ∀n ∈ N,

(7.13f)

m=1

where .At = [a 1,t , . . . , a N,t ] with .a n,t = (a1n,t , . . . , aMn,t ), .Qt = [q 1,t , . . . , q N,t ] with .q n,t = (q1n,t , . . ., .qmn,t ), and .β t = [β 1,t , . . . , β N,t ] with .β n,t = (β1n,t , . . . , βMn,t ). .γE and .γT are weighting parameters that combine the value of energy and time consumption into an integrated utility function. Equations (7.13a) and (7.13b) ensure that each user can connect to only one HAB for task processing.

7.3 Federated Learning for Proactive User Association

157

Equations (7.13c) and (7.13d) guarantee that each HAB can only process one computational task at each time instant. Equation (7.13e) indicates that the data requested by each user can be cooperatively processed by both HABs and users. Equation (7.13f) is the energy constraint of HAB n at time instant t. As the data size of the requested computational task varies, the HABs must dynamically adjust the user association, service sequence, and task allocation to minimize each user’s energy and time consumption. The problem in (7.13) is challenging to solve by conventional optimization algorithms due to the following reasons. First, each HAB must collect the information related to the computational task requested by each user so as to minimize the energy and time consumption of ground users. However, each computational task is generated by a ground user and, hence, each HAB can only collect the information related to the computational tasks of its associated users instead of all users’ computational information. When using optimization techniques, given the computational task information of only a fraction of the users, each HAB must use traditional iterative methods to find the globally optimal user association thus increasing the delay for processing computational task. Second, as the data size of each computational task varies, the HABs must re-execute the iterative methods which leads to additional delays and overhead. Thus, we need a machine learning approach that can predict the optimal user association via using the information collected by each HAB itself. Based on the predicted optimal user association, each HAB can collect the data size of the computational task from its associated users thus optimizing service sequence and task allocation for the users. User association can be considered as a multi-classification problem and SVM methods are good at solving such problems [141]. Hence, we introduce an SVM-based machine learning approach for predicting user association. In addition, exchanging the information related to historical computational task request among HABs can lead to significant energy consumption [188]. Thus, we introduce an SVM-based FL algorithm to determine the user association proactively so as to minimize the energy and time consumption. The introduced algorithm enables each HAB to use its local dataset to collaboratively train an optimal SVM model that can determine user association for all users while keeping the training data local. Based on the proactive user association, the optimization problem in (7.13) can be simplified and solved.

7.3 Federated Learning for Proactive User Association Next, we introduce the training process of the SVM-based FL model for predicting user association. The proposed algorithm first enables each HAB to train an SVM model locally via using its locally collected data so as to build a relationship between each user’s future association and the data size of the task that the user must process currently. Then, each HAB exchanges the trained SVM model with other HABs to integrate the trained SVM models and improve the SVM model locally so as to collaboratively perform a prediction for each user without training data exchange.

158

7 Federated Learning for Mobile Edge Computing

7.3.1 Components of the SVM-Based FL An SVM-based FL algorithm consists of four components: (a) agents, (b) input, (c) output, (d) SVM model, which are defined as follows: • Agents: The agents in our system are the HABs. Since each SVM-based FL algorithm typically performs prediction for just one user, each HAB must implement M SVM-based FL algorithms to determine the optimal user association for all users. Hereinafter, we introduce an SVM-based FL algorithm for the prediction of user m’s future association. For simplicity, an SVM-FL model of HAB n is short for an SVM-FL model that HAB n uses for the prediction of user m’s future association. • Input: The input of the SVM-based FL algorithm that is implemented by HAB n for predicting user m’s future association is defined by .Xmn that includes user m’s user association and the data size of its requested task at historical time instants. Here, .Xmn = {(x m,1 , amn,1 ), . . . , (x m,K , amn,K )} where K is the number of the data samples of each user m collected by HAB n. In .(x m,k , amn,k ), X Y X Y T .x m,k = [x m,k , xm,k , zm,k ] with .xm,k and .xm,k being the location of user m at current time instant, .amn,k is the index of the user association between user m and HAB n at the next time instant. • Output: The output of the proposed algorithm performed by HAB n for predicting user m’s future association at time instant t is .amn,t+1 that represents the user association between HAB n and user m at next time instant. • SVM model: For each user m, we define an SVM model represented by a vector .wmn and a matrix .Ωm ∈ RN ×N where .w mn is used to approximate the prediction function between the input .x m,t and the output .amn,t+1 thus building the relationship between the future user association and the data size of the task that user m needs to process currently. .Ωm is used to measure the difference between the SVM model generated by HAB n and other SVM models that are generated by other HABs for determining user m’s future association hence improving the performance of HAB n’s local SVM model for prediction.

7.3.2 Training of SVM-Based FL We must train the SVM-based FL algorithm to determine each user m’s association with all HABs. Training is done in a way to solve [32]:

.

min

W m ,Ωm

K N Σ Σ {

} ln (w mn , (x m,k , amn,k )) + R(W m , Ωm ) ,

(7.14)

n=1 k=1

s. t. Ωm 0,

.

tr(Ωm ) = 1,

(7.14a)

7.3 Federated Learning for Proactive User Association

159

where .ln ((wmn )T x m,k , amn,k ) = (amn,k − (wmn )T x m,k )2 is a loss function that measures a squared error between the predicted user association and the target user association. .R(W m , Ωm ) = λ1 ||W m ||2F +λ2 tr(W m (Ωm )−1 (W m )T ) with .λ1 , λ2 > 0 is used to collaboratively build an SVM-based FL model where .||W ||2F is used to perform .L2 regularization on each local model, and .tr(W m (Ωm )−1 (W m )T captures the relationship among SVM models so as to improve the performance of SVM models that are used to determine user m’s association. In (7.14a), .Ωm 0 implies that matrix .Ωm is positive semidefinite. To solve the optimization problem in (7.14), we observe the following: (a) Given .Ωm , updating .W m depends on the data pair .(x m,k , amn,k ) which is collected by HAB n and (b) Given .W m , optimizing .Ωm only depends on .W m and not on data pair .(x m,k , amn,k ). Based on these observations, it is natural to divide the training process of the proposed algorithm into two stages: a) .W m training stage in which HAB n updates .wmn using its local collected data and b) .Ωm training stage in which HAB n first transmits .w mn to other HABs to generated .W m and then, calculates .Ωm using its generated .W m to capture the relationship between the SVM model generated by HAB n and other SVM models that are generated by other HABs for determining user m’s future association thus improving .wmn for each HAB n. Next, we introduce the two stages of the training process. • .W m training stage: In this stage, HAB n updates .wmn based on the local dataset .X mn and .Ωm that is calculated at last iteration. Next, we first introduce the use of quadratic approximation to divide the optimization problem in (7.14) into distributed subproblems and then, the distributed subproblems that are solved by each HAB is presented. Given .Ωm , the dual problem of (7.14) can be rewritten as: ⎫ ⎧ K N Σ Σ ( ) ∗ ∗ . min D (α m ) = (7.15) ln −αmn,k + R (X m α m |Ωm ) , αm

n=1 k=1

where .ln∗ (−αmn,k )=max(−αmn,k wmn x m,k − ln (wmn x m,k )) and .R ∗ (Xm α m |Ωm ) = max .(Xm α m W m − R(W m |Ωm )). In (7.15), .Xm = Diag[Xm1 , . . . , X mN ] and .α m = [α m1 , . . . , α mN ] where .α mn = [αmn,1 , . . . , αmn,K ] with .αmn,k being the dual variable for the data sample .(x m,k , amn,k ). Note that, given dual variables ∗ .α mn , the primal variables .w mn can be found via .W m (α m ) = ∇R (X m α m |Ωm ) where .wmn is column n of .W m (α m ). To solve (7.15) in a distributed manner, we define a local dual problem to approximate (7.15). Using a quadratic approximation, this the local dual problem

160

7 Federated Learning for Mobile Edge Computing

will be: min Gσn ( Δ α mn ; w mn , α mn |Ωm )

Δ α mn

.

=

K Σ

ln∗ (−αmn,k − Δ αmn,k ) +

(7.16)

k=1

+

σ ||Xmn Δ α mn ||2 + R ∗ (Xn α mn |Ωm ), 2μ1

where .σ = max

α mn ∈RK

||Xm α mn ||2 N Σ ||Xmn α mn ||2

∈ (0, 1) measures the correlation between each

n=1

HAB’s dataset that includes user m’s historical user association and the data size of the requested task. . Δ α mn = [ Δ αmn,1 , · · · , Δ αmn,K ] represents the difference between .α m in (7.15) and .α mn in (7.16). From (7.16), we can see that, to solve the local dual problem, we only need to use the data collected by each HAB n. Hence, the problem in (7.15) can be approximated by (7.16) and solved by each HAB in a distributed manner. Note that, since a quadratic approximation is used to solve .D (α m ) in (7.15), the performance loss generated by this approximation method N Σ Gσn ( Δ α mn ; w mn , α mn |Ωm ) . In Sect. 7.5, we will quantify this is .D (α m ) − n=1

performance loss and show that as the number of iterations increases, the value N Σ Gσn ( Δ α mn ; w mn , α mn |Ωm ) decreases and thus, the solution of of .D (α m ) − n=1

local dual problem in (7.16) converges to the solution of global dual problem in (7.15). • .Ωm training stage: In this stage, each HAB n first transmits .wmn to other HABs and generates .W m . Based on .W m , each HAB n calculates a structure matrix .Ωm to measure the difference of .w mn among HABs and build an SVM model that can quantify the relationship between user association and the historical computational task information so as to predict the association result for all users. Given .W m , (7.14) can be rewritten as: min tr(W m (Ωm )−1 (W m )T ),

.

Ωm

.

s. t. Ωm 0,

tr(Ωm ) = 1.

(7.17)

(7.17a)

From (7.17), we can see that, compared to the standard FL algorithm that directly averages the learning parameters .W m , the proposed FL algorithm uses a matrix .Ωm to find the relationship among all HABs’ user association schemes. This approach can, in turn, improve the FL prediction performance. Given (7.17) and

7.4 Optimization of Service Sequence and Task Allocation

161

Algorithm 6 Support vector machine based federated distributed learning framework 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:

Input: Data Xmn from n = 1, · · · , N HABs, stored on one of N HABs. Initialize: Ωm is initially generated randomly via a uniform distribution. α (0) := 0 ∈ Rn . for iterations i = 0, 1, · · · do for n ∈ {1, 2, · · · , N } in parallel over N HABs do For each HAB, calculating and returning Δ α mn of the local subproblem in (7.16). Update local variables α mn ← α mn + Δ α mn . Return updates wmn . end for Broadcast w mn and collect trained SVM models from other HABs, save as W m . Update Ωm based on W m for latest α mn . end for Output: W m := [wm1 , w m2 , . . . , wmN ].

(7.17a), we have: tr(W m (Ω)−1 (W m )T ) = tr(W m (Ωm )−1 (W m )T )tr(Ωm ) 1

1

1

≥ (tr(Ωm )− 2 ((W m )T W m ) 2 (Ωm ) 2 )2

.

(7.18)

1

= (tr((W m )T W m ) 2 )2 , where the inequality holds due to the Cauchy-Schwarz inequality for the Frobenius norm. Moreover, .tr(W m (Ωm )−1 (W m )T ) achieves its minimum value 1 2 if and only if .(Ω )− 12 ((W )T W ) 21 = aΩ for some T .(tr((W m ) W m ) 2 ) m m m n constant a and .tr(Ωn ) = 1. Given (7.18), we have: 1

.

Ωm =

((W m )T W m ) 2 1

tr(((W m )T W m ) 2 )

,

(7.19)

At each learning step, HAB n first updates .w mn based on .Xm and .Ωm , then broadcasts .w mn to other HABs and calculates .Ωm . Note that, the data size of .w mn can be neglected compared to the data size of each computational task and hence, the energy and time consumption for training the proposed FL is neglected. As the proposed algorithm converges, the optimal .W m and .Ωm can be found to solve problem (7.14). The entire process of training the proposed SVM-based FL algorithm is shown in Algorithm 6.

7.4 Optimization of Service Sequence and Task Allocation Once the user association is determined, the HABs can optimize the service sequence and task allocation for each user so as to solve (7.13). Since we use directional antennas, interference among HABs is negligible. In consequence,

162

7 Federated Learning for Mobile Edge Computing

problem (7.13) is independent for each HAB and can be decoupled into multiple subproblems. Given the user association, problem (7.13) for HAB n can be rewritten as: min

.

β n,t ,s n,t

T Σ M Σ (

( ) ( )) γE em,t βmn,t + γT tm,t βmn,t , qmn,t

(7.20)

t=1 m=1

s. t. (7.13c)

.

(7.13f).

Problem (7.20) is a mixed integer programming problem due to the discrete variable .qmn,t and continuous variable .βmn,t . To solve (7.20), the following result is used so as to separate the variables .qmn,t and .βmn,t in (7.20): Lemma 7.1 Given the data size of each computational task .zm,t , user association index .amn,t , and service sequence variable .qmn,t , the processing delay for the users that are associated HAB n will be: |a n | Σ

|a n | ( ) Σ ( ) γT tmn,t qmn,t , βmn,t = γT |a n | − qmn,t + 1 lmn,t (βmn,t ).

.

m=1

(7.21)

m=1

Proof The enumeration method is used to prove Lemma 7.1. | | – If the number of users associated with HAB n is 1, i.e., .|a n,t | = 1, then the sum delay for processing the computational task that is requested by user m is given by: a n,t | |Σ .

( ) S γT tmn,t qmn,t , βmn,t = γT lmn,t (qmn,t ) + γT lm (βmn,t ) = γT lm (βmn,t ),

m=1

where the last equality stems from the fact that the first scheduled user that is associated with HAB n will finish its computational task without access delay, S (q i.e., .lmn,t mn,t ) = 0 with .qmn,t = 1. | | – If the number of users associated with HAB n is 2, i.e., .|a n,t | = 2, then the sum delay for processing the computational tasks that are requested by the two users is given by a n,t | |Σ m=1 .

( ) γT tmn,t qmn,t , βmn,t

⎞ ⎛ S S = γ T lm ' n,t (qm' n,t ) + lm' n,t (βm' n,t ) + lmn,t (qmn,t ) + lmn,t (βmn,t ) ( ) = γT 2lm' n,t (βm' n,t ) + lmn,t (βmn,t )

7.4 Optimization of Service Sequence and Task Allocation

163

where the last equality holds since the access delay of the first scheduled user does not exist and the access delay of the second user depends on the processing delay of the first user. – Using the |enumeration method, if the number of users that are associated with | HAB n is .|a n,t |, given the process delay for each associated users, the sum delay for processing the computational tasks that are requested by the associated users is given by: a n,t | |Σ .

a n,t | |Σ | ( ) (| ) γT tmn,t qmn,t , βmn,t = γT |a n,t | − qmn,t + 1 lmn,t (βmn,t ).

m=1

m=1

⨆ ⨅

This completes the proof.

From Lemma 7.1, we can see that the time needed by user m and HAB n to cooperatively process the computational task is determined by .βmn,t . We can also see that the access delay of each user m is determined by .qmn,t . Next, to determine the optimal service sequence .qmn,t of each user m, we state the following result: Theorem 7.1 Given the data size of each computational task .zm,t and user association index .amn,t , the optimal service sequence of user m that is associated with| HAB n will be .qmn,t = Qm with .Qm being the number of users in .Qm = {m' |lm' n,t (βm' n,t ) ≤ lmn,t (βmn,t ) }. Proof We use contradiction method to prove Theorem 7.1. First, we assume that the set of users are served by HAB n in an ascending order of the time consumption for processing the computational task. For a specific user m, the optimal service sequence is .qmn,t = Q∗m with .Q∗m being the number of users in | ∗ ' .Qm = {m |lm' n,t ≤ lmn,t }. The total delay for processing can be given by: |a n | Σ .

⎛

Q∗m −1

γT tmn,t = γT ⎝

Σ

tm' n,t

( ) + tmn,t Q∗m , βmn,t +

m' =1

m=1

⎞

|a n | Σ

tm' n,t ⎠ .

m' =Q∗m +1

(7.22) S , .l , .t S where .lmn,t m )mn,t are simplified notations for .lmn,t (qmn,t ), .lm (βmn,t ), and ( .tmn,t qmn,t , βmn,t . Then, as the optimal service sequence of the specific user m is changed from .Q∗m to .Qm , the total delay for processing can be given by: |a n | Σ .

m=1

⎛

QΣ m −1

γT tmn,t = γT ⎝

m' =1

tm' n,t

( ) + tmn,t Qm , βmn,t +

|a n | Σ

⎞ tm' n,t ⎠ .

m' =Qm +1

(7.23)

164

7 Federated Learning for Mobile Edge Computing

The gap between (7.22) and (7.23) is given by: ⎛ .(7.22)–(7.23) = γT ⎝

Q∗m −1

Σ

tmn,t −

m' =Qm

− tmn,t ⎛ = γT ⎝

(

Σ

− tmn,t

= γT ⎝

⎞ ) Qm , βmn,t ⎠

Q∗m −1

tm' n,t − (

( ) tmn,t + tmn,t Q∗m , βmn,t

m' =Qm +1

Q∗m −1

m' =Qm

⎛

∗

Qm Σ

Σ

( ( ) ) tm' n,t qm' n,t − 1 + tmn,t Q∗m , βmn,t

m' =Qm

⎞ ) Qm , βmn,t ⎠

Q∗m −1

Σ ( ) |a n | − qm' n,t + 1 lm' n,t

m' =Qm

⎞ Σ ( ) |a n | − qm' n,t − 1 + 1 lm' n,t + (Q∗m − Qm )lmn,t ⎠ − Q∗m −1

m' =Qm

⎛ = γT ⎝

Q∗m −1

Σ

⎞ lm' n,t − (Q∗m − Qm )lmn,t ⎠ .

m' =Qm

• If .Q∗m > Qm , user m is served before the users whose required service time is ∗ −1 QΣ m less than user m, i.e., .lm' n,t < lmn,t , we have: . lm' n,t −(Q∗m −Qm )lmn,t < 0. m' =Qm

• If .Q∗m < Qm , user m is served after the users whose required service time is ∗ −1 QΣ m larger than user m, i.e., .lm' n,t > lmn,t , we have . lm' n,t − (Q∗m − Qm )lmn,t = −

Q Σm m' =Q∗m −1

m' =Qm

lm' n,t + (Qm − Q∗m )lmn,t < 0 .

From the above analysis, we can see that, as the service sequence .Q∗m for user m changes, the time needed by all associated users for processing the computational |a Σn | γT tmn,t increases. Thus, the sum delay is minimized as the associated task . m=1

users are served in ascending order of the time consumption for processing the computational task, which can be expressed as the service sequence .qmn,t = Qm

7.5 Simulation Results and Analysis

165

| with .Qm being the number of elements in .Qm = {m' |lm' n,t ≤ lmn,t }. This completes the proof. ⨆ ⨅ From Theorem 7.1, we can see that, the service sequence variable .q n,t can be determined according to the time consumption for processing the computational task using a sorting algorithm, such as bubble sort. Based on Theorem 7.1, optimization problem (7.20) can be rewritten as

.

min

β n,t ,q n,t

|a n | N Σ T Σ Σ (

( ) ( ) ) γE em,t βmn,t + γT |a n | − qmn,t + 1 lmn,t (βmn,t )

t=1 n=1 m=1

(7.24) .

s. t. 0 ≤ βmn,t ≤ 1, ∀m ∈ M, ∀n ∈ N, . M Σ

( ) en,t βmn,t ≤ Et , ∀n ∈ N.

(7.24a) (7.24b)

m=1

Problem (7.24) is a linear and convex problem since the objective functions and constraints are both convex and linear, which can be optimally solved by the well established optimization toolbox, e.g., CVX [157] optimally and efficiently.

7.5 Simulation Results and Analysis In our simulations, an MEC-enabled HAB network area having a radius .r = 2.5 km is considered with .M = 10 uniformly distributed users and .N = 4 uniformly distributed HABs. The values of other parameters are defined in Table 7.1. All statistical results are averaged over 5000 independent runs. Real data used to train the proposed algorithm is obtained from the OMNILab at Shanghai Jiao Tong University [189]. We consider the data size of cellular traffic in the dataset as the data size of each user’s computational task. The optimal user associations used for training the SVM model to minimize the utility function of all users are obtained by exhaustive search. In simulations, we compare the introduced scheme with

Table 7.1 Simulation parameters [34] Parameter B PB PU γE γT ςm

Value 10 MHz 20 W 0.5 W 0.5 0.5 3.44 × 10−23

Parameter ς ωU ωB fmU fB fc

Value 3.44 × 10−23 1500 1500 0.5 GHz 10 GHz 28 GHz

Parameter ϕn,t ϕm,t χ ρ σ2 H

Value −20 dB −20 dB 1.45 dB/km 65 −95 dBm 17 km

166

7 Federated Learning for Mobile Edge Computing 0.9 0.88

Accuracy rate

0.86 0.84 0.82 0.8 0.78 0.76

SVM-based global learning

0.74

SVM-based local learning 0.72

Proposed SVM-based federated learning 0.7 30

35

40

45

50

55

60

65

70

75

80

Number of training samples Fig. 7.2 Accuracy rate as the number of training samples varies

two baselines named SVM-based local learning and SVM-based global learning, respectively. The SVM-based local learning enables each HAB to train its local SVM model individually while the SVM-based global learning requires each HAB to transmit its local dataset to other HABs for training purpose. In Fig. 7.2, we show how the accuracy rate changes as the number of data samples varies. In this figure, the accuracy rate is the probability with which the considered algorithms accurately predict the optimal user association. Clearly, as the number of data samples increases, the accuracy rate of all algorithms increases. This is due to the fact that, as the number of data samples increases, the probability of underfitting decreases and hence, the accuracy rate of all considered algorithms increases. Figure 7.2 also shows that the proposed algorithm achieves only a 3% accuracy gap compared to the SVM-based global learning. However, the SVMbased global learning algorithm requires each HAB to transmit all datasets to other HABs for training purpose, which results in high overhead as well as significant energy and time consumption for data transmission. Figure 7.3 shows how the accuracy rate changes as the number of users varies. Clearly, as the number of users increases, the accuracy rate of the proposed algorithm increases. This is due to the fact that, as the number of users increases, the average energy that is used to process the computational tasks of each user decreases and hence, the probability that each user changes its association increases. In consequence, the information of computational task from each user can be collected by different HABs, thus increasing the correlation between the datasets at the HABs. Hence, the accuracy rate of the proposed algorithm increases. Figure 7.3 also shows that the proposed algorithm yields up to 19.4% gain in terms of the accuracy rate compared to SVM-based local learning. This implies that the proposed algorithm

7.5 Simulation Results and Analysis

167

1 0.95

Accuracy rate

0.9 0.85 0.8 0.75 0.7

SVM-based global learning SVM-based local learning Proposed SVM-based federated learning

0.65 0.6 8

10

12

14

16

18

20

Number of users Fig. 7.3 Accuracy rate as the number of users varies 1

SVM-based global learning SVM-based local learning Proposed SVM-based federated learning

Value of utility function

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

10

20

30

40

50

60

Number of iterations Fig. 7.4 Value of utility function as the total number of iterations varies

enables each HAB to train the learning model cooperatively to build a relationship of the user association among the HABs and improve the prediction performance. Figure 7.4 shows the number of iterations needed till convergence for all considered algorithms. From this figure, we can see that, as time elapses, the value of utility function for the considered algorithms decreases until convergence. Figure 7.4 also shows that the proposed algorithm achieves a 16.7% loss in terms of the number of iterations needed to converge compared to the SVM-based global

168

7 Federated Learning for Mobile Edge Computing

400

4500 300

3

5000

2

200

1

100

4000 3500 HAB 4

Y axes/m

Optimal User Association Proposed SVM-based Federated Learning Data Size of Each Computational Task

Data size of task (KB)

User Association

4

HAB 1

3000 2500 2000 1500 HAB 3

User 1

HAB 2

1000 500

0 0

10

20

30

Time Slots

(a)

40

50

0 60

0 0

1000

2000

3000

4000

5000

X axes/m

(b)

Fig. 7.5 An example of the prediction of the user association performed by the proposed algorithm. (a) The prediction of the user association as the data size of the computational task varies. (b) The map of the distribution of the HABs and users

learning and SVM-based local learning. This is because the proposed algorithm enables each HAB to train the learning model not only based on the historical data samples, but also based on the trained parameters from other HABs, thus decreasing convergence speed. Although exchanging the trained parameters increases the number of iterations needed to converge, the proposed algorithm can achieve a performance gain of up to 19.4% gain in terms of prediction performance compared to SVM-based local learning. Figure 7.5 shows an example of the prediction of the user association performed by the proposed algorithm for a network with 4 HABs and 12 users. In this figure, we can see that, as the data size of the computational task that is requested by user 1 varies, the prediction of the user association changes, as shown in Fig. 7.5a. This implies that the proposed algorithm enables each HAB to predict the optimal user association based on the data size of the computational task that user 1 needs to process currently. Specifically, the proposed algorithm can achieve up to 90% accuracy rate to predict the optimal user association. Figure 7.5a also shows that user 1 connects to HAB 3 as long as the data size of the computational task is larger than 100 KB, and HAB 2, otherwise. This is due to the fact that as the data size of the computational task is smaller than 100 KB, user 1 associates with HAB 2 for task processing since HAB 2 is nearest to user 1 and have enough energy to process the computational task. Moveover, as the data size of the computational task increases, each HAB needs more energy and time consumption to process the computational task that is offloaded from each user. However, from Fig. 7.5b, we can see that, the number of users that associate with HAB 3 is smaller than the number of users that associate with HAB 2. Thus, as the data size of the computational task that is offloaded from user 1 increases, the energy of HAB 2 is insufficient to process the computational tasks from its associated users and hence, user 1 associates HAB 3 for task processing.

7.6 Conclusion

169 30

Optimal solution Proposed SVM-based federated learning SVM-based global learning SVM-based local learning Delay Energy consumption

Utility function

25

20

15

10

5

0 12

14

16

18

20

Number of users Fig. 7.6 Value of utility function as the total number of users varies

Figure 7.6 shows how the value of utility function changes as the number of users varies. From Fig. 7.6, we can see that the value of utility function increases as the number of users increases. This stems from the fact that, as the number of users increases, the number of tasks that users need to process increases, which increases the sum energy and time consumption for task processing. Figure 7.6 also shows that as the number of users increases, the sum energy consumption increases linearly while the sum time consumption increases exponentially. This is because that the sum energy consumption is linear related to the number of users in the considered TDMA system while the sum of the access delay is exponential related to the number of users. From Fig. 7.6, we can also see that the proposed algorithm reduces the value of utility function by up to 16.1% and 26.7% compared to the SVM-based global learning and SVM-based local learning. This gain stems from the fact that the proposed algorithm enables each HAB to build the SVM model cooperatively without transmitting the local training data samples to the HAB hence reducing energy consumption for local data transmission while guarantee a better performance for prediction of optimal user association.

7.6 Conclusion In this chapter, we have introduced the use of SVM based FL for minimizing energy and time consumption of task computation and transmission in an HAB based network. We have first introduced the considered HAB model and then introduced the optimization problem that seeks to minimize the weighted sum of the energy

170

7 Federated Learning for Mobile Edge Computing

and time consumption of all users. To solve this problem, we have explained the use of an SVM-based FL algorithm which enables each HAB to cooperatively train an optimal SVM model using its own data. The SVM model can analyze the relationship between the future user association and the data size of the task that each user needs to process at current time slot so as to determine the user association proactively. Based on the optimal prediction, the optimization of service sequence and task allocation are determined so as to minimize the energy and time consumption for task computing and transmission. Simulation results have shown that the introduced FL based approach yields significant gains in terms of sum energy and time consumption compared to conventional approaches.

References

1. K.B. Letaief, W. Chen, Y. Shi, J. Zhang, Y.J.A. Zhang, The roadmap to 6G: AI empowered wireless networks. IEEE Commun. Mag. 57(8), 84–90 (2019) 2. I.F. Akyildiz, A. Kak, S. Nie, 6G and beyond: The future of wireless communications systems. IEEE Access 8, 133995–134030 (2020) 3. S. Dang, O. Amin, B. Shihada, M.S. Alouini, What should 6G be? Nat. Electron. 3(1), 20–29 (2020) 4. J. Posner, L. Tseng, M. Aloqaily, Y. Jararweh, Federated learning in vehicular networks: Opportunities and solutions. IEEE Netw. 35(2), 152–159 (2021) 5. X. Wang, Y. Han, C. Wang, Q. Zhao, X. Chen, M. Chen, In-edge AI: Intelligentizing mobile edge computing, caching and communication by federated learning. IEEE Netw. 33(5), 156– 165 (2019) 6. S. Niknam, H.S. Dhillon, J.H. Reed, Federated learning for wireless communications: Motivation, opportunities, and challenges. IEEE Commun. Mag. 58(6), 46–51 (2020) 7. Y. Liu, X. Yuan, Z. Xiong, J. Kang, X. Wang, D. Niyato, Federated learning for 6G communications: Challenges, methods, and future directions. China Commun. 17(9), 105– 118 (2020) 8. Z. Zhao, C. Feng, H.H. Yang, X. Luo, Federated-learning-enabled intelligent fog radio access networks: Fundamental theory, key techniques, and future trends. IEEE Wirel. Commun. 27(2), 22–28 (2020) 9. J. Kang, Z. Xiong, D. Niyato, Y. Zou, Y. Zhang, M. Guizani, Reliable federated learning for mobile networks. IEEE Wirel. Commun. 27(2), 72–80 (2020) 10. O.A. Wahab, A. Mourad, H. Otrok, T. Taleb, Federated machine learning: Survey, multilevel classification, desirable criteria and future directions in communication and networking systems. IEEE Commun. Surv. Tutorials 23(2), 1342–1397 (Secondquarter 2021) 11. M. Chen, Z. Yang, W. Saad, C. Yin, H.V. Poor, S. Cui, A joint learning and communications framework for federated learning over wireless networks. IEEE Trans. Wirel. Commun. 20(1), 269–283 (2021) 12. M.S.H. Abad, E. Ozfatura, D. Gündüz, O. Ercetin, Hierarchical federated learning across heterogeneous cellular networks, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, May (2020) 13. M.M. Amiri, D. Gündüz, Machine learning at the wireless edge: Distributed stochastic gradient descent over-the-air. IEEE Trans. Signal Process. 68, 2155–2169 (2020) 14. G. Zhu, K. Huang, Broadband analog aggregation for low-latency federated edge learning. IEEE Trans. Wirel. Commun. 19(1), 491–506 (2020)

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 M. Chen, S. Cui, Communication Efficient Federated Learning for Wireless Networks, Wireless Networks, https://doi.org/10.1007/978-3-031-51266-7

171

172

References

15. M.M. Amiri, D. Gündüz, Federated learning over wireless fading channels. IEEE Trans. Wirel. Commun. 19(5), 3546–3557 (2020) 16. K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C.M. Kiddon, J. Konecny, S. Mazzocchi, B. McMahan, T.V. Overveldt, D. Petrou, D. Ramage, J. Roselander, Towards federated learning at scale: System design, in Proc. Systems and Machine Learning Conference, Stanford, CA, USA, Feb. (2019) 17. M. Chen, D. Gündüz, K. Huang, W. Saad, M. Bennis, A.V. Feljan, H.V. Poor, Distributed learning in wireless networks: Recent progress and future challenges. IEEE J. Sel. Areas Commun. 39(12), 3579–3605 (2021) 18. A. Tak, S. Cherkaoui, Federated edge learning: Design issues and challenges. IEEE Netw. 35(2), 252–258 (2021) 19. C. Shen, J. Xu, S. Zheng, X. Chen, Resource rationing for wireless federated learning: Concept, benefits, and challenges. IEEE Commun. Mag. 59(5), 82–87 (2021) 20. G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, K. Huang, Toward an intelligent edge: Wireless communication meets machine learning. IEEE Commun. Mag. 58(1), 19–25 (2020) 21. T. Li, A.K. Sahu, A. Talwalkar, V. Smith, Federated learning: Challenges, methods, and future directions. IEEE Signal Process. Mag. 37(3), 50–60 (2020) 22. K. Yang, Y. Shi, Y. Zhou, Z. Yang, L. Fu, W. Chen, Federated machine learning for intelligent IoT via reconfigurable intelligent surface. IEEE Netw. 34(5), 16–22 (2020) 23. J. Park, S. Samarakoon, A. Elgabli, J. Kim, M. Bennis, S.L. Kim, M. Debbah, Communication-efficient and distributed learning over wireless networks: Principles and applications. Proc. IEEE 109(5), 796–819 (2021) 24. W.Y.B. Lim, N.C. Luong, D.T. Hoang, Y. Jiao, Y.C. Liang, Q. Yang, D. Niyato, C. Miao, Federated learning in mobile edge networks: A comprehensive survey. IEEE Commun. Surv. Tutorials 22(3), 2031–2063 (Thirdquarter 2020) 25. Y. Sun, W. Shi, X. Huang, S. Zhou, Z. Niu, Edge learning with timeliness constraints: Challenges and solutions. IEEE Commun. Mag. 58(12), 27–33 (2020) 26. F. Seide, H. Fu, J. Droppo, G. Li, D. Yu, 1-bit stochastic gradient descent and application to data-parallel distributed training of speech DNNs, in Proc. Annual Conference of the International Speech Communication Association, Singapore, Singapore, Sept. (2014) 27. N. Strom, Scalable distributed DNN training using commodity GPU cloud computing, in Proc. Annual Conference of the International Speech Communication Association, Dresden, Germany, Sept. (2015) 28. W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, H. Li, Terngrad: Ternary gradients to reduce communication in distributed deep learning, in Proc. Advances in Neural Information Processing Systems (NIPS), Long Beach, CA, USA, Dec. (2017) 29. S. Dorner, S. Cammerer, J. Hoydis, S.T. Brink, Deep learning based communication over the air. IEEE J. Sel. Top. Signal Process. 12(1), 132–143 (2018) 30. S. Hosseinalipour, C.G. Brinton, V. Aggarwal, H. Dai, M. Chiang, From federated to fog learning: Distributed machine learning over heterogeneous wireless networks. IEEE Commun. Mag. 58(12), 41–47 (2020) 31. B. McMahan, E. Moore, D. Ramage, S. Hampson, B.A. y Arcas, Communication-efficient learning of deep networks from decentralized data, in Proc. International Conference on Artificial Intelligence and Statistics, Ft. Lauderdale, FL, USA, April (2017) 32. V. Smith, C.K. Chiang, M. Sanjabi, A.S. Talwalkar, Federated multi-task learning, in Proc. Advances in Neural Information Processing Systems, Long Beach, CA, USA, Dec. (2017) 33. A. Fallah, A. Mokhtari, A. Ozdaglar, Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach, in Proc. Advances in Neural Information Processing Systems, vol. 33, Virtual Conference, Dec. (2020), pp. 3557–3568 34. N.H. Tran, W. Bao, A. Zomaya, M.N.H. Nguyen, C.S. Hong, Federated learning over wireless networks: Optimization model design and analysis, in Proc. IEEE Conference on Computer Communications, Paris, France (2019)

References

173

35. S. Wang, T. Tuor, T. Salonidis, K.K. Leung, C. Makaya, T. He, K. Chan, Adaptive federated learning in resource constrained edge computing systems. IEEE J. Sel. Areas Commun. 37(6), 1205–1221 (2019) 36. R. Balakrishnan, M. Akdeniz, S. Dhakal, N. Himayat, Resource management and fairness for federated learning over wireless edge networks, in Proc. IEEE International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), Atlanta, GA, USA, May (2020) 37. H.H. Yang, Z. Liu, T.Q.S. Quek, H.V. Poor, Scheduling policies for federated learning in wireless networks. IEEE Trans. Commun. 68(1), 317–333 (2020) 38. C.T. Dinh, N.H. Tran, M.N.H. Nguyen, C.S. Hong, W. Bao, A.Y. Zomaya, V. Gramoli, Federated learning over wireless networks: Convergence analysis and resource allocation. IEEE/ACM Trans. Netw. 29(1), 398–409 (2021) 39. W. Shi, S. Zhou, Z. Niu, M. Jiang, L. Geng, Joint device scheduling and resource allocation for latency constrained wireless federated learning. IEEE Trans. Wirel. Commun. 20(1), 453– 467 (2021) 40. W. Xia, T.Q.S. Quek, K. Guo, W. Wen, H.H. Yang, H. Zhu, Multi-armed bandit-based client scheduling for federated learning. IEEE Trans. Wirel. Commun. 19(11), 7108–7123 (2020) 41. J. Xu, H. Wang, Client selection and bandwidth allocation in wireless federated learning networks: A long-term perspective. IEEE Trans. Wirel. Commun. 20(2), 1188–1200 (2021) 42. M. Gastpar, Uncoded transmission is exactly optimal for a simple Gaussian sensor network. IEEE Trans. Inf. Theory 54, 2008–2017 (2008) 43. G. Zhu, K. Huang, MIMO over-the-air computation for high-mobility multimodal sensing. IEEE Internet Things J. 6(4), 6089–6103 (2019) 44. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, Nevada, USA, June (2016) 45. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in Proc. International Conference on Learning Representations, San Diego, California, USA, May (2015) 46. N.F. Eghlidi, M. Jaggi, Sparse communication for training deep networks, arXiv 2009.09271 (2020) 47. J. Wangni, J. Wang, J. Liu, T. Zhang, Gradient sparsification for communication-efficient distributed optimization, in Proc. Advances in Neural Information Processing Systems, Montreal, Canada, Dec. (2018) 48. A.F. Aji, K. Heafield, Sparse communication for distributed gradient descent, in Proc. Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, Sept. (2017) 49. D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat, C. Renggli, The convergence of sparsified gradient methods, in Proc. Advances in Neural Information Processing Systems, Montreal, Canada, Dec. (2018), pp. 5976–5986 50. S.U. Stich, J.B. Cordonnier, M. Jaggi, Sparsified SGD with memory, in Proc. Advances in Neural Information Processing Systems, Montreal, Canada (2018), pp. 4448–4459 51. E. Ozfatura, K. Ozfatura, D. Gündüz, Time-correlated sparsification for communicationefficient federated learning, in Proc. IEEE International Symposium on Information Theory (ISIT), Melbourne, Australia, July (2021), pp. 461–466 52. M. Riedmiller, H. Braun, A direct adaptive method for faster backpropagation learning: The RPROP algorithm, in Proc. IEEE International Conference on Neural Networks, San Francisco, CA, USA, Mar. (1993) 53. J. Bernstein, Y.X. Wang, K. Azizzadenesheli, A. Anandkumar, SignSGD: Compressed optimisation for non-convex problems, in Proc. International Conference on Machine Learning (ICML), Stockholm, Sweden, Jul. (2018) 54. J. Bernstein, J. Zhao, K. Azizzadenesheli, A. Anandkumar, SignSGD with majority vote is communication efficient and fault tolerant, in Proc. International Conference on Learning Representations, New Orleans, LA, USA, May (2019)

174

References

55. S.P. Karimireddy, Q. Rebjock, S. Stich, M. Jaggi, Error feedback fixes SignSGD and other gradient compression schemes, in Proc. International Conference on Machine Learning, Long Beach, CA, USA, Jun. (2019) 56. M. Chen, N. Shlezinger, H.V. Poor, Y.C. Eldar, S. Cui, Communication efficient federated learning. Proc. Natl. Acad. Sci. U. S. A. 118(17), e2024789118 (2021) 57. F. Haddadpour, M.M. Kamani, A. Mokhtari, M. Mahdavi, Federated learning with compression: Unified analysis and sharp guarantees, in Proc. International Conference on Artificial Intelligence and Statistics, vol. 130, Virtual Conference, Apr. (2021), pp. 2350–2358 58. S. Caldas, J. Koneˇcny, H.B. McMahan, A. Talwalkar, Expanding the reach of federated learning by reducing client resource requirements, Preprint. arXiv:1812.07210 (2018) 59. J. Xu, W. Du, Y. Jin, W. He, R. Cheng, Ternary compression for communication-efficient federated learning. IEEE Trans. Neural Netw. Learn. Syst. 33(3), 1162–1176 (2022) 60. A. Albasyoni, M. Safaryan, L. Condat, P. Richtárik, Optimal gradient compression for distributed and federated learning. Preprint. arXiv:2010.03246 (2020) 61. X. Dai, X. Yan, K. Zhou, H. Yang, K.K.W. Ng, J. Cheng, Y. Fan, Hyper-sphere quantization: Communication-efficient SGD for federated learning. Preprint. arXiv:1911.04655 (2019) 62. S. Zheng, C. Shen, X. Chen, Design and analysis of uplink and downlink communications for federated learning. IEEE J. Sel. Areas Commun. 39(7), 2150–2167 (2021) 63. A. Abdi, Y.M. Saidutta, F. Fekri, Analog compression and communication for federated learning over wireless MAC, in Proc. IEEE International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), Atlanta, GA, USA, May (2020) 64. D. Rothchild, A. Panda, E. Ullah, N. Ivkin, I. Stoica, V. Braverman, J. Gonzalez, R. Arora, FetchSGD: Communication-efficient federated learning with sketching, in Proc. International Conference on Machine Learning, Virtual Conference, Jul. (2020) 65. D. Alistarh, D. Grubic, J. Li, R. Tomioka, M. Vojnovic, QSGD: Communication-efficient SGD via gradient quantization and encoding, in Proc. Advances in Neural Information Processing Systems (NIPS), Long Beach, CA, USA, Dec. (2017) 66. S. Horvath, C.Y. Ho, L. Horvath, A.N. Sahu, M. Canini, P. Richtarik, “Natural compression for distributed deep learning. Preprint. arXiv:1905.10988 (2019) 67. A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, R. Pedarsani, Fedpaq: A communication-efficient federated learning method with periodic averaging and quantization, in Proc. International Conference on Artificial Intelligence and Statistics, Palermo, Sicily, Italy, Oct. (2020) 68. M.M. Amiri, D. Gündüz, S.R. Kulkarni, H.V. Poor, Convergence of federated learning over a noisy downlink. IEEE Trans. Wirel. Commun. 21(3), 1422–1437 (2022) 69. J. Chen, X. Pan, R. Monga, S. Bengio, R. Jozefowicz, Revisiting distributed synchronous SGD. [Online]. Available: https://arxiv.org/abs/1604.00981 70. R. Tandon, Q. Lei, A.G. Dimakis, N. Karampatziakis, Gradient coding: Avoiding stragglers in distributed learning, in Proc. International Conference on Machine Learning (ICML), Sydney, Australia, Aug. (2017) 71. M. Kamp, L. Adilova, J. Sicking, F. Huger, P. Schlicht, T. Wirtz, S. Wrobe, Efficient decentralized deep learning by dynamic model averaging. [Online]. Available: https://arxiv. org/abs/1807.03210 72. T. Chen, G. Giannakis, T. Sun, W. Yin, Lag: Lazily aggregated gradient for communicationefficient distributed learning, in Proc. of Advances in Neural Information Processing Systems, Montreal Canada, Dec. (2018) 73. X. Fan, Y. Wang, Y. Huo, Z. Tian, Joint optimization of communications and federated learning over the air. IEEE Trans. Wirel. Commun. 21(6), 4434–4449 (2022) 74. X. Fan, Y. Wang, Y. Huo, Z. Tian, 1-bit compressive sensing for efficient federated learning over the air. IEEE Trans. Wirel. Commun. 22(3), 2139–2155 (2023) 75. D. Fan, X. Yuan, Y.J.A. Zhang, Temporal-structure-assisted gradient aggregation for overthe-air federated edge learning. IEEE J. Sel. Areas Commun. 39(12), 3757–3771 (2021) 76. K. Yang, T. Jiang, Y. Shi, Z. Ding, Federated learning via over-the-air computation. IEEE Trans. Wirel. Commun. 19(3), 2022–2035 (2020)

References

175

77. S. Wang, Y. Hong, R. Wang, Q. Hao, Y.C. Wu, D.W.K. Ng, Edge federated learning via unitmodulus over-the-air computation. IEEE Trans. Commun. 70(5), 3141–3156 (2022) 78. M.M. Amiri, T.M. Duman, D. Gündüz, Collaborative machine learning at the wireless edge with blind transmitters, in Proc. IEEE Global Conference on Signal and Information Processing (GlobalSIP), Ottawa, ON, Canada, Nov. (2019) 79. L. Zhu, Z. Liu, S. Han, Deep leakage from gradients, in Proc. Advances in Neural Information Processing Systems, Vancouver, Canada, Dec. (2019) 80. L. Melis, C. Song, E. De Cristofaro, V. Shmatikov, Exploiting unintended feature leakage in collaborative learning, in Proc. IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, May (2019) 81. C. Dwork, A. Roth, The algorithmic foundations of differential privacy. Found. Trends Theoret. Comput. Sci. 9(3–4), 211–407 (2014) 82. M. Abadi, A. Chu, I. Goodfellow, H.B. McMahan, I. Mironov, K. Talwar, L. Zhang, Deep learning with differential privacy, in Proc. of ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, Oct. (2016) 83. M. Seif, R. Tandon, M. Li, Wireless federated learning with local differential privacy, in Proc. IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA, June (2020) 84. Y. Koda, K. Yamamoto, T. Nishio, M. Morikura, Differentially private aircomp federated learning with power adaptation harnessing receiver noise, Preprint. arXiv:2004.06337 (2020) 85. D. Liu, O. Simeone, Privacy for free: Wireless federated learning via uncoded transmission with adaptive power control. IEEE J. Sel. Areas Commun. 39(1), 170–185 (2021) 86. M. Seif, W.T. Chang, R. Tandon, Privacy amplification for federated learning via user sampling and wireless aggregation. IEEE J. Sel. Areas Commun. 39(12), 3821–3835 (2021) 87. B. Hasircioglu, D. Gündüz, Private wireless federated learning with anonymous over-theair computation, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual Conference, June (2021) 88. S. Hosseinalipour, S.S. Azam, C.G. Brinton, N. Michelusi, V. Aggarwal, D.J. Love, H. Dai, Multi-stage hybrid federated learning over large-scale D2D-enabled fog networks. IEEE/ACM Trans. Netw. 30(4), 1569–1584 (2022) 89. J. Sun, T. Chen, G. Giannakis, Z. Yang, Communication-efficient distributed learning via lazily aggregated quantized gradients, in Proc. Advances in Neural Information Processing Systems, Vancouver, Canada (2019) 90. R. Kassab, O. Simeone, Federated generalized bayesian learning via distributed stein variational gradient descent. IEEE Trans. Signal Process. 70, 2180–2192 (2022) 91. T. Lin, S.U. Stich, K.K. Patel, M. Jaggi, Don’t use large mini-batches, use local SGD, in Proc. International Conference on Learning Representations, Addis Ababa, Ethiopia, Apr. (2020) 92. H. Yu, S. Yang, S. Zhu, Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning, in Proc. the AAAI Conference on Artificial Intelligence, Honolulu, Hawaii, USA, Jan. (2019) 93. C.T. Dinh, N. Tran, J. Nguyen, Personalized federated learning with moreau envelopes, in Proc. Advances in Neural Information Processing Systems (NIPS), Virtual Conference, Dec. (2020), pp. 21394–21405 94. A. Ghosh, J. Chung, D. Yin, K. Ramchandran, An efficient framework for clustered federated learning, in Proc. Advances in Neural Information Processing Systems (NIPS), Virtual Conference, Dec. (2020) 95. H. Xing, O. Simeone, S. Bi, Federated learning over wireless device-to-device networks: Algorithms and convergence analysis. Preprint. arXiv:2101.12704 (2021) 96. T. Li, M. Sanjabi, A. Beirami, V. Smith, Fair resource allocation in federated learning, in Proc. International Conference on Learning Representations (ICLR), Virtual Conference, Apr. (2020) 97. D.K. Dennis, T. Li, V. Smith, Heterogeneity for the win: One-shot federated clustering, in Proc. International Conference on Machine Learning, Virtual Conference, July (2021), pp. 2611–2620

176

References

98. B. McMahan, D. Ramage, Federated learning: Collaborative machine learning without centralized training data. Google Research Blog 3, April (2017) 99. M.J. Sheller, G.A. Reina, B. Edwards, J. Martin, S. Bakas, Multi-institutional deep learning modeling without sharing patient data: A feasibility study on brain tumor segmentation, in Proc. International MICCAI Brainlesion Workshop, Granada, Spain, Sept. (2018) 100. M. Rojek, R. Daigle, AI FL for IoT, Presentation at MWC 2019. https://www.slideshare. net/byteLAKE/bytelake-and-lenovo-presenting-federated-learning-at-mwc-2019 (2019). Accessed 17 Jan 2021 101. F. Díaz González, FL for time series forecasting using LSTM networks: Exploiting similarities through clustering, Master thesis, KTH Royal Institute of Technology, School of Electrical Engineering and Computer Science. http://urn.kb.se/resolve?urn=urn:nbn:se:kth: diva-254665 (2019). Accessed 17 Jan 2021 102. S. Ickin, K. Vandikas, M. Fiedler, Privacy preserving QoE modeling using collaborative learning, in Proc. the Internet-QoE Workshop on QoE-based Analysis and Management of Data Communication Networks, Los Cabos Mexico, Oct. (2019) 103. K. Vandikas, S. Ickin, G. Dixit, M. Buisman, J. Åkeson, Privacy-aware machine learning with low network footprint, Ericsson Technology Review article. https://www.ericsson.com/en/ ericsson-technologyreview/archive/2019/privacy-aware-machine-learning (2019). Accessed 17 Jan 2021 104. M. Isaksson, K. Norrman, Secure federated learning in 5G mobile networks, in Proc. IEEE Global Communications Conference, Taipei, Taiwan, Dec. (2020) 105. J. Koneˇcn`y, H.B. McMahan, D. Ramage, P. Richtárik, Federated optimization: Distributed machine learning for on-device intelligence. Preprint. arXiv:1610.02527 (2016) 106. J. Konecny, H.B. McMahan, F.X. Yu, P. Richtarik, A.T. Suresh, D. Bacon, Federated learning: Strategies for improving communication efficiency,” in Proc. of NIPS Workshop on Private Multi-Party Machine Learning, Barcelona, SPAIN, Dec. (2016). [Online]. Available: https:// arxiv.org/abs/1610.05492 107. Y. Xi, A. Burr, J. Wei, D. Grace, A general upper bound to evaluate packet error rate over quasi-static fading channels. IEEE Trans. Wirel. Commun. 10(5), 1373–1377 (2011) 108. Y. Pan, C. Pan, Z. Yang, M. Chen, Resource allocation for D2D communications underlaying a NOMA-based cellular network. IEEE Wirel. Commun. Lett. 7(1), 130–133 (2018) 109. Z. Yang, M. Chen, W. Saad, C.S. Hong, M. Shikh-Bahaei, Energy efficient federated learning over wireless communication networks. IEEE Trans. Wirel. Commun. 20(3), 1935–1949 (2021) 110. M.P. Friedlander, M. Schmidt, Hybrid deterministic-stochastic methods for data fitting. SIAM J. Sci. Comput. 34(3), A1380–A1405 (2012) 111. B. Stephen, V. Lieven, Convex Optimization (Cambridge University Press, 2004) 112. M. Mahdian, Q. Yan, Online bipartite matching with random arrivals: An approach based on strongly factor-revealing LPs, in Proc. ACM Symposium on Theory of Computing, San Jose, California, USA, June (2011) 113. R. Jonker, T. Volgenant, Improving the hungarian assignment algorithm. Oper. Res. Lett. 5(4), 171–175 (1986) 114. Y. LeCun, The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/ 115. Y. Zhang, Solving large-scale linear programs by interior-point methods under the Matlab environment. Optim. Methods Softw. 10(1), 1–31 (1998) 116. W. Wu, J. Wang, M. Cheng, Z. Li, Convergence analysis of online gradient method for BP neural networks. Neural Netw. 24(1), 91–98 (2011) 117. Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 118. Classify mnist digits using a feedforward neural network with matlab. https://www. machinelearningtutorial.net/2017/01/14/matlab-mnist-dataset-tutorial/#m1 119. Y. Mao, J. Zhang, K.B. Letaief, Dynamic computation offloading for mobile-edge computing with energy harvesting devices. IEEE J. Sel. Areas Commun. 34(12), 3590–3605 (2016) 120. W. Dinkelbach, On nonlinear fractional programming. Manag. Sci. 13(7), 492–498 (1967)

References

177

121. Speedtest.net, Speedtest united states market report (2019). [Online]. Available: http://www. speedtest.net/reports/united-states/ 122. R.M. Gray, D.L. Neuhoff, Quantization. IEEE Trans. Inform. Theory 44(6), 2325–2383 (1998) 123. Y. Polyanskiy, Y. Wu, Lecture notes on information theory. Lecture Notes for 6.441 (MIT), ECE563 (University of Illinois Urbana-Champaign), and STAT 664 (Yale), 2012–2017 124. R. Zamir, M. Feder, On universal quantization by randomized uniform/lattice quantizers. IEEE Trans. Inform. Theory 38(2), 428–436 (1992) 125. P.A. Chou, M. Effros, R.M. Gray, A vector quantization approach to universal noiseless coding and quantization. IEEE Trans. Inform. Theory 42(4), 1109–1138 (1996) 126. J. Ziv, On universal quantization. IEEE Trans. Inform. Theory 31(3), 344–347 (1985) 127. R. Zamir, M. Feder, On lattice quantization noise. IEEE Trans. Inform. Theory 42(4), 1152– 1159 (1996) 128. J.H. Conway, N.J.A. Sloane, Sphere Packings, Lattices and Groups, vol. 290 (Springer Science & Business Media, 2013) 129. R. Rubinstein, Generating random vectors uniformly distributed inside and on the surface of different regions. Eur. J. Oper. Res. 10(2), 205–209 (1982) 130. T.C. Aysal, M.J. Coates, M.G. Rabbat, Distributed average consensus with dithered quantization. IEEE Trans. Signal Process. 56(10), 4905–4918 (2008) 131. R.M. Gray, T.G. Stockham, Dithered quantizers. IEEE Trans. Inform. Theory 39(3), 805–812 (1993) 132. T.M. Cover, J.A. Thomas, Elements of Information Theory (John Wiley & Sons, 2012) 133. A. Wyner, J. Ziv, The rate-distortion function for source coding with side information at the decoder. IEEE Trans. Inform. Theory 22(1), 1–10 (1976) 134. E. Agrell, T. Eriksson, Optimization of lattices for quantization. IEEE Trans. Inform. Theory 44(5), 1814–1828 (1998) 135. K. Ferentios, On Tcebycheff’s type inequalities. Trabajos de Estadistica y de Investigacion Operativa 33(1), 125 (1982) 136. J. Conway, N. Sloane, Voronoi regions of lattices, second moments of polytopes, and quantization. IEEE Trans. Inform. Theory 28(2), 211–226 (1982) 137. X. Li, K. Huang, W. Yang, S. Wang, Z. Zhang, On the convergence of FedAvg on non-IID data. Preprint. arXiv:1907.02189 (2019) 138. P. Kairouz et al., Advances and open problems in federated learning, arXiv:1912.04977 (2019) 139. N. Shlezinger, M. Chen, Y.C. Eldar, H.V. Poor, S. Cui, UVeQFed: Universal vector quantization for federated learning. IEEE Trans. Signal Process. 69, 500–514 (2020) 140. A. Kirac, P. Vaidyanathan, Results on lattice vector quantization with dithering. IEEE Trans. Circuits Syst. II 43(12), 811–826 (1996) 141. M. Sharma, S. Soman, Jayadeva, Minimal complexity machines under weight quantization. IEEE Trans. Comput. 70(8), 1189–1198 (2021) 142. H. Qin, R. Gong, X. Liu, X. Bai, J. Song, N. Sebe, Binary neural networks: A survey. Pattern Recogn. 105, 1–14 (2020) 143. S.I. Young, W. Zhe, D. Taubman, B. Girod, Transform quantization for CNN compression. IEEE Trans. Pattern Anal. Mach. Intell. 44(9), 5700–5714 (2022) 144. Y. Ji, L. Chen, FedQNN: A computation-communication-efficient federated learning framework for IoT with low-bitwidth neural network quantization. IEEE Internet Things J. 10(3), 2494–2507 (2023) 145. S. Wang, M. Chen, C.G. Brinton, C. Yin, W. Saad, S. Cui, Performance optimization for variable bitwidth federated learning in wireless networks. Preprint. arXiv:2209.10200 (2022) 146. N. Megiddo, A. Tamir, Finding least-distances lines. J. Algebraic Discrete Methods 4(2), 207–211 (1983) 147. Y. Yang, Z. Zhang, Q. Yang, Communication-efficient federated learning with binary neural networks. IEEE J. Sel. Areas Commun. 39(12), 3836–3850 (2021)

178

References

148. A. Krizhevsky, Learning multiple layers of features from tiny images. Available Online: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf, April (2009) 149. Y. Umuroglu, M. Jahre, Streamlined deployment for quantized neural networks. Available Online: https://arxiv.org/abs/1709.04060v2, May (2018) 150. R.C. Buck, Approximate complexity and functional representation. J. Math. Anal. Appl. 70(1), 280–298 (1979) 151. G. Arunabha, J. Zhang, J.G. Andrews, R. Muhamed, Fundamentals of LTE. Prentice Hall Communications Engineering and Emerging Technologies Series (2010) 152. Y. Shao, D. Gündüz, S.C. Liew, Federated edge learning with misaligned over-the-air computation. IEEE Trans. Wirel. Commun. 21(6), 3951–3964 (2022) 153. P.A.F.F. Hekland, T.A. Ramstad, Shannon-kotelnikov mappings in joint source-channel coding. IEEE Trans. Commun. 57(1), 94–105 (2009) 154. X. Cao, G. Zhu, J. Xu, Z. Wang, S. Cui, Optimized power control design for over-the-air federated edge learning. IEEE J. Sel. Areas Commun. 40(1), 342–358 (2022) 155. X. Cao, G. Zhu, J. Xu, K. Huang, Optimal power control for over-the-air computation in fading channels. IEEE Trans. Wirel. Commun. 19(11), 7498–7513 (2020) 156. N. Zhang, M. Tao, Gradient statistics aware power control for over-the-air federated learning. IEEE Trans. Wirel. Commun. 20(8), 5115–5128 (2021) 157. M. Grant, S. Boyd, CVX: Matlab software for disciplined convex programming (2016). [Online]. Available: http://cvxr.com/cvx 158. X. Cao, G. Zhu, J. Xu, K. Huang, Optimized power control for over-the-air computation in fading channels. IEEE Trans. Wireless Commun. 19(11), 7498–7513 (2020) 159. H. Jedda, A. Mezghani, A.L. Swindlehurst, J.A. Nossek, Quantized constant envelope precoding with PSK and QAM signaling. IEEE Trans. Wirel. Commun. 17(12), 8022–8034 (2018) 160. S. Wang, M. Chen, C. Shen, C. Yin, C.G. Brinton, Cross-layer federated learning optimization in mimo networks. Preprint. arXiv:2302.14648 (2023) 161. S. Xia, J. Zhu, Y. Yang, Y. Zhou, Y. Shi, W. Chen, Fast convergence algorithm for analog federated learning, in Proc. IEEE International Conference on Communications, Montreal, QC, Canada, June (2021) 162. X. Zhao, L. You, R. Cao, Y. Shao, and L. Fu, Broadband digital over-the-air computation for asynchronous federated edge learning, in Proc. IEEE International Conference on Communications, Seoul, South Korea, May (2022), pp. 5359–5364 163. L. dos Santos Coelho, D.L. de Andrade Bernert, An improved harmony search algorithm for synchronization of discrete-time chaotic systems. Chaos Solitons Fractals 41(5), 2526–2532 (2009) ˇ 164. B. Paden, M. Cáp, S.Z. Yong, D. Yershov, E. Frazzoli, A survey of motion planning and control techniques for self-driving urban vehicles. IEEE Trans. Intell. Veh. 1(1), 33–55 (2016) 165. L. Hewing, K.P. Wabersich, M. Menner, M.N. Zeilinger, Learning-based model predictive control: Toward safe learning in control. Annu. Rev. Control Robot. Auton. Syst. 3(1), 269– 296 (2020) 166. S.S. Ge, C.C. Hang, T.H. Lee, T. Zhang, Stable Adaptive Neural Network Control, vol. 13 (Springer Science & Business Media, 2013) 167. T. Gu, J. Atwood, C. Dong, J.M. Dolan, J.-W. Lee, Tunable and stable real-time trajectory planning for urban autonomous driving, in Proc. of International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, Sept. (2015) 168. Q. Song, J.C. Spall, Y.C. Soh, J. Ni, Robust neural network tracking controller using simultaneous perturbation stochastic approximation. IEEE Trans. Neural Netw. 19(5), 817– 835 (2008) 169. X. Liu, P. Lu, Solving nonconvex optimal control problems by convex optimization. J. Guidance Control Dyn. 37(3), 750–765 (2014) 170. G. Acosta-Marum, M. Ingram, Doubly selective vehicle-to-vehicle channel measurements and modeling at 5.9 GHz, in Proc. of IEEE International Symposium Wireless Personal Multimedia Communication, San Diego, CA, USA, Sept. (2006)

References

179

171. M. Chen, H.V. Poor, W. Saad, S. Cui, Wireless communications for collaborative federated learning. IEEE Commun. Mag. 58(12), 48–54 (2020) 172. T. Li, A.K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, V. Smith, Federated optimization in heterogeneous networks, in Proc. of Conference on Machine Learning and Systems (MLSys), Austin, TX, USA (2020) 173. L. Bottou, F. Curtis, J. Nocedal, Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018) 174. T. Zeng, O. Semiari, M. Chen, W. Saad, M. Bennis, Federated learning on the road autonomous controller design for connected and autonomous vehicles. IEEE Trans. Wirel. Commun. 21(12), 10407–10423 (2022) 175. P. Bolton, M. Dewatripont et al., Contract Theory (MIT Press, 2005) 176. Y. Zhan, P. Li, Z. Qu, D. Zeng, S. Guo, A learning-based incentive mechanism for federated learning. IEEE Internet Things J. 7(7), 6360–6368 (2020) 177. L.U. Khan, S.R. Pandey, N.H. Tran, W. Saad, Z. Han, M.N.H. Nguyen, C.S. Hong, Federated learning for edge networks: Resource optimization and incentive mechanism. IEEE Commun. Mag. 58(10), 88–93 (2020) 178. W.Y.B. Lim, Z. Xiong, C. Miao, D. Niyato, Q. Yang, C. Leung, H.V. Poor, Hierarchical incentive mechanism design for federated machine learning in mobile networks. IEEE Internet Things J. 7(10), 9575–9588 (2020) 179. F. Zhou, Y. Wu, R.Q. Hu, Y. Qian, Computation rate maximization in UAV-enabled wirelesspowered mobile-edge computing systems. IEEE J. Sel. Areas Commun. 36(9), 1927–1941 (2018) 180. F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, T. Darrell, BDD100K: A diverse driving dataset for heterogeneous multitask learning, in Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, Jun. (2020) 181. S. Moosavi, B. Tehrani, R. Ramnath, Trajectory annotation by discovering driving patterns, in Proc. of ACM SIGSPATIAL Workshop on Smart Cities and Urban Analytics, Redondo Beach, CA, USA, Nov. (2017) 182. S. Xiong, H. Xie, K. Song, G. Zhang, A speed tracking method for autonomous driving via ADRC with extended state observer. Appl. Sci. 9(16), 1–21 (2019) 183. G. Tagne, R. Talj, A. Charara, Higher-order sliding mode control for lateral dynamics of autonomous vehicles, with experimental validation, in Proc. of IEEE Intelligent Vehicles Symposium, Gold Coast, QLD, Australia, Jun. (2013) 184. S. Zhang, W. Quan, J. Li, W. Shi, P. Yang, X. Shen, Air-ground integrated vehicular network slicing with content pushing and caching. IEEE J. Sel. Areas Commun. 36(9), 2114–2127 (2018) 185. F. Yuan, Y.H. Lee, Y.S. Meng, S. Manandhar, J.T. Ong, High-resolution ITU-R cloud attenuation model for satellite communications in tropical region. IEEE Trans. Antennas Propag. 67(9), 6115–6122 (2019) 186. E. Falletti, M. Laddomada, M. Mondin, F. Sellone, Integrated services from high-altitude platforms: A flexible communication system. IEEE Commun. Mag. 44(2), 85–94 (2006) 187. M. Chen, M. Mozaffari, W. Saad, C. Yin, M. Debbah, C.S. Hong, Caching in the sky: Proactive deployment of cache-enabled unmanned aerial vehicles for optimized quality-ofexperience. IEEE J. Sel. Areas Commun. 35(5), 1046–1061 (2017) 188. M. Mohammadi Amiri, D. Gündüz, Computation scheduling for distributed machine learning with straggling workers. IEEE Trans. Signal Process. 67(24), 6270–6284 (2019) 189. J. Long, City Cellular Traffic Map (C2TM). Available Online: http://xiaming.me/city-cellulartraffic-map/